本项目来源:从头预训练一只超迷你 LLaMA 3——复现 Tiny Stories
如之前文章所述微调的步骤很简单,分为3步:
- 1 model and tokenizer
- 2 data / dataset / datacollator
- 3 trainer
预训练也是如此。
1 model and tokenizer
此处的模型我们使用llama,transformer库中集成了该model的模型架构,只需要载入对应的config对象,设置好参数,即可得到该模型架构。
代码如下:
hidden_size = 256
intermediate_size = (int(hidden_size * 8/3 / 128) + 1) * 128
config = AutoConfig.for_model(
model_type="llama",
hidden_size=hidden_size,
intermediate_size=intermediate_size,
num_attention_heads=16,
num_hidden_layers=4,
num_key_value_heads=8
)
model = AutoModelForCausalLM.from_config(
config,
torch_dtype=torch.float32
).to(device)
该模型的hidden_size为256,使用16个头,使用MQA技术,group为2,模型层数为4,最后模型总参数为19.5M。
tokenzier的话,我们直接使用原生llama的tokenizer即可,代码如下:
tokenizer = AutoTokenizer.from_pretrained('./model', local_files_only=True)
2 data / dataset / datacollator
看标题可得出我们想要训练一个TinyStories大模型,所以数据集使用的是
noanabeshima/TinyStoriesV2。
dataset对data进行tokenize,collator对dataset中的元素进行padding和truncate等操作。
具体代码如下:
ds_train = load_dataset('json', data_files="./data/TinyStoriesV2-GPT4-train.jsonl", split='train[:20%]')
ds_val = load_dataset('json', data_files="./data/TinyStoriesV2-GPT4-valid.jsonl", split='train')
ds_train = ds_train.shuffle().map(
process_func,
batched=True,
num_proc=8,
remove_columns=ds_train.column_names,
desc='Running tokenizer on train_set: '
)
ds_val = ds_val.map(
process_func,
batched=True,
num_proc=8,
remove_columns=ds_val.column_names,
desc='Running tokenizer on val_set: '
)
print(ds_val)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
3 Trainer
设置好训练参数,就可以开始训练了。
training_args = TrainingArguments(
output_dir='./outputs',
overwrite_output_dir=True,
do_train=True,
do_eval=True,
eval_steps=1000,
per_device_train_batch_size=8,
gradient_accumulation_steps=4,
learning_rate=1e-4,
lr_scheduler_type='cosine',
bf16=torch.cuda.is_bf16_supported(),
fp16=not torch.cuda.is_bf16_supported(),
logging_steps=50,
report_to=None,
num_train_epochs=1,
save_steps=1000,
save_total_limit=2,
seed=3407
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=ds_train,
eval_dataset=ds_val,
tokenizer=tokenizer,
data_collator=data_collator,
)
trainer.train()
trainer.save_model("./model")
inference(
model,
tokenizer,
"Once upon a time, in a beautiful garden, there lived a little rabbit named Peter Rabbit."
)
dd