wassname/Open-Assistant

Fork 0

mirror of https://github.com/wassname/Open-Assistant.git synced 2026-06-27 16:10:30 +08:00

Files

T

History

Yannic Kilcher b6a0eedf81 various fixes to the SFT code

2023-02-11 21:37:35 +01:00

configs

various fixes to the SFT code

2023-02-11 21:37:35 +01:00

custom_datasets

various fixes to the SFT code

2023-02-11 21:37:35 +01:00

models

better

2023-01-11 11:37:27 +03:00

tests

[merge] Fix conflict

2023-02-11 00:23:25 +00:00

.gitignore

various fixes to the SFT code

2023-02-11 21:37:35 +01:00

efficiency_utils.py

Apply pre-commit

2023-02-08 03:50:27 +09:00

losses.py

better

2023-01-11 11:37:27 +03:00

README.md

os_private

2023-02-11 13:19:09 +01:00

requirements.txt

Get working on multi-gpu

2023-02-08 20:49:25 +00:00

trainer.py

various fixes to the SFT code

2023-02-11 21:37:35 +01:00

utils.py

choice for empty eval

2023-02-11 13:21:02 +01:00

README.md

Train using supervised examples

Requirements

pip install -r requirements.txt

Start training SFT model

python trainer.py --configs defaults galactica-125m

For wandb: update the entity argument in trainer.py's call to wandb.init to be your weights and biases username per docs.

Dataset choices

To specify which translation pair for WMT and TED Talk translation simply add the supported language pair at the postfix

  datasets:
    - wmt2019_zh-en
    - wmt2019_ru-en
    - wmt2019_de-en
    - ted_trans_nl-en
    - ted_trans_de-ja

Currently only these languages are supported via prompt translation:

ar,de,fr,en,it,nl,tr,ru,ms,ko,ja,zh

Dataset sub-sampling

We can subsample the training data by passing either the fraction or size argument in the configs/config.yml file. Don't forget the additional colon ":" after the dataset name when doing this.

Example:

  datasets:
    - webgpt:
        fraction : 0.05
    - prompt_dialogue:
        size : 500
    - adversarial_qa
    - trivia_qa_nocontext

In this example, per epoch we will use:

A random 5% of webgpt;
A random 500 examples from prompt_dialogue;
All examples from datasets for which we don't specify the fraction or size argument.

In the above example, per epoch we'll use a different 5% from webgpt and a different 500 examples from prompt_dialogue.

This works with torch.distributed.

Training only on OA internal data:

To experiment with the Open Assistant data simply run:

python trainer.py --configs defaults oa_dataset_only galactica-125m

Change the data_path in the oa_dataset_only from the configs/config.yaml file to the correct path.

Model

Normally you should be able to add new models in configs/config.yml

your-model-name:
  learning_rate: 2e-6
  model_name: <huggingface model name>
  weight_decay: 0.01
  max_length: 812
  warmup_steps: 600
  gradient_checkpointing: false
  gradient_accumulation_steps: 5
  per_device_train_batch_size: 4
  per_device_eval_batch_size: 4

python trainer.py --configs defaults your-model-name

However, if the model of your choice doesn't have pad_token, eos_token, sep_token, you have to update get_tokenizer in utils.py to use the right token.

Deepspeed support

You can edit the configs/zero_config.json and use any stage you wish. The current config uses zero-stage 3. For more details on how to setup the config checkout this page.

Once you are satisfy with your deepzero config, you can add --deepspeed flag at the end to trigger deepspeed

python trainer.py --configs defaults your-model-name --deepspeed

Results

Experimental results in wandb here.

TODOS

Decide on a model
Merge utils etc with reward model
Casual Modelling for GPT-JT does not leverage the bidirectional mask for the prompt? (https://huggingface.co/togethercomputer/GPT-JT-6B-v1)