mirror of https://github.com/wassname/Open-Assistant.git synced 2026-06-27 16:10:30 +08:00

Files

T

sanagnos 4dd0d67e9c Merge pull request #1398 from jackapbutler/fix-tokenizer-match

Add tests and update docstring to tokenizer matching

2023-02-10 11:37:00 +01:00

configs

Merge branch 'main' into sft-data-sampling

2023-02-09 09:19:17 +01:00

custom_datasets

Merge branch 'main' into sft-data-sampling

2023-02-09 09:19:17 +01:00

models

better

2023-01-11 11:37:27 +03:00

tests

add test for tokenizer matching behaviour

2023-02-10 09:47:20 +00:00

efficiency_utils.py

Apply pre-commit

2023-02-08 03:50:27 +09:00

losses.py

better

2023-01-11 11:37:27 +03:00

README.md

Get working on multi-gpu

2023-02-08 20:49:25 +00:00

requirements.txt

Get working on multi-gpu

2023-02-08 20:49:25 +00:00

trainer.py

Merge branch 'main' into sft-data-sampling

2023-02-09 09:19:17 +01:00

utils.py

Merge pull request #1398 from jackapbutler/fix-tokenizer-match

2023-02-10 11:37:00 +01:00

README.md

Train using supervised examples

Requirements

pip install -r requirements.txt

Start training SFT model

python trainer.py --configs defaults galactica-125m

For wandb: update the entity argument in trainer.py's call to wandb.init to be your weights and biases username per docs.

Dataset choices

To specify which translation pair for WMT and TED Talk translation simply add the supported language pair at the postfix

  datasets:
    - wmt2019_zh-en
    - wmt2019_ru-en
    - wmt2019_de-en
    - ted_trans_nl-en
    - ted_trans_de-ja

Currently only these languages are supported via prompt translation:

ar,de,fr,en,it,nl,tr,ru,ms,ko,ja,zh

Dataset sub-sampling

We can subsample the training data by passing either the fraction or size argument in the configs/config.yml file. Don't forget the additional colon ":" after the dataset name when doing this.

Example:

  datasets:
    - webgpt:
        fraction : 0.05
    - prompt_dialogue:
        size : 500
    - adversarial_qa
    - trivia_qa_nocontext

In this example, per epoch we will use:

A random 5% of webgpt;
A random 500 examples from prompt_dialogue;
All examples from datasets for which we don't specify the fraction or size argument.

In the above example, per epoch we'll use a different 5% from webgpt and a different 500 examples from prompt_dialogue.

This works with torch.distributed.

Model

Normally you should be able to add new models in configs/config.yml

your-model-name:
  learning_rate: 2e-6
  model_name: <huggingface model name>
  weight_decay: 0.01
  max_length: 812
  warmup_steps: 600
  gradient_checkpointing: false
  gradient_accumulation_steps: 5
  per_device_train_batch_size: 4
  per_device_eval_batch_size: 4

python trainer.py --configs defaults your-model-name

However, if the model of your choice doesn't have pad_token, eos_token, sep_token, you have to update get_tokenizer in utils.py to use the right token.

Deepspeed support

You can edit the configs/zero_config.json and use any stage you wish. The current config uses zero-stage 3. For more details on how to setup the config checkout this page.

Once you are satisfy with your deepzero config, you can add --deepspeed flag at the end to trigger deepspeed

python trainer.py --configs defaults your-model-name --deepspeed

Results

Experimental results in wandb here.

TODOS

Decide on a model
Merge utils etc with reward model
Casual Modelling for GPT-JT does not leverage the bidirectional mask for the prompt? (https://huggingface.co/togethercomputer/GPT-JT-6B-v1)