To use it on the fair cluster gshard branch, you need the following dependencies: (from inside fairseq env, assuming cuda 11.0)
pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda110 -U
pip install -U fairscale
WARNING: if you dont do this step your checkpoints will not be usable!
Remove your old --optimizer
and add the following:
grid.extend([
hyperparam('--optimizer', 'adam8bit', save_dir_key=lambda val: original_opt),
hyperparam('--no-scale-embedding'),
hyperparam('--use-stable-embedding', save_dir_key=lambda x: 'stable' if x else ''),
hyperparam('--block-wise', save_dir_key=lambda x: 'blockwise' if x else ''),
hyperparam('--use-stable-embedding', save_dir_key=lambda x: 'stable' if x else '')
]
)
If you are using FSDP, you also need to add
grid.append(hyperparam('--use-sharded-state'))
This will make your checkpoint files look different, like this:
checkpoint_last-rank-0-shard0.pt
checkpoint_last-rank-1-shard1.pt
checkpoint_last-shared-shard0.pt
checkpoint_last-shared-shard1.pt
which eval_lm and gpt3_eval will not consume.
After training with sharded state, you can run, for example:
python scripts/consolidate_fsdp_shards.py SAVE_DIR/checkpoint_1_1000.pt
which will save files like
SAVE_DIR/checkpoint_1_1000_consolidated-shared.pt
SAVE_DIR/checkpoint_1_1000_consolidated-rank-0.pt
so that
fairseq_cli/eval_lm.py SAVE_DIR/checkpoint_1_1000.pt ...
works.