How to use adam8bit

Setup

To use it on the fair cluster gshard branch, you need the following dependencies: (from inside fairseq env, assuming cuda 11.0)

pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda110 -U
pip install -U fairscale

WARNING: if you dont do this step your checkpoints will not be usable!

Change Sweep Script

Remove your old --optimizer and add the following:

grid.extend([
        hyperparam('--optimizer', 'adam8bit', save_dir_key=lambda val: original_opt),
        hyperparam('--no-scale-embedding'),
        hyperparam('--use-stable-embedding', save_dir_key=lambda x: 'stable' if x else ''),
        hyperparam('--block-wise', save_dir_key=lambda x: 'blockwise' if x else ''),
        hyperparam('--use-stable-embedding', save_dir_key=lambda x: 'stable' if x else '')
    ]
)

If you are using FSDP, you also need to add

grid.append(hyperparam('--use-sharded-state'))

This will make your checkpoint files look different, like this:

checkpoint_last-rank-0-shard0.pt 
checkpoint_last-rank-1-shard1.pt
checkpoint_last-shared-shard0.pt
checkpoint_last-shared-shard1.pt

which eval_lm and gpt3_eval will not consume.

Consolidating Sharded Checkpoints

After training with sharded state, you can run, for example:

python scripts/consolidate_fsdp_shards.py SAVE_DIR/checkpoint_1_1000.pt

which will save files like

SAVE_DIR/checkpoint_1_1000_consolidated-shared.pt
SAVE_DIR/checkpoint_1_1000_consolidated-rank-0.pt

so that

fairseq_cli/eval_lm.py SAVE_DIR/checkpoint_1_1000.pt ...

works.

sshleifer/adam8bit_fair_usage.md

Setup

Change Sweep Script

Consolidating Sharded Checkpoints