TLDR: 8B gets a big bump across the board, 70B instruct shows minor improvements, and 405B is the SoTA open model. But 405B still lags behind flagship models.
Here are the notable upgrades:
- Every model now supports 128k context length (up from 8k)
- Trained on a massive ~15T tokens of public data
- Fine-tuning data includes publicly available instruction datasets and over 25M synthetically generated examples
- Multilingual support for 7 languages: French, German, Hindi, Italian, Portuguese, Spanish, and Thai
- Training required a whopping 39.3M GPU hours on H100-80GB: 1.5m for 8B, 7m for 70B, and 31M for 405B
Meta's benchmarks are in, but we're still waiting for the verdict.
Check out the leaked model card for yourself: https://lnkd.in/gSVt3Ex7
A list of base model files with their size:
405B is a dense model release instead of Mixture of experts (MoE).