Skip to content

Instantly share code, notes, and snippets.

@agucova
Created June 25, 2024 20:18
Show Gist options
  • Save agucova/cfe61a25b29b2e21f25c2d1f1bd51cd4 to your computer and use it in GitHub Desktop.
Save agucova/cfe61a25b29b2e21f25c2d1f1bd51cd4 to your computer and use it in GitHub Desktop.
We can't make this file beautiful and searchable because it's too large.
System,Domain,Organization,Authors,Publication date,Reference,Link,Notability criteria,Notability criteria notes,Parameters,Parameters notes,Training compute (FLOP),Training compute notes,Training dataset,Training dataset notes,Training dataset size (datapoints),Dataset size notes,Abstract,Confidence,Country (from Organization),Organization categorization,Training time (hours),Training time notes,Training hardware,Model accessibility,Finetune compute notes,Hardware quantity,Hardware utilization,Base model,Citations,Compute cost notes,Finetune compute (FLOP),Epochs,Training compute cost (2023 USD),Batch size,Batch size notes
DeepSeek-Coder-V2-236B,Language,DeepSeek,"Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Gao, Xuan Lu, Qinyu Chen, Yaohui Wang, Chengqi Deng, Jiashi Li, Chenggang Zhao, Chong Ruan, Fuli Luo, Wenfeng Liang",2024-06-17,DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence,https://github.com/deepseek-ai/DeepSeek-Coder-V2,SOTA improvement,"New SOTA on Aider, AIME 2024, and Math Odyssey benchmarks (including against proprietary models such as Claude 3 Opus, GPT-4o and GPT-4 Turbo).
Note that Figure 1 appears to show the new model getting SOTA for several other benchmarks, but omits results from GPT-4o which wins in most cases.",236000000000.0,Mixture of experts model. 21B parameters activated per token.,1.2852e+24,"Trained on a total of 10.2T tokens
6NC: 6 * 10.2T * 21B = 1.285e24","GitHub,Common Crawl","See Section 2. ""In the pre-training phase, the dataset of DeepSeek-Coder-V2 is created with a composition of 60% source code, 10% math corpus, and 30% natural language corpus""",4451000000000.0,"""The source code consists of 1,170B code-related tokens sourced from GitHub and CommonCrawl... For the math corpus, we collect 221B math-related tokens sourced from CommonCrawl... In total, DeepSeek-Coder-V2 has been exposed to 10.2T training tokens, where 4.2 trillion tokens originate from the DeepSeek V2 dataset, while the remaining 6 trillion tokens come from the DeepSeek-Coder-V2 dataset""
Total of 1.391T tokens in the new data.
From the DeepSeek-V2 paper: ""our tokenized pretraining corpus contains 8.1T tokens""
So some tokens are trained over for multiple epochs:
- 10.2T * 0.6 / 1.17T = 5.23 epochs on the code corpus
- 10.2T * 0.1 / 221B = 4.6 epochs on the math corpus
- 10.2T * 0.3 / 8.1T = 0.38 epochs on the natural language corpus
Total unique tokens seen is likely 1.17T + 221B + (10.2T*0.3) = 4.451T","We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2
with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks. Compared to DeepSeekCoder-33B, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-CoderV2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K. In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks.",Confident,China,Industry,,,,,,,,,,,,,,,
Nemotron-4 340B,Language,NVIDIA,"Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek, Robert Hero, Jining Huang, Vibhu Jawa, Joseph Jennings, Aastha Jhunjhunwala, John Kamalu, Sadaf Khan, Oleksii Kuchaiev, Patrick LeGresley, Hui Li, Jiwei Liu, Zihan Liu, Eileen Long, Ameya Sunil Mahabaleshwarkar, Somshubra Majumdar, James Maki, Miguel Martinez, Maer Rodrigues de Melo, Ivan Moshkov, Deepak Narayanan, Sean Narenthiran, Jesus Navarro, Phong Nguyen, Osvald Nitski, Vahid Noroozi, Guruprasad Nutheti, Christopher Parisien, Jupinder Parmar, Mostofa Patwary, Krzysztof Pawelec, Wei Ping, Shrimai Prabhumoye, Rajarshi Roy, Trisha Saar, Vasanth Rao Naik Sabavat, Sanjeev Satheesh, Jane Polak Scowcroft, Jason Sewall, Pavel Shamis, Gerald Shen, Mohammad Shoeybi, Dave Sizer, Misha Smelyanskiy, Felipe Soares, Makesh Narsimhan Sreedhar, Dan Su, Sandeep Subramanian, Shengyang Sun, Shubham Toshniwal, Hao Wang, Zhilin Wang, Jiaxuan You, Jiaqi Zeng, Jimmy Zhang, Jing Zhang, Vivienne Zhang, Yian Zhang, Chen Zhu",2024-06-14,NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models,https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/ ,Training cost,"~2e25 FLOP, so high training cost",340000000000.0,340B,1.7999999999999999e+25,"9 trillion tokens for training
6 * 340B * 9T = 1.8E25
alternatively, can do a hardware estimate with a few extra steps:
According to the technical report, Nemotron-4 340B was trained using up to 6144 H100 GPUs. Helpfully, they also report the model FLOP utilization (MFU), which was 41-42% (Table 2). This is the ratio of the actual output of their GPUs, in FLOP used for training, relative to their theoretical max of 989 teraFLOP/s per GPU.
Unfortunately, the report omits the last ingredient, which is the duration of the training run. However, in Table 2 they report some relevant data that we can use to infer the training time.
Nemotron-4 was trained in several stages, but the largest stage used all 6144 GPUs with a batch size of 2304 and an iteration time (time per batch) of 8.0 seconds. This stage involved 7.6T tokens, so it makes up the majority of training.
A batch size of 2304 means that each batch consists of 2304 sequences, and they report that the sequence length used for training was 4096 tokens. This means that each batch contained 4096 * 2304 = 9,437,184 tokens.
So, during this stage, it took 8 seconds to train the model on 9.4m tokens. Extrapolating to the entire 9T token dataset, this implies the training run would have taken 7,659,574 seconds, or 89 days. (it actually took longer because they didn't use all their GPUs for the whole run)
Multiplying 7,659,574 seconds by 41% MFU, 989 peak teraFLOP/s for each H100, and 6144 H100s, we get ~1.9e25 FLOP. This is very close to our first estimate.
",Unspecified unreleased,"The technical report for the 340B model cites the report for the 15B version (https://arxiv.org/pdf/2402.16819 )
from that paper:
""We train Nemotron-4 15B on a pre-training dataset consisting of 8 trillion tokens. At a high-level,
the data blend is split into three different types of data: English natural language data (70%), multilingual
natural language data (15%), and source-code data (15%).
The English corpus consists of curated documents from a variety of sources and domains including web
documents, news articles, scientific papers, books, etc and the distribution used in our pre-training set is
highlighted in Figure 2. The code and multilingual data consists of a diverse set of natural and programming
languages. We find that appropriately sampling tokens from these languages is key to strong accuracies in
these domains. We share the distributions used for both code and multilingual tokens in our pre-training
dataset in Figure 3 and Figure 4 respectively.
In constructing the pre-training corpus, we remove any possible duplicates via document-level exact and
near-deduplication (Jennings et al., 2023). We additionally applied document-level quality filtering across
our corpus using a language-model based filtering approach similar to (Wenzek et al., 2019) in addition to a
series of heuristic filters as described in (Rae et al., 2022) and (Raffel et al., 2020).""",6750000000000.0,"9T training tokens.
They first train on an 8T token dataset and then an additional 1T tokens, it's slightly unclear if that's more data or a partial second epoch
6.75T words using 1 token = 0.75 words","We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-
340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open
Model License Agreement, a permissive model license that allows distribution, modification, and use of
the models and its outputs. These models perform competitively to open access models on a wide range
of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in
FP8 precision. We believe that the community can benefit from these models in various research studies
and commercial applications, especially for generating synthetic data to train smaller language models.
Notably, over 98% of data used in our model alignment process is synthetically generated, showcasing
the effectiveness of these models in generating synthetic data. To further support open research and
facilitate model development, we are also open-sourcing the synthetic data generation pipeline used in
our model alignment process.
(from technical report: https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf )",Confident,United States of America,Industry,2200.0,"see training compute notes, this is an inferred estimate",NVIDIA H100 SXM5,Open source,,,,,,,,,,,
GPT-4o,"Multimodal,Language,Audio,Speech,Vision",OpenAI,"Aidan Clark, Alex Paino, Jacob Menick, Liam Fedus, Luke Metz, Clemens Winter, Lia Guy, Sam Schoenholz, Daniel Levy, Nitish Keskar, Alex Carney, Alex Paino, Ian Sohl, Qiming Yuan, Reimar Leike, Arka Dhar, Brydon Eastman, Mia Glaese, Ben Sokolowsky, Andrew Kondrich, Felipe Petroski Such, Henrique Ponde de Oliveira Pinto, Jiayi Weng, Randall Lin, Youlong Cheng, Nick Ryder, Lauren Itow, Barret Zoph, John Schulman, Mianna Chen, Adam Lerer, Adam P. Goucher, Adam Perelman, Akila Welihinda, Alec Radford, Alex Borzunov, Alex Carney, Alex Chow, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexi Christakis, Ali Kamali, Allison Moyer, Allison Tam, Amin Tootoonchian, Ananya Kumar, Andrej Karpathy, Andrey Mishchenko, Andrew Cann, Andrew Kondrich, Andrew Tulloch, Angela Jiang, Antoine Pelisse, Anuj Gosalia, Avi Nayak, Avital Oliver, Behrooz Ghorbani, Ben Leimberger, Ben Wang, Blake Samic, Brian Guarraci, Brydon Eastman, Camillo Lugaresi, Chak Li, Charlotte Barette, Chelsea Voss, Chong Zhang, Chris Beaumont, Chris Hallacy, Chris Koch, Christian Gibson, Christopher Hesse, Colin Wei, Daniel Kappler, Daniel Levin, Daniel Levy, David Farhi, David Mely, David Sasaki, Dimitris Tsipras, Doug Li, Duc Phong Nguyen, Duncan Findlay, Edmund Wong, Ehsan Asdar, Elizabeth Proehl, Elizabeth Yang, Eric Peterson, Eric Sigler, Eugene Brevdo, Farzad Khorasani, Francis Zhang, Gene Oden, Geoff Salmon, Hadi Salman, Haiming Bao, Heather Schmidt, Hongyu Ren, Hyung Won Chung, Ian Kivlichan, Ian O'Connell, Ian Osband, Ilya Kostrikov, Ingmar Kanitscheider, Jacob Coxon, James Crooks, James Lennon, Jason Teplitz, Jason Wei, Jason Wolfe, Jay Chen, Jeff Harris, Jiayi Weng, Jie Tang, Joanne Jang, Jonathan Ward, Jonathan McKay, Jong Wook Kim, Josh Gross, Josh Kaplan, Joy Jiao, Joyce Lee, Juntang Zhuang, Kai Fricke, Kavin Karthik, Kenny Hsu, Kiel Howe, Kyle Luther, Larry Kai, Lauren Itow, Leo Chen, Lia Guy, Lien Mamitsuka, Lilian Weng, Long Ouyang, Louis Feuvrier, Lukas Kondraciuk, Lukasz Kaiser, Lyric Doshi, Mada Aflak, Maddie Simens, Madeleine Thompson, Marat Dukhan, Marvin Zhang, Mateusz Litwin, Max Johnson, Mayank Gupta, Mia Glaese, Michael Janner, Michael Petrov, Michael Wu, Michelle Fradin, Michelle Pokrass, Miguel Oom Temudo de Castro, Mikhail Pavlov, Minal Khan, Mo Bavarian, Natalia Gimelshein, Natalie Staudacher, Nick Stathas, Nik Tezak, Nithanth Kudige, Noel Bundick, Ofir Nachum, Oleg Boiko, Oleg Murk, Olivier Godement, Owen Campbell-Moore, Philip Pronin, Philippe Tillet, Rachel Lim, Rajan Troll, Randall Lin, Rapha gontijo lopes, Raul Puri, Reah Miyara, Reimar Leike, Renaud Gaubert, Reza Zamani, Rob Honsby, Rohit Ramchandani, Rory Carmichael, Ruslan Nigmatullin, Ryan Cheu, Scott Gray, Sean Grove, Sean Metzger, Shantanu Jain, Shengjia Zhao, Sherwin Wu, Shuaiqi (Tony) Xia, Sonia Phene, Spencer Papay, Steve Coffey, Steve Lee, Steve Lee, Stewart Hall, Suchir Balaji, Tal Broda, Tal Stramer, Tarun Gogineni, Ted Sanders, Thomas Cunninghman, Thomas Dimson, Thomas Raoux, Tianhao Zheng, Tina Kim, Todd Underwood, Tristan Heywood, Valerie Qi, Vinnie Monaco, Vlad Fomenko, Weiyi Zheng, Wenda Zhou, Wojciech Zaremba, Yash Patil, Yilei, Qian, Yongjik Kim, Youlong Cheng, Yuchen He, Yuchen Zhang, Yujia Jin, Yunxing Dai, Yury Malkov",2024-05-13,Hello GPT-4o,https://openai.com/index/hello-gpt-4o/ ,"SOTA improvement,Significant use","Outperforms GPT-4 Turbo and other models on text and especially on multimodal benchmarks, such as MMLU, GPQA, HumanEval, MMMU, etc See Model Evaluations: https://openai.com/index/hello-gpt-4o/
GPT-4o is now the default model in ChatGPT, so it's one of the most widely used models.",,"Not known.
Inference costs in the API are 2x cheaper than GPT-4 Turbo",,"Not known. But it's more capable than GPT-4, Gemini 1 Ultra, etc",Unspecified unreleased,"""With GPT-4o, we trained a single new model end-to-end across text, vision, and audio.""",,,"We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.
GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time(opens in a new window) in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.",Confident,United States of America,Industry,,,,API access,"Definitely a new model, not a GPT-4 finetune",,,,,,,,,,
Llama 3-70B,Language,Meta AI,Aaditya Singh; Aaron Grattafiori; Abhimanyu Dubey; Abhinav Jauhri; Abhinav Pandey; Abhishek Kadian; Adam Kelsey; Adi Gangidi; Ahmad Al-Dahle; Amit Sangani; Ahuva Goldstand; Aiesha Letman; Ajay Menon; Akhil Mathur; Alan Schelten; Alex Vaughan; Amy Yang; Andrei Lupu; Andres Alvarado; Andrew Gallagher; Andrew Gu; Andrew Ho; Andrew Poulton; Andrew Ryan; Angela Fan; Ankit Ramchandani; Anthony Hartshorn; Archi Mitra; Archie Sravankumar; Artem Korenev; Arun Rao; Ashley Gabriel; Ashwin Bharambe; Assaf Eisenman; Aston Zhang; Ash JJhaveri; Aurelien Rodriguez; Austen Gregerson; Ava Spataru; Baptiste Roziere; Ben Maurer; Benjamin Leonhardi; Bernie Huang; Bhargavi Paranjape; Bing Liu; Binh Tang; Bobbie Chern; Brani Stojkovic; Brian Fuller; Catalina Mejia Arenas; Chao Zhou; Charlotte Caucheteux; Chaya Nayak; Ching-Hsiang Chu; Chloe Bi; Chris Cai; Chris Cox; Chris Marra; Chris McConnell; Christian Keller; Christoph Feichtenhofer; Christophe Touret; Chunyang Wu; Corinne Wong; Cristian Canton Ferrer; Damien Allonsius; Daniel Kreymer; Daniel Haziza; Daniel Li; Danielle Pintz; Danny Livshits; Danny Wyatt; David Adkins; David Esiobu; David Xu; Davide Testuggine; Delia David; Devi Parikh; Dhruv Choudhary; Dhruv Mahajan; Diana Liskovich; Diego Garcia-Olano; Diego Perino; Dieuwke Hupkes; Dingkang Wang; Dustin Holland; Egor Lakomkin; Elina Lobanova; Xiaoqing Ellen Tan; Emily Dinan; Eric Smith; Erik Brinkman; Esteban Arcaute; Filip Radenovic; Firat Ozgenel; Francesco Caggioni; Frank Seide; Frank Zhang; Gabriel Synnaeve; Gabriella Schwarz; Gabrielle Lee; Gada Badeer; Georgia Anderson; Graeme Nail; Gregoire Mialon; Guan Pang; Guillem Cucurell; Hailey Nguyen; Hamid Shojanazeri; Hannah Korevaar; Hannah Wang; Haroun Habeeb; Harrison Rudolph; Henry Aspegren; Hu Xu; Hugo Touvron; Iga Kozlowska; Igor Molybog; Igor Tufanov; Iliyan Zarov; Imanol Arrieta Ibarra; Irina-Elena Veliche; Isabel Kloumann; Ishan Misra; Ivan Evtimov; Jacob Xu; Jade Copet; Jake Weissman; Jan Geffert; Jana Vranes; Japhet Asher; Jason Park; Jay Mahadeokar; Jean-Baptiste Gaya; Jeet Shah; Jelmer van der Linde; Jennifer Chan; Jenny Hong; Jenya Lee; Jeremy Fu; Jeremy Teboul; Jianfeng Chi; Jianyu Huang; Jie Wang; Jiecao Yu; Joanna Bitton; Joe Spisak; Joelle Pineau; Jon Carvill; Jongsoo Park; Joseph Rocca; Joshua Johnstun; Junteng Jia; Kalyan Vasuden Alwala; Kam Hou U; Kate Plawiak; Kartikeya Upasani; Kaushik Veeraraghavan; Ke Li; Kenneth Heafield; Kevin Stone; Khalid El-Arini; Krithika Iyer; Kshitiz Malik; Kuenley Chiu; Kunal Bhalla; Kyle Huang; Lakshya Garg; Lauren Rantala-Yeary; Laurens van der Maaten; Lawrence Chen; Leandro Silva; Lee Bell; Lei Zhang; Liang Tan; Louis Martin; Lovish Madaan; Luca Wehrstedt; Lukas Blecher; Luke de Oliveira; Madeline Muzzi; Madian Khabsa; Manav Avlani; Mannat Singh; Manohar Paluri; Mark Zuckerberg; Marcin Kardas; Martynas Mankus; Mathew Oldham; Mathieu Rita; Matthew Lennie; Maya Pavlova; Meghan Keneally; Melanie Kambadur; Mihir Patel; Mikayel Samvelyan; Mike Clark; Mike Lewis; Min Si; Mitesh Kumar Singh; Mo Metanat; Mona Hassan; Naman Goyal; Narjes Torabi; Nicolas Usunier; Nikolay Bashlykov; Nikolay Bogoychev; Niladri Chatterji; Ning Dong; Oliver Aobo Yang; Olivier Duchenne; Onur Celebi; Parth Parekh; Patrick Alrassy; Paul Saab; Pavan Balaji; Pedro Rittner; Pengchuan Zhang; Pengwei Li; Petar Vasic; Peter Weng; Polina Zvyagina; Prajjwal Bhargava; Pratik Dubal; Praveen Krishnan; Punit Singh Koura; Qing He; Rachel Rodriguez; Ragavan Srinivasan; Rahul Mitra; Ramon Calderer; Raymond Li; Robert Stojnic; Roberta Raileanu; Robin Battey; Rocky Wang; Rohit Girdhar; Rohit Patel; Romain Sauvestre; Ronnie Polidoro; Roshan Sumbaly; Ross Taylor; Ruan Silva; Rui Hou; Rui Wang; Russ Howes; Ruty Rinott; Saghar Hosseini; Sai Jayesh Bondu; Samyak Datta; Sanjay Singh; Sara Chugh; Sargun Dhillon; Satadru Pan; Sean Bell; Sergey Edunov; Shaoliang Nie; Sharan Narang; Sharath Raparthy; Shaun Lindsay; Sheng Feng; Sheng Shen; Shenghao Lin; Shiva Shankar; Shruti Bhosale; Shun Zhang; Simon Vandenhende; Sinong Wang; Seohyun Sonia Kim; Soumya Batra; Sten Sootla; Steve Kehoe; Suchin Gururangan; Sumit Gupta; Sunny Virk; Sydney Borodinsky; Tamar Glaser; Tamar Herman; Tamara Best; Tara Fowler; Thomas Georgiou; Thomas Scialom; Tianhe Li; Todor Mihaylov; Tong Xiao; Ujjwal Karn; Vedanuj Goswami; Vibhor Gupta; Vignesh Ramanathan; Viktor Kerkez; Vinay Satish Kumar; Vincent Gonguet; Vish Vogeti; Vlad Poenaru; Vlad Tiberiu Mihailescu; Vladan Petrovic; Vladimir Ivanov; Wei Li; Weiwei Chu; Wenhan Xiong; Wenyin Fu; Wes Bouaziz; Whitney Meers; Will Constable; Xavier Martinet; Xiaojian Wu; Xinbo Gao; Xinfeng Xie; Xuchao Jia; Yaelle Goldschlag; Yann LeCun; Yashesh Gaur; Yasmine Babaei; Ye Qi; Yenda Li; Yi Wen; Yiwen Song; Youngjin Nam; Yuchen Hao; Yuchen Zhang; Yun Wang; Yuning Mao; Yuzi He; Zacharie Delpierre Coudert; Zachary DeVito; Zahra Hankir; Zhaoduo Wen; Zheng Yan; Zhengxing Chen; Zhenyu Yang; Zoe Papakipos,2024-04-18,Introducing Meta Llama 3: The most capable openly available LLM to date,"https://ai.meta.com/blog/meta-llama-3/
https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/",Significant use,"Will almost certainly be very influential and widely used in the open access AI industry, as with the previous Llama generations.",70000000000.0,,6.30000000001e+24,"direct calculation
15000000000000 tokens*70000000000.00 parameters*6=6.300000000000001e+24
GPU calculation
400 TFLOPS per GPU * 6.4 million GPU hours * 3600s=9.216 × 10^24
""Our most efficient implementation achieves a compute utilization of over 400 TFLOPS per GPU when trained on 16K GPUs simultaneously. We performed training runs on two custom-built 24K GPU clusters.""
So the throughput may have been lower if they used more than 16k GPUs.",,,15000000000000.0,,,Confident,United States of America,Industry,,,NVIDIA H100 SXM5,Open access (restricted use),,16000.0,0.4,,,,,,,,
ReALM,Language,Apple,"Joel Ruben Antony Moniz, Soundarya Krishnan, Melis Ozyildirim, Prathamesh Saraf, Halim Cagri Ates, Yuan Zhang, Hong Yu, Nidhi Rajshree",2024-03-29,ReALM: Reference Resolution As Language Modeling,https://arxiv.org/abs/2403.20329,SOTA improvement,"""We show that ReaLM outperforms previous approaches, and performs roughly as well as the state-of-the-art LLM today, GPT-4, despite consisting of far fewer parameters.""
""We also benchmark against GPT-3.5 and GPT-4, with our smallest model achieving performance comparable to that of GPT-4, and our larger models substantially outperforming it.""",3000000000.0,Fine-tuned FLAN-T5 models ranging from 80M to 3B,,Fine-tuned from FLAN-T5,,"Mix of synthetic and manually annotated data.
""Each data point contains the user query and a list of entities, along with the ground-truth entity (or set of entities) that are relevant to the corresponding user query. Each entity, in turn, contains information about its type and other properties such as the name and other textual details associated with the entity (the label and time of an alarm, for example)""",16300.0,2300 training examples from conversation; 3900 synthetically generated training examples; 10100 training examples using context from a phone screen.,"Reference resolution is an important problem, one that is essential to understand and successfully handle context of different kinds. This context includes both previous turns and context that pertains to non-conversational entities, such as entities on the user's screen or those running in the background. While LLMs have been shown to be extremely powerful for a variety of tasks, their use in reference resolution, particularly for non-conversational entities, remains underutilized. This paper demonstrates how LLMs can be used to create an extremely effective system to resolve references of various types, by showing how reference resolution can be converted into a language modeling problem, despite involving forms of entities like those on screen that are not traditionally conducive to being reduced to a text-only modality. We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute gains of over 5% for on-screen references. We also benchmark against GPT-3.5 and GPT-4, with our smallest model achieving performance comparable to that of GPT-4, and our larger models substantially outperforming it.",Confident,United States of America,Industry,,,,Unreleased,No training details given,,,Flan-T5 11B,,,,,,,
MM1-30B,"Multimodal,Language,Vision",Apple,"Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang",2024-03-14,"MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training",https://arxiv.org/abs/2403.09611,SOTA improvement,""" In particular, the pretrained model MM1 is SOTA, performing better than Emu2 [105], Flamingo [3],
and IDEFICS [47] on captioning and visual question answering (VQA) tasks
in few-shot settings, both in small and large size regimes""
Table 4: outperforms Gemini and GPT-4V on VQA",30000000000.0,30B,4.86e+23,"Pre-trained on ~2B image-text pairs and 2T tokens (Table 2). Each image is 144 tokens, so the images are ~300B tokens.
Then additional multimodal training for 400B tokens, for a total of ~2.7T tokens.
This is the final training recipe: ""We initialize both the image encoder and the underlying LLM decoder weights for MM1 from in-house pre-trained models2. We then perform multimodal pre-training on the above data mix for 200k steps (approx. 400B tokens).""
Compute = 6ND = 6 * 2.7 trillion * 30 billion = 4.86e23
maybe the size of the visual connector is relevant",,"Text, captioned images. See Table 2",1500000000000.0,at least 2T tokens,"In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.",Likely,United States of America,Industry,,,,Unreleased,,,,,29.0,,,,,,
Inflection-2.5,Language,Inflection AI,,2024-03-07,Inflection-2.5: meet the world's best personal AI,https://inflection.ai/inflection-2-5,Significant use,one million daily users; six million monthly,,,1.0001e+25,"""Inflection-1 used approximately 4% the training FLOPs of GPT-4 and, on average, performed at approximately 72% GPT-4 level on a diverse range of IQ-oriented tasks. Inflection-2.5, now powering Pi, achieves more than 94% the average performance of GPT-4 despite using only 40% the training FLOPs.""
This is a weird one - we estimated GPT-4 at 2.1e25 FLOP (which could be off somewhat, or Inflection could believe a different number). 40% of that is ~8e24. But Inflection 2, the previous model, was trained on ~1e25 FLOP per Inflection. Inflection-2.5 also does better on benchmarks than 2. Intuitively Inflection-2.5 would be trained on appreciably more compute.
1e25 seems like a rough, perhaps conservative guess given all this.",,,,,"At Inflection, our mission is to create a personal AI for everyone. Last May, we released Pi—a personal AI, designed to be empathetic, helpful, and safe. In November we announced a new major foundation model, Inflection-2, the second best LLM in the world at the time.
Now we are adding IQ to Pi’s exceptional EQ.
We are launching Inflection-2.5, our upgraded in-house model that is competitive with all the world's leading LLMs like GPT-4 and Gemini. It couples raw capability with our signature personality and unique empathetic fine-tuning. Inflection-2.5 is available to all Pi's users today, at pi.ai, on iOS, on Android, or our new desktop app.",Speculative,United States of America,Industry,,,NVIDIA H100 SXM5,Hosted access (no API),,,,,,,,,,,
Claude 3 Opus,"Multimodal,Language,Vision",Anthropic,,2024-03-04,"The Claude 3 Model Family: Opus, Sonnet, Haiku",https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf,SOTA improvement,,,,,,,"Claude 3 models are trained on a proprietary mix of publicly available information on the Internet as of August 2023, as well as non-public data from third parties, data provided by data labeling services and paid contractors, and data we generate internally. We employ several data cleaning and filtering methods, including deduplication and classification. The Claude 3 suite of models have not been trained on any user prompt or output data submitted to us by users or customers, including free users, Claude Pro users, and API customers.",,,"We introduce Claude 3, a new family of large multimodal models – Claude 3 Opus, our most capable offering, Claude 3 Sonnet, which provides a combination of skills and speed, and Claude 3 Haiku, our fastest and least expensive model. All new models have vision capabilities that enable them to process and analyze image data. The Claude 3 family demonstrates strong performance across benchmark evaluations and sets a new standard on
measures of reasoning, math, and coding. Claude 3 Opus achieves state-of-the-art results on evaluations like GPQA [1], MMLU [2], MMMU [3] and many more. Claude 3 Haiku performs as well or better than Claude 2 [4] on most pure-text tasks, while Sonnet and Opus significantly outperform it. Additionally, these models exhibit improved fluency in non-English languages, making them more versatile for a global audience. In this report, we provide an in-depth analysis of our evaluations, focusing on core capabilities, safety, societal impacts, and the catastrophic risk assessments we committed to in our Responsible Scaling Policy [5].
",Unknown,United States of America,Industry,,"Like its predecessors, Claude 3 models employ various training methods, such as unsupervised learning and Constitutional AI [6]. These models were trained using hardware from Amazon Web Services (AWS) and Google Cloud Platform (GCP)",,API access,,,,,,"Per https://time.com/6980000/anthropic/
""Claude 3 cost somewhere between $30 million and $300 million to train""
This would seem to include all three versions.
Ballpark estimate, based on relative API costs:
sqrt($30M * $300M) * (15 / (0.25 + 3 + 15)) = $78.0M
(cost) * (Sonnet share of API cost)
Convert to 2020 dollars: $64.7M",,,,,
MegaScale (Production),Language,"ByteDance,Peking University","Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu",2024-02-23,"MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs",https://arxiv.org/abs/2402.15627,SOTA improvement,Improves SOTA in FLOP utilization for distributed LLM training by 1.34X.,530000000000.0,"Production run is stated to have ""hundreds of billions of parameters"". Since the authors also do a number of experiments with a 530B model, I speculate they've used 530B for the production model.",1.2e+25,"Speculative. The model is stated to have trained for ""several weeks"". Assuming 530B parameters and ""several"" = 3, compute can be estimated from the 175B model's stated PFLOP/sec:
2166.3 aggregate PFlops/sec * 530B/175B * 3 weeks * 7 days/week * 24 hours/day * 3600 seconds/hour = 1.2e25 ",,,,"Speculative. Authors note production system was trained on ""multi-trillions of tokens"". This could refer to training for multiple epochs on the same 300B tokens used to train the 175B and 530B models outlined in more detail in the paper. Alternatively, it could refer to a larger dataset of perhaps 3-9 trillion tokens.","We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.",Speculative,"China,China","Industry,Academia",504.0,"Speculative. Authors state ""several weeks"". For analysis, I've assumed this means around 3 weeks.",NVIDIA A100,Unreleased,,12288.0,,,7.0,,,,,,
Aya,Language,"Cohere for AI,Brown University,Cohere,Carnegie Mellon University (CMU),Massachusetts Institute of Technology (MIT)","Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D'souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, Sara Hooker",2024-02-12,Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model,https://arxiv.org/abs/2402.07827,SOTA improvement,"from abstract ""We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99
language""",13000000000.0,13B - fine tune of mT5 - from last page - model card ,,"13B parameters, batch size = 256, sequence length = 1024 (for both input and output), 30K updates
- aproximation 6ND = 6 * 13B * 2 * 1024 * 30K * 256= 1226833920000000000000 = 1.22683392e+21
""We finetune mT5 models using the Adafactor optimizer [Shazeer & Stern, 2018] with a learning rate of 3 × 10−4 and a batch size of 256. We find that using a smaller learning rate compared to 1 × 10−3 leads to a better downstream performance, which is potentially due to the diverse nature of our IFT mixture. Both input and target sequence length are set to 1024.""
""We train all the models for 30,000 update steps with data packing enabled.16 This results in a training budget of 25M samples. """,,"""Expansion of Language Coverage We significantly expand the size of available training data to directly address the linguistic inequality of recent NLP development. "" from the paper
""Datasets: xP3x, Aya Dataset, Aya Collection, DataProvenance collection, ShareGPT-Command."" from https://huggingface.co/CohereForAI/aya-101and https://huggingface.co/CohereForAI/aya-101#data-sources",,"at least 835 GB + size of ShareGPT-command + size of DataProvenance collection
https://huggingface.co/CohereForAI/aya-101#data-sourcesxP3x - 680GB - from
https://huggingface.co/datasets/CohereForAI/xP3x
aya_dataset - 138MB - https://huggingface.co/datasets/CohereForAI/aya_dataset
aya collection - 155GB - https://huggingface.co/datasets/CohereForAI/aya_collection","Recent breakthroughs in large language models (LLMs) have centered around a handful of data-rich languages. What does it take to broaden access to breakthroughs beyond first-class citizen languages? Our work introduces Aya, a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced. Aya outperforms mT0 and BLOOMZ on the majority of tasks while covering double the number of languages. We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages -- including discriminative and generative tasks, human evaluation, and simulated win rates that cover both held-out tasks and in-distribution performance. Furthermore, we conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models. We open-source our instruction datasets and our model at this https://huggingface.co/CohereForAI/aya-101",Speculative,"Multinational,United States of America,Canada,United States of America,United States of America","Industry,Academia,Industry,Academia,Academia",,,Google TPU v4,Open source,"13B parameters, batch size = 256, sequence length = 1024 (for both input and output), 30K updates
- aproximation 6ND = 6 * 13B * 2 * 1024 * 30K * 256= 1226833920000000000000 = 1.22683392e+21
""We finetune mT5 models using the Adafactor optimizer [Shazeer & Stern, 2018] with a learning rate of 3 × 10−4 and a batch size of 256. We find that using a smaller learning rate compared to 1 × 10−3 leads to a better downstream performance, which is potentially due to the diverse nature of our IFT mixture. Both input and target sequence length are set to 1024.""
""We train all the models for 30,000 update steps with data packing enabled.16 This results in a training budget of 25M samples. """,128.0,,mT5-XXL,27.0,,1.22683392e+21,,,,
Qwen1.5 72B,Language,Alibaba,,2024-02-04,Introducing Qwen1.5,https://qwenlm.github.io/blog/qwen1.5/,SOTA improvement,"#1 in C-Eval (84.1, better than Qwen-72B. https://qwenlm.github.io/blog/qwen1.5/, https://cevalbenchmark.com/static/leaderboard.html)",72000000000.0,72B,,,,,,,"In recent months, our focus has been on developing a “good” model while optimizing the developer experience. As we progress towards Qwen1.5, the next iteration in our Qwen series, this update arrives just before the Chinese New Year. With Qwen1.5, we are open-sourcing base and chat models across six sizes: 0.5B, 1.8B, 4B, 7B, 14B, and 72B. In line with tradition, we’re also providing quantized models, including Int4 and Int8 GPTQ models, as well as AWQ and GGUF quantized models. To enhance the developer experience, we’ve merged Qwen1.5’s code into Hugging Face transformers, making it accessible with transformers>=4.37.0 without needing trust_remote_code.",Likely,China,Industry,,,,Open access (restricted use),,,,,,,,,,,
Qwen-VL-Max,"Multimodal,Language,Vision",Alibaba,,2024-01-25,Introducing Qwen-VL,https://qwenlm.github.io/blog/qwen-vl/,SOTA improvement,"""Notably, Qwen-VL-Max outperforms both GPT-4V from OpenAI and Gemini from Google in tasks on Chinese question answering and Chinese text comprehension""",7000000000.0,"Not stated. Qwen-VL (less capable, presumably smaller version) is 9.6B
Upd: 7B parameters mentioned here
https://github.com/QwenLM/Qwen-VL#qwen-vl-plus",,,,,,,"Along with the rapid development of our large language model Qwen, we leveraged Qwen’s capabilities and unified multimodal pretraining to address the limitations of multimodal models in generalization, and we opensourced multimodal model Qwen-VL in Sep. 2023. Recently, the Qwen-VL series has undergone a significant upgrade with the launch of two enhanced versions, Qwen-VL-Plus and Qwen-VL-Max. The key technical advancements in these versions include:
Substantially boost in image-related reasoning capabilities;
Considerable enhancement in recognizing, extracting, and analyzing details within images and texts contained therein;
Support for high-definition images with resolutions above one million pixels and images of various aspect ratios.",Confident,China,Industry,,,,API access,,,,,,,,,,,
AlphaGeometry,Mathematics,"Google DeepMind,New York University (NYU)","Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He, Thang Luong",2024-01-17,Solving olympiad geometry without human demonstrations,https://www.nature.com/articles/s41586-023-06747-5,SOTA improvement,"""On a test set of 30 latest olympiad-level problems, AlphaGeometry solves 25, outperforming the previous best method that only solves ten problems and approaching the performance of an average International Mathematical Olympiad (IMO) gold medallist.""",151000000.0,"""Overall, the transformer has 151 million parameters, excluding embedding layers at its input and output heads.""",,"Training details. Don't think there's enough info for a FLOP estimate.
""Our customized tokenizer is trained with ‘word’ mode using
SentencePiece36 and has a vocabulary size of 757. We limit the maximum context length to 1,024 tokens and use T5-style relative position embedding37. Sequence packing38,39 is also used because more
than 90% of our sequences are under 200 in length. During training, a
dropout40 rate of 5% is applied pre-attention and post-dense. A 4 × 4 slice of TPUv3 (ref. 41) is used as its hardware accelerator. For pretraining, we train the transformer with a batch size of 16 per core
and a cosine learning-rate schedule that decays from 0.01 to 0.001
in 10,000,000 steps. For fine-tuning, we maintain the final learning rate of 0.001 for another 1,000,000 steps""",,Synthetic dataset of geometry proofs,,"100m examples of theorem-proofs
""By using existing symbolic engines on a diverse set of random theorem premises, we extracted 100 million synthetic theorems and their
proofs, many with more than 200 proof steps, four times longer than
the average proof length of olympiad theorems.""","Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning1,2,3,4, owing to their reputed difficulty among the world’s best talents in pre-university mathematics. Current machine-learning approaches, however, are not applicable to most mathematical domains owing to the high cost of translating human proofs into machine-verifiable format. The problem is even worse for geometry because of its unique translation challenges1,5, resulting in severe scarcity of training data. We propose AlphaGeometry, a theorem prover for Euclidean plane geometry that sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity. AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challenging problems. On a test set of 30 latest olympiad-level problems, AlphaGeometry solves 25, outperforming the previous best method that only solves ten problems and approaching the performance of an average International Mathematical Olympiad (IMO) gold medallist. Notably, AlphaGeometry produces human-readable proofs, solves all geometry problems in the IMO 2000 and 2015 under human expert evaluation and discovers a generalized version of a translated IMO theorem in 2004.",Confident,"Multinational,United States of America","Industry,Academia",,,Google TPU v3,Open source,,,,,73.0,,,,,,
CoRe,Mathematics,Tsinghua University,"Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Ruyi Gan, Jiaxing Zhang, Yujiu Yang",2023-12-29,Solving Math Word Problems via Cooperative Reasoning induced Language Models,https://arxiv.org/abs/2210.16257,SOTA improvement,"We evaluate our CoRe framework on several mathematical reasoning datasets and achieve decent improvement over state-of-the-art methods, up to 9.6% increase over best baselines.",12400000000.0,"""Since the default setting consists of two GPT-J (6B) and a DeBERTa-large (0.4B), we note our backbone as “GPT-J 12B”, which implies around 12.4 billion parameters in total. """,,,"GSM8K,ASDiv","We consider several widely-used math word problem datasets: GSM8K (Cobbe et al., 2021), ASDivA (Miao et al., 2020), SingleOp (Roy et al., 2015), SinlgeEq (Koncel-Kedziorski et al., 2015) and MultiArith (Roy and Roth, 2015). (Details in Appendix A). Following the general setting as in (Kojima et al., 2022; Wei et al., 2022c), we employ accuracy as the evaluation metric for all datasets.",,,"Large-scale pre-trained language models (PLMs) bring new opportunities to challenging problems, especially those that need high-level intelligence, such as the math word problem (MWPs). However, directly applying existing PLMs to MWPs can fail as the generation process lacks sufficient supervision and thus lacks fast adaptivity as humans. We notice that human reasoning has a dual reasoning framework that consists of an immediate reaction system (system 1) and a delicate reasoning system (system 2), where the entire reasoning is determined by their interaction. This inspires us to develop a cooperative reasoning-induced PLM for solving MWPs, called Cooperative Reasoning (CoRe), resulting in a human-like reasoning architecture with system 1 as the generator and system 2 as the verifier. In our approach, the generator is responsible for generating reasoning paths, and the verifiers are used to supervise the evaluation in order to obtain reliable feedback for the generator. We evaluate our CoRe framework on several mathematical reasoning datasets and achieve decent improvement over state-of-the-art methods, up to 9.6% increase over best baselines.",Speculative,China,Academia,,,NVIDIA A100 SXM4 40 GB,,,,,GPT-J-6B,32.0,,,,,,
Gemini Nano-2,"Multimodal,Language,Vision,Audio",Google DeepMind,Gemini Team,2023-12-19,Gemini: A Family of Highly Capable Multimodal Models,https://arxiv.org/abs/2312.11805,Significant use,"Significant use; deployed on Android phones such as the Pixel: https://store.google.com/intl/en/ideas/articles/pixel-feature-drop-december-2023/
""Despite their size, they show exceptionally strong performance on factuality,
i.e. retrieval-related tasks, and significant performance on reasoning, STEM, coding, multimodal and multilingual tasks""",3250000000.0,3.25B,,"More tokens than Chinchilla-optimal:
""The number of tokens used to train the largest models were determined following the approach in Hoffmann et al. (2022). The smaller models are trained for significantly more tokens to improve performance for a given inference budget, similar to the approach advocated in Touvron et al. (2023a)""
Chinchilla was 1.4T tokens for 70B params, so Chinchilla-optimal for 3.25B params would be ~1.4T/20 = 70B tokens.
So compute was significantly greater than 3.25B * 70B * 6, which is 1.4e21.
Touvron et al. is the Llama 1 paper, in which a 6.7B model is trained for 1T tokens. Using the same ratio, a 3.25B model would be trained on ~500B tokens. 3.25 * 500B * 6 = 9.75e21. No guarantee that the exact ratio for Nano is close to Llama's, of course.",,"""Gemini models are trained on a dataset that is both multimodal and multilingual. Our pre-training dataset uses data from web documents, books, and code, and includes image, audio, and video data.""",,,"This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of Gemini models in cross-modal reasoning and language understanding will enable a wide variety of use cases and we discuss our approach toward deploying them responsibly to users.",Confident,Multinational,Industry,,,Google TPU v5e,,,,,,633.0,,,,,,
Gemini Nano-1,"Multimodal,Language,Vision,Audio",Google DeepMind,Gemini Team,2023-12-19,Gemini: A Family of Highly Capable Multimodal Models,https://arxiv.org/abs/2312.11805,Significant use,"Significant use; deployed on Android phones such as the Pixel: https://store.google.com/intl/en/ideas/articles/pixel-feature-drop-december-2023/
""Despite their size, they show exceptionally strong performance on factuality,
i.e. retrieval-related tasks, and significant performance on reasoning, STEM, coding, multimodal and multilingual tasks""",1800000000.0,1.8B,,"More tokens than Chinchilla-optimal:
""The number of tokens used to train the largest models were determined following the approach in Hoffmann et al. (2022). The smaller models are trained for significantly more tokens to improve performance for a given inference budget, similar to the approach advocated in Touvron et al. (2023a)""",,"""Gemini models are trained on a dataset that is both multimodal and multilingual. Our pre-training
dataset uses data from web documents, books, and code, and includes image, audio, and video data.""",,,"This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of Gemini models in cross-modal reasoning and language understanding will enable a wide variety of use cases and we discuss our approach toward deploying them responsibly to users.",Confident,Multinational,Industry,,,Google TPU v5e,,,,,,633.0,,,,,,
FunSearch,"Language,Search",Google DeepMind,"Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, Alhussein Fawzi ",2023-12-14,Mathematical discoveries from program search with large language models,"https://www.nature.com/articles/s41586-023-06924-6
https://deepmind.google/discover/blog/funsearch-making-new-discoveries-in-mathematical-sciences-using-large-language-models/","SOTA improvement,Historical significance",Improved SOTA for the cap set problem. Can plausibly claim the first instance of a LLM system making a genuine and novel scientific contribution.,15000000000.0,"From the section called ""Pretrained LLM"": ""We use Codey, an LLM built on top of the PaLM2 model family... Because FunSearch relies on sampling from an LLM extensively, an important performance-defining tradeoff is between the quality of the samples and the inference speed of the LLM. In practice, we have chosen to work with a fast-inference model (rather than slower-inference, higher-quality)""
Unclear which PaLM2 model was used (of Gecko, Otter, Bison, and Unicorn); above quote indicates it was perhaps Otter or Bison, but not Unicorn. Exact parameter counts are not publicly disclosed for any of these models. In comparisons where FunSearch uses StarCoder-15B, Codey is an improvement but not obviously of an entirely different model class.
I report the 15B parameters from StarCoder-15B, used as an open-source comparison",3.87e+23,"Appendix A.5: ""Finding the full-sized symmetric admissible set I(15, 10) required the generation and analysis of approximately two million programs... To reproduce admissible set experiments done above (generating 2 million samples) one would have to use 15 instances of StarCoder-15B running on A100 40 GB GPU each and 5 CPU servers (each running 32 evaluators in parallel) for two days. We estimate that when running on Google Cloud, the price of an experiment is around $800 – $1400, and the energy usage around 250 – 500 kWh; i.e., 0.5% of the energy used for training StarCoder""
15 GPUs * 7.80E+13 FLOP/GPU-sec * 2 days * 24 hours/day * 3600 sec/hour = 2.02e20 FLOP for the GPU servers
We should also add the compute used to train the PaLM2 variant used as the base LLM. Since we don't have any details about this model, I use the compute from StarCoder-15B (used as the open source comparison point): 3.87e+23 FLOP
Unclear how to evaluate the compute from the CPU servers implementing the evolutionary algorithm, but this is very likely dwarfed by the pre-training compute for the LLM.",,"""The experiments carried out in this paper do not require any data corpus other than the publicly available OR-Library bin packing benchmarks""",0.0,"""The experiments carried out in this paper do not require any data corpus other than the publicly available OR-Library bin packing benchmarks""","Large language models (LLMs) have demonstrated tremendous capabilities in solving complex tasks, from quantitative reasoning to understanding natural language. However, LLMs sometimes suffer from confabulations (or hallucinations), which can result in them making plausible but incorrect statements1,2. This hinders the use of current large models in scientific discovery. Here we introduce FunSearch (short for searching in the function space), an evolutionary procedure based on pairing a pretrained LLM with a systematic evaluator. We demonstrate the effectiveness of this approach to surpass the best-known results in important problems, pushing the boundary of existing LLM-based approaches3. Applying FunSearch to a central problem in extremal combinatorics—the cap set problem—we discover new constructions of large cap sets going beyond the best-known ones, both in finite dimensional and asymptotic cases. This shows that it is possible to make discoveries for established open problems using LLMs. We showcase the generality of FunSearch by applying it to an algorithmic problem, online bin packing, finding new heuristics that improve on widely used baselines. In contrast to most computer search approaches, FunSearch searches for programs that describe how to solve a problem, rather than what the solution is. Beyond being an effective and scalable strategy, discovered programs tend to be more interpretable than raw solutions, enabling feedback loops between domain experts and FunSearch, and the deployment of such programs in real-world applications.",Speculative,Multinational,Industry,48.0,"Appendix A.5: ""To reproduce admissible set experiments done above (generating 2 million samples) one would have to use 15 instances of StarCoder-15B running on A100 40 GB GPU each and 5 CPU servers (each running 32 evaluators in parallel) for two days""",,Open source,No finetuning,,,PaLM 2,81.0,"Appendix A.5: ""We estimate that when running on Google Cloud, the price of an experiment is around $800 – $1400, and the energy usage around 250 – 500 kWh; i.e., 0.5% of the energy used for training StarCoder"" (in reference to a replication done using StarCoder-15B)
Estimate (800+1400)/2 = $1100 at time of publication. CPI conversion to 2020 dollars: $929",0.0,,,,
Mixtral 8x7B,Language,Mistral AI,"Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Louis Ternon, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.",2023-12-11,Mixtral of experts: A high quality Sparse Mixture-of-Experts.,"https://mistral.ai/news/mixtral-of-experts/, https://arxiv.org/abs/2401.04088",Significant use,"Frequently downloaded: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
Probably the best OS model by a big margin right now, e.g. #7 on Chatbot Arena, above Gemini Pro and Claude 2.1: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
",46700000000.0,"46.7B *sparse* params. 12.9B params used on average:
""Concretely, Mixtral has 46.7B total parameters but only uses 12.9B parameters per token. It, therefore, processes input and generates output at the same speed and for the same cost as a 12.9B model.""",,,,"""Mixtral is pretrained with multilingual data using a context size of 32k tokens""",,,"Today, the team is proud to release Mixtral 8x7B, a high-quality sparse mixture of experts model (SMoE) with open weights. Licensed under Apache 2.0. Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference. It is the strongest open-weight model with a permissive license and the best model overall regarding cost/performance trade-offs. In particular, it matches or outperforms GPT3.5 on most standard benchmarks.",Confident,France,Industry,,,,Open source,,,,,,,,,,,
SeamlessM4T,"Speech,Language","Facebook,INRIA,UC Berkeley","Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek, Yilin Yang, Ethan Ye, Ivan Evtimov, Pierre Fernandez, Cynthia Gao, Prangthip Hansanti, Elahe Kalbassi, Amanda Kallet, Artyom Kozhevnikov, Gabriel Mejia Gonzalez, Robin San Roman, Christophe Touret, Corinne Wong, Carleigh Wood, Bokai Yu, Pierre Andrews, Can Balioglu, Peng-Jen Chen, Marta R. Costa-jussà, Maha Elbayad, Hongyu Gong, Francisco Guzmán, Kevin Heffernan, Somya Jain, Justine Kao, Ann Lee, Xutai Ma, Alex Mourachko, Benjamin Peloquin, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Anna Sun, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang, Mary Williamson",2023-12-08,Seamless: Multilingual Expressive and Streaming Speech Translation,"https://arxiv.org/abs/2312.05187, https://huggingface.co/facebook/seamless-m4t-v2-large",SOTA improvement,"""As an improved version of SeamlessM4T,
SeamlessM4T v2 delivers state-of-the-art semantic accuracy across different speech and text translation tasks
while supporting nearly 100 languages as input speech or text""",2300000000.0,2.3B,,,,"Several datasets including unlabeled speech, ASR data, TTS data",,~5M hours of audio data (figure 2),"Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at this https URL",Confident,"United States of America,France,United States of America","Industry,Academia,Academia",,,NVIDIA V100,Open source,expanded from 1M hours data to 4.5M hours,,,W2v-BERT,13.0,,,,,,
Llama Guard,Language,Meta AI,"Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Davide Testuggine, Madian Khabsa",2023-12-07,Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations,https://arxiv.org/abs/2312.06674,SOTA improvement,"""Llama Guard, a Llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and oxicChat, where its performance matches or exceeds that of currently available content moderation tools. """,7000000000.0,7B,1.6e+23,"1.7e17 finetune compute, plus Llama 2-13B pretrain compute (1.6e+23)",,"Dataset of prompt-response pairs of human-AI conversations
""We leverage the human preference data about harmlessness from Anthropic (Ganguli et al., 2022). From
this dataset, we pick the first human prompt and discard the corresponding response from the assistant, as
well as all the other turns to create an initial single-turn prompt dataset. Next, we use one of our internal
Llama checkpoints to generate a mix of cooperating and refusing responses for these prompts. We employ
our expert, in-house red team to label the prompt and response pairs for the corresponding category based
on the taxonomy defined in Section 2. The red-teamers annotate the dataset for 4 labels: prompt-category,
response-category, prompt-label (safe or unsafe), and response-label (safe or unsafe). During the annotation
process, we also do data cleaning, and discard examples with badly formatted inputs or outputs. The final
dataset comprises of 13,997 prompts and responses, with their respective annotations.""",4096000.0,"14k prompt-response pairs. Based on training details it's trained on ~4M tokens, which is stated to be ~1 epoch:
2 * 4096 * 500 = 4,096,000
(batch size) * (sequence length) * (steps)","We introduce Llama Guard, an LLM-based input-output safeguard model geared towards Human-AI conversation use cases. Our model incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks found in LLM prompts (i.e., prompt classification). This taxonomy is also instrumental in classifying the responses generated by LLMs to these prompts, a process we refer to as response classification. For the purpose of both prompt and response classification, we have meticulously gathered a dataset of high quality. Llama Guard, a Llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat, where its performance matches or exceeds that of currently available content moderation tools. Llama Guard functions as a language model, carrying out multi-class classification and generating binary decision scores. Furthermore, the instruction fine-tuning of Llama Guard allows for the customization of tasks and the adaptation of output formats. This feature enhances the model's capabilities, such as enabling the adjustment of taxonomy categories to align with specific use cases, and facilitating zero-shot or few-shot prompting with diverse taxonomies at the input. We are making Llama Guard model weights available and we encourage researchers to further develop and adapt them to meet the evolving needs of the community for AI safety.",Confident,United States of America,Industry,,,NVIDIA A100 SXM4 80 GB,Open access (restricted use),"""We train on a single machine with 8xA100 80GB GPUs using a batch size of 2, with sequence length of 4096, using model parallelism of 1 and a learning rate of 2 × 10−6. We train for 500 steps, which corresponds to ∼1 epoch over our training set.""
6 * 2*4096*500 * 7 billion = 1.7e17",,,Llama 2-7B,64.0,,1.7e+17,1.0,,,
Gemini 1.0 Ultra,"Multimodal,Language,Vision",Google DeepMind,Gemini Team,2023-12-06,Gemini: A Family of Highly Capable Multimodal Models,https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf,"SOTA improvement,Training cost",""" Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks — notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined.""",,,5.0000000001e+25,"This number is an estimate based on limited evidence. In particular, we combine information about the performance of Gemini Ultra on various benchmarks compared to other models, and guesstimates about the hardware setup used for training to arrive at our estimate. Our reasoning and calculations are detailed in this Colab notebook.
https://colab.research.google.com/drive/1sfG91UfiYpEYnj_xB5YRy07T5dv-9O_c",,"""Gemini models are trained on a dataset that is both multimodal and multilingual. Our pretraining dataset uses data from web documents, books, and code, and includes image, audio, and video data... We find that data quality is critical to a highlyperforming model, and believe that many interesting questions remain around finding the optimal
dataset distribution for pretraining.""",,,"This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks — notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of Gemini models in cross-modal reasoning and language understanding will enable a wide variety of use cases and we discuss our approach toward deploying them responsibly to users.
",Speculative,Multinational,Industry,2400.0,"Dylan Patel, author of SemiAnalysis, speculates that the training duration of Gemini may have been 100 days.",Google TPU v4,Hosted access (no API),,55000.0,,,633.0,,,,29827341.919963885,,
Gemini 1.0 Pro,"Multimodal,Language,Vision",Google DeepMind,Gemini Team,2023-12-06,Gemini: A Family of Highly Capable Multimodal Models,https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf,Significant use,"Default/free model on gemini.google.com
From paper:
""Broadly, we find that the performance of Gemini Pro outperforms inference-optimized models such as GPT-3.5 and performs comparably with several of the most capable models available, and Gemini Ultra outperforms all current models. In this section, we examine some of these findings.""",,,,"Not known.
Our reasoning and calculations for Gemini 1 Ultra are detailed in this Colab notebook.
https://colab.research.google.com/drive/1sfG91UfiYpEYnj_xB5YRy07T5dv-9O_c",,"""Gemini models are trained on a dataset that is both multimodal and multilingual. Our pretraining dataset uses data from web documents, books, and code, and includes image, audio, and video data... We find that data quality is critical to a highlyperforming model, and believe that many interesting questions remain around finding the optimal
dataset distribution for pretraining.""",,,"This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks — notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of Gemini models in cross-modal reasoning and language understanding will enable a wide variety of use cases and we discuss our approach toward deploying them responsibly to users.
",Unknown,Multinational,Industry,,,Google TPU v4,API access,,,,,633.0,,,,,,
Mamba-24M (SC09),Speech,"Carnegie Mellon University (CMU),Princeton University","Albert Gu, Tri Dao",2023-12-01,Mamba: Linear-Time Sequence Modeling with Selective State Spaces,https://arxiv.org/abs/2312.00752,SOTA improvement,"""SC09 is a benchmark speech generation dataset (Donahue, McAuley, and Puckette 2019; Warden 2018), consisting
of 1-second clips sampled at 16000 Hz of the digits “zero” through “nine” with highly variable characteristics. We
largely follow the autoregressive training setup and generation protocol of Goel et al. (2022).
Table 4 shows automated metrics of the Mamba-UNet model compared to a variety of baselines from Goel et al.
(2022): WaveNet (Oord et al. 2016), SampleRNN (Mehri et al. 2017), WaveGAN (Donahue, McAuley, and Puckette
2019), DiffWave (Z. Kong et al. 2021), and SaShiMi. A small Mamba model outperforms the state-of-the-art
(and much larger) GAN- and diffusion- based models. A larger model parameter-matched to the baselines further
improves on fidelity metrics dramatically.""",23400000.0,Table 4,,,SC09,"""SC09 is a benchmark speech generation dataset (Donahue, McAuley, and Puckette 2019; Warden 2018), consisting
of 1-second clips sampled at 16000 Hz of the digits “zero” through “nine” with highly variable characteristics""",305280000.0,"Section 4.4.2: ""We largely follow the autoregressive training setup and generation protocol of Goel et al. (2022)""
In which they model raw audio waveforms, such that each sample is a datapoint.
SC09 is 5.3 hours long. 5.3h * 3600 sec/h * 16k samples/sec = 305,280,000 samples
Appendix E.4.2: ""We used a learning rate of 0.002 and 200000 training steps at a batch size of 16... training went through 100 epochs""","Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5× higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.",Confident,"United States of America,United States of America","Academia,Academia",,,,,,,,,409.0,,,100.0,,,
Qwen-72B,Language,Alibaba,"Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, Tianhang Zhu",2023-11-30,,https://huggingface.co/Qwen/Qwen-72B,SOTA improvement,"SOTA on several Chinese benchmarks, with highest average rating overall for Chinese benchmarks:
https://opencompass.org.cn/leaderboard-llm",72000000000.0,72B,1.3e+24,"72 billion params, 3 trillion tokens
72b * 3T * 6 = 1.3e24",,"""It is pretrained on over 3 trillion tokens, including Chinese, English, multilingual texts, code, and mathematics, covering general and professional fields""",,,"Qwen-72B is the 72B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-72B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-72B, we release Qwen-72B-Chat, a large-model-based AI assistant, which is trained with alignment techniques.",Likely,China,Industry,,,,Open access (restricted use),,,,,,,,,,4000000.0,"Table 1 https://arxiv.org/abs/2309.16609
(this is uncertain because this table only lists sizes up to 14B. 72B was released after the paper)"
PPLX-70B-Online,Language,Perplexity,"Lauren Yang, Kevin Hu, Aarash Heydari, Gradey Wang, Dmitry Pervukhin, Nikhil Thota, Alexandr Yarats, Max Morozov, Denis Yarats",2023-11-29,Introducing PPLX Online LLMs ,https://blog.perplexity.ai/blog/introducing-pplx-online-llms,Significant use,"Probably significant use: ""Perplexity, which has just 41 employees and is based out of a shared working space in San Francisco, has 10 million monthly active users, an impressive number for a young start-up."" However, this includes everyone who uses Perplexity's app which also uses third party models like GPT-4.
https://www.nytimes.com/2024/02/01/technology/perplexity-search-ai-google.html
",70000000000.0,70B,,,,"Fine-tuned on website excerpts:
""Website excerpts, which we call “snippets”, are provided to our pplx-online models to enable responses with the most up-to-date information.
Fine-tuning: our PPLX models have been fine-tuned to effectively use snippets to inform their responses. Using our in-house data contractors, we carefully curate high quality, diverse, and large training sets in order to achieve high performance on various axes like helpfulness, factuality, and freshness.""",,,"We’re excited to share two new PPLX models: pplx-7b-online and pplx-70b-online! Our online models are focused on delivering helpful, up-to-date, and factual responses, and are publicly available via pplx-api, making it a first-of-its-kind API. pplx-7b-online and pplx-70b-online are also accessible via Perplexity Labs, our LLM playground.",Likely,United States of America,Industry,,,,API access,"""Fine-tuning: our PPLX models have been fine-tuned to effectively use snippets to inform their responses. Using our in-house data contractors, we carefully curate high quality, diverse, and large training sets in order to achieve high performance on various axes like helpfulness, factuality, and freshness. Our models are regularly fine-tuned to continually improve performance.""",,,Llama 2-70B,,,,,,,
Inflection-2,Language,Inflection AI,,2023-11-22,Inflection-2: The Next Step Up,https://inflection.ai/inflection-2,"Significant use,Training cost","Inflection-2 either already powers Pi or soon will: https://inflection.ai/inflection-2
Inflection has claimed that Pi has >1m users: https://x.com/inflectionAI/status/1699100179390210091?s=20",,,1.001e+25,"""Inflection-2 was trained on 5,000 NVIDIA H100 GPUs in fp8 mixed precision for ~10²⁵ FLOPs""
(the second 1 is there because of airtable being wonky, it's not a real sig fig)",,,,,"Today we are proud to announce that we have completed training of Inflection-2, the best model in the world for its compute class and the second most capable LLM in the world today. Our mission at Inflection is to create a personal AI for everyone. Just a few months ago, we announced Inflection-1 — a best-in-class language model that currently powers Pi. Our new model, Inflection-2, is substantially more capable than Inflection-1, demonstrating much improved factual knowledge, better stylistic control, and dramatically improved reasoning.",Confident,United States of America,Industry,,,NVIDIA H100 SXM5,Hosted access (no API),,5000.0,,,,,,,12961959.001361668,,
Claude 2.1,Language,Anthropic,,2023-11-21,Introducing Claude 2.1,https://www.anthropic.com/index/claude-2-1,Significant use,,,,,,,,,,"Our latest model, Claude 2.1, is now available over API in our Console and is powering our claude.ai chat experience. Claude 2.1 delivers advancements in key capabilities for enterprises—including an industry-leading 200K token context window, significant reductions in rates of model hallucination, system prompts and our new beta feature: tool use.",Unknown,United States of America,Industry,,,,API access,,,,Claude 2,0.0,,,,,,
Nemotron-3-8B,Language,NVIDIA,,2023-11-15,NVIDIA AI Foundation Models: Build Custom Enterprise Chatbots and Co-Pilots with Production-Ready LLMs,https://developer.nvidia.com/blog/nvidia-ai-foundation-models-build-custom-enterprise-chatbots-and-co-pilots-with-production-ready-llms/,SOTA improvement,"""The Nemotron-3-8B-QA model offers state-of-the-art performance, achieving a zero-shot F1 score of 41.99% on the Natural Questions dataset. This metric measures how closely the generated answer resembles the truth in ‌QA. """,8000000000.0,,1.8e+23,"https://huggingface.co/nvidia/nemotron-3-8b-base-4k
""This model was trained on a dataset containing 3.8 Trillion tokens of text""
8 billion * 3.8 trillion * 6 = 1.8e23
Also, using the hardware method: ""1,024 A100s were used for 19 days to train the model.""
19*1024 * 312 trillion * 24 * 3600 * 0.3 = 1.57e23",,"""NVIDIA models are trained on a diverse set of public and proprietary datasets. This model was trained on a dataset containing 3.8 Trillion tokens of text. The dataset contains 53 different human languages (including English, German, Russian, Spanish, French, Japanese, Chinese, Italian, and Dutch) and 37 programming languages. The model also uses the training subsets of downstream academic benchmarks from sources like FLANv2, P3, and NaturalInstructions v2""",,,"Large language models (LLMs) are revolutionizing data science, enabling advanced capabilities in natural language understanding, AI, and machine learning. Custom LLMs, tailored for domain-specific insights, are finding increased traction in enterprise applications.
The NVIDIA Nemotron-3 8B family of foundation models is a powerful new tool for building production-ready generative AI applications for the enterprise–fostering innovations ranging from customer service AI chatbots to cutting-edge AI products.",Likely,United States of America,Industry,456.0,19 days,NVIDIA A100,Open access (restricted use),,1024.0,0.34,,,,,,214467.02013524104,,
Qwen-Audio-Chat,"Language,Speech,Audio",Alibaba,"Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, Jingren Zhou",2023-11-14,Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models,https://arxiv.org/abs/2311.07919,SOTA improvement,"""A notable achievement of Qwen-Audio is its state-of-the-art performance on the test set of Aishell1, cochlscene, ClothoAQA, and VocalSound""",8460000000.0,"the model has two components - audio and language.
670M + 7.7B = 8.46B
""The audio encoder is composed of 640M parameters""
""Qwen-Audio incorporates a large language model as its foundational component.
The model is initialized using pre-trained weights derived from Qwen-7B (Bai et al., 2023a). Qwen-7B is a 32-layer Transformer decoder model with a hidden size of 4096, encompassing a total of 7.7B parameters.""",,,,multiple audio and language sources,,not clear," Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios. ",Likely,China,Industry,,,,Open access (restricted use),,,,,31.0,,,,,,
GraphCast,Earth science,Google DeepMind,"Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, Alexander Merose, Stephan Hoyer, George Holland, Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir Mohamed, Peter Battaglia",2023-11-14,Learning skillful medium-range globalweather forecasting,https://www.science.org/doi/epdf/10.1126/science.adi2336,SOTA improvement,"""Our state-of-the-art model delivers 10-day weather predictions at unprecedented accuracy in under one minute""",,Not mentioned in paper.,2.1e+22,"""Training GraphCast took roughly four weeks on 32 Cloud TPU v4 devices using batch parallelism.""
4.6: ""we use bfloat16 floating point precision""
2.1e22 = 2.75E+14 FLOP/s * 32 * 60* 60 * 24 * 7 * 4",,"According to the blog post, ""we trained GraphCast on four decades of weather reanalysis data, from the ECMWF’s ERA5 dataset. This trove is based on historical weather observations such as satellite images, radar, and weather stations using a traditional NWP to ‘fill in the blanks’ where the observations are incomplete, to reconstruct a rich record of global historical weather.""
https://deepmind.google/discover/blog/graphcast-ai-model-for-faster-and-more-accurate-global-weather-forecasting/",,,,Speculative,Multinational,Industry,,,,,,,,,,,,,,,
Volcano 13B,Language,"Korea University,Korea Advanced Institute of Science and Technology (KAIST),LG","Seongyun Lee, Sue Hyun Park, Yongrae Jo, Minjoon Seo",2023-11-13,Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision,https://arxiv.org/abs/2311.07362,SOTA improvement,"""Volcano effectively reduces multimodal hallucination and achieves state-of-the-art on MMHal-Bench, POPE, and GAVIE"" (hallucination benchmarks)",13000000000.0,13B,4.56e+22,"Base model is LLaVa-1.5 13B, which used 4.55e22 FLOP (mostly coming from Llama base)
""For this research, we used an NVIDIA A100-SXM4-80GB GPU and an AMD EPYC 7513 32-Core Processor running at 2.0778 GHz. Training
VOLCANO 7B required 8 GPUs and took a total of 15 hours, while training VOLCANO 13B took 30 hours.""
3.12e14 * 8 * 30 * 3600 * 0.3 = 8.1e19 finetune compute",,"trained on synthetic data: ""To train VOLCANO, we collect initial responses for
visual questions from an open-source LMM and
generate feedback and revisions using a proprietary
LLM as shown in Figure 3 (Akyürek et al., 2023;
Madaan et al., 2023; Ye et al., 2023b; Wang et al.,
2023d; Kim et al., 2023).""
https://huggingface.co/datasets/kaist-ai/volcano-train",,"https://huggingface.co/datasets/kaist-ai/volcano-train
558k image-text pairs, rest of dataset is ~1M examples of text data; length per sequence is not clear","Large multimodal models (LMMs) suffer from multimodal hallucination, where they provide incorrect responses misaligned with the given visual information. Recent works have conjectured that one of the reasons behind multimodal hallucination might be due to the vision encoder failing to ground on the image properly. To mitigate this issue, we propose a novel approach that leverages self-feedback as visual cues. Building on this approach, we introduce Volcano, a multimodal self-feedback guided revision model. Volcano generates natural language feedback to its initial response based on the provided visual information and utilizes this feedback to self-revise its initial response. Volcano effectively reduces multimodal hallucination and achieves state-of-the-art on MMHal-Bench, POPE, and GAVIE. It also improves on general multimodal abilities and outperforms previous models on MM-Vet and MMBench. Through a qualitative analysis, we show that Volcano's feedback is properly grounded on the image than the initial response. This indicates that Volcano can provide itself with richer visual information, helping alleviate multimodal hallucination. We publicly release Volcano models of 7B and 13B sizes along with the data and code at this https URL.",Likely,"Korea (Republic of),Korea (Republic of),Korea (Republic of)","Academia,Academia,Industry",30.0,,NVIDIA A100 SXM4 80 GB,Open access (non-commercial),"""For this research, we used an NVIDIA A100- SXM4-80GB GPU and an AMD EPYC 7513 32-Core Processor running at 2.0778 GHz. Training VOLCANO 7B required 8 GPUs and took a total of 15 hours, while training VOLCANO 13B took 30 hours""
= 8 * 312 teraflops * 30 * 3600 * 0.4 utilization (assumed)
= 8.1e19
",,,LLaVA 1.5,12.0,,8.1e+19,1.0,,,
SPHINX (Llama 2 13B),"Vision,Language","Shanghai AI Lab,Chinese University of Hong Kong (CUHK),ShanghaiTech University","Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, Jiaming Han, Siyuan Huang, Yichi Zhang, Xuming He, Hongsheng Li, Yu Qiao",2023-11-13,"SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models",https://arxiv.org/abs/2311.07575,SOTA improvement,"""as shown in Figure 2, SPHINX can achieve impressive fine-grained visual perception for high-resolution images, which exhibits state-of-the-art performance on extensive evaluation benchmarks, e.g., MMBench (Liu et al., 2023f), MME (Fu et al., 2023a), and POPE (Li et al., 2023e).""",19900000000.0,"SPHINX + Llama 2 13B
SPHINX component involves four vision encoders:
- CLIP - ViT
- CLIP - ConvNeXt V2 (89M to 659M params, depending on size)
- DinoV2 - ViT (22M to 1.14B params, depending on size)
- Q-former (188M params)
Also involves to projection networks
Huggingface Hub model files appear to be 39.8GB. Assuming models are stored in fp16 there are 2 bytes per parameter, so 39.8 / 2 = 19.9B parameters.",3.04e+22,"""The pre-training time is around 125 hours on 32 A100 GPUs with a 7B
language model and about twice the time with a 13B language model... The fine-tuning takes about 38 hours with 16 A100 GPUs with a 13B
language model.""
((125*2 * 32) + (38 * 16)) * 3.12e14 * 3600 * 0.3 = 2.9e21
Component vision encoders were initialized from pre-trained:
- CLIP ViT: 1.5e22 FLOPs for L/14@336
- ConvNeXt V2: 6.8e21 FLOPs for largest
- DinoV2: 7.42e+21 FLOPs for largest
- Q-former: 1.2e21 FLOPs for largest
(Based on full parameter count, SPHINX probably uses largest models)
Sum: 3.04e22 FLOPs",LAION-400M,,,,"We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and design task-specific instructions to avoid inter-task conflict. In addition to the basic visual question answering, we include more challenging tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation, contributing to mutual enhancement over different scenarios. Additionally, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications. On top of this, we further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, SPHINX attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. We hope our work may cast a light on the exploration of joint mixing in future MLLM research.",Likely,"China,Hong Kong,China","Academia,Academia,Academia",290.0,"""The pre-training time is around 125 hours on 32 A100 GPUs with a 7B
language model and about twice the time with a 13B language model.""
"" The fine-tuning takes about 38 hours with 16 A100 GPUs with a 13B
language model.""",NVIDIA A100 SXM4 40 GB,,"32 A100 * 312 TFLOPS/A100 * 290 hours * 40% utilization ~= 4e21 FLOP
https://www.wolframalpha.com/input?i=250+hours+*+312+TFLOPS+*+32+*+0.4",32.0,,Llama 2-13B,,,4.00000000001e+21,,239188.6875340231,,
MultiBand Diffusion,"Audio,Speech","Meta AI,Hebrew University of Jerusalem,LORIA","Robin San Roman, Yossi Adi, Antoine Deleforge, Romain Serizel, Gabriel Synnaeve, Alexandre Défossez",2023-11-08,From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion,https://arxiv.org/abs/2308.02560,SOTA improvement,"""At equal bit rate, the proposed approach outperforms state-of-the-art generative techniques in terms of perceptual quality""",,,2.6e+19,"""It takes around 2 days on 4 Nvidia V100 with 16 GB to train one of the 4 models.""
125 tflop/s for V100 SXM (not clear which they used, could be PCI given small number - still same OOM thus confident)
4 * 125 trillion * 2 * 24 * 3600 * 0.3 = 2.6e19","Common Voice,DNS","""We train on a diverse set of domains and data. We use speech from the train set of Common Voice 7.0
(9096 hours) [Ardila et al., 2019] together with the DNS challenge 4 (2425 hours) [Dubey et al., 2022].
For music, we use the MTG-Jamendo dataset (919h) [Bogdanov et al., 2019]. For the environmental
sound we use FSD50K (108 hours) [Fonseca et al., 2021] and AudioSet (4989 hours) [Gemmeke
et al., 2017]. We used AudioSet only for the research that is described in the publication and for the
benefit of replicability. For evaluation, we also use samples from an internal music dataset.""",,~16k hours,"Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the conditioning is flawed or imperfect. An alternative modeling approach is to use diffusion models. However, these have mainly been used as speech vocoders (i.e., conditioned on mel-spectrograms) or generating relatively low sampling rate signals. In this work, we propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality (e.g., speech, music, environmental sounds) from low-bitrate discrete representations. At equal bit rate, the proposed approach outperforms state-of-the-art generative techniques in terms of perceptual quality. Training and, evaluation code, along with audio samples, are available on the facebookresearch/audiocraft Github page.",Confident,"United States of America,Israel,France","Industry,Academia,Academia",48.0,around 2 days,NVIDIA V100,Open source,,,,,3.0,,,,22.81032329809286,,
OmniVec,"Multimodal,Vision,Speech,Language",TensorTour,"Siddharth Srivastava, Gaurav Sharma",2023-11-07,OmniVec: Learning robust representations with cross modal sharing,https://arxiv.org/abs/2311.05709v1,SOTA improvement,"Table 13.
E.g. SOTA on ImageNet at 92.4 top-1 accuracy",,,,,,Many datasets across several modalities,,,"Majority of research in learning based methods has been towards designing and training networks for specific tasks. However, many of the learning based tasks, across modalities, share commonalities and could be potentially tackled in a joint framework. We present an approach in such direction, to learn multiple tasks, in multiple modalities, with a unified architecture. The proposed network is composed of task specific encoders, a common trunk in the middle, followed by task specific prediction heads. We first pre-train it by self-supervised masked training, followed by sequential training for the different tasks. We train the network on all major modalities, e.g.\ visual, audio, text and 3D, and report results on 22 diverse and challenging public benchmarks. We demonstrate empirically that, using a joint network to train across modalities leads to meaningful information sharing and this allows us to achieve state-of-the-art results on most of the benchmarks. We also show generalization of the trained network on cross-modal tasks as well as unseen datasets and tasks.",Unknown,United States of America,Industry,,,,,"Appears to build on several models, like BERT and ViT (Table 1)",,,BERT-Large,14.0,,,,,,
mPLUG-Owl2,"Vision,Language",Alibaba,"Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou",2023-11-07,mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration,https://arxiv.org/abs/2311.04257,SOTA improvement,"""Extensive experiments illustrate the effectiveness and generalization abilities of mPLUG-Owl2, which achieves state-of-the-art performance on 8 classic vision-language benchmarks using a single generic model.""",7120000000.0,"""As depicted in Figure 2, our model, referred to as mPLUGOwl2, is composed of three main components: a fundamental vision encoder, a visual abstractor, and a language decoder. Specifically, we utilize ViT-L/14 as the
vision encoder and LLaMA-2-7B [58] as the language decoder""
ViT-L/14 has 123M parameters and Llama 2 7B has 7B parameters.",,"ViT-L/14 and Llama 2-7b compute, plus 1.7e19 joint pretrain FLOP (6 * 400M * 7.1B) and 4e16 joint finetune FLOP. Everything is a negligible fraction except the Llama 2 compute.",,"""mPLUG-Owl2 is first pre-trained on image-text pairs and fine-tunes on mono-modal and multi-modal instruction data. For pre-training data, we randomly pick about 400 million image-text pairs from five public datasets: Conceptual Captions (CC3M/CC12M) [9], COCO [35], Laionen [49], COYO [7], DataComp [18]. For instruction data, we collect 5 types of datasets including 1) image captioning (i.e., TextCaps [53], COCO [35]); 2) image question answering (i.e., VQAv2 [21], OKVQA [43], OCR-VQA [44], GQA [24], and A-OKVQA [50]); 3) region-aware QA (i.e., Ref-COCO [69], VisualGenome [26]); 4) multi-modal instruct data (i.e., LLaVA-instruct-150K [38]); 5) text-only instruct data (i.e., ShareGPT-80K [1], SlimOrca [34]). Details can be found in the Appendix.""
According to the appendix, the instruction-tuning dataset was 1.23MB total across text, dialog, captions, and visual question-answering. This can't be much more than 1.5M updates per epoch, and the paper says ""For the instruction tuning stage, we train the whole model for 1 epoch with a learning rate of 2e-5 and batch size 256"".",400000000.0,,"Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.
",Speculative,China,Industry,,,,,https://www.wolframalpha.com/input?i=6+*+400+million+*+7.12+billion,,,Llama 2-7B,,,1.7000000001e+19,1.0,,,
GPT-4 Turbo,"Multimodal,Vision,Language",OpenAI,,2023-11-06,New models and developer products announced at DevDay,https://openai.com/blog/new-models-and-developer-products-announced-at-devday,SOTA improvement,"""More capable"" than GPT-4 according to OpenAI, with larger context window",,Not known. Maybe smaller/sparser than GPT-4.,,,,,,,"Today, we shared dozens of new additions and improvements, and reduced pricing across many parts of our platform. These include:
New GPT-4 Turbo model that is more capable, cheaper and supports a 128K context window",Unknown,United States of America,Industry,,,,API access,,,,,,,,,,,
CogVLM,"Multimodal,Vision,Language","Tsinghua University,Zhipu AI,Beihang University","Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang",2023-11-06,CogVLM: Visual Expert for Pretrained Language Models,"https://arxiv.org/abs/2311.03079
https://huggingface.co/THUDM/cogvlm-chat-hf
https://github.com/THUDM/CogVLM
",SOTA improvement,"""CogVLM-17B
achieves state-of-the-art performance on 17 clas-
sic cross-modal benchmarks, including 1) im-
age captioning datasets: NoCaps, Flicker30k, 2)
VQA datasets: OKVQA, TextVQA, OCRVQA,
ScienceQA, 3) LVLM benchmarks: MM-
Vet, MMBench, SEED-Bench, LLaVABench,
POPE, MMMU, MathVista, 4) visual grounding
datasets: RefCOCO, RefCOCO+, RefCOCOg,
Visual7W. Codes and checkpoints are available at
https://github.com/THUDM/CogVLM""",17000000000.0,"CogVLM-17B has 10 billion vision parameters and 7 billion language parameters. However, ""the total number of trainable parameters is 6.5B"".
""CogVLM model comprises four fundamental components: a vision transformer (ViT) encoder, an MLP adapter, a pretrained large language model (GPT), and a visual expert module.""
ViT: EVA2-CLIP-E, last layer removed (5B params with last layer, non-trainable)
MLP adapter: 2 layers, parameter count unavailable
GPT: Vicuna1.5-7B (7B params)
Visual expert module: parameter count unclear",2.311e+22,"from table 8 on page 17
230.1 FLOPS*days
so
10**15*24*3600*230.1= 1.988e22
Since this training uses pretrained weights from EVA02-CLIP-E and Vicuna1.5-7B, we report the full number of FLOPs baked into the model.
EVA02-CLIP-g/14 is stated to have taken 25 days to train 12B samples using 64 A100-40GB GPUs, implying:
25 days * 24 hr/day * 3600 sec/hr * 64 GPU * 7.80E+13 FLOP/GPU-sec * 30% efficiency = 3.23e21
EVA02-CLIP-E doesn't give a training time; it saw 1/4 as many samples as the g/14 model but has 4.27x more parameters; as a rough estimate, assume it took the same number of FLOPs to train.
Vicuna1.5-7B is stated to have cost $140, so training compute is likely negligible.","VQAv2,LAION-2B,COYO-700M,OKVQA,TextVQA,OCR-VQA,ScienceQA,LLaVA-Instruct-150k,LRV-Instruction,LLaVAR,Flickr30K Entities,RefCOCO,Visual7W,VisualGenome,COCO,TextCaps","Pretraining uses LAION-2B, COYO-700M, plus a newly created visual grounding dataset of 40M images.
Generalist models CogVLM-Chat and CogVLM-Grounding are additionally finetuned on VQAv2, OKVQA, TextVQA, OCRVQA, ScienceQA, LLaVA-Instruct, LRV-Instruction, LLaVAR, Flickr30K Entities, RefCOCO, Visual7W, and VisualGenome.
Additional tests finetune on the training sets from COCO and TextCaps.",1518534581.0,"After filtering, about 1.5B image-text pairs are left for pretraining in stage one. Stage two of pretraining adds a visual grounding dataset of 40M images with generated noun bounding boxes. These are filtered from LAION-115M so that 75% of images contain at least two bounding boxes.
Two different kinds of finetuning are done, each using a number of datasets:
- CogVLM-Chat: VQAv2 (11059040), OKVQA (70275), TextVQA (453360), OCRVQA (1002146), ScienceQA (21208), LLaVAInstruct (150000), LRV-Instruction (300000), LLaVAR (1633000)
- CogVLM-Grounding: Flickr30K Entities (520000), RefCOCO (142209), Visual7W (889388), VisualGenome (1700000)
Additional experiments finetune using the training sets from COCO (413915 in train) and TextCaps (109765 in train)
In sum, pretraining and finetuning appear to contain 1,500,000,000 and 18,534,581 datapoints, respectively.","We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at this https URL. ",Confident,"China,China,China","Academia,Industry,Academia",,,,Open access (restricted use),Trained from Vicuna1.5-7B weights,,,Vicuna-7B,126.0,,2e+22,,,,"8192 in pretraining stage 1, 1024 in stage 2"
LLaVA 1.5,"Multimodal,Language,Vision","University of Wisconsin Madison,Microsoft Research","Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee",2023-11-05,Improved Baselines with Visual Instruction Tuning,"https://arxiv.org/abs/2310.03744,
https://huggingface.co/liuhaotian/llava-v1.5-13b",SOTA improvement,"from abstract: ""we establish stronger baselines that achieve state-of-the-art across 11 benchmark""",13000000000.0,"from abstract ""Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. """,4.55e+22,"""Due to the increased image input resolution to 336^2, the training of LLaVA-1.5 is ∼2× as long as LLaVA: ∼6 hours of pretraining and ∼20 hours of visual instruction tuning using 8× A100s.""
26 * 3600 * 8 * 3.12e14 * 0.3 = 7.0e19
Fine-tuned from Vicuna-13B which is fine-tuned Llama-13B, which was 4.55e22 FLOP",,from https://huggingface.co/liuhaotian/llava-v1.5-13b#training-dataset,1200000.0,1.2M text-image pairs from https://huggingface.co/liuhaotian/llava-v1.5-13b#training-dataset,"Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available. ",Confident,"United States of America,United States of America","Academia,Industry",24.0,"from abstract ""Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. """,NVIDIA A100,Open access (restricted use),"8 * 312e12 * 24 * 3600 * 0.3 = 6.469632e+19 = num gpus * peak flops * time in seconds * assumed utilization rate
from abstract ""Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node.""",8.0,,Vicuna-13B,626.0,,6.5e+19,,,,
Grok-1,Language,xAI,,2023-11-04,Announcing Grok,"https://x.ai/model-card/, https://x.ai/blog/grok-os",SOTA improvement,"""On these benchmarks, Grok-1 displayed strong results, surpassing all other models in its compute class, including ChatGPT-3.5 and Inflection-1""",314000000000.0,"""314B parameter Mixture-of-Experts model with 25% of the weights active on a given token"". So effectively 78B parameters
Mixture of 8 experts: https://github.com/xai-org/grok-1",2.90000000001e+24,"""On these benchmarks, Grok-1 displayed strong results, surpassing all other models in its compute class, including ChatGPT-3.5 and Inflection-1. It is only surpassed by models that were trained with a significantly larger amount of training data and compute resources like GPT-4""
Per table, Grok-1 is surpassed by Palm 2, Claude 2, GPT-4, so it required less compute than these three models. Palm 2 was trained on 7e24 FLOP.
GPT-3.5 is ~2.6e24. Inflection-1's compute is not public/known by us but Inflection says Inflection-1 compute was <= Palm-540B's (which was ~2.5e24).
For optimal training, our current working hypothesis is that you still need something like Chinchilla scaling on the total number of parameters in the model, even for MoE models, so optimal dataset size would be 20*310B tokens. With 25%*314B params active per forward pass, this would be around 3e24 FLOP.
https://www.wolframalpha.com/input?i=20*310+billion+*+6+*+25%25+*+314+billion",,,,,"Grok is an AI modeled after the Hitchhiker’s Guide to the Galaxy, so intended to answer almost anything and, far harder, even suggest what questions to ask!
Grok is designed to answer questions with a bit of wit and has a rebellious streak, so please don’t use it if you hate humor!
A unique and fundamental advantage of Grok is that it has real-time knowledge of the world via the 𝕏 platform. It will also answer spicy questions that are rejected by most other AI systems.
Grok is still a very early beta product – the best we could do with 2 months of training – so expect it to improve rapidly with each passing week with your help.",Likely,United States of America,Industry,,,,Open source,,,,,,,,,,,
RT-Trajectory,Robotics,"Google DeepMind,UC San Diego,Stanford University","Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, Priya Sundaresan, Peng Xu, Hao Su, Karol Hausman, Chelsea Finn, Quan Vuong, Ted Xiao",2023-11-03,RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches,https://arxiv.org/abs/2311.01977,SOTA improvement,"from blog https://deepmind.google/discover/blog/shaping-the-future-of-advanced-robotics/
""When tested on 41 tasks unseen in the training data, an arm controlled by RT-Trajectory more than doubled the performance of existing state-of-the-art RT models: it achieved a task success rate of 63%, compared with 29% for RT-2.""",,seems to be based on the RT-1 architecture (35M parameters) with some modifications (section 3.3),,"Given the architecture seems to use 35M parameters, it seems unlikely this is above 1e23 FLOP.",RT-1,"""We use the RT-1 (Brohan et al., 2023b) demonstration dataset for training""
also trained with retroactively-generated trajectories created by humans, by code written by GPT-4, and image generation models",,,"Generalization remains one of the most important desiderata for robust robot learning systems. While recently proposed approaches show promise in generalization to novel objects, semantic concepts, or visual distribution shifts, generalization to new tasks remains challenging. For example, a language-conditioned policy trained on pick-and-place tasks will not be able to generalize to a folding task, even if the arm trajectory of folding is similar to pick-and-place. Our key insight is that this kind of generalization becomes feasible if we represent the task through rough trajectory sketches. We propose a policy conditioning method using such rough trajectory sketches, which we call RT-Trajectory, that is practical, easy to specify, and allows the policy to effectively perform new tasks that would otherwise be challenging to perform. We find that trajectory sketches strike a balance between being detailed enough to express low-level motion-centric guidance while being coarse enough to allow the learned policy to interpret the trajectory sketch in the context of situational visual observations. In addition, we show how trajectory sketches can provide a useful interface to communicate with robotic policies: they can be specified through simple human inputs like drawings or videos, or through automated methods such as modern image-generating or waypoint-generating methods. We evaluate RT-Trajectory at scale on a variety of real-world robotic tasks, and find that RT-Trajectory is able to perform a wider range of tasks compared to language-conditioned and goal-conditioned policies, when provided the same training data.",Unknown,"Multinational,United States of America,United States of America","Industry,Academia,Academia",,,,,,,,,11.0,,,,,,
BLUUMI,Language,"University of Turku,Hugging Face","Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna-Mari Kupari, Filip Ginter, Veronika Laippala, Niklas Muennighoff, Aleksandra Piktus, Thomas Wang, Nouamane Tazi, Teven Le Scao, Thomas Wolf, Osma Suominen, Samuli Sairanen, Mikko Merioksa, Jyrki Heinonen, Aija Vahtola, Samuel Antao, Sampo Pyysalo",2023-11-03,FinGPT: Large Generative Models for a Small Language,https://arxiv.org/abs/2311.05640,SOTA improvement,"SOTA for Finnish: ""Our best monolingual model outperforms this result by over
10% points and the BLUUMI model by over 20% points, representing a substantial advance in the
state of the art in the capability of generative models trained for Finnish.""",176000000000.0,176 billion,,,,Finnish data from several sources,38000000000.0,38B tokens,"Large language models (LLMs) excel in many tasks in NLP and beyond, but most open models have very limited coverage of smaller languages and LLM work tends to focus on languages where nearly unlimited data is available for pretraining. In this work, we study the challenges of creating LLMs for Finnish, a language spoken by less than 0.1% of the world population. We compile an extensive dataset of Finnish combining web crawls, news, social media and eBooks. We pursue two approaches to pretrain models: 1) we train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT, 2) we continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI. For model evaluation, we introduce FIN-bench, a version of BIG-bench with Finnish tasks. We also assess other model qualities such as toxicity and bias. Our models and tools are openly available at this https URL.",Likely,"Finland,Multinational","Academia,Industry",,,AMD Instinct MI250X,,"They ""continued pretraining"" of BLOOM on Finnish data. Don't think they specify the number of tokens they trained BLOOM/BLUUMI on; for their smaller models it was 300b.",,,BLOOM-176B,14.0,,,,,,
Yi-34B,Language,01.AI,"Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, Zonghong Dai",2023-11-02,Yi: Open Foundation Models by 01.AI,https://arxiv.org/abs/2403.04652,Significant use,"2nd most popular model on HuggingFace: https://decrypt.co/206195/new-open-source-ai-model-from-china-boasts-twice-the-capacity-of-chatgpt
also maybe the best open-source model, does better than Llama 2-70B on several benchmarks",34000000000.0,34b,6.1e+23,"""The dataset we use contains Chinese & English only. We used approximately 3T tokens"" sounds like this means it was trained on 3T tokens, not necessarily that the dataset contains 3T tokens?
If so, 34b * 3T * 6 = 6.1e23",,Chinese and English dataset,,,The Yi series models are large language models trained from scratch by developers at 01.AI.,Speculative,China,Industry,,,,Open access (restricted use),,,,,,,,,,,
Cohere Embed,Language,Cohere,"Nils Reimers, Elliott Choi, Amr Kayid, Alekhya Nandula, Manoj Govindassamy, Abdullah Elkady",2023-11-02,Cohere Command & Embed on Amazon Bedrock,https://txt.cohere.com/introducing-embed-v3/,SOTA improvement,"""We are releasing new English and multilingual Embed versions with either 1024 or 384 dimensions. All models can be accessed via our APIs. As of October 2023, these models achieve state-of-the-art performance among 90+ models on the Massive Text Embedding Benchmark (MTEB) and state-of-the-art performance for zero-shot dense retrieval on BEIR.""",,,,"https://docs.cohere.com/docs/environmental-impact
Embed v2 (older version) produced 6689.76 kg CO2 to train. Using the calculator Cohere links (https://mlco2.github.io/impact/) that's the equivalent of 80,000 TPUv3-hours in the ""us-west1"" region. That's 3.5e22 FLOP without considering utilization. However, I have no idea which region Cohere's GPUs are in (looks like CO2/energy can vary a lot by region), and they probably used a more recent GPU.",,,,,"We're excited to introduce Embed v3, our latest and most advanced embeddings model. Embed v3 offers state-of-the-art performance per trusted MTEB and BEIR benchmarks.",Unknown,Canada,Industry,,,,API access,,,,,,,,,,,
Skywork-13B,Language,Kunlun Inc.,"Tianwen Wei, Liang Zhao, Lichang Zhang, Bo Zhu, Lijie Wang, Haihua Yang, Biye Li, Cheng Cheng, Weiwei Lü, Rui Hu, Chenxia Li, Liu Yang, Xilin Luo, Xuejie Wu, Lunan Liu, Wenjun Cheng, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Lei Lin, Xiaokun Wang, Yutuan Ma, Chuanhai Dong, Yanqi Sun, Yifu Chen, Yongyi Peng, Xiaojuan Liang, Shuicheng Yan, Han Fang, Yahui Zhou",2023-10-30,Skywork: A More Open Bilingual Foundation Model,https://arxiv.org/abs/2310.19341,SOTA improvement,"""We show that our model not only excels on popular benchmarks, but also achieves state of the art performance in Chinese language modeling on diverse domains""",13000000000.0,13B,2.5e+23,"""Our Skywork-13B is trained on a cluster of 64 NVIDIA-HGX-A800 nodes, a total of 512 A800-80G SXM GPUs... The training process of Skywork-13B spanned a total of 39 days.""
They note that ""we achieved a token throughput of 1873 per GPU per second and a model flops utilization (MFU) of 56.5%... "".
""MFU"" was coined in the Palm paper (https://arxiv.org/pdf/2204.02311.pdf) and only counts operations used to train the model, not all operations observed on the hardware. MFU is lower than traditionally measured utilization.
Using the 56.5% number, and a peak tensor performance of 623.8 TFLOPS for the A800, this suggests 512 * 623.8 TFLOPS * 39 days * 86400 seconds/day * 0.565 = 6.08e23 FLOP.
Based on C=6ND, with 13B parameters and 3.2T tokens, we have C=6*(13B)*(3.2T)=2.5e23 FLOP.
Since the reported MFU is quite high, and would imply a higher compute usage than 6ND, it seems they may have trained on mixed precision and with the GPUs not always operating in the 623.8 TFLOPS mode.",SkyPile,"""In order to train Skywork-13B, we build SkyPile, a vast, high quality corpus comprising more than 6 trillion tokens. A segment of the corpus, comprising over 150 billion tokens of web text, has been open sourced to facilitate research and training on Chinese LLMs""",3180000000000.0,"The full SkyPile dataset is 6 trillion tokens, roughly half English and half Chinese: (https://huggingface.co/Skywork/Skywork-13B-base).
The model is trained for the equivalent of 0.53 epochs on the full dataset, or 3.18 trillion unique tokens. This is around 2.78 trillion words, based on an average of 1 word/token for the Chinese portion and 0.75 word/token on the English portion.","In this technical report, we present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts. This bilingual foundation model is the most extensively trained and openly published LLMs of comparable size to date. We introduce a two-stage training methodology using a segmented corpus, targeting general purpose training and then domain-specific enhancement training, respectively. We show that our model not only excels on popular benchmarks, but also achieves state of the art performance in Chinese language modeling on diverse domains. Furthermore, we propose a novel leakage detection method, demonstrating that test data contamination is a pressing issue warranting further investigation by the LLM community. To spur future research, we release Skywork-13B along with checkpoints obtained during intermediate stages of the training process. We are also releasing part of our SkyPile corpus, a collection of over 150 billion tokens of web text, which is the largest high quality open Chinese pre-training corpus to date. We hope Skywork-13B and our open corpus will serve as a valuable open-source resource to democratize access to high-quality LLMs.",Confident,China,Industry,940.0,39 days,NVIDIA A800,Open access (restricted use),,512.0,0.46,,41.0,,,1.0,,16000000.0,Table 3
ChatGLM3,"Multimodal,Language,Vision",Zhipu AI,,2023-10-27,Zhipu AI launches third-generation base model,https://www.zhipuai.cn/en/news/76,SOTA improvement,"Aiming at GPT-4V, ChatGLM3 has implemented iterative upgrades of several new functions this time, including:
CogVLM with multi-modal understanding capabilities, looks at image semantics, and achieved SOTA on more than 10 international standard image and text evaluation data sets;",130000000000.0,"Highly speculative. The ChatGLM website https://chatglm.cn/ states that the model has hundreds of billions of parameters, so at least 100e9. It also states that the new model is based on ChatGLM2 and the GLM architectures. There is a previous GLM 130B model, so this may be the most likely size.",1.09200000000001e+24,"Highly speculative.
Assume 1 epoch on 1.4T tokens.
6 FLOP/token/param * 1.4T tokens * 130B params
https://www.wolframalpha.com/input?i=6*130+billion*1.4+trillion",,ChatGLM2 corpus pretraining plus human preference alignment training,1050000000000.0,"The ChatGLM website states that the latest ChatGLM service is based on (and upgraded from) ChatGLM2, which was trained on 1.4T tokens. Assume that ChatGLM3 is trained on at least the same number of tokens.
Sources:
https://chatglm.cn/
https://github.com/THUDM/ChatGLM2-6B/blob/main/README_EN.md
https://www.zhipuai.cn/en/news/76","On October 27, 2023, at the 2023 China Computer Conference (CNCC), Zhipu AI launched the fully self-developed third-generation large base model ChatGLM3 and related series of products.",Speculative,China,Industry,,,,,,,,,,,,,,,
DiT-XL/2 + CADS,Image generation,ETH Zurich,"Seyedmorteza Sadat, Jakob Buhmann, Derek Bradley, Otmar Hilliges, Romann M. Weber",2023-10-26,CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling,https://arxiv.org/abs/2310.17347v2,SOTA improvement,"""Further, using an existing pretrained diffusion model, CADS achieves a new state-of-the-art FID of 1.70 and 2.31 for class-conditional ImageNet generation at 256×256 and 512×512 respectively""",675000000.0,original parameter count for DiT-XL/2,,,ImageNet,,,,"While conditional diffusion models are known to have good coverage of the data distribution, they still face limitations in output diversity, particularly when sampled with a high classifier-free guidance scale for optimal image quality or when trained on small datasets. We attribute this problem to the role of the conditioning signal in inference and offer an improved sampling strategy for diffusion models that can increase generation diversity, especially at high guidance scales, with minimal loss of sample quality. Our sampling strategy anneals the conditioning signal by adding scheduled, monotonically decreasing Gaussian noise to the conditioning vector during inference to balance diversity and condition alignment. Our Condition-Annealed Diffusion Sampler (CADS) can be used with any pretrained model and sampling algorithm, and we show that it boosts the diversity of diffusion models in various conditional generation tasks. Further, using an existing pretrained diffusion model, CADS achieves a new state-of-the-art FID of 1.70 and 2.31 for class-conditional ImageNet generation at 256×256 and 512×512 respectively.",Likely,Switzerland,Academia,,,,,,,,DiT-XL/2,,,,,,,
CODEFUSION (Python),Language,"Microsoft,Microsoft Research","Mukul Singh, José Cambronero, Sumit Gulwani, Vu Le, Carina Negreanu, Gust Verbruggen",2023-10-26,CODEFUSION: A Pre-trained Diffusion Model for Code Generation,https://arxiv.org/abs/2310.17680,SOTA improvement,"See Table 1, SOTA in Python code generation",75000000.0,Table 1,7.92e+18,"V100 performance: 125 teraFLOPS according to https://www.nvidia.com/en-us/data-center/v100/
11 hours * 4 GPUs * 125 teraFLOPS/GPU * 0.40 utilization = 7.92e18 FLOP",,,4390400.0,"Section A3, Table 5: for python, 56k samples with an average length of 78.4 tokens","Imagine a developer who can only change their last line of code, how often would they have to start writing a function from scratch before it is correct? Auto-regressive models for code generation from natural language have a similar limitation: they do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. We evaluate CodeFusion on the task of natural language to code generation for Bash, Python, and Microsoft Excel conditional formatting (CF) rules. Experiments show that CodeFusion (75M parameters) performs on par with state-of-the-art auto-regressive systems (350M-175B parameters) in top-1 accuracy and outperforms them in top-3 and top-5 accuracy due to its better balance in diversity versus quality.",Confident,"United States of America,United States of America","Industry,Industry",11.0,"""The system used to run the experiments uses an Intel Core i7 processor (base at 1.8 GHz) along with 4 V100 GPU units, a 64-bit operating system, and 56 GB RAM. CODEFUSION took 8 hours to pre-train and 3 hours to fine-tune on average for each dataset.""",NVIDIA Tesla V100 SXM2 32 GB,,,,,,10.0,,,,8.542235671062665,,
QMoE: compressed 1T model,Language,"Institute of Science and Technology Austria (ISTA),Neural Magic","Elias Frantar, Dan Alistarh",2023-10-25,QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models,https://arxiv.org/abs/2310.16795,SOTA improvement,"low memory usage of compressed model from the abstract ""This enables, for the first time, the execution of a trillion-parameter model on affordable commodity hardware""",1600000000000.0,"""Concretely, QMoE can compress the 1.6 trillion parameter witchTransformer-c2048 model to less than 160GB""
parametrs of base model (this works compress the model - there is no learning from the data) base model have 1.6T parameters",,,,this works compress the model - there is no dataset,,this works compress the model - there is no dataset,"Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts. For example, the SwitchTransformer-c2048 model has 1.6 trillion parameters, requiring 3.2TB of accelerator memory to run efficiently, which makes practical deployment challenging and expensive. In this paper, we present a solution to this memory problem, in form of a new compression and execution framework called QMoE. Specifically, QMoE consists of a scalable algorithm which accurately compresses trillion-parameter MoEs to less than 1 bit per parameter, in a custom format co-designed with bespoke GPU decoding kernels to facilitate efficient end-to-end compressed inference, with minor runtime overheads relative to uncompressed execution. Concretely, QMoE can compress the 1.6 trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x compression, 0.8 bits per parameter) at only minor accuracy loss, in less than a day on a single GPU. This enables, for the first time, the execution of a trillion-parameter model on affordable commodity hardware, like a single server with 4x NVIDIA A6000 or 8x NVIDIA 3090 GPUs, at less than 5% runtime overhead relative to ideal uncompressed inference. The source code and compressed models are available at github.com/IST-DASLab/qmoe.",Likely,"Austria,United States of America","Academia,Industry",,,NVIDIA RTX A6000,Open source,"(1) * (38.71 * 10 ** 12) * (0.3) * (24 * 3600) = 1003363200000000000
(num gpu) * (peak flop) * (assumed utilization rate) * (time in seconds)
from the paper: ""This allows us to apply data-dependent compression to massive MoEs, while preserving the key feature of post-training
compression techniques: the ability to perform effective
compression using only modest computational resources,
e.g., a single NVIDIA A6000 GPU and less than one day of
compute.""
A6000 have 38.71 TFLOPs from https://www.techpowerup.com/gpu-specs/rtx-a6000.c3686",,,Switch,7.0,,1.0033632e+18,,,,
DALL·E 3,Image generation,OpenAI,"James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, Aditya Ramesh",2023-10-19,Improving Image Generation with Better Captions,https://cdn.openai.com/papers/dall-e-3.pdf,SOTA improvement,,,,,,,,,,"We show that prompt following abilities of text-to-image models can be substantially improved by training on highly descriptive generated image captions.
Existing text-to-image models struggle to follow detailed image descriptions and
often ignore words or confuse the meaning of prompts. We hypothesize that this
issue stems from noisy and inaccurate image captions in the training dataset. We
address this by training a bespoke image captioner and use it to recaption the
training dataset. We then train several text-to-image models and find that training
on these synthetic captions reliably improves prompt following ability. Finally, we
use these findings to build DALL-E 3: a new text-to-image generation system, and
benchmark its performance on an evaluation designed to measure prompt following, coherence, and aesthetics, finding that it compares favorably to competitors. We publish samples and code for these evaluations so that future research can continue optimizing this important aspect of text-to-image systems.
",Unknown,United States of America,Industry,,,,API access,,,,,299.0,,,,,,
ERNIE 4.0,"Multimodal,Language",Baidu,,2023-10-17,"Baidu Launches ERNIE 4.0 Foundation Model, Leading a New Wave of AI-Native Applications",https://www.prnewswire.com/news-releases/baidu-launches-ernie-4-0-foundation-model-leading-a-new-wave-of-ai-native-applications-301958681.html,Significant use,"Likely SOTA for Mandarin? But very little info available.
Lots of users (https://www.cnn.com/2023/12/15/tech/gpt4-china-baidu-ernie-ai-comparison-intl-hnk/index.html):
""Baidu says ERNIE has racked up 70 million users. That’s compared with 150 million users for ChatGPT, according to an estimate from Similarweb, a digital data and analytics company.""",,,,,,,,,"Baidu, Inc. (NASDAQ: BIDU and HKEX: 9888), a leading AI company with strong Internet foundation, today hosted its annual flagship technology conference Baidu World 2023 in Beijing, marking the conference's return to an offline format after four years. With the theme ""Prompt the World,"" this year's Baidu World conference saw Baidu launch ERNIE 4.0, Baidu's next-generation and most powerful foundation model offering drastically enhanced core AI capabilities. Baidu also showcased some of its most popular applications, solutions, and products re-built around the company's state-of-the-art generative AI.
Robin Li, Co-founder, Chairman and CEO of Baidu, announced ERNIE 4.0 at Baidu World 2023
""ERNIE 4.0 has achieved a full upgrade with drastically improved performance in understanding, generation, reasoning, and memory,"" Robin Li, Co-founder, Chairman and CEO of Baidu, said at the event. ""These four core capabilities form the foundation of AI-native applications and have now unleashed unlimited opportunities for new innovations.""
",Unknown,China,Industry,,,,,,,,,,,,,,,
RT-2-X,Robotics,Google DeepMind,Open X-Embodiment Collaboration,2023-10-13,Open X-Embodiment: Robotic Learning Datasets and RT-X Models,https://arxiv.org/abs/2310.08864,SOTA improvement,"""Emergent skills evaluation. To investigate the transfer
of knowledge across robots, we conduct experiments with
the Google Robot, assessing the performance on tasks like
the ones shown in Fig. 5. These tasks involve objects and
skills that are not present in the RT-2 dataset but occur in the
Bridge dataset [95] for a different robot (the WidowX robot).
Results are shown in Table II, Emergent Skills Evaluation
column. Comparing rows (1) and (2), we find that RT-2-X
outperforms RT-2 by ∼ 3×, suggesting that incorporating
data from other robots into the training improves the range
of tasks that can be performed even by a robot that already
has large amounts of data available. Our results suggest that
co-training with data from other platforms imbues the RT-2-
X controller with additional skills for the platform that are
not present in that platform’s original dataset.""",55000000000.0,55B,,,Open X-Embodiment,"""The Open X-Embodiment Dataset contains 1M+ real robot
trajectories spanning 22 robot embodiments, from single
robot arms to bi-manual robots and quadrupeds. The dataset
was constructed by pooling 60 existing robot datasets from
34 robotic research labs around the world and converting
them into a consistent data format for easy download and
usage. We use the RLDS data format [119], which saves data
in serialized tfrecord files and accommodates the various
action spaces and input modalities of different robot setups,
such as differing numbers of RGB cameras, depth cameras
and point clouds. It also supports efficient, parallelized data
loading in all major deep learning frameworks. For more
details about the data storage format and a breakdown of all
60 datasets, see robotics-transformer-x.github.io.""",,,"Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website this https URL.",Confident,Multinational,Industry,,,,Unreleased,"""RT-2-X is trained via
co-fine-tuning (similarly to the original RT-2 [9]), with an approximately one to one split of the original VLM data
and the robotics data mixture. N""
RT-2 is in turn a fine-tune of Pali-X 55B",,,RT-2,85.0,,,,,,
Ferret (13B),"Multimodal,Language,Vision","Columbia University,Apple","Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang",2023-10-11,Ferret: Refer and Ground Anything Anywhere at Any Granularity,https://arxiv.org/abs/2310.07704,SOTA improvement,"""To evaluate this new capability, we introduce Ferret-Bench, covering three new types of tasks: Referring Description, Referring Reasoning, and Grounding in Conversation. We benchmark existing MLLMs and observe that Ferret can outperform the best of them by 20.4% on average.""",13000000000.0,13B,,"Fine-tuned from Vicuna-13B, which we don't have an estimate for. Finetuning cost is ~4e19.
""Training Details. We initialize the image encoder with CLIP-ViT-L/14@336p, the LLM with Vicuna, and the projection layer with LLaVA’s first-stage weights, leaving the visual sampler randomly initialized. After the initialization, Ferret is trained on the aforementioned GRIT data for three epochs, optimized by Loshchilov & Hutter (2017) with a learning rate of 2e − 5 and a batch size of 128. The training takes ∼5/2.5 days on 8 A100 GPU for a Ferret-13B/7B.""
5 * 24 * 3600 * 0.3 utilization (assumption) * 312 TFLOP/s = 4.04e19",GRIT,"""In order to make the refer-and-ground capability in Ferret open-vocabulary, instruction-following, and robust, we collect GRIT, a Ground-and-Refer Instruction-Tuning dataset with 1.1M samples. GRIT contains multiple levels of spatial knowledge, covering objects, relationships, region descriptions, and complex reasoning. It includes both text-in location-out (grounding) and location-in textout (referring) data, as well as data that mixes location and text in both input and output""",,,"We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with 95K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination.",Likely,"United States of America,United States of America","Academia,Industry",120.0,"""The training takes ∼5/2.5 days on 8 A100 GPU for a Ferret-13B/7B.""",NVIDIA A100,Open access (non-commercial),"""The training takes ~5 days on 8 A100 GPU for a Ferret-13B""
5 * 24 * 3600 * 0.3 utilization (assumption) * 312 TFLOP/s = 4.04e19",8.0,,Vicuna-13B,91.0,,4.04e+19,3.0,,,
FinGPT-13B,Language,"University of California Los Angeles (UCLA),Columbia University,New York University (NYU)","Neng Wang, Hongyang Yang, Christina Dan Wang",2023-10-07,FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets,https://arxiv.org/abs/2310.04793; https://github.com/AI4Finance-Foundation/FinGPT,SOTA improvement,SOTA for financial sentiment analysis,13000000000.0,"Finetunes using LoRA, so only trains 3.67 million parameters",1.6e+23,From Llama 2-13B,,Financial sentiment data (for fine-tuning): https://huggingface.co/datasets/FinGPT/fingpt-sentiment-train,,,"In the swiftly expanding domain of Natural Language Processing (NLP), the potential of GPT-based models for the financial sector is increasingly evident. However, the integration of these models with financial datasets presents challenges, notably in determining their adeptness and relevance. This paper introduces a distinctive approach anchored in the Instruction Tuning paradigm for open-source large language models, specifically adapted for financial contexts. Through this methodology, we capitalize on the interoperability of open-source models, ensuring a seamless and transparent integration. We begin by explaining the Instruction Tuning paradigm, highlighting its effectiveness for immediate integration. The paper presents a benchmarking scheme designed for end-to-end training and testing, employing a cost-effective progression. Firstly, we assess basic competencies and fundamental tasks, such as Named Entity Recognition (NER) and sentiment analysis to enhance specialization. Next, we delve into a comprehensive model, executing multi-task operations by amalgamating all instructional tunings to examine versatility. Finally, we explore the zero-shot capabilities by earmarking unseen tasks and incorporating novel datasets to understand adaptability in uncharted terrains. Such a paradigm fortifies the principles of openness and reproducibility, laying a robust foundation for future investigations in open-source financial large language models (FinLLMs).",Likely,"United States of America,United States of America,United States of America","Academia,Academia,Academia",17.25,https://github.com/AI4Finance-Foundation/FinGPT?tab=readme-ov-file,NVIDIA GeForce RTX 3090,Open source,"fine-tuned Llama 2 13B
RTX 3090 for 17 hours, at a cost of $17
35.5 trillion flops * 17 * 3600 * 0.3 = 6.532488e+17",1.0,,Llama 2-13B,10.0,"Finetuning cost for FinGPT v3.3 given as $17.25 at github repo; paper notes cost to train a financial model using their methods are ""typically"" between $100 - $300",6.532488e+17,,,,
CTM (CIFAR-10),Image generation,"Stanford University,Sony","Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, Stefano Ermon",2023-10-01,Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion,https://arxiv.org/abs/2310.02279v1,SOTA improvement,"""CTM... achieves new state-of-the-art FIDs for single-step diffusion model sampling on CIFAR-10 (FID 1.73)""",,,,"Almost certainly <1e23 FLOP due to the small scale experiments.
",CIFAR-10,,,,"Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion model sampling at the cost of sample quality but lack a natural way to trade-off quality for speed. To address this limitation, we propose Consistency Trajectory Model (CTM), a generalization encompassing CM and score-based models as special cases. CTM trains a single neural network that can -- in a single forward pass -- output scores (i.e., gradients of log-density) and enables unrestricted traversal between any initial and final time along the Probability Flow Ordinary Differential Equation (ODE) in a diffusion process. CTM enables the efficient combination of adversarial training and denoising score matching loss to enhance performance and achieves new state-of-the-art FIDs for single-step diffusion model sampling on CIFAR-10 (FID 1.73) and ImageNet at 64X64 resolution (FID 2.06). CTM also enables a new family of sampling schemes, both deterministic and stochastic, involving long jumps along the ODE solution trajectories. It consistently improves sample quality as computational budgets increase, avoiding the degradation seen in CM. Furthermore, CTM's access to the score accommodates all diffusion model inference techniques, including exact likelihood computation.",Unknown,"United States of America,Japan","Academia,Industry",,,NVIDIA V100,,,,,,36.0,,,,,,
Show-1,Video,National University of Singapore,"David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, Mike Zheng Shou",2023-09-27,Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation,https://arxiv.org/abs/2309.15818,SOTA improvement,"""Our approach achieves state-of-the-art performance on standard benchmarks including UCF-101 and MSR-VTT.""",,,,,WebVid-10M,"""WebVid-10M is a large-scale dataset of short videos with textual descriptions sourced from stock footage sites. The videos are diverse and rich in their content. 10.7M video-caption pairs. 52K total video hours.""",,,"Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). We also validate our model on standard video generation benchmarks. Our code and model weights are publicly available at this https URL.",Unknown,Singapore,Academia,,,NVIDIA A100,,,,,,,,,,,,
GPT-4V,"Multimodal,Vision,Language",OpenAI,,2023-09-25,GPT-4V(ision) system card,https://cdn.openai.com/papers/GPTV_System_Card.pdf,Significant use,Incorporated into ChatGPT,,,,,,,,,"GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user, and is the latest capability we are making broadly available. Incorporating additional modalities (such as image inputs) into large language models (LLMs) is viewed by some as a key frontier in artificial intelligence research and development. Multimodal LLMs offer the possibility of expanding the impact of language-only systems with novel interfaces and capabilities, enabling them to solve new tasks and provide novel experiences for their users. In this system card, we analyze the safety properties of GPT-4V. Our work on safety for GPT-4V builds on the work done for GPT-4 and here we dive deeper into the evaluations, preparation, and mitigation work done specifically for image inputs.",Unknown,United States of America,Industry,,,,API access,,,,,0.0,,,,,,
AlphaMissense,Biology,Google DeepMind,"Jun Cheng, Guido Novati, Joshua Pan, Clare Bycroft, Akvile ̇Žemgulyte ̇, Taylor Applebaum, Alexander Pritzel, Lai Hong Wong, Michal Zielinski, Tobias Sargeant, Rosalia G. Schneider,Andrew W. Senior, John Jumper, Demis Hassabis, Pushmeet Kohli,Žiga Avsec",2023-09-22,Accurate proteome-wide missense variant effect prediction with AlphaMissense,https://www.science.org/doi/10.1126/science.adg7492,SOTA improvement,"""By combining structural context and evolutionary conservation, our model achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data."" [Abstract]",93000000.0,"""The model architecture is similar to that of AlphaFold (21), with minor modifications""
Reference is to the AlphaFold 2 paper; that model had 93 million parameters",,"From supplementary materials: ""We independently trained three AlphaFold models and fine-tuned them independently on variants. We followed the training procedure described in (21), (only the “Initial training” stage) ... AF training is carried out for about 7e6 steps on single-chain structures ... Fine-tuning is carried out @until auROC of the evaluation set converges (about 350k samples, each training sample contains maximum 50 variants)""
Table S4 gives details. Total samples seen across the three pretraining models are (7.8M + 7.5M + 5.85M) = 21.15M
Each sequence is cropped to 256 elements long, which suggests 5.4B tokens seen in training.",,"Supplemental materials section on training data lists sources:
75% of pre-training structures are self-distillation data sampled from MGnify and UniRef90.
Fine-tuning data on benign variants come from gnomAD v2.1.1 (1.25M variants), the Great Ape project (95k variants), and FigShare (2k variants).
Fine-tuning data for pathogenic variants are sampled from the missense proteome map to create a dataset with balanced positive and negative labels.
Suggests a total of 2.7M variants, each 256 long.",,,"The vast majority of missense variants observed in the human genome are of unknown clinical significance. We present AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity. By combining structural context and evolutionary conservation, our model achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data. The average pathogenicity score of genes is also predictive for their cell essentiality, capable of identifying short essential genes that existing statistical approaches are underpowered to detect. As a resource to the community, we provide a database of predictions for all possible human single amino acid substitutions and classify 89% of missense variants as either likely benign or likely pathogenic.",Likely,Multinational,Industry,,,,Unreleased,,,,AlphaFold 2,220.0,,,,,,
Robot Parkour,Robotics,"Shanghai Qi Zhi institute,Stanford University,Carnegie Mellon University (CMU),Tsinghua University","Ziwen Zhuang, Zipeng Fu, Jianren Wang, Christopher Atkeson, Soeren Schwertfeger, Chelsea Finn, Hang Zhao",2023-09-12,Robot Parkour Learning,https://arxiv.org/abs/2309.05665,SOTA improvement,,500000.0,"Parkour policy details on page 8, table 11.",,"The paper provides some details on the training time and hardware used:
Each specialized skill policy (climbing, leaping, etc) was pre-trained with soft dynamics constraints for 12 hours using 1 Nvidia RTX 3090 GPU.
The skills were then fine-tuned with hard dynamics constraints for 6 hours each.
The final parkour policy distillation process used 4 computers with 1 RTX 3090 GPU each, training for an unspecified amount of time.
So the total training time was at least 12 + 6 x 5 = 42 hours for the initial skills, plus an additional unknown time for the distillation.
The hardware used was high-end Nvidia RTX 3090 GPUs, which at the time of paper writing would have been top of the line GPUs. Multiple GPUs were used in parallel during the distillation stage.",,"Isaac Gym simulated proprioceptive data, images, and actions",,,"Parkour is a grand challenge for legged locomotion that requires robots to overcome various obstacles rapidly in complex environments. Existing methods can generate either diverse but blind locomotion skills or vision-based but specialized skills by using reference animal data or complex rewards. However, autonomous parkour requires robots to learn generalizable skills that are both vision-based and diverse to perceive and react to various scenarios. In this work, we propose a system for learning a single end-to-end vision-based parkour policy of diverse parkour skills using a simple reward without any reference motion data. We develop a reinforcement learning method inspired by direct collocation to generate parkour skills, including climbing over high obstacles, leaping over large gaps, crawling beneath low barriers, squeezing through thin slits, and running. We distill these skills into a single vision-based parkour policy and transfer it to a quadrupedal robot using its egocentric depth camera. We demonstrate that our system can empower two different low-cost robots to autonomously select and execute appropriate parkour skills to traverse challenging real-world environments.",Confident,"China,United States of America,United States of America,China","Academia,Academia,Academia",,,NVIDIA GeForce RTX 3090,,,,,,44.0,,,,,,
Falcon-180B,Language,Technology Innovation Institute,"Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, Guilherme Penedo",2023-09-06,The Falcon Series of Open Language Models,https://falconllm.tii.ae/falcon-180b.html; https://arxiv.org/abs/2311.16867,Training cost,"""It's currently at the top of the Hugging Face Leaderboard for pre-trained Open Large Language Models and is available for both research and commercial use.""
""This model performs exceptionally well in various tasks like reasoning, coding, proficiency, and knowledge tests, even beating competitors like Meta's LLaMA 2.""",180000000000.0,"""Falcon 180B is a super-powerful language model with 180 billion parameters""",3.76e+24,"43,500 petaflop-days per Table 1 of the paper
43500 * 1e15 * 24 * 3600 = 3.76e24
C = 6ND = 6 FLOP/token/parameter * 3.5 trillion tokens * 180 billion parameters = 3.78*10^24 FLOP",RefinedWeb,"""The Falcon series is made of three causal decoder-only models trained on up to 4,096 A100. We assembled a pretraining dataset of 3,500 billion tokens, predominantly sourced from our work on RefinedWeb (Penedo et al., 2023)–a massive filtered and deduplicated web dataset""
Training dataset composition is described in Table 3. Falcon was trained for 1 epoch.",3500000000000.0,3.5 trillion tokens * (~3 words per 4 tokens) ~= 2.625 trillion words,"Falcon 180B is a super-powerful language model with 180 billion parameters, trained on 3.5 trillion tokens. It's currently at the top of the Hugging Face Leaderboard for pre-trained Open Large Language Models and is available for both research and commercial use.
This model performs exceptionally well in various tasks like reasoning, coding, proficiency, and knowledge tests, even beating competitors like Meta's LLaMA 2.
Among closed source models, it ranks just behind OpenAI's GPT 4, and performs on par with Google's PaLM 2 Large, which powers Bard, despite being half the size of the model.",Confident,United Arab Emirates,Government,4320.0,"Stanford CRFM foundation model ecosystem graph data page https://crfm.stanford.edu/ecosystem-graphs/index.html?asset=Falcon-180B says 9 months, which is the maximum possible amount of time: training began sometime in 2023, and it was released in September.
However, 6 months is more realistic. That is the length of the gap between Falcon 40B and Falcon 180B. Additionally, the amount of compute is specified in the paper, so there is only one degree of freedom in the uncertain values of training duration and hardware utilization rate. At six months, the utilization is unusually low, so the training was probably not longer than that.",NVIDIA A100 SXM4 40 GB,Open access (restricted use),,4096.0,0.1876,,108.0,"From Hugging Face:
""Falcon-180B was trained on up to 4,096 A100 40GB GPUs, using a 3D parallelism strategy (TP=8, PP=8, DP=64) combined with ZeRO.""
""Falcon-180B was trained on AWS SageMaker, on up to 4,096 A100 40GB GPUs in P4d instances.""
https://huggingface.co/tiiuae/falcon-180B
Utilization must have been at least 12.5%, and they probably did not use the whole 4096 GPU cluster for 9 months, so it was probably higher. Lower bound estimate:
https://www.wolframalpha.com/input?i=%286+FLOP+*+3.5+trillion+*+180+billion%29+%2F+%284096*312+teraFLOPS+*+9+months%29",,1.0,10340911.710964862,4194304.0,"from paper (https://arxiv.org/pdf/2311.16867.pdf):
Batch size 2048 (presumably sequences) per Table 16. Warmed up using smaller batches for first 100B tokens.
""All Falcon models are pretrained with a 2,048 sequence length""
2048*2048 = 4194304"
Swift,Robotics,Intel Labs,"Elia Kaufmann, Leonard Bauersfeld, Antonio Loquercio, Matthias Müller, Vladlen Koltun, Davide Scaramuzza ",2023-08-30,Champion-level drone racing using deep reinforcement learning,https://www.nature.com/articles/s41586-023-06419-4,SOTA improvement,"""Our work marks the first time, to our knowledge, that an autonomous mobile robot achieved world-champion-level performance in a real-world competitive sport.""",56804.0,"The control network is an MLP with input dimension 31, two hidden layers of size 128, and an output of dimension 4.
(31+1)*128+(128+1)*128+(128+1)*4 = 21124
Gate detector is a 6 layer U-net with
8*(3^3*3+1) + 16*(3^2*8+1) + 16*(3^2*16+1) + 16*(5^2*16+1) + 16*(7^2*16+1) + 16*(7^2*16+1) = 35680
35680 + 21124 = 56804",5.337e+16,"Policies are trained for a total of 1 × 108 environment interactions, which takes 50 min on a workstation (i9 12900K, RTX 3090, 32 GB RAM DDR5). Fine-tuning is performed for 2 × 107 environment interactions.
35.58 TFLOPS * 50 min * 60 s/min * 0.50 utilization = 5.337*10^16 FLOP",,,,,"First-person view (FPV) drone racing is a televised sport in which professional competitors pilot high-speed aircraft through a 3D circuit. Each pilot sees the environment from the perspective of their drone by means of video streamed from an onboard camera. Reaching the level of professional pilots with an autonomous drone is challenging because the robot needs to fly at its physical limits while estimating its speed and location in the circuit exclusively from onboard sensors1. Here we introduce Swift, an autonomous system that can race physical vehicles at the level of the human world champions. The system combines deep reinforcement learning (RL) in simulation with data collected in the physical world. Swift competed against three human champions, including the world champions of two international leagues, in real-world head-to-head races. Swift won several races against each of the human champions and demonstrated the fastest recorded race time. This work represents a milestone for mobile robotics and machine intelligence2, which may inspire the deployment of hybrid learning-based solutions in other physical systems.",Likely,Multinational,Industry,0.833,"50 minutes (training details, page 8)",NVIDIA GeForce RTX 3090,Unreleased,,,,,101.0,,,,,,
Jais,Language,"Cerebras Systems,Mohamed bin Zayed University of Artificial Intelligence,Inception","Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Alham Fikri Aji, Zhengzhong Liu, Andy Hock, Andrew Feldman, Jonathan Lee, Andrew Jackson, Preslav Nakov, Timothy Baldwin, Eric Xing",2023-08-29,Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models,https://arxiv.org/abs/2308.16149,SOTA improvement,SOTA at Arabic language tasks.,13000000000.0,"""With 13 billion parameters, they demonstrate better knowledge and reasoning capabilities in Arabic""",3.08e+22,C = 6ND = 6 * 13 billion params * 395 billion tokens = 3.081e+22 FLOP,"Abu El-Khair,Aranews,ArabicText 2022,C4 Arabic,Arabic Wikipedia,ArabicNews 2020,Maktabah,United Nations Parallel Corpus,The Pile,Books3,arXiv,PubMed Central,WebText2,English Wikipedia,FreeLaw,PubMed Abstracts,DeepMind Mathematics,Project Gutenberg,BookCorpus2,EuroParl,PhilPapers,YouTube Subtitles,NIH Grant Abstracts,Enron Emails,GitHub","It was pretrained on 395 billion tokens, including 116 billion Arabic tokens, 232 billion English tokens, and 46 billion tokens of code.
The Arabic data consists of 72 billion tokens, which was augmented by 18 billion tokens of translated English text and then upsampled 1.6 times to reach 116 billion tokens.
The English data is sampled from the Pile dataset and consists of 232 billion tokens.
The code data consists of 46 billion tokens sampled from GitHub.",395000000000.0,395B tokens ~= 300B words,"We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs). The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts, including source code in various programming languages. With 13 billion parameters, they demonstrate better knowledge and reasoning capabilities in Arabic than any existing open Arabic and multilingual models by a sizable margin, based on extensive evaluation. Moreover, the models are competitive in English compared to English-centric open models of similar size, despite being trained on much less English data. We provide a detailed description of the training, the tuning, the safety alignment, and the evaluation of the models. We release two open versions of the model —the foundation Jais model, and an instruction-tuned Jais-chat variant— with the aim of promoting research on Arabic LLMs.",Confident,"Multinational,United Arab Emirates,United States of America","Industry,Academia",600.0,2023 June 25 to July 18 = 25 days = 600 hours,,Open source,,,,,14.0,,,,,3932160.0,"""After packing, we used a global batch size of 1,920 sequences of 2,048 tokens each. """
PeptideBERT,Biology,Carnegie Mellon University (CMU),"Chakradhar Guntuboina, Adrita Das, Parisa Mollaei, Seongwon Kim, and Amir Barati Farimani",2023-08-28,PeptideBERT: A language Model based on Transformers for Peptide Property Prediction,https://arxiv.org/abs/2309.03099,SOTA improvement,"""Our model has achieved state of the art (SOTA) for predicting Hemolysis, which is a task for determining peptide’s potential to induce red blood cell lysis.""",,,7.6e+21,"""Compute for fine-tuning ProtBERT: 1 NVidia GeForce GTX 1080Ti, 30 epochs, batch size 32, model trained for individual tasks with training time ranging from 58-116 minutes, assuming
from Table 1 we have 244 minutes
11.34e12 FLOPs and 0.3 utilization rate FLOP = 244 min * 60 sec/min * 11.34e12 FLOP/sec *0.3 = 4.9e16 FLOP,",,,,,"Recent advances in Language Models have enabled the protein modeling community with a powerful tool since protein sequences can be represented as text. Specifically, by taking advantage of Transformers, sequence-to-property prediction will be amenable without the need for explicit structural data. In this work, inspired by recent progress in Large Language Models (LLMs), we introduce PeptideBERT, a protein language model for predicting three key properties of peptides (hemolysis, solubility, and non- fouling). The PeptideBert utilizes the ProtBERT pretrained transformer model with 12 attention heads and 12 hidden layers. We then finetuned the pretrained model for the three downstream tasks. Our model has achieved state of the art (SOTA) for predicting Hemolysis, which is a task for determining peptide’s potential to induce red blood cell lysis. Our PeptideBert non-fouling model also achieved remarkable accuracy in predicting peptide’s capacity to resist non-specific interactions. This model, trained predominantly on shorter sequences, benefits from the dataset where negative examples are largely associated with insoluble peptides. Codes, models, and data used in this study are freely available at: https://github.com/ChakradharG/PeptideBERT",Confident,United States of America,Academia,4.067,244 minues from Table 1,NVIDIA Geforce GTX 1080 Ti,Open source,"""Compute for fine-tuning ProtBERT: 1 NVidia GeForce GTX 1080Ti, 30 epochs, batch size 32, model trained for individual tasks with training time ranging from 58-116 minutes, assuming
from Table 1 we have 244 minutes
11.34e12 FLOPs and 0.3 utilization rate FLOP = 244 min * 60 sec/min * 11.34e12 FLOP/sec *0.3 = 4.9e16 FLOP,",1.0,,,,,4.980528e+16,30.0,,,
Qwen-VL,"Multimodal,Language,Vision",Alibaba,,2023-08-24,"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond",https://arxiv.org/abs/2308.12966,SOTA improvement,"""As the results shown, our Qwen-VL and Qwen-VL-Chat both achieve obviously better results compared to previous
generalist models in terms of both two tasks. Specifically, on zero-shot image caption task, Qwen-VL achieves
state-of-the-art performance (i.e., 85.8 CIDEr score) on the Flickr30K karpathy-test split, even outperforms
previous generalist models with much more parameters (e.g., Flamingo-80B with 80B parameters).""",9600000000.0,9.6B total - Table 1,,"Qwen-7B and ViT as base models, trained on 1.5B image-text pairs",,"""Our pre-training dataset is composed of several publicly accessible sources and some in-house data.
We made an effort to clean the dataset of certain patterns. As summarized in Table 2, the original dataset
contains a total of 5 billion image-text pairs, and after cleaning, 1.4 billion data remain, with 77.3% English
(text) data and 22.7% Chinese (text) data.""",1400000000.0,1.4B text-image pairs,"We introduce the Qwen-VL series, a set of large-scale vision-language models designed to perceive and understand both text and images. Comprising Qwen-VL and Qwen-VL-Chat, these models exhibit remarkable performance in tasks like image captioning, question answering, visual localization, and flexible interaction. The evaluation covers a wide range of tasks including zero-shot captioning, visual or document visual question answering, and grounding. We demonstrate the Qwen-VL outperforms existing Large Vision Language Models (LVLMs). We present their architecture, training, capabilities, and performance, highlighting their contributions to advancing multimodal artificial intelligence. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.",Likely,China,Industry,,,,Open access (restricted use),"50k steps, 30k batch size (table 8)",,,Qwen-7B,95.0,,,1.0,,,
RT-2,Robotics,Google DeepMind,"Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, Brianna Zitkovich",2023-07-28,RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,https://arxiv.org/abs/2307.15818,SOTA improvement,"""We compare our method to multiple state-of-the-art baselines that challenge different aspects of our method. All of the baselines use the exact same robotic data... Here, on average, both instantiations of RT-2 perform similarly, resulting in ∼2x improvement over the next two baselines, RT-1 and MOO, and ∼6x better than the other baselines""",55000000000.0,"""We train two specific instantiations of RT-2 that leverage pre-trained VLMs: (1) RT-2-PaLI-X is built from 5B and 55B PaLI-X (Chen et al., 2023a), and (2) RT-2-PaLM-E is built from 12B PaLM-E (Driess et al., 2023).""
55B and 12B have similar overall performance",,"""""For RT-2-PaLI-X-55B, we use learning rate 1e-3 and batch size 2048 and co-fine-tune the model for 80K gradient steps""
Sequence length not stated",RT-1,"""The vision-language datasets are based on the dataset mixtures from Chen et al. (2023b) and Driess et al. (2023). The bulk of this data consists of the WebLI dataset, which is around 10B image-text pairs across 109 languages, filtered to the top 10% scoring cross-modal similarity examples to give
1B training examples""
""The robotics dataset is based on the dataset from Brohan et al. (2022).""
Chen et al and Driess et al are the original Pali-X and Palm-E papers. image-text web data
Brohan et al is the RT-1 paper",,,"We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data (such as placing an object onto a particular number or icon), and the ability to perform rudimentary reasoning in response to user commands (such as picking up the smallest or largest object, or the one closest to another object). We further show that incorporating chain of thought reasoning allows RT-2 to perform multi-stage semantic reasoning, for example figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink).",Confident,Multinational,Industry,,,,,"""For RT-2-PaLI-X-55B, we use learning rate 1e-3 and batch size 2048 and co-fine-tune the model for 80K gradient steps""
",,,PaLI-X,374.0,,,,,,
AudioLM,Audio,Google,"Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, Neil Zeghidour",2023-07-26,AudioLM: a Language Modeling Approach to Audio Generation,https://arxiv.org/abs/2209.03143,SOTA improvement,"Compared to other systems without text supervision, AudioLM achieves the highest
sWUGGY scores across both splits. Similarly, it also attains the
highest score in the sBLIMP metric, improving by 8% relative
over the previous state-of-the-art (CPC-BERT [59]).",1500000000.0,"""We use identical decoder-only Transformers in
all stages, with 12 layers, 16 attention heads, embedding
dimension of 1024, feed-forward layer dimension of 4096
and dropout of 0.1, together with T5-style relative positional
embeddings [38], resulting in a model parameter size of
0.3B per stage.""
Three stages (figure 2), and 300M per stage. Plus 600M parameters for w2v-BERT XL, so 1.5B total",3.9e+18,"""We train each stage on 16 TPUv4s with batch size of 256 for 1M steps.""
That's for the 900M-param transformers
If there's 256 passes in each batch, then using 6ND that's 900m * 256m * 6 = 1.3e18. sanity check: 16 tpu4s is 4.4e15 FLOP/s. 1.3e18 FLOP / 4.4e15 FLOP/s is 295 seconds. adjusting for utilization it would be ~1000 seconds or 15 minutes? probably too short, so 1.3e18 seems too low.
upd there are 3 stages -> 1.3e18*3 = 3.9e+18 (Speculative due to reasoning above)",LibriLight,,820800000.0,"60k hours of English speech
13680*60000 = 820800000 words
https://docs.google.com/document/d/1G3vvQkn4x_W71MKg0GmHVtzfd9m0y3_Ofcoew0v902Q/edit#heading=h.sxcem9l5k3ce","We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.",Speculative,United States of America,Industry,,,Google TPU v4,,,,,,274.0,,,,,,
Llama 2-70B,Language,Meta AI,"Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom
",2023-07-18,Llama 2: Open Foundation and Fine-Tuned Chat Models,"https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/
https://arxiv.org/abs/2307.09288","Historical significance,Significant use,Highly cited,Training cost",Model has been open-sourced and frequently downloaded. The paper claims that Llama 2 is the current best open-source chat model as of its release date.,70000000000.0,"Llama has been released in 7B, 13B, 34B, and 70B variants.",8.1e+23,"""Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB"" of which 1720320 GPU hours were used to train the 70B model.
311.84 BF16 TFLOP/s * 1720320 hours * 0.40 utilization = 7.725e+23 FLOP.
Alternatively: the model was trained for 1 epoch on 2 trillion tokens and has 70B parameters. C = 6ND = 6*70B*2T = 8.4e+23 FLOP.",Llama 2 dataset,"2 trillion tokens of publicly available text, with no text from Meta's products.
""Our training corpus includes a new mix of data from publicly available sources, which does not include data from Meta’s products or services. We made an effort to remove data from certain sites known to contain a high volume of personal information about private individuals. We trained on 2 trillion tokens of data as this provides a good performance–cost trade-off, up-sampling the most factual sources in an effort to increase knowledge and dampen hallucinations.""",1500000000000.0,2 trillion tokens ~= 1.5 trillion words,"In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closedsource models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.",Confident,United States of America,Industry,1728.0,"Model was trained from January 2023 to July 2023, which is six months. However, the training run duration did not take up this whole period. According to a Meta employee interviewed by Epoch, Llama 2 34B and 70B were trained on different clusters, with overlapping training periods. Based on an estimate of 1000 GPUs, it would have taken 72 days.",NVIDIA A100 SXM4 80 GB,Open access (restricted use),,1000.0,0.435,,4399.0,"A100 cost in 2023: $1.10/hour
Training time: 1720320 A100 GPU-hours
Inflation adjustment: $1.000 2020 = $1.145 2023",,1.0,1099604.9936612176,4000000.0,
Llama 2-7B,Language,Meta AI,"Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom
",2023-07-18,Llama 2: Open Foundation and Fine-Tuned Chat Models,https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/,"Historical significance,Significant use,Highly cited",Model has been open-sourced and frequently downloaded. The paper claims that Llama 2 is the current best open-source chat model as of its release date.,70000000000.0,"Llama has been released in 7B, 13B, and 70B variants.",8.4e+22,"Trained on 2 trillion tokens per Table 1.
C = 6ND = 6*7B*2T = 8.4e+22 FLOP.
Also, 7B model was trained on 184320 GPU-hours
312 trillion * 184320 * 3600 * 0.3 = 6.21e22",Llama 2 dataset,"2 trillion tokens of publicly available text, with no text from Meta's products.
""Our training corpus includes a new mix of data from publicly available sources, which does not include data from Meta’s products or services. We made an effort to remove data from certain sites known to contain a high volume of personal information about private individuals. We trained on 2 trillion tokens of data as this
provides a good performance–cost trade-off, up-sampling the most factual sources in an effort to increase knowledge and dampen hallucinations.""",2000000000000.0,2 trillion tokens,"In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closedsource models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.",Confident,United States of America,Industry,,,NVIDIA A100 SXM4 80 GB,Open access (restricted use),,,,,4399.0,"A100 cost in 2023: $1.10/hour
Training time: 184320 A100 GPU-hours
Inflation adjustment: $1.000 2020 = $1.145 2023
184320 * 1.10 / 1.145 = $177,075",,1.0,114259.38527188863,4000000.0,
Claude 2,Language,Anthropic,,2023-07-11,,"https://www.anthropic.com/index/claude-2, https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf",Historical significance,,,,3.866e+24,https://colab.research.google.com/drive/1MdPuhS4Emaf23VXYZ-ooExDW-5GXZkw0#scrollTo=Ds0Q5X8aMnOY,,"From model card: ""Claude models are trained on a proprietary mix of publicly available information from the Internet, datasets
that we license from third party businesses, and data that our users affirmatively share or that crowd workers provide. Some of the human feedback data used to finetune Claude was made public [12] alongside our RLHF [2] and red-teaming [4] research.
Claude 2’s training data cuts off in early 2023, and roughly 10 percent of the data included was non-English.""",,,,Speculative,United States of America,Industry,,,,API access,,,,,0.0,,,,,,
xTrimoPGLM -100B,Biology,"Tsinghua University,BioMap Research","Bo Chen, Xingyi Cheng, Yangli-ao Geng, Shen Li, Xin Zeng, Boyan Wang, Jing Gong, Chiming Liu, Aohan Zeng, Yuxiao Dong, Jie Tang, Le Song",2023-07-06,xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein,https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1,"SOTA improvement,Training cost","""Our extensive experiments reveal that xTrimoPGLM significantly outperforms other advanced baselines in diverse protein understanding tasks (13 out of 15 tasks across four categories)""",100000000000.0,"Abstract: ""training xTrimoPGLM at an unprecedented scale of 100 billion
parameters and 1 trillion training tokens""",6.2e+23,"""xTrimoPGLM-100B is trained on a cluster of 96 DGX-A100 GPU (8×40G) servers in FP16 precision from January 18 to June 30, 2023. During this time, xTrimoPGLM-100B has consumed 1
trillion tokens from the dataset consisting of Uniref90 and ColAbFoldDB. As of the current date,
xTrimoPGLM-100B continues its pre-training process to pass through as many tokens as possible""
6 * 100 billion params * 1T tokens = 6e23
8*96 * 312 trillion * 163 days * 24 * 3600 * 0.3 ~= 1e24
directly given in the paper (Table 9): 6.2E+23 ",UniRef50,,,~24M protein sequences,"Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. This paper proposes a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. Our extensive experiments reveal that xTrimoPGLM significantly outperforms other advanced baselines in diverse protein understanding tasks (13 out of 15 tasks across four categories) and generates novel protein sequences which are structurally similar to natural ones. Furthermore, using the same xTrimoPGLM framework, we train an antibody-specific model (xTrimoPGLM-Ab) using 1 billion parameters. This model set a new record in predicting antibody naturalness and structures, both essential to the field of antibody-based drug design, and demonstrated a significantly faster inference speed than AlphaFold2. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences.",Confident,"China,China","Academia,Industry",3912.0,163 days,NVIDIA A100 SXM4 40 GB,Unreleased,,768.0,,,37.0,,,,1818526.294987458,,
InternLM,Language,"Shanghai AI Lab,SenseTime",,2023-07-06,,https://internlm.org/,SOTA improvement,"(from Google-translated page) ""In addition to using academic datasets to evaluate InternLM, we also use human examinations to assess its capabilities. InternLM can achieve good scores on examination benchmarks such as MMLU, AGIEval, C-Eval, and GAOKAO-bench that cover different languages and subjects, scoring higher than ChatGPT on multiple benchmarks""",100000000000.0,Pre-training a bilingual 100B Foundation model on data with over a trillion tokens,,,,,750000000000.0,"""Pre-training a bilingual 100B Foundation model on data with over a trillion tokens"" equals approximately 750B words for English, but the tokenizer's conversion ratio may be different for Chinese.","Pre-training a bilingual 100B Foundation model on data with over a trillion tokens, the model exhibits excellent performance in scenarios such as Chinese, English, and coding due to the appropriate data ratio. Based on the foundation model, the application of high-quality human annotated dialogue data combined with RLHF technology enables the InternLM large language model to respond to complex commands during human interaction, while also demonstrating responses in line with human morality and values.",Speculative,"China,Hong Kong","Academia,Industry",,Training performance for the open-source InternLM-7B: https://github.com/InternLM/InternLM/blob/main/doc/en/train_performance.md,NVIDIA A100 SXM4 80 GB,,,,,,0.0,,,,,,
Pangu-Weather,Earth science,Huawei,"Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, Qi Tian",2023-07-05,Accurate medium-range global weather forecasting with 3D neural networks,"https://www.nature.com/articles/s41586-023-06185-3, https://www.huaweicloud.com/intl/en-us/news/20230707180809498.html,
https://www.huawei.com/en/news/2023/7/pangu-ai-model-nature-publish",SOTA improvement,"""In meteorology, the Pangu Meteorology Model (or Pangu-Weather) is the first AI model to have surpassed state-of-the-art numerical weather prediction (NWP) methods in terms of accuracy. The prediction speed is also several orders of magnitude faster. In the past, predicting the trajectory of a typhoon over 10 days took 4 to 5 hours of simulation on a high-performance cluster of 3,000 servers. Now, the Pangu model can do it in 10 seconds on a single GPU of a single server, and with more accurate results.""
https://www.huaweicloud.com/intl/en-us/news/20230707180809498.html",256000000.0,"4*64 million = 256M params
""We trained four deep networks with lead times (the time difference
between input and output) at 1 h, 3 h, 6 h and 24 h, respectively...
This modification increases the number of bias parameters by a factor of 527, with each 3D deep network containing approximately 64 million parameters.""",3.98e+22,"""Each of the four deep networks was trained for 100 epochs, and
each of them takes approximately 16 days on a cluster of 192 NVIDIA
Tesla-V100 GPUs.""
192 * 4 * 16 * 24 * 3600 * 125 teraflops * 0.3 utilization = 3.98e22",ERA5,"""We used a single point in time for both input and output. The time resolution
of the ERA5 data is 1 h; in the training subset (1979–2017), there were
as many as 341,880 time points, the amount of training data in one
epoch""",,"""We used a single point in time for both input and output. The time resolution
of the ERA5 data is 1 h; in the training subset (1979–2017), there were
as many as 341,880 time points, the amount of training data in one
epoch... We fed all included weather variables, including 13 layers of upper-air
variables and the surface variables""
341,880 is the number of hours in ~40 years. But there's lots of data for each hour.","Weather forecasting is important for science and society. At present, the most accurate forecast system is the numerical weather prediction (NWP) method, which represents atmospheric states as discretized grids and numerically solves partial diferential equations that describe the transition between those states1 . However, this procedure is computationally expensive. Recently, artifcial-intelligence-based methods2 have shown potential in accelerating weather forecasting by orders of magnitude, but the forecast accuracy is still signifcantly lower than that of NWP methods. Here we introduce an artifcial-intelligence-based method for accurate, medium-range global weather forecasting. We show that three-dimensional deep networks equipped with
Earth-specifc priors are efective at dealing with complex patterns in weather data, and that a hierarchical temporal aggregation strategy reduces accumulation errors in medium-range forecasting. Trained on 39 years of global data, our program, Pangu-Weather, obtains stronger deterministic forecast results on reanalysis data in all tested variables when compared with the world’s best NWP system, the operational integrated forecasting system of the European Centre for Medium-Range Weather Forecasts (ECMWF)3
. Our method also works well with extreme weather forecasts and ensemble forecasts. When initialized with reanalysis data, the accuracy of tracking
tropical cyclones is also higher than that of ECMWF-HRES.",Confident,China,Industry,1536.0,"4*16 = 64 days
""Each of the four deep networks was trained for 100 epochs, andeach of them takes approximately 16 days on a cluster of 192 NVIDIA Tesla-V100 GPUs.""
",NVIDIA V100,,"Possibly based on Pangu 3? Pangu-Weather is mentioned in the Pangu 3 announcement. But the architecture description doesn't seem to resemble Pangu 3. So it seems like Pangu-Weather is one of the higher-level models that can be attached to Pangu 3.
https://www.huaweicloud.com/intl/en-us/news/20230707180809498.html
",192.0,,,154.0,,,100.0,51279.01751905432,,
Stable Diffusion XL,Image generation,Stability AI,"Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach",2023-07-04,SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis,https://arxiv.org/abs/2307.01952,Significant use,Looks like this is now the main/flagship Stable Diffusion model,3400000000.0,"""...result in a model size of 2.6B parameters in the UNet, see Tab. 1. The text encoders have a total size of 817M parameters.""",,,,,,,"We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at this https URL",Speculative,Multinational,Industry,,,,,,,,,574.0,,,,,,
HyenaDNA,Biology,"Stanford University,Harvard University,Mila - Quebec AI (originally Montreal Institute for Learning Algorithms),University of Montreal / Université de Montréal","Eric Nguyen, Michael Poli, Marjan Faizi, Armin W. Thomas, Callum Birch Sykes, Michael Wornow, Aman Patel, Clayton Rabideau, Stefano Massaroli, Yoshua Bengio, Stefano Ermon, Stephen A. Baccus, Christopher Ré",2023-06-27,HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution,https://arxiv.org/abs/2306.15794,SOTA improvement,"""On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.1 On the GenomicBenchmarks, HyenaDNA surpasses SotA on 7 of 8 datasets on average by +10 accuracy points, and by as much as +20 accuracy points on enhancer identification.""",1600000.0,See footnote 1,4.49e+18,"8 Nvidia A100 (80GB) GPUs, ~300 minutes (figure 3.2)
Assuming 40% utilization
Estimate: 78 TFLOP/s * 8 GPUs * (300*60) s * 0.4 = 4.49e18 FLOPs",Human Reference Genome,See footnote 1,,,"Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language models, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (<0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on tokenizers or fixed k-mers to aggregate meaningful DNA units, losing single nucleotide resolution where subtle genetic variations can completely alter protein function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a large language model based on implicit convolutions was shown to match attention in quality while allowing longer context lengths and lower time complexity. Leveraging Hyena's new long-range capabilities, we present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level - an up to 500x increase over previous dense attention-based models. HyenaDNA scales sub-quadratically in sequence length (training up to 160x faster than Transformer), uses single nucleotide tokens, and has full global context at each layer. We explore what longer context enables - including the first use of in-context learning in genomics. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data. On the GenomicBenchmarks, HyenaDNA surpasses SotA on 7 of 8 datasets on average by +10 accuracy points.",Confident,"United States of America,United States of America,Canada,Canada","Academia,Academia,Academia,Academia",,,NVIDIA A100,,,8.0,,,69.0,,,,5.554208907015164,,
ERNIE 3.5,Language,Baidu,,2023-06-27,Introducing ERNIE 3.5: Baidu’s Knowledge-Enhanced Foundation Model Takes a Giant Leap Forward,http://research.baidu.com/Blog/index-view?id=185,SOTA improvement,SOTA scores on AGIEval and MMLU. See article in China Science Daily: https://mp.weixin.qq.com/s/QVdkmofRSTgjQ7UOFX7s1g,,,,,,,,,,Unknown,China,Industry,,,,,,,,,0.0,,,,,,
RoboCat,Robotics,"Google DeepMind,Google","Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X. Lee, Maria Bauza, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, Antoine Laurens, Claudio Fantacci, Valentin Dalibard, Martina Zambelli, Murilo Martins, Rugile Pevceviciute, Michiel Blokzijl, Misha Denil, Nathan Batchelor, Thomas Lampe, Emilio Parisotto, Konrad Żołna, Scott Reed, Sergio Gómez Colmenarejo, Jon Scholz, Abbas Abdolmaleki, Oliver Groth, Jean-Baptiste Regli, Oleg Sushkov, Tom Rothörl, José Enrique Chen, Yusuf Aytar, Dave Barker, Joy Ortiz, Martin Riedmiller, Jost Tobias Springenberg, Raia Hadsell, Francesco Nori, Nicolas Heess",2023-06-20,RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation,https://arxiv.org/abs/2306.11706,SOTA improvement,,1180000000.0,"""Most of the experimental results are based on models with a 1.18B-parameter decoder-only transformer (Vaswani et al., 2017) with 24 layers, an embedding size of 2048, and a post-attention feedforward hidden size of 8196."" page 8",,,,"""We use a diverse and large number of datasets for training RoboCat. These include data from agent experience, human demonstrations and self-generated data, on both simulated and real-world robot environments. See Section 3.4 for details on our datasets.""",,,"The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a foundation agent for robotic manipulation. This agent, named RoboCat, is a visual goal-conditioned decision transformer capable of consuming multi-embodiment action-labelled visual experience. This data spans a large repertoire of motor control skills from simulated and real robotic arms with varying sets of observations and actions. With RoboCat, we demonstrate the ability to generalise to new tasks and robots, both zero-shot as well as through adaptation using only 100--1000 examples for the target task. We also show how a trained model itself can be used to generate data for subsequent training iterations, thus providing a basic building block for an autonomous improvement loop. We investigate the agent's capabilities, with large-scale evaluations both in simulation and on three different real robot embodiments. We find that as we grow and diversify its training data, RoboCat not only shows signs of cross-task transfer, but also becomes more efficient at adapting to new tasks.",Speculative,"Multinational,United States of America","Industry,Industry",,,,,,,,,23.0,,,,,,
WizardCoder-15.5B,Language,Microsoft,"Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Daxin Jiang",2023-06-14,WizardCoder: Empowering Code Large Language Models with Evol-Instruct,https://arxiv.org/abs/2306.08568,SOTA improvement,"""It surpasses all other open-source Code LLMs by a substantial margin. Moreover, our model even outperforms the largest closed LLMs, Anthropic’s Claude and Google’s Bard, on HumanEval and HumanEval+.""",15500000000.0,15.5B,1.12e+23,1.12e23 base compute (StarCoder estimate) + 1.95e19 finetune compute (see below) ~= 1.12e23,,"synthetic code data:
""To construct the training dataset, we initialized it with the 20K
instruction-following dataset called Code Alpaca5. We iteratively employ the Evol-Instruct technique on this dataset consisting of 20,000 samples to produce evolved data""",,"""The evolved dataset consists of approximately 78k samples""
Not sure how big the samples are.","Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing
models are solely pre-trained on extensive raw code data without instruction finetuning. In this paper, we introduce WizardCoder, which empowers Code LLMs
with complex instruction fine-tuning, by adapting the Evol-Instruct method to
the domain of code. Through comprehensive experiments on four prominent
code generation benchmarks, namely HumanEval, HumanEval+, MBPP, and DS1000, we unveil the exceptional capabilities of our model. It surpasses all other
open-source Code LLMs by a substantial margin. Moreover, our model even
outperforms the largest closed LLMs, Anthropic’s Claude and Google’s Bard, on
HumanEval and HumanEval+. Our code, model weights, and data are public at
https://github.com/nlpxucan/WizardLM.",Likely,United States of America,Industry,,,,Open access (restricted use),"""The StarCoder [11] serves as our basic foundation model. The evolved dataset consists of approximately 78k samples. To fine-tune the basic models, we employ specific configurations, including a
batch size of 512, a sequence length of 2048, 200 fine-tuning steps, 30 warmup steps, a learning rate
of 2e-5, a Cosine learning rate scheduler, and fp16 mixed precision.""
512*2048*200 = 209,715,200 training tokens
209715200 * 15.5B * 6 = 1.95e19",,,StarCoder,252.0,,1.95035136e+19,,,,
MusicGen,Audio,Meta AI,"Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Défossez",2023-06-08,Simple and Controllable Music Generation,https://arxiv.org/abs/2306.05284,SOTA improvement,"""We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark""",3359000000.0,"""We train autoregressive transformer models at different sizes: 300M, 1.5B, 3.3B parameters""
Uses EnCodec 32kHz (HF version has 59M params) for audio tokenization.",,"We train the 300M, 1.5B and 3.3B parameter models, using respectively 32, 64 and 96 GPUs, with mixed precision.
Unclear how many epochs used so FLOP calculation is not feasible.",ShutterStock and Pond5 music data collections,"""We use 20K hours of licensed music to train MUSICGEN. Specifically, we rely on an internal dataset of 10K high-quality music tracks, and on the ShutterStock and Pond5 music data collections with respectively 25K and 365K instrument-only music tracks. All datasets consist of full-length music sampled at 32 kHz with metadata composed of a textual description and additional information such as the genre, BPM, and tags.""",,"""We train on 30-second audio crops sampled at random from the full track... We use 20K hours of licensed music""
20000 hours * 60 min/hour * 2 inputs/min = 2400000 input sequences
EnCodec is run at 32kHz but after convolutions has a frame rate of 50 Hz, suggesting 2400000 * 30s * 50/s = 3,600,000,000 audio tokens.
Not confident enough in this calculation to add to database.","We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at this https URL.",Likely,United States of America,Industry,,,,,,,,,104.0,,,,,,
LTM-1,Language,Magic,,2023-06-06,"LTM-1: an LLM with a 5,000,000 token context window",https://magic.dev/blog/ltm-1,SOTA improvement,Very long context window - 5M tokens,,,,"Must be below 1e23 FLOP, as it's trained with a single A100.",,,,,"Magic’s LTM-1 enables 50x larger context windows than transformers
Magic's trained a Large Language Model (LLM) that’s able to take in the gigantic amounts of context when generating suggestions. For our coding assistant, this means Magic can now see your entire repository of code.",Unknown,United States of America,Industry,,,,,,,,,,,,,,,
PaLI-X,"Multimodal,Language,Vision,Video",Google Research,"Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut",2023-05-29,PaLI-X: On Scaling up a Multilingual Vision and Language Model,https://arxiv.org/abs/2305.18565,SOTA improvement,"""PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them).""",55000000000.0,55B (table 1),,,WebLI,"""The main pretraining data for our model is based on WebLI [5], consisting of roughly one billion images with alt-texts from the web and OCR annotations (using the GCP Vision API), covering over 100 languages. In addition to WebLI ⟨image, text⟩ pairs, we introduce here Episodic WebLI data, where each episode corresponds to a set of such pairs. We aim to have each episode contain loosely related images (i.e., they are clustered according to their URL field), so as to encourage attention among examples in an “episode”. We find this new dataset (with 75M episodes and around 400M images in total) important for developing the few-shot capabilities of the model.""",1400000000.0,"1 billion images with alt texts in WebLI, 400m images in Episodic WebLI data","We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.",Likely,Multinational,Industry,,,,,,,,UL2,112.0,,,,,,
Goat-7B,Language,National University of Singapore,"Tiedong Liu, Bryan Kian Hsiang Low",2023-05-23,Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks,https://arxiv.org/abs/2305.14201,SOTA improvement,"""We introduce Goat, a fine-tuned LLaMA model that significantly outperforms GPT-4 on a range of arithmetic tasks. Fine-tuned on a synthetically generated dataset, Goat achieves state-ofthe-art performance on BIG-bench arithmetic sub-task.""",7000000000.0,7B,,2.78e+22 for base LLaMA-7B,,"Model was fine-tuned from LLaMA-7B.
Fine-tuning dataset is a synthetic math dataset:
""We generate the dataset synthetically using a Python script. The dataset consists of around 1 million question-answer pairs. The answer contains
the proposed CoT as well as the final numerical output. The numbers are randomly generated, hence
ensuring a very low probability of instances being
duplicated, although small numbers may be sampled multiple times. We sample from log space to
ensure the numbers are equally likely to be sampled
from different orders of magnitude, which is similar to the sampling method used by Lee and Kim
(2023). The details of the dataset are presented in
Appendix F.""",,Fine-tune dataset had 1 million question-answer pairs. likely ~10 tokens per pair?,"We introduce Goat, a fine-tuned LLaMA model that significantly outperforms GPT-4 on a range of arithmetic tasks. Fine-tuned on a synthetically generated dataset, Goat achieves state-of-the-art performance on BIG-bench arithmetic sub-task. In particular, the zero-shot Goat-7B matches or even surpasses the accuracy achieved by the few-shot PaLM-540B. Surprisingly, Goat can achieve near-perfect accuracy on large-number addition and subtraction through supervised fine-tuning only, which is almost impossible with previous pretrained language models, such as Bloom, OPT, GPT-NeoX, etc. We attribute Goat's exceptional performance to LLaMA's consistent tokenization of numbers. To tackle more challenging tasks like large-number multiplication and division, we propose an approach that classifies tasks based on their learnability, and subsequently decomposes unlearnable tasks, such as multi-digit multiplication and division, into a series of learnable tasks by leveraging basic arithmetic principles. We thoroughly examine the performance of our model, offering a comprehensive evaluation of the effectiveness of our proposed decomposition steps. Additionally, Goat-7B can be easily trained using LoRA on a 24GB VRAM GPU, facilitating reproducibility for other researchers. We release our model, dataset, and the Python script for dataset generation.",Speculative,Singapore,Academia,,,NVIDIA A10 PCIe,Open access (non-commercial),"""Goat-7B can be easily fine-tuned using LoRA on a 24GB VRAM GPU... In particular, the fine-tuning process for a specific arithmetic sub-task, such as 8-digit addition using 100K instances, takes only approximately 1.5 hours on an A10 GPU to achieve near-perfect accuracy""
Info isn't very complete - no timeframe specified for the VRAM GPU, I'm not sure how many tokens are in the fine-tune dataset and they use LoRA. Maybe it's 15 A10-hours total (1M total instances)? But safe to assume it's a small fraction of Llama's pretrain compute.
125 trillion (A10 FLOPs) * 15 * 3600 * 0.3 = 2.02e18",,,LLaMA-7B,44.0,,2.02e+18,1.0,,,
CodeT5+,Language,Salesforce,"Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D.Q. Bui, Junnan Li, Steven C.H. Hoi",2023-05-20,CodeT5+: Open Code Large Language Models for Code Understanding and Generation,https://arxiv.org/abs/2305.07922,SOTA improvement,"""We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks""",16000000000.0,"""We implemented a family of CodeT5+ models, with model sizes ranging from 220M to 16B""",,,,"""We enlarge the pretraining dataset of CodeSearchNet [Husain et al., 2019] with the recently released GitHub Code dataset""",,"""We use the CodeT5 tokenizer to tokenize the multilingual dataset, resulting in 51.5B tokens""","""Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.""",,United States of America,Industry,,,NVIDIA A100,Open source,,,,,201.0,,,10.0,,,
ONE-PEACE,"Multimodal,Vision,Speech,Language","Alibaba,Huazhong University of Science and Technology","Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren Zhou, Xinggang Wang, Chang Zhou",2023-05-18,ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities,https://arxiv.org/abs/2305.11172v1,SOTA improvement,""" ONEPEACE achieves leading results in both uni-modal and multi-modal tasks, including image classification (89.8%
accuracy on ImageNet w/o privately labeled data), semantic segmentation (63.0% mIoU on ADE20K), audio-text
retrieval (outperforming previous SOTAs on AudioCaps and Clotho by a large margin), audio classification (91.8%
zero-shot accuracy on ESC-50, 69.7% accuracy on FSD50K, 59.6% accuracy on VGGSound w/o visual information),
audio question answering (86.2% accuracy on AVQA w/o visual information), image-text retrieval (84.1% I2T R@1
on MSCOCO and 97.6% I2T R@1 on Flickr30K w/o intermediate finetuning and ranking), and visual grounding
(89.26%/83.23%/89.27% scores on RefCOCO/+/g test sets).""",4000000000.0,"""we propose ONE-PEACE, a model with 4B parameters""",1.8e+20,"4 billion params * 7.5 billion data * 6 = 1.8e20.
see training dataset size notes. this estimate required some more assumptions than usual.","LAION-2B,LAION-Audio-630K","""For image-text pairs, we use LAION-2B... For audio-text pairs, we mainly use the environmental sound datasets processed by [76].""
looks like there's additional fine-tuning data as well",1600000000.0,"""After these steps, we retain about 1.5 billion image-text pairs""
...
""We also perform simple cleaning on the data, which involves removing samples with text lengths less than 3 or greater than
512, as well as texts containing non-English or emoji characters. Ultimately, we obtain about 2.4 million audio-text pairs, with a total duration of around 8,000 hours""
8000 hours = 480,000 minutes = ~109,440,000 words at 228 wpm
https://docs.google.com/document/d/1G3vvQkn4x_W71MKg0GmHVtzfd9m0y3_Ofcoew0v902Q/edit#heading=h.3pbt0hfgv7pq
Trained on 10 epochs for audio. For text, they train on ""200K steps with a batch size of 32768"" = 6,533,600,000
Adding together, they train on ~ 7.5b data points on a dataset of 1.6b, for ~4.7 epochs on average.","In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results on a wide range of uni-modal and multi-modal tasks, including image classification (ImageNet), semantic segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g). Code is available at this https URL.",Speculative,"China,China","Industry,Academia",,,,Open source,,,,,42.0,,,4.7,,,
CoEdiT-xxl,Language,"University of Minnesota,Grammarly","Vipul Raheja, Dhruv Kumar, Ryan Koo, Dongyeop Kang",2023-05-17,CoEdIT: Text Editing by Task-Specific Instruction Tuning,"https://arxiv.org/abs/2305.09857, https://huggingface.co/grammarly/coedit-large",SOTA improvement,"""We achieve state-of-the-art performance on multiple text editing tasks: grammatical error correction, text simplification, sentence fusion, iterative text editing, and three stylistic editing
tasks (formality style transfer, neutralization,
and paraphrasing).""",11000000000.0,11B,,finetuned from Flan-T5,,"82k pairs of editing examples:
""we fine-tune a pre-trained
sequence-to-sequence model on a parallel corpus
of instruction-based 82K input-output pairs. The
inputs and outputs are sourced from publicly available corpora for different text editing tasks""
""Our dataset creation is based on the ITERATER+
dataset proposed by Kim et al. (2022) who combined datasets from various text editing tasks (See
Table 1). Their work, in turn, is based on Du et al (2022b), who categorized each edit into MEANINGCHANGED or NON-MEANING-CHANGED.""",3000000.0,"82k pairs of sentences. Roughly 20 words per sentence based on examples but mean length could be higher due to outliers.
40*82k = ~3,000,000","We introduce COEDIT, a state-of-the-art text editing system for writing assistance. COEDIT takes instructions from the user specifying the attributes of the desired text, such as ""Make the sentence simpler"" or ""Write it in a more neutral style,"" and outputs the edited text. We present a large language model fine-tuned on a diverse collection of task-specific instructions for text editing (a total of 82K instructions). Our model (1) achieves state-of-the-art performance on various text editing benchmarks, (2) is competitive with publicly available largestsized LLMs trained on instructions while being ∼60x smaller, (3) is capable of generalizing to unseen edit instructions, and (4) exhibits abilities to generalize to composite instructions containing different combinations of edit actions. Through extensive qualitative and quantitative analysis, we show that writers prefer the edits suggested by COEDIT, relative to other stateof-the-art text editing models1.",Likely,"United States of America,United States of America","Academia,Industry",,,NVIDIA A100,Open access (non-commercial),"""We fine-tune different versions of pre-trained FLANT5 (Chung et al., 2022a) models on the COEDIT dataset. Specifically, we use FLANT5-L (770M parameters), FLANT5-XL (3B parameters), FLANT5-XXL (11B parameters) models.""
""Each model is trained for 5 epochs with early stopping. All models were fine-tuned on A100 GPUs using Deepspeed""
6 * 5 epochs * 3 million words (rough estimate) * 11 billion = 9.9e17 ~= 1e18
",,,Flan-T5 11B,26.0,,1e+18,5.0,,,
InstructBLIP,"Multimodal,Language,Vision","Salesforce Research,Hong Kong University of Science and Technology,Nanyang Technological University","Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi",2023-05-11,InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning,https://arxiv.org/abs/2305.06500,SOTA improvement,from abstract - SOTA on ScienceQA,13000000000.0,13B form 2.6,1.94e+20,"""All models are trained utilizing 16 Nvidia A100 (40G) GPUs and are completed within 1.5 days.""
16 * 3.12e14 * 1.5 * 24 * 3600 * 0.3 = 1.94e20",,"COCO Caption, Web CapFilt, NoCaps, Flickr30K, TextCaps, VQAv2, VizWiz, GQA, Visual Spatial Reasoning, IconQA, OKVQA, A-OKVQA, ScienceQA, Visual Dialog, OCR-VQA, TextVQA, HatefulMemes, LLaVA-Instruct-150K, MSVD-QA, MSRVTT-QA, iVQA",,,"Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at this https URL. ",Confident,"United States of America,Hong Kong,Singapore","Industry,Academia,Academia",36.0,"""All models are trained utilizing 16 Nvidia A100 (40G) GPUs and are completed within 1.5 days.""",NVIDIA A100 SXM4 40 GB,Open access (non-commercial),"flops = (16) * (312 * 10**12) * (1.5* 24 * 3600) * (0.3) = 1.9e20
(num gpu) * (peak flops) * (time in seconds) * (assumed utilization rate)
""All models are trained utilizing 16 Nvidia A100 (40G) GPUs and are completed within 1.5 days.""",16.0,,Vicuna-13B,984.0,,1.9408896e+20,,,,
PaLM 2,Language,Google,"Andrew M. Dai, David R. So, Dmitry Lepikhin, Jonathan H. Clark, Maxim Krikun, Melvin Johnson, Nan Du, Rohan Anil, Siamak Shakeri, Xavier Garcia, Yanping Huang, Yi Tay, Yong Cheng, Yonghui Wu, Yuanzhong Xu, Yujing Zhang, Zachary Nado, Bryan Richter, Alex Polozov, Andrew Nystrom, Fangxiaoyu Feng, Hanzhao Lin, Jacob Austin, Jacob Devlin, Kefan Xiao, Orhan Firat, Parker Riley, Steven Zheng, Yuhuai Wu, Zhongtao Liu, Jiahui Yu, Guy Gur-Ari, Weikang Zhou, Sneha Kudugunta, Sunipa Dev, Frederick Liu, Gustavo Hernandez Abrego, Kelvin Xu, Abe Ittycheriah, Daniel Sohn, John Nham, Le Hou, Siyuan Qiao, Pidong Wang, Zirui Wang, Laurent El Shafey, Hyeontaek Lim, Marcello Maggioni, Michael Isard, Paul Barham, Qiao Zhang, Tao Wang, Yash Katariya, Aurko Roy, Benjamin Lee, Brennan Saeta, Ce Zheng, Hadi Hashemi, Junwhan Ahn, Rajkumar Samuel, Steven Hand, Zhifeng Chen, Kiran Vodrahalli, Aakanksha Chowdhery, Ethan Dyer, Emanuel Taropa, Vlad Feinberg, James Bradbury, Reiner Pope, Wei Li, YaGuang Li, Eric Chu, Jeffrey Hui, Joshua Howland, Vlad Fienber, Aroma Mahendru, Michele Catasta, Vedant Misra, Kevin Robinson, Maysam Moussalem, Sebastian Ruder, Erica Moreira, Eric Ni, Paige Bailey, Lucas Gonzalez, Alexandre Passos, Slav Petrov, Gaurav Mishra, Mark Omernick, Ambrose Slone, Andrea Hu, Colin Cherry, Denny Zhou, Jan Botha, John Wieting, Joshua Maynez, Kathleen Kenealy, Kevin Brooks, Linting Xue, Markus Freitag, Martin Polacek, Pengcheng Yin, Sebastian Gehrmann, Xuezhi Wang, Kathy Meier-Hellstern, Christopher A. Choquette-Choo, Daniel Smilkov, Emily Reif, Alicia Parrish, Alex Castro Ros, Clément Crepy, Dasha Valter, Jeremy Hurwitz, Katherine Lee, Mark Díaz, Marie Pellat, Matthew Jagielski, Renee Shelby, Shachi Dave",2023-05-10,PaLM 2 Technical Report,https://arxiv.org/abs/2305.10403,"SOTA improvement,Training cost",,340000000000.0,"Model Architecture: ""PaLM-2 is a new state-of-the-art language model. We have small, medium, and large variants that use stacked layers based on the Transformer architecture, with varying parameters depending on model size. Further details of model size and architecture are withheld from external publication.""
However, the parameter count was leaked to CNBC: https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html",7.34e+24,"Compute Requirements ""Not reported.""
Paper suggests heuristic of C=6ND. Based on 340B parameters and 3.6*10^12 tokens, training compute would be around 7.3*10^24 FLOP.",,"""The PaLM 2 pre-training corpus is composed of a diverse set of sources: web documents, books, code, mathematics, and conversational data. The pre-training corpus is significantly larger than the corpus used to train PaLM (Chowdhery et al., 2022). PaLM 2 is trained on a dataset that includes a higher percentage of non-English data than previous large language models, which is beneficial for multilingual tasks"" (page 9)",2700000000000.0,"""The pre-training corpus is significantly larger than the corpus used to train PaLM"" so greater than 6e+11. According to the leaked documents viewed by CNBC, the corpus was 3.6 trillion tokens or around 2.7*10^12 words.
https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html","We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM (Chowdhery et al., 2022). PaLM 2 is a Transformer-based model trained using a mixture of objectives similar to UL2 (Tay et al., 2023). Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities.",Likely,United States of America,Industry,,,Google TPU v4,API access,,,,,701.0,PaLM 2 was trained on TPU v4 according to the model card (pages 91-92),,,4865570.06395341,,
StarCoder,Language,"Hugging Face,ServiceNow,Northeastern University,Mila - Quebec AI (originally Montreal Institute for Learning Algorithms),Carnegie Mellon University (CMU),Johns Hopkins University,Leipzig University,ScaDS.AI,Queen Mary University of London,Roblox,Sea AI Lab,Technion - Israel Institute of Technology,Monash University,CSIRO,Data61,McGill University,Saama,University of British Columbia (UBC),Massachusetts Institute of Technology (MIT),Technical University of Munich,IBM,University of Vermont,UnfoldML,SAP,University of Notre Dame,Columbia University,New York University (NYU),University of Allahabad,Discover Dollar,Toloka,Telefonica,Stanford University,Weizmann Institute of Science,Alan Turing Institute,Wellesley College,EleutherAI,Forschungszentrum Julich","Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries",2023-05-09,StarCoder: may the source be with you!,https://arxiv.org/abs/2305.06161,SOTA improvement,"""We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python""",15500000000.0,"""We trained a 15.5B parameter model""",8.46e+22,"FLOP reported here, 8.46e22
https://huggingface.co/bigcode/starcoder
""We trained our model on a GPU cluster with 512 A100 80 GB GPUs... Based on the total number of GPU hours that training took (320,256) and an average power usage of 280W per GPU... The fine-tuned model adds 3.5% of training time""
320256 * 312 tFLOP/s * 3600 * 1.035 * 0.3 (utilization assumption) = 1.12e23",The Stack,"""StarCoderBase is trained on 1 trillion tokens sourced from The Stack (Kocetkov et al., 2022), a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process""",,"""StarCoderBase is trained on 1 trillion tokens sourced from The Stack""","""The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40\% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.""",Likely,"Multinational,United States of America,United States of America,Canada,United States of America,United States of America,Germany,Germany,United Kingdom of Great Britain and Northern Ireland,United States of America,Singapore,Israel,Australia,Australia,Australia,Canada,United States of America,Canada,United States of America,Germany,United States of America,United States of America,Sweden,Multinational,United States of America,United States of America,United States of America,India,India,Multinational,Spain,United States of America,Israel,United Kingdom of Great Britain and Northern Ireland,United States of America,Multinational,Germany","Industry,Industry,Academia,Academia,Academia,Academia,Academia,Academia,Industry,Academia,Academia,Government,Government,Academia,Academia,Academia,Academia,Industry,Academia,Industry,Academia,Academia,Academia,Academia,Industry,Industry,Industry,Academia,Academia,Government,Academia,Research collective,Government",625.5,"625.5 hours = 320256 /512
512 GPUs from ""We trained our model on a GPU cluster with 512 A100 80 GB GPUs ""
320256 GPU hours from ""Based on the total number of GPU hours that training took (320,256)""
citations from sections 5.6 and 5.7",NVIDIA A100 SXM4 80 GB,Open access (restricted use),,512.0,,,316.0,,,1.0,212217.65075330864,4194304.0,"""We train for 100,000 steps with a global batch size of 4,096 sequences of a maximum length of 1,024 so that approximately 400B tokens are observed"""
ImageBind,"Multimodal,Vision,Audio,Language,Image generation,Speech",Meta AI,"Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra",2023-05-09,IMAGEBIND: One Embedding Space To Bind Them All,"https://arxiv.org/abs/2305.05665, https://github.com/facebookresearch/ImageBind",SOTA improvement,"""we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models""",932000000.0,used ViT-Huge 630M as an image/video encoder and OpenCLIP-302m as text encoder,,,"SUN RGB-D,LLVIP,Ego4D,AudioSet",""" For the naturally available paired data, we use
the (video, audio) pairs from the Audioset dataset [19], (image, depth) pairs from the SUN RGB-D dataset [69], (image, thermal) pairs from the LLVIP dataset [32] and (video,
IMU) pairs from the Ego4D dataset [23].""",,,"We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.",Likely,United States of America,Industry,,,"NVIDIA V100,NVIDIA A100",Open access (non-commercial),,,,,361.0,,,64.0,,,
Agile Soccer Robot,Robotics,Google DeepMind,"Tuomas Haarnoja, Ben Moran, Guy Lever, Sandy H. Huang, Dhruva Tirumala, Markus Wulfmeier, Jan Humplik, Saran Tunyasuvunakool, Noah Y. Siegel, Roland Hafner, Michael Bloesch, Kristian Hartikainen, Arunkumar Byravan, Leonard Hasenclever, Yuval Tassa, Fereshteh Sadeghi, Nathan Batchelor, Federico Casarini, Stefano Saliceti, Charles Game, Neil Sreendra, Kushal Patel, Marlon Gwira, Andrea Huber, Nicole Hurley, Francesco Nori, Raia Hadsell, Nicolas Heess",2023-04-26,Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning,https://arxiv.org/abs/2304.13653,SOTA improvement,"Likely the best bipedal soccer AI, since it's DeepMind, and related work section just discusses results involving specific soccer skills and quadruped robots:
""Whether bipedal or quadrupedal, navigation represents only a fraction of animal and human
capabilities. Motivated by this observation, there is a growing interest in whole body control, i.e.
tasks in which the whole body is used in flexible ways to interact with the environment. Examples
include climbing (Rudin et al., 2022a), getting-up from the ground (Ma et al., 2023), catching objects
(Ma et al., 2023), and mobile manipulation with legs (Cheng et al., 2023). Recently, reinforcement
learning has been applied to learn simple soccer skills, including goalkeeping (Huang et al., 2022),
ball manipulation on diverse terrains (Bohez et al., 2022; Ji et al., 2023), and shooting (Ji et al.,
2022). These works focus on a narrower set of skills than the 1v1 soccer game, and the quadrupedal
platform is inherently more stable and therefore presents an easier learning challenge.""",,,,,,self-play training in simulation,,""". The get-up teacher learns to get up relatively quickly and trained in total for approximately 2.4 · 10^8 environment steps,
equivalent to approximately 70 days of simulation time, or 14 hours of wall-clock time. The soccer
teacher was trained for 2 · 10^9 environment steps, which took 158 hours of training, equivalent to
approximately 580 days of simulated match""","We investigate whether Deep Reinforcement Learning (Deep RL) is able to synthesize sophisticated and safe movement skills for a low-cost, miniature humanoid robot that can be composed into complex behavioral strategies in dynamic environments. We used Deep RL to train a humanoid robot with 20 actuated joints to play a simplified one-versus-one (1v1) soccer game. We first trained individual skills in isolation and then composed those skills end-to-end in a self-play setting. The resulting policy exhibits robust and dynamic movement skills such as rapid fall recovery, walking, turning, kicking and more; and transitions between them in a smooth, stable, and efficient manner - well beyond what is intuitively expected from the robot. The agents also developed a basic strategic understanding of the game, and learned, for instance, to anticipate ball movements and to block opponent shots. The full range of behaviors emerged from a small set of simple rewards. Our agents were trained in simulation and transferred to real robots zero-shot. We found that a combination of sufficiently high-frequency control, targeted dynamics randomization, and perturbations during training in simulation enabled good-quality transfer, despite significant unmodeled effects and variations across robot instances. Although the robots are inherently fragile, minor hardware modifications together with basic regularization of the behavior during training led the robots to learn safe and effective movements while still performing in a dynamic and agile way. Indeed, even though the agents were optimized for scoring, in experiments they walked 156% faster, took 63% less time to get up, and kicked 24% faster than a scripted baseline, while efficiently combining the skills to achieve the longer term objectives. Examples of the emergent behaviors and full 1v1 matches are available on the supplementary website.",Unknown,Multinational,Industry,240.0,"14+158+68 hours:
""Training the get-up and soccer teachers took 14 and 158 hours (6.5 days), respectively, and distillation and self-play
took 68 hours (see Appendix B for details)""",,Unreleased,,,,,40.0,,,,,,
WizardLM-7B,Language,"Microsoft,Peking University","Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Daxin Jiang",2023-04-24,WizardLM: Empowering Large Language Models to Follow Complex Instructions,https://arxiv.org/abs/2304.12244,SOTA improvement,"""Labelers prefer WizardLM outputs over outputs from ChatGPT under complex test instructions. On Evol-Instruct testset, WizardLM performs worse than ChatGPT, with a win
rate 12.8% lower than ChatGPT (28.0% vs. 40.8%). However, in the high-difficulty section
of Evol-Instruct test set (difficulty level ≥ 8), our WizardLM even outperforms ChatGPT,
with a win rate 7.9% larger than ChatGPT (42.9% vs. 35.0%), that is human annotators even
prefer the output of our model than ChatGPT on those hard questions""",6700000000.0,This is Llama-7b's parameter count,4.02e+22,"""We use pre-trained LLaMA 7B [4] to initialize our model. We adopt Adam optimizer as an initial learning rate of 2 ×10−5, a maximum number of tokens 2048, and the batch size is 8 for each GPU. We train our model on 8 V100 GPUs with Deepspeed Zero-3 for 70 hours on 3 epochs""
Llama-7b was ~4e22. 8*70 V100-hours is ~2e20, so fine-tuning was <1% of base training.",,"Fine-tuning dataset is made of LLM-generated instructions: ""In this work, we introduce Evol-Instruct, a novel method using LLMs instead of humans to automatically mass-produce open-domain instructions of various difficulty levels, to improve the performance of LLMs""",,,"Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed and Vicuna's testset show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that outputs from our WizardLM are preferred to outputs from OpenAI ChatGPT. In GPT-4 automatic evaluation, WizardLM achieves more than 90\% capacity of ChatGPT on 17 out of 29 skills. Even though WizardLM still lags behind ChatGPT in some aspects, our findings suggest that fine-tuning with AI-evolved instructions is a promising direction for enhancing LLMs. Our code and data are public at this https URL",Confident,"United States of America,China","Industry,Academia",70.0,,NVIDIA V100,Open access (non-commercial),,8.0,,LLaMA-7B,468.0,,,,46907.42757520694,,
LLaVA,"Multimodal,Vision,Language","University of Wisconsin Madison,Microsoft Research,Columbia University","Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee",2023-04-17,Visual Instruction Tuning,https://arxiv.org/abs/2304.08485,SOTA improvement,"When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.",13000000000.0,13B,4.852224e+19,"8*312e12*(10+8)*3600*0.3 = 4.852224e+19
num gpus * peak flops * time *assumed utilization rate
""We train all models with 8× A100s. Pretraining on CC-595K completes within 4 hours. Finetuning on Instruct-158K completes within 10 hours. Finetuning on ScienceQA completes within 4 hours."" so 18 hours of time, 8 A100,",Conceptual Captions (CC3M),"""We pre-train our model on the filtered CC-595K subset for 1 epoch with a learning rate of 2e-3 and a batch size of 128, and fine-tune on the proposed LLaVA-Instruct-158K dataset """,,"595K + 158K = 753K image text pairs
""This results in around 595K image-text pairs""
""We collect 158K unique language-image instruction-following samples in total, including 58K in conversations, 23K in detailed description, and 77k in complex reasoning, respectively. ""","Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available. ",Confident,"United States of America,United States of America,United States of America","Academia,Industry,Academia",10.0,"""We train all models with 8× A100s. Pretraining on CC-595K completes within 4 hours. Finetuning on Instruct-158K completes within 10 hours. Finetuning on ScienceQA completes within 4 hours.""",NVIDIA A100,Open source,,8.0,,,1288.0,,,,42.46267260692187,,
DINOv2,Vision,"Facebook AI Research,INRIA","Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski",2023-04-14,DINOv2: Learning Robust Visual Features without Supervision,https://arxiv.org/abs/2304.07193,SOTA improvement,"""Our family of models drastically improves over
the previous state of the art in self-supervised learning and reaches performance comparable with weakly-
supervised features.""
",1140000000.0,1.14B from https://huggingface.co/facebook/dinov2-giant,7.41851136e+21,"table 14
22016 * 3600 * 312 * 10 ** 12 * 3/10 = 7.41851136e+21
gpu hours in seconds * flops of A100 * assumed utilization rate",,new dataset - named LVD142M Table 15,142000000.0,new dataset - named LVD142M Table 15,"The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels. ",Confident,"United States of America,France","Industry,Academia",,,NVIDIA A100 SXM4 40 GB,Open source,,,,,874.0,,,,10203.60518105836,,
Incoder-6.7B,Language,"Facebook AI Research,University of Washington,UC Berkeley,Carnegie Mellon University (CMU),Toyota Technological Institute at Chicago","Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, Mike Lewis",2023-04-09,InCoder: A Generative Model for Code Infilling and Synthesis,https://arxiv.org/abs/2204.05999,SOTA improvement,"""Zero-shot infilling with bidirectional context substantially outperforms approaches based on left-to-right-only models, and on several tasks
obtains performance comparable to state-of-the-art models fine-tuned on the tasks""",6700000000.0,6.7B,3.00001e+21,"per table 5, required 3 zettaflop (3e21) to train.
also, ""INCODER-6.7B was trained on 248 V100 GPUs for 24 days""
hardware method: 125 trillion * 248 * 24 * 24 * 3600 * 0.3 = 2e22. suggests their utilization was quite low, or 24 days was just calendar time.
",,"Code from GitHub and StackOverflow
""To train our models, we collect a corpus of (1) public code with permissive, non-copyleft, opensource licenses from GitHub and GitLab and (2) StackOverflow questions, answers, and comments.
Our primary focus in this paper is on the Python language, but we also include code files from
28 total languages and StackOverflow content from all available languages.""",,"216 GB: ""Our final pre-training corpus contains a total of 159 GB of code, 52 GB of it
in Python, and a total of 57 GB of content from StackOverflow""","Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined. We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) as well as editing (via infilling). InCoder is trained to generate code files from a large corpus of permissively licensed code, where regions of code have been randomly masked and moved to the end of each file, allowing code infilling with bidirectional context. Our model is the first generative model that is able to directly perform zero-shot code infilling, which we evaluate on challenging tasks such as type inference, comment generation, and variable re-naming. We find that the ability to condition on bidirectional context substantially improves performance on these tasks, while still performing comparably on standard program synthesis benchmarks in comparison to left-to-right only models pretrained at similar scale. The InCoder models and code are publicly released. this https URL",Confident,"United States of America,United States of America,United States of America,United States of America,United States of America","Industry,Academia,Academia,Academia,Academia",576.0,24,NVIDIA V100,Open access (non-commercial),,,,,397.0,,,1.0,3129.0771365574788,,
Segment Anything Model,Vision,Meta AI,"Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, Ross Girshick",2023-04-05,Segment Anything,https://arxiv.org/abs/2304.02643,Highly cited,,636000000.0,"From Facebook website: https://segment-anything.com/
""How big is the model? The image encoder has 632M parameters.
The prompt encoder and mask decoder have 4M parameters.""",7.8e+21,"""SAM was trained on 256 A100 GPUS for 68 hours. We acknowledge the environmental impact and cost of training
large scale models. The environmental impact of training the released SAM model is approximately 6963 kWh""
68*256 A100-hours =
17408 hours * 3600 * 312 trillion * 0.4 (utilization assumption for image models)
= 7.82e21
max A100 power is 400W. 6,963,000 watt-hours / 400 watts = 17407.5 hours (so they probably just calculated backwards from power rating, and this doesn't give any info on utilization)",Segment Anything 1B,"""Dataset (§5). Our final dataset, SA-1B, includes more than
1B masks from 11M licensed and privacy-preserving images (see Fig. 2). SA-1B, collected fully automatically using the final stage of our data engine, has 400× more masks
than any existing segmentation dataset [66, 44, 117, 60],
and as we verify extensively, the masks are of high quality
and diversity. Beyond its use in training SAM to be robust
and general, we hope SA-1B becomes a valuable resource
for research aiming to build new foundation models.""",1100000000.0,"""SA-1B contains 11M diverse, high-resolution, licensed, and privacy protecting images and 1.1B high-quality segmentation masks.""
segmentation mask is a map that identifies segments in an image","We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at this https URL to foster research into foundation models for computer vision.",Confident,United States of America,Industry,68.0,"""SAM was trained on 256 A100 GPUS for 68 hours""",NVIDIA A100,Open source,see Training Compute notes,256.0,,ViT-Huge/14,2494.0,,7.8e+21,2.0,15888.411228475235,,
Vicuna-13B,Language,"Large Model Systems Organization,UC Berkeley","Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, Eric P. Xing",2023-03-30,Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality,https://lmsys.org/blog/2023-03-30-vicuna/,Historical significance,,13000000000.0,,,Might be possible to estimate training compute from the training cost. Fine-tuning cost $300.,,"70K conversations from ShareGPT.com, a website where users can share their ChatGPT conversations.",,,,Speculative,"United States of America,United States of America","Academia,Academia",,,,Open access (non-commercial),,,,LLaMA-13B,0.0,"$300 in 2020, adjusted for inflation using BLS.gov inflation calculator",,,,,
BloombergGPT,Language,"Bloomberg,Johns Hopkins University","Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, Gideon Mann",2023-03-30,BloombergGPT: A Large Language Model for Finance,https://arxiv.org/abs/2303.17564,SOTA improvement,"""We validate BloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite of internal benchmarks that most accurately reflect our intended usage. Our mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks.""",50558868480.0,,2.36e+23,"2.36e23 per Table 4
(using our usual hardware method, 512 A100s over 53 days would be 512 * 312 teraFLOP/s * 53 * 24 * 3600 * 0.3 = 2.19e23)",,"""To train BloombergGPT, we construct “FinPile”, a comprehensive dataset consisting of a range of English financial documents including news, filings, press releases, web-scraped financial documents, and social media drawn from the Bloomberg archives. These documents have been acquired through our business process over the past two decades. We augment FinPile with public data widely used to train LLMs. The result is a training corpus that is roughly half domain-specific text and half general-purpose text.""",532000000000.0,"708.9 billion tokens. At 0.75 English words per token, that's 532B words","The use of NLP in the realm of financial technology is broad and complex, with applications ranging from sentiment analysis and named entity recognition to question answering. Large Language Models (LLMs) have been shown to be effective on a variety of tasks; however, no LLM specialized for the financial domain has been reported in literature. In this work, we present BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. We construct a 363 billion token dataset based on Bloomberg's extensive data sources, perhaps the largest domain-specific dataset yet, augmented with 345 billion tokens from general purpose datasets. We validate BloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite of internal benchmarks that most accurately reflect our intended usage. Our mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks. Additionally, we explain our modeling choices, training process, and evaluation methodology. We release Training Chronicles (Appendix C) detailing our experience in training BloombergGPT.",Confident,"United States of America,United States of America","Industry,Academia",1270.0,"""~53 days""",NVIDIA A100,Unreleased,,512.0,0.32,,369.0,,,0.8,369586.1352802876,4200000.0,"""in the first 7,200 steps, we use a batch size of 1,024 (2.1M tokens), then switch to a batch size of 2,048 (4.2M tokens) for the remainder of training."""
VideoMAE V2,Video,"Nanjing University,Shenzhen Institute of Advanced Technology,Shanghai AI Lab","Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao",2023-03-29,VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking,https://arxiv.org/abs/2303.16727v2,SOTA improvement,"""Finally, we successfully train a video ViT model with a
billion parameters, which achieves a new state-of-the-art
performance on the datasets of Kinetics (90.0% on K400
and 89.9% on K600) and Something-Something (68.7% on
V1 and 77.0% on V2).""",1000000000.0,1B,9.7e+21,"finetuned on ViT-g (smaller than ViT-G with 1B params)
""It takes more than two weeks to pre-train a ViT-g model with VideoMAE
on 64 A100 GPUs""
64 * 312 trillion * 2 * 7 * 24 * 3600 * 0.4 (utilization assumption) = 9.7e21",,"""To well support the billion-level ViT model pretraining, we build two large-scale video datasets for our proposed progressive training. For self-supervised pre-training of VideoMAE V2, we build a million-level unlabeled video
dataset by collecting clips from multiple resources such
as Movie, Youtube, Instagram, General Webs, and manual recordings from scripts, and the dataset is termed as
UnlabeledHybrid""",,"1.35 million video clips. Not sure about average length (34 seconds, but that's only reported for Instagram portion).
""In total, there are around 1.35M clips in our mixed dataset and
this is the largest dataset ever used for video masked autoencoding.","Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder can still further reduce the overall computational cost. This enables the efficient pre-training of billion-level models in video. We also use a progressive training paradigm that involves an initial pre-training on a diverse multi-sourced unlabeled dataset, followed by a post-pre-training on a mixed labeled dataset. Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and 89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In addition, we extensively verify the pre-trained video ViT models on a variety of downstream tasks, demonstrating its effectiveness as a general video representation learner. The code and model is available at \url{this https URL}.",Confident,"China,China,China","Academia,Academia",336.0,2 weeks,NVIDIA A100 SXM4 80 GB,Open source,"finetuned on ViT-g (smaller than ViT-G with 1B params)
""It takes more than two weeks to pre-train a ViT-g model with VideoMAE
on 64 A100 GPUs""
64 * 312 trillion * 2 * 7 * 24 * 3600 * 0.4 (utilization assumption) = 9.7e21
",64.0,,ViT-G/14,129.0,,9.7e+21,1200.0,18339.96928068276,,
Firefly,Image generation,Adobe,,2023-03-21,"Adobe Unveils Firefly, a Family of new Creative Generative AI",https://news.adobe.com/news/news-details/2023/Adobe-Unveils-Firefly-a-Family-of-new-Creative-Generative-AI/default.aspx,Significant use,"Integrated into Photoshop. Users generate >200m images within a few months of release:
https://venturebeat.com/ai/adobe-stock-creators-arent-happy-with-firefly-the-companys-commercially-safe-gen-ai-tool/",,,,,,"""The current Firefly generative AI model is trained on a dataset of licensed content, such as Adobe Stock, and public domain content where copyright has expired.""
https://www.adobe.com/products/firefly.html",,,"Today, Adobe (Nasdaq:ADBE) introduced Adobe Firefly, a new family of creative generative AI models, first focused on the generation of images and text effects. Adobe Firefly will bring even more precision, power, speed and ease directly into Creative Cloud, Document Cloud, Experience Cloud and Adobe Express workflows where content is created and modified. Adobe Firefly will be part of a series of new Adobe Sensei generative AI services across Adobe’s clouds.",Unknown,United States of America,Industry,,,,,,,,,,,,,,,
PanGu-Σ,Language,Huawei Noah's Ark Lab,"Xiaozhe Ren, Pingyi Zhou, Xinfan Meng, Xinjing Huang, Yadao Wang, Weichao Wang, Pengfei Li, Xiaoda Zhang, Alexander Podolskiy, Grigory Arshinov, Andrey Bout, Irina Piontkovskaya, Jiansheng Wei, Xin Jiang, Teng Su, Qun Liu, Jun Yao",2023-03-20,PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing,https://arxiv.org/abs/2303.10845,SOTA improvement,"""Our experimental findings show that PanGu-{\Sigma} provides state-of-the-art performance in zero-shot learning of various Chinese NLP downstream tasks.""",1085000000000.0,"""In this work, we present PanGu-Σ , a large language model with sparse architecture containing 1.085 trillion parameters.""",4.67e+23,"It has sparse architecture, so we can't use C=6ND.
""We develop PanGu-Σ model under the framework of MindSpore and train it on a cluster with only 512 Ascend 910 AI Accelerators with 329 billion tokens over 100 days.""
100 days * 512 processors * 320 teraFLOPS/processor * 33% utilization = 4.67e+23 FLOP
https://www.wolframalpha.com/input?i=100+days+*+512+*+320+terahertz+*+0.33",,"""329B tokens in more than 40 natural and programming languages""",246750000000.0,329B tokens ~= 247B words,"The scaling of large language models has greatly improved natural language understanding, generation, and reasoning. In this work, we develop a system that trained a trillion-parameter language model on a cluster of Ascend 910 AI processors and MindSpore framework, and present the language model with 1.085T parameters named PanGu-{\Sigma}. With parameter inherent from PanGu-{\alpha}, we extend the dense Transformer model to sparse one with Random Routed Experts (RRE), and efficiently train the model over 329B tokens by using Expert Computation and Storage Separation(ECSS). This resulted in a 6.3x increase in training throughput through heterogeneous computing. Our experimental findings show that PanGu-{\Sigma} provides state-of-the-art performance in zero-shot learning of various Chinese NLP downstream tasks. Moreover, it demonstrates strong abilities when fine-tuned in application data of open-domain dialogue, question answering, machine translation and code generation.",Confident,China,Industry,2400.0,"We develop PanGu-Σ model under the framework of MindSpore 5
and train it on a cluster with only 512 Ascend 910 AI Accelerators [28] with 329 billion tokens over 100 days.",Huawei Ascend 910,Unreleased,,512.0,,,38.0,,,1.836,,524288.0,"""We train PanGu-Σ with global batch size of 512 with sequence length of 1024 for each sample"""
Gen-2,Video,Runway,Gen-2 authors,2023-03-20,,https://research.runwayml.com/gen2,SOTA improvement,"Website claims SOTA improvement over Stable Diffusion and Text2Live, paper forthcoming",,,,,,,,,,Unknown,United States of America,Industry,,,,,,,,,0.0,,,,,,
GPT-4,"Multimodal,Language,Vision,Image generation",OpenAI,"OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain et al. (181 additional authors not shown)",2023-03-15,GPT-4 Technical Report,https://arxiv.org/abs/2303.08774,"Highly cited,SOTA improvement,Training cost","See the paper, p.1: ""On a suite of traditional NLP benchmarks, GPT-4 outperforms both previous large language models and most state-of-the-art systems (which often have benchmark-specific training or hand-engineering).""",,,2.1e+25,"90% CI: 8.2E+24 to 4.4E+25
NOTE: this is a rough estimate based on public information, much less information than most other systems in the database.
Calculation and confidence intervals here: https://colab.research.google.com/drive/1O99z9b1I5O66bT78r9ScslE_nOj5irN9?usp=sharing",,,4900000000000.0,"Speculative. Reported secondhand by online sources such as Semianalysis, but not verified by OpenAI. If total number of tokens seen was 13T, text was repeated for 2 epochs, and text was the majority of tokens, then dataset size roughly is 13T*0.75/2 = 4.9T words.
Note this examines only the text dataset, since GPT-4 was first and foremost a language model. However, the vision component had its own vision dataset, which we believe accounted for a much smaller part of the compute budget.","We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.",Speculative,United States of America,Industry,2280.0,(Speculative) SemiAnalysis conjectures that GPT-4 training took 90-100 days with utilization of 32-36%.,NVIDIA A100 SXM4 40 GB,API access,,25000.0,0.34,,4374.0,,,2.0,40586592.57781653,,
Falcon-40B,Language,Technology Innovation Institute,,2023-03-15,Abu Dhabi-based Technology Innovation Institute Introduces Falcon LLM: Foundational Large Language Model (LLM) outperforms GPT-3 with 40 Billion Parameters,https://arxiv.org/abs/2311.16867; https://www.tii.ae/news/abu-dhabi-based-technology-innovation-institute-introduces-falcon-llm-foundational-large,Historical significance,,40000000000.0,Model comes in 7B and 40B variants.,2.4e+23,"C = 6ND = 6 * 40B * 1000B = 2.4e+23 FLOP (assuming one epoch)
Table 1 from https://arxiv.org/pdf/2311.16867 Falcon paper
2,800 petaflop-days * 1e15 * 24 * 3600 = 2.4192e+23 FLOPs",Falcon RefinedWeb,"Falcon-40B was trained on 1,000B tokens of RefinedWeb, a high-quality filtered and deduplicated web dataset which we enhanced with curated corpora. Significant components from our curated copora were inspired by The Pile (Gao et al., 2020).",1000000000000.0,1000B tokens ~= 750B words,,Confident,United Arab Emirates,Government,1440.0,"""Falcon-40B was trained on AWS SageMaker, on 384 A100 40GB GPUs in P4d instances.""
""Training started in December 2022 and took two months.""",NVIDIA A100,Open source,,384.0,0.3864,,0.0,,,,319783.157242365,2359296.0,"Batch size 1152 (presumably sequences) per Table 16. Warmed up using smaller batches for first 100B tokens.
""All Falcon models are pretrained with a 2,048 sequence length""
https://arxiv.org/pdf/2311.16867.pdf
"
Claude,Language,Anthropic,,2023-03-14,Introducing Claude,https://www.anthropic.com/index/introducing-claude,"Historical significance,SOTA improvement",,,,,,,,,,"Claude is a next-generation AI assistant based on Anthropic’s research into training helpful, honest, and harmless AI systems. Accessible through chat interface and API in our developer console, Claude is capable of a wide variety of conversational and text processing tasks while maintaining a high degree of reliability and predictability.",Unknown,United States of America,Industry,,,,,,,,,0.0,,,,,,
PaLM-E,Robotics,"Google,TU Berlin","Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence",2023-03-06,PaLM-E: An Embodied Multimodal Language Model,https://arxiv.org/abs/2303.03378,SOTA improvement,"""Our largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist
with state-of-the-art performance on OK-VQA, and retains
generalist language capabilities with increasing scale.""",562000000000.0,562B,,"Based on Palm-540B and ViT-22B and then trained on robotics data.
",,"""Our three robot environments (Fig. 1) include a Task and Motion Planning (TAMP) domain where a robot has to manipulate (grasp and stack) objects, a table-top pushing environment, and a mobile manipulation domain. In each domain, PaLM-E is trained on expert data from that domain. In many cases, this is a sparse amount of data per task. The TAMP tasks involve large combinatorics over possible plans, and many decision sequences are infeasible. PaLM-E has to generate plans that consist of multiple steps, with complicated decision boundaries. The multi-object tabletop pushing environment is taken from the publicly available Language-Table dataset (Lynch et al., 2022) and is challenging since it includes several objects, large cardinality of language, and complex pushing dynamics""",,,"Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains. Our largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.",Likely,"United States of America,Germany","Industry,Academia",,,,,"Based on Palm-540B and ViT 22B. No compute details given.
""We scale PaLM-E up to 562B parameters, integrating the 540B PaLM (Chowdhery et al., 2022) LLM and the 22B Vision Transformer (ViT) (Dehghani et al., 2023) into, to our knowledge, the largest vision-language model currently reported.""",,,PaLM (540B),817.0,,,,,,
AudioGen,Audio,"Meta AI,Hebrew University of Jerusalem","Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, Yossi Adi",2023-03-05,AudioGen: Textually Guided Audio Generation,https://arxiv.org/abs/2209.15352,SOTA improvement,"""We propose a state-of-the-art auto-regressive audio generation model conditioned on textual descriptions or audio prompts, as evaluated with objective and subjective (human
listeners) scores.""",1000000000.0,"""We trained two sets of ALMs, one with 285M parameters (base) and the other with 1B parameters (large).""",9.5e+21,"""the large model was trained on 128 A100 GPUs for 200k steps (∼1 week)""
A100s are 312 teraflop/s
128 * 312 trillion * 7 * 24 * 3600 * 0.3 (utilization assumption) = 7.2e21
Text encoding uses T5-Large, which used 2.3e21 FLOP in pre-training per Flan paper: https://arxiv.org/abs/2210.11416 ","AudioSet,AudioCaps","""We use a set of several datasets: AudioSet (Gemmeke et al., 2017), BBC sound effects,
AudioCaps (Kim et al., 2019), Clotho v2 (Drossos et al., 2020), VGG-Sound (Chen et al., 2020),
FSD50K (Fonseca et al., 2021), Free To Use Sounds 2
, Sonniss Game Effects 3
, WeSoundEffects 4
,
Paramount Motion - Odeon Cinematic Sound Effects 5
. All audio files were sampled at 16kHz.
For textual descriptions we use two types of annotations. The first one is multi-label annotations,
available for the datasets: AudioSet, VGG-Sound, FSD50K, Sinniss Game Effects, WeSoundEffects, Paramount Motion - Odeon Cinematic Sound Effects.""",230400000000.0,"""Overall we are left with ∼4k hours for training data.""
mix of speech and other sounds
Training the audio autoencoder uses reconstruction loss on sequence of raw audio samples. Audio files are in 16kHz, so
16k * 4k * 3600 = 230.4B samples
Audio language modelling operates on tokens; ""each second of audio is represented by 500 tokens"".
4k * 3600 * 500 = 7.2B tokens","We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ``objects'' can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. We curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points. For faster inference, we explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. We apply classifier-free guidance to improve adherence to text. Comparing to the evaluated baselines, AudioGen outperforms over both objective and subjective metrics. Finally, we explore the ability of the proposed method to generate audio continuation conditionally and unconditionally. Samples: this https URL",Likely,"United States of America,Israel","Industry,Academia",168.0,1 week,NVIDIA A100,Open access (non-commercial),,,,,137.0,,,,9429.74091062958,,
DiT-XL/2,Image generation,"New York University (NYU),UC Berkeley","William Peebles, Saining Xie",2023-03-02,Scalable Diffusion Models with Transformers,https://arxiv.org/abs/2212.09748,SOTA improvement,"""our largest DiT-XL/2 models outperform all prior diffusion models on the classconditional ImageNet 512×512 and 256×256 benchmarks,
achieving a state-of-the-art FID of 2.27 on the latter.""",675000000.0,675M,6e+20,"~6e20, based on eyeballing Figure 9. It's between 1e11 and 1e12 gigaflop (1 gigaflop = 1e9 flop), and about 80% of the way towards 1e12 on a log scale. 10^0.8 is about 6.
3M iterations with a batch size of 256.
""Compute. We implement all models in JAX [1] and train
them using TPU-v3 pods. DiT-XL/2, our most computeintensive model, trains at roughly 5.7 iterations/second on a
TPU v3-256 pod with a global batch size of 256""
256*123000000000000 FLOPs/s * 800000 training steps / 5.7 iterations/second * 0.3 = 1.3258105e+21",ImageNet,,,didn't state which ImageNet set,"We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.",Confident,"United States of America,United States of America","Academia,Academia",,,Google TPU v3,,,,,Stable Diffusion (LDM-KL-8-G),433.0,,,,111048.19613085664,,
LLaMA-65B,Language,Meta AI,"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample",2023-02-24,LLaMA: Open and Efficient Foundation Language Models,https://arxiv.org/abs/2302.13971,"Historical significance,Highly cited",Widely-used foundation model that has been adapted for others such as Alpaca.,65200000000.0,"Model card, table 1: https://github.com/facebookresearch/llama/blob/53011c3d7946dadb8274a4c5c7586ab54edf792d/MODEL_CARD.md",5.5e+23,"1.4e12 tokens * 6.52e10 parameters * 6 FLOP/token/parameter = 5.5e23 FLOP
Compared to 2048 A100 GPUs each with 311.84 TFLOPS maximum performance for 21 days, this implies 47% utilization.
https://www.wolframalpha.com/input?i=5.5*10%5E23+FLOP+%2F+%282048+*+311.84+teraFLOPS+*+21+days%29","CCNet,GitHub,Wikipedia,books,arXiv,Stack Exchange","The model was trained using the following source of data: CCNet [67%], C4 [15%], GitHub [4.5%], Wikipedia [4.5%], Books [4.5%], ArXiv [2.5%], Stack Exchange[2%]. The Wikipedia and Books domains include data in the following languages: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. See the paper for more details about the training set and corresponding preprocessing.",1340000000000.0,"Table 1 indicates that 1.4T tokens involved sampling sub-datasets at more or less than one epoch. Correcting for this:
(1.1 epoch * 3.3TB) + (1.06 epoch * 0.783TB) + ... = 1.4T tokens
5.24 epoch-TBs = 1.4T tokens
5.24 epoch-TB * 1000 GB/TB * 200M token/GB = 1.4T tokens
1.05T epoch*token = 1.4T tokens
1 epoch = 1.34T tokens
","We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.",Confident,United States of America,Industry,500.0,"""When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days.""",NVIDIA A100,Open access (non-commercial),,2048.0,0.4746,,5708.0,"1023384 processor-hours on A100 GPUs. May 2023 cost rate is $1.36/GPU-hour on Azure ML cloud. https://azure.microsoft.com/en-us/pricing/details/machine-learning/
According to https://www.bls.gov/data/inflation_calculator.htm, $1.18 in May 2023 = $1.00 in January 2020.
$1391674 / 1.18 = $1179385 in 2020 USD.",,1.09,576476.4930991895,4000000.0,
BASIC-L + Lion,Vision,"Google,University of California Los Angeles (UCLA)","Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, Quoc V. Le",2023-02-13,Symbolic Discovery of Optimization Algorithms,https://arxiv.org/abs/2302.06675v4,SOTA improvement,"""On vision-language contrastive learning, we achieve 88.3% zero-shot and 91.1% fine-tuning accuracy on ImageNet, surpassing the previous best results by 2% and 0.1%, respectively""",3070000000.0,parameter count of original BASIC-L,,"This model is BASIC-L retrained with a different optimizer, Lion. Lion seems more compute-efficient, so we should expect compute to be less than BASIC-L.",,,,,"We present a method to formulate algorithm discovery as program search, and apply it to discover optimization algorithms for deep neural network training. We leverage efficient search techniques to explore an infinite and sparse program space. To bridge the large generalization gap between proxy and target tasks, we also introduce program selection and simplification strategies. Our method discovers a simple and effective optimization algorithm, Lion (EvoLved Sign Momentum). It is more memory-efficient than Adam as it only keeps track of the momentum. Different from adaptive optimizers, its update has the same magnitude for each parameter calculated through the sign operation. We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks. On image classification, Lion boosts the accuracy of ViT by up to 2% on ImageNet and saves up to 5x the pre-training compute on JFT. On vision-language contrastive learning, we achieve 88.3% zero-shot and 91.1% fine-tuning accuracy on ImageNet, surpassing the previous best results by 2% and 0.1%, respectively. On diffusion models, Lion outperforms Adam by achieving a better FID score and reducing the training compute by up to 2.3x. For autoregressive, masked language modeling, and fine-tuning, Lion exhibits a similar or better performance compared to Adam. Our analysis of Lion reveals that its performance gain grows with the training batch size. It also requires a smaller learning rate than Adam due to the larger norm of the update produced by the sign function. Additionally, we examine the limitations of Lion and identify scenarios where its improvements are small or not statistically significant. Lion is also successfully deployed in production systems such as Google search ads CTR model.",Confident,"United States of America,United States of America","Industry,Academia",,,,,,,,,165.0,,,,,,
ViT-22B,Vision,Google,"Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai, Daniel Keysers, Jeremiah Harmsen, Neil Houlsby",2023-02-10,Scaling Vision Transformers to 22 Billion Parameters,https://arxiv.org/abs/2302.05442v1,SOTA improvement,"""The largest
ViT-22B sets the new SOTA on the challenging ObjectNet test set""",21743000000.0,"21.743B, Table 1",4.0001e+23,"""ViT-22B was trained using 256 visual tokens per image, where each token represents a
14 × 14 patch extracted from 224 × 224 sized images. ViT-22B is trained for 177k steps with batch size of 65k:
approximately 3 epochs""
""ViT-22B was trained on 1024 TPU V4 chips for 177K steps""
256 * 177k * 65k = 3T tokens
6 * 22B * 3T = 3.96e23 ~= 4e23
also, MFU was high:
""Using these techniques, ViT-22B processes 1.15k tokens per second per core during training (forward and
backward pass) on TPUv4 (Jouppi et al., 2020). ViT-22B’s model flops utilization (MFU) (Chowdhery et al.,
2022; Dehghani et al., 2021a) is 54.9%, indicating a very efficient use of the hardware.""
as a sanity check, 4e23 / (1024 * 275 teraFLOP/s (TPUv4 FLOP) * 0.55) = 2582644 seconds, or 30 days, which is a plausible length",JFT-4B,"""Dataset. ViT-22B is trained on a version of JFT (Sun et al., 2017), extended to around 4B images (Zhai et al.,
2022a). These images have been semi-automatically annotated with a class-hierarchy of 30k labels""",4000000000.0,"""Dataset. ViT-22B is trained on a version of JFT (Sun et al., 2017), extended to around 4B images (Zhai et al.,
2022a). These images have been semi-automatically annotated with a class-hierarchy of 30k labels""","The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for ""LLM-like"" scaling in vision, and provides key steps towards getting there.",Confident,United States of America,Industry,,,Google TPU v4,,,,,,310.0,,,2.9,285555.57016183576,,
ProteinDT,"Biology,Language","UC Berkeley,California Institute of Technology,University of Toronto,University of Wisconsin Madison,Texas A&M,NVIDIA,Mila - Quebec AI (originally Montreal Institute for Learning Algorithms)","Shengchao Liu, Yanjing Li, Zhuoxinran Li, Anthony Gitter, Yutao Zhu, Jiarui Lu, Zhao Xu, Weili Nie, Arvind Ramanathan, Chaowei Xiao, Jian Tang, Hongyu Guo, and Anima Anandkumar",2023-02-09,A Text-guided Protein Design Framework,https://arxiv.org/abs/2302.04611,SOTA improvement,"""Compared to six state-of-the-art protein sequence representation methods, ProteinDT can obtain consistently superior performance on four of six benchmark tasks.""",,,,,UniProtKB,They extract a subset of 441K protein-text pairs,,,"Current AI-assisted protein design mainly utilizes protein sequential and structural information. Meanwhile, there exists tremendous knowledge curated by humans in the text format describing proteins’ high-level functionalities. Yet, whether the incorporation of such text data can help protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation. To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441K text and protein pairs. We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 10 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.",Unknown,"United States of America,United States of America,Canada,United States of America,United States of America,United States of America,Canada","Academia,Academia,Academia,Academia,Academia,Industry,Academia",,,,Unreleased,,,,SciBERT,,,,,,,
Gen-1,Video,Runway,"Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, Anastasis Germanidis",2023-02-06,Structure and Content-Guided Video Synthesis with Diffusion Models,https://arxiv.org/abs/2302.03011,SOTA improvement,,,,,,,,,,,Unknown,United States of America,Industry,,,,,,,,,272.0,,,,,,
Flan T5-XXL + BLIP-2,"Multimodal,Language,Vision",Salesforce,"Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi",2023-01-30,BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models,"https://arxiv.org/abs/2301.12597, https://huggingface.co/Salesforce/blip2-flan-t5-xl",Highly cited,,12100000000.0,"12.1B, per Table 2.
only 108M trainable params (i.e. params trained during the BLIP process)",1.2e+21,"fine-tuned from Flan-T5 XXL (11B) and ViT-g
fine-tuning compute:
""using a single 16-A100(40G) machine, our largest model with
ViT-g and FlanT5-XXL requires less than 6 days for the first
stage and less than 3 days for the second stage.""
16 * 9 days * 24 * 3600 * 312 teraflops * 0.3 ~= 1.2e21","COCO,LAION-400M","""We use the same pre-training dataset as BLIP with 129M images in total, including COCO (Lin
et al., 2014), Visual Genome (Krishna et al., 2017), CC3M (Sharma et al., 2018), CC12M (Changpinyo et al.,
2021), SBU (Ordonez et al., 2011), and 115M images from the LAION400M dataset""",,,"The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.",Likely,United States of America,Industry,200.0,"""less than 6 days for the first
stage and less than 3 days for the second stage""
9*24 is 216, rounding down a bit is 200 hours",NVIDIA A100 SXM4 40 GB,Open source,"ViT-g is the other base model.
""using a single 16-A100(40G) machine, our largest model with
ViT-g and FlanT5-XXL requires less than 6 days for the first
stage and less than 3 days for the second stage.""
16 * 9 days * 24 * 3600 * 312 teraflops * 0.3 ~= 1.2e21",,,Flan-T5 11B,1778.0,,1.2e+21,,99690.24664120316,,
BLIP-2 (Q-Former),"Vision,Language",Salesforce Research,"Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi",2023-01-30,BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models,https://arxiv.org/abs/2301.12597,SOTA improvement,"""BLIP-2 achieves state-of-the-art performance on various vision-language tasks""",1480000000.0,"Q-Former has 188M params. The BLIP-2 system overall has ""54x fewer trainable parameters"" than Flamingo80B.",1.20000000001e+21,https://www.wolframalpha.com/input?i=312+teraFLOPS+*+16+*+200+hours+*+0.33,"COCO,LAION-400M,Conceptual Captions (CC3M),Conceptual Captions 12M (CC12M),VisualGenome,SBU","""We use the same pre-training dataset as
BLIP with 129M images in total, including COCO (Lin
et al., 2014), Visual Genome (Krishna et al., 2017),
CC3M (Sharma et al., 2018), CC12M (Changpinyo et al.,
2021), SBU (Ordonez et al., 2011), and 115M images from
the LAION400M dataset (Schuhmann et al., 2021).""",129000000.0,,"The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.",Confident,United States of America,Industry,200.0,"""For example, using
a single 16-A100(40G) machine, our largest model with
ViT-g and FlanT5-XXL requires less than 6 days for the first
stage and less than 3 days for the second stage.""
9 days = 216 hours",NVIDIA A100 SXM4 40 GB,Open source,,16.0,,,1778.0,,,,1960.8225382725807,,
DDPM-IP (CelebA),Image generation,Utrecht University,"Mang Ning, Enver Sangineto, Angelo Porrello, Simone Calderara, Rita Cucchiara",2023-01-27,Input Perturbation Reduces Exposure Bias in Diffusion Models,https://arxiv.org/abs/2301.11706v3,SOTA improvement,"""For instance, on CelebA 64×64, we achieve a new state-of-theart FID score of 1.27, while saving 37.5% of the training time""",295000000.0,"295M for CelebA model, per Table 9",3.5e+20,"""We use Pytorch 1.8 (Paszke et al., 2019) and trained all the models on different NVIDIA Tesla V100s (16G memory). In
more detail, we use 2 GPUs to train the models on CIFAR10 for 2 days, and 4 GPUs to train the models on ImageNet 32×32
for 34 days. For LSUN tower 64×64, CelebA 64×64 and FFHQ 128×128, we used 16 GPUs to train the models for 3 days,
5 days and 4 days, respectively""
5*16 V100-days for CelebA.
5 * 16 * 24 * 3600 * 125 teraflops * 0.4 ~= 3.5e20",CelebA,,203000.0,,"Denoising Diffusion Probabilistic Models have shown an impressive generation quality, although their long sampling chain leads to high computational costs. In this paper, we observe that a long sampling chain also leads to an error accumulation phenomenon, which is similar to the exposure bias problem in autoregressive text generation. Specifically, we note that there is a discrepancy between training and testing, since the former is conditioned on the ground truth samples, while the latter is conditioned on the previously generated results. To alleviate this problem, we propose a very simple but effective training regularization, consisting in perturbing the ground truth samples to simulate the inference time prediction errors. We empirically show that, without affecting the recall and precision, the proposed input perturbation leads to a significant improvement in the sample quality while reducing both the training and the inference times. For instance, on CelebA 64×64, we achieve a new state-of-the-art FID score of 1.27, while saving 37.5% of the training time. The code is publicly available at this https URL",Likely,Netherlands,Academia,120.0,5 days,NVIDIA V100,,,,,,26.0,,,681.0,390.4861317667304,,
MusicLM,Audio,Google,"Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank",2023-01-26,MusicLM: Generating Music From Text,https://arxiv.org/abs/2301.11325,SOTA improvement,"""We demonstrate that our method outperforms baselines on MusicCaps, a hand-curated, high-quality
dataset of 5.5k music-text pairs prepared by musicians.""",860000000.0,"""We use decoder-only Transformers for modeling the semantic stage and the acoustic stages of AudioLM. The models
share the same architecture, composed of 24 layers, 16 attention heads, an embedding dimension of 1024, feed-forward
layers of dimensionality 4096, dropout of 0.1, and relative
positional embeddings (Raffel et al., 2020), resulting in
430M parameters per stage.""
""stage"" seems to mean semantic + acoustic, so 860M total",,,Free Music Archive,"""We train SoundStream and w2v-BERT on the Free Music
Archive (FMA) dataset (Defferrard et al., 2017), whereas
the tokenizers and the autoregressive models for the semantic and acoustic modeling stages are trained on a dataset containing five million audio clips, amounting to 280k hours of
music at 24 kHz. Each of the stages is trained with multiple passes over the training data""",,>280k hours,"We introduce MusicLM, a model generating high-fidelity music from text descriptions such as ""a calming violin melody backed by a distorted guitar riff"". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.",Confident,United States of America,Industry,,,,,also MuLan and SoundStream,,,W2v-BERT,220.0,,,,,,
Ankh_large,Biology,"Technical University of Munich,Columbia University","Ahmed Elnaggar, Hazem Essam, Wafaa Salah-Eldin, Walid Moustafa, Mohamed Elkerdawy, Charlotte Rochereau, Burkhard Rost",2023-01-16,Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling,https://arxiv.org/abs/2301.06568,SOTA improvement,"""On average, Ankh improved the PLM SOTA performance by 4.8%""",1900000000.0,"Figure 1 indicates 1.15B parameters, but both the hugginface model and a replication (https://huggingface.co/ElnaggarLab/ankh-large and https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1.full.pdf) indicate 1.9B parameters.
Notebook for counting params: https://colab.research.google.com/drive/1EGI5_vDl4pOBUukJexMHQR16BFKJe4a5?usp=sharing",6.5e+21,Table 9 from here: https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1.full.pdf,UniRef50,"""We build upon the same results by pre-training our baseline on UniRef50.""",14000000000.0,"Pretrained over UniRef50; 45M proteins and 14B amino acids, per Table 2
952B tokens from Table 9 at:
https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1
(This is total tokens over multiple epochs)","As opposed to scaling-up protein language models (PLMs), we seek improving performance via protein-specific optimization. Although the proportionality between the language model size and the richness of its learned representations is validated, we prioritize accessibility and pursue a path of data-efficient, cost-reduced, and knowledge-guided optimization. Through over twenty experiments ranging from masking, architecture, and pre-training data, we derive insights from protein-specific experimentation into building a model that interprets the language of life, optimally. We present Ankh, the first general-purpose PLM trained on Google’s TPU-v4 surpassing the state-of-the-art performance with fewer parameters (<10% for pre-training, <7% for inference, and <30% for the embedding dimension). We provide a representative range of structure and function benchmarks where Ankh excels. We further provide a protein variant generation analysis on High-N and One-N input data scales where Ankh succeeds in learning protein evolutionary conservation-mutation trends and introducing functional diversity while retaining key structural-functional characteristics. We dedicate our work to promoting accessibility to research innovation via attainable resources.",Confident,"Germany,United States of America","Academia,Academia",,,Google TPU v4,Open access (non-commercial),,,,,33.0,,,68.0,4802.398249072418,,
Ankh_base,Biology,"Technical University of Munich,Columbia University","Ahmed Elnaggar, Hazem Essam, Wafaa Salah-Eldin, Walid Moustafa, Mohamed Elkerdawy, Charlotte Rochereau, Burkhard Rost",2023-01-16,Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling,https://arxiv.org/abs/2301.06568,SOTA improvement,"""On average, Ankh improved the PLM SOTA performance by 4.8%""",740000000.0,"Figure 1 indicates 450M, but the model on huggingface https://huggingface.co/ElnaggarLab/ankh-base as well as Table 9 from https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1.full.pdf
each indicate 740M parameters.
Notebook for counting params: https://colab.research.google.com/drive/1EGI5_vDl4pOBUukJexMHQR16BFKJe4a5?usp=sharing",2.6e+21,Table 9 from here: https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1.full.pdf,UniRef50,"""We build upon the same results by pre-training our baseline on UniRef50.""",14000000000.0,"Pretrained over UniRef50; 45M proteins and 14B amino acids, per Table 2
952B tokens from Table 9 at:
https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1
(This is total tokens over multiple epochs)","As opposed to scaling-up protein language models (PLMs), we seek improving performance via protein-specific optimization. Although the proportionality between the language model size and the richness of its learned representations is validated, we prioritize accessibility and pursue a path of data-efficient, cost-reduced, and knowledge-guided optimization. Through over twenty experiments ranging from masking, architecture, and pre-training data, we derive insights from protein-specific experimentation into building a model that interprets the language of life, optimally. We present Ankh, the first general-purpose PLM trained on Google’s TPU-v4 surpassing the state-of-the-art performance with fewer parameters (<10% for pre-training, <7% for inference, and <30% for the embedding dimension). We provide a representative range of structure and function benchmarks where Ankh excels. We further provide a protein variant generation analysis on High-N and One-N input data scales where Ankh succeeds in learning protein evolutionary conservation-mutation trends and introducing functional diversity while retaining key structural-functional characteristics. We dedicate our work to promoting accessibility to research innovation via attainable resources.",Confident,"Germany,United States of America","Academia,Academia",,,Google TPU v4,Open access (non-commercial),,,,,33.0,,,68.0,1920.959299628968,,
VALL-E,"Audio,Speech",Microsoft,"Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei",2023-01-05,Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers,https://arxiv.org/abs/2301.02111,SOTA improvement,"""VALL-E significantly outperforms
the state-of-the-art zero-shot TTS system [Casanova et al., 2022b] in terms of speech naturalness and
speaker similarity, with +0.12 comparative mean option score (CMOS) and +0.93 similarity mean
option score (SMOS) improvement on LibriSpeech""",353000000.0,"""Both the AR model and the NAR model have the same transformer architecture with 12
layers, 16 attention heads, an embedding dimension of 1024, a feed-forward layer dimension of 4096, and a dropout of 0.1""
Ben's script says that's 353M parameters, using n_block 12, d_model 1024, d_ff 4096, encoder only False
https://github.com/bencottier/ml-parameter-count/blob/main/parameter_count.py",1.01e+19,"""The models are trained using 16 NVIDIA TESLA V100 32GB GPUs with a batch size of 6k acoustic
tokens per GPU for 800k steps""
353M * 800k * 6k * 6 = 1.01e19
16 V100s is 2080 teraFLOP or 2e15 FLOP so 1e19 would take 1.5 hours at 100% utilization or ~5 hours at 30%. Is that plausible?",LibriLight,"""60K hours of English speech with over 7000 unique speakers.""",820800000.0,"60k hours
~13,680 words/hour * 60,000 = 820800000 words
https://docs.google.com/document/d/1G3vvQkn4x_W71MKg0GmHVtzfd9m0y3_Ofcoew0v902Q/edit#heading=h.3pbt0hfgv7pq","We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See this https URL for demos of our work.",Speculative,United States of America,Industry,,,NVIDIA V100,Unreleased,,,,,258.0,,,,11.40575196486363,,
Hybrid H3-2.7B,Language,"Stanford University,University at Buffalo","Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, Christopher Ré",2022-12-28,Hungry Hungry Hippos: Towards Language Modeling with State Space Models,https://arxiv.org/abs/2212.14052,SOTA improvement,Results table shows SOTA performance for some benchmarks,2700000000.0,,8.49e+20,,,,,,,,"United States of America,United States of America","Academia,Academia",,,,Open source,,,,,153.0,,,509.02,,,
CaLM,Biology,University of Oxford,Carlos Outeiral and Charlotte M. Deane,2022-12-19,Codon language embeddings provide strong signals for protein engineering,https://www.biorxiv.org/content/10.1101/2022.12.15.519894v1.full.pdf,SOTA improvement,"""We show that large language models trained on codons, instead of amino acid sequences, provide high-quality representations that outperform comparable state-of-the-art models across a variety of tasks. In some tasks, like species recognition, prediction of protein and transcript abundance, or melting point estimation, we show that a language model trained on codons outperforms every other published protein language model, including some that contain over 50 times more parameters"" [Abstract]",86000000.0,"""We trained a large language model with 86M parameters""",2.9e+19,"""4 NVIDIA Quadro RTX4000 GPUs for 40 days""
Calculation assuming FP32, utilization 30%:
= (40 * 24 * 3600) s * 7.1e12 FLOP/s * 0.3 * 4 GPU = 2.999808e+19
alternative calculation:
""Gradients were accumulated to an effective batch size of 1,000 examples, or approximately 256,000 tokens. ""
""(66,000 gradient steps, 14 full epochs)""
256000*66000*14*86000000*6=1.220567e+20",European Nucleotide Archive (ENA),"""The training set was constructed from the European Nucleotide Archive [39], with significant preprocessing to limit redundancy and save computational cost.""",2304000000.0,"""a dataset of 9M non-redundant and diverse cDNA sequences identified from whole-genome sequencing""
""Gradients were accumulated to an effective batch size of 1,000 examples, or approximately 256,000 tokens. ""
9000000*256000/1000=2304000000 tokens","Protein representations from deep language models have yielded state-of-the-art performance across many tasks in computational protein engineering. In recent years, progress has primarily focused on parameter count, with recent models’ capacities surpassing the size of the very datasets they were trained on. Here, we propose an alternative direction. We show that large language models trained on codons, instead of amino acid sequences, provide high-quality representations that outperform comparable state-of-the-art models across a variety of tasks.
In some tasks, like species recognition, prediction of protein and transcript abundance, or melting point estimation, we show that a language model trained on codons outperforms every other published protein language model, including some that contain over 50 times more parameters. These results suggest that, in addition to commonly studied scale and model complexity, the information content of biological data provides an orthogonal direction to improve the power of machine learning in biology.",Likely,United Kingdom of Great Britain and Northern Ireland,Academia,960.0,"""The model reported in this work was trained on 4 NVIDIA Quadro
RTX4000 GPUs for 40 days (66,000 gradient steps, 14 full epochs)""",NVIDIA Quadro RTX 4000,,,4.0,,,6.0,,,14.0,,,
RT-1,Robotics,Google,"Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, Deeksha Manjunath, Igor Mordatch, Ofir Nachum, Carolina Parada, Jodilyn Peralta, Emily Perez, Karl Pertsch, Jornell Quiambao, Kanishka Rao, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Kevin Sayed, Jaspiar Singh, Sumedh Sontakke, Austin Stone, Clayton Tan, Huong Tran, Vincent Vanhoucke, Steve Vega, Quan Vuong, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, Brianna Zitkovich",2022-12-13,RT-1: Robotics Transformer for Real-World Control at Scale,https://arxiv.org/abs/2212.06817,SOTA improvement,"""Across each category, we find that RT-1 outperforms the prior
models significantly. On seen tasks, RT-1 is able to perform 97% of the more than 200 instructions successfully, which is 25% more than BC-Z and 32% more than Gato. On unseen tasks, RT-1
shows it is capable of generalizing to novel instructions, performing 76% of the never-before-seen
instructions, 24% more than the next best baseline""",35000000.0,"""we also limit the size of the model compared to
the original publication, which was 1.2B parameters (resulting in on robot inference time of 1.9s),
to be of similar size to RT-1 (37M parameters for Gato vs. 35M for RT-1""
16M params for image tokenizer, 19M for the transformer",,,RT-1,"""We utilize a dataset that we gathered over the course of 17 months with a fleet of 13 robots, containing
∼130k episodes and over 700 tasks""
Episode is an example of robot following instructions",,,"By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at this http URL",Confident,United States of America,Industry,,,,Open source,,,,,446.0,,,,,,
TranceptEve,Biology,"University of Oxford,Harvard University","Pascal Notin, Lood Van Niekerk, Aaron W Kollasch, Daniel Ritter, Yarin Gal, Debora S. Marks",2022-12-10,TranceptEVE: Combining Family-specific and Family-agnostic Models of Protein Sequences for Improved Fitness Prediction,https://www.biorxiv.org/content/10.1101/2022.12.07.519495v1,SOTA improvement,"""Besides its broader application scope, it achieves state-of- the-art performance for mutation effects prediction, both in terms of correlation with experimental assays and with clinical annotations from ClinVar.""",,,,,,,,,"Modeling the fitness landscape of protein sequences has historically relied on training models on family-specific sets of homologous sequences called Multiple Sequence Alignments. Many proteins are however difficult to align or have shallow alignments which limits the potential scope of alignment-based methods. Not subject to these limitations, large protein language models trained on non-aligned sequences across protein families have achieved increasingly high predictive performance – but have not yet fully bridged the gap with their alignment-based counterparts. In this work, we introduce TranceptEVE – a hybrid method between family-specific and family-agnostic models that seeks to build on the relative strengths from each approach. Our method gracefully adapts to the depth of the alignment, fully relying on its autoregressive transformer when dealing with shallow alignments and leaning more heavily on the family-specifc models for proteins with deeper alignments. Besides its broader application scope, it achieves state-of-the-art performance for mutation effects prediction, both in terms of correlation with experimental assays and with clinical annotations from ClinVar.",Unknown,"United Kingdom of Great Britain and Northern Ireland,United States of America","Academia,Academia",,,,,,,,Tranception,,,,,,,
DeepNash,Games,DeepMind,"Julien Perolat, Bart de Vylder, Daniel Hennes, Eugene Tarassov, Florian Strub, Vincent de Boer, Paul Muller, Jerome T. Connor, Neil Burch, Thomas Anthony, Stephen McAleer, Romuald Elie, Sarah H. Cen, Zhe Wang, Audrunas Gruslys, Aleksandra Malysheva, Mina Khan, Sherjil Ozair, Finbarr Timbers, Toby Pohlen, Tom Eccles, Mark Rowland, Marc Lanctot, Jean-Baptiste Lespiau, Bilal Piot, Shayegan Omidshafiei, Edward Lockhart, Laurent Sifre, Nathalie Beauguerlange, Remi Munos, David Silver, Satinder Singh, Demis Hassabis, Karl Tuyls",2022-12-01,Mastering the game of Stratego with model-free multiagent reinforcement learning,https://www.science.org/stoken/author-tokens/ST-887/full,SOTA improvement,"DeepNash beat existing state-of-the-art AI methods in Stratego and achieved a year-to-date (2022) and all-time top-three ranking on the Gravon games platform, competing with human expert players.",,,,"""The final agent was trained using 768 MXU’s (matrix multiplication unit) for Learners and
256 MXU’s for Actors (using 256 TPU’s in total).""
Some more details in Table S1 (in supplementary materials)",,,,"768 * 7.21M trajectories? (Table S1)
768 * 7.21M = 5,537,280,000
https://www.science.org/doi/suppl/10.1126/science.add4679/suppl_file/science.add4679_sm.pdf","We introduce DeepNash, an autonomous agent that plays the imperfect information game Stratego at a human expert level. Stratego is one of the few iconic board games that artificial intelligence (AI) has not yet mastered. It is a game characterized by a twin challenge: It requires long-term strategic thinking as in chess, but it also requires dealing with imperfect information as in poker. The technique underpinning DeepNash uses a game-theoretic, model-free deep reinforcement learning method, without search, that learns to master Stratego through self-play from scratch. DeepNash beat existing state-of-the-art AI methods in Stratego and achieved a year-to-date (2022) and all-time top-three ranking on the Gravon games platform, competing with human expert players.",Unknown,United Kingdom of Great Britain and Northern Ireland,Industry,,,,,,,,,115.0,,,,,,
ChatGPT (gpt-3.5-turbo),Language,OpenAI,,2022-11-30,,,"Historical significance,Significant use",https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/,20000000000.0,20B parameters according to Table 1 in Microsoft's CODEFUSION paper: https://arxiv.org/pdf/2310.17680.pdf,,,,,,,,Speculative,United States of America,Industry,,,,,,,,,,,,,,,
GPT-3.5 (text-davinci-003),Language,OpenAI,,2022-11-28,,https://platform.openai.com/docs/models/gpt-3-5,"Historical significance,Significant use,SOTA improvement,Training cost",,,"Parameter count may be 175B based on OpenAI's statements that text-davinci-003 is in the GPT-3.5 series of models. It was also stated to be 175B in the Microsoft CODEFUSION paper, but the paper was reportedly retracted because the authors did not know the parameter count.",2.578e+24,https://colab.research.google.com/drive/1QSxa8YCWjEBQU7mrXLhw6TP1VX5oqgdW#scrollTo=Gt6Z6oZ26clI,,,,,,Speculative,United States of America,Industry,,,NVIDIA A100 SXM4 40 GB,API access,,,,,,,,,4625550.747400068,,
DiT-XL/2 + Discriminator Guidance,Image generation,"Korea Advanced Institute of Science and Technology (KAIST),NAVER","Dongjun Kim, Yeongmin Kim, Se Jung Kwon, Wanmo Kang, Il-Chul Moon",2022-11-28,Refining Generative Process with Discriminator Guidance in Score-based Diffusion Models,https://arxiv.org/abs/2211.17091v4,SOTA improvement,"""Using our algorithm, we achive state-of-the-art results on ImageNet 256x256 with FID 1.83 and recall 0.64, similar to the validation data's FID (1.68) and recall (0.66)""",,,,"This is a finetune of DiT-XL/2, so its compute won't be much higher.",,,,,"The proposed method, Discriminator Guidance, aims to improve sample generation of pre-trained diffusion models. The approach introduces a discriminator that gives explicit supervision to a denoising sample path whether it is realistic or not. Unlike GANs, our approach does not require joint training of score and discriminator networks. Instead, we train the discriminator after score training, making discriminator training stable and fast to converge. In sample generation, we add an auxiliary term to the pre-trained score to deceive the discriminator. This term corrects the model score to the data score at the optimal discriminator, which implies that the discriminator helps better score estimation in a complementary way. Using our algorithm, we achive state-of-the-art results on ImageNet 256x256 with FID 1.83 and recall 0.64, similar to the validation data's FID (1.68) and recall (0.66). We release the code at this https URL.",Unknown,"Korea (Republic of),Korea (Republic of)","Academia,Industry",,,NVIDIA A100,,,,,DiT-XL/2,49.0,,,7.0,,,
Discriminator Guidance,Image generation,"Korea Advanced Institute of Science and Technology (KAIST),NAVER","Dongjun Kim, Yeongmin Kim, Se Jung Kwon, Wanmo Kang, Il-Chul Moon",2022-11-28,Refining Generative Process with Discriminator Guidance in Score-based Diffusion Models,https://arxiv.org/abs/2211.17091v4,SOTA improvement,"""Using our algorithm, we achive state-of-the-art results on ImageNet 256x256 with FID 1.83 and recall 0.64, similar to the validation data's FID (1.68) and recall (0.66).""
https://paperswithcode.com/paper/refining-generative-process-with",,,2.1570000001e+20,481 hours * 312 TFLOPS (A100) * 40% utilization,,,,,"The proposed method, Discriminator Guidance, aims to improve sample generation of pre-trained diffusion models. The approach introduces a discriminator that gives explicit supervision to a denoising sample path whether it is realistic or not. Unlike GANs, our approach does not require joint training of score and discriminator networks. Instead, we train the discriminator after score training, making discriminator training stable and fast to converge. In sample generation, we add an auxiliary term to the pre-trained score to deceive the discriminator. This term corrects the model score to the data score at the optimal discriminator, which implies that the discriminator helps better score estimation in a complementary way. Using our algorithm, we achive state-of-the-art results on ImageNet 256x256 with FID 1.83 and recall 0.64, similar to the validation data's FID (1.68) and recall (0.66).",Confident,"Korea (Republic of),Korea (Republic of)","Academia,Industry",481.0,Table 6,NVIDIA A100 PCIe,,,,,,49.0,,,,337.8811363173893,,
ALM 1.0,Language,Beijing Academy of Artificial Intelligence / BAAI,,2022-11-28,ALM 1.0,https://github.com/FlagAI-Open/FlagAI/blob/master/examples/ALM/README.md,SOTA improvement,SOTA results on Arabic-language benchmark ALUE.,335000000.0,335M parameters: https://github.com/FlagAI-Open/FlagAI/blob/master/examples/ALM/README.md,,,,,,,,Speculative,China,Academia,,,,,,,,,0.0,,,,,,
CICERO,Games,Meta AI,"Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexander Wei, David Wu, Hugh Zhang, Markus Zijlstra",2022-11-22,Human-level play in the game of Diplomacy by combining language models with strategic reasoning,https://www.science.org/doi/10.1126/science.ade9097,SOTA improvement,"""We introduce Cicero, the first AI agent to achieve human-level performance in Diplomacy""",,"""We took R2C2 (22) as our base model – a 2.7B parameter Transformer-based (23) encoder-decoder model pre-trained on text from the Internet using a BART de-noising objective (24).""",,,WebDiplomacy,,,"""We obtained a dataset of 125,261 games of Diplomacy played online at webDiplomacy.net. Of these, 40,408 games contained dialogue, with a total of 12,901,662 messages exchanged between players. Player accounts were de-identified and automated redaction of personally identifiable information (PII) was performed by webDiplomacy. We refer to this dataset hereafter as WebDiplomacy .""",,Unknown,United States of America,Industry,,,,Open access (non-commercial),,,,,214.0,,,,,,
AR-LDM,Image generation,"Alibaba,University of Waterloo,Vector Institute","Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, Wenhu Chen",2022-11-20,Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models,https://arxiv.org/abs/2211.10950,SOTA improvement,"The first latent diffusion model for coherent visual story synthesizing.
""Quantitative results show that AR-LDM achieves SoTA FID scores on PororoSV, FlintstonesSV, and the newly introduced challenging dataset VIST containing natural images""",1500000000.0,Table 1,5.1e+20,8 NVIDIA A100 GPUs for 8 days,,,,"PororoSV, FlintstonesSV and VIST. All storytelling datasets, sizes would be possible to look up.","Conditioned diffusion models have demonstrated state-of-the-art text-to-image synthesis capacity. Recently, most works focus on synthesizing independent images; While for real-world applications, it is common and necessary to generate a series of coherent images for story-stelling. In this work, we mainly focus on story visualization and continuation tasks and propose AR-LDM, a latent diffusion model auto-regressively conditioned on history captions and generated images. Moreover, AR-LDM can generalize to new characters through adaptation. To our best knowledge, this is the first work successfully leveraging diffusion models for coherent visual story synthesizing. Quantitative results show that AR-LDM achieves SoTA FID scores on PororoSV, FlintstonesSV, and the newly introduced challenging dataset VIST containing natural images. Large-scale human evaluations show that AR-LDM has superior performance in terms of quality, relevance, and consistency.",Confident,"China,Canada,Canada","Industry,Academia,Academia",194.0,8 NVIDIA A100 GPUs for 8 days,NVIDIA A100,Unreleased,,,,,31.0,,,50.0,745.8360575556148,,
Fusion in Encoder,Language,Samsung,"Akhil Kedia, Mohd Abbas Zaidi, Haejun Lee",2022-11-18,FiE: Building a Global Probability Space by Leveraging Early Fusion in Encoder for Open-Domain Question Answering,https://arxiv.org/abs/2211.10147,SOTA improvement,"""Using our proposed method, we outperform the current state-of-the-art method by 2.5 Exact Match score on the Natural Question dataset while using only 25% of parameters and 35% of the latency during inference, and 4.4 Exact Match on WebQuestions dataset""",330000000.0,330M,1.3e+20,"""The experiments were run on 8x80GB Nvidia A100s with 800GB RAM and 4x32-core CPUs, and each experiment took around 1 day for NQ and 2 days for TriviaQA with large models. Inference was run on the same system, and took 2 minutes.""
2 days * 24 * 3600 * 8 * 312 teraflop/s * 0.3 utilization = 1.3e20",TriviaQA,,,79k per table 11 (probably number of question-answer pairs),"Generative models have recently started to outperform extractive models in Open Domain Question Answering, largely by leveraging their decoder to attend over multiple encoded passages and combining their information. However, generative models tend to be larger than extractive models due to the need for a decoder, run slower during inference due to auto-regressive decoder beam search, and their generated output often suffers from hallucinations. We propose to extend transformer encoders with the ability to fuse information from multiple passages, using global representation to provide cross-sample attention over all tokens across samples. Furthermore, we propose an alternative answer span probability calculation to better aggregate answer scores in the global space of all samples. Using our proposed method, we outperform the current state-of-the-art method by 2.5 Exact Match score on the Natural Question dataset while using only 25% of parameters and 35% of the latency during inference, and 4.4 Exact Match on WebQuestions dataset. When coupled with synthetic data augmentation, we outperform larger models on the TriviaQA dataset as well. The latency and parameter savings of our method make it particularly attractive for open-domain question answering, as these models are often compute-intensive.",Likely,Korea (Republic of),Industry,48.0,2 days,NVIDIA A100 SXM4 80 GB,,,,,,5.0,,,,233.0630322095263,,
Galactica,"Language,Biology",Meta AI,"Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, Robert Stojnic",2022-11-16,Galactica: A Large Language Model for Science,https://arxiv.org/abs/2211.09085,SOTA improvement,"""We outperform existing models on a range of scientific tasks. On technical knowledge probes such as LaTeX equations, Galactica outperforms the latest GPT-3 by 68.2% versus 49.0%. Galactica also performs well on reasoning, outperforming Chinchilla on mathematical MMLU by 41.3% to 35.7%, and PaLM 540B on MATH""",120000000000.0,"""The largest 120B model we train runs on a single NVIDIA A100 node""",3.24e+23,"Authors state the model is trained on 450b tokens. Using 6 FLOP/token/parameter, this is 6*120b*450b = 3.24e23",Galactica Corpus,"""Our corpus consists of 106 billion tokens from papers, reference material, encyclopedias and other scientific sources. We combine natural language sources, such as papers and textbooks, and natural sequences, such as protein sequences and chemical formulae. We process LATEX where we can capture it, and also include
academic code to capture computational science""",106000000000.0,"""Total dataset size = 106 billion tokens""","Information overload is a major obstacle to scientific progress. The explosive growth in scientific literature and data has made it ever harder to discover useful insights in a large mass of information. Today scientific knowledge is accessed through search engines, but they are unable to organize scientific knowledge alone. In this paper we introduce Galactica: a large language model that can store, combine and reason about scientific knowledge. We train on a large scientific corpus of papers, reference material, knowledge bases and many other sources. We outperform existing models on a range of scientific tasks. On technical knowledge probes such as LaTeX equations, Galactica outperforms the latest GPT-3 by 68.2% versus 49.0%. Galactica also performs well on reasoning, outperforming Chinchilla on mathematical MMLU by 41.3% to 35.7%, and PaLM 540B on MATH with a score of 20.4% versus 8.8%. It also sets a new state-of-the-art on downstream tasks such as PubMedQA and MedMCQA dev of 77.6% and 52.9%. And despite not being trained on a general corpus, Galactica outperforms BLOOM and OPT-175B on BIG-bench. We believe these results demonstrate the potential for language models as a new interface for science. We open source the model for the benefit of the scientific community.",Likely,United States of America,Industry,,,NVIDIA A100 SXM4 80 GB,Open access (non-commercial),,128.0,,,458.0,,,4.0,591076.8943544837,2000000.0,"Table 1: batch size 2M, warmup 1.1B (out of 450B tokens)"
EVA-01,Vision,"Beijing Academy of Artificial Intelligence / BAAI,Huazhong University of Science and Technology,Zhejiang University,Beijing Institute of Technology","Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao",2022-11-14,EVA: Exploring the Limits of Masked Visual Representation Learning at Scale,https://arxiv.org/abs/2211.07636,SOTA improvement,"from abstract 'Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training.'",1011000000.0,1011M from table 3,3.7509433344e+21,"flops = (128) * (77.97 * 10**12) * (14.5 * 24 * 3600) * (0.3) = 3.75e21
(num gpu) * (peak flops) * (time in seconds) * (assumed utilization rate)
from Table 3, time and num gpus, GPU model is on page 4 (A100), precision is fp16","ImageNet21k,COCO,Conceptual Captions 12M (CC12M),Conceptual Captions (CC3M)","from table 3 : ImageNet-21K, CC12M, CC3M, Object365, COCO, ADE",29600000.0,from table 3: 29.6M images,"We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research,
we release all the code and billion-scale model.",Confident,"China,China,China,China","Academia,Academia,Academia,Academia",348.0,from Table 3 14.5 days = 348 hours,NVIDIA A100 SXM4 40 GB,Open source,,128.0,,,353.0,,,150.0,29374.46909439904,,
AltCLIP,"Multimodal,Language,Vision,Image generation",Beijing Academy of Artificial Intelligence / BAAI,"Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, Ledell Wu",2022-11-12,AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities,https://arxiv.org/abs/2211.06679,SOTA improvement,"""We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30kCN, COCO-CN and XTD""",,,,,,,,,"In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model. Starting from the pre-trained multimodal representation model CLIP released by OpenAI, we altered its text encoder with a pre-trained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k-CN, COCO-CN and XTD. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding. Our models and code are available at this https URL.",Unknown,China,Academia,,,,Open source,,,,CLIP (ViT L/14@336px),46.0,,,10.0,,,
InternImage,Vision,"Shanghai AI Lab,Tsinghua University,Nanjing University,SenseTime,Chinese University of Hong Kong (CUHK)","Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu Qiao",2022-11-10,InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions,https://arxiv.org/abs/2211.05778,SOTA improvement,"""InternImage-H achieved a new record 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming current leading CNNs and ViTs""",1080000000.0,"1.08B, table 1",8.3174839e+19,"6ND = 6*1080000000*(427000000*30+1281167*20)=8.3174839e+19
to pre-train InternImage-H on a 427 million joint dataset of public Laion-400M [61], YFCC-15M [62], and CC12M [63] for 30 epochs, and then we fine-tune the model on ImageNet-1K for 20 epochs.","LAION-400M,Conceptual Captions 12M (CC12M),ImageNet-1k","""To further explore the capability of our model and match the large-scale private data used in previous methods [16, 20, 59], we adopt M3I
Pre-training [60], a unified pre-training approach available
for both unlabeled and weakly-labeled data, to pre-train
InternImage-H on a 427 million joint dataset of public
Laion-400M [61], YFCC-15M [62], and CC12M [63] for
30 epochs, and then we fine-tune the model on ImageNet1K for 20 epochs.""",427000000.0,427M images,"Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved a new record 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming current leading CNNs and ViTs. The code will be released at this https URL.",Confident,"China,China,China,Hong Kong,Hong Kong","Academia,Academia,Academia,Industry,Academia",,,,Open source,,,,,313.0,,,30.0,,,
BLOOM-176B,Language,"Hugging Face,BigScience","Margaret Mitchell, Giada Pistilli, Yacine Jernite, Ezinwanne Ozoani, Marissa Gerchick, Nazneen Rajani, Sasha Luccioni, Irene Solaiman, Maraim Masoud, Somaieh Nikpoor, Carlos Muñoz Ferrandis, Stas Bekman, Christopher Akiki, Danish Contractor, David Lansky, Angelina McMillan-Major, Tristan Thrush, Suzana Ilić, Gérard Dupont, Shayne Longpre, Manan Dey, Stella Biderman, Douwe Kiela, Emi Baylor, Teven Le Scao, Aaron Gokaslan, Julien Launay, Niklas Muennighoff",2022-11-08,BLOOM: A 176B-Parameter Open-Access Multilingual Language Model,https://arxiv.org/abs/2211.05100,"Historical significance,Highly cited","Was the largest open-source model at the time. 1000+ researchers, many from important orgs such as Microsoft and NVIDIA.
https://huggingface.co/bigscience/bloom",176247271424.0,"See ""Technical Specifications"" on Hugging Face:
https://huggingface.co/bigscience/bloom",5.7700000000001e+23,"https://towardsdatascience.com/run-bloom-the-largest-open-access-ai-model-on-your-desktop-computer-f48e1e2a9a32
384 A100 GPUs * 150 TFLOPS throughput per GPU * 116 days = 5.77e+23 FLOP
https://www.wolframalpha.com/input?i=384+*+150+TFLOPS+*+116+days",BigScience ROOTS Corpus,"In total, 1.6 terabytes of pre-processed text was converted into 350 billion unique tokens as BLOOM's training datasets.
arXiv:2210.15424
""BLOOM was trained on the ROOTS corpus (Lauren¸con et al., 2022), a composite collection
of 498 Hugging Face datasets (Lhoest et al., 2021) amounting to 1.61 terabytes of text that
span 46 natural languages and 13 programming languages. A high-level overview of this
dataset can be seen in Figure 3, while a detailed itemized list of every language along with
its linguistic genus, family and macroarea is presented in Table 1""",262500000000.0,350B tokens ~= 262B words,,Confident,"Multinational,Multinational","Industry,Research collective",2808.0,117 days * 24 hours/day,NVIDIA A100 SXM4 80 GB,Open access (restricted use),,384.0,0.4808,,1517.0,,,1.0,901068.6599742996,4194304.0,Table 3. 2048*2048
mT0-13B,Language,"Hugging Face,BigScience","Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, Colin Raffel",2022-11-03,Crosslingual Generalization through Multitask Finetuning,"https://arxiv.org/abs/2211.01786, https://huggingface.co/bigscience/bloomz",SOTA improvement,"""Finetuning on multilingual tasks with English prompts further improves performance on English and non-English tasks leading to various state-of-the-art zero-shot results.""
Table 1",13000000000.0,13B,,"fine-tuned from mT5
1.37e22 fine-tune compute",xP3,"""In addition, we introduce xP3, a composite of supervised datasets in 46 languages with English and machine-translated prompts.
https://huggingface.co/datasets/bigscience/xP3",20000000000.0,"per https://huggingface.co/datasets/bigscience/xP3, 94,941,936 KB or 94GB
if approx 200M words per GB, that's ~20B words (rougher estimate because it's multilingual)
https://docs.google.com/document/d/1G3vvQkn4x_W71MKg0GmHVtzfd9m0y3_Ofcoew0v902Q/edit#heading=h.ieihc08p8dn0","Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting, but so far explorations of MTF have focused on English data and models. We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0. We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages that appear only in the pretraining corpus. Finetuning on multilingual tasks with English prompts further improves performance on English and non-English tasks leading to various state-of-the-art zero-shot results. We also investigate finetuning on multilingual tasks with prompts that have been machine-translated from English to match the language of each dataset. We find training on these machine-translated prompts leads to better performance on human-written prompts in the respective languages. Surprisingly, we find models are capable of zero-shot generalization to tasks in languages they have never intentionally seen. We conjecture that the models are learning higher-level capabilities that are both task- and language-agnostic. In addition, we introduce xP3, a composite of supervised datasets in 46 languages with English and machine-translated prompts. Our code, datasets and models are freely available at this https URL.",Confident,"Multinational,Multinational","Industry,Research collective",,,,Open source,"""We finetune the models for an additional 13 billion tokens with loss only being computed on target tokens...
For finetuning mT5, we follow the same procedure as described above for BLOOM, except that inputs are fed into the encoder and thus are not space-separated from targets.""
13B * 13B * 6 = 1.01e21",,,mT5-XXL,242.0,,1.01e+21,,,,
Mogrifier RLSTM (WT2),Language,"DeepMind,University College London (UCL)",Gábor Melis,2022-11-03,Circling Back to Recurrent Models of Language,https://arxiv.org/abs/2211.01848,SOTA improvement,"""On top of these improvements, the RLSTM
outperformed the LSTM by a small margin, and we established a new state of the art on both datasets""",35000000.0,,1.09e+17,,,,,,,,"United Kingdom of Great Britain and Northern Ireland,United Kingdom of Great Britain and Northern Ireland","Industry,Academia",,,,Unreleased,,,,,0.0,,,250.0,,,
BLOOMZ-176B,Language,Hugging Face,"Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, Colin Raffel",2022-11-03,Crosslingual Generalization through Multitask Finetuning,"https://arxiv.org/abs/2211.01786, https://huggingface.co/bigscience/bloomz",SOTA improvement,"""Finetuning on multilingual tasks with English prompts further improves performance on English and non-English tasks leading to various state-of-the-art zero-shot results.""
Table 1",176000000000.0,176B,,"fine-tuned from BLOOM-176B
1.37e22 fine-tune compute",xP3,"""In addition, we introduce xP3, a composite of supervised datasets in 46 languages with English and machine-translated
prompts.
https://huggingface.co/datasets/bigscience/xP3",20000000000.0,"per https://huggingface.co/datasets/bigscience/xP3, 94,941,936 KB or 94GB
if approx 200M words per GB, that's ~20B words (rougher estimate because it's multilingual)
https://docs.google.com/document/d/1G3vvQkn4x_W71MKg0GmHVtzfd9m0y3_Ofcoew0v902Q/edit#heading=h.ieihc08p8dn0","Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting, but so far explorations of MTF have focused on English data and models. We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0. We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages that appear only in the pretraining corpus. Finetuning on multilingual tasks with English prompts further improves performance on English and non-English tasks leading to various state-of-the-art zero-shot results. We also investigate finetuning on multilingual tasks with prompts that have been machine-translated from English to match the language of each dataset. We find training on these machine-translated prompts leads to better performance on human-written prompts in the respective languages. Surprisingly, we find models are capable of zero-shot generalization to tasks in languages they have never intentionally seen. We conjecture that the models are learning higher-level capabilities that are both task- and language-agnostic. In addition, we introduce xP3, a composite of supervised datasets in 46 languages with English and machine-translated prompts. Our code, datasets and models are freely available at this https URL.",Likely,Multinational,Industry,,,,Open source,"""We use publicly available pretrained BLOOM models ranging from 560 million to 176 billion parameters. BLOOM models are large decoder-only language models pretrained for around 350 billion tokens with an architecture similar to GPT-3
(Brown et al., 2020). We finetune the models for an additional 13 billion tokens with loss only being
computed on target tokens.""
13B * 176B * 6",,,BLOOM-176B,242.0,,1.3728e+22,,,,
eDiff-I,Image generation,NVIDIA,"Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, Ming-Yu Liu",2022-11-02,eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers,https://arxiv.org/abs/2211.01324,SOTA improvement,"SOTA zero-shot FID on COCO 2014, Table 1
May be significantly used, via Nvidia Picasso: https://www.nvidia.com/en-us/gpu-cloud/picasso/",9100000000.0,"9.1B for config D, Table 1",5.46e+19,"6ND = 6*9100000000*1000000000=5.46e+19 (likely, might change because of several epochs / dataset division)
""The base model was trained using 256 NVIDIA A100 GPUs, while the two super-resolution models were trained with 128 NVIDIA A100 GPUs each""
no info on duration",,"""We use a collection of public and proprietary datasets to train our model. To ensure high-quality training data, we apply heavy filtering using a pretrained CLIP model to measure the image-text alignment score as well as an aesthetic scorer to rank the image quality""",1000000000.0,"""The final dataset to train our model contains about one billion text-image pairs""","Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's ""paint-with-words"" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at this https URL",Likely,United States of America,Industry,,,NVIDIA A100,API access,,,,,511.0,,,,,,
Taiyi-Stable Diffusion,Image generation,IDEA CCNL,,2022-10-31,,https://huggingface.co/IDEA-CCNL/Taiyi-Stable-Diffusion-1B-Chinese-v0.1,Historical significance,"The first open-source, Chinese version of Stable Diffusion.",1000000000.0,,5.1e+22,"Fine-tuning: 32 NVIDIA A100 GPUs for 100 hours
32 * 312e12 * 30% * 100 * 60 * 60 = 1.078272e+21 FLOP
Base model: Stable Diffusion, 5e+22 FLOP",,,,,,Likely,China,Academia,100.0,32 NVIDIA A100 GPUs for 100 hours,NVIDIA A100,Open access (restricted use),,32.0,,Stable Diffusion (LDM-KL-8-G),0.0,,,,113638.30314103366,,
EnCodec,Audio,Meta AI,"Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi",2022-10-24,High Fidelity Neural Audio Compression,"https://arxiv.org/abs/2210.13438, ",SOTA improvement,""" Finally, our best model, EnCodec, reaches state-of-the-art scores for speech and for
music at 1.5, 3, 6, 12 kbps at 24 kHz, and at 6, 12, and 24 kbps for 48 kHz with stereo channels.""",,,,"""We train all models for 300 epochs, with one epoch being 2,000 updates with the Adam optimizer with a batch size of 64 examples of 1 second each, a learning rate of 3 · 10−4 , β1 = 0.5, and β2 = 0.9. All the models are traind using 8 A100 GPUs""","DNS,Common Voice,AudioSet,FSD50K,Jamendo","""We train EnCodec on 24 kHz monophonic across diverse domains, namely: speech, noisy speech, music and
general audio while we train the fullband stereo EnCodec on only 48 kHz music. For speech, we use the clean speech segments from DNS Challenge 4 (Dubey et al., 2022) and the Common Voice dataset (Ardila et al., 2019).
For general audio, we use on AudioSet (Gemmeke et al., 2017) together with FSD50K (Fonseca et al., 2021).
For music, we rely on the Jamendo dataset (Bogdanov et al., 2019) for training and evaluation and we further
evaluate our models on music using a proprietary music dataset.""",,"~17k hours total, per Table A.1","We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio. Code and models are available at this http URL.",Unknown,United States of America,Industry,,,NVIDIA A100,Open access (non-commercial),,,,,217.0,,,300.0,,,
U-PaLM (540B),Language,Google,"Yi Tay, Jason Wei, Hyung Won Chung, Vinh Q. Tran, David R. So, Siamak Shakeri, Xavier Garcia, Huaixiu Steven Zheng, Jinfeng Rao, Aakanksha Chowdhery, Denny Zhou, Donald Metzler, Slav Petrov, Neil Houlsby, Quoc V. Le, Mostafa Dehghani",2022-10-20,Transcending Scaling Laws with 0.1% Extra Compute,https://arxiv.org/abs/2210.11399,SOTA improvement,"""We show that U-PaLM 540B outperforms PaLM 540B on 21 out of 26 tasks. Given that PaLM is
the SOTA language model on these tasks, this makes U-PaLM the new state-of-the-art on these tasks.""
performance improvement equivalent to 2x training efficiency: ""Impressively, at 540B scale, we show an approximately 2x computational savings rate where U-PaLM achieves the same performance as the final PaLM 540B model at around half its computational budget """,540000000000.0,,2.53e+24,"""The total number of extra tokens we train on for the 540B
model is approximately 1.3 Billion which constitutes 0.16% extra computation... Training an U-PaLM 540B model only consumes 512 TPUv4 chips and finishes in about 5 days which is considered to be lightweight.""
original PaLM was 2.527e+24. adding 0.16% is ~2.53e24",,"""To keep things consistent, we train this model with the same data mixture as PaLM and do not rely on
additional sources of data (labeled or unlabeled).""",,,"Scaling language models improves performance but comes with significant computational costs. This paper proposes UL2R, a method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra compute. The key idea is to continue training a state-of-the-art large language model (e.g., PaLM) on a few more steps with UL2's mixture-of-denoiser objective. We show that, with almost negligible extra computational costs and no new sources of data, we are able to substantially improve the scaling properties of large language models on downstream metrics. In this paper, we continue training PaLM with UL2R, introducing a new set of models at 8B, 62B, and 540B scale which we call U-PaLM. Impressively, at 540B scale, we show an approximately 2x computational savings rate where U-PaLM achieves the same performance as the final PaLM 540B model at around half its computational budget (i.e., saving ∼4.4 million TPUv4 hours). We further show that this improved scaling curve leads to 'emergent abilities' on challenging BIG-Bench tasks -- for instance, U-PaLM does much better than PaLM on some tasks or demonstrates better quality at much smaller scale (62B as opposed to 540B). Overall, we show that U-PaLM outperforms PaLM on many few-shot setups, i.e., English NLP tasks (e.g., commonsense reasoning, question answering), reasoning tasks with chain-of-thought (e.g., GSM8K), multilingual tasks (MGSM, TydiQA), MMLU and challenging BIG-Bench tasks. Finally, we provide qualitative examples showing the new capabilities of U-PaLM for single and multi-span infilling.",Confident,United States of America,Industry,120.0,5 days,Google TPU v4,Unreleased,"""The total number of extra tokens we train on for the 540B
model is approximately 1.3 Billion which constitutes 0.16% extra computation... Training an U-PaLM 540B model only consumes 512 TPUv4 chips and finishes in about 5 days which is considered to be lightweight.""
PaLM was 2.5e24
0.16% of that is 4e21",512.0,,PaLM (540B),46.0,,4e+21,,,,
LMSI-Palm,Language,"Google,University of Illinois Urbana-Champaign (UIUC)","Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han",2022-10-20,Large Language Models Can Self-Improve,https://arxiv.org/abs/2210.11610,SOTA improvement,"""We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label.""",540000000000.0,540B,,"(fine-tuned from Palm-540B, which was 2.52e24)",,Trained on chain-of-thought PaLM output from several datasets of questions that require reasoning. See section 4,,,"Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate ""high-confidence"" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement.",Confident,"United States of America,United States of America","Industry,Academia",,,,Unreleased,"""To reduce the training burden, we sample 5k examples from the non-football and football partition of the DROP dataset, and sample 5k examples from ANLI-A2 and ANLI-A3. For each dataset, we fine-tune the model for 10k steps with a learning rate of 5e−5
and a batch size of 32."" Not sure about sequence length",,,PaLM (540B),284.0,,,,,,
Flan-T5 11B,Language,Google,"Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei",2022-10-20,Scaling Instruction-Finetuned Language Models,"https://arxiv.org/abs/2210.11416, https://huggingface.co/google/flan-t5-xxl",Highly cited,,11000000000.0,11B,3.3e+22,"Table 2: 0.2% greater than T5 xxl, which used 3.3e22 FLOP",,"Various instruction examples for many tasks:
""Our final set of finetuning tasks is sourced from a combination of tasks from FLAN, T0,
Natural Instructions, along with some dialog, program synthesis, and chain-of-thought reasoning tasks, as
described in Figure 2. We provide specific pointers and citations in Table 24. All data sources are publicly
available. We also remove all MMLU tasks from Natural Instructions to preserve its role as a broad benchmark
of 57 held-out tasks for evaluation. In total, there are 1,836 tasks."" ",100000000000.0,"""For T5 models without instruction finetuning, we use LM-adapted models, which were produced by training T5 on 100B additional tokens from C4 on a standard language modeling objective""","Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.",Likely,United States of America,Industry,,,Google TPU v4,Open source,"7.6e19, per Table 2",,,T5-11B,1901.0,,7.6e+19,,98374.29476250126,,
Flan-PaLM 540B,Language,Google,"Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei",2022-10-20,Scaling Instruction-Finetuned Language Models,https://arxiv.org/abs/2210.11416,"Highly cited,SOTA improvement",">1k cites
""Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU.""",540000000000.0,540B,2.4999999999999997e+24,"0.2% greater than Palm 540B, which used 2.5e24",,"Various instruction examples for many tasks:
""Our final set of finetuning tasks is sourced from a combination of tasks from FLAN, T0, Natural Instructions, along with some dialog, program synthesis, and chain-of-thought reasoning tasks, as described in Figure 2. We provide specific pointers and citations in Table 24. All data sources are publicly
available. We also remove all MMLU tasks from Natural Instructions to preserve its role as a broad benchmark of 57 held-out tasks for evaluation. In total, there are 1,836 tasks."" ",,,"Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.",Confident,United States of America,Industry,37.0,"""we only use 0.2% of the pre-training compute to instruction-finetune Flan-PaLM 540B (approximately 512 v4 TPU chips for 37 hours)""",Google TPU v4,Unreleased,"5.6e21 per Table 2
""we only use 0.2% of the pre-training compute to instruction-finetune Flan-PaLM 540B (approximately 512 v4 TPU chips for 37 hours)""
512 * 37 * 3600 * 275 teraflops * 0.3 = 5.6e21 (so 30% utilization was correct)",512.0,0.3,PaLM (540B),1901.0,,5e+21,,,,
GenSLM,Biology,"University of Chicago,NVIDIA,Harvard University,Cerebras Systems,Technical University of Munich,California Institute of Technology","Maxim Zvyagin, Alexander Brace, Kyle Hippe, Yuntian Deng, Bin Zhang, Cindy Orozco Bohorquez, Austin Clyde, Bharat Kale, Danilo Perez-Rivera, Heng Ma, Carla M. Mann, Michael Irvin, J. Gregory Pauloski, Logan Ward, Valerie Hayot, Murali Emani, Sam Foreman, Zhen Xie, Diangen Lin, Maulik Shukla, Weili Nie, Josh Romero, Christian Dallago, Arash Vahdat, Chaowei Xiao, Thomas Gibbs, Ian Foster, James J. Davis, Michael E. Papka, Thomas Brettin, Rick Stevens, Anima Anandkumar, Venkatram Vishwanath, Arvind Ramanathan",2022-10-11,GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics,https://www.biorxiv.org/content/biorxiv/early/2022/10/11/2022.10.10.511571.full.pdf,SOTA improvement,"""Together, these capabilities go beyond state-of-the-art techniques
for global-scale whole genome surveillance of pandemic-causing
viruses and address a critical infrastructure need for the global
public health organization"" - SOTA improvement on very specific task",25000000000.0,See Table 3,1.42e+21,"See Table 3
Overall ZettaFlops 1.42",SARS-CoV-2 genome dataset,"SARS-CoV-2 genome datasets from multiple sources:
""we used >1.5 million high-quality BV-BRC SARSCoV-2 complete genome sequences""
""We also utilized a dataset collected by the Houston Methodist Hospital System - one of the largest single-institution collections of SARS-CoV-2 genome sequences in the United States. [...] Sequences with >256 ambiguous characters were discarded, leaving 16,545 total sequences""
Prokaryotic gene sequence dataset from BV-BRC:
""To allow for better generalization and avoid overfitting of the models to the SARS-CoV-2 data, we used >110 million unique prokaryotic gene sequences from BV-BRC""",,"Multiple datasets used:
""we used >1.5 million high-quality BV-BRC SARS CoV-2 complete genome sequences""
HMHS dataset: ""leaving 16,545 total sequences""
BV-BRC prokaryotic dataset: ""We queried BV-BRC to find 10,206 unique PGfams, each with >30,000 unique members.""","Our work seeks to transform how new and emergent variants of pandemic causing viruses, specially SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pretraining on over 10 million prokaryotic gene sequences, and then finetuning a SARS-CoV-2 specific model on 1.5 million genomes, we show that GenSLM can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLM represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate the scaling of GenSLMs on both GPU-based supercomputers and AI-hardware accelerators, achieving over 1.54 zettaflops in training runs. We present initial scientific insights gleaned from examining GenSLMs in tracking the evolutionary dynamics of SARS-CoV-2, noting that its full potential on large biological data is yet to be realized.",Confident,"United States of America,United States of America,United States of America,Multinational,Germany,United States of America","Academia,Industry,Academia,Industry,Academia,Academia",,,,,,,,,34.0,,,,,,
Diplodocus,Games,"Meta AI,Massachusetts Institute of Technology (MIT)","Anton Bakhtin, David J Wu, Adam Lerer, Jonathan Gray, Athul Paul Jacob, Gabriele Farina, Alexander H Miller, Noam Brown",2022-10-11,Mastering the Game of No-Press Diplomacy via Human-Regularized Reinforcement Learning and Planning,https://arxiv.org/abs/2210.05492,SOTA improvement,"SOTA Improvement in no-press Diplomacy
""In a 200-game no-press Diplomacy tournament involving 62 human participants spanning skill levels from beginner to expert, two Diplodocus agents both achieved a higher average score than all other participants who played more than two games, and ranked first and third according to an Elo ratings model. """,,may be estimated from https://github.com/facebookresearch/diplomacy_cicero?tab=readme-ov-file,,,,"""we train the architecture described in Appendix F on a dataset of roughly 46000 online Diplomacy games provided by webdiplomacy.net.""
then self-play training",,"""we train the architecture described in Appendix F on a dataset of roughly 46000 online Diplomacy games provided by webdiplomacy.net.""
then self-play training","No-press Diplomacy is a complex strategy game involving both cooperation and competition that has served as a benchmark for multi-agent AI research. While self-play reinforcement learning has resulted in numerous successes in purely adversarial games like chess, Go, and poker, self-play alone is insufficient for achieving optimal performance in domains involving cooperation with humans. We address this shortcoming by first introducing a planning algorithm we call DiL-piKL that regularizes a reward-maximizing policy toward a human imitation-learned policy. We prove that this is a no-regret learning algorithm under a modified utility function. We then show that DiL-piKL can be extended into a self-play reinforcement learning algorithm we call RL-DiL-piKL that provides a model of human play while simultaneously training an agent that responds well to this human model. We used RL-DiL-piKL to train an agent we name Diplodocus. In a 200-game no-press Diplomacy tournament involving 62 human participants spanning skill levels from beginner to expert, two Diplodocus agents both achieved a higher average score than all other participants who played more than two games, and ranked first and third according to an Elo ratings model.",Unknown,"United States of America,United States of America","Industry,Academia",,,,Open access (non-commercial),,,,,23.0,,,,,,
Phenaki,Video,"University College London (UCL),University of Michigan,Google Brain","Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, Dumitru Erhan",2022-10-05,Phenaki: Variable Length Video Generation From Open Domain Textual Description,https://arxiv.org/abs/2210.02399,SOTA improvement,"""To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts""",1800000000.0,"Unless specified otherwise, we train a 1.8B parameter Phenaki model on a corpus of ∼15M textvideo pairs at 8 FPS mixed with ∼50M text-images plus ∼400M pairs of LAION-400M [41] (more
details in Appendix B.3). The model used in the visualisations in this paper was trained for 1 million
steps at a batch size of 512, which took less than 5 days. In this setup 80% of the training data came
from the video dataset and each image dataset contributed 10%.",,,,"Unless specified otherwise, we train a 1.8B parameter Phenaki model on a corpus of ∼15M textvideo pairs at 8 FPS mixed with ∼50M text-images plus ∼400M pairs of LAION-400M [41] (more
details in Appendix B.3). The model used in the visualisations in this paper was trained for 1 million
steps at a batch size of 512, which took less than 5 days. In this setup 80% of the training data came
from the video dataset and each image dataset contributed 10%.",,"Unless specified otherwise, we train a 1.8B parameter Phenaki model on a corpus of ∼15M textvideo pairs at 8 FPS mixed with ∼50M text-images plus ∼400M pairs of LAION-400M [41] (more
details in Appendix B.3). The model used in the visualisations in this paper was trained for 1 million
steps at a batch size of 512, which took less than 5 days. In this setup 80% of the training data came
from the video dataset and each image dataset contributed 10%.",,,"United Kingdom of Great Britain and Northern Ireland,United States of America,United States of America","Academia,Academia,Industry",,,,,,,,,228.0,,,,,,
Make-A-Video,Video,Meta AI,"Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, Yaniv Taigman",2022-09-29,Make-A-Video: Text-to-Video Generation without Text-Video Data,https://arxiv.org/abs/2209.14792,SOTA improvement,,,,,,,,,,,Unknown,United States of America,Industry,,,,,,,,,700.0,,,,,,
Whisper,Speech,OpenAI,"Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever",2022-09-21,Robust Speech Recognition via Large-Scale Weak Supervision,https://cdn.openai.com/papers/whisper.pdf,SOTA improvement,,1550000000.0,Table 1,4.65e+22,See figure 9,,,9302400000.0,"""When scaled to 680,000 hours of multilingual and multitask
supervision, the resulting models generalize well
to standard benchmarks and are often competitive
with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning.""
13,680 words/h * 680,000h = 9,302,400,000 words","We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.",,United States of America,Industry,,,,Open source,,,,,1296.0,,,3.0,,,
PaLI,"Language,Vision,Multimodal",Google,"Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut",2022-09-14,PaLI: A Jointly-Scaled Multilingual Language-Image Model,https://arxiv.org/abs/2209.06794v4,SOTA improvement,"""PaLI achieves state-of-the-art in multiple vision and language tasks
(such as captioning, visual question-answering, scene-text understanding)""",16900000000.0,"3.9b Image Encoder,
14b Multimodal Encoder-Decoder",5.1e+22,"""The largest model, PaLI-17B, is pretrained using 1,024 GCP-TPUv4 chips for 7 days""
275 teraFLOP/s * 1024 * 7 * 24 * 3600 * 0.3 (utilization assumption) = 5.1e22",WebLI,"""we introduce WebLI, a multilingual imagelanguage dataset built from images and texts available on the public web... Due to the abundance of multilingual content on the internet, the collection process for the WebLI dataset can be scaled to cover 10 billion images and 12 billion alt-texts. In addition to annotation with web text, we use publicly available automatic service to extract OCR annotations on all images, resulting in 29 billion image-OCR pairs. To balance quality and retain scale, we filter the dataset to the highest quality subset retaining only the top 10% scoring of the original WebLI image-text pairs (about 1B examples), which we use to train PaLI""",,,"Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.",Likely,United States of America,Industry,168.0,7 days,Google TPU v4,Unreleased,,1024.0,,,436.0,,,1.0,50878.10777366616,,
BEIT-3,"Multimodal,Vision,Language",Microsoft,Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks,2022-08-22,Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks,https://arxiv.org/abs/2208.10442,SOTA improvement,"from abstract: 'In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language tasks.'",1900000000.0,1.9B from Table 2,7e+19,"from Table 11, 1M training steps with batch size 6144.
From Table 2 we have that model have 1.9B parameters.
Model is VIT","ImageNet21k,COCO,English Wikipedia,BookCorpus (BooksCorpus, Toronto Book Corpus)",from Table 3,,"from Table 3
21M pairs image text,
14M images,160GB documents","A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We introduce Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked ""language"" modeling on images (Imglish), texts (English), and image-text pairs (""parallel sentences"") in a unified manner. Experimental results show that BEiT-3 obtains state-of-the-art performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO). ",Likely,United States of America,Industry,,,,Unreleased,,,,,473.0,,,,,,
BlenderBot 3,Language,"McGill University,Meta AI,Mila - Quebec AI (originally Montreal Institute for Learning Algorithms)","Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju, Eric Michael Smith, Stephen Roller, Megan Ung, Moya Chen, Kushal Arora, Joshua Lane, Morteza Behrooz, William Ngan, Spencer Poff, Naman Goyal, Arthur Szlam, Y-Lan Boureau, Melanie Kambadur, Jason Weston",2022-08-10,BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage,"https://arxiv.org/abs/2208.03188, https://github.com/facebookresearch/ParlAI/blob/main/parlai/zoo/bb3/model_card.md
training code: https://parl.ai/projects/bb3/ ",SOTA improvement,"""Human evaluations show its superiority to existing open-domain dialogue agents, including its predecessors""",175000000000.0,,4.3e+23,(taken from OPT-175 base),BlenderBot 3 Data,"Fine-tuned from OPT-175B.
""The fine-tuning data for BB3 comprises roughly 4 million source/target examples spread across the various
training modules. This corresponds to around 1.13B training tokens. When fine-tuning the OPT-based
BB3 models, we additionally included 600k examples ( 170m tokens) of pre-training data to help with
training stability. Table 16 and Table 17 enumerate the breakdown by module.""
",,,"We present BlenderBot 3, a 175B parameter dialogue model capable of open-domain conversation with access to the internet and a long-term memory, and having been trained on a large number of user defined tasks. We release both the model weights and code, and have also deployed the model on a public web page to interact with organic users. This technical report describes how the model was built (architecture, model and training scheme), and details of its deployment, including safety mechanisms. Human evaluations show its superiority to existing open-domain dialogue agents, including its predecessors (Roller et al., 2021; Komeili et al., 2022). Finally, we detail our plan for continual learning using the data collected from deployment, which will also be publicly released. The goal of this research program is thus to enable the community to study ever-improving responsible agents that learn through interaction.",Likely,"Canada,United States of America,Canada","Academia,Industry,Academia",,,NVIDIA A100 SXM4 40 GB,Open access (non-commercial),"""The 30B and 175B parameter BlenderBot 3 models were each trained for one epoch of the training data
on 64 (30B) or 128 (175B) x 40gb A100 GPUs; we found that the model (especially the 175B version)
overfit significantly when seeing the training data more than once. The 175B model was trained with
a batch size of 2^18 and the 30B model was trained with a batch size of 2^19, resulting in roughly 5600
updates and 2800 updates respectively.""
175b params * 5600 * 2^18 * 6 = 1.5e21
",128.0,,OPT-175B,186.0,,1.5e+21,,,262144.0,"Note that this is batch size for fine-tuning. Blenderbot is based on OPT-175B which had batch size 2M.
""The 175B model was trained with a batch size of 2^18""
2^18 = 262144"
GLM-130B,Language,Tsinghua University,"Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Peng Zhang, Yuxiao Dong, Jie Tang",2022-08-04,GLM-130B: An Open Bilingual Pre-trained Model,https://keg.cs.tsinghua.edu.cn/glm-130b/posts/glm-130b/,SOTA improvement,"""GLM-130B achieves an accuracy of 80.2% on zero-shot LAMBADA (En), while 76.2% for GPT-3 175B and 77.9% for the SOTA offered by PaLM 540B.""",130000000000.0,Dense model,3.778e+23,"""96 NVIDIA A100 (40G * 8) servers for 2 months""
312 TFLOPS/GPU * 96 servers * 8 GPU/server * 2 months * 30% utilization = 3.778*10^23 FLOPhttps://www.wolframalpha.com/input?i=312+teraflops+*+96+*+8+*+2+months+*+30%25
utilization rate - citation from the paper: ""we report hardware FLOPs utilization (HFU) of 43.3% and model FLOPs utilization (MFU) of 32.5% due to re-materialization.""","The Pile,WuDao Corpora","""The pre-training data includes 1.2T Pile (train split) (Gao et al., 2020) English, 1.0T Chinese WudaoCorpora (Yuan et al., 2021), and 250G Chinese corpora (including online forums, encyclopedia, and
QA) we crawl from the web, which form a balanced composition of English and Chinese contents""",,,,,China,Academia,1440.0,see compute notes,NVIDIA A100 SXM4 40 GB,Open access (non-commercial),,768.0,0.433,,696.0,,,1.0,820296.6313095269,,
AlexaTM 20B,Language,Amazon,"Saleh Soltan, Shankar Ananthakrishnan, Jack FitzGerald, Rahul Gupta, Wael Hamza, Haidar Khan, Charith Peris, Stephen Rawls, Andy Rosenbaum, Anna Rumshisky, Chandana Satya Prakash, Mukund Sridhar, Fabian Triefenbach, Apurv Verma, Gokhan Tur, Prem Natarajan",2022-08-02,AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model,https://arxiv.org/abs/2208.01448,SOTA improvement,The Abstract reports SOTA improvement on multiple benchmarks.,19750000000.0,See Table 1 on p.3 of the paper,2.04374016e+23,"Training throughput is reported as 154 TFLOP/s - see p.5 of the paper.
""We relied on an internal and optimized version of DeepSpeed that we have since open-sourced (Chiu & Zheng, 2022) to obtain training throughput of up to 154 TFLOPS/GPU on 16 AWS p4d.24xlarge compute instances.""
Accelerator compute days are reported as 15,360 days - see Table 17 on p.18 of the paper.","mC4,Wikipedia",See Table 2 on p.3 of the paper.,,,"In this work, we demonstrate that multilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various tasks. In particular, we train a 20 billion parameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B) and show that it achieves state-of-the-art (SOTA) performance on 1-shot summarization tasks, outperforming a much larger 540B PaLM decoder model. AlexaTM 20B also achieves SOTA in 1-shot machine translation, especially for low-resource languages, across almost all language pairs supported by the model (Arabic, English, French, German, Hindi, Italian, Japanese, Marathi, Portuguese, Spanish, Tamil, and Telugu) on Flores-101 dataset. We also show in zero-shot setting, AlexaTM 20B outperforms GPT3 (175B) on SuperGLUE and SQuADv2 datasets and provides SOTA performance on multilingual tasks such as XNLI, XCOPA, Paws-X, and XWinograd. Overall, our results present a compelling case for seq2seq models as a powerful alternative to decoder-only models for Large-scale Language Model (LLM) training. ",Confident,United States of America,Industry,2880.0,"See p.5 of the paper: ""We trained AlexaTM 20B for 120 days on 128 A100 GPUs...""",NVIDIA A100,API access,,128.0,0.4935,,67.0,,,,267943.21130997164,2000000.0,"""We trained AlexaTM 20B for 120 days on 128 A100 GPUs for the total of 500k updates with the accumulated batch size of 2 million tokens"""
OmegaPLM,Biology,"Massachusetts Institute of Technology (MIT),Westlake University","Ruidong Wu, Fan Ding, Rui Wang, Rui Shen, Xiwen Zhang, Shitong Luo, Chenpeng Su, Zuofan Wu, Qi Xie, Bonnie Berger, Jianzhu Ma, Jian Peng",2022-07-22,High-resolution de novo structure prediction from primary sequence,https://www.biorxiv.org/content/10.1101/2022.07.21.500999v1,Historical significance,"""Here, we introduce OmegaFold, the first computational method to successfully predict high-resolution protein structure from a single primary sequence alone. Using a new combination of a protein language model that allows us to make predictions from single sequences and a geometry-inspired transformer model trained on protein structures, OmegaFold outperforms RoseTTAFold and achieves similar prediction accuracy to AlphaFold2 on recently released structures""",670000000.0,"""Our model contains 66 layers with around 670 million parameters without sharing parameters, which doubles the layer count of ESM-1b but roughly retains the parameter count.""",1.38018816e+22,"""OmegaPLM is implemented in PyTorch (44) and trained for 2,560 GPU Nvidia A100 80G days.""
""Default precision format in Nvidia A100 GPUs is set to TensorFloat-32 for matrix operations.""
Assume 0.4 utilization.
Estimate: (2560 * 24 * 3600) s * 156e12 FLOP/s * 0.4 * = 1.38e22",UniRef50,"""After pretraining on sequences in UniRef50 (dated at 2021/04)""",,,"Recent breakthroughs have used deep learning to exploit evolutionary information in multiple sequence alignments (MSAs) to accurately predict protein structures. However, MSAs of homologous proteins are not always available, such as with orphan proteins or fast-evolving proteins like antibodies, and a protein typically folds in a natural setting from its primary amino acid sequence into its three-dimensional structure, suggesting that evolutionary information and MSAs should not be necessary to predict a protein’s folded form. Here, we introduce OmegaFold, the first computational method to successfully predict high-resolution protein structure from a single primary sequence alone. Using a new combination of a protein language model that allows us to make predictions from single sequences and a geometry-inspired transformer model trained on protein structures, OmegaFold outperforms RoseTTAFold and achieves similar prediction accuracy to AlphaFold2 on recently released structures. OmegaFold enables accurate predictions on orphan proteins that do not belong to any functionally characterized protein family and antibodies that tend to have noisy MSAs due to fast evolution. Our study fills a much-encountered gap in structure prediction and brings us a step closer to understanding protein folding in nature.",Confident,"United States of America,China","Academia,Academia",,"2,560 GPU Nvidia A100 80G days",NVIDIA A100 SXM4 80 GB,,,,,,212.0,,,,52400.32118528479,,
ESM2-15B,Biology,"Meta AI,New York University (NYU),Stanford University,Massachusetts Institute of Technology (MIT)","Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, Alexander Rives",2022-07-21,Evolutionary-scale prediction of atomic-level protein structure with a language model,https://www.science.org/doi/abs/10.1126/science.ade2574,SOTA improvement,"""The resulting ESM-2 model family significantly outperforms previously state-of-the-art ESM-1b (a ∼650 million parameter model) at a comparable number of parameters, and on structure prediction benchmarks it also outperforms other recent protein language models""",15000000000.0,"""we train models up to 15B parameters""",7.35000000001e+22,"from xTrimoPGLM paper Table 9 (https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1): 5.1e22 FLOP
from Arb Research (https://arbresearch.com/files/gen_bio.pdf): ""ESM-2-15B: 270000 updates x 3.2M batch size x 15 B “connections” x 6. : 7.8e22 FLOP
from the paper's Supplementary Materials:
""We trained each model over 512 NVIDIA V100 GPUs. ESM2 700M took 8 days to train. The 3B parameter LM took 30 days. The 15B model took 60 days.""
60 days x 512 V100s x an imputed 30% utilization"": 1e23 FLOP
Geometric mean: 7.35e22",UniRef50,"""UniRef50, September 2021 version, is used for the training of ESM models""",12000000000.0,"Section A.1.1:
""This allowed ESM-2 models to train on over 60M protein sequences.""
Average protein sequence is 200 tokens, per https://epochai.org/blog/biological-sequence-models-in-the-context-of-the-ai-directives#fn:4
60M * 200 = 12B tokens
Epochs: 15B model used 270k steps at 3.2M token batch size
270k * 3.2M / 12B = 72","""Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an atomic-resolution picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resolution structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins.""",Confident,"United States of America,United States of America,United States of America,United States of America","Industry,Academia,Academia,Academia",1440.0,,NVIDIA V100,Open source,,512.0,,,636.0,,,72.0,163467.82019979745,,
NLLB,Language,Meta AI,"Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco (Paco) Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Jeff Wang",2022-07-06,No Language Left Behind: Scaling Human-Centered Machine Translation,https://research.facebook.com/publications/no-language-left-behind/,SOTA improvement,"""Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art""",54500000000.0,"Section 8.2.4: ""The model has a total of 54.5B parameters
and FLOPs similar to that of a 3.3B dense model""",1.751113728e+22,"Section 8.8:
"" To train NLLB-200, a cumulative
of 51968 GPU hours of computation was performed on hardware of type A100-SXM-80GB""
See also Table 48
Section 8.2.4 states they use FP16
NVIDIA Datasheet states 312TFLOPS for FP16
https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-nvidia-us-2188504-web.pdf
Assuming 0.3 utilization:
312e12*3600*51968*0.3
Also:
""Our final model is a Transformer
encoder-decoder model in which we replace the Feed Forward Network (FFN) layer in
every 4th Transformer block with a Sparsely Gated Mixture of Experts layer containing 128
experts. We use model dimension 2048, FFN dimension 8192, 16 attention heads, 24 encoder
layers and 24 decoder layers. We use Pre-LayerNorm (Xiong et al., 2020) as described in
Section 6.1.1. We share the embedding weights of the encoder input embedding, decoder
input embedding and decoder output embedding layers. We use an overall dropout of 0.3,
attention dropout 0.1 and EOM with peom=0.2. The model has a total of 54.5B parameters
and FLOPs similar to that of a 3.3B dense model.""",,,360000000000.0,"[WORDS]
Section 8.2.2: ""As we prepare to train on the final 202 language dataset comprising of over 18B sentence
pairs and 2440 language directions""
18B sentences * 20 words/sentence","Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind? In No Language Left Behind, we took on this challenge by first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system. Finally, we open source all contributions described in this work, accessible at https://github.com/facebookresearch/fairseq/tree/nllb.",,United States of America,Industry,,,NVIDIA A100 SXM4 80 GB,Open source,,,,,629.0,,,,50667.25034038439,,
CodeT5-large,Language,Salesforce,"Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi ",2022-07-05,CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning,https://arxiv.org/abs/2207.01780,SOTA improvement,"""Our method not only achieves new SOTA results on the challenging APPS benchmark, but also shows strong zero-shot transfer capability with new SOTA results on the simpler MBPP benchmark.""",770000000.0,"""We pretrain a CodeT5-large model (770M) from scratch following T5-large’s architecture""",2.72e+21,"""We perform our experiments on a kubernetes with 16 A100-40G GPUs on Google Cloud Platform and the total pretraining duration is around 21 days""
16 * 312tFLOP/s * 21 * 24 * 3600 * 0.3 (utilization assumption) = 2.72e21",GitHub,"""We enlarge the Python pretraining dataset using the recently released
large-scale Github Code dataset5. We have compiled public, non-personal information from GitHub consisting of permissively licensed Python code (e.g. “mit”, “apache-2”, “bsd-3-clause”, “bsd-2- 126clause”, “cc0-1.0”, “unlicense”, “isc”). The resulting Python dataset (GCPY) has 10.5B tokens and is 10x larger than the CodeSearchNet (CSN) corpus [Husain et al., 2019] used in the original CodeT5 [Wang et al., 2021]""",,10.5b tokens,"""Program synthesis or code generation aims to generate a program that satisfies a problem specification. Recent approaches using large-scale pretrained language models (LMs) have shown promising results, yet they have some critical limitations. In particular, they often follow a standard supervised fine-tuning procedure to train a code generation model only from the pairs of natural-language problem descriptions and ground-truth programs. Such paradigm largely ignores some important but potentially useful signals in the problem specification such as unit tests, which thus often results in poor performance when solving complex unseen coding tasks. To address the limitations, we propose ""CodeRL"", a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning (RL). Specifically, during training, we treat the code-generating LM as an actor network, and introduce a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor. During inference, we introduce a new generation procedure with a critical sampling strategy that allows a model to automatically regenerate programs based on feedback from example unit tests and critic scores. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives, larger model sizes, and better pretraining data. Our method not only achieves new SOTA results on the challenging APPS benchmark, but also shows strong zero-shot transfer capability with new SOTA results on the simpler MBPP benchmark.""",Likely,United States of America,Industry,504.0,21 days,NVIDIA A100,Open source,,,,,140.0,,,150.0,4478.145684414144,,
Minerva (540B),Language,Google,"Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, Vedant Misra",2022-06-29,Solving Quantitative Reasoning Problems with Language Models,https://arxiv.org/abs/2206.14858,SOTA improvement,,540350000000.0,"""To further our understanding of the
impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer
language model, which we call Pathways Language Model (PaLM).""
Our approach is to start with the PaLM pretrained decoder-only transformer language models Chowdhery
et al. (2022), and further train (finetune) them on our mathematical dataset using an autoregressive objective.
Table 2 contains the main model and training hyperparameters.
See Table 2",2.7415e+24,"Minerva was fine-tuned from PaLM using the same hardware. Assume the same model FLOPs utilization rate for pre-training and fine-tuning.
PaLM pretraining time: 6144 TPU for 1200 hours + 3072 TPU for 336 hours = @8404992 TPU-hours
Minerva finetuning time: 1024 TPU for 696 hours = 712704 TPU-hours
So fine-tuning added 8.5% more compute.
Minerva total compute = PaLM pretraining compute * (712704+8404992)/(8404992) = 2.7415*10^24 FLOP
https://www.wolframalpha.com/input?i=%28712704%2B8404992%29%2F%288404992%29+*+2.5272*10%5E24
",,"PaLM, finetuned on arxiv",613875000000.0,"""Our models were trained on a dataset of 38.5B tokens"" + PaLM","Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks that require quantitative reasoning, such as solving mathematics, science, and engineering problems at the college level. To help close this gap, we introduce Minerva, a large language model pretrained on general natural language data and further trained on technical content. The model achieves state-of-the-art performance on technical benchmarks without the use of external tools. We also evaluate our model on over two hundred undergraduate-level problems in physics, biology, chemistry, economics, and other sciences that require quantitative reasoning, and find that the model can correctly answer nearly a third of them.",,United States of America,Industry,696.0,,Google TPU v4,Unreleased,,1024.0,,PaLM (540B),452.0,,2.1429e+23,,,,
ProGen2-xlarge,Biology,"Salesforce Research,Columbia University,Johns Hopkins University","Erik Nijkamp, Jeffrey Ruffolo, Eli N. Weinstein, Nikhil Naik, Ali Madani
",2022-06-27,ProGen2: Exploring the Boundaries of Protein Language Models,https://arxiv.org/abs/2206.13517,SOTA improvement,"""ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and pre- dicting protein fitness without additional finetuning.""",6400000000.0,"""We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters""",1.35e+22,"Estimate 1:
""350,000 steps x 1m batch size x 6.4 B “connections” x 6"" - Arb Research (https://arbresearch.com/files/gen_bio.pdf)
Steps and batches from Table 1.
FLOP estimate: 1.3e22
Table 9 from here: https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1.full.pdf
FLOP estimate: 1.4e22
Geometric mean = 1.35e22 FLOP","UniRef90,BFD30","""The standard PROGEN2 models are pretrained on a mixture of Uniref90 (Suzek et al., 2015) and BFD30 (Steinegger & Söding, 2018) databases""",350000000000.0,350B from Table 9 https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1,"Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial intelligence- driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional finetuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. We release the ProGen2 models and code at https://github.com/salesforce/progen.",Confident,"United States of America,United States of America,United States of America","Industry,Academia,Academia",,,Google TPU v3,Open source,,,,,131.0,,,,11850.178410269727,,
Parti,Image generation,Google Research,"Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, Yonghui Wu",2022-06-22,Scaling Autoregressive Models for Content-Rich Text-to-Image Generation,https://arxiv.org/abs/2206.10789v1,SOTA improvement,"""Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO""",20000000000.0,"Abstract: ""we achieve consistent quality improvements
by scaling the encoder-decoder Transformer model up to 20B parameters""",3.962895376192635e+23,"Calculated from architecture. Does not take into account the encoding and decoding of text and images, only the transformer stack.
Table 1 shows for the 20B model
16 encoder layers
64 decoder layers
Dmodel = 4096
Dhidden = 16384
Num heads = 64
Just below table 1:
""We use a maximum length of text tokens of 128, and the length of image tokens are fixed to 1024""
I take the length of the sequence to be 100 for the encoder stack and 1024 for the decoder stack.
Section 3, Training: ""a total
of 450,000 steps and final ratio of 0.025. We use a global batch size of 8192 during training.""","LAION-400M,FIT400M,JFT-4B",,4800000000.0,,"We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See https://parti.research.google/ for high-resolution images.",,Multinational,Industry,,,Google TPU v4,Unreleased,,,,,706.0,,,,344852.94872144435,,
CoCa,Vision,Google Research,"Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu",2022-06-14,CoCa: Contrastive Captioners are Image-Text Foundation Models,https://arxiv.org/abs/2205.01917v2,SOTA improvement,"""Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and new state-of-the-art 91.0% top-1 accuracy on ImageNet with a finetuned encoder.""",2100000000.0,"""Our largest CoCa model (""CoCa"" in short) follows the ViT-giant setup in [21] with 1B-parameters in the image encoder and 2.1B-parameters altogether with the text decoder""",7.3e+22,"""Pretraining CoCa takes about 5 days on 2,048 CloudTPUv4 chips""
275 teraFLOP/s * 2048 * 5 * 24 * 3600 * 0.3 (assumed utilization) = 7.3e22","JFT-3B,ALIGN","""CoCa is pretrained from scratch in a single stage on both webscale alt-text data and annotated images by treating all labels simply as texts. We use the JFT-3B dataset [21] with label names as the paired texts, and the ALIGN dataset [13] with noisy alt-texts.""",4800000000.0,"JFT is 3 billion captioned images, ALIGN is 1.8 billion captioned images","Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. CoCa is pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps). Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and new state-of-the-art 91.0% top-1 accuracy on ImageNet with a finetuned encoder.",Confident,Multinational,Industry,120.0,5 days,Google TPU v4,Unreleased,,2048.0,,,869.0,,,7.5,78043.3756911775,,"65,536 image-text pairs"
MetaLM,"Multimodal,Language,Vision",Microsoft Research,"Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shuming Ma, Furu Wei",2022-06-13,Language Models are General-Purpose Interfaces,https://arxiv.org/abs/2206.06336v1,SOTA improvement,"Abstract: ""Experimental results across various language-only and vision-language benchmarks show that our model outperforms or is competitive with specialized models on finetuning, zero-shot generalization, and few-shot learning.""",,,,,,,,,,Unknown,United States of America,Industry,,,,,,,,,79.0,,,,,,
DITTO,Language,"Tsinghua University,Apple,Westlake University,Chinese University of Hong Kong (CUHK)","Jin Xu, Xiaojiang Liu, Jianhao Yan, Deng Cai, Huayang Li, Jian Li",2022-06-06,Learning to Break the Loop: Analyzing and Mitigating Repetitions for Neural Text Generation,https://arxiv.org/abs/2206.02369,SOTA improvement,"Achieves SOTA on CNN/DailyMail by fine-tuning and improving on BART-large, which is SOTA",750000000.0,,1.1e+19,,WikiText-103,,,,,,"China,United States of America,China,Hong Kong","Academia,Industry,Academia,Academia",,,,Unreleased,,,,,44.0,,,7.16,,,
Diffusion-GAN,Image generation,"UT Austin,Microsoft","Zhendong Wang, Huangjie Zheng, Pengcheng He, Weizhu Chen, Mingyuan Zhou",2022-06-05,Diffusion-GAN: Training GANs with Diffusion,https://arxiv.org/abs/2206.02262v4,SOTA improvement,"""We demonstrate the advantages of Diffusion-GAN over strong GAN
baselines on various datasets, showing that it can produce more realistic images with higher stability and data efficiency than state-of-the-art GANs.""",,,,"Must be <1e23 FLOP, all experiments were done with 4 or 8 V100s.",,"They experimented with the following datasets: ""CIFAR-10 (Krizhevsky, 2009), STL-10 (Coates et al., 2011), LSUN-Bedroom (Yu et al., 2015), LSUN-Church
(Yu et al., 2015), AFHQ(Cat/Dog/Wild) (Choi et al., 2020), and FFHQ (Karras et al., 2019)""",,,"Generative adversarial networks (GANs) are challenging to train stably, and a promising remedy of injecting instance noise into the discriminator input has not been very effective in practice. In this paper, we propose Diffusion-GAN, a novel GAN framework that leverages a forward diffusion chain to generate Gaussianmixture distributed instance noise. Diffusion-GAN consists of three components, including an adaptive diffusion process, a diffusion timestep-dependent discriminator, and a generator. Both the observed and generated data are diffused by the same adaptive diffusion process. At each diffusion timestep, there is a different noise-to-data ratio and the timestep-dependent discriminator learns to distinguish the diffused real data from the diffused generated data. The generator learns from the discriminator’s feedback by backpropagating through the forward diffusion chain, whose length is adaptively adjusted to balance the noise and data levels. We theoretically show that the discriminator’s timestep-dependent strategy gives consistent and helpful guidance to the generator, enabling it to match the true data distribution. We demonstrate the advantages of Diffusion-GAN over strong GAN baselines on various datasets, showing that it can produce more realistic images with higher stability and data efficiency than state-of-the-art GANs.",Unknown,"United States of America,United States of America","Academia,Industry",,,NVIDIA V100,,,,,,125.0,,,,,,
CogVideo,"Multimodal,Video","Tsinghua University,Beijing Academy of Artificial Intelligence / BAAI","Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang",2022-05-29,CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers,https://arxiv.org/abs/2205.15868,Historical significance,The world's largest and first opensource large-scale pre-trained text-to-video model.,9400000000.0,,,,Unspecified unreleased,,5400000.0,"""trained on 5.4 million text-video pairs""","Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.",Likely,"China,China","Academia,Academia",,,,Open source,,,,CogView2,270.0,,,,,,
Tranception,Biology,"University of Oxford,Harvard Medical School,Cohere","Pascal Notin, Mafalda Dias, Jonathan Frazer, Javier Marchena-Hurtado, Aidan Gomez, Debora S. Marks, Yarin Gal",2022-05-27,Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval,https://arxiv.org/abs/2205.13760,SOTA improvement,"""We introduce Tranception, a novel transformer architecture leveraging autoregressive predictions and retrieval of homologous sequences at inference to achieve state-of-the-art fitness prediction performance. Given its markedly higher performance on multiple mutants, robustness to shallow alignments and ability to score indels, our approach offers significant gain of scope over existing approaches.""",700000000.0,"""Our largest transformer model, Tranception L, has 700M parameters and is trained on UniRef100 (Suzek et al., 2014)""",7.24e+21,"Trained using 64 A100 GPUs for two weeks.
64 * 312 teraFLOP/s * 14 days * 24 hours/day * 3600 seconds/hour * 0.3 utilization (assumption)
= 7.24e21",UniRef100,"""We therefore train our final model (700M parameters) on UniRef100""",,250 million proteins after filtering,"The ability to accurately model the fitness landscape of protein sequences is critical to a wide range of applications, from quantifying the effects of human variants on disease likelihood, to predicting immune-escape mutations in viruses and designing novel biotherapeutic proteins. Deep generative models of protein sequences trained on multiple sequence alignments have been the most successful approaches so far to address these tasks. The performance of these methods is however contingent on the availability of sufficiently deep and diverse alignments for reliable training. Their potential scope is thus limited by the fact many protein families are hard, if not impossible, to align. Large language models trained on massive quantities of non-aligned protein sequences from diverse families address these problems and show potential to eventually bridge the performance gap. We introduce Tranception, a novel transformer architecture leveraging autoregressive predictions and retrieval of homologous sequences at inference to achieve state-of-the-art fitness prediction performance. Given its markedly higher performance on multiple mutants, robustness to shallow alignments and ability to score indels, our approach offers significant gain of scope over existing approaches. To enable more rigorous model testing across a broader range of protein families, we develop ProteinGym -- an extensive set of multiplexed assays of variant effects, substantially increasing both the number and diversity of assays compared to existing benchmarks.",Likely,"United Kingdom of Great Britain and Northern Ireland,United States of America,Canada","Academia,Academia,Industry",336.0,2 weeks,NVIDIA A100,Open source,,64.0,,,114.0,,,,15247.43608848737,,
Imagen,Image generation,Google Brain,"Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li",2022-05-23,Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding,https://imagen.research.google/,"Significant use,SOTA improvement,Highly cited",,3000000000.0,"2B 64x64 generation model, 600M 64->256 super-resolution model, 400M 256->1024 super-resolution model",1.46e+22,"256 TPU v4 chips for 64x64, for 4 days
128 TPU v4 chips for 64->256, for 2 days
128 TPU v4 chips for 256->1024, for 2 days
256 TPUs * 275 teraFLOPS/TPU * 4 days + 2 * (128 TPUs * 275 teraFLOPS/TPU * 2 days) * 40% utilization = 1.46e+22 FLOP","LAION-400M,other",,860000000.0,"""We train on a combination of internal datasets, with ≈ 460M
image-text pairs, and the publicly available Laion dataset [61], with ≈ 400M image-text pairs.""",,Likely,United States of America,Industry,96.0,4 days,Google TPU v4,API access,,256.0,,,3482.0,,,,7915.823806150154,,
SimCSE,Language,"Princeton University,Tsinghua University","Tianyu Gao, Xingcheng Yao, Danqi Chen",2022-05-18,SimCSE: Simple Contrastive Learning of Sentence Embeddings,https://arxiv.org/abs/2104.08821,"Highly cited,SOTA improvement",,,,,,,"""Training details. We start from pre-trained checkpoints of BERT (Devlin et al., 2019) (uncased) or RoBERTa (Liu et al., 2019) (cased) and take the [CLS] representation as the sentence embedding (see §6.3 for comparison between different pooling methods). We train unsupervised SimCSE on 106 randomly sampled sentences from English Wikipedia, and train supervised SimCSE on the combination of MNLI and SNLI datasets (314k). More training details can be found in Appendix A""",,,"This paper presents SimCSE, a simple contrastive learning framework that greatly advances state-of-the-art sentence embeddings. We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. This simple method works surprisingly well, performing on par with previous supervised counterparts. We find that dropout acts as minimal data augmentation, and removing it leads to a representation collapse. Then, we propose a supervised approach, which incorporates annotated pairs from natural language inference datasets into our contrastive learning framework by using ""entailment"" pairs as positives and ""contradiction"" pairs as hard negatives. We evaluate SimCSE on standard semantic textual similarity (STS) tasks, and our unsupervised and supervised models using BERT base achieve an average of 76.3% and 81.6% Spearman's correlation respectively, a 4.2% and 2.2% improvement compared to the previous best results. We also show -- both theoretically and empirically -- that the contrastive learning objective regularizes pre-trained embeddings' anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.",Unknown,"United States of America,China","Academia,Academia",,,,,,,,RoBERTa Large,2305.0,,,,,,
Gato,"Multimodal,Robotics,Games,Language",DeepMind,"Scott Reed, Konrad Żołna, Emilio Parisotto, Sergio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Giménez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, Nando de Freitas",2022-05-12,A Generalist Agent,https://arxiv.org/abs/2205.06175,SOTA improvement,"SOTA at Meta-World MT50 tasks (96.6%) page 14, section 5.5",1180000000.0,"""This section focuses on in-simulation evaluation.
Figure 10 compares the full 1.18B parameter Gato"" p.10",5.44e+21,256 (16x16x) TPUv3 chips x 123e12 FLOPS/chip x 4 days x 86400 seconds/day * 0.5 utilization = 5.44e21 FLOPs,,,,,"Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato.",,United Kingdom of Great Britain and Northern Ireland,Industry,96.0,4 days,Google TPU v3,Unreleased,,256.0,,,553.0,,,,3523.0649800752253,,
UL2,Language,"Google Research,Google Brain","Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler",2022-05-10,Unifying Language Learning Paradigms,https://arxiv.org/abs/2205.05131v1,SOTA improvement,"""by scaling our model up to 20B parameters, we achieve SOTA
performance on 50 well-established supervised NLP tasks""",20000000000.0,Taken from Directory of LLMs,1.2e+23,"Trained on 1T tokens
20B * 1T * 6 = 1.2e23
Second source: Section 5.1 says model was trained on 512 TPUv4 chips, and took slightly over 1 month
512 * 2.75e14 * 31 * 24 * 3600 * 0.3 = 1.13e23",C4,'The model is trained on a total of 1 trillion tokens on C4 (2 million steps).',1000000000000.0,1T tokens,,Confident,"Multinational,United States of America","Industry,Industry",744.0,"around 31 days from 'Pre-training took approximately slight more than one month for about 1 trillion
tokens.' from section 5.1
so around 31*24 = 744
",Google TPU v4,Open source,,512.0,0.318,,194.0,,,,126785.76203549476,65536.0,"""We pre-train all models for 500K steps with a batch size of 128 and a sequence length of 512 inputs and 512 targets using the C4 corpus. The total approximate tokens seen during pre-training is approximately 32 billion tokens.""
500k*128*512 ~= 32B
128*512=65,536"
OPT-175B,Language,Meta AI,"Susan Zhang∗ , Stephen Roller∗ , Naman Goyal∗ , Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott† , Sam Shleifer† , Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Zettlemoyer",2022-05-02,OPT: Open Pre-trained Transformer Language Models,https://arxiv.org/abs/2205.01068,"Significant use,Highly cited",https://ai.meta.com/blog/opt-175b-large-language-model-applications/,175000000000.0,"""In line with Meta AI’s commitment to open science, we are sharing Open Pretrained Transformer (OPT-175B), a language model with 175 billion parameters trained on publicly available data sets""",4.3e+23,"https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/final_update.md
""As of yesterday, at 12:46pm PST on January 6, our 175B model finally completed its training run on 300B tokens. This required ~4.30E+23 FLOPs of compute""","The Pile,BookCorpus (BooksCorpus, Toronto Book Corpus),CC-Stories,Pushshift Reddit","""The pre-training corpus contains a concatenation
of datasets used in RoBERTa (Liu et al., 2019b),
the Pile (Gao et al., 2021a), and PushShift.io Reddit (Baumgartner et al., 2020; Roller et al., 2021)""
...
""RoBERTa We included the BookCorpus (Zhu et al., 2015) and Stories (Trinh and Le, 2018) subsets of the RoBERTa corpus and utilized an updated version of CCNews, containing news stories crawled through September 28, 2021. This CCNews v2 corpus was preprocessed the same way as the original RoBERTa CCNews (Liu et al., 2019b).
The Pile We included a subset of the Pile (Gao et al., 2021a), including: CommonCrawl, DM Mathematics, Project Gutenberg, HackerNews, OpenSubtitles, OpenWebText2, USPTO and Wikipedia. Other subsets of the Pile were eliminated
...
PushShift.io Reddit We included a subset of the Pushshift.io corpus produced by Baumgartner et al. (2020) and previously used by Roller et al. (2021). To convert the conversational trees
into language-model-accessible documents, we extracted the longest chain of comments in each thread and discarded all other paths in the tree. This reduced the corpus by about 66%.",180000000000.0,"""The training data contains 180B tokens corresponding to 800 GB of data""
1 token ~ 0.75 words","Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3,1 while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models",Confident,United States of America,Industry,793.5,"4.3*10^23 FLOP / (147 TFLOPS) = 813000 A100-hours
https://www.wolframalpha.com/input?i=4.3*10%5E23+FLOP+%2F+%28147+TFLOPS%29
""As of yesterday, at 12:46pm PST on January 6, our 175B model finally completed its training run on 300B tokens. This required ~4.30E+23 FLOPs of compute, or roughly ~33 days of continuous training on 1024 80GB A100s (assuming no hardware issues, no numerical instabilities, etc.).""",NVIDIA A100 SXM4 80 GB,Open access (non-commercial),,1024.0,0.47115,,2241.0,,,1.6667,731667.6068059877,2000000.0,Table 1
Flamingo,"Multimodal,Vision,Language,Video",DeepMind,"Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan",2022-04-29,Flamingo: a Visual Language Model for Few-Shot Learning,https://arxiv.org/abs/2204.14198,"Highly cited,SOTA improvement","""For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.""",80000000000.0,"""We obtain three models, Flamingo-3B, Flamingo-9B and Flamingo-80B""",2.7e+23,"1536 TPU v4 chips for 15 days. Assuming 50% utilization:
C = 1536 TPU * 275*10^12 FLOP/s/TPU * 15 day * 86400 s/day * 0.50 = 2.7*10^23 FLOP
All training and evaluation
was performed on TPUv4 instances. The largest model containing 80 billion parameters is trained on
QUSV chips for 15 days and sharded across 16 devices.
All trained parameters and optimizer accumulators are stored
and updated in float32; all activations and gradients are computed in bfloat16 after downcasting
of parameters from float32 to bfloat16","MultiModal MassiveWeb,LTIP,VTP,ALIGN",,,"Flamingo was trained on a mixture of web-scraped datasets:
43M pages of text with interleaved images (MultiModal MassiveWeb dataset)
312M image-text pairs (LTIP dataset)
27M video-text pairs (VTP dataset)
1.8B image-alt text pairs (ALIGN dataset)
Training dataset size is at least 2.1 billion.","Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.",Likely,United Kingdom of Great Britain and Northern Ireland,Industry,360.0,1536 TPU v4 chips for 15 days,Google TPU v4,Unreleased,,1536.0,,,1823.0,,,,183423.16330597224,,
Sparse all-MLP,Language,Meta AI,"Ping Yu, Mikel Artexte, Myle Ott, Sam Shleifer, Hongyu Gong, Ves Stoyanov, Xian Li",2022-04-14,Efficient Language Modeling with Sparse all-MLP,https://arxiv.org/abs/2203.06850,SOTA improvement,"Abstract:
""Our model also outperforms
the RoBERTa-Large model on several English tasks of the GLUE benchmark by 0.3% on average while handling 99 more languages.""",9400000000.0,"Table 2: ""In Section 4.4, we run our large model (9.41B parameters)""",6.0770304e+19,"112 hours on 32 V100 GPUs
assumed 0.33 util rate
32*112*60*60*0.3*1.57E+13
",RoBERTa dataset,,75000000000.0,100B tokens (Table 2) so 75B words.,"All-MLP architectures have attracted increasing interest as an alternative to attention-based models. In NLP, recent work like gMLP shows that all-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. In this work, we analyze the limitations of MLPs in expressiveness, and propose sparsely activated MLPs with mixture-of-experts (MoEs) in both feature and input (token) dimensions. Such sparse all-MLPs significantly increase model capacity and expressiveness while keeping the compute constant. We address critical challenges in incorporating conditional computation with two routing strategies. The proposed sparse all-MLP improves language modeling perplexity and obtains up to 2× improvement in training efficiency compared to both Transformer-based MoEs (GShard, Switch Transformer, Base Layers and HASH Layers) as well as dense Transformers and all-MLPs. Finally, we evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.",,United States of America,Industry,112.0,,,Unreleased,,,,,10.0,,,,,,
Stable Diffusion (LDM-KL-8-G),Image generation,"Runway,Ludwig Maximilian University","Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer",2022-04-13,High-Resolution Image Synthesis with Latent Diffusion Models,https://arxiv.org/abs/2112.10752,"Significant use,Highly cited",,1450000000.0,See Table 2,5e+22,"""I get 5e22 FLOP. 150k hours on A100 [1] gives 150*10^3 hours * 3600 seconds/hour * 3.12E+14 peak performance of A100 * 0.33 utilisation = 5e22 FLOP""
[1] https://twitter.com/EMostaque/status/1563870674111832066",LAION-400M,"Depends on the specific task; see sec 4
""we train a 1.45B parameter
KL-regularized LDM conditioned on language prompts on
LAION-400M""",400000000.0,,"By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at this https URL .",,"United States of America,Germany","Industry,Academia",585.9375,"total chip-hours divided by number of GPUs
150k/256",NVIDIA A100,Open access (restricted use),,256.0,,,7283.0,,,,111248.21698072633,,
BERT-RBP,Biology,Waseda University,"Keisuke Yamada, Michiaki Hamada",2022-04-07,Prediction of RNA–protein interactions using a nucleotide language model,https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac023/6564689,SOTA improvement,"""Our model outperformed state-of-the-art prediction models using the eCLIP-seq data of 154 RBPs"" [Abstract] - SOTA improvement on a very specific task",110000000.0,"Base model is BERT base (110M parameters), pre-trained on human reference genome (DNABert: https://academic.oup.com/bioinformatics/article/37/15/2112/6128680)",1.4e+20,"See DNABert entry:
""Since the pre-training of DNABERT model is resource-intensive (about 25 days on 8 NVIDIA 2080Ti GPUs)""
Assuming FP16 and 30% utilization
Calculation = (25 * 24 *3600) s * 2.7e13 FLOP/s per GPU * 8 GPUs * 0.3 utilization = 1.4e20 FLOP","Human reference genome [pre-training],RBPSuite","See DNABert entry: ""We generated training data from human genome [...]"" [2.2.2 Pre-training]
""An eCLIP-seq dataset previously generated from the ENCODE3 database by Pan et al. (2020) was used. The original dataset consisted of 154 RBP sets with up to 60 000 positive RNA sequences that bind to the corresponding RBP and the same number of negative sequences."" [2.2 Data preparation]",,,"Motivation
The accumulation of sequencing data has enabled researchers to predict the interactions between RNA sequences and RNA-binding proteins (RBPs) using novel machine learning techniques. However, existing models are often difficult to interpret and require additional information to sequences. Bidirectional encoder representations from transformer (BERT) is a language-based deep learning model that is highly interpretable. Therefore, a model based on BERT architecture can potentially overcome such limitations.
Results
Here, we propose BERT-RBP as a model to predict RNA–RBP interactions by adapting the BERT architecture pretrained on a human reference genome. Our model outperformed state-of-the-art prediction models using the eCLIP-seq data of 154 RBPs. The detailed analysis further revealed that BERT-RBP could recognize both the transcript region type and RNA secondary structure only based on sequence information. Overall, the results provide insights into the fine-tuning mechanism of BERT in biological contexts and provide evidence of the applicability of the model to other RNA-related problems.",Confident,Japan,Academia,,,,Open access (non-commercial),"""The models were trained on four NVIDIA Tesla V100 GPUs (128
GB memory). The training of one RBP model using 19 200 samples
took <10 min.""
Calculation assuming FP16 and 30% utlization and NVIDIA Tesla V100 SMX2 model:
10 min * 60 sec/min * 3.1e13 FLOP/s * 4 GPU * 0.3 utilization = 2.2e16",,,DNABERT,26.0,,2.2e+16,,,,
DALL·E 2,Image generation,OpenAI,"Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen",2022-04-06,Hierarchical Text-Conditional Image Generation with CLIP Latents,https://cdn.openai.com/papers/dall-e-2.pdf,"Highly cited,SOTA improvement",,3500000000.0,"""Our decoder architecture is the 3.5 billion parameter GLIDE model""",,"Decoder architecture is similar to Imagen (1.46E+22), but trained on 1.6e9 datapoints (Table 3) rather than Imagen's 5.1e9 datapoints.
DALL-E 2 uses two models as priors. I estimate the prior model's FLOP as 6*N*D = 6 * 1e9 * 4096 * 1e6 = 2.5e19 FLOP. However, this seems low compared to CLIP.
So it may be possible to estimate DALL-E 2's compute by analogy to Imagen, but there is a lot of uncertainty and more research would be needed.","CLIP,DALL-E",,650000000.0,"""When training the encoder, we sample from the CLIP [39] and DALL-E [40] datasets (approximately 650M images in total) with equal probability""",,Confident,United States of America,Industry,,,,,,,,,4276.0,,,,,,
PaLM (540B),Language,Google Research,"Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev,, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta ,Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, Noah Fiedel",2022-04-04,PaLM: Scaling Language Modeling with Pathways,https://arxiv.org/abs/2204.02311,"Highly cited,SOTA improvement,Training cost","Demonstrates continued benefits of scaling, as well as discontinuous improvements in performance",540350000000.0,"""To further our understanding of the
impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer
language model, which we call Pathways Language Model (PaLM).""",2.5272e+24,"See Table 20: https://storage.googleapis.com/pathways-language-model/PaLM-paper.pdf
6144 TPUv4 for 1200 hours + 3072 TPUv4 for 336 hours.
Equivalent to 6144 TPUv4 for 1368 hours.
46.2% model FLOPs utilization
""The 540B-parameter PaLM model sustained a remarkable 57.8% of the peak hardware floating point performance over 50 days while training on TPU v4 supercomputers. "" https://cloud.google.com/blog/topics/systems/tpu-v4-enables-performance-energy-and-co2e-efficiency-gains",,,585000000000.0,"""The PaLM pretraining dataset consists of a high-quality corpus of 780 billion tokens that represent a wide range of natural language use cases.""
1 token ~ 0.75 words","Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.",Confident,Multinational,Industry,1368.0,"6144 TPUv4 for 1200 hours + 3072 TPUv4 for 336 hours.
Equivalent to 6144 TPUv4 for 1368 hours.",Google TPU v4,Unreleased,,6144.0,0.462,,3988.0,"Training compute and utilization rate exclude rematerialization FLOP, but cost should account for rematerialization.",,,2945949.763287097,4000000.0,"""For the largest model, we use batch size 512 (1M tokens) until step 50k, then double it to 1024 (2M tokens) until step 115k, and finally double again it to 2048 (4M tokens) until training is complete at step 255k"""
Chinchilla,Language,DeepMind,"Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre",2022-03-29,Training Compute-Optimal Large Language Models,https://arxiv.org/abs/2203.15556,SOTA improvement,"Proposes new scaling law, with good empirical results",70000000000.0,"""We test this hypothesis by training a predicted compute-optimal model, \chinchilla, that uses the same compute budget as \gopher but with 70B parameters and 4× more more data. \chinchilla uniformly and significantly outperforms \Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.""",5.76e+23,"""Both Chinchilla and Gopher have been trained for the same number of FLOPs but differ in the size of the model and the number of training tokens.""
We see the number of flops in table 3","MassiveWeb,C4","MassiveWeb, Books, C4, News, Github, Wikipedia (Table A1)",1050000000000.0,"Table 1 shows Chinchilla was training on 1.4 trillion tokens
1 token ~ 0.75 words","We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over \nummodels language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, \chinchilla, that uses the same compute budget as \gopher but with 70B parameters and 4× more more data. \chinchilla uniformly and significantly outperforms \Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that \chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, \chinchilla reaches a state-of-the-art average accuracy of 67.5\% on the MMLU benchmark, greater than a 7\% improvement over \gopher.",Likely,United Kingdom of Great Britain and Northern Ireland,Industry,,,"Google TPU v4,Google TPU v3",Unreleased,,,,,1114.0,,,1.0,,3000000.0,"Table 1. ""1.5M → 3M"""
"Segatron-XL large, M=384 + HCP",Language,"Microsoft Research,University of Waterloo","He Bai, Tong Wang, Alessandro Sordoni, Peng Shi",2022-03-21,Better Language Model with Hypernym Class Prediction,https://arxiv.org/abs/2203.10692,SOTA improvement,"""Empirically, this curriculum learning strategy consistently improves perplexity over various large, highly-performant state-of-the-art Transformer-based models on two datasets, WikiText-103 and ARXIV""",257000000.0,,2.65e+19,,,,,,,,"United States of America,Canada","Industry,Academia",,,,Unreleased,,,,,9.0,,,167.02,,,
ViT-G (model soup),Vision,"University of Washington,Columbia University,Google,Meta AI,Tel Aviv University","Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, Ludwig Schmidt",2022-03-10,Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,https://arxiv.org/abs/2203.05482v3,SOTA improvement,"""When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90.94% top-1 accuracy on ImageNet, achieved a new state of the art.""",1843000000.0,This is from the original ViT-G paper,3.4e+21,"This is a fine-tuned version of ViT-G, which required 3.4e21 to train per PCD/Akronomicon.
Fine-tuning compute is likely minor in comparision:
""Models are fine-tuned at a batch size of 512 for either 10,000 or 20,000 steps (approximately 4 or 8 epochs)... all models are fine-tuned at 518 × 518 resolution""
At 20k steps, we have (518^2) * 512 * 20k = 2.75e12 pixels seen in fine-tuning, compared to (224^2) * 32768 * 5M = 8.22e15 in pre-training.",,,,,"The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results ""model soups."" When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90.94% top-1 accuracy on ImageNet, achieved a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically. Code is available at this https URL.",Confident,"United States of America,United States of America,United States of America,United States of America,Israel","Academia,Academia,Industry,Industry,Academia",,,,Open access (non-commercial),,,,,501.0,,,,,,
MegaSyn,Medicine,Collaborations Pharmaceuticals,"Fabio Urbina, Filippa Lentzos, Cédric Invernizzi, Sean Ekins",2022-03-07,Dual Use of Artificial Intelligence-powered Drug Discovery,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9544280/,Historical significance,"Notable example of an AI model having a potential dual use for bio/chemical weapons:
""To narrow the universe of molecules we chose to drive the generative model towards compounds like the nerve agent VX, one of the most toxic chemical warfare agents developed during the 20th century—a few salt-sized grains of VX, (6–10 mg)5, is sufficient to kill a person. Nerve agents such as Novichoks have also been in the headlines recently6.
In less than 6 hours after starting on our in-house server, our model generated forty thousand molecules that scored within our desired threshold. In the process, the AI designed not only VX, but many other known chemical warfare agents that we identified through visual confirmation with structures in public chemistry databases. Many new molecules were also designed that looked equally plausible. These new molecules were predicted to be more toxic based on the predicted LD50 in comparison to publicly known chemical warfare agents (Figure 1). This was unexpected as the datasets we used for training the AI did not include these nerve agents. The virtual molecules even occupied a region of molecular property space that was entirely separate to the many thousands of molecules in the organism-specific LD50 model, which is mainly made up of pesticides, environmental toxins, and drugs (Figure 1). By inverting the use of our machine learning models, we had transformed our innocuous generative model from a helpful tool of medicine to a generator of likely deadly molecules.""",,"model details here: https://chemrxiv.org/engage/chemrxiv/article-details/61551803d1fc335b7cf8fd45
""The variational autoencoder utilizes an encoder-decoder architecture to map chemical space into a latent vector 34. The encoder is composed of 3 LSTM layers of 512 units each followed by a linear layer of 64 units (the latent space).
Our decoder is comprised of 3 LSTM layers of 512 units each with dropout of 0.2 between
all layers""",,,ChEMBL,"https://chemrxiv.org/engage/chemrxiv/article-details/61551803d1fc335b7cf8fd45
""The initial model is trained on ChEMBL 28’s ~2 million compounds""",,,An international security conference explored how artificial intelligence (AI) technologies for drug discovery could be misused for de novo design of biochemical weapons. A thought experiment evolved into a computational proof.,Unknown,United States of America,Industry,,,,Unreleased,,,,,148.0,,,,,,
Statement Curriculum Learning,Language,OpenAI,"Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, Ilya Sutskever ",2022-03-02,Formal Mathematics Statement Curriculum Learning,https://arxiv.org/abs/2202.01344,SOTA improvement,"""by applying this expert iteration to a manually curated set
of problem statements, we achieve state-of-the-art on the miniF2F benchmark, automatically solving
multiple challenging problems drawn from high school olympiads.""",774000000.0,,,Probably below 1e23 FLOP given the small model size.,"Common Crawl,WebMath","300 billion tokens from Common Crawl
72 billion tokens (220 GB) of code from WebMath
25000 theorems from mathlib
327 math problems from competitions and textbooks
The model was also trained on its own self-generated proofs",275000000000.0,"Table on p12 gives WebMath dataset size in GB of code. Uncompressed code probably has a similar number of tokens per gigabyte as natural language text, on the order of 3e8 tokens per GB.",,,United States of America,Industry,,,,,,,,,70.0,,,,,,
DeepNet,Language,Microsoft Research,"Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, Furu Wei",2022-03-01,"DeepNet: Scaling Transformers to 1,000 Layers",https://arxiv.org/abs/2203.00555,SOTA improvement,"""Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points""",3200000000.0,"""Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points, which indicates a promising scaling direction""
EDIT 05/05/2022: The 12B model was presented in an earlier paper. This paper presents a 3.2B model",,"They show results on par with the original Transformer, so probably less than 2.3e19 FLOP.",,,12000000000.0,""" The final data consists of 102 languages, 1932 directions, and
12B sentence pairs.""",,,United States of America,Industry,,,,,,,,,108.0,,,,,,
PolyCoder,Language,Carnegie Mellon University (CMU),"Frank F. Xu, Uri Alon, Graham Neubig, Vincent J. Hellendoorn",2022-02-26,A Systematic Evaluation of Large Language Models of Code,https://arxiv.org/abs/2202.13169,SOTA improvement,"""In the C programming language, PolyCoder outperforms
all models including Codex""",2700000000.0,2.7B for largest model,1.1e+21,"""We use GPT-NeoX toolkit 11 to
train the model efficiently in parallel with 8 Nvidia RTX 8000 GPUs on a single machine. The wall
time used to train the largest 2.7B model is about 6 weeks""
8 * 130 TFLOP/s * 6 * 7 * 24 * 3600 * 0.3 (utilization) ~= 1.1e21",,"Code scraped from GitHub. ""249GB of code across 12 programming languages on a single machine.""",,"249GB
They trained on 39B tokens per Table 3, but I'm not sure how many epochs that is. May be <1. ","Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (e.g., Codex (Chen et al., 2021)) are not publicly available, leaving many questions about their model and data design decisions. We aim to fill in some of these blanks through a systematic evaluation of the largest existing models: Codex, GPT-J, GPT-Neo, GPT-NeoX-20B, and CodeParrot, across various programming languages. Although Codex itself is not open-source, we find that existing open-source models do achieve close results in some programming languages, although targeted mainly for natural language modeling. We further identify an important missing piece in the form of a large open-source model trained exclusively on a multi-lingual corpus of code. We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine. In the C programming language, PolyCoder outperforms all models including Codex. Our trained models are open-source and publicly available at this https URL, which enables future research and application in this area.",Likely,United States of America,Academia,1000.0,6 weeks,NVIDIA Quadro RTX 8000,,,,,,372.0,,,,,,
ST-MoE,Language,"Google,Google Brain,Google Research","Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, William Fedus",2022-02-17,ST-MoE: Designing Stable and Transferable Sparse Expert Models,https://arxiv.org/abs/2202.08906v2,SOTA improvement,"""ST-MoE-32B improves the current state-of-the-art on the test server submissions for both ARC Easy (92.7 → 94.8) and ARC Challenge (81.4 → 86.5).""",269000000000.0,269B. it's called ST-MoE-32B because it's equivalent to a 32B dense model.,2.9e+23,"The paper claims ""scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder"". If this is true for training cost, then 6*32e9*1.5e12 = 2.9e23",C4,"""The pre-training dataset used to train our Sparse 32B model is a mix of C4 (Raffel et al., 2019) and the dataset introduced in GLaM (Du et al., 2021).""",1500000000000.0,"""We pre-train for 1.5T tokens on a mixture of English-only C4 dataset (Raffel et al., 2019) and the dataset from GLaM (Du et al., 2021) summarized in Appendix E""","Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).",Likely,"United States of America,United States of America,Multinational","Industry,Industry,Industry",,,,,,,,,72.0,,,0.84,,,
ProteinBERT,Biology,"Hebrew University of Jerusalem,Ben-Gurion University of the Negev,Deep Trading","Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, Michal Linial",2022-02-10,ProteinBERT: a universal deep-learning model of protein sequence and function,https://academic.oup.com/bioinformatics/article/38/8/2102/6502274,SOTA improvement,"""ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes)""",16000000.0,"""Altogether, it includes ∼16M trainable parameters, making it substantially smaller than other protein language models""",6.5e+19,"""Pretraining speed on a single GPU (Nvidia Quadro RTX 5000) was 280 protein records per second. We trained the model for 28 days over ∼670M records""
28 * 24 * 3600 * 89 TFLOP/s * 0.3 (assumed utilization) = 6.5e19
https://www.wolframalpha.com/input?i=28+days+*+89+TFLOP%2Fs+*+0.3",UniRef90,"""ProteinBERT was pretrained on ∼106M UniRef90 records for ∼6.4 epochs""",,,"Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data.",Likely,"Israel,United States of America,United States of America","Academia,Academia,Industry",672.0,28 days,NVIDIA Quadro RTX 5000,,,,,,285.0,,,6.4,,,
LaMDA,Language,Google,"Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, Quoc Le",2022-02-10,LaMDA: Language Models for Dialog Applications,https://arxiv.org/abs/2201.08239,Historical significance,,137000000000.0,"""LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters""",3.55e+23,"""The total FLOPS is 56.5% * 123 TFLOPS/s * 1024 chips * 57.7 days
= 3.55E+23""
From https://arxiv.org/pdf/2201.08239.pdf p.18
",Infiniset,"LaMDA's underlying dataset is called 'Infiniset', and besides the dialogue also involves common crawl, wikipedia, a mixture of english and non-english web documents, and data from programming-related sites (so LaMDA models can also dabble in code).",1560000000000.0,"""and are pre-trained on 1.56T words of public dialog data and web text""","We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and are pre-trained on 1.56T words of public dialog data and web text. While model scaling alone can improve quality, it shows less improvements on safety and factual grounding. We demonstrate that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding. The first challenge, safety, involves ensuring that the model's responses are consistent with a set of human values, such as preventing harmful suggestions and unfair bias. We quantify safety using a metric based on an illustrative set of human values, and we find that filtering candidate responses using a LaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promising approach to improving model safety. The second challenge, factual grounding, involves enabling the model to consult external knowledge sources, such as an information retrieval system, a language translator, and a calculator. We quantify factuality using a groundedness metric, and we find that our approach enables the model to generate responses grounded in known sources, rather than responses that merely sound plausible. Finally, we explore the use of LaMDA in the domains of education and content recommendations, and analyze their helpfulness and role consistency.",Confident,United States of America,Industry,1385.0,57.7 days * 24,Google TPU v3,Unreleased,,1024.0,0.565,,1184.0,,,,229949.98625999544,256000.0,"""All models were trained with 256K tokens per batch"""
GPT-NeoX-20B,Language,EleutherAI,"Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach",2022-02-09,GPT-NeoX-20B: An Open-Source Autoregressive Language Model,https://arxiv.org/abs/2204.06745,Historical significance,,20000000000.0,,9.31627008e+22,Trained for 3 months on 96 A100s (according to correspondence with author). Let's say 0.4 utilization rate.,The Pile,,177167400000.0,"""In aggregate, the Pile consists of over 825GiB of raw text data""
1 GB ~ 200M words",,,Multinational,Research collective,2160.0,"see other notes
",NVIDIA A100 SXM4 40 GB,Open source,,96.0,0.375,,556.0,,,1.0,184272.8073600264,3150000.0,"""we opt to use the same batch size as OpenAI’s 175B model–approximately 3.15M tokens, or 1538 contexts of 2048 tokens each, and train for a total of 150,000 steps"""
RETRO-7B,Language,DeepMind,"Sebastian Borgeaud†, Arthur Mensch†, Jordan Hoffmann†, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero,Karen Simonyan, Jack W. Rae‡, Erich Elsen‡ and Laurent Sifre",2022-02-07,Improving language models by retrieving from trillions of tokens,https://arxiv.org/abs/2112.04426,SOTA improvement,"""Our largest model obtains state-of-the-art results on a range of downstream evaluation
datasets including Wikitext103""",7500000000.0,"""Retro provides a constant gain for models ranging from 150M to 7B parameters, and Retro can be improved at evaluation time by increasing the database size and the number of retrieved neighbours. """,1.68e+22,C=6ND = 6 * 7e9 * 400e9 = 1.7e22 ,,,315000000000.0,"""we train for 419,430,400,000 training tokens"" ~= 315B words.",,,United Kingdom of Great Britain and Northern Ireland,Industry,,,,Unreleased,,,,,623.0,,,,,,
AlphaCode,Language,DeepMind,"Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d'Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, Oriol Vinyals",2022-02-02,Competition-Level Code Generation with AlphaCode,https://arxiv.org/abs/2203.07814,SOTA improvement,,41100000000.0,41.1B. Table 3,1.568160000001e+23,"Figure 7 (a) shows a maximum training compute budget of approx 20000 TPU-days per model.
20000 days * 275 TFLOPS * 0.33 utilization = 1.6e23 FLOP
https://www.wolframalpha.com/input?i=20000+*+275+teraFLOPS+*+1+day+*+0.33",,,,Appendix part A has answers for pretraining.,"Programming is a powerful and ubiquitous problem-solving tool. Developing systems that can assist programmers or even generate programs independently could make programming more productive and
accessible, yet so far incorporating innovations in AI has proven challenging. Recent large-scale language models have demonstrated an impressive ability to generate code, and are now able to complete
simple programming tasks. However, these models still perform poorly when evaluated on more complex, unseen problems that require problem-solving skills beyond simply translating instructions into
code. For example, competitive programming problems which require an understanding of algorithms
and complex natural language remain extremely challenging. To address this gap, we introduce AlphaCode, a system for code generation that can create novel solutions to these problems that require deeper
reasoning. In simulated evaluations on recent programming competitions on the Codeforces platform,
AlphaCode achieved on average a ranking of top 54.3% in competitions with more than 5,000 participants. We found that three key components were critical to achieve good and reliable performance:
(1) an extensive and clean competitive programming dataset for training and evaluation, (2) large and
efficient-to-sample transformer-based architectures, and (3) large-scale model sampling to explore the
search space, followed by filtering based on program behavior to a small set of submissions.
",,United Kingdom of Great Britain and Northern Ireland,Industry,,,"Google TPU v4,Google TPU v4i",,,3750.0,,,777.0,,,,,4718592.0,"2304 token sequences, 2048 batch size. 2304 * 2048 = 4718592
trained on 967B tokens and 205k steps. 967B/205k = 4717073, so seems they didn't do warmup"
InstructGPT,Language,OpenAI,"Long Ouyang, Pamela Mishkin, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright,John Schulman Amanda Askell, Fraser Kelton Peter Welinder, Luke Miller Maddie Simens Paul Christiano,Ryan Lowe,Chong Zhang Jacob Hilton, Sandhini Agarwal Katarina Slama Alex Ray, Jan Leike",2022-01-27,Training language models to follow instructions with human feedback,https://cdn.openai.com/papers/Training_language_models_to_follow_instructions_with_human_feedback.pdf,"Historical significance,Highly cited",,175000000000.0,"""We train three model sizes (1.3B, 6B, and 175B parameters)""",,,,,374000033207.0,"Table 6 - describes **number of prompts**
26584 + 6623 = 33207
This is added to GPT-3 dataset size.",,,United States of America,Industry,,,,,,,,GPT-3 175B (davinci),6503.0,,,,,,
OntoProtein,"Biology,Language",Zhejiang University,"Ningyu Zhang, Zhen Bi, Xiaozhuan Liang, Siyuan Cheng, Shumin Deng, Qiang Zhang, Jiazhang Lian, Huajun Chen, Haosen Hong",2022-01-23,ONTOPROTEIN: PROTEIN PRETRAINING WITH GENE ONTOLOGY EMBEDDING,https://openreview.net/pdf?id=yfe1VMYAXa4,SOTA improvement,Experimental results show that OntoProtein can surpass state-of-the-art methods with pre-trained protein language models in TAPE benchmark and yield better performance compared with baselines in protein-protein interaction and protein function prediction1.,420000000.0,"""For the protein encoder, we use the pre-trained ProtBert from Elnaggar et al. (2020).""",,,ProteinKG25,"""the ProteinKG25 dataset used for pre-training contains about 612,483 entities and 4,990,097 triples, aligned with GO annotations and including protein sequences.""",,,"Self-supervised protein language models have proved their effectiveness in learn- ing the proteins representations. With the increasing computational power, cur- rent protein language models pre-trained with millions of diverse sequences can advance the parameter scale from million-level to billion-level and achieve re- markable improvement. However, those prevailing approaches rarely consider incorporating knowledge graphs (KGs), which can provide rich structured knowl- edge facts for better protein representations. We argue that informative biology knowledge in KGs can enhance protein representation with external knowledge. In this work, we propose OntoProtein, the first general framework that makes use of structure in GO (Gene Ontology) into protein pre-training models. We construct a novel large-scale knowledge graph that consists of GO and its related proteins, and gene annotation texts or protein sequences describe all nodes in the graph. We propose novel contrastive learning with knowledge-aware negative sampling to jointly optimize the knowledge graph and protein embedding during pre-training. Experimental results show that OntoProtein can surpass state-of-the-art methods with pre-trained protein language models in TAPE benchmark and yield better performance compared with baselines in protein-protein interaction and protein function prediction1.",,China,Academia,,,,,,,,ProtBERT-BFD,,,,,,,
AbLang,Biology,University of Oxford,"Tobias H. Olsen, Iain H. Moal and Charlotte M. Deane",2022-01-22,"AbLang: an antibody language model for completing
antibody sequences",https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac046/6609807,SOTA improvement,"""AbLang restores residues more accurately and faster than a current state-of-the-art protein language model ESM-1b, emphasizing the benefits and potential of an antibody specific language model"" - SOTA improvement for a very specific task",355000000.0,"""The hyperparameters were selected to be similar to those used
in the RoBERTa paper (Liu et al., 2019).""
Liu et al., 2019 link: https://arxiv.org/pdf/1907.11692.pdf
""We begin by training RoBERTa following the BERTLARGE architecture (L = 24, H = 1024, A = 16, 355M parameters)""",,,Observed antibody space (OAS) database,"""Here, we present AbLang, an antibody specific language model trained on either the heavy or light chain antibody sequences from OAS""",,,"Motivation
General protein language models have been shown to summarize the semantics of protein sequences into representations that are useful for state-of-the-art predictive methods. However, for antibody specific problems, such as restoring residues lost due to sequencing errors, a model trained solely on antibodies may be more powerful. Antibodies are one of the few protein types where the volume of sequence data needed for such language models is available, e.g. in the Observed Antibody Space (OAS) database.
Results
Here, we introduce AbLang, a language model trained on the antibody sequences in the OAS database. We demonstrate the power of AbLang by using it to restore missing residues in antibody sequence data, a key issue with B-cell receptor repertoire sequencing, e.g. over 40% of OAS sequences are missing the first 15 amino acids. AbLang restores the missing residues of antibody sequences better than using IMGT germlines or the general protein language model ESM-1b. Further, AbLang does not require knowledge of the germline of the antibody and is seven times faster than ESM-1b.",Confident,United Kingdom of Great Britain and Northern Ireland,Academia,,,,,,,,,58.0,,,,,,
data2vec (vision),Vision,Meta AI,"Alexei Baevski, Wei-Ning Hsu, Qiantong Xu , Arun Babu, Jiatao Gu, Michael Auli",2022-01-20,"Data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language",https://ai.facebook.com/research/data2vec-a-general-framework-for-self-supervised-learning-in-speech-vision-and-language/,SOTA improvement,"""Experiments on the major benchmarks of speech recognition, image classification, and natural lan guage understanding demonstrate a new state of the art or competitive performance to predominant approaches""",705134592.0,"Section 4: ""We experiment with two model sizes: data2vec Base and
data2vec Large, containing either L = 12 or L = 24 Trans-
former blocks with H = 768 or H = 1024 hidden dimen-
sion (with 4 × H feed-forward inner-dimension)""
",,,ImageNet,,1281167.0,"Section 5.1:
""we pretrain data2vec on the images of the ImageNet-1K training
set""",,,United States of America,Industry,,,,,,,,,583.0,,,,,,
data2vec (speech),Speech,Meta AI,"Alexei Baevski, Wei-Ning Hsu, Qiantong Xu , Arun Babu, Jiatao Gu, Michael Auli",2022-01-20,"Data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language",https://ai.facebook.com/research/data2vec-a-general-framework-for-self-supervised-learning-in-speech-vision-and-language/,SOTA improvement,"""Experiments on the major benchmarks of speech recognition, image classification, and natural lan guage understanding demonstrate a new state of the art or competitive performance to predominant approaches""",705134592.0,"Section 4: ""We experiment with two model sizes: data2vec Base and
data2vec Large, containing either L = 12 or L = 24 Trans-
former blocks with H = 768 or H = 1024 hidden dimen-
sion (with 4 × H feed-forward inner-dimension)""
",,,LS-960,,13132800.0,"Section 5.2:
""we pre-train data2vec on the 960
hours of speech audio data from Librispeech (LS-960)""
13,680 words per hour",,,United States of America,Industry,,,,,,,,,583.0,,,,,,
data2vec (language),Language,Meta AI,"Alexei Baevski, Wei-Ning Hsu, Qiantong Xu , Arun Babu, Jiatao Gu, Michael Auli",2022-01-20,"Data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language",https://ai.facebook.com/research/data2vec-a-general-framework-for-self-supervised-learning-in-speech-vision-and-language/,SOTA improvement,"""Experiments on the major benchmarks of speech recognition, image classification, and natural lan guage understanding demonstrate a new state of the art or competitive performance to predominant approaches""",705134592.0,"Section 4: ""We experiment with two model sizes: data2vec Base and
data2vec Large, containing either L = 12 or L = 24 Trans-
former blocks with H = 768 or H = 1024 hidden dimen-
sion (with 4 × H feed-forward inner-dimension)""
",,,"BookCorpus (BooksCorpus, Toronto Book Corpus),English Wikipedia",,3300000000.0,"Section 5.3: ""we
adopt the same training setup as BERT (Devlin et al., 2019)
by pre-training on the Books Corpus (Zhu et al., 2015) and
English Wikipedia data over 1M updates and a batch size
of 256 sequences.""",,,United States of America,Industry,,,,,,,,,583.0,,,,,,
Detic,Vision,"Meta AI,University of Texas at Austin",Detecting Twenty-thousand Classes using Image-level Supervision,2022-01-07,Detecting Twenty-thousand Classes using Image-level Supervision,https://arxiv.org/abs/2201.02605,SOTA improvement,"""On open-vocabulary COCO, our method outperforms the previous state-of-the-art OVR-CNN [ 72 ] by 5 point with the same detector and data""",88000000.0,"from https://github.com/microsoft/Swin-Transformer Swin-B have 88M,
from page 8 : 'Training our ResNet50 model takes ∼ 22 hours on 8 V100 GPUs. The large 21K Swin-B model trains in ∼ 24 hours on 32 GPUs.'",2.34399744e+19,"28.26e12* 32 * 24*3600*0.3 =2.34e19 = peak flops * num gpus * num seconds * assumed utilization rate
for Swin-B model from page 8 : 'Training our ResNet50 model takes ∼ 22 hours on 8 V100 GPUs. The large 21K Swin-B model trains in ∼ 24 hours on 32 GPUs.'","ImageNet21k,Conceptual Captions (CC3M)","table above section 5.1
""We evaluate Detic on the large-vocabulary object detection dataset LVIS [18 ]""
""Image-supervised data. We use two sources of image-supervised data: ImageNet-
21K [10] and Conceptual Captions """,16900000.0,"14M + 1.5M + 1.2M + 100K + 100K = 16900000.0
table above section 5.1"," Current object detectors are limited in vocabulary size due to the small scale of detection datasets. Image classifiers, on the other hand, reason about much larger vocabularies, as their datasets are larger and easier to collect. We propose Detic, which simply trains the classifiers of a detector on image classification data and thus expands the vocabulary of detectors to tens of thousands of concepts. Unlike prior work, Detic does not need complex assignment schemes to assign image labels to boxes based on model predictions, making it much easier to implement and compatible with a range of detection architectures and backbones. Our results show that Detic yields excellent detectors even for classes without box annotations. It outperforms prior work on both open-vocabulary and long-tail detection benchmarks. Detic provides a gain of 2.4 mAP for all classes and 8.3 mAP for novel classes on the open-vocabulary LVIS benchmark. On the standard LVIS benchmark, Detic obtains 41.7 mAP when evaluated on all classes, or only rare classes, hence closing the gap in performance for object categories with few samples. For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without finetuning. Code is available at \url{this https URL}. ",Speculative,"United States of America,United States of America","Industry,Academia",24.0,"from page 8 : 'Training our ResNet50 model takes ∼ 22 hours on 8 V100
GPUs. The large 21K Swin-B model trains in ∼ 24 hours on 32 GPUs.'",NVIDIA V100,Open source,,32.0,,,382.0,,,,191.44581825616132,,
ERNIE-ViLG,"Multimodal,Image generation,Vision",Baidu,"Han Zhang, Weichong Yin, Yewei Fang, Lanxin Li, Boqiang Duan, Zhihua Wu, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang",2021-12-31,ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation,https://arxiv.org/abs/2112.15283,SOTA improvement,"""we train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs which achieves state-of-the-art performance for both text-to-image and image-to-text tasks""",10000000000.0,"""To explore the landscape of large-scale pre-training for bidirectional text-image generation, we pre-train a 10-billion parameter model on a large-scale dataset of 145 million high-quality Chinese image-text pairs.""",,,,,145000000.0,"To explore the landscape of large-scale pre-training for bidirectional text-image generation,
we pre-train a 10-billion parameter model on a large-scale dataset of 145 million high-quality Chinese image-text pairs.",,,China,Industry,,,,,,,,,45.0,,,,,,
ERNIE 3.0 Titan,Language,"Baidu,Peng Cheng Laboratory","Shuohuan Wang, Yu Sun, Yang Xiang, Zhihua Wu, Siyu Ding, Weibao Gong, Shikun Feng",2021-12-23,ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation,https://arxiv.org/abs/2112.12731,SOTA improvement,"""Empirical results show that the ERNIE 3.0 Titan outperforms the state-of-the-art models on 68 NLP datasets.""",260000000000.0,"""[We] developed... distributed training technology, including fine-grained parallelism, heterogeneous hardware-aware training, and fault tolerance mechanism to train the 260B model on both Nvidia V100 GPU and Ascend 910 NPU clusters.""
See also:
https://twitter.com/BaiduResearch/status/1468633977242243078?t=6q4zuLNdTSc4GUBe9OM5Aw&s=19",1.0421e+24,"The paper suggests that ERNIE 3.0 Titan uses more compute than GPT-3. This is consistent with the 6ND approximation.
C = 6ND = 6 (FLOP/param/token) * (260B params) * (668B tokens) = 1.0421*10^24 FLOP",ERNIE 3.0 Corpus,,668000000000.0,"""To ensure the success of the pre-training of ERNIE 3.0 Titan, we utilize the ERNIE 3.0 Corpus [ 2 ], a large-scale, wide-variety, and high-quality Chinese text corpora amounting to 4TB""
Assuming 167M words/tokens per GB","Pre-trained language models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. GPT-3 has shown that scaling up pre-trained language models can further exploit their enormous potential. A unified framework named ERNIE 3.0 was recently proposed for pre-training large-scale knowledge enhanced models and trained a model with 10 billion parameters. ERNIE 3.0 outperformed the state-of-the-art models on various NLP tasks. In order to explore the performance of scaling up ERNIE 3.0, we train a hundred-billion-parameter model called ERNIE 3.0 Titan with up to 260 billion parameters on the PaddlePaddle platform. Furthermore, we design a self-supervised adversarial loss and a controllable language modeling loss to make ERNIE 3.0 Titan generate credible and controllable texts. To reduce the computation overhead and carbon emission, we propose an online distillation framework for ERNIE 3.0 Titan, where the teacher model will teach students and train itself simultaneously. ERNIE 3.0 Titan is the largest Chinese dense pre-trained model so far. Empirical results show that the ERNIE 3.0 Titan outperforms the state-of-the-art models on 68 NLP datasets.",Likely,"China,China","Industry,Academia",,,"Huawei Ascend 910,NVIDIA Tesla V100 DGXS 32 GB",Hosted access (no API),,1920.0,,,56.0,,,,,1048576.0,"""The maximum sequence length of context and
the memory length of language generation is 512 and 128, respectively""
In table 1, they use a global batch size of 512 when data parallelism is ""1"" and 2048 when DP is ""4"". Not sure I fully understand this part but I guess they'd use parallelism as much as possible given how they talk about it.
2048 * 512 = 1048576."
XGLM-7.5B,Language,"Meta AI,Facebook AI Research","Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li",2021-12-20,Few-shot Learning with Multilingual Language Models,https://arxiv.org/abs/2112.10668,SOTA improvement,"""Our largest model
with 7.5 billion parameters sets new state of
the art in few-shot learning in more than 20
representative languages""",7500000000.0,"""Our largest model
with 7.5 billion parameters sets new state of
the art in few-shot learning in more than 20
representative languages""",3.1276322648e+22,""" The XGLM 7.5B model was trained on 256 A100 GPUs for about 3 weeks, at a speed of 311.6k words per second""
312e12 * 256 * 3*7*24*3600 *0.3 = 4.347592704e+22
alternative:
6ND = 6*7.5e9*500e9 = 2.25e22 - we have 7.5B params and 500B tokens from ""All models are trained for up to 500B tokens, with context length of 2048 tokens""
geom mean: sqrt(4.35e22 * 2.25e22) = 3.13e22","CC100-XL,Common Crawl","""We extend the pipeline used for mining the CC100
corpus (Conneau et al., 2020; Wenzek et al., 2020)
to generate CC100-XL, a significantly larger mul-
tilingual dataset covering 68 Common Crawl (CC)
snapshots (from Summer 2013 to March/April
2020) and 134 languages.""
",500000000000.0,"Training Data. Our models are trained on a static multilingual corpus extracted from CommonCrawl, with English text comprising 32.6% of the total number of tokens corresponding to 163B tokens.
163B / 0.326 = 500B total
Note that this dataset is sampled from the much larger CC100-XL, outlined in Appendix F and here: https://huggingface.co/facebook/xglm-7.5B#training-data-statistics
The huggingface link sums to 1.64T tokens, while the Data Card in the appendix claims 1.9T tokens.
","Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these models are known to be able to jointly represent many different languages, their training data is dominated by English, potentially limiting their cross-lingual generalization. In this work, we train multilingual generative language models on a corpus covering a diverse set of languages, and study their few- and zero-shot learning capabilities in a wide range of tasks. Our largest model with 7.5 billion parameters sets new state of the art in few-shot learning in more than 20 representative languages, outperforming GPT-3 of comparable size in multilingual commonsense reasoning (with +7.4% absolute accuracy improvement in 0-shot settings and +9.4% in 4-shot settings) and natural language inference (+5.4% in each of 0-shot and 4-shot settings). On the FLORES-101 machine translation benchmark, our model outperforms GPT-3 on 171 out of 182 directions with 32 training examples, while surpassing the official supervised baseline in 45 directions. We conduct an in-depth analysis of different multilingual prompting approaches, showing in particular that strong few-shot learning performance across languages can be achieved via cross-lingual transfer through both templates and demonstration examples. Finally, we evaluate our models in social value tasks such as hate speech detection in five languages and find it has limitations similar to comparable sized GPT-3 models. ",Confident,"United States of America,United States of America","Industry,Industry",504.0,"appendix A : ""The XGLM 7.5B model was trained on 256 A100 GPUs for about 3 weeks, at a speed of 311.6k words per second""",NVIDIA A100,Open access (non-commercial),,256.0,,,176.0,,,1.0,104152.22590136188,,
XGLM,Language,"Meta AI,Facebook AI Research","Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li",2021-12-20,Few-shot Learning with Multilingual Language Models,https://arxiv.org/abs/2112.10668,SOTA improvement,"""Our largest model (XGLM7.5B) sets a new state of the art performance for few-shot learning in more than 20 representative languages (including medium- and low-resource languages) for the tasks of commonsense reasoning, natural language inference and machine translation.""",7500000000.0,7.5B,2.25e+22,"""The XGLM 7.5B model was trained on 256 A100 GPUs for about 3 weeks, at a speed of 311.6k words per second""
256 * 312 teraFLOP/s * 21 * 24 * 3600 * 0.3 utilization assumption ~= 4.3e22
also, it was trained for 500B tokens. Using Compute = 6ND, we have
6 * 500B * 7.5B = 2.25e22
311k tokens per second * 7.5B params * 6 is 1.35e16 FLOP/s. divide that by 312 teraFLOP/s, which is A100 peak compute, gets 43, suggesting low utilization (17%) of the 256-GPU cluster, or somewhat higher if there's more than one token per word. So I'll use the 6ND number.","Subset of CC100-XL,CC100-XL,Common Crawl","*they built a closed dataset based on open Common Crawl
""We extend the pipeline used for mining the CC100 corpus (Conneau et al., 2020; Wenzek et al., 2020) to generate CC100-XL, a significantly larger multilingual dataset covering 68 Common Crawl (CC) snapshots (from Summer 2013 to March/April 2020) and 134 languages.",1740000000.0,,,Likely,"United States of America,United States of America","Industry,Industry",,,,Open access (non-commercial),,,,,176.0,,,,,,
LDM-1.45B,Image generation,"Heidelberg University,Runway","Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer",2021-12-20,High-Resolution Image Synthesis with Latent Diffusion Models,https://arxiv.org/abs/2112.10752,Highly cited,,1450000000.0,1.45B,,,LAION-400M,,400000000.0,400M image-text pairs,"By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at this https URL.",Confident,"Germany,United States of America","Academia,Industry",,,NVIDIA A100,Open source,,,,,7283.0,,,0.66,,,
GLIDE,Image generation,OpenAI,"Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam Pamela Mishkin Bob McGrew IlyaSutskever MarkChen",2021-12-20,GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models,https://arxiv.org/abs/2112.10741,Highly cited,,3500000000.0,"""Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking""",,"""Note that GLIDE was
trained with roughly the same training compute as DALL-E
but with a much smaller model (3.5 billion vs. 12 billion
parameters)""",,,250000000.0,"Section 4:
""We train our model on the same dataset as DALL-E (Ramesh
et al., 2021)""
This paper used 250M image-text pairs
https://arxiv.org/pdf/2102.12092.pdf",,,United States of America,Industry,,,,,,,,,2164.0,,,,,,
Contriever,Language,"Meta AI,University College London (UCL),PSL University,Université Grenoble Alpes","Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, Edouard Grave",2021-12-16,Unsupervised Dense Information Retrieval with Contrastive Learning,https://arxiv.org/abs/2112.09118,SOTA improvement,"""We observe that when
used as pre-training, contrastive learning leads to strong performance: contriever obtains the best results
among dense bi-encoder methods for the nDCG@10, and is state-of-the-art for the recall@100 (improving the
average recall@100 from 65.0 to 67.1). This strong recall@100 performance can be further exploited by using
a cross-encoder2
to re-rank the retrieved documents: this leads to the state-of-the-art on 8 datasets of the
BEIR benchmark for the nDCG@10, as well as on average""",110000000.0,"Based on BERT base, which had 110m params.
""We initialize the network with the publicly available BERT base uncased model.""",1.57e+20,"Pre-training:
""We use the random cropping data augmentation, with documents of 256 tokens... batch size of 2,048 and 500,000 steps""
256 * 2048 * 500k * 100M * 6 = 1.57e20
Fine-tuning looks unlikely to move final sum much beyond this.","Wikipedia,CCNet","""Documents are simply random piece of text sampled from a mix between Wikipedia and CCNet data """,,,"Recently, information retrieval has seen the emergence of dense retrievers, using neural networks, as an alternative to classical sparse methods based on term-frequency. These models have obtained state-of-the-art results on datasets and tasks where large training sets are available. However, they do not transfer well to new applications with no training data, and are outperformed by unsupervised term-frequency methods such as BM25. In this work, we explore the limits of contrastive learning as a way to train unsupervised dense retrievers and show that it leads to strong performance in various retrieval settings. On the BEIR benchmark our unsupervised model outperforms BM25 on 11 out of 15 datasets for the Recall@100. When used as pre-training before fine-tuning, either on a few thousands in-domain examples or on the large MS~MARCO dataset, our contrastive model leads to improvements on the BEIR benchmark. Finally, we evaluate our approach for multi-lingual retrieval, where training data is even scarcer than for English, and show that our approach leads to strong unsupervised performance. Our model also exhibits strong cross-lingual transfer when fine-tuned on supervised English data only and evaluated on low resources language such as Swahili. We show that our unsupervised models can perform cross-lingual retrieval between different scripts, such as retrieving English documents from Arabic queries, which would not be possible with term matching methods.",Likely,"United States of America,United Kingdom of Great Britain and Northern Ireland,France,France","Industry,Academia,Academia,Academia",,,,Open access (non-commercial),actually BERT base,,,BERT-Large,368.0,,,,,,
LongT5,Language,Google Research,"Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang",2021-12-15,LongT5: Efficient Text-To-Text Transformer for Long Sequences,https://arxiv.org/abs/2112.07916,SOTA improvement,"from abstract: ""We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.""",3000000000.0,3B from section 4.1,,"architecture is sparse so we cannot use 6ND method,
from 3.1.1 ""we simply replace the encoder
self-attention operation in T5 with a sparse sliding-
window local attention operation following the im-
plementation in ETC ""
at the end of section 3.1.2 there is information about
complexity O(l(r + l/k)) of local attention
from 4.1.1 ""We pre-train LongT5 models for 1M steps on
4096 input sequence length and 910 output se-
quence length.
batch size is 128 (from 4.1 configurations section)
so with l = 4096, k = 16, r = 127,
so l(r+l/k) = 1568768, but we are not sure about constant.
if normal attention have complexity O(l^2), and l^2 = 16777216
16777216/1568768 = 10.7
We can try to estimate that LongT5 would have 10 times less compute that normal architecture.",C4,"from 4.1.1 ""The same as
T5.1.1, we pre-train LongT5 only on the C4 dataset
(Raffel et al., 2019b), and we do not apply dropout
during pre-training.""",200000000000.0,"size of C4, from https://huggingface.co/datasets/c4 , C4 dataset is a collection of about 750GB of English-language text
200M word/GB * 4/3 token/word * 750GB = 200000000000 tokens
Actual tokens seen:
1M steps * (4096 input len + 910 output len) * 128 batch size = 641B tokens, so around 3.2 epochs.","Recent work has shown that either (1) increasing the input length or (2) increasing model size can improve the performance of Transformer-based neural models. In this paper, we present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. Specifically, we integrated attention ideas from long-input transformers (ETC), and adopted pre-training strategies from summarization pre-training (PEGASUS) into the scalable T5 architecture. The result is a new attention mechanism we call {\em Transient Global} (TGlobal), which mimics ETC's local/global attention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.",Confident,Multinational,Industry,,,Google TPU v3,Open source,,128.0,,,211.0,,,3.2,,,
GLaM,Language,Google,"Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui",2021-12-13,GLaM: Efficient Scaling of Language Models with Mixture-of-Experts,https://arxiv.org/abs/2112.06905,SOTA improvement,"""As shown in Table 5, GLaM (64B/64E) is better than the dense model and outperforms the previous finetuned state-of-the-art (SOTA) on this dataset in the open-domain setting""",1200000000000.0,1.2 trillion parameters,3.74e+23,"from paper: ""GLaM (64B/64E) training after 600B tokens consumes 456 MWh, about 1/3 of the energy cost of 1287 MWh used by GPT-3. Moreover, to reach similar (and slightly exceeded) scores as GPT-3, we train using 1,024 TPU-v4 chips for 574 hours (with 280B tokens). This consumes 213 MWh or 1/6 of the GPT-3 energy cost""
600/280 is almost exactly 456/213 (2.14) so the later tokens have the same per-token energy cost.
2.14*574*1024 = 1,257,840 TPU-v4 hours
TPU-v4s are 275 teraFLOP/s.
Using our usual 0.3 utilization assumption, 275 trillion * 1,257,840 * 3600 * 0.3 = 3.74e23
Later they say they measured 326W power usage per chip, which could maybe be used to estimate utilization.",Wikipedia,"""To train our model, we build a high-quality dataset of 1.6 trillion tokens that are representative of a wide range of natural language use cases. Web pages constitute the vast quantity of data in our unlabeled dataset. However, their quality ranges from professional writing to low-quality comment and forum pages.""",600000000000.0,"The dataset is made of 1.6 trillion tokens, but later in the paper they say they only train the largest model for 600b tokens. 600b / 0.75 words/token = 800b words.
""The complete GLaM training using 600B tokens consumes only 456 MWh and emits 40.2 net tCO2e.""","Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest GLaM has 1.2 trillion parameters, which is approximately 7x larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.",Confident,United States of America,Industry,1366.0,"Note that they give several energy estimates. Use the complete training figures for 600B tokens, not the GPT-3 comparison values with 280B tokens.
""326W measured system power per TPU-v4 chip""
""The complete GLaM training using 600B tokens consumes only
456 MWh""
1024 TPU v4 chips
(456 MWh) / (326W/chip * 1024 chips) = 1366 hours",Google TPU v4,Unreleased,,1024.0,,,460.0,,,,541437.4162400038,1000000.0,"""We use a maximum sequence
length of 1024 tokens, and pack each input example to have
up to 1 million tokens per batch."""
Gopher (280B),Language,DeepMind,"Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d'Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, Geoffrey Irving",2021-12-08,"""Scaling Language Models: Methods, Analysis & Insights from Training Gopher""",https://arxiv.org/abs/2112.11446,SOTA improvement,"""These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority""",280000000000.0,"Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.",6.31e+23,"Table A26
6.31E+08 Train PFLOPs",MassiveTex,,225000000000.0,"""We train all models for 300 billion tokens with a 2048 token context window, using the Adam (Kingma and Ba, 2014) optimiser.""
1 token ~ 0.75 words","We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a 2 trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25× fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit memory at unprecedented scale.",Confident,United Kingdom of Great Britain and Northern Ireland,Industry,920.0,"""We trained Gopher for 920 hours in November and December 2020 in Google’s Georgia datacentre. The PUE of the datacenter at this time was 1.08; the net tCO2e per MWh in October 2020 was 0.33. Using an estimate of 283W drawn per chip, this leads to a total of 380 net tCO2e""",Google TPU v3,Unreleased,,4096.0,0.378,,954.0,,,1.0,616611.1391817601,6000000.0,"Table 1. ""Furthermore, we increase Gopher’s batch size from three to six million tokens per batch during training"""
Student of Games,Games,DeepMind,"Martin Schmid, Matej Moravcik, Neil Burch, Rudolf Kadlec, Josh Davidson, Kevin Waugh, Nolan Bard, Finbarr Timbers, Marc Lanctot, Zach Holland, Elnaz Davoodi, Alden Christianson, Michael Bowling",2021-12-06,Player of Games,https://arxiv.org/abs/2112.03178,SOTA improvement,"""Player of Games reaches strong performance in chess and Go, beats the strongest openly available agent in heads-up no-limit Texas hold'em poker (Slumbot), and defeats the state-of-the-art agent in Scotland Yard""",,,3.667927300468287e+22,"""We trained a version of AlphaZero using its original settings in chess and Go, e.g. , using 800 MCTS simulations during training, with 3500 concurrent actors each on a single TPUv4, for a total of 800k training steps. SOG was trained using a similar amount of TPU resources.""",,,,,,Speculative,United Kingdom of Great Britain and Northern Ireland,Industry,,,,Unreleased,,,,,9.0,,,,,,
DeBERTaV3-large + KEAR,Language,Microsoft,"Yichong Xu, Chenguang Zhu, Shuohang Wang, Siqi Sun, Hao Cheng, Xiaodong Liu, Jianfeng Gao, Pengcheng He, Michael Zeng, Xuedong Huang",2021-12-06,Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention,https://arxiv.org/abs/2112.03254v3,SOTA improvement,"""The proposed system, Knowledgeable External Attention for commonsense Reasoning (KEAR), reaches human parity on the open CommonsenseQA research benchmark with an accuracy of 89.4\% in comparison to the human accuracy of 88.9\%.""
SOTA per https://paperswithcode.com/sota/common-sense-reasoning-on-commonsenseqa",418000000.0,"DeBERTaV3-large had 418M params, per Table 2",,this is a fine-tuned version of DeBERTaV3-large,,"""We present details of the 17 datasets that we use for training
data retrieval in Table 1. All the datasets are multiple-choice
or classification datasets related to commonsense reasoning,
and we include dataset details in the appendix.""",,,"Most of today's AI systems focus on using self-attention mechanisms and transformer architectures on large amounts of diverse data to achieve impressive performance gains. In this paper, we propose to augment the transformer architecture with an external attention mechanism to bring external knowledge and context to bear. By integrating external information into the prediction process, we hope to reduce the need for ever-larger models and increase the democratization of AI systems. We find that the proposed external attention mechanism can significantly improve the performance of existing AI systems, allowing practitioners to easily customize foundation AI models to many diverse downstream applications. In particular, we focus on the task of Commonsense Reasoning, demonstrating that the proposed external attention mechanism can augment existing transformer models and significantly improve the model's reasoning capabilities. The proposed system, Knowledgeable External Attention for commonsense Reasoning (KEAR), reaches human parity on the open CommonsenseQA research benchmark with an accuracy of 89.4\% in comparison to the human accuracy of 88.9\%.",Likely,United States of America,Industry,,,,,,,,,47.0,,,,,,
NÜWA,"Multimodal,Vision,Image generation,Video,Language","Microsoft Research,Peking University","Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, Nan Duan",2021-11-24,NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion,https://arxiv.org/abs/2111.12417,SOTA improvement,"""NÜWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc""",870000000.0,Section 4.1,4.8384e+21,"From AI Tracker:
""Compute cost: End of Sec 4.1: ""We pre-train on 64 A100 GPUs for two weeks"". Info sheet from NVIDIA (https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet.pdf) gives single precision TensorFloat 32 performance of 156 TFLOPs/s. So we get 64 x 14 x 156 = 140,000 TFLOPs/s x days.""
Multiply by seconds/day and 30% utilization","Conceptual Captions (CC3M),Moments in Time,VATEX",,,"we first pre-train N ̈UWA on three
datasets: Conceptual Captions [22] for text-to-image (T2I)
generation, which includes 2.9M text-image pairs, Mo-
ments in Time [26] for video prediction (V2V), which in-
cludes 727K videos, and VATEX dataset [43] for text-to-
video (T2V) generation, which includes 241K text-video
pairs.","This paper presents a unified multimodal pre-trained model called NÜWA that can generate new or manipulate existing visual data (i.e., images and videos) for various visual synthesis tasks. To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively. A 3D Nearby Attention (3DNA) mechanism is also proposed to consider the nature of the visual data and reduce the computational complexity. We evaluate NÜWA on 8 downstream tasks. Compared to several strong baselines, NÜWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good zero-shot capabilities on text-guided image and video manipulation tasks. Project repo is this https URL.",,"United States of America,China","Industry,Academia",,,,Unreleased,,,,,221.0,,,,,,
Florence,Vision,Microsoft,"Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, JianFeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang",2021-11-22,Florence: A New Foundation Model for Computer Vision,https://arxiv.org/abs/2111.11432v1,"Historical significance,SOTA improvement",,893000000.0,"""Our Florence pretrained model has in total 893M parameters, including the language transformer with 256M parameters and the CoSwin-H transformer with 637M parameters.""",4.831e+22,"""The model takes 10 days to train on 512 NVIDIA A100 GPUs with 40GB memory per GPU.""
512 * 312 teraFLOPS * 10 days * 35% utilization = 4.831e22 FLOP",FLD-900M,"900 million image-text pairs curated from internet images and descriptions
""We leverage large quantities of image-text data available
publicly on the internet. Specifically, we construct a 900
million image-text-pair dataset, called FLD-900M (FLD
stands for FLorenceDataset), using a programmatic data
curation pipeline that processes around 3 billion Internet
images and their raw descriptions in parallel. <..>The final form of the FLD-900M dataset consists of 900M images with 900M free-form texts (ranging from one word, phase to sentences), 9.7M unique queries, and 7.5B tokens in total.
",900000000.0,,"Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications. While existing vision foundation models such as CLIP, ALIGN, and Wu Dao 2.0 focus mainly on mapping images and textual representations to a cross-modal shared representation, we introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth). By incorporating universal visual-language representations from Web-scale image-text data, our Florence model can be easily adapted for various computer vision tasks, such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition. Moreover, Florence demonstrates outstanding performance in many types of transfer learning: fully sampled fine-tuning, linear probing, few-shot transfer and zero-shot transfer for novel images and objects. All of these properties are critical for our vision foundation model to serve general purpose vision tasks. Florence achieves new state-of-the-art results in majority of 44 representative benchmarks, e.g., ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of 97.18, 62.4 mAP on COCO fine tuning, 80.36 on VQA, and 87.8 on Kinetics-600.",Confident,United States of America,Industry,240.0,10 days on 512 A100 40GB,NVIDIA A100 SXM4 40 GB,Unreleased,,512.0,,,665.0,,,,106950.61569328008,,
BASIC-L,Vision,Google,"Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, Mingxing Tan, Quoc V. Le",2021-11-19,Combined Scaling for Zero-shot Transfer Learning,https://arxiv.org/abs/2111.10050,SOTA improvement,"SOTA on ImageNet for a model that was not trained on ImageNet images:
""We present a combined scaling method – named BASIC – that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set without learning from any labeled ImageNet example. This accuracy
surpasses best-published similar models – CLIP and ALIGN – by 9.3%""",3070000000.0,2.4B image model + 670M text model,4.12e+22,"6.9k + 1k + 0.8k = 8.7k TPUv4 core-days for BASIC-L, per Table 8
Two cores per chip, and 275 teraflop/s per chip
(https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#tpu_v4)
275 teraflops * 8700/2 * 24 * 3600 * 0.4 (assumed utilization) = 8.3e22","JFT,ALIGN","For pretraining (Section 8), we use the JFT dataset. This dataset has been
used in previous publications (Zhai et al., 2021; Dosovitskiy et al., 2021; Kolesnikov et al., 2020), but it has been constantly expanded. The JFT version used in our experiments has 5B images, each of which can be associated to one or multiple labels out of 29K possible classes.
""Starting from the ALIGN dataset, which contains 1.7B weakly-aligned image-text pairs (Jia et al., 2021), we collect 5B more image-text pairs, hence expanding the dataset size by roughly 4 times. We acquire
these 5B image-text pairs from the JFT dataset""",6700000000.0,6.7B image-text pairs,"We present a combined scaling method - named BASIC - that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set without learning from any labeled ImageNet example. This accuracy surpasses best published similar models - CLIP and ALIGN - by 9.3%. Our BASIC model also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution shifts such as ImageNet-{A,R,V2,Sketch} and ObjectNet, our model achieves 84.3% top-1 average accuracy, only a small drop from its original ImageNet accuracy. To achieve these results, we scale up the contrastive learning framework of CLIP and ALIGN in three dimensions: data size, model size, and batch size. Our dataset has 6.6B noisy image-text pairs, which is 4x larger than ALIGN, and 16x larger than CLIP. Our largest model has 3B weights, which is 3.75x larger in parameters and 8x larger in FLOPs than ALIGN and CLIP. Finally, our batch size is 65536 which is 2x more than CLIP and 4x more than ALIGN. We encountered two main challenges with the scaling rules of BASIC. First, the main challenge with implementing the combined scaling rules of BASIC is the limited memory of accelerators, such as GPUs and TPUs. To overcome the memory limit, we propose two simple methods which make use of gradient checkpointing and model parallelism. Second, while increasing the dataset size and the model size has been the defacto method to improve the performance of deep learning models like BASIC, the effect of a large contrastive batch size on such contrastive-trained image-text models is not well-understood. To shed light on the benefits of large contrastive batch sizes, we develop a theoretical framework which shows that larger contrastive batch sizes lead to smaller generalization gaps for image-text models such as BASIC.",Likely,United States of America,Industry,,,Google TPU v4,Unreleased,,,,,141.0,,,3.0,1684.770712126102,,"65536, but these are image-text pairs not tokens
""For the batch size, we use 65536 contrastive
learning examples per minibatch"""
Swin Transformer V2,"Vision,Video",Microsoft Research Asia,"Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo",2021-11-18,Swin Transformer V2: Scaling Up Capacity and Resolution,https://arxiv.org/abs/2111.09883v2,"SOTA improvement,Highly cited","""It set new performance records on 4 representative vision tasks, including ImageNet-V2
image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification.""",3000000000.0,,1.1e+21,"trained on ""<0.5k"" TPUv3 core-days per Table 2 (not trained on TPUs, this is a comparison with other papers)
A core is 123/2 teraflops
500 core-days
= 500 * 123/2 trillion * 24 * 3600 * 0.4 utilization
~= 1.1e21","ImageNet,COCO,ADE20K","""We conduct experiments on ImageNet-1K image classification (V1 and V2) [18, 55], COCO object detection [44], and ADE20K semantic segmentation [85]. For the 3B model experiments, we also report the accuracy on
Kinetics-400 video action recognition [37].""
• Image classification. ImageNet-1K V1 and V2 val are
used [18,55] for evaluation. ImageNet-22K [18] which
has 14M images and 22K categories is optionally employed for pre-training. For the pre-training our largest
model SwinV2-G, a privately collected ImageNet22K-ext dataset with 70 million images is used.",,,"Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to 1,536×1,536 resolution. It set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also note our training is much more efficient than that in Google's billion-level visual models, which consumes 40 times less labelled data and 40 times less training time. Code is available at \url{this https URL}.",Confident,China,Industry,,,NVIDIA A100 SXM4 40 GB,Open source,,,,,1083.0,,,,2326.6636503781665,,
ViT-G/14 (LiT),Vision,Google,"Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, Lucas Beyer",2021-11-15,Zero-Shot Transfer with Locked-image Text Tuning,https://arxiv.org/abs/2111.07991v3,SOTA improvement,"""For example, it achieves 82.5% accuracy on the challenging ObjectNet test set [1], outperforming the previous state-of-the-art
method [46] by 10.2%.""",3005000000.0,Table 7,,"They start with the ViT-G/14 image model and train their own text model. ViT-G/14 is 3.4e21.
They also say ""We use 128 TPU cores by default for the above experiments, and 256 TPU cores for our best run with 18 billion seen image-text pairs"" which may be relevant.",,"CC12M, YFCC100m, and their novel dataset:
""Our dataset. We collect 4 billion image and alt-text
pairs following the same process as ALIGN [31], with the
same image-based filtering but simpler text-based filtering.
Appendix L shows that reducing text filtering does not harm
performance. To avoid misleading evaluation results, we
remove from our dataset near-duplicate images of all splits
from all datasets we evaluate on. We do not consider the
creation of our dataset a main contribution of this paper; we
just simplify the data collection process in ALIGN [31] to
demonstrate the efficacy of our methods at scale.""",4000000000.0,"Largest dataset is ""4 billion image and alt-text pairs"". This is rounded down slightly; the other datasets are much smaller.","This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training. In our empirical study we find that locked pre-trained image models with unlocked text models work best. We call this instance of contrastive-tuning ""Locked-image Tuning"" (LiT), which just teaches a text model to read out good representations from a pre-trained image model for new tasks. A LiT model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval. The proposed LiT is widely applicable; it works reliably with multiple pre-training methods (supervised and unsupervised) and across diverse architectures (ResNet, Vision Transformers and MLP-Mixer) using three different image-text datasets. With the transformer-based pre-trained ViT-g/14 model, the LiT model achieves 85.2% zero-shot transfer accuracy on the ImageNet test set, and 82.5% on the challenging out-of-distribution ObjectNet test set.",Likely,United States of America,Industry,,,Google TPU v3,,,,,,389.0,,,4.5,,,
Masked Autoencoders,Vision,Facebook AI Research,"Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick",2021-11-11,Masked Autoencoders Are Scalable Vision Learners,https://arxiv.org/abs/2111.06377,"Highly cited,SOTA improvement","""By fine-tuning with a 448 size, we achieve 87.8% accuracy, using only IN1K data. The previous best accuracy, among all methods using only IN1K data, is 87.1% (512 size)... We improve over the state-of-the-art by a nontrivial margin in the highly competitive benchmark of IN1K (no external data). Our result is based on vanilla ViT, and we expect advanced networks will perform better.""
See Table 3",632000000.0,"Three models:
ViT-B (86M), ViT-L (304M), ViT-H (632M)",4.6e+20,"128 TPU-v3 cores trained for 1600 epochs. Times are given for 800 epochs in Table 2; largest model (ViT-H) took 34.5 hrs for 800.
128 TPU-v3 cores * 0.5 chips/core * 34.5 hours * 2 * 1.23E+14 FLOP/sec / chip * 3600 sec/hour * 40% utilization = 7.84e20 FLOP
Note that the operations counting method disagrees:
2 × 632000000 connections × 3 × 1281167 training examples × 1600 of epochs = 7.8e18 FLOP
Manual calculation with `calflops` package roughly agrees with hardware-time calculation:
286.21 GFLOPS/observation * 1281167 observations * 1600 epochs = 5.86e20 FLOP
See reproduction here: https://colab.research.google.com/drive/1KCsmrfPzT9BgGO_YQthnz4oP3QRqbw5o?usp=sharing
Weighting three estimates equally:
(7.84e20 + 7.8e18 + 5.86e20)/3 = 4.6e20",ImageNet-1k,,1281167.0,,"This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.",Speculative,United States of America,Industry,69.0,"Table 2 gives wall times for training ViT-L and ViT-H to 800 epochs; later it is stated that the systems are each trained for 1600 epochs.
(34.5 hours / 800 epochs) * 1600 epochs = 69 hours",,Open access (non-commercial),"UNCERTAIN
128 TPU-v3 cores trained for 1600 epochs. Times are given for 800 epochs in Table 2; largest model (ViT-H) took 34.5 hrs for 800.
128 TPU-v3 cores * 0.5 chips/core * 34.5 hours * 2 * 1.23E+14 FLOP/sec / chip * 3600 sec/hour * 40% utilization = 7.84e20 FLOP
Note that the operations counting method disagrees:
2 × 632000000 connections × 3 × 1281167 training examples × 1600 of epochs = 7.8e18 FLOP
",,,ViT-Huge/14,5077.0,"$0.62 / 32 cores of TPU-v3 * (128 cores / 32 cores) = $2.65/hour
CPI conversion to 2020: $2.25
$2.25/hour * 69 hours = $155.25",,1600.0,,,
Projected GAN,Image generation,Heidelberg University,"Axel Sauer, Kashyap Chitta, Jens Müller, Andreas Geiger",2021-11-01,Projected GANs Converge Faster,https://proceedings.neurips.cc/paper/2021/hash/9219adc5c42107c4911e249155320648-Abstract.html,SOTA improvement,"""It is further compatible with resolutions of up to one Megapixel and advances the state-of-the-art Fréchet Inception Distance (FID) on twenty-two benchmark datasets""",,Possibly calculable from Appendix Table 8,1.05e+19,"""With this setting, each experiment takes roughly 100-200 GPU hours on a NVIDIA V100,
for more details we refer to the appendix.""
""We conduct our experiments on an internal cluster with several nodes, each with up to 8 Quadro RTX
6000 or NVIDIA V100 using PyTorch 1.7.1 and CUDA 11.0.""
In appendix table 7, takes 10.1 seconds per 1k images on 8 Quadro RTX 6000s. Longest training run for Projected GAN appears to be in Figure 4 (left), at 14M images, though this is overtrained and the largest checkpoint used for evaluations was 10M.
10M images * 10.1 s/1000 images * 8 * 3.26e13 FLOP/s * 0.4 = 1.05e19",,They experiment with 22 image datasets. Largest appears to be LSUN-Bedroom at 3M images.,3000000.0,They experiment with 22 image datasets. Largest appears to be LSUN-Bedroom at 3M images.,"Generative Adversarial Networks (GANs) produce high-quality images but are
challenging to train. They need careful regularization, vast amounts of compute,
and expensive hyper-parameter sweeps. We make significant headway on these issues by projecting generated and real samples into a fixed, pretrained feature space.
Motivated by the finding that the discriminator cannot fully exploit features from
deeper layers of the pretrained model, we propose a more effective strategy that
mixes features across channels and resolutions. Our Projected GAN improves image quality, sample efficiency, and convergence speed. It is further compatible with
resolutions of up to one Megapixel and advances the state-of-the-art Fréchet Inception Distance (FID) on twenty-two benchmark datasets. Importantly, Projected
GANs match the previously lowest FIDs up to 40 times faster, cutting the wall-clock
time from 5 days to less than 3 hours given the same computational resources.",Confident,Germany,Academia,,,"NVIDIA V100,NVIDIA Quadro RTX 6000",,,,,,175.0,,,,,,
CodeT5-base,Language,"Salesforce,Nanyang Technological University","Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi",2021-11-01,CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation,https://aclanthology.org/2021.emnlp-main.685/,SOTA improvement,"""Extensive experiments show that CodeT5 yields state-of-the-art results on the fourteen sub-tasks in CodeXGLUE.""",220000000.0,"""We build CodeT5 based on Huggingface’s T5 (Raffel et al., 2020) PyTorch implementation and employ two sizes of CodeT5-small (60M) and CodeT5-base (220M)""",1.56e+21,"""We pre-train the model with the denoising objective for 100 epochs and bimodal dual training for further 50 epochs on a cluster of 16 NVIDIA A100 GPUs with 40G memory. The total training time for CodeT5-small and CodeT5- base is 5 and 12 days, respectively""
16 * 312 teraFLOP/s * 12 * 24 * 3600 * 0.3 (utilization assumption) = 1.56e21","CodeSearchNet,BigQuery","""We follow Feng et al. (2020) to employ CodeSearchNet (Husain et al., 2019) to pre-train CodeT5, which consists of six PLs with both unimodal and bimodal data. Apart from that, we additionally collect two datasets of C/CSharp from
BigQuery1 to ensure that all downstream tasks have overlapped PLs with the pre-training data. In total, we employ around 8.35 million instances for pretraining""",,"""In total, we employ around 8.35 million instances for pretraining""
Instances meaning code snippets/examples, not tokens.","""Pre-trained models for Natural Languages (NL) like BERT and GPT have been recently shown to transfer well to Programming Languages (PL) and largely benefit a broad set of code-related tasks. Despite their success, most current methods either rely on an encoder-only (or decoder-only) pre-training that is suboptimal for generation (resp. understanding) tasks or process the code snippet in the same way as NL, neglecting the special characteristics of PL such as token types. We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning. Besides, we propose a novel identifier-aware pre-training task that enables the model to distinguish which code tokens are identifiers and to recover them when they are masked. Furthermore, we propose to exploit the user-written code comments with a bimodal dual generation task for better NL-PL alignment. Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can better capture semantic information from code. Our code and pre-trained models are released at https://github.com/salesforce/CodeT5.""",Likely,"United States of America,Singapore","Industry,Academia",288.0,"""The total training time for CodeT5-small and CodeT5- base is 5 and 12 days, respectively""",NVIDIA A100,Open source,,,,,928.0,,,150.0,3114.8690946174474,,
S4,Language,Stanford University,"Albert Gu, Karan Goel, Christopher Ré",2021-10-31,Efficiently Modeling Long Sequences with Structured State Spaces,https://arxiv.org/abs/2111.00396,SOTA improvement,"""S4 achieves strong empirical results across a diverse range of established benchmarks, including... SoTA on every task from the Long Range Arena benchmark""",249000000.00000003,,6e+20,,WikiText-103,,,,,,United States of America,Academia,,,,Open source,,,,,589.0,,,509.02,,,
EfficientZero,Games,"Tsinghua University,UC Berkeley,Shanghai Qi Zhi institute","Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, Yang Gao",2021-10-30,Mastering Atari Games with Limited Data,https://arxiv.org/abs/2111.00210,SOTA improvement,"""Our method is 176% and 163% better
than the previous SoTA performance, in mean and median human normalized score respectively""",,,,"""Our implementation is computationally friendly. To train an Atari agent for 100k steps, it only needs 4 GPUs to train 7 hours.""",,,,,,Unknown,"China,United States of America,China","Academia,Academia",,,,,,,,,143.0,,,,,,
base LM+GNN+kNN,Language,"Shannon.AI,Nanjing University,Nanyang Technological University,Zhejiang University","Yuxian Meng, Shi Zong, Xiaoya Li, Xiaofei Sun, Tianwei Zhang, Fei Wu, Jiwei Li",2021-10-17,GNN-LM: Language Modeling based on Global Contexts via GNN,https://arxiv.org/abs/2110.08743,SOTA improvement,,274000000.0,,7.3e+18,,WikiText-103,,,,,,"China,China,Singapore,China","Industry,Academia,Academia,Academia",,,,Open source,,,,,33.0,,,,,,
T0-XXL,Language,"Hugging Face,Brown University","Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng-Xin Yong, Harshit Pandey, Michael McKenna, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf, Alexander M. Rush",2021-10-15,Multitask Prompted Training Enables Zero-Shot Task Generalization,https://arxiv.org/abs/2110.08207,Highly cited,"""we compare T0 to the zero-shot performance of the largest language models available as of writing, i.e., various GPT-3 models up to 175B parameters...
We find that T0 matches or exceeds the performance
of all GPT-3 models on 9 out of 11 held-out datasets""",11000000000.0,"""Unless specified otherwise, we use the XXL version which
has 11B parameters.""",9.1819e+20,"From Table 1 and section B.1, a single run uses 27 hours of a 512 core slice of a TPU-v3 pod.
512 * 0.5 * 1.23e14 * 3600 * 27 * 0.3 = 9.18e20
(cores) * (chip/core) * (FLOP/chip-sec) * (sec/hour) * (hours) * (utilization assumption)",P3 (Public Pool of Prompts),,,"Multitask - 12 tasks, 62 datasets. See fig 2 for details.
This is going to be a nightmare to figure out! TODO figure out the sizes of each of these 62 datasets!
All datasets from here: https://arxiv.org/pdf/2109.02846.pdf
From B.2: ""across all of our training runs (including preliminary test experiments not described in this paper) we trained for 250 billion tokens""","Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language models' pretraining (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping any natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts with diverse wording. These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks. We fine-tune a pretrained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16x its size. Further, our approach attains strong performance on a subset of tasks from the BIG-bench benchmark, outperforming models up to 6x its size. All trained models are available at this https URL and all prompts are available at this https URL.",Confident,"Multinational,United States of America","Industry,Academia",27.0,"For main model, 27 hours (Table 1)
Total time taken to train for all experiments was 270 hours ""These training runs corresponded to about 270 total hours of training on a v3-512 Cloud TPU device.""",Google TPU v3,Open source,,256.0,,,1283.0,,,,11671.840843653788,,
Yuan 1.0,Language,Inspur,"Shaohua Wu, Xudong Zhao, Tong Yu, Rongguo Zhang, Chong Shen, Hongli Liu, Feng Li, Hong Zhu, Jiangang Luo, Liang Xu, Xuanwei Zhang, Jun Liu",2021-10-12,Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning,https://arxiv.org/abs/2110.04725,SOTA improvement,"""The zero-shot average scores of both LM and PLM are superior to the SOTA one. On Csldcp, Tnews and Iflytek tasks, we surpass the zero-shot SOTA by a large margin""",245730000000.0,"Table 2: Parameters of Yuan models.
""Parameters (billion)""",3.5380000000001e+23,"Table 9: 4095 petaFLOPS-days which equals 3.538*10^23 FLOP
https://www.wolframalpha.com/input?i=4095+petaFLOPS+*+1+day
","Common Crawl,Wikipedia,Sogue News","""A Chinese corpus with 5TB high-quality text is built, which is sufficient to train Yuan 245B model without sampling the dataset twice.""
In order to obtain the high-quality dataset, we develop a Massive Data Filtering System (MDFS) built on Spark to clean and filter the raw data, and train a Bert-based model to select high quality
samples. MDFS is consisted of three parts, data collection, coarse filtering and fine filtering (Fig. 5). The raw data is
collected from Common Crawl, Sogou News, SogouT, Encyclopedia, and Books (Table 3). To process these raw data,
we run MDFS system on a high performance cluster with 36 nodes.",1000000000000.0,"""Yuan 1.0 was trained on a new Chinese dataset of 5TB high-quality text that was built on 850TB raw data from Internet.""
1 GB ~ 167M words in English or 333M words in Chinese. For a mixed dataset of mostly Chinese, 5TB may be equivalent to around 1T words.","Recent work like GPT-3 has demonstrated excellent performance of Zero-Shot and Few-Shot learning on many natural language processing (NLP) tasks by scaling up model size, dataset size and the amount of computation. However, training a model like GPT-3 requires huge amount of computational resources which makes it challengeable to researchers. In this work, we propose a method that incorporates large-scale distributed training performance into model architecture design. With this method, Yuan 1.0, the current largest singleton language model with 245B parameters, achieves excellent performance on thousands GPUs during training, and the state-of-the-art results on NLP tasks. A data processing method is designed to efficiently filter massive amount of raw data. The current largest high-quality Chinese corpus with 5TB high quality texts is built based on this method. In addition, a calibration and label expansion method is proposed to improve the Zero-Shot and Few-Shot performance, and steady improvement is observed on the accuracy of various tasks. Yuan 1.0 presents strong capacity of natural language generation, and the generated articles are difficult to distinguish from the human-written ones.",Confident,China,Industry,,,,API access,,2128.0,0.45,,45.0,,,0.22,,6881280.0,"Table 2. Batch size 3360, sequence length 2048. 3360*2048 = 6881280"
Megatron-Turing NLG 530B,Language,"Microsoft,NVIDIA","Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, Bryan Catanzaro",2021-10-11,"Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model",https://arxiv.org/abs/2201.11990,"SOTA improvement,Training cost","The 105-layer, transformer-based MT-NLG improved upon the prior state-of-the-art models in zero-, one-, and few-shot settings",530000000000.0,,1.17e+24,"https://www.lesswrong.com/posts/bGuMrzhJdENCo8BxX/nvidia-and-microsoft-releases-530b-parameter-transformer?commentId=HSJSNspKp94tFcSCx
source: https://lair.lighton.ai/akronomicon/
9938 PF-days * 3600 * 24 * 10^15 = 8.586432e+23","Common Crawl,The Pile,CC-Stories,Realnews"," In addition to Common Crawl data, we leveraged a number of other previously generated datasets. From The Pile, we selected Books3, OpenWebText2, Stack Exchange, PubMed Abstracts,
Wikipedia, Gutenberg (PG-19), BookCorpus2, NIH ExPorter, and Pile-CC datasets. We also included the
CC-Stories and RealNews datasets used to train Megatron",202500000000.0,"""Our training dataset consists of 339 billion tokens and we
trained MT-NLG on 270 billions tokens by blending the 15 training datasets as described above. We also set aside 2% of our data for validation.""
1 token ~ 0.75 words","Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a key ingredient to the success of the model. Finally, we discuss various evaluation results, as well as other interesting observations and new properties exhibited by MT-NLG. We demonstrate that MT-NLG achieves superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results. We believe that our contributions will help further the development of large-scale training infrastructures, large-scale language models, and natural language generations.",,"United States of America,United States of America","Industry,Industry",770.0,"Total compute was 1.17*10^24 FLOP.
They don't directly report the utilization and training speed when using the full Selene supercomputer with 560 DGX * 8 A100/DGX = 4480 GPUs. See section 2.3 Hardware Setup.
At 280 DGX, the utilization is 126/312 = 40% and a batch takes 60 seconds; at 350, it is 39% for 50 seconds; at 420, it is 36% for 44 seconds.
The overall utilization was 30.2% and the full cluster has 560 DGX. Dividing the total compute by the total performance of 4480 A100 at 30.2% utilization gives 770 hours.",NVIDIA A100 SXM4 80 GB,Unreleased,,4480.0,0.302,,566.0,,,,3704291.3087597536,3932160.0,"""The sequence length is 2048 and the global batch size is 1920. We used 8-way tensor and 35-way pipeline parallelism. The learning rate is 5.0e −5 . We used one billion tokens for linear learning rate warmup. We used cosine decay for the learning rate targeting to reach 10% of its value over 340 billion tokens. Over the first 12 billion tokens, we started at a batch size of 32 and gradually increased the batch size in increments of 32, until we reach the final batch size of 1920""
Final batch size is 1920 * 2048 = 3932160"
AlphaFold-Multimer,Biology,"Google DeepMind,DeepMind","Richard Evans, Michael O’Neill, Alexander Pritzel, Natasha Antropova, Andrew Senior, Tim Green, Augustin Žídek, Russ Bates, Sam Blackwell, Jason Yim, Olaf Ronneberger, Sebastian Bodenstein, Michal Zielinski, Alex Bridgland, Anna Potapenko, Andrew Cowie, Kathryn Tunyasuvunakool, Rishub Jain, Ellen Clancy, Pushmeet Kohli, John Jumper and Demis Hassabis",2021-10-04,Protein complex prediction with AlphaFold-Multimer,https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1,"Highly cited,SOTA improvement","""On a benchmark dataset of 17 heterodimer proteins without templates (introduced in [2]) we achieve at least medium accuracy (DockQ [3] ≥ 0.49) on 14 targets and high accuracy (DockQ ≥ 0.8) on 6 targets, compared to 9 targets of at least medium accuracy and 4 of high accuracy for the previous state of the art system (an AlphaFold-based system from [2])""
""For heteromeric interfaces we successfully predict the interface (DockQ ≥ 0.23) in 67% of cases, and produce high accuracy predictions (DockQ ≥ 0.8) in 23% of cases, an improvement of +25 and +11 percentage points over the flexible linker modification of AlphaFold [4] respectively""
""For homomeric interfaces we successfully predict the interface in 69% of cases, and produce high accuracy predictions in 34% of cases, an improvement of +5 percentage points in both instances""",,"""Multiple changes to the AlphaFold system were made to adapt it to training on protein complexes, which are detailed below. Summarizing briefly, we [...] make various small adjustments to the structure losses and the model architecture."" [2. Methods]
Hence, this will have approximately the same amount of parameters as AlphaFold2",4.35e+21,"Section: 2.5. Training Regimen
""We train the model to convergence (approximately 10M samples, for 2 weeks) across 128 TPUv3 cores [...]. Then we [...] run two separate fine-tuning stages (one further day of training each)""
Assuming: FP16 and utilization 0.4
Calculation: (14+2) days * 24 hours/day * 60 min/hour * 60 sec/min * (128 TPU cores/2 cores per chip) * 1.23e14 FLOP/s per chip * 0.4 utilization = 4.35e21 FLOPs",PDB (Protein Data Bank),"""The training dataset comprised structures from the Protein Data Bank (PDB) [13] with a maximum release date of 2018-04-30"" [2.5. Training Regimen]",147328.0,See: https://www.rcsb.org/stats/growth/growth-released-structures for 2018,"While the vast majority of well-structured single protein chains can now be predicted to high accuracy due to the recent AlphaFold [1] model, the prediction of multi-chain protein complexes remains a challenge in many cases. In this work, we demonstrate that an AlphaFold model trained specifically for multimeric inputs of known stoichiometry, which we call AlphaFold-Multimer, significantly increases accuracy of predicted multimeric interfaces over input-adapted single-chain AlphaFold while maintaining high intra-chain accuracy. On a benchmark dataset of 17 heterodimer proteins without templates (introduced in [2]) we achieve at least medium accuracy (DockQ [3] ≥ 0.49) on 14 targets and high accuracy (DockQ ≥ 0.8) on 6 targets, compared to 9 targets of at least medium accuracy and 4 of high accuracy for the previous state of the art system (an AlphaFold-based system from [2]). We also predict structures for a large dataset of 4,433 recent protein complexes, from which we score all non-redundant interfaces with low template identity. For heteromeric interfaces we successfully predict the interface (DockQ ≥ 0.23) in 67% of cases, and produce high accuracy predictions (DockQ ≥ 0.8) in 23% of cases, an improvement of +25 and +11 percentage points over the flexible linker modification of AlphaFold [4] respectively. For homomeric interfaces we successfully predict the interface in 69% of cases, and produce high accuracy predictions in 34% of cases, an improvement of +5 percentage points in both instances.",Confident,"Multinational,United Kingdom of Great Britain and Northern Ireland","Industry,Industry",384.0,"Section: 2.5. Training Regimen
""We train the model to convergence (approximately 10M samples, for 2 weeks) across 128 TPUv3 cores [...]. Then we [...] run two separate fine-tuning stages (one further day of training each)""",Google TPU v3,Open source,,64.0,,AlphaFold 2,1430.0,,,,7966.2330131223,,
TrOCR,Vision,"Beihang University,Microsoft Research Asia","Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei",2021-09-21,TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models,https://arxiv.org/abs/2109.10282,SOTA improvement,"from conclusion ""Experiment results show that TrOCR achieves state-of-the-art results on printed, handwritten and scene text recognition with just a simple encoder-decoder model, without any post-processing steps""",558000000.0,558M table 5,,may be computed from github and datasets details,,"""To build a large-scale high-quality dataset, we sample two million document pages from the publicly available PDF files on the Internet.""
From the Experiment section: ""In total, the first-stage pre-training dataset contains 684M textlines.""""In total, the printed dataset consists of 3.3M textlines.""
and from MJSynth, SynthText datasets there is ""about 16M text images.""",703300000.0,"The input data to the model are images.
684M + 3.3M + 16M
from Experiment section: ""In total, the first-stage pre-training dataset contains 684M textlines."" ""In total, the printed dataset consists of 3.3M textlines.""
and from MJSynth, SynthText datasets there is ""about 16M text images.""","Text recognition is a long-standing research problem for document digitalization. Existing approaches are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on the printed, handwritten and scene text recognition tasks. The TrOCR models and code are publicly available at https://aka.ms/trocr",Confident,"China,China","Academia,Industry",,,NVIDIA Tesla V100 DGXS 32 GB,Open source,,32.0,,,177.0,,,,,,
PLATO-XL,Language,Baidu,"Siqi Bao, Huang He, Fan Wang, Hua Wu, Haifeng Wang, Wenquan Wu, Zhihua Wu, Zhen Guo, Hua Lu, Xinxian Huang, Xin Tian, Xinchao Xu, Yingzhan Lin, Zheng-Yu Niu",2021-09-20,PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation,https://arxiv.org/abs/2109.09519,SOTA improvement,,11000000000.0,,9.9e+21,"""In PLATO-XL, each model was trained for a total of 150B tokens, with
a batch size of 2M tokens.""
150B * 11B * 6 = 9.9e21",,,150000000000.0,"""In PLATO-XL, each model was trained for a total of 150B tokens, with
a batch size of 2M tokens.""","To explore the limit of dialogue generation pre-training, we present the models of PLATO-XL with up to 11 billion parameters, trained on both Chinese and English social media conversations. To train such large models, we adopt the architecture of unified transformer with high computation and parameter efficiency. In addition, we carry out multi-party aware pre-training to better distinguish the characteristic information in social media conversations. With such designs, PLATO-XL successfully achieves superior performances as compared to other approaches in both Chinese and English chitchat. We further explore the capacity of PLATO-XL on other conversational tasks, such as knowledge grounded dialogue and task-oriented conversation. The experimental results indicate that PLATO-XL obtains state-of-the-art results across multiple conversational tasks, verifying its potential as a foundation model of conversational AI.",Confident,China,Industry,,,NVIDIA Tesla V100 DGXS 32 GB,Open source,,256.0,,,53.0,,,,,,
HyperCLOVA 204B,Language,NAVER,,2021-09-10,,,SOTA improvement,"""HyperCLOVA with our training configuration shows state-of-the-art in-context zero-shot and few-shot learning performances on various downstream tasks in Korean""",204000000000.0,https://www.navercorp.com/navercorp_/ir/announce/2023/NAVER_CEO%20letter%20to%20shareholders_Aug%202023_Eng.pdf,,"""For experiments in Section 4, the model trained with 150B is used for fair comparison, because not all models are finished training at the same iteration. However, experiments in Section 5.2 use the model trained with 300B tokens, as HyperCLOVA Studio provided the 39B and 82B models trained with 300B tokens.""
82e9 connections * 2 FLOP/connection * 300e9 tokens * 3 backward pass = 1.476e23 FLOP
Calculation using GPU time corroborates this:
- ""Our model is based on megatron-LM (Shoeybi et al., 2019) and trained on the NVIDIA Superpod, which includes 128 strongly clustered DGX servers with 1,024 A100 GPUs.""
- ""It takes 13.4 days to train a model with 82B parameters with 150B tokens."" Assume 300B tokens takes twice as long, 26.8 days.
- Assume the default of 30% utilization rate for large language models.
1024 A100 GPUs * 312e12 FLOP/second * 0.3 utilization * 26.8 days * 24 * 60 * 60 seconds/day = 2.219e+23 FLOP",Unspecified unreleased,,,,,Speculative,Korea (Republic of),Industry,,,NVIDIA A100,,,,,,92.0,,,,,,
PermuteFormer,Language,Peking University,Peng Chen,2021-09-06,PermuteFormer: Efficient Relative Position Encoding for Long Sequences,https://arxiv.org/abs/2109.02377,SOTA improvement,"""Results show that
PermuteFormer uniformly improves the performance of Performer, accelerates convergence, and
achieves state-of-the-art on some tasks.""",33000000.0,,3.1e+18,,WikiText-103,,,,,,China,Academia,,,,Unreleased,,,,,19.0,,,30.0,,,
MEB,Search,Microsoft,"W Liu, Z Wang, X Liu, N Zeng, Y Liu, FE Alsaadi",2021-09-04,Make Every feature Binary: A 135B parameter sparse neural network for massively improved search relevance,https://www.microsoft.com/en-us/research/blog/make-every-feature-binary-a-135b-parameter-sparse-neural-network-for-massively-improved-search-relevance/,Significant use,"""MEB is running in production for 100 percent of Bing searches, in all regions and languages.""",135000000000.0,See paper title,,,,,,"""MEB uses three years of search logs from Bing as training data."" TODO convert",,,United States of America,Industry,,,,,,,,,26.0,,,,,,
FLAN 137B,Language,Google Research,"Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le",2021-09-03,Finetuned Language Models Are Zero-Shot Learners,https://arxiv.org/abs/2109.01652,"Highly cited,SOTA improvement","Abstract:
""FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 datasets that we evaluate.""",137000000000.0,"Abstract:
""We take a 137B parameter pretrained language model and instruction tune it on
over 60 NLP datasets verbalized via natural language instruction templates. We
evaluate this instruction-tuned model, which we call FLAN, on unseen task types.""
Many models seem to be using the same 137B base transformer model?",4.896e+22,"From section 2.4: ""60 hours on a TPUv3 with 128 cores."" I assume that ""128 cores"" = 128 TPUv3s. Which took less than 2% of total time (see environmental considerations section)","Wikipedia,Unspecified unreleased","Abstract: ""We take a 137B parameter pretrained language model and instruction tune it on over 60 NLP datasets""",1870000000000.0,"""Model architecture and pretraining. In our experiments, we use LaMDA-PT, a dense left-to-right, decoder-only transformer language model of 137B parameters (Thoppilan et al., 2022). This model is pretrained on a collection of web documents (including those with computer code), dialog data, and Wikipedia, tokenized into 2.49T BPE tokens with a 32k vocabulary using the SentencePiece library (Kudo & Richardson, 2018). Around 10% of the pretraining data was non-English. Note that LaMDA-PT only has language model pretraining (c.f. LaMDA, which was finetuned for dialog).""
2.49e12 tokens ~= 1.87e12 words","This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning—finetuning language models on a collection of datasets described via instructions—substantially improves zeroshot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction tune it on over 60 NLP datasets verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 datasets that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.",,Multinational,Industry,60.0,,Google TPU v3,Unreleased,"""In our experiments, we use LaMDA-PT, a dense left-to-right,
decoder-only transformer language model of 137B parameters (Thoppilan et al., 2022) [...] Note that
LaMDA-PT only has language model pretraining (c.f. LaMDA, which was finetuned for dialog)."" In our entry for LaMDA we only measured pre-training compute, so we just specify LaMDA as the base model of FLAN 137B.",64.0,,LaMDA,2242.0,,,,230526.76439336664,,
XLMR-XXL,Language,Facebook AI Research,"Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau",2021-08-17,Larger-Scale Transformers for Multilingual Masked Language Modeling,https://arxiv.org/abs/2105.00572,SOTA improvement,"Abstract:
""Our model also outperforms
the RoBERTa-Large model on several English tasks of the GLUE benchmark by 0.3% on average while handling 99 more languages.""",10700000000.0,"Section 2.1:
"" ...XLM-RXXL (L= 48, H = 4096, A = 32, 10.7B params)""",,,CC100,,125250000000.0,"""We pretrain the models on the CC100 dataset, which corresponds to 167B tokens in 100 languages.""
1 token ~ 0.7 words",,,United States of America,Industry,,,,,,,,,82.0,,,,,,
DNABERT,Biology,Northeastern University,"Yanrong Ji, Zhihan Zhou, Han Liu, Ramana V Davuluri",2021-08-15,DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome,https://academic.oup.com/bioinformatics/article/37/15/2112/6128680,SOTA improvement,"""We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data."" [Abstract] - SOTA improvement on very specific task",110000000.0,"""We used the same model architecture as the BERT base, which consists of 12 Transformer layers with 768 hidden units and 12 attention heads in each layer, and the same parameter setting across all the four DNABERT models during pre-training""
Known to have 110 million parameters as reported in: https://arxiv.org/pdf/1810.04805v2.pdf
""We primarily report results on two model sizes: BERTBASE (L=12, H=768, A=12, Total Parameters=110M) [...]""",1.07e+20,"""Since the pre-training of DNABERT model is resource-intensive (about 25 days on 8 NVIDIA 2080Ti GPUs)""
Assuming FP16 and 30% utilization
Calculation = (25 * 24 *3600) s * 2.7e13 FLOP/s per GPU * 8 GPUs * 0.3 utilization = 1.4e20 FLOP
Alternatively:
""DNABERT takes a sequence with a max length of 512 as input... We pre-trained DNABERT for 120k steps with a batch size of 2000""
6 * 512 * 2000 * 120k * 110M = 8.11e19
Geometric mean: 1.07e20",Human genome,"""We generated training data from human genome [...]"" [2.2.2 Pre-training]. ",3000000000.0,"The human genome is around 3 billion base pairs (https://useast.ensembl.org/Homo_sapiens/Info/Annotation).
The authors use both non-overlapping sampling and random sampling from a human genome, though the source is unspecified.","Motivation
Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios.
Results
To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks.",Confident,United States of America,Academia,600.0,"""Since the pre-training of DNABERT model is resource-intensive (about 25 days on 8 NVIDIA 2080Ti GPUs)""",NVIDIA GeForce RTX 2080 Ti,Open source,,,,,365.0,,,,,,
Zidong Taichu,"Multimodal,Speech,Vision,Language",Chinese Academy of Sciences,,2021-08-11,Zidong Ancestral multi-modal large model,https://gitee.com/zidongtaichu/multi-modal-models,Historical significance,"The world’s first image, language, and audio trimodal pre-trained model.",3200000000.0,共32亿参数 translated as A total of 3.2 billion parameters ,,,,,,,,Confident,China,Academia,,,,,,,,,0.0,,,,,,
W2v-BERT,Speech,"Google Brain,Massachusetts Institute of Technology (MIT)","Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, Yonghui Wu",2021-08-07,W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training,https://arxiv.org/abs/2108.06209v2,SOTA improvement,"""Our experiments show that w2v-BERT achieves competitive results
compared to current state-of-the-art pre-trained models on the LibriSpeech benchmarks when using the Libri-Light 60k corpus as the
unsupervised data. In particular, when compared to published models such as conformer-based wav2vec 2.0 and HuBERT, our model
shows 5% to 10% relative WER reduction on the test-clean and
test-other subsets""",1000000000.0,1B for XXL model,,,LibriLight,"""We use the Libri-Light unlab-60k subset [34], which contains
about 60,000 hours of unannotated speech audio, for pre-training
w2v-BERT models. For our main results, we use the LibriSpeech
960hr subset [35] as the supervised data, and use the 100hr subset
for ablation studies""",,,"Motivated by the success of masked language modeling~(MLM) in pre-training natural language processing models, we propose w2v-BERT that explores MLM for self-supervised speech representation learning. w2v-BERT is a framework that combines contrastive learning and MLM, where the former trains the model to discretize input continuous speech signals into a finite set of discriminative speech tokens, and the latter trains the model to learn contextualized speech representations via solving a masked prediction task consuming the discretized tokens. In contrast to existing MLM-based speech pre-training frameworks such as HuBERT, which relies on an iterative re-clustering and re-training process, or vq-wav2vec, which concatenates two separately trained modules, w2v-BERT can be optimized in an end-to-end fashion by solving the two self-supervised tasks~(the contrastive task and MLM) simultaneously. Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models on the LibriSpeech benchmarks when using the Libri-Light~60k corpus as the unsupervised data. In particular, when compared to published models such as conformer-based wav2vec~2.0 and HuBERT, our model shows~5\% to~10\% relative WER reduction on the test-clean and test-other subsets. When applied to the Google's Voice Search traffic dataset, w2v-BERT outperforms our internal conformer-based wav2vec~2.0 by more than~30\% relatively.",Confident,"United States of America,United States of America","Industry,Academia",,,,,,,,,268.0,,,,,,
YOLOX-X,Vision,Megvii Inc,"Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, Jian Sun",2021-08-06,YOLOX: Exceeding YOLO Series in 2021,https://arxiv.org/abs/2107.08430,"Highly cited,SOTA improvement",Table 6,99100000.0,"99.1M, table 3",,"""We train the models for a total of 300 epochs with 5 epochs warmup on COCO train2017 [17]. We use stochastic gradient descent (SGD) for training. We use a learning rate of
lr×BatchSize/64 (linear scaling [8]), with a initial lr =
0.01 and the cosine lr schedule. The weight decay is 0.0005
and the SGD momentum is 0.9. The batch size is 128 by
default to typical 8-GPU devices""",COCO 2017,"""We train the models for a total of 300 epochs with 5 epochs warmup on COCO train2017""",2500000.0,"2.5 million image-label pairs, per Coco paper https://arxiv.org/abs/1405.0312","In this report, we present some experienced improvements to YOLO series, forming a new high-performance detector -- YOLOX. We switch the YOLO detector to an anchor-free manner and conduct other advanced detection techniques, i.e., a decoupled head and the leading label assignment strategy SimOTA to achieve state-of-the-art results across a large scale range of models: For YOLO-Nano with only 0.91M parameters and 1.08G FLOPs, we get 25.3% AP on COCO, surpassing NanoDet by 1.8% AP; for YOLOv3, one of the most widely used detectors in industry, we boost it to 47.3% AP on COCO, outperforming the current best practice by 3.0% AP; for YOLOX-L with roughly the same amount of parameters as YOLOv4-CSP, YOLOv5-L, we achieve 50.0% AP on COCO at a speed of 68.9 FPS on Tesla V100, exceeding YOLOv5-L by 1.8% AP. Further, we won the 1st Place on Streaming Perception Challenge (Workshop on Autonomous Driving at CVPR 2021) using a single YOLOX-L model. We hope this report can provide useful experience for developers and researchers in practical scenes, and we also provide deploy versions with ONNX, TensorRT, NCNN, and Openvino supported. Source code is at this https URL.",Likely,China,Industry,,,NVIDIA V100,Open source,,,,,3207.0,,,300.0,,,
6-Act Tether,Robotics,"Facebook AI Research,Georgia Institute of Technology","Joel Ye, Dhruv Batra, Abhishek Das, Erik Wijmans",2021-08-03,Auxiliary Tasks and Exploration Enable ObjectGoal Navigation,https://openaccess.thecvf.com/content/ICCV2021/html/Ye_Auxiliary_Tasks_and_Exploration_Enable_ObjectGoal_Navigation_ICCV_2021_paper.html,SOTA improvement,"""Our agents achieve 24.5% success and 8.1% SPL, a 37% and 8% relative improvement over prior state-of-the-art, respectively, on the Habitat ObjectNav Challenge""",5000000.0,"""Agent parameter counts were all 5 − 6 million parameters, excluding parameters in auxiliary modules""",,"""In our experiments, we train each of our agents for 8 GPU-weeks (192 GPU-hours)"". No GPU specified.",Matterport,"""We experiment on the Matterport dataset (MP3D [4]), which has 90 scenes and 40 labeled semantic object categories.""",,,,Confident,"United States of America,United States of America","Industry,Academia",,,,,,,,,60.0,,,,,,
SEER,Vision,"Facebook AI Research,INRIA","Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, Piotr Bojanowski",2021-07-29,Self-supervised Pretraining of Visual Features in the Wild,https://arxiv.org/abs/2103.01988,SOTA improvement,"SOTA for self-supervised models on ImageNet, which seems fair to consider a different benchmark than ImageNet for supervised models.
""Our final SElf-supERvised (SEER) model,
a RegNetY with 1.3B parameters trained on 1B random
images with 512 GPUs achieves 84.2% top-1 accuracy,
surpassing the best self-supervised pretrained model by 1%""",1300000000.0,"From abstract:
"" Our final SElf-supERvised (SEER) model, a RegNetY with 1.3B parameters...""",4.42e+21,"Numbers from section 3.2
512 GPUs * 0.1 * 8days * 24h/day * 3600s/h * 125 TFLOP/s",Instagram,"Section 3.3:
""For our billion scale pretraining, we consider a dataloader that directly samples random, public, and non-EU images from Instagram""
Note the dataset is not static - it is refreshed every 90 days",1000000000.0,"""Overall, we train
on 1B images for a total of 122K iterations.""","Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and SwAV have reduced the gap with supervised methods. These results have been achieved in a control environment, that is the highly curated ImageNet dataset. However, the premise of self-supervised learning is that it can learn from any random image and from any unbounded dataset. In this work, we explore if self-supervision lives to its expectation by training large models on random, uncurated images with no supervision. Our final SElf-supERvised (SEER) model, a RegNetY with 1.3B parameters trained on 1B random images with 512 GPUs achieves 84.2% top-1 accuracy, surpassing the best self-supervised pretrained model by 1% and confirming that self-supervised learning works in a real world setting. Interestingly, we also observe that self-supervised models are good few-shot learners achieving 77.9% top-1 with access to only 10% of ImageNet. Code: this https URL",,"United States of America,France","Industry,Academia",192.0,8 days,NVIDIA Tesla V100 DGXS 32 GB,Open access (non-commercial),,512.0,,,227.0,,,,34114.252463690456,,
HuBERT,Speech,Facebook AI Research,"Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed",2021-07-27,HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,https://arxiv.org/abs/2106.07447,"Highly cited,SOTA improvement","Abstract:
"" the
HuBERT model either matches or improves upon the state-ofthe-art wav2vec 2.0 performance on the Librispeech (960h) and
Libri-light (60,000h) benchmarks with 10min, 1h, 10h, 100h, and
960h fine-tuning subsets.""",1000000000.0,"From abstract:
""Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets""",5.54e+21,"GPU NOT SPECIFIED - for the sake of argument I assume something on the order of 1 TFLOP/s
Numbers from Section IV part C
0.1 * (960h * 32GPUs + 60000h * 256 GPUs) * 3600s/h * 1 TFLOP/s/GPU","LibriSpeech,LibriLight",,820800000.0,"""When the HuBERT model is pre-trained on either the standard Librispeech 960h [24] or the Libri-Light 60k hours [25], it either matches or improves upon the state-of-theart wav2vec 2.0 [6] performance on all fine-tuning subsets of 10mins, 1h, 10h, 100h, and 960h.""
1h ~ 13,680 words
13,680 * 60,000 = 820800000","Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with 10min, 1h, 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets.",,United States of America,Industry,,,,Open source,,,,,1611.0,,,,,,
GOAT,Games,DeepMind,"Open-Ended Learning Team*, Adam Stooke, Anuj Mahajan, Catarina Barros, Charlie Deck, Jakob Bauer, Jakub Sygnowski, Maja Trebacz, Max Jaderberg, Michael Mathieu, Nat McAleese, Nathalie Bradley-Schmieg, Nathaniel Wong, Nicolas Porcel, Roberta Raileanu, Steph Hughes-Fitt, Valentin Dalibard and Wojciech Marian Czarnecki",2021-07-27,Open-Ended Learning Leads to Generally Capable Agents,"https://deepmind.com/blog/article/generally-capable-agents-emerge-from-open-ended-play
https://arxiv.org/abs/2107.12808",SOTA improvement,likely qualitatively SOTA,3500000.0,estimate described here: https://docs.google.com/document/d/1S9xZyCeITDOs-P1W_-liNW0WgVN-OLsSudVrPXMaLqw/edit?usp=sharing,7.8e+22,"[Final calculation]
(8 TPUs)(4.20e14 FLOP/s)(0.1 utilisation rate)(32 agents)(7.3e6 s/agent) = 7.8e22 FLOPs
==========================
NOTES BELOW
[Hardware]
- ""Each agent is trained using 8 TPUv3s and consumes approximately 50,000 agent steps (observations) per second.""
- TPUv3 (half precision): 4.2e14 FLOP/s
- Number of TPUs: 8
- Utilisation rate: 0.1
[Timesteps]
- Figure 16 shows steps per generation and agent. In total there are 1.5e10 + 4.0e10 + 2.5e10 + 1.1e11 + 2e11 = 3.9e11 steps per agent.
- 3.9e11 / 5e4 = 8e6 s → ~93 days
- 100 million steps is equivalent to 30 minutes of wall-clock time in our setup. (pg 29, fig 27)
- 1e8 steps → 0.5h
- 3.9e11 steps → 1950h → 7.0e6 s → ~82 days
- Both of these seem like overestimates, because:
“Finally, on the largest timescale (days), generational training iteratively improves population performance by bootstrapping off previous generations, whilst also iteratively updating the validation normalised percentile metric itself.” (pg 16)
- Suggests that the above is an overestimate of the number of days needed, else they would have said (months) or (weeks)?
- Final choice (guesstimate): 85 days = 7.3e6 s
[Population size]
- 8 agents? (pg 21) → this is describing the case where they’re not using PBT, so ignore this number
- The original PBT paper uses 32 agents for one task https://arxiv.org/pdf/1711.09846.pdf (in general it uses between 10 and 80)
- (Guesstimate) Average population size: 32",XLand,,390000000000.0,Figure 16 shows steps per generation and agent. In total there are 1.5e10 + 4.0e10 + 2.5e10 + 1.1e11 + 2e11 = 3.9e11 steps per agent.,"In this work we create agents that can perform well beyond a single, individual task, that exhibit much wider generalisation of behaviour to a massive, rich space of challenges. We define a universe of tasks within an environment domain and demonstrate the ability to train agents that are generally capable across this vast space and beyond. The environment is natively multi-agent, spanning the continuum of competitive, cooperative, and independent games, which are situated within procedurally generated physical 3D worlds. The resulting space is exceptionally diverse in terms of the challenges posed to agents, and as such, even measuring the learning progress of an agent is an open research problem. We propose an iterative notion of improvement between successive generations of agents, rather than seeking to maximise a singular objective, allowing us to quantify progress despite tasks being incomparable in terms of achievable rewards. We show that through constructing an open-ended learning process, which dynamically changes the training task distributions and training objectives such that the agent never stops learning, we achieve consistent learning of new behaviours. The resulting agent is able to score reward in every one of our humanly solvable evaluation levels, with behaviour generalising to many held-out points in the universe of tasks. Examples of this zero-shot generalisation include good performance on Hide and Seek, Capture the Flag, and Tag. Through analysis and hand-authored probe tasks we characterise the behaviour of our agent, and find interesting emergent heuristic behaviours such as trial-and-error experimentation, simple tool use, option switching, and cooperation. Finally, we demonstrate that the general capabilities of this agent could unlock larger scale transfer of behaviour through cheap finetuning.",,United Kingdom of Great Britain and Northern Ireland,Industry,,"see other notes
",Google TPU v3,Unreleased,,,,,147.0,,,,84799.78517435073,,
Codex,Language,OpenAI,"Mark Chen , Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba ",2021-07-07,Evaluating Large Language Models Trained on Code,https://openai.com/blog/openai-codex/,"Significant use,Highly cited",,12000000000.0,"""With just a single sample, a 12B parameter Codex solves 28.8% of these problems, and a 300M parameter Codex solves 13.2% of these problems""",,"""The original training of GPT-3-12B consumed hundreds of petaflop/sdays of compute, while fine-tuning it to create Codex-12B
consumed a similar amount of compute.""
",,,31800000000.0,"""Our training dataset was collected in May 2020 from 54 million public software repositories hosted on GitHub, containing 179 GB of unique Python files under 1 MB. We filtered out files which were likely auto-generated, had average line
length greater than 100, had maximum line length greater
than 1000, or contained a small percentage of alphanumeric
characters. After filtering, our final dataset totaled 159 GB.""
1 GB ~ 200M words",,,United States of America,Industry,,,,,,,,,2736.0,,,,,,
ERNIE 3.0,Language,Baidu,"Y Sun, S Wang, S Feng, S Ding, C Pang",2021-07-05,ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation,http://research.baidu.com/Blog/index-view?id=160,SOTA improvement,"""ERNIE 3.0 achieved new state-of-the-art results across 54 Chinese NLP tasks""",10000000000.0,"""We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph.""",2.25e+22,"Section 3.3.3:
""""The model is trained for
a total of 375 billion tokens""
Total compute approximated as 6*N*D",,,668000000000.0,"""To ensure the success of the pre-training of ERNIE 3.0, we construct a large-scale, wide-variety and high-quality Chinese text corpora amounting to 4TB storage size in 11 different categories.""
1 GB ~ 167M chinese words",,,China,Industry,,,NVIDIA V100,Open source,,384.0,,,278.0,,,,39104.26710360624,,
Adaptive Input Transformer + RD,Language,"Microsoft Research Asia,Soochow University","Xiaobo Liang, Lijun Wu, Juntao Li, Yue Wang, Qi Meng, Tao Qin, Wei Chen, Min Zhang, Tie-Yan Liu",2021-06-28,R-Drop: Regularized Dropout for Neural Networks,https://arxiv.org/abs/2106.14448,SOTA improvement,"""In particular, it yields substantial
improvements when applied to fine-tune large-scale pre-trained models, e.g., ViT, RoBERTa-large, and BART, and achieves state-of-the-art (SOTA) performances with the vanilla Transformer model """,247000000.00000003,,8.2e+19,,WMT14,,,,,,"China,Taiwan","Industry,Academia",,,,Unreleased,,,,,307.0,,,,,,
EfficientNetV2,Vision,"Google,Google Brain","Mingxing Tan, Quoc V. Le",2021-06-23,EfficientNetV2: Smaller Models and Faster Training,https://arxiv.org/abs/2104.00298,"Highly cited,SOTA improvement","""EfficientNetV2 achieves 87.3% top-1 accuracy on ImageNet ILSVRC2012, outperforming the recent ViT by 2.0% accuracy while
training 5x-11x faster using the same computing resources.""",208000000.0,"Table 7, page 7",9.56e+19,"Table 7, page 7: 45 hours on 32 TPUv3 cores.
""Each v3 TPU chip contains two TensorCores.""
TPU performance per chip = 123e12 FLOP/s
32 cores = 16 chips
123e12 FLOP/s per chip * (32 cores / 2 cores per chip) * 45 hours * 3600 seconds/hour * 0.30 utilization = 9.56e19 FLOP
https://www.wolframalpha.com/input?i=123+terahertz+*+16+*+45+hours+*+0.3",ImageNet21k,,14197122.0,,"This paper introduces EfficientNetV2, a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models. To develop this family of models, we use a combination of training-aware neural architecture search and scaling, to jointly optimize training speed and parameter efficiency. The models were searched from the search space enriched with new ops such as Fused-MBConv. Our experiments show that EfficientNetV2 models train much faster than state-of-the-art models while being up to 6.8x smaller.
Our training can be further sped up by progressively increasing the image size during training, but it often causes a drop in accuracy. To compensate for this accuracy drop, we propose to adaptively adjust regularization (e.g., dropout and data augmentation) as well, such that we can achieve both fast training and good accuracy.
With progressive learning, our EfficientNetV2 significantly outperforms previous models on ImageNet and CIFAR/Cars/Flowers datasets. By pretraining on the same ImageNet21k, our EfficientNetV2 achieves 87.3% top-1 accuracy on ImageNet ILSVRC2012, outperforming the recent ViT by 2.0% accuracy while training 5x-11x faster using the same computing resources. Code will be available at this https URL.",Confident,"United States of America,United States of America","Industry,Industry",45.0,Table 7,Google TPU v3,Open source,,,,,1609.0,,,,104.34013401561587,,
Denoising Diffusion Probabilistic Models (LSUN Bedroom),Vision,UC Berkeley,"Jonathan Ho, Ajay Jain, Pieter Abbeel",2021-06-11,Denoising Diffusion Probabilistic Models,https://arxiv.org/abs/2006.11239,"Highly cited,SOTA improvement","Novel approach to image synthesis that yields SOTA results on datasets like CIFAR-10
Abstract:
""On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. """,256000000.0,"Appendix B:
"" Our CIFAR10 model has 35.7 million parameters, and our LSUN and
CelebA-HQ models have 114 million parameters. We also trained a larger variant of the LSUN Bedroom model with approximately 256 million parameters by increasing filter count.""",3.8e+20,"Numbers in Appendix B
10.6h for the CIFAR model (batch size 128, 21 step/s)
2.2 step/s for the LSUN model, 1.15M steps so 702.8 hours
This is for TPUv3-8's, which seems to mean 8 cores (standard chip is 125 teraflop/s for 2 cores)
https://cloud.google.com/tpu/docs/regions-zones
1.25E14 FLOP/s * (8 cores / 2 cores/chip) * 702.8h * 3600s/h * 0.3 = 3.8e20",LSUN Bedroom,,3033042.0,"""We trained on CelebA-HQ for 0.5M steps, LSUN Bedroom for 2.4M steps, LSUN Cat for 1.8M steps, and LSUN Church for 1.2M steps.""
""The CelebA-HQ dataset is a high-quality version of CelebA that consists of 30,000 images at 1024×1024 resolution.""
https://paperswithcode.com/dataset/celeba-hq
LSUN bedroom has 3,033,042 examples. LSUN cat has 1,657,266 examples. LSUN church has 126,227 examples.
https://www.tensorflow.org/datasets/catalog/lsun
",,,United States of America,Academia,,,Google TPU v3,Open source,,,,,8071.0,,,,436.308484475363,,
ALIGN,"Multimodal,Vision,Language",Google Research,"Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig",2021-06-11,Scaling up visual and vision-language representation learning with noisy text supervision,https://arxiv.org/abs/2102.05918,"Highly cited,SOTA improvement","""The aligned visual and language representations... set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks""",820000000.0,"From author communication
480M (image tower) + 340 M (text tower)",2.598670000001e+22,"From author communication
14.82K TPUv3 core-days
Precision: bfloat16
Estimation
TPUv3 at float16: 123 TFLOPS/chip
123*10^12 TFLOPS/chip * (1 chip / 2 cores) * 14820 TPU core-days * 86400 s/day * 33% utilization = 2.599*10^22 FLOP
https://www.wolframalpha.com/input?i=14820+days+*+123+teraFLOPS+%2F+2+*+0.33",Conceptual Captions (CC3M),,1600000000.0,"Dataset contains 1.8B image-text pairs, then some duplicates are removed.","Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries.",Confident,Multinational,Industry,347.3,14820 TPU core-hours / 1024 TPU cores = 347.3 hours,Google TPU v3,Unreleased,,512.0,,,2432.0,,,,32852.91660437908,,
DeBERTa,Language,Microsoft,"Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen",2021-06-10,DeBERTa: Decoding-enhanced BERT with Disentangled Attention,https://arxiv.org/abs/2006.03654,"Highly cited,SOTA improvement","""DeBERTa significantly outperforms all existing PLMs of similar size on MNLI and creates a new state of the art""",1500000000.0,"""...we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters""
Other versions are smaller and use a smaller pre-training dataset. These are distinguished in the paper (e.g. DeBERTa1.5B is the version of DeBERTa with 1.5 billion parameters).",6.00000000001e+21,"From section 5.1.1: ""We use 6 DGX-2 machines (96 V100 GPUs) to train the models. A single model trained with 2K batch size and 1M steps takes about 20 days.""
This specifically refers to the largest models referred to in the paper, and smaller models are described elsewhere, but I'm assuming the large models are what we care about here.
Apparently there are multiple types of GPUs referred to as V100s. I'm guessing these are NVIDIA Tesla SMX2s.","Wikipedia,CC-Stories,OPENWEBTEXT,BookCorpus (BooksCorpus, Toronto Book Corpus)","We pre-train our large models following the setting of BERT (Devlin et al., 2019), except that we use the BPE vocabulary of Radford et al. (2019); Liu et al. (2019c). For training data, we use Wikipedia (English Wikipedia dump3; 12GB), BookCorpus (Zhu et al., 2015) (6GB), OPENWEBTEXT (public Reddit content (Gokaslan & Cohen, 2019); 38GB), and STORIES (a subset of CommonCrawl (Trinh & Le, 2018); 31GB). The total data size after data deduplication (Shoeybi et al., 2019) is about 78G",15600000000.0,""" DeBERTa is pretrained on 78G training data""
1GB ~ 200M words","Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).",,United States of America,Industry,240.0,20 days,NVIDIA V100,Open source,,96.0,,,1817.0,,,,6682.2289986716,,
EMDR,Language,"Mila - Quebec AI (originally Montreal Institute for Learning Algorithms),McGill University,DeepMind","Devendra Singh Sachan, Siva Reddy, William Hamilton, Chris Dyer, Dani Yogatama",2021-06-09,End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering,https://arxiv.org/abs/2106.05346v2,SOTA improvement,"""Experiments on three benchmark datasets demonstrate that our proposed method outperforms all existing approaches of comparable size by 2-3% absolute exact match points, achieving new state-of-the-art results.""",440000000.0,Table 2,1.04e+21,"""We run all of our experiments on a machine with 96 CPUs, 1.3TB physical memory, and 16 A100 GPUs. We use PyTorch (Paszke et al., 2019) to implement our proposed model. With this hardware setup, our experiments on NQ and TriviaQA took approximately 25 hours to complete,
while experiments on WebQ took roughly 8 hours to complete. Before supervised training, we also
perform a one-time unsupervised MSS pre-training for 82,000 steps that took roughly 1 week.""
1 week + 25 hours * 16 A100s
= ~193 * 16 A100-hours
= 193 * 16 * 3600 * 312 trillion * 0.3 = 1.04e21","Wikipedia,NQ (Natural Questions),TriviaQA","pre-train on Wikipedia (Table 6), then training on the QA datasets",,,"We present an end-to-end differentiable training method for retrieval-augmented open-domain question answering systems that combine information from multiple retrieved documents when generating answers. We model retrieval decisions as latent variables over sets of relevant documents. Since marginalizing over sets of retrieved documents is computationally hard, we approximate this using an expectation-maximization algorithm. We iteratively estimate the value of our latent variable (the set of relevant documents for a given question) and then use this estimate to update the retriever and reader parameters. We hypothesize that such end-to-end training allows training signals to flow to the reader and then to the retriever better than staged-wise training. This results in a retriever that is able to select more relevant documents for a question and a reader that is trained on more accurate documents to generate an answer. Experiments on three benchmark datasets demonstrate that our proposed method outperforms all existing approaches of comparable size by 2-3% absolute exact match points, achieving new state-of-the-art results. Our results also demonstrate the feasibility of learning to retrieve to improve answer generation without explicit supervision of retrieval decisions.",Confident,"Canada,Canada,United Kingdom of Great Britain and Northern Ireland","Academia,Academia,Industry",230.0,,NVIDIA A100,Open source,,,,,112.0,,,,2773.3124307816734,,
CoAtNet,Vision,"Google,Google Research,Google Brain","Zihang Dai, Hanxiao Liu, Quoc V. Le, Mingxing Tan",2021-06-09,"CoAtNet: Marrying Convolution and Attention
for All Data Sizes",https://arxiv.org/abs/2106.04803v2,SOTA improvement,"""Notably, when we further scale up CoAtNet with JFT-3B, it achieves 90.88% top-1 accuracy on ImageNet, establishing a new state-of-the-art result.""",2440000000.0,,4.27e+22,"20.1K TPU-v3 core-days
TPUs have two cores per chip, and a chip is 123 teraflop/s
https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#tpu_v3
123 teraflop/s * 20100/2 * 24 * 3600 * 0.4 (utilization assumption for non-language models) = 4.27e22",JFT-3B,"When only ImageNet-1K is used for training, CoAtNet achieves 86.0% top-1 accuracy, matching the prior art NFNet [20] under similar computation resource and training conditions. Further, when pre-trained on ImageNet-21K with about 10M images, CoAtNet reaches 88.56% top-1 accuracy when finetuned on ImageNet-1K, matching the ViT-Huge pre-trained on JFT-300M, a 23× larger dataset. Finally, when JFT-3B is used for pre-training, CoAtNet exhibits better efficiency compared to ViT, and pushes the ImageNet-1K top-1 accuracy to 90.88% while using 1.5x less computation of the prior art set by ViT-G/14 [26].",,,"Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks. In this work, we show that while Transformers tend to have larger model capacity, their generalization can be worse than convolutional networks due to the lack of the right inductive bias. To effectively combine the strengths from both architectures, we present CoAtNets (pronounced “coat” nets), a family of hybrid models built from two key insights: (1) depthwise Convolution and self-Attention can be naturally unified via simple relative attention; (2) vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency. Experiments show that our CoAtNets achieve state-of-the-art performance under different resource constraints across various datasets: Without extra data, CoAtNet achieves 86.0% ImageNet top-1 accuracy; When pre-trained with 13M images from ImageNet-21K, our CoAtNet achieves 88.56% top-1 accuracy, matching ViT-huge pre-trained with 300M images from JFT-300M while using 23x less data; Notably, when we further scale up CoAtNet with JFT-3B, it achieves 90.88% top-1 accuracy on ImageNet, establishing a new state-of-the-art result.",Likely,"United States of America,Multinational,United States of America","Industry,Industry,Industry",,,Google TPU v3,Unreleased,,,,,857.0,,,,1887.1635925610951,,
ViT-G/14,Vision,"Google Brain,Google Research","X Zhai, A Kolesnikov, N Houlsby, L Beyer",2021-06-08,Scaling Vision Transformers,https://arxiv.org/abs/2106.04560,SOTA improvement,"""we successfully train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy""",1843000000.0,Table 2 of paper,3.4e+21,"source: https://lair.lighton.ai/akronomicon/
archived: https://github.com/lightonai/akronomicon/tree/main/akrodb
Alternatively: per paper, ViT-G required between 20-30k TPUv3 core-days to train (from eyeballing the tick marks in Figure 9).
TPUv3 is 123 teraflop/s per chip, 2 cores per chip
123 trillion * (1/2) * 25,000 * 3600 * 0.4 = 2.2e21
","JFT-3B,ImageNet","We trained a large Vision Transformer, ViT-G/14, which
contains nearly two billion parameters. Section 3.6 details
the architecture’s shape. We evaluate the ViT-G/14 model on
a range of downstream tasks, and compare it to recent stateof-the-art results. We fine-tune on ImaegNet",3000000000.0,"""For this study, we use the proprietary JFT-3B dataset, a larger version of the JFT-300M dataset used
in many previous works on large-scale computer vision models [31, 18, 11]. This dataset consists of
nearly 3 billion images, annotated with a class-hierarchy of around 30k labels via a semi-automatic
pipeline""","Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of-the-art results on many computer vision benchmarks. Scale is a primary ingredient in attaining excellent results, therefore, understanding a model's scaling properties is a key to designing future generations effectively. While the laws for scaling Transformer language models have been studied, it is unknown how Vision Transformers scale. To address this, we scale ViT models and data, both up and down, and characterize the relationships between error rate, data, and compute. Along the way, we refine the architecture and training of ViT, reducing memory consumption and increasing accuracy of the resulting models. As a result, we successfully train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy. The model also performs well for few-shot transfer, for example, reaching 84.86% top-1 accuracy on ImageNet with only 10 examples per class.",Confident,"United States of America,Multinational","Industry,Industry",,,Google TPU v3,Unreleased,,,,,773.0,,,,3847.8613922131217,,
ByT5-XXL,Language,"Google,Google Research","Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel",2021-05-28,ByT5: Towards a token-free future with pre-trained byte-to-byte models,https://arxiv.org/abs/2105.13626,SOTA improvement,"""On the most realistic in-language setting, where some gold training data is available in all languages, ByT5 surpasses the previous state-of-art mT5 on all tasks and model sizes""",12900000000.0,"12.9B, from Table 1",8.1e+22,"""Like mT5, we set our sequence length to 1024 (bytes rather than tokens), and train for 1 million steps over batches of 2^20 tokens.""
12.9 billion * 1 million * 2^20 * 6 = ~8.1e22",mC4,,,,"Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.",Likely,"United States of America,Multinational","Industry,Industry",,,Google TPU v3,Open source,,,,,319.0,,,,92453.37635669748,1048576.0,"""Like mT5, we set our sequence length to 1024 (bytes rather than tokens), and train for 1 million steps over batches of 2^20 tokens"""
Transformer local-attention (NesT-B),Vision,"Google Cloud,Google Research","Zizhao Zhang, Han Zhang, Long Zhao, Ting Chen, Sercan Arık, Tomas Pfister",2021-05-26,"Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding",https://arxiv.org/abs/2105.12723v4,Highly cited,,90100000.0,"Table A2, NesT-B is the largest size.",2.40576e+19,"17.9 GFLOPS per forward pass
300 epochs
1.28M training examples
3.5 f_to_b pass ratio
(From Imagenet paper-data, Besiroglu et al., forthcoming) ",ImageNet-1k,,1280000.0,,"Hierarchical structures are popular in recent vision transformers, however, they require sophisticated designs and massive datasets to work well. In this paper, we explore the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical way. We find that the block aggregation function plays a critical role in enabling cross-block non-local information communication. This observation leads us to design a simplified architecture that requires minor code changes upon the original vision transformer. The benefits of the proposed judiciously-selected design are threefold: (1) NesT converges faster and requires much less training data to achieve good generalization on both ImageNet and small datasets like CIFAR; (2) when extending our key ideas to image generation, NesT leads to a strong decoder that is 8× faster than previous transformer-based generators; and (3) we show that decoupling the feature learning and abstraction processes via this nested hierarchy in our design enables constructing a novel method (named GradCAT) for visually interpreting the learned model. Source code is available this https URL.",,"Multinational,Multinational","Industry,Industry",,,,Open source,,,,,5734.0,,,,,,
CogView,Image generation,"Tsinghua University,Alibaba DAMO Academy","M Ding, Z Yang, W Hong, W Zheng, C Zhou",2021-05-26,CogView: Mastering Text-to-Image Generation via Transformers,https://arxiv.org/abs/2105.13290,SOTA improvement,"""CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E""",4000000000.0,"""We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem.""",2.68e+22,"source: https://lair.lighton.ai/akronomicon/
archived: https://github.com/lightonai/akronomicon/tree/main/akrodb",WuDao Corpora,"""We collected about 30 million text-image pairs from multiple channels, and built a 2.5TB new dataset (after tokenization, the size becomes about 250GB).""",50000000000.0,"""We collected about 30 million text-image pairs from multiple channels, and built a 2.5TB new dataset (after tokenization, the size becomes about 250GB).""
250GB * (1 word / 5 bytes) = 50 billion words or 67 billion tokens
So 30M text-image pairs and 50 billion words","Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.",Likely,"China,China","Academia,Industry",,,NVIDIA Tesla V100 DGXS 16 GB,Open source,,512.0,,,521.0,,,,60071.706664791694,,
ConSERT,Language,"Meituan University,Beijing University of Posts and Telecommunications","Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng Zhang, Wei Wu, Weiran Xu",2021-05-25,ConSERT: A contrastive framework for self-supervised sentence representation transfer,https://arxiv.org/abs/2105.11741,SOTA improvement,Trains an effective BERT model on small sample sizes and achieves an 8% improvement over previous SOTA on STA datasets.,340000000.0,,2.8e+20,"Fine-tuning was done using a single Nvidia V100 GPU for a few minutes -> 1.0E+15 to 5.0E+15 (2 to 10 min)
Foundation model is BeRT with 2.8e+20 FLOP.
So total compute is 2.8e+20.",Chinese STS,,,,"Learning high-quality sentence representations benefits a wide range of natural language processing tasks. Though BERT-based pre-trained language models achieve high performance on many downstream tasks, the native derived sentence representations are proved to be collapsed and thus produce a poor performance on the semantic textual similarity (STS) tasks. In this paper, we present ConSERT, a Contrastive Framework for Self-Supervised Sentence Representation Transfer, that adopts contrastive learning to fine-tune BERT in an unsupervised and effective way. By making use of unlabeled texts, ConSERT solves the collapse issue of BERT-derived sentence representations and make them more applicable for downstream tasks. Experiments on STS datasets demonstrate that ConSERT achieves an 8\% relative improvement over the previous state-of-the-art, even comparable to the supervised SBERT-NLI. And when further incorporating NLI supervision, we achieve new state-of-the-art performance on STS tasks. Moreover, ConSERT obtains comparable results with only 1000 samples available, showing its robustness in data scarcity scenarios.",Confident,"China,China","Academia,Academia",0.1,,NVIDIA Tesla V100S PCIe 32 GB,Open source,,,,,428.0,,,,957.274379475876,,
MedBERT,Medicine,"Peng Cheng Laboratory,University of Texas at Houston","Laila Rasmy, Yang Xiang, Ziqian Xie, Cui Tao, Degui Zhi",2021-05-20,Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction,https://www.nature.com/articles/s41746-021-00455-y,SOTA improvement,"""This work is the first demonstration of significantly boosted
performance over state-of-the-art methods on multiple
clinical tasks with phenotyped cohorts.""",17000000.0,"17M from ""This is possibly due to the fact that the untrained Med-BERT is an over-parameterized model (around 17 million parameters) with a huge
number of configurations, so it might overfit to the training data""",9.47e+18,"flops = (1) * (3.13e13) * (24*7 * 3600) * (0.5) = 9.47e18
(num gpu) * (peak flops) * (time in seconds) * (assumed utilization rate)
I assume higher utilization rate, because only 1 GPU is used.
Citation from the text:
""We used a single Nvidia Tesla V100GPU of 32 GB graphics memory capacity, and we trained the model for a week for more than 45 million steps, for which each step consists of 32 patients (batch size)."" - page 11
Note that public code appears not to make use of the tensor core speed up, thus I use 3.13e13 FLOP/sec",Cerner Health Facts,"page 3 data source
""We extracted our cohorts from two databases: Cerner Health
Facts® (version 2017) (Cerner) and Truven Health MarketScan®
(Truven)""
""Our pretraining cohort for Med-BERT is consisting of 28 million
patients extracted from Cerner""""",,"data about 28M patients
""Our pretraining cohort for Med-BERT is consisting of 28 million
patients extracted from Cerner""","Deep learning (DL)-based predictive models from electronic health records (EHRs) deliver impressive performance in many clinical tasks. Large training cohorts, however, are often required by these models to achieve high accuracy, hindering the adoption of DL-based models in scenarios with limited training data. Recently, bidirectional encoder representations from transformers (BERT) and related models have achieved tremendous successes in the natural language processing domain. The pretraining of BERT on a very large training corpus generates contextualized embeddings that can boost the performance of models trained on smaller datasets. Inspired by BERT, we propose Med-BERT, which adapts the BERT framework originally developed for the text domain to the structured EHR domain. Med-BERT is a contextualized embedding model pretrained on a structured EHR dataset of 28,490,650 patients. Fine-tuning experiments showed that Med-BERT substantially improves the prediction accuracy, boosting the area under the receiver operating characteristics curve (AUC) by 1.21–6.14% in two disease prediction tasks from two clinical databases. In particular, pretrained Med-BERT obtains promising performances on tasks with small fine-tuning training sets and can boost the AUC by more than 20% or obtain an AUC as high as a model trained on a training set ten times larger, compared with deep learning models without Med-BERT. We believe that Med-BERT will benefit disease prediction studies with small local training datasets, reduce data collection expenses, and accelerate the pace of artificial intelligence aided healthcare.",Likely,"China,United States of America","Academia,Academia",168.0,"""We used a single Nvidia Tesla V100GPU of 32 GB graphics memory capacity, and we trained the model for a week for more than 45 million steps, for which each step consists of 32 patients (batch size)."" - page 11",NVIDIA Tesla V100 DGXS 32 GB,Unreleased,,1.0,,,424.0,,,,62.48645493610712,,
ADM,Image generation,OpenAI,"Prafulla Dhariwal, Alex Nichol",2021-05-11,Diffusion Models Beat GANs on Image Synthesis,https://arxiv.org/abs/2105.05233,"Highly cited,SOTA improvement","""We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models""",559000000.0,"Largest model is denoted ImageNet 512, has 559M parameters",6.2e+21,"Largest run with their architecture improvements is the ImageNet 512 variant. Table 7 suggests utilization is around 30% for largest models (though we only see 256 x 256 and 128 -> 512)
Table 10: ImageNet 512 variant took 1914 V100-days of training
125e12 FLOP/sec * 1914 days * 24 h/day * 3600 sec/h * 0.3 = 6.2e21","LSUN,ILSVRC 2012 subset of ImageNet","""To evaluate our improved model architecture on unconditional image generation, we train separate diffusion models on three LSUN [71] classes: bedroom, horse, cat""",1281167.0,"Biggest models are trained on ImageNet 512x512. ImageNet ILSVRC has 1,281,167 images in the training set, but it is possible some were filtered due to size.
Note that a smaller model was trained on LSUN {bedroom, horse, cat}, which forms a larger dataset:
3,033,042 + 2,000,340 + 1,657,266 = 6,690,648 images","We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. We achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, we further improve sample quality with classifier guidance: a simple, compute-efficient method for trading off diversity for fidelity using gradients from a classifier. We achieve an FID of 2.97 on ImageNet 128×128, 4.59 on ImageNet 256×256, and 7.72 on ImageNet 512×512, and we match BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution. Finally, we find that classifier guidance combines well with upsampling diffusion models, further improving FID to 3.94 on ImageNet 256×256 and 3.85 on ImageNet 512×512.",Confident,United States of America,Industry,,,NVIDIA V100,Open access (non-commercial),,,,,4052.0,,,,11274.484326547095,,
ProtT5-XXL-BFD,Biology,"Technical University of Munich,Med AI Technology,NVIDIA,Oak Ridge National Laboratory,Google,Seoul National University","Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, Burkhard Rost",2021-05-04,ProtTrans:Towards Cracking the Language of Life's Code Through Self-Supervised Learning,"https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3 or
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9477085",SOTA improvement,"""For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches.""",11000000000.0,Table 2,3.7e+22,"FLOP = 11B*2*(920k*512*4096) + 11B*4*(920k*512*4096), 920k steps using seq length 512 batch size 4096, ",BFD (Big Fantastic Dataset),"First, T5-XL and T5-XXL were trained on BFD for 1.2M and 920k steps respectively (ProtT5-XL-BFD, ProtT5-XXL-BFD). In a second step, ProtT5-XL-BFD and ProtT5-XXL-BFD were fine-tuned on
UniRef50 for 991k and 343k steps respectively (ProtT5-XLU50, ProtT5-XXL-U50).",,"Table 1: 2122M proteins, 393B amino acids, 572 GB","Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP. These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The LMs were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores.
Dimensionality reduction revealed that the raw protein LM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks. The first was a per-residue prediction of protein secondary structure (3-state accuracy Q3=81%-87%); the second were per-protein predictions of protein sub-cellular localization (ten-state accuracy: Q10=81%) and membrane vs. water-soluble (2-state accuracy Q2=91%). For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that protein LMs learned some of the grammar of the language of life. To facilitate future work, we released our models at https://github.com/agemagician/ProtTrans.",Confident,"Germany,China,United States of America,United States of America,United States of America,Korea (Republic of)","Academia,Industry,Government,Industry,Academia",,,Google TPU v3,Open source,,512.0,,,,,,,43025.05718973151,,
ProtT5-XXL,Biology,"Technical University of Munich,Med AI Technology,NVIDIA,Oak Ridge National Laboratory,Google,Seoul National University","A Elnaggar, M Heinzinger, C Dallago, G Rihawi",2021-05-04,ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning,"https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3 or
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9477085",SOTA improvement,"""For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art
without using evolutionary information""",11000000000.0,"source: https://lair.lighton.ai/akronomicon/
archived: https://github.com/lightonai/akronomicon/tree/main/akrodb",7.37e+22,"source: https://lair.lighton.ai/akronomicon/
archived: https://github.com/lightonai/akronomicon/tree/main/akrodb
3.7E+22 from Table 9 https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1","BFD (Big Fantastic Dataset),UniRef50","First, T5-XL and T5-XXL were trained on BFD for 1.2M and 920k steps respectively (ProtT5-XL-BFD, ProtT5-XXL-BFD). In a second step, ProtT5-XL-BFD and ProtT5-XXL-BFD were fine-tuned on
UniRef50 for 991k and 343k steps respectively (ProtT5-XLU50, ProtT5-XXL-U50).",393000000000.0,"""Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids.""","Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP. These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The LMs were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw protein LM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks. The first was a per-residue prediction of protein secondary structure (3-state accuracy Q3=81%-87%); the second were per-protein predictions of protein sub-cellular localization (ten-state accuracy: Q10=81%) and membrane vs. water-soluble (2-state accuracy Q2=91%). For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that protein LMs learned some of the grammar of the language of life. To facilitate future work, we released our models at https://github.com/agemagician/ProtTrans.",Confident,"Germany,China,United States of America,United States of America,United States of America,Korea (Republic of)","Academia,Industry,Government,Industry,Academia",,,Google TPU v3,Open source,,512.0,,,396.0,,,,85701.26256441118,,
ProtBERT-BFD,Biology,"Technical University of Munich,NVIDIA,Seoul National University,Google,Oak Ridge National Laboratory,Med AI Technology","Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, Burkhard Rost",2021-05-04,ProtTrans:Towards Cracking the Language of Life's Code Through Self-Supervised Learning,"https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3 or
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9477085",SOTA improvement,"""For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches.""",420000000.0,Table 2,3.9e+22,"FLOP = 420M * 6 * (800k*512*32k + 200k*2048*6k)
1M steps total split into two phases, (1) 800k steps, seq length 512 (batch size 32k) and (2) 200k steps, seq length 2048 (batch size 6k)
single TPU Pod V3-1024 (64 nodes and 1024 TPUs) info from paper and https://huggingface.co/Rostlab/prot_bert_bfd",BFD (Big Fantastic Dataset),"ProtBert: BERT2 was trained using both UniRef100
and BFD-100 datasets (referred to as ProtBert and ProtBertBFD, respectively; Table 2)",8900000000000.0,"""ProtBERT-BFD (420M parameters) saw around 27B proteins during pre-training""
Table 1: BFD has 2122M proteins, 393B amino acids, 572 GB
Suggests average amino acid length of 185
Implies 27B * 185 = 5T amino acids seen in training
However, Table 2 suggests number of tokens (amino acids) seen in training was:
(512*32768*800k) + (2048*6144*200k) = 15.9T amino acids in training
Geometric mean = 8.9T","Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP. These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The LMs were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores.
Dimensionality reduction revealed that the raw protein LM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks. The first was a per-residue prediction of protein secondary structure (3-state accuracy Q3=81%-87%); the second were per-protein predictions of protein sub-cellular localization (ten-state accuracy: Q10=81%) and membrane vs. water-soluble (2-state accuracy Q2=91%). For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that protein LMs learned some of the grammar of the language of life. To facilitate future work, we released our models at https://github.com/agemagician/ProtTrans.",Confident,"Germany,United States of America,Korea (Republic of),United States of America,United States of America,China","Academia,Industry,Academia,Industry,Government",,"figure 3 shows 19 hours per epoch, though this was on a different GPU setup than the one used for training.",Google TPU v3,Open source,,1024.0,,,,,,,45350.73595674404,,
ViT + DINO,Vision,"INRIA,Facebook AI Research","Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin",2021-04-29,Emerging Properties in Self-Supervised Vision Transformers,https://arxiv.org/abs/2104.14294,Highly cited,,85000000.0,"85M, table 1",2.1e+20,"""Overall, training DINO with Vision Transformers
achieves 76.1 top-1 accuracy using two 8-GPU servers for 3
days""
GPU is V100
16 * 125 teraflops * 3 days * 0.4 utilization
= 2.1e20
However, this isn't the best result in the paper (which is 80.1% with ViT-B/8). 76.1% is the result from ViT-B/16 per Table 2, which may be 5x cheaper than ViT-B/8 based on Table 1?
upd:
""Table 8: Time and memory requirements. We show total running
time and peak memory per GPU (“mem.”) when running ViT-S/16
DINO models on two 8-GPU machines.""
2*8*125 teraflops*72.6h*3600*0.4=2.09088e+20",ImageNet,"""We pretrain the models on the ImageNet dataset [60] without labels""",,,"In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.",Confident,"France,United States of America","Academia,Industry",,,NVIDIA V100,Open source,,,,,3536.0,,,300.0,380.2849049080349,,
PLUG,Language,Alibaba,,2021-04-19,,https://mp.weixin.qq.com/s/DAQomIkDa52Sef-ruyH5qg,SOTA improvement,Was a SOTA in CLUE 1.0 https://www.cluebenchmarks.com/classification10.html,27000000000.0,,3.5997696e+22,128 Nvidia A100 for 35 days,,,,,,,China,Industry,840.0,35 days,NVIDIA A100,Hosted access (no API),,128.0,,,0.0,,,,108672.69136370042,,
M6-T,"Multimodal,Language,Vision",Alibaba,"An Yang, Junyang Lin, Rui Men, Chang Zhou, Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Jiamang Wang, Yong Li, Di Zhang, Wei Lin, Lin Qu, Jingren Zhou, Hongxia Yang",2021-03-05,M6-T: Exploring Sparse Expert Models and Beyond,https://arxiv.org/abs/2105.15082,SOTA improvement,"Improves on hardware SOTA for similar problems
Abstract:
""We push the model
scale to over 1 trillion parameters and implement it on solely 480 NVIDIA V100-32GB GPUs, in comparison with the recent SOTAs [11; 6] on 2048 TPU cores.""",1002700000000.0,Table 5. Note model is sparse MoE with 960 experts; not all parameters are activated on the forward pass.,5.5e+21,Estimate taken from https://www.governance.ai/research-paper/recent-trends-chinas-llm-landscape,M6-Corpus,M6-Corpus is a Chinese language multimodal dataset with 60.5B images and 111.8B tokens of text,1900000000000.0,60.5B images and 111.8B tokens of text,"Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost, and thus it has become a trend in model scaling. Still it is a mystery how MoE layers bring quality gains by leveraging the parameters with sparse activation. In this work, we investigate several key factors in sparse expert models. We observe that load imbalance may not be a significant problem affecting model quality, contrary to the perspectives of recent studies, while the number of sparsely activated experts k and expert capacity C in top-k routing can significantly make a difference in this context. Furthermore, we take a step forward to propose a simple method called expert prototyping that splits experts into different prototypes and applies k top-1 routing. This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models. We push the model scale to over 1 trillion parameters and implement it on solely 480 NVIDIA V100-32GB GPUs, in comparison with the recent SOTAs on 2048 TPU cores. The proposed giant model achieves substantial speedup in convergence over the same-size baseline.",Likely,China,Industry,,,NVIDIA Tesla V100 DGXS 32 GB,Unreleased,,480.0,,,76.0,,,,13156.86123531961,,
Generative BST,Language,Facebook AI Research,"Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston",2021-03-05,Recipes for building an open-domain chatbot,https://arxiv.org/abs/2004.13637,SOTA improvement,"Abstract:
""Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.""",9400000000.0,"Abstract:
""We build variants of these recipes with 90M, 2.7B and 9.4B parameter models""",,"Unclear - no mention of GPUs used, or training time, and the architecture is terribly complicated",,"Section 6:
Pushshfit.io Reddit, ConvAI 2, Wizard of Wikipedia",,,,,United States of America,Industry,,,,,,,,,862.0,,,,,,
Meta Pseudo Labels,Vision,"Google Brain,Google AI","Hieu Pham, Zihang Dai, Qizhe Xie, Minh-Thang Luong, and Quoc V. Le",2021-03-01,Meta pseudo labels,https://arxiv.org/abs/2003.10580,SOTA improvement,,480000000.0,"Table 4
480M",4.79e+22,"From communication with author:
22671 TPU days on specific hardware.
Which hardware did you use and in which configuration?
2048 cores of TPU v3.
Precision: Mixed. bfloat16 for activations, float32 for weights and optimizer slots.
2048 TPUv3 cores means 1024 TPUv3 chips, and the spec is 123e12 FLOP/second per chip with bfloat16 precision (Source: https://cloud.google.com/tpu/docs/system-architecture-tpu-vm)
So the compute estimate is:
1024 chips * 123e12 FLOP/second * 0.4 utilization * 11 days * 24 * 60 * 60 = 4.788191232e+22 FLOP","ImageNet,JFT-300M",,130000000.0,"Section 4
Datasets. For this experiment, we use the entire ImageNet
training set as labeled data, and use the JFT dataset as unlabeled data. The JFT dataset has 300 million images, and
then is filtered down to 130 million images by Noisy Student
using confidence thresholds and up-sampling [77]. We use
the same 130 million images as Noisy Student","We present Meta Pseudo Labels, a semi-supervised learning method that achieves a new state-of-the-art top-1 accuracy of 90.2% on ImageNet, which is 1.6% better than the existing state-of-the-art. Like Pseudo Labels, Meta Pseudo Labels has a teacher network to generate pseudo labels on unlabeled data to teach a student network. However, unlike Pseudo Labels where the teacher is fixed, the teacher in Meta Pseudo Labels is constantly adapted by the feedback of the student's performance on the labeled dataset. As a result, the teacher generates better pseudo labels to teach the student. Our code will be available at this https URL.",,"United States of America,Multinational","Industry,Industry",264.0,"11 days from section 4:
""We train the model for 1 million steps in total,
which takes about 11 days for EfficientNet-L2 and 10 days
for EfficientNet-B6-Wide. ""
""Specifically, our training process runs on a cluster of 2,048
TPUv3 cores. ""
",Google TPU v3,Unreleased,,1024.0,,,548.0,,,,53844.28059190435,,
SRU++ Large,Language,ASAPP,Tao Lei,2021-02-24,When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute,https://arxiv.org/abs/2102.12459,SOTA improvement,"""our model achieves a state-of-the-art result on the ENWIK8 dataset using 1.6 days of training on an 8-GPU machine. """,234000000.0,,1.1e+19,,"WikiText-103,enwik8,One Billion Word benchmark",,,,,,United States of America,Industry,,,,Open source,,,,,40.0,,,34.08,,,
Rational DQN Average,Games,TU Darmstadt,"Q Delfosse, P Schramowski, A Molina",2021-02-18,Recurrent Rational Networks,https://openreview.net/forum?id=gnRmI8TatHV,SOTA improvement,,1683456.0,See figure 7,,,,,,,,,Germany,Academia,,,,,,,,,6.0,,,,,,
MSA Transformer,Biology,"Facebook AI Research,UC Berkeley,New York University (NYU)","Roshan Rao, Jason Liu, Robert Verkuil, Joshua Meier, John F. Canny, Pieter Abbeel, Tom Sercu, Alexander Rives",2021-02-13,MSA Transformer,https://proceedings.mlr.press/v139/rao21a/rao21a.pdf,SOTA improvement,"""The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models""",100000000.0,"""We train an MSA Transformer model with 100M parameters..."" ",5.49e+21,"Based on: https://docs.google.com/spreadsheets/d/1enan21dFx03TkwufHgOwTVNBtuYlqNY9uurjIK6YS-8/edit#gid=0
Number of steps 4.5e5, batch size (tokens) 6.1e7, parameters 1e8
Calculation = 4e8 FLOP/bp * 4.5e5 bp + 2e8 FLOP/fp * 2.75e13 fp
Batch size: 512
Seq length: 100 * 1192 tokens
All models are trained on 32 V100 GPUs for 100k updates. The four models with best contact precision are then further trained to 150k updates. Finally, the best model at 150k updates is trained to 450k updates.
450k * 512 * 100 * 1192 * 100M * 6 = 1.65e22","UniRef50,UniRef30 (FKA UniClust30)","""Models are trained on a dataset of 26 million MSAs. An MSA is generated for each UniRef50 sequence by searching UniClost30 with HHblits.""",6198400000000.0,"""We train an MSA Transformer model with 100M parameters on a large dataset (4.3 TB) of 26 million MSAs, with an average of 1192 sequences per MSA.""
Average sequence is ~200 amino acids/tokens long per https://epochai.org/blog/biological-sequence-models-in-the-context-of-the-ai-directives#fn:4
26 million * 1192 * 200 =6.2T tokens","Unsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models. ",Likely,"United States of America,United States of America,United States of America","Industry,Academia,Academia",,,NVIDIA Tesla V100 DGXS 32 GB,Open source,,32.0,,,366.0,,,,13256.937301517895,,
top-down frozen classifier,Language,"University of Edinburgh,Toshiba Cambridge Research Laboratory","Shucong Zhang, Cong-Thanh Do, Rama Doddipatla, Erfan Loweimi, Peter Bell, Steve Renals",2021-02-09,Train your classifier first: Cascade Neural Networks Training from upper layers to lower layers,https://arxiv.org/abs/2102.04697,SOTA improvement,"""Table 2 demonstrates that, to the best of our knowledge, top-down training results in state-of-the art character error rates for LSTM-based endto-end models on WSJ""",,,,,,,,,,Unknown,"United Kingdom of Great Britain and Northern Ireland,United Kingdom of Great Britain and Northern Ireland","Academia,Industry",,,,Unreleased,,,,,2.0,,,,,,
DeiT-B,Vision,"Meta AI,Sorbonne University","Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou",2021-01-15,Training data-efficient image transformers & distillation through attention,https://arxiv.org/abs/2012.12877,Highly cited,,86000000.0,(DeiT-B),7.884e+19,"2*86000000 parameters*3*1280000 training examples*300 epochs=1.98144e+17 FLOPs
compute [FLOP] = training time [s] × # of GPUs/TPUs × peak FLOP/s × utilization rate
(53h+20h)*3600*8*125000000000000 peak FLOP/s*0.3=7.884e+19
",ImageNet,,1280000.0,,"Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption.
In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data.
More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.",Confident,"United States of America,France","Industry,Academia",53.0,"A typical training of 300 epochs takes 37 hours with 2 nodes or 53 hours on a single node for the DeiT-B.
In this paper, we train a vision transformer on a single 8-GPU node in two
to three days (53 hours of pre-training, and optionally 20 hours of fine-tuning) that is competitive with convnets having a similar number of parameters and efficiency. It uses Imagenet as the sole training set.",NVIDIA V100,Open source,,,,,5493.0,,,300.0,,,
Switch,Language,Google,"William Fedus, Barret Zoph, Noam Shazeer",2021-01-11,Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity,https://arxiv.org/abs/2101.03961,"Highly cited,SOTA improvement",""" On ANLI (Nie et al., 2019), Switch XXL improves over the prior state-of-the-art to get a 65.7
accuracy versus the prior best of 49.4 (Yang et al., 2020)... Finally, we also conduct an early examination of the model’s knowledge with three closed-book knowledge-based tasks: Natural
Questions, WebQuestions and TriviaQA, without additional pre-training using Salient Span
Masking (Guu et al., 2020). In all three cases, we observe improvements over the prior stateof-the-art T5-XXL model (without SSM)",1600000000000.0,"""Combining expert, model and data parallelism, we design two large Switch Transformer models, one
with 395 billion and 1.6 trillion parameters""",8.22e+22,"Table 4
https://arxiv.org/ftp/arxiv/papers/2104/2104.10350.pdf",C4,,432000000000.0,"""In our protocol we pre-train with 220 (1,048,576) tokens
per batch for 550k steps amounting to 576B total tokens.""
1 token ~ 0.75 words","In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the ""Colossal Clean Crawled Corpus"" and achieve a 4x speedup over the T5-XXL model.
",,United States of America,Industry,648.0,"see table 4 in https://arxiv.org/ftp/arxiv/papers/2104/2104.10350.pdf
",Google TPU v3,Open source,,1024.0,0.28,,1291.0,,,,139663.55942731188,,
BigSSL,Speech,"Google,Apple","Yu Zhang, Daniel S. Park, Wei Han,James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yanping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang and Yonghui Wu",2021-01-10,BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition,https://arxiv.org/abs/2109.13226,SOTA improvement,"""In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set""",8000000000.0,"""... we study the utility of large models, with the parameter count ranging from 600M to 8B...""",,,,,42626880000.0,"Sum all values in Table VII, and add 34k for English VAD, and 926k for English Youtube = 3116k hours
Note this involves significant self-training: ""Noisy student training (NST) [23], [41] is a self-training
method where a teacher model generates pseudo-labels for a
large unlabeled dataset, which is in turn used to train a student
model with augmentation.""
1 hour ~ 13,680 words
13680 * 3116000 = 42626880000",,,"United States of America,United States of America","Industry,Industry",,,,,,,,,129.0,,,,,,
DALL-E,Image generation,OpenAI,"Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever",2021-01-05,Zero-Shot Text-to-Image Generation,"https://openai.com/blog/dall-e/
https://arxiv.org/abs/2102.12092","Significant use,Highly cited",,12000000000.0,DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions,4.7e+22,"source: https://lair.lighton.ai/akronomicon/
archived: https://github.com/lightonai/akronomicon/tree/main/akrodb",,"To scale up to 12-billion parameters, we created a dataset of
a similar scale to JFT-300M (Sun et al., 2017) by collecting
250 million text-images pairs from the internet. This dataset
does not include MS-COCO, but does include Conceptual
Captions and a filtered subset of YFCC100M (Thomee et al.,
2016). As MS-COCO was created from the latter, our training data includes a fraction of the MS-COCO validation images (but none of the captions).",250000000.0,"""To scale up to 12-billion parameters, we created a dataset of a similar scale to JFT-300M (Sun et al., 2017) by collecting
250 million text-images pairs from the internet. ""","Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.",,United States of America,Industry,,"""We trained the model using 1024, 16 GB NVIDIA V100 GPUs and a total batch size of 1024, for a total of 430,000 updates.
At the start of training, we use a linear schedule to ramp up the step size to 4.5 · 10−4 over 5000 updates, and halved the
step size each time the training loss appeared to plateau. We did this a total of five times, ending training with a final step
size that was 32 times smaller than the initial one. """,NVIDIA Tesla V100 DGXS 16 GB,API access,,1024.0,,,3242.0,,,,118437.35864214256,,
CLIP (ViT L/14@336px),"Multimodal,Vision,Language,Video",OpenAI,"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever",2021-01-05,Learning Transferable Visual Models From Natural Language Supervision,https://arxiv.org/abs/2103.00020,"Highly cited,SOTA improvement",,370000000.0,"Image encoder
Vision Transformer
Table 1 in https://arxiv.org/pdf/2010.11929.pdf
Authors fine-tuned ViT L/14 at additional 336px resolution, hence the @336 (See ViT)
307M params
Text encoder
~Transformer (from paper)
63M params",1.05e+22,https://docs.google.com/document/d/156miAJkFN9DDX06C3s03UDsretCtymCKiGDddLBCgQE/edit?usp=sharing,Unspecified unreleased,"Custom image-text pairs from the internet
we constructed a new dataset of 400 million (image,
text) pairs collected form a variety of publicly available
sources on the Internet. To attempt to cover as broad a set
of visual concepts as possible, we search for (image, text)
pairs as part of the construction process whose text includes
one of a set of 500,000 queries",400000000.0,,"State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at this https URL.",,United States of America,Industry,288.0,"“The largest ResNet model, RN50x64, took 18 days to train on 592 V100 GPUs while the largest Vision Transformer took 12 days on 256 V100 GPUs”",NVIDIA V100,Open source,,256.0,,,14671.0,"https://www.kdnuggets.com/2021/03/beginners-guide-clip-model.html
",,,24638.518413409514,,
CLIP (ResNet-50),"Multimodal,Vision,Language,Video",OpenAI,"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever",2021-01-05,Learning Transferable Visual Models From Natural Language Supervision,https://arxiv.org/abs/2103.00020,"Highly cited,SOTA improvement",,88600000.0,"Image encoder
~ResNet-50 (from paper)
25.6M params
Text encoder
~Transformer (from paper)
63M params",,,,Custom image-text pairs from the internet,400000000.0,,"State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at this https URL.",,United States of America,Industry,,,,,,,,,14671.0,,,,,,
ERNIE-Doc (247M),Language,Baidu,"Siyu Ding, Junyuan Shang, Shuohuan Wang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang",2020-12-31,ERNIE-Doc: A Retrospective Long-Document Modeling Transformer,https://arxiv.org/abs/2012.15688,SOTA improvement,"""ERNIE-DOC improved the state-of-the-art language modeling result of perplexity to 16.8 on WikiText103""",247000000.00000003,,2.91e+19,,"Wikipedia,CC-News,CC-Stories,BookCorpus (BooksCorpus, Toronto Book Corpus)",,,,,,China,Industry,,,,Open source,,,,,40.0,,,190.88,,,
CT-MoS (WT2),Language,"Google,National Tsing Hua University","Pei-Hsin Wang, Sheng-Iou Hsieh, Shih-Chieh Chang, Yu-Ting Chen, Jia-Yu Pan, Wei Wei, Da-Chang Juan",2020-12-25,Contextual Temperature for Language Modeling,https://arxiv.org/abs/2012.13575,SOTA improvement,"""Experimental results confirm that the
proposed method significantly improves state-of-the-art language models, achieving a perplexity of 55.31 and 62.89 on
the test set of Penn Treebank and WikiText-2""",45000000.0,,5.62e+17,,WikiText-2,,,,,,"United States of America,Taiwan","Industry,Academia",,,,Unreleased,,,,,9.0,,,1000.0,,,
DensePhrases,Language,"Korea University,Princeton University","Jinhyuk Lee, Mujeen Sung, Jaewoo Kang, Danqi Chen",2020-12-23,Learning Dense Representations of Phrases at Scale,https://arxiv.org/abs/2012.12624v3,SOTA improvement,"from abstract ""our model DensePhrases improves over previous phrase retrieval models by 15%-25% absolute accuracy and matches the performance of state-of-the-art retriever-reader models. """,,may be possible to estimate from batch size (8) and maximum memory of GPUs (96GB),2.09952e+18," flops = (8) * (1215 * 10**10) * (20 * 3600) * 3 // 10 = 2099520000000000000
(num gpu) * (peak flops) * (time in seconds) * (assumed utilization rate)
model of GPU from appendix B (Titan Xp)
number of GPUs from table in appendix A
flops from https://www.techpowerup.com/gpu-specs/titan-xp.c2948","SQuAD,NQ (Natural Questions)","from appendix D ""The number of generated questions is 327,302 and 1,126,354 for SQuAD and Natural Questions, respectively.""",58000000.0,"from appendix D ""The number of generated questions is 327,302 and 1,126,354 for SQuAD and Natural Questions, respectively.""
assuming 40 words per question we get around ~ 58M","Open-domain question answering can be reformulated as a phrase retrieval problem, without the need for processing documents on-demand during inference (Seo et al., 2019). However, current phrase retrieval models heavily depend on sparse representations and still underperform retriever-reader approaches. In this work, we show for the first time that we can learn dense representations of phrases alone that achieve much stronger performance in open-domain QA. We present an effective method to learn phrase representations from the supervision of reading comprehension tasks, coupled with novel negative sampling methods. We also propose a query-side fine-tuning strategy, which can support transfer learning and reduce the discrepancy between training and inference. On five popular open-domain QA datasets, our model DensePhrases improves over previous phrase retrieval models by 15%-25% absolute accuracy and matches the performance of state-of-the-art retriever-reader models. Our model is easy to parallelize due to pure dense representations and processes more than 10 questions per second on CPUs. Finally, we directly use our pre-indexed dense phrase representations for two slot filling tasks, showing the promise of utilizing DensePhrases as a dense knowledge base for downstream tasks. ",Speculative,"Korea (Republic of),United States of America","Academia,Academia",20.0,appendix A row 3,NVIDIA TITAN Xp,Open source,,8.0,,,101.0,,,4.0,,,
VQGAN + CLIP,Image generation,Heidelberg University,"Patrick Esser, Robin Rombach, Björn Ommer",2020-12-17,Taming Transformers for High-Resolution Image Synthesis,https://arxiv.org/abs/2012.09841,"Highly cited,SOTA improvement",,,,,,,,,I'm confused - I guess they pretrained on several different datasets? I think the model is also able to do zero-shot learning,,Unknown,Germany,Academia,,,,,,,,,1721.0,,,,,,
CPM-Large,Language,"Tsinghua University,Beijing Academy of Artificial Intelligence / BAAI","Z Zhang, X Han, H Zhou, P Ke, Y Gu, D Ye, Y Qin, Y Su",2020-12-01,CPM: A Large-scale Generative Chinese Pre-trained Language Model,https://arxiv.org/abs/2012.00413,SOTA improvement,"""CPM outperforms CDial-GPT with a large margin in the few-shot experiment, showing the generalization ability of our model.""",2600000000.0,"""To the best of our knowledge, CPM, with 2.6 billion parameters and 100GB Chinese training data, is the largest Chinese pre-trained language mode""",1.8e+21,"source: https://lair.lighton.ai/akronomicon/
archived: https://github.com/lightonai/akronomicon/tree/main/akrodb",Unspecified unreleased,"we construct a new sub-word vocabulary, containing both words and characters.",16700000000.0,"""language model, with 2.6 billion parameters and 100GB Chinese training data.""
We use the conversion factor 1GB ~ 167M words","Pre-trained Language Models (PLMs) have proven to be beneficial for various downstream NLP tasks. Recently, GPT-3, with 175 billion parameters and 570GB training data, drew a lot of attention due to the capacity of few-shot (even zero-shot) learning. However, applying GPT-3 to address Chinese NLP tasks is still challenging, as the training corpus of GPT-3 is primarily English, and the parameters are not publicly available. In this technical report, we release the Chinese Pre-trained Language Model (CPM) with generative pre-training on large-scale Chinese training data. To the best of our knowledge, CPM, with 2.6 billion parameters and 100GB Chinese training data, is the largest Chinese pre-trained language model, which could facilitate several downstream Chinese NLP tasks, such as conversation, essay generation, cloze test, and language understanding. Extensive experiments demonstrate that CPM achieves strong performance on many NLP tasks in the settings of few-shot (even zero-shot) learning. The code and parameters are available at this https URL.",,"China,China","Academia,Academia",336.0,"""It takes two weeks to train our largest model using 64 NVIDIA V100.""",NVIDIA V100,Open source,,64.0,,,97.0,"https://towardsdatascience.com/the-future-of-ai-is-decentralized-848d4931a29a#:~:text=Training%20GPT%2D3%20reportedly%20cost,a%20single%20training%20run%C2%B9.",,,7340.099044719796,,
AlphaFold 2,Biology,DeepMind,"John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Kathryn Tunyasuvunakool, Olaf Ronneberger, Russ Bates, Augustin Žídek, Alex Bridgland, Clemens Meyer, Simon A A Kohl, Anna Potapenko, Andrew J Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Martin Steinegger, Michalina Pacholska, David Silver, Oriol Vinyals, Andrew W Senior, Koray Kavukcuoglu, Pushmeet Kohli, Demis Hassabis.",2020-11-30,Highly accurate protein structure prediction with AlphaFold,https://www.nature.com/articles/s41586-021-03819-2,"Historical significance,Highly cited","""Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known"" [Abstract]
>17790 citations",93000000.0,"https://arxiv.org/abs/2207.05477 reimplements AlphaFold 2 in a more efficient way, and states there are 93M parameters in the original version (Table 1)",2.99e+21,"123 teraFLOPS / TPU v3 chip * 128 cores * (1 chip / 2 cores) * 11 days * 40% utilization = 2.99e21 FLOP
https://www.wolframalpha.com/input?i=123+teraFLOPS+*+128+*+11+days+*+0.4
""Training regimen"" section:
""We train the model on Tensor Processing Unit (TPU) v3 with a batch size of 1 per TPU
core, hence the model uses 128 TPU v3 cores. [...] The initial training stage takes approximately 1 week, and the fine-tuning stage takes approximately 4 additional days.""","PDB (Protein Data Bank),UniRef30 (FKA UniClust30),UniRef90,MGnify,BFD (Big Fantastic Dataset),UniProtKB","""Inputs and data sources"" section:
""The following versions of public datasets were used in this study. Our models were trained on a copy of the PDB downloaded on 28 August 2019. For finding template structures at prediction time, we used a copy of the PDB downloaded on 14 May 2020, and the PDB70 clustering database downloaded on 13 May 2020. For MSA lookup at both training and prediction time, we used Uniref90 v.2020_01, BFD, Uniclust30 v.2018_08 and MGnify v.2018_12. For sequence distillation, we used Uniclust30 v.2018_08 to construct a distillation structure dataset. Full details are provided in Supplementary Methods 1.2.""
AlphaFold needs multiple genetic (sequence) databases to run:
BFD,
MGnify,
PDB70,
PDB (structures in the mmCIF format),
PDB seqres – only for AlphaFold-Multimer,
UniRef30 (FKA UniClust30),
UniProt – only for AlphaFold-Multimer,
UniRef90",530000.0,"3 different types of input data to the network:
(1) Amino acid sequence
(2) Multiple sequence alignments (MSA) to sequences from evolutionarily related proteins
(3) Template structures (3D atom coordinates of homologous structures), where available
Training data is processed into the following two datasets that are sampled with different probabilities.
Supplementary Material, Section 1.2.4. Training data:
""With 75% probability a training example comes from the self-distillation set (see subsection 1.3) and with 25% probability the training example is a known structure from the Protein Data Bank""
Supplementary Material, Section 1.3 Self-distillation dataset:
""This gives a final dataset of 355,993 sequences"". An initial model was used to predict structures for these sequences.
PDB dataset size in 2020: https://www.rcsb.org/stats/growth/growth-released-structures
172788
Therefore, estimate for number of protein structures available for training (for which amino acid sequence, MSA and homologue template info is also available as input to network): 528781 [~530k]","Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort, the structures of around 100,000 unique proteins have been determined, but this represents a small fraction of the billions of known protein sequences. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’—has been an important open research problem for more than 50 years. Despite recent progress, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14), demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.",Likely,United Kingdom of Great Britain and Northern Ireland,Industry,264.0,7 days pretrain and 4 days finetune,Google TPU v3,Open source,,,,,17790.0,,,,3841.772612266474,,
KEPLER,Language,"Tsinghua University,Mila - Quebec AI (originally Montreal Institute for Learning Algorithms),HEC,CIFAR AI Research,Princeton University,University of Montreal / Université de Montréal","Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhiyuan Liu, Juanzi Li, and Jian Tang.",2020-11-23,KEPLER: A Unified Model for Knowledge Embedding and Pre- trained Language Representation.,https://arxiv.org/abs/1911.06136,SOTA improvement,"""Experimental results show that KEPLER achieves state-of-the-art performances
on various NLP tasks""",125000000.0,,1.24e+20,"From author communication
""About 128 GPU-days using Nvidia V100 (16GB). ""
precision: float16
V100 GPU for float16: 28000000000000 (2.8E+13)
0.4 * 28TFLOP/s * 128 GPU-days * 24h/day * 3600s/h
= 1.24E+20
","Wikipedia,BookCorpus (BooksCorpus, Toronto Book Corpus),Wikidata5M","From author communication
For the language modeling objective, we use Wikipedia+BookCorpus datasets (about 13GB). For the knowledge embedding objective, we use Wikidata5m (about 1GB).",3300000000.0,"For BookCorpus + English Wikipedia: 800M + 2500M
For Wikidata5M: 20614279
See table 1. Contains ""entities"", ""relations"", and ""triplets""","Pre-trained language representation models (PLMs) cannot well capture factual knowledge from text. In contrast, knowledge embedding (KE) methods can effectively represent the relational facts in knowledge graphs (KGs) with informative entity embeddings, but conventional KE models cannot take full advantage of the abundant textual information. In this paper, we propose a unified model for Knowledge Embedding and Pre-trained LanguagE Representation (KEPLER), which can not only better integrate factual knowledge into PLMs but also produce effective text-enhanced KE with the strong PLMs. In KEPLER, we encode textual entity descriptions with a PLM as their embeddings, and then jointly optimize the KE and language modeling objectives. Experimental results show that KEPLER achieves state-of-the-art performances on various NLP tasks, and also works remarkably well as an inductive KE model on KG link prediction. Furthermore, for pre-training and evaluating KEPLER, we construct Wikidata5M, a large-scale KG dataset with aligned entity descriptions, and benchmark state-of-the-art KE methods on it. It shall serve as a new KE benchmark and facilitate the research on large KG, inductive KE, and KG with text. The source code can be obtained from this https URL.",,"China,Canada,France,Canada,United States of America,Canada","Academia,Academia,Academia,Research collective,Academia,Academia",,,,Unreleased,,,,,497.0,,,,,,
SimCLRv2,Vision,Google Brain,"Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton",2020-10-26,Big self- supervised models are strong semi-supervised learners.,https://arxiv.org/abs/2006.10029,Highly cited,,795000000.0,"From author communication
We trained different model sizes (from 24M to 795M), and they're summarized in Table 1 of the paper (https://arxiv.org/pdf/2006.10029.pdf).",,,,,1280000.0,,,,United States of America,Industry,,,,,,,,,1872.0,,,,,,
wave2vec 2.0 LARGE,Speech,Facebook,"Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli",2020-10-22,wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,https://arxiv.org/abs/2006.11477,"Highly cited,SOTA improvement","Arguably an ""important"" paper?
Abstract:
""We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.""",317000000.0,"Section 5.1:
""We consider two model sizes: BASE (95m parameters) and LARGE (317m parameters)
",1.9e+21,"From surveying the authors:
We trained the base model on 64 V100 GPUs for 400k updates. This takes about 3 days to complete. The large model is trained on 128 V100 GPUs for 1 million updates, and this takes about 7 days to complete.
V100 GPU peak: 125TFLOP/s (https://www.nvidia.com/en-gb/data-center/tesla-v100/)
Assume 40% utilization based on default for non-Language domain (https://epochai.org/blog/estimating-training-compute)
64 GPUs * 40% * 125TFLOP/s * 7 days * 24h/day * 3600s/h
~= 1.9E+21 FLOP","LibriSpeech,LibriLight",,727776000.0,"pg 4, section 4.1
""As unlabeled data we consider the Librispeech corpus [40] without transcriptions containing 960 hours of audio (LS-960) or the audio data from LibriVox (LV-60k). For the latter we follow the preprocessing of [27] resulting in 53.2k hours of audio.""
53.2k h * 13,680 words/h = 727776000 words","We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.",,United States of America,Industry,,,NVIDIA Tesla V100 DGXS 32 GB,Open source,,,,,3696.0,,,,5021.241075362514,,
ViT-Huge/14,Vision,"Google Brain,Google Research","Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby",2020-10-22,An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,https://arxiv.org/abs/2010.11929,Highly cited,,632000000.0,Table 1 https://arxiv.org/pdf/2010.11929.pdf,4.262e+21,from Table 6,"ImageNet-1k,ImageNet21k,JFT-300M","To explore model scalability, we use the ILSVRC-2012 ImageNet dataset with 1k classes and 1.3M images (we refer to it as ImageNet in what follows), its superset ImageNet-21k with 21k classes and 14M images (Deng et al., 2009), and JFT (Sun et al., 2017) with 18k classes and 303M high-resolution images. ",1280000.0,,"While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.",Confident,"United States of America,Multinational","Industry,Industry",,,Google TPU v3,Open source,,,0.32,,21522.0,,,,6724.023241261178,,
ViT-Base/32,Vision,Google Brain,"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby",2020-10-22,An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,https://arxiv.org/abs/2010.11929,Highly cited,,86000000.0,Table 1 https://arxiv.org/pdf/2010.11929.pdf,,,,,,,,,United States of America,Industry,,,,,,,,,21522.0,,,,,,
German ELECTRA Large,Language,"deepset,Bayerische Staatsbibliothek Muenchen","Branden Chan, Stefan Schweter, Timo Möller",2020-10-21,German's Next Language Model,https://arxiv.org/abs/2010.10906,SOTA improvement,'we were able to attain SoTA performance across a set of document classification and named entity recognition (NER) tasks for both models of base and large size.',335000000.0,335M from Table 5,1.42829568e+21,"flops = (64) * (123* 10**12) * (7 * 24 * 3600) * (0.3) = 1.4e21
(num gpu) * (peak flops) * (time in seconds) * (assumed utilization rate)
'large models were trained on pods of 16 TPUs v3 (128 cores).' - from section 4.1 it was trained for 7 days from Table 2
Agrees with 6CN:
Tokens seen: 512 (seq len) * 1024 (batch size) * 1 million (steps) = 5.24e11
FLOPs: 6 * 335M * 5.24e11 = 1.05e21","Wikipedia,OPUS,OSCAR,OpenLegalData",Table 1 in the paper,36383733333.333336,"163.4GB from Table 1 in the paper
assuming 167M words per GB (German Language) we have 163.4 * 167M * 4/3 tokens per word = 36,383,733,333","In this work we present the experiments which lead to the creation of our BERT and ELECTRA based German language models, GBERT and GELECTRA. By varying the input training data, model size, and the presence of Whole Word Masking (WWM) we were able to attain SoTA performance across a set of document classification and named entity recognition (NER) tasks for both models of base and large size. We adopt an evaluation driven approach in training these models and our results indicate that both adding more data and utilizing WWM improve model performance. By benchmarking against existing German models, we show that these models are the best German models to date. Our trained models will be made publicly available to the research community. ",Confident,"Germany,Germany","Industry,Government",168.0,7 days from Table 2,Google TPU v3,Open source,,64.0,,,215.0,,,,2392.2883349806307,,
GBERT-Large,Language,"deepset,Bayerische Staatsbibliothek Muenchen","Branden Chan, Stefan Schweter, Timo Möller",2020-10-21,German's Next Language Model,https://arxiv.org/abs/2010.10906,SOTA improvement,'we were able to attain SoTA performance across a set of document classification and named entity recognition (NER) tasks for both models of base and large size.',335000000.0,335M from Table 5,2.2444646e+21,"flops = (64) * (123* 10**12) * (11 * 24 * 3600) * (0.3) = 2.24e21
(num gpu) * (peak flops) * (time in seconds) * (assumed utilization rate)
'large models were trained on pods of 16 TPUs v3 (128 cores).' - from section 4.1it was trained for 11 days from Table 2","Wikipedia,OPUS,OSCAR,OpenLegalData",Table 1 in the paper,27287800000.0,"163.4GB from Table 1 in the paper
assuming 167M words per GB (German Language) we have 163.4 * 167M = 27287800000.0","In this work we present the experiments which lead to the creation of our BERT and ELECTRA based German language models, GBERT and GELECTRA. By varying the input training data, model size, and the presence of Whole Word Masking (WWM) we were able to attain SoTA performance across a set of document classification and named entity recognition (NER) tasks for both models of base and large size. We adopt an evaluation driven approach in training these models and our results indicate that both adding more data and utilizing WWM improve model performance. By benchmarking against existing German models, we show that these models are the best German models to date. Our trained models will be made publicly available to the research community. ",Likely,"Germany,Germany","Industry,Government",264.0,11 days from Table 2,Google TPU v3,Open source,,64.0,,,215.0,,,,3771.13338489269,2048.0,
mT5-XXL,Language,"Google,Google Research","Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel",2020-10-20,mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer,https://aclanthology.org/2021.naacl-main.41/,"Highly cited,SOTA improvement","""Table 2 presents our main results, with perlanguage breakdowns for each task given in Appendix B. Our largest model mT5-XXL exceeds state-of-the-art on all classification and QA tasks and is near SOTA on NER (69.2 vs. 70.1).""",13000000000.0,13 billion,8.2e+22,"""We pre-train our mT5 model variants for 1 million steps on batches of 1024 length-1024 input sequences, corresponding to roughly 1 trillion input tokens total.""
1 million steps * 1024 batchsize * 1024 length * 13 billion params * 6 = 8.2e22
Ignores fine-tuning compute; this is likely a small fraction of pre-training compute.",mC4,"""The C4 dataset was explicitly designed to be English only: any page that was not given a probability of at least 99% of being English by langdetect2 was discarded. In contrast, for mC4 we use cld33 to identify over 100 languages.
Since some of these languages are relatively scarce on the internet, we make use of all of the 71 monthly web scrapes released so far by Common Crawl. This is dramatically more source data than was used for C4, for which the April 2019 web scrape alone was enough to provide plenty of English-language data.""",1000000000000.0,"The model was trained on a subset of 1 trillion tokens.
Full mC4 corpus has data ""totaling 6.6B pages and 6.3T tokens""
Distribution by language is in Appendix A.","The recent “Text-to-Text Transfer Transformer” (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We detail the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. We also describe a simple technique to prevent “accidental translation” in the zero-shot setting, where a generative model chooses to (partially) translate its prediction into the wrong language. All of the code and model checkpoints used in this work are publicly available.",Confident,"United States of America,Multinational","Industry,Industry",,,,Open source,,,,,1765.0,,,1.0,,1048576.0,"""We pre-train our mT5 model variants for 1 million steps on batches of 1024 length-1024 input sequences, corresponding to roughly 1 trillion input tokens total."""
Conformer + Wav2vec 2.0 + Noisy Student,Speech,"Google,Google Research,Google Brain","Yu Zhang, James Qin, Daniel S. Park, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Quoc V. Le, Yonghui Wu",2020-10-20,Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition,https://arxiv.org/abs/2010.10504v2,SOTA improvement,"""By doing so, we are able to achieve
word-error-rates (WERs) 1.4%/2.6% on the LibriSpeech test/test-other sets against
the current state-of-the-art WERs 1.7%/3.3%.""",1000000000.0,1B for XXL model,7.6e+21,"""We train with global batch size 2048 on 256/512 Google TPU V3 cores for 3-4 days for the XL/XXL models respectively...
We fine-tune the pre-trained checkpoints (400k steps) with global batch
size 1024/512 on 256/512 Google TPU v3 cores for 1-3 days for the XL/XXL models""
TPU v3 chips are 123 teraflop/s. 2 chips per core
512 cores * 7 days * 24 * 3600 * 123 tflops * (1 chip/2 cores) * 0.4 (assumed utilization) = 7.6e21",LibriLight,"""We pre-train the Conformer encoder akin to wav2vec 2.0 pre-training [6] with 60k hours of unlabeled audio from the ""unlab-60k"" subset of Libri-Light. Unlike in the original work which takes raw waveforms as input, we use log-mel spectrograms...
The 960h of transcribed audio of the LibriSpeech dataset is used as the supervised data""",,,"We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the Libri-Light dataset. More precisely, we carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training. By doing so, we are able to achieve word-error-rates (WERs) 1.4%/2.6% on the LibriSpeech test/test-other sets against the current state-of-the-art WERs 1.7%/3.3%.",Confident,"United States of America,Multinational,United States of America","Industry,Industry,Industry",168.0,7 days,Google TPU v3,Unreleased,,256.0,,,280.0,,,,9449.541661137231,,
LUKE,Language,"University of Washington,National Institute of Informatics","Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto",2020-10-02,LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention,https://arxiv.org/abs/2010.01057v1,SOTA improvement,"from abstract ""In particular, it obtains state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question answering).""",484000000.0,"from https://github.com/studio-ousia/luke section - 484 M (LUKE model)
(mLUKE is model from different paper)",1.75799808e+20,"(16) * (1413 * 10**10) * (30 * 24 * 3600) * (0.3) = 175799808000000000000
(num gpu) * (peak flops) * (time in seconds) * (assumed utilization rate)
from appendix A: ""Werun the pretraining on NVIDIA’s PyTorch Docker
container 19.02 hosted on a server with two Intel
Xeon Platinum 8168 CPUs and 16 NVIDIA Tesla V100 GPUs. The training takes approximately 30
days.""
peak flops for fp32 from https://www.techpowerup.com/gpu-specs/tesla-v100-pcie-16-gb.c2957",Wikipedia,"""As input corpus for pretraining, we use the December 2018 version of Wikipedia, comprising approximately 3.5 billion words and 11 million entity annotations. """,3500000000.0,"""As input corpus for pretraining, we use the December 2018 version of Wikipedia, comprising approximately 3.5 billion words and 11 million entity annotations. ""","Entity representations are useful in natural language tasks involving entities. In this paper, we propose new pretrained contextualized representations of words and entities based on the bidirectional transformer. The proposed model treats words and entities in a given text as independent tokens, and outputs contextualized representations of them. Our model is trained using a new pretraining task based on the masked language model of BERT. The task involves predicting randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia. We also propose an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the transformer, and considers the types of tokens (words or entities) when computing attention scores. The proposed model achieves impressive empirical performance on a wide range of entity-related tasks. In particular, it obtains state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question answering). Our source code and pretrained representations are available at this https://github.com/studio-ousia/luke",Likely,"United States of America,Japan",Academia,720.0,see compute notes,NVIDIA V100,Open source,,16.0,,,599.0,,,,4186.383565732907,2048.0,table in appendix A
ProBERTa,Biology,"University of Illinois Urbana-Champaign (UIUC),Reed College","Ananthan Nambiar, Maeve Heflin, Simon Liu, Sergei Maslov, Mark Hopkins, Anna Ritz",2020-09-01,"Transforming the Language of Life: Transformer Neural
Networks for Protein Prediction Tasks",https://dl.acm.org/doi/10.1145/3388440.3412467,SOTA improvement,"""Furthermore, we used embeddings from PRoBERTa for a fundamentally different problem, PPI prediction, using two different
datasets generated from the HIPPIE database and found that with
sufficient data, it substantially outperforms the current state-of-theart method in the conservative scenario.""",44000000.0,"""In total, our model has approximately 44M trainable parameters.""",9.72e+18,"""we pre-train PRoBERTa on 4 NVIDIA V100 GPUs in 18 hours""
4 * 125 tFLOP/s * 18 * 3600 * 0.3 (assumed utilization) = 9.72e18",UniProtKB/Swiss-Prot,"""Pre-training data: We use UniProtKB/Swiss-Prot (450K unique sequences with a mean tokenized length of 129.6 tokens), a collection of experimentally annotated and reviewed amino acid sequences""
Fine tuning uses a subset of 313,214 sequences which have annotated labels.",58320000.0,"450k sequences * 129.6 tokens per sequence = 58,320,000 tokens","The scientific community is rapidly generating protein sequence information, but only a fraction of these proteins can be experimentally characterized. While promising deep learning approaches for protein prediction tasks have emerged, they have computational limitations or are designed to solve a specific task. We present a Transformer neural network that pre-trains task-agnostic sequence representations. This model is fine-tuned to solve two different protein prediction tasks: protein family classification and protein interaction prediction. Our method is comparable to existing state-of-the-art approaches for protein family classification while being much more general than other architectures. Further, our method outperforms all other approaches for protein interaction prediction. These results offer a promising framework for fine-tuning the pre-trained sequence representations for other protein prediction tasks.",Confident,"United States of America,United States of America","Academia,Academia",18.0,,NVIDIA V100,,,4.0,,,97.0,,,,26.206992435694065,,
ESM1-670M (UR50/D),Biology,"Facebook AI Research,New York University (NYU)","Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus",2020-08-31,Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences,https://www.pnas.org/doi/abs/10.1073/pnas.2016239118,"Highly cited,SOTA improvement","""We apply the representations to a range of prediction tasks and find that they improve state-of-art features across the applications.""",669200000.0,See Table 1,4.8e+20,"Information:
128 NVIDIA V100 GPUs [Pre-training details]
906k steps [See Table S2: Hyperparameters]
131,072 tokens per batch [""We trained with 131,072 tokens per batch (128 gpus x 1024 tokens)."" - Pre-training details]
Estimate: 906e3 updates * 3 * 131072 tokens/update * 2 * 669.2e6 parameters = 4.8e20 FLOP",UniRef50,"""the high-diversity dense dataset (UR50/D) samples the UniRef100 sequences evenly across the UniRef50 clusters.""",,,"In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization
reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning
produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.",Confident,"United States of America,United States of America","Industry,Academia",,,NVIDIA V100,Open source,,128.0,,,1376.0,,,,1054.8194995883428,,
ERNIE-GEN (large),Language,Baidu,"Dongling Xiao, Han Zhang, Yukun Li, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang",2020-08-06,ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation,https://arxiv.org/abs/2001.11314,SOTA improvement,"""Empirically, ERNIE-GEN is particularly effective and
achieves state-of-the-art results on a range of NLG tasks
including abstractive summarization (Gigaword and CNN/DailyMail), question generation (SQuAD), dialogue response generation (Persona-Chat) and generative question answering (CoQA)""",340000000.0,"""We train a base model ERNIEGENBASE (L=12, H=768, A=12, Total Parameters=110M)1
and a large model ERNIE-GENLARGE (L=24, H=1024,
A=16, Total Parameters=340M) with parameters initialized
by BERTBASE and BERTLARGE respectively""",2e+20,"430GB text for 1 epoch
approx 430 * 200 million words = 86B words, or 100B tokens per https://docs.google.com/document/d/1G3vvQkn4x_W71MKg0GmHVtzfd9m0y3_Ofcoew0v902Q/edit#heading=h.ieihc08p8dn0
6 * 340 million params * 100 billion tokens ~= 2e20","CC-News,BookCorpus (BooksCorpus, Toronto Book Corpus),WebText2,Wikipedia,C4","""Recent works for pre-training verify that larger scaled pretraining corpora can improve the performances on downstream tasks. We pre-train ERNIE-GENLARGE model on
the 430GB text corpora with 1 epoch and 1M training steps.
Our 430GB text corpora is extracted from the corpus used by
RoBERTa [Liu et al., 2019], T5 [Raffel et al., 2019] and ALBERT [Lan et al., 2020]. We fine-tune ERNIE-GENLARGE
on two abstractive summarization datasets including Gigaword and CNN/Daily Mail, the evaluation results are reported
in Table 9""
RoBERTa and T5 datasets are CC-News, BookCorpus, Wikipedia, WebText2, and C4",86000000000.0,"approx 430 * 200 million words = ~86B words, per https://docs.google.com/document/d/1G3vvQkn4x_W71MKg0GmHVtzfd9m0y3_Ofcoew0v902Q/edit#heading=h.ieihc08p8dn0",,Speculative,China,Industry,,,,Open access (non-commercial),,,,,109.0,,,,,,
DeLight,Language,"University of Washington,Allen Institute for AI,Facebook AI Research","Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, Hannaneh Hajishirzi",2020-08-03,DeLighT: Deep and Light-weight Transformer,https://arxiv.org/abs/2008.00623,SOTA improvement,"""Comparison with state-of-the-art methods on machine translation corpora. DeLighT delivers
similar or better performance than state-of-the-art models with fewer parameters.""",99000000.0,,2.4e+19,,WikiText-103,,,,,,"United States of America,United States of America,United States of America","Academia,Research collective,Industry",,,,Unreleased,,,,,98.0,,,62.14,,,
EfficientDet,Vision,Google Brain,"Mingxing Tan, Ruoming Pang, Quoc V. Le",2020-07-27,EfficientDet: Scalable and Efficient Object Detection,https://openaccess.thecvf.com/content_CVPR_2020/html/Tan_EfficientDet_Scalable_and_Efficient_Object_Detection_CVPR_2020_paper.html,"Highly cited,SOTA improvement","""EfficientDet-D7 achieves stateof-the-art 55.1 AP on COCO test-dev with 77M parameters and 410B FLOPs""",77000000.0,"""EfficientDet-D7 achieves stateof-the-art 55.1 AP on COCO test-dev with 77M parameters and 410B FLOPs""",,,COCO 2017,,,,,,United States of America,Industry,,,,Open source,,,,,4701.0,,,,,,
Hopfield Networks (2020),"Biology,Vision,Language,Medicine","Johannes Kepler University Linz,Institute of Advanced Research in Artificial Intelligence,University of Oslo","Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlović, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter",2020-07-16,Hopfield Networks is All You Need,https://arxiv.org/abs/2008.02217,SOTA improvement,"""Hopfield layers yielded a new state-ofthe-art when compared to different machine learning methods. Finally, Hopfield
layers achieved state-of-the-art on two drug design datasets""",,,,,"BACE,SIDER","""We test the Hopfield layer HopfieldLayer, on four drug
design datasets. These datasets represent four main areas of modeling tasks in drug design, concretely
to develop accurate models for predicting a) new anti-virals (HIV) by the Drug Therapeutics Program
(DTP) AIDS Antiviral Screen, b) new protein inhibitors, concretely human β-secretase (BACE) inhibitors by Subramanian et al. (2016), c) metabolic effects as blood-brain barrier permeability (BBBP)
(Martins et al., 2012) and d) side effects of a chemical compound from the Side Effect Resource
(SIDER) Kuhn et al. (2016). """,,,,Unknown,"Austria,Austria,Norway","Academia,Academia,Academia",,,,Open source,,,,,287.0,,,,,,
SemExp,Robotics,"Carnegie Mellon University (CMU),Facebook AI Research","Devendra Singh Chaplot, Dhiraj Gandhi, Abhinav Gupta, Ruslan Salakhutdinov",2020-07-02,Object Goal Navigation using Goal-Oriented Semantic Exploration,https://proceedings.neurips.cc/paper/2020/file/2c75cf2681788adaca63aa95ae028b22-Paper.pdf,SOTA improvement,"""Our method achieves state-of-the-art performance on the object goal navigation task and won the CVPR2020 Habitat ObjectNav challenge""",,,,,"Gibson,Matterport3D (MP3D)","""We use the Gibson [46] and Matterport3D (MP3D) [6] datasets""",,"""Our training and test set consists of a total of 86 scenes (25 Gibson tiny and 61 MP3D) and 16 scenes (5 Gibson tiny and 11 MP3D), respectively""","This work studies the problem of object goal navigation which involves navigating to an instance of the given object category in unseen environments. End-to-end learning-based navigation methods struggle at this task as they are ineffective at exploration and long-term planning. We propose a modular system called, ‘GoalOriented Semantic Exploration’ which builds an episodic semantic map and uses it to explore the environment efficiently based on the goal object category. Empirical results in visually realistic simulation environments show that the proposed model outperforms a wide range of baselines including end-to-end learning-based methods as well as modular map-based methods and led to the winning entry of the CVPR2020 Habitat ObjectNav Challenge. Ablation analysis indicates that the proposed model learns semantic priors of the relative arrangement of objects in a scene, and uses them to explore efficiently. Domain-agnostic module design allows us to transfer our model to a mobile robot platform and achieve similar performance for object goal navigation in the real-world.",Unknown,"United States of America,United States of America","Academia,Industry",,,,Open source,,,,,358.0,,,,,,
GShard (dense),Language,Google,"Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen",2020-06-30,GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding,https://arxiv.org/abs/2006.16668,SOTA improvement,"""such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art""",2300000000.0,"""Our best quality dense single Transformer model (2.3B parameters) achieving ∆BLEU of 6.1, was trained with GPipe [15] on 2048 TPU v3 cores for 6 weeks or total of 235.5 TPU v3 core-years.""",1.3702e+23,"""Our best quality dense single Transformer model (2.3B parameters) achieving ∆BLEU of 6.1, was trained with GPipe [15] on 2048 TPU v3 cores for 6 weeks or total of 235.5 TPU v3 core-years.""
Assume 30% utilization. 2 TPU v3 cores = 1 TPU v3 chip.
TPU v3 performance is 123 teraFLOPS per chip
Best dense model was trained on 235.5 TPU v3 core-years or 1.3702e23 FLOP
https://www.wolframalpha.com/input?i=123+teraFLOPS+%2F+2+*+235.5+years+*+0.30
Effective model FLOPs utilization could have been lower since this model has very high training compute compared to parameter count (2.3B). (Compare to Chinchilla-optimal?)",,,346666666666.6667,"""We focus on improving the translation quality (measured in terms of BLEU score [48]) from all 100 languages to English. This resulted in approximately 13 billion training examples to be used for model training""
Each example is a sentence pair. Assuming 20 words per sentence and 4/3 tokens per word, that is 13*20*4/3 billion tokens","Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.",Confident,United States of America,Industry,1008.0,6 weeks = 1008 hours,Google TPU v3,Unreleased,,1024.0,,,714.0,,,,256224.76861369325,4000000.0,"Table 3, bolded row is best model"
iGPT-XL,"Vision,Image generation",OpenAI,"Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, Ilya Sutskever",2020-06-17,Generative Pretraining from Pixels,https://openai.com/research/image-gpt,Highly cited,,6801000000.0,source: https://openai.com/blog/image-gpt/#rfref53,3.3e+22,"Taken from here
https://www.lesswrong.com/posts/wfpdejMWog4vEDLDg/ai-and-compute-trend-isn-t-predictive-of-what-is-happening",ILSVRC 2012 subset of ImageNet,,9600000.0,"""We use the ImageNet ILSVRC 2012 training dataset, splitting off 4% as our experimental validation set and report results on the ILSVRC 2012 validation set as our test set.""
https://image-net.org/challenges/LSVRC/2012/
""The goal of this competition is to estimate the content of photographs for the purpose of retrieval and automatic annotation using a subset of the large hand-labeled ImageNet dataset (10,000,000 labeled images depicting 10,000+ object categories) as training.""
","Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full finetuning, matching the top supervised pre-trained models. An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of our features.",,United States of America,Industry,,,NVIDIA Tesla V100 DGXS 32 GB,Open source,,,,,1206.0,,,,98082.33822642289,,
iGPT-L,"Image generation,Vision",OpenAI,"Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, Ilya Sutskever",2020-06-17,Generative Pretraining from Pixels,https://openai.com/blog/image-gpt/,Highly cited,,1362000000.0,source: https://openai.com/blog/image-gpt/#rfref53,8.91e+21,"We have that ""iGPT-L was trained for roughly 2500 V100-days"" [1]
I assume this is the NVIDIA Tesla V100 GPU. In the specifications, the NVIDIA Tesla V100 has 7 to 8.2 TFLOPS of peak double precision performance and 14 to 16.4 TFLOPS of peak single precision performance and 112 to 130 TFLOPS of peak tensor performance [2].
I suppose the one that makes sense using if peak tensor performance, for ~125 TFLOPS peak tensor performance more or less.
Following OpenAIs AI and compute we apply a 0.33 utitilization factor [3].
In total we get 2500 V100-days * (24*60*60) seconds/day * 125 TFLOPS * 0.33 = 8.91e+21 FLOPS = 89.1 PF-days.
[1] https://openai.com/blog/image-gpt/
[2] https://images.nvidia.com/content/technologies/volta/pdf/volta-v100-datasheet-update-us-1165301-r5.pdf
[3] https://openai.com/blog/ai-and-compute/",ILSVRC 2012 subset of ImageNet,,9600000.0,"""We use the ImageNet ILSVRC 2012 training dataset, splitting off 4% as our experimental validation set and report results on the ILSVRC 2012 validation set as our test set.""
https://image-net.org/challenges/LSVRC/2012/
""The goal of this competition is to estimate the content of photographs for the purpose of retrieval and automatic annotation using a subset of the large hand-labeled ImageNet dataset (10,000,000 labeled images depicting 10,000+ object categories) as training.""
","Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full finetuning, matching the top supervised pre-trained models. An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of our features.",,United States of America,Industry,,,NVIDIA Tesla V100 DGXS 32 GB,Open source,,,,,1206.0,,,,30093.444683107013,,
GPT-3 175B (davinci),Language,OpenAI,"Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei",2020-05-28,Language Models are Few-Shot Learners,https://arxiv.org/abs/2005.14165,"Highly cited,Training cost",,175000000000.0,"""we train GPT-3, an autoregressive language model with 175 billion parameters""",3.14e+23,"Table D.1
https://arxiv.org/abs/2005.14165","Common Crawl,WebText2,Wikipedia,Books1,Books2",Table 2.2 (other datasets also used),374000000000.0,"From table 2.2, we determine that there are 410 + 19 + 12 + 55 + 3 = 499 billion tokens.
We multiply this by 0.75 to give 374B words.
3.74e11
========================
[Anson: I think the calculation below doesn't look at all the data, the CommonCrawl data only constitutes 60% of the data. Multiplying by 5/3 gives 4.75e11]
""The CommonCrawl data was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering and 570GB after filtering, roughly equivalent to 400 billion byte-pair-encoded tokens. ""
Converted to words using
http://extraconversion.com/data-storage/gigabits/gigabits-to-words.html
2.85e11","Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.",Confident,United States of America,Industry,355.2,14.8 days according to https://arxiv.org/ftp/arxiv/papers/2104/2104.10350.pdf,NVIDIA Tesla V100 DGXS 32 GB,API access,,10000.0,0.2196,,25572.0,,,0.6,2056969.3385324872,3200000.0,"3.2M, per table 2.1"
DETR,Vision,Facebook,"Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko",2020-05-26,End-to-End Object Detection with Transformers,https://arxiv.org/abs/2005.12872,Highly cited,,60000000.0,60M per Table 1,4e+20,"""Training the baseline model for 300 epochs on 16 V100 GPUs takes 3 days, with 4 images per GPU (hence a total batch size of 64). For the longer schedule used to compare with Faster R-CNN we train for 500 epochs with learning rate drop after 400 epochs. This schedule adds 1.5 AP compared to the shorter schedule.""
48 V100-days for baseline DETR model. Larger model had 1.5x the params and 5/3 as many epochs, so required ~2.5x as much training compute.
125 teraflop/s * 2.5 * 48 * 24 * 3600 * 0.3 (assumed utilization) ~ 4e20",COCO 2017,"""We perform experiments on COCO 2017 detection and panoptic segmentation datasets [24,18], containing 118k training images and 5k validation images""",123000.0,,"Abstract. We present a new method that views object detection as a
direct set prediction problem. Our approach streamlines the detection
pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation
that explicitly encode our prior knowledge about the task. The main
ingredients of the new framework, called DEtection TRansformer or
DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given
a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output
the final set of predictions in parallel. The new model is conceptually
simple and does not require a specialized library, unlike many other
modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation
in a unified manner. We show that it significantly outperforms competitive baselines. Training code and pretrained models are available at
https://github.com/facebookresearch/detr.",Confident,United States of America,Industry,,,NVIDIA V100,Open source,,,,,8840.0,,,500.0,959.9097627206426,64.0,
Retrieval-Augmented Generator,Language,"Facebook,New York University (NYU),University College London (UCL)","Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela",2020-05-22,Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,https://arxiv.org/abs/2005.11401v4,"Highly cited,SOTA improvement","""Our RAG models achieve state-of-the-art results on open Natural Questions [29], WebQuestions [3] and CuratedTrec [2] """,626000000.0,"""Our RAG models contain the trainable parameters for the BERT-base query and document encoder of DPR, with 110M parameters each (although we do not train the document encoder ourselves) and 406M trainable parameters from BART-large, 406M parameters, making a total of 626M trainable parameters""",,"not enough info, e.g. no training time reported:
""We train with mixed precision floating point arithmetic [40], distributing training across 8, 32GB NVIDIA V100 GPUs, though training and inference can be run on one GPU""","Natural Questions,Wikipedia",,,,"Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.",Confident,"United States of America,United States of America,United Kingdom of Great Britain and Northern Ireland","Industry,Academia,Academia",,,NVIDIA Tesla V100 PCIe 32 GB,Open source,,,,,2047.0,,,,,,
Conformer,Speech,Google,"Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang",2020-05-16,Conformer: Convolution-augmented Transformer for Speech Recognition,https://arxiv.org/abs/2005.08100v1,"Highly cited,SOTA improvement","""Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies. On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother""",118800000.0,118.8M for Conformer(L),,,LibriSpeech,,,,"Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. In this work, we achieve the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way. To this regard, we propose the convolution-augmented transformer for speech recognition, named Conformer. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies. On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother. We also observe competitive performance of 2.7%/6.3% with a small model of only 10M parameters.",Confident,United States of America,Industry,,,,Unreleased,,,,,2202.0,,,,,,
ContextNet,Speech,Google,"Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, Yonghui Wu",2020-05-07,ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context,https://arxiv.org/abs/2005.03191v3,SOTA improvement,"""We demonstrate that on the widely used Librispeech
benchmark, ContextNet achieves a word error rate (WER) of
2.1%/4.6% without external language model (LM), 1.9%/4.1%
with LM and 2.9%/7.0% with only 10M parameters on the
clean/noisy LibriSpeech test sets. This compares to the
best previously published model of 2.0%/4.6% with LM and
3.9%/11.3% with 20M parameters""",112700000.0,Table 5,,,LibriSpeech,,,970 hours of speech,"Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the previous best published system of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the proposed ContextNet model is also verified on a much larger internal dataset.",Likely,United States of America,Industry,,,,Unreleased,,,,,233.0,,,,,,
NAS+ESS (156M),Language,"Northeastern University (China),Chinese Academy of Sciences,NiuTrans Research,Kingsoft","Yinqiao Li, Chi Hu, Yuhao Zhang, Nuo Xu, Yufan Jiang, Tong Xiao, Jingbo Zhu, Tongran Liu, Changliang Li",2020-05-06,Learning Architectures from an Extended Search Space for Language Modeling,https://arxiv.org/abs/2005.02593,SOTA improvement,"""Our ESS method
achieves state-of-the-art result on the PTB task""",156000000.0,,2.89e+18,,Penn TreeBank,,,,,,"China,China,China,China","Academia,Academia,Industry",,,,Unreleased,,,,,12.0,,,30.0,,,
UnifiedQA,Language,"Allen Institute for AI,University of Washington","Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, Hannaneh Hajishirzi",2020-05-02,UnifiedQA: Crossing Format Boundaries With a Single QA System,https://arxiv.org/abs/2005.00700v3,SOTA improvement,"""We then introduce UNIFIEDQA (§3.2) that is a QA system
trained on datasets in multiple formats, indicating
new state-of-the-art results on 10 datasets and generalization to unseen datasets.""",11000000000.0,11B (based on T5-11B),3.5e+19,"A.2: ""In the experiments, we use v3-8 TPUs for T5 models... pretraining UNIFIEDQA approximately takes about 36 hours on T5(11B)""
8 * 1.23e14 * 36 * 3600 * 0.3 = 3.83e19
Alternatively, input (ouput) size of 512 (100) tokens, batch size of 8, trained for 100k steps. 6CN:
6 * 11B * 612 * 8 * 100k = 3.23e19
Took geometric mean of these estimates:
sqrt(3.23e19*3.83e19) = 3.5e19",,"""We empirically chose the following 8 seed datasets for training UNIFIEDQA, 3 based on their effectiveness in our pilot study (details deferred to Section 5) assessing which datasets are most valuable for out-of-format training:
• EX: SQuAD 1.1, SQuAD 2.0
• AB: NarrativeQA
• MC: RACE, ARC, OBQA, MCTest
• YN: BoolQ""",97309860.0,"Table 2:
SQuAD 1.1: 87k examples, avg total length of 136.2 + 3.0
SQuAD 2.0: 130k examples, avg total length of 139.9 + 2.6
NarrativeQA: 65k examples, avg total length of 563.6 + 6.2
RACE: 87k examples, avg total length of 317.9 + 6.9
ARC (easy): 2k examples, avg total length of 39.4 + 3.7
ARC (hard): 1k examples, avg total length of 47.4 + 5.0
OBQA: 4k examples, avg total length of 28.7 + 3.6
MCTest: 1.4k examples, avg total length of 245.4 + 4.0
BoolQ: 9k examples, avg total length of 105.1 + 1.0
Total tokens: 97,309,860","Question answering (QA) tasks have been posed using a variety of formats, such as extractive span selection, multiple choice, etc. This has led to format-specialized models, and even to an implicit division in the QA community. We argue that such boundaries are artificial and perhaps unnecessary, given the reasoning abilities we seek to teach are not governed by the format. As evidence, we use the latest advances in language modeling to build a single pre-trained QA model, UnifiedQA, that performs surprisingly well across 17 QA datasets spanning 4 diverse formats. UnifiedQA performs on par with 9 different models that were trained on individual datasets themselves. Even when faced with 12 unseen datasets of observed formats, UnifiedQA performs surprisingly well, showing strong generalization from its out-of-format training data. Finally, simply fine-tuning this pre-trained QA model into specialized models results in a new state of the art on 6 datasets, establishing UnifiedQA as a strong starting point for building QA systems.",Confident,"United States of America,United States of America","Research collective,Academia",36.0,"""pretraining UNIFIEDQA approximately takes about 36 and 55 hours, on T5(11B) and BART models, respectively.""",Google TPU v3,Unreleased,"""• Infrastructure: In the experiments, we use v3-8 TPUs for T5 models, and eight 32GB GPUs for
BART models.
• Time spent to build UNIFIEDQA: pretraining UNIFIEDQA approximately takes about 36 and 55
hours, on T5(11B) and BART models, respectively.""
8 * 123 TFLOPS * 36 * 3600 * 0.3 (utilization assumption) = 3.8e19",8.0,,T5-11B,613.0,,3.8e+19,,,,
ATLAS,Language,"Allen Institute for AI,University of Washington","Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, Hannaneh Hajishirzi",2020-05-02,UnifiedQA: Crossing Format Boundaries With a Single QA System,https://arxiv.org/abs/2005.00700,SOTA improvement,"from abstract: ""Finally, simply fine-tuning this pre-trained QA model into specialized models results in a new state of the art on 6 datasets""",11000000000.0,"11B from appendix A.2 : Model sizes: ""Most of the experiments are done on T5(11B) which has 11 billion parameters. We also report experiments with BART (large) with 440 million parameters.""",3.825792e+19,"flops = (8) * (123 * 10**12) * (36 * 3600) * (0.3)
(num gpu) * (peak flops) * (time in seconds) * (assumed utilization rate)
from Appendix A.2: ""Time spent to build UNIFIEDQA: pretraining UNIFIEDQA approximately takes about 36 and 55 hours, on T5(11B) and BART models, respectively.""
so 36h for T5
""Infrastructure: In the experiments, we use v3-8 TPUs for T5 models, and eight 32GB GPUs for BART models.""
from https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#tpu_chip
tpu chip have peak flops 123 teraflops
so 8 chips have peak flops 123 * 8",SQuAD 1.1,"from appendix A.1 - multiple QA datasets, In section 3 there is description how batches are created from multiple datasets.",,"from appendix A.1 - multiple QA datasets - it may be possible to estimate by summing sizes of all datasets
I am not sure if all data is used as system is trained for 100K steps (from appendix A.2)
with batch size 8 (appendix A.1)","Question answering (QA) tasks have been posed using a variety of formats, such as extractive span selection, multiple choice, etc. This has led to format-specialized models, and even to an implicit division in the QA community. We argue that such boundaries are artificial and perhaps unnecessary, given the reasoning abilities we seek to teach are not governed by the format. As evidence, we use the latest advances in language modeling to build a single pre-trained QA model, UnifiedQA, that performs surprisingly well across 17 QA datasets spanning 4 diverse formats. UnifiedQA performs on par with 9 different models that were trained on individual datasets themselves. Even when faced with 12 unseen datasets of observed formats, UnifiedQA performs surprisingly well, showing strong generalization from its out-of-format training data. Finally, simply fine-tuning this pre-trained QA model into specialized models results in a new state of the art on 6 datasets, establishing UnifiedQA as a strong starting point for building QA systems.",Confident,"United States of America,United States of America","Research collective,Academia",36.0,"Appendix A.2: Time spent to build UNIFIEDQA: pretraining UNIFIEDQA approximately takes about 36 and 55 hours, on T5(11B) and BART models, respectively.",Google TPU v3,Open source,,,,,613.0,,,,59.12208690735248,,
Once for All,Vision,"MIT-IBM Watson AI Lab,Massachusetts Institute of Technology (MIT),IBM","Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han",2020-04-29,Once for all: Train one network and specialize it for efficient deployment.,https://arxiv.org/abs/1908.09791,SOTA improvement,"""In particular, OFA achieves a new SOTA 80.0% ImageNet top-1 accuracy under the mobile setting""",7700000.0,,1.78428096e+21,"4.2k V100-hours (table 1)
0.33 utilization rate
",ImageNet,,,,"We address the challenging problem of efficient inference across many devices and resource constraints, especially on edge devices. Conventional approaches either manually design or use neural architecture search (NAS) to find a specialized neural network and train it from scratch for each case, which is computationally prohibitive (causing CO2 emission as much as 5 cars' lifetime) thus unscalable. In this work, we propose to train a once-for-all (OFA) network that supports diverse architectural settings by decoupling training and search, to reduce the cost. We can quickly get a specialized sub-network by selecting from the OFA network without additional training. To efficiently train OFA networks, we also propose a novel progressive shrinking algorithm, a generalized pruning method that reduces the model size across many more dimensions than pruning (depth, width, kernel size, and resolution). It can obtain a surprisingly large number of sub-networks (>1019) that can fit different hardware platforms and latency constraints while maintaining the same level of accuracy as training independently. On diverse edge devices, OFA consistently outperforms state-of-the-art (SOTA) NAS methods (up to 4.0% ImageNet top1 accuracy improvement over MobileNetV3, or same accuracy but 1.5x faster than MobileNetV3, 2.6x faster than EfficientNet w.r.t measured latency) while reducing many orders of magnitude GPU hours and CO2 emission. In particular, OFA achieves a new SOTA 80.0% ImageNet top-1 accuracy under the mobile setting (<600M MACs). OFA is the winning solution for the 3rd Low Power Computer Vision Challenge (LPCVC), DSP classification track and the 4th LPCVC, both classification track and detection track. Code and 50 pre-trained models (for many devices & many latency constraints) are released at this https URL.",,"United States of America,United States of America,United States of America","Academia,Academia,Industry",,,NVIDIA V100,Open source,,,,,1036.0,from Table 1,,,1753.9255676777682,,
Go-explore,Games,"Uber AI,OpenAI","A Ecoffet, J Huizinga, J Lehman, KO Stanley, J Clune",2020-04-27,"First return, then explore",https://arxiv.org/abs/2004.12919,SOTA improvement,"""GoExplore solves all heretofore unsolved Atari games (meaning those for which algorithms could not previously
outperform humans when evaluated following current community standards for Atari3) and surpasses the state
of the art on all hard-exploration games""",,,,,,,,,,Unknown,"United States of America,United States of America","Industry,Industry",,,,Unreleased,,,,,280.0,,,,,,
CURL,Games,UC Berkeley,"A Srinivas, M Laskin, P Abbeel",2020-04-08,CURL: Contrastive Unsupervised Representations for Reinforcement Learning,https://arxiv.org/abs/2004.04136v4,SOTA improvement,,907264.0,,,,,"RL on Atari:
""We measure the data-efficiency and performance of our
method and baselines at 100k and 500k environment steps
on DMControl and 100k interaction steps (400k environment steps with action repeat of 4) on Atari, which we will
henceforth refer to as DMControl100k, DMControl500k
and Atari100k for clarity. While Atari100k benchmark has been common practice when investigating data-efficiency
on Atari (Kaiser et al., 2019; van Hasselt et al., 2019; Kielak,
2020), the DMControl benchmark was set at 500k environment steps because state-based RL approaches asymptotic
performance on many environments at this point, and 100k
steps to measure the speed of initial learning. A broader
motivation is that while RL algorithms can achieve superhuman performance on Atari games, they are still far less
efficient than a human learner. Training for 100-500k environment steps corresponds to a few hours of human time.""",,,,,United States of America,Academia,,,,Open source,,,,,866.0,,,,,,
Agent57,Games,DeepMind,"AP Badia, B Piot, S Kapturowski",2020-03-30,Agent57: Outperforming the Atari Human Benchmark,https://arxiv.org/abs/2003.13350,SOTA improvement,"""We propose Agent57, the first deep RL agent that outperforms the standard human benchmark on all 57 Atari games""",,,,,,,,,,Unknown,United Kingdom of Great Britain and Northern Ireland,Industry,,,,Unreleased,,,,,445.0,,,,,,
MetNet,Earth science,Google,"Casper Kaae Sønderby, Lasse Espeholt, Jonathan Heek, Mostafa Dehghani, Avital Oliver, Tim Salimans, Shreya Agrawal, Jason Hickey, Nal Kalchbrenner",2020-03-24,MetNet: A Neural Weather Model for Precipitation Forecasting,https://arxiv.org/abs/2003.12140,SOTA improvement,"""MetNet improves upon the current operational NWP system HRRR for up to 8 hours of lead time""
...
""Numerical Weather Prediction is the most successful framework to perform medium- and longrange (up to 6 days with high confidence) forecast to date (Bauer et al., 2015).""",,,,,,"""Precipitation provides a benchmark for a highly varying and densely measured target (Agrawal
et al.). We cast precipitation forecasting as a structured prediction problem where the output comes
in the form of a three-dimensional tensor. Each value of the tensor corresponds to a time and a
location and indicates the corresponding rate of precipitation measured in mm/h. Target precipitation rates are estimated by the Multi Radar Multi Sensor (MRMS) ground based radars as a
function of the returned radar echoes (Zhang et al., 2016). The spatial size obtained from MRMS
is 7000 × 2500 covering the continental United States. Each pixel covers 0.01◦ of longitude and
latitude corresponding to approximately 1 km2
. In addition to MRMS frames, the available input
data include the 16 spectral bands of the optical Geostationary Operational Environmental Satellite
16 (GOES-16). Figure 1 contains examples of MRMS and GOES-16 frames.""",,,,Unknown,United States of America,Industry,,,,Unreleased,,,,,228.0,,,,,,
ELECTRA,Language,"Stanford University,Google,Google Brain","Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning",2020-03-23,ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators,https://arxiv.org/abs/2003.10555v1,Highly cited,,335000000.0,https://github.com/google-research/electra,3.1e+21,"Table 8: ""ELECTRA-1.75M"" used 3.1e21 train FLOPs. Note that the actual parameter count is 335M. The 1.75M refers to the number of training steps.","BookCorpus (BooksCorpus, Toronto Book Corpus),Wikipedia,ClueWeb,Gigaword","""For most experiments we pre-train on the same data as BERT, which consists
of 3.3 Billion tokens from Wikipedia and BooksCorpus (Zhu et al., 2015). However, for our Large
model we pre-trained on the data used for XLNet (Yang et al., 2019), which extends the BERT
dataset to 33B tokens by including data from ClueWeb (Callan et al., 2009), CommonCrawl, and
Gigaword (Parker et al., 2011).""",25000000000.0,"33B tokens or ~25B words
""For most experiments we pre-train on the same data as BERT, which consists
of 3.3 Billion tokens from Wikipedia and BooksCorpus (Zhu et al., 2015). However, for our Large
model we pre-trained on the data used for XLNet (Yang et al., 2019), which extends the BERT
dataset to 33B tokens by including data from ClueWeb (Callan et al., 2009), CommonCrawl, and
Gigaword (Parker et al., 2011).""","Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.",,"United States of America,United States of America,United States of America","Academia,Industry,Industry",,table 1,,Open source,,,,,2968.0,,,,,,
Tensor-Transformer(1core)+PN (WT103),Language,UC Berkeley,"Sheng Shen, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer",2020-03-17,PowerNorm: Rethinking Batch Normalization in Transformers,https://arxiv.org/abs/2003.07845,SOTA improvement,"""The results are reported in Table 1. In the first section of
rows, we report state-of-the-art results for these two tasks with comparable model sizes""",85300000.0,,1.58e+18,,WikiText-103,,,,,,United States of America,Academia,,,,Open source,,,,,60.0,,,30.0,,,
Routing Transformer,Language,Google Research,"Aurko Roy, Mohammad Saffar, Ashish Vaswani, David Grangier",2020-03-12,Efficient Content-Based Sparse Attention with Routing Transformers,https://arxiv.org/abs/2003.05997,SOTA improvement,"""Additionally, we set a new state-of-the-art on the newly released PG-19 data-set, obtaining a test perplexity of 33.2 with a 22 layer Routing Transformer model trained on sequences of length 8192""",79500000.0,,,,WikiText-103,,,,,,Multinational,Industry,,,,Open source,,,,,447.0,,,,,,
TransformerXL + spectrum control,Language,"University of California Los Angeles (UCLA),JD.com","Lingxiao Wang, Jing Huang, Kevin Huang, Ziniu Hu, Guangtao Wang, Quanquan Gu",2020-03-11,Improving Neural Language Generation with Spectrum Control,https://openreview.net/forum?id=ByxY8CNtvr,SOTA improvement,"""We demonstrate that our spectrum control method outperforms the state-of-the-art Transformer-XL modeling for language model""",151000000.0,,4.6e+17,,WikiText-103,,,,,,"United States of America,China","Academia,Industry",,,,Unreleased,,,,,70.0,,,250.0,,,
Temporal Convolutional Attention-based Network(TCAN) (WT2),Language,"Nanjing University,Ant Group","Hongyan Hao, Yan Wang, Yudi Xia, Jian Zhao, Furao Shen",2020-02-28,Temporal Convolutional Attention-based Network For Sequence Modeling,https://arxiv.org/abs/2002.12530,SOTA improvement,"""We improve the state-of-theart results of ... 9.20 on WikiText-2""",33000000.0,,,,WikiText-2,,,,,,"China,China","Academia,Industry",,,,Unreleased,,,,,33.0,,,,,,
Feedback Transformer,Language,"LORIA,University of Lorraine,Facebook AI Research","Angela Fan, Thibaut Lavril, Edouard Grave, Armand Joulin, Sainbayar Sukhbaatar",2020-02-21,Addressing Some Limitations of Transformers with Feedback Memory,https://arxiv.org/abs/2002.09402,SOTA improvement,"""As shown in Table 4, the Feedback
Transformer model achieves a new SOTA performance (on Enwiki8) of 0.96 bit-per-byte despite its small size.""",126000000.0,,4.41e+19,,WikiText-103,,,,,,"France,France,United States of America","Academia,Academia,Industry",,,,Unreleased,,,,,41.0,,,267.23,,,
Turing-NLG,Language,Microsoft,Corby Rosset,2020-02-13,Turing-NLG: A 17-billion-parameter language model by Microsoft,https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/,SOTA improvement,"from paper: ""Turing Natural Language Generation (T-NLG) is a 17 billion parameter language model by Microsoft that outperforms the state of the art on many downstream NLP tasks""",17000000000.0,,1.57e+22,"source: https://lair.lighton.ai/akronomicon/
157 PF-days * 3600 * 24 * 10^15 = 1.35648e+22
archived: https://github.com/lightonai/akronomicon/tree/main/akrodb
6ND=6*17000000000*46400000000=4.7328e+21 (confidence regarding dataset size - likely)",,,46400000000.0,"Authors say they pretrain on the same data as for Megatron-LM.
From the Megatron-LM paper: https://arxiv.org/pdf/1909.08053.pdf
""The resulting aggregate corpus contains 174 GB of deduplicated text.""
174GB * 2e8words/GB = 3.48e10 words
3.48e10 words (if english) *4/3 = 46400000000 tokens
confidence - likely","Turing Natural Language Generation (T-NLG) is a 17 billion parameter language model by Microsoft that outperforms the state of the art on many downstream NLP tasks. We present a demo of the model, including its freeform generation, question answering, and summarization capabilities, to academics for feedback and research purposes. <|endoftext|>",Likely,United States of America,Industry,,,NVIDIA Tesla V100 DGXS 32 GB,Unreleased,,256.0,,,114.0,,,3.39,51659.713290894986,,
SimCLR,Vision,Google Brain,"Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton",2020-02-13,A Simple Framework for Contrastive Learning of Visual Representations,https://arxiv.org/abs/2002.05709,Highly cited,,375000000.0,source: https://openai.com/blog/image-gpt/,,,ImageNet,"""Dataset and Metrics. Most of our study for unsupervised
pretraining (learning encoder network f without labels)
is done using the ImageNet ILSVRC-2012 dataset (Russakovsky et al., 2015). Some additional pretraining experiments on CIFAR-10 (Krizhevsky & Hinton, 2009) can be
found in Appendix B.9.""",,,,,United States of America,Industry,,,Google TPU v3,Open source,,,,,13835.0,,,1000.0,,,
ALBERT-xxlarge,Language,"Toyota Technological Institute at Chicago,Google","Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut",2020-02-09,ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.,https://arxiv.org/abs/1909.11942,Highly cited,,235000000.0,,2.39e+21,"32 hours of training
512 TPU V3s
0.33 utilization rate
","Wikipedia,BookCorpus (BooksCorpus, Toronto Book Corpus)","""To keep the comparison as meaningful as possible, we follow the BERT (Devlin et al., 2019) setup in using the BOOKCORPUS (Zhu et al., 2015) and English Wikipedia (Devlin et al., 2019) for pretraining baseline models. These two corpora consist of around 16GB of uncompressed text. W""",3300000000.0,"Pretraining same as for BERT - Wikipedia and BookCorpus
""For the pre-training corpus we
use the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words)""","Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and \squad benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at this https URL.",,"United States of America,United States of America","Academia,Industry",32.0,,Google TPU v3,Open source,,512.0,,,5403.0,,,,4439.921298510215,,
TaLK Convolution,Language,Carleton University,"Vasileios Lioutas, Yuhong Guo",2020-02-08,Time-aware Large Kernel Convolutions,https://arxiv.org/abs/2002.03184,SOTA improvement,"""[We] set a new state-of-the-art result on the
IWSLT De-En and CNN-DailyMail datasets""",240000000.0,Table 5,2.78e+19,,WikiText-103,,,,,,Canada,Academia,,,,Unreleased,,,,,28.0,,,187.43,,,
Perceiver IO (optical flow),"Multimodal,Language,Vision",DeepMind,"Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff,
Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira",2020-02-08,Perceiver IO: A General Architecture for Structured Inputs & Outputs,https://arxiv.org/abs/2107.14795,SOTA improvement,"""Perceiver IO... achieves state-of-the-art performance on Sintel optical flow estimation""",27900000.0,"Optical flow model (SOTA) was 27.9M params. There are other, larger models described in this paper, e.g. for language.
""For the pixel- and patch-based models, total computational
complexity for a forward pass on a 368 × 496 image is roughly 987 billion FLOPs, and there are
roughly 27.9 million parameters.""",,,AutoFlow,"""In all cases, we train on the AutoFlow dataset (Sun et al., 2021), which consists of 400, 000 image
pairs, for 480 epochs using a cosine learning rate schedule which starts at a learning rate of 4e-4.
We use a batch size of 512. We use the LAMB (You et al., 2021) optimizer.""",,,"A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain & task assumptions or scale poorly to large inputs or outputs. In this work, we propose Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs. Our model augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, doing away with the need for task-specific architecture engineering. The same architecture achieves strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning, and StarCraft II. As highlights, Perceiver IO outperforms a Transformer-based BERT baseline on the GLUE language benchmark despite removing input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation with no explicit mechanisms for multiscale correspondence.",,United Kingdom of Great Britain and Northern Ireland,Industry,,,,Unreleased,,,,,416.0,,,,,,
Theseus 6/768,Language,"UC San Diego,Beihang University,Microsoft","Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, Ming Zhou",2020-02-07,BERT-of-Theseus: Compressing BERT by Progressive Module Replacing,https://arxiv.org/abs/2002.02925,SOTA improvement,"""Our approach outperforms existing knowledge distillation approaches on GLUE benchmark""",66000000.0,"66M, Table 1",,,GLUE,"fine-tuned on training sets from GLUE benchmark:
""We test our approach under a task-specific compression setting (Sun et al., 2019; Turc et al., 2019)
instead of a pretraining compression setting (Sanh
et al., 2019; Sun et al., 2020). That is to say, we use
no external unlabeled corpus but only the training set of each task in GLUE to compress the
model. """,,,,,"United States of America,China,United States of America","Academia,Academia,Industry",,,NVIDIA V100,Open source,"Actually BERT-base, 110M params. Up to 20 V100-hours depending on task.
125 trillion * 20 * 3600 * 0.3 (utilization assumption) = 2.7e18",,,BERT-Large,179.0,,2.7e+18,,,,
Meena,Language,Google Brain,"Dongling Xiao, Han Zhang, Yukun Li, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang",2020-01-28,Towards a Human-like Open-Domain Chatbot,https://arxiv.org/abs/2001.09977,SOTA improvement,"""We also propose a human evaluation metric called Sensibleness and
Specificity Average (SSA)... the full version of Meena (with a filtering mechanism and tuned decoding) scores 79% SSA, 23% higher in absolute SSA than the existing chatbots we evaluated""",2600000000.0,"""We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token.""",1.12e+23,"https://arxiv.org/ftp/arxiv/papers/2104/2104.10350.pdf
Table 4",,,40000000000.0,"""The final Meena dataset contains 341GB of text
(40B words)""
Converting from GB to words yields 6.8e10, which is in the same OOM","We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. We also propose a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation. Our experiments show strong correlation between perplexity and SSA. The fact that the best perplexity end-to-end trained Meena scores high on SSA (72% on multi-turn evaluation) suggests that a human-level SSA of 86% is potentially within reach if we can better optimize perplexity. Additionally, the full version of Meena (with a filtering mechanism and tuned decoding) scores 79% SSA, 23% higher in absolute SSA than the existing chatbots we evaluated.",,United States of America,Industry,720.0,"We trained our best model for 30 days on a TPUv3 Pod (2,048 TPU cores)",Google TPU v3,Unreleased,,1024.0,0.3439,,816.0,,,,206760.3812904988,82655.0,"61B tokens over 738k training steps, or 82655 tokens per batch on average. Not certain about warmup, etc"
ContextNet + Noisy Student,Speech,Google,"Daniel S. Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, Quoc V. Le",2020-01-19,Improved Noisy Student Training for Automatic Speech Recognition,https://arxiv.org/abs/2005.09629v2,SOTA improvement,"""We are thus able to improve upon the previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h (4.74%/12.20%) and LibriSpeech (1.9%/4.1%)""",,,8.16e+21,"""We train 6 generations of models numbered 0 to 5, where
we count the baseline model trained with the supervised set
as the zeroth generation. Each generation is trained ... on 32 Google
Cloud TPU chips for 10 days.""
The TPU version is likely v3 given this is a 2020 paper.
we get 6 * 10 * 24 * 3600 * 32 * 123 tflops * 0.4 (assumed utilization) = 8.16e21","LibriSpeech,LibriLight","""LibriSpeech 100-860 is a semi-supervised task where the clean 100h subset of LibriSpeech [6] is taken to be the supervised set, while the remaining 860h of audio is taken to be the unlabeled set. The unlabeled audio consists of 360h of clean data and 500h of noisy data. We tokenize the transcripts using a WPM model [37] with vocabulary size 16k constructed from the clean 100h subset transcripts.""
Inputs are mel-spectrograms, but unclear the duration of each.",,,"Recently, a semi-supervised learning method known as ""noisy student training"" has been shown to improve image classification performance of deep networks significantly. Noisy student training is an iterative self-training method that leverages augmentation to improve network performance. In this work, we adapt and improve noisy student training for automatic speech recognition, employing (adaptive) SpecAugment as the augmentation method. We find effective methods to filter, balance and augment the data generated in between self-training iterations. By doing so, we are able to obtain word error rates (WERs) 4.2%/8.6% on the clean/noisy LibriSpeech test sets by only using the clean 100h subset of LibriSpeech as the supervised set and the rest (860h) as the unlabeled set. Furthermore, we are able to achieve WERs 1.7%/3.4% on the clean/noisy LibriSpeech test sets by using the unlab-60k subset of LibriLight as the unlabeled set for LibriSpeech 960h. We are thus able to improve upon the previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h (4.74%/12.20%) and LibriSpeech (1.9%/4.1%).",Confident,United States of America,Industry,1440.0,roughly 10 days,Google TPU v3,Unreleased,,,,,217.0,,,,14226.054462994534,,
AlphaFold,Biology,DeepMind,"Andrew W. Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek, Alexander W. R. Nelson, Alex Bridgland, Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli, David T. Jones, David Silver, Koray Kavukcuoglu, Demis Hassabis",2020-01-15,Improved protein structure prediction using potentials from deep learning,https://www.nature.com/articles/s41586-019-1923-7,"SOTA improvement,Highly cited","""AlphaFold represents a considerable advance
in protein-structure prediction."" [Abstract]",16340840.0,"""Neural network hyperparameters"" section of https://www.nature.com/articles/s41586-019-1923-7:
“7 × 4 Blocks with 256 channels, cycling through dilations 1, 2, 4, 8”
“48 × 4 Blocks with 128 channels, cycling through dilations 1, 2, 4, 8”
""Distogram prediction"" section:
""For the final layer, a position-specific bias was used""
Extended Data Fig.1 (b):
Shows that each block consists of 9 layers:
(1) Batch norm
(2) Elu
(3) Project down (halves number of dimensions)
(4) Batch norm
(5) Elu
(6) 3x3 kernel with dilation
(7) Batch norm
(8) Elu
(9) Project up (doubles number of dimensions)
Dilations don't change the number of parameters in each filter
Assuming that projection layers are convolutional layers with 1x1 kernels
Parameter estimate for each layer in a 256 channel block:
(1) 256*2 = 512
(2) 0
(3) 1*1*256*128 = 32768
(4) 128*2 = 256
(5) 0
(6) 3*3*128*128 = 147456
(7) 128*2 = 256
(8) 0
(9) 1*1*128*256 + 256 = 33024
Total = 214272
Parameter estimate for each layer in a 128 channel block:
(1) 128*2 = 256
(2) 0
(3) 1*1*128*64 = 8192
(4) 64*2 = 128
(5) 0
(6) 3*3*64*64 = 36864
(7) 64*2 = 128
(8) 0
(9) 1*1*64*128 + 128 = 8320
Total = 53897
Estimate total network = 7*4*214272 + 48*4*53897 = 5992616 + 10348224
= 16340840
~ 16e6
Within a factor of 2 of the estimate of 21M parameters stated in: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7305407/
[Previous approximation: 7 * 4 * 256 * 3 * 3 * 256 + 48 * 4 * 128 * 3 * 3 * 128 = 44826624]",1e+20,"Estimated in the blogpost below
https://www.lesswrong.com/posts/wfpdejMWog4vEDLDg/ai-and-compute-trend-isn-t-predictive-of-what-is-happening
""AlphaFold: they say they trained on GPU and not TPU. Assuming V100 GPU, it's 5 days * 24 hours/day * 3600 sec/hour * 8 V100 GPU * 100*10^12 FLOP/s * 33% actual GPU utilization = 10^20 FLOP.""","PDB (Protein Data Bank),UniRef30 (FKA UniClust30)","""Our models are trained on structures extracted from the PDB"" [""Data"" section]
""For each training sequence, we searched for and aligned to the training sequence similar protein sequences in the Uniclust3035 dataset"" [""Data"" section]",,Multiple tasks! Different units,"Protein structure prediction can be used to determine the three-dimensional shape of a protein from its amino acid sequence. This problem is of fundamental importance as the structure of a protein largely determines its function; however, protein structures can be difficult to determine experimentally. Considerable progress has recently been made by leveraging genetic information. It is possible to infer which amino acid residues are in contact by analysing covariation in homologous sequences, which aids in the prediction of protein structures. Here we show that we can train a neural network to make accurate predictions of the distances between pairs of residues, which convey more information about the structure than contact predictions. Using this information, we construct a potential of mean force that can accurately describe the shape of a protein. We find that the resulting potential can be optimized by a simple gradient descent algorithm to generate structures without complex sampling procedures. The resulting system, named AlphaFold, achieves high accuracy, even for sequences with fewer homologous sequences. In the recent Critical Assessment of Protein Structure Prediction (CASP13)—a blind assessment of the state of the field—AlphaFold created high-accuracy structures (with template modelling (TM) scores of 0.7 or higher) for 24 out of 43 free modelling domains, whereas the next best method, which used sampling and contact information, achieved such accuracy for only 14 out of 43 domains. AlphaFold represents a considerable advance in protein-structure prediction. We expect this increased accuracy to enable insights into the function and malfunction of proteins, especially in cases for which no structures for homologous proteins have been experimentally determined.",Speculative,United Kingdom of Great Britain and Northern Ireland,Industry,120.0,"""Training time: about 5 days for 600,000 steps""",,Unreleased,,,,,2773.0,,,,,,
Big Transfer (BiT-L),Vision,Google Brain,"A Kolesnikov, L Beyer, X Zhai, J Puigcerver, J Yung",2019-12-24,Large scale learning of general visual representations for transfer,https://arxiv.org/abs/1912.11370,SOTA improvement,"""We transfer BiT to many diverse tasks... These tasks include ImageNet’s ILSVRC-2012 [10], CIFAR-10/100 [27], Oxford-IIIT Pet [41], Oxford
Flowers-102 [39] (including few-shot variants), and the 1000-sample VTAB-1k benchmark [66], which consists of 19 diverse datasets. BiT-L attains state-ofthe-art performance on many of these tasks",928000000.0,,,,JFT-300M,"""We train networks on three different scales of datasets. The largest, BiT-L
is trained on the JFT-300M dataset [51], which contains 300 M noisily labelled images""",,,,,United States of America,Industry,,,Google TPU v3,Unreleased,,,,,1024.0,,,40.0,,,
DD-PPO,Robotics,"Georgia Institute of Technology,Facebook AI Research,Oregon State University,Simon Fraser University","Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, Dhruv Batra",2019-12-19,DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames,https://openreview.net/forum?id=H1gX8C4YPr,SOTA improvement,"""This agent achieves state-of-art on the Habitat Challenge 2019 RGB track (rank 2 entry has 0.89 SPL).""",,"no parameter count but some architecture details: ""The policy is parameterized by a 2-layer LSTM with a 512-dimensional hidden state. It takes three inputs: the previous action, the target relative to the current state, and the output of the visual encoder. The LSTM’s output is used to produce a softmax distribution over the action space and an estimate of the value function. See Appendix C for full details.""",7.8e+20,"""Using DD-PPO, we train agents for 2.5 Billion steps of experience with 64 Tesla V100 GPUs in 2.75 days – 180 GPU-days of training""
125 teraFLOP/s (exact V100 model not specified) * 180 * 24 * 3600 * 0.4 (assumed utilization) = 7.8e20",,"""We experiment with several different sources of data. First, we utilize the training data released
as part of the Habitat Challenge 2019, consisting of 72 scenes from the Gibson dataset (Xia et al.,
2018). We then augment this with all 90 scenes in the Matterport3D dataset (Chang et al., 2017) to
create a larger training set (note that Matterport3D meshes tend to be larger and of better quality).2
Furthermore, Savva et al. (2019) curated the Gibson dataset by rating every mesh reconstruction on
a quality scale of 0 to 5 and then filtered all splits such that each only contains scenes with a rating of
4 or above (Gibson-4+), leaving all scenes with a lower rating previously unexplored. We examine
training on the 332 scenes from the original train split with a rating of 2 or above (Gibson-2+).""",,,"We present Decentralized Distributed Proximal Policy Optimization (DD-PPO), a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever ""stale""), making it conceptually simple and easy to implement. In our experiments on training virtual robots to navigate in Habitat-Sim, DD-PPO exhibits near-linear scaling -- achieving a speedup of 107x on 128 GPUs over a serial implementation. We leverage this scaling to train an agent for 2.5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs. ",Likely,"United States of America,United States of America,United States of America,Canada","Academia,Industry,Academia,Academia",66.0,2.75 days,NVIDIA V100,Unreleased,,64.0,,,372.0,,,,1926.8992900057376,,
OpenAI Five Rerun,Games,OpenAI,"Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung,
Przemysław “Psyho"" Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique Pondé de Oliveira Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, Susan Zhang",2019-12-13,Dota 2 with Large Scale Deep Reinforcement Learning,https://cdn.openai.com/dota-2.pdf,"Highly cited,SOTA improvement","""On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game.""",159000000.0,"""We define a policy (π) as a function from the history of observations to a probability distribution
over actions, which we parameterize as a recurrent neural network with approximately 159 million
parameters (θ)."" pg. 3 of paper
source: https://docs.google.com/spreadsheets/d/1Kj4Q5WADcDXtUJLIOfGTCE3tGvxNczEMwyy8QtgSkHk/edit#gid=54587040&fvid=1361937389",1.3e+22,"THIS CALCULATION IS FOR RERUN
source: https://docs.google.com/spreadsheets/d/1Kj4Q5WADcDXtUJLIOfGTCE3tGvxNczEMwyy8QtgSkHk/edit#gid=54587040&fvid=1361937389",,,53084160000.0,"54k iterations (Fig 7)
with a batch size of 983040 (Table 2)","On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems. OpenAI Five leveraged existing reinforcement learning techniques, scaled to learn from batches of approximately 2 million frames every 2 seconds. We developed a distributed training system and tools for continual training which allowed us to train OpenAI Five for 10 months. By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.",,United States of America,Industry,,,,Unreleased,,512.0,,,1474.0,,,,,,
OpenAI Five,Games,OpenAI,"Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique P. d.O. Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, Susan Zhang",2019-12-13,Dota 2 with Large Scale Deep Reinforcement Learning,https://arxiv.org/abs/1912.06680,"Highly cited,SOTA improvement","""On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game.""",159000000.0,"""We define a policy (π) as a function from the history of observations to a probability distribution over actions, which we parameterize as a recurrent neural network with approximately 159 million parameters (θ)."" pg. 3 of paper
source: https://docs.google.com/spreadsheets/d/1Kj4Q5WADcDXtUJLIOfGTCE3tGvxNczEMwyy8QtgSkHk/edit#gid=54587040&fvid=1361937389",6.7e+22,"""770±50 PFlops/s·days of compute"" for the model that played against world champions. They did a single training run that took 10 months.
While the model was playing against world champions, they continued training for a few days, so that the resulting model used even more training compute: 820±50 PFlops/s·days.
Finally, they also trained a Rerun model with 150±5 PFlops/s·days of compute.
Source: Dota 2 with Large Scale Deep Reinforcement Learning
https://arxiv.org/abs/1912.06680
You cannot multiply the hardware quantity by training time to get the quantity of GPU-hours! Page 5: "" the number of GPUs (up to 1536 at the peak)""",,,454321373184.0,"""Although the Dota 2 engine runs at 30 frames per second, OpenAI Five only acts on every 4th
frame which we call a timestep""
--> 7.5 timesteps/s
""OpenAI Five is a single training run that ran from June 30th, 2018 to April 22nd, 2019. "" --> 296 days
296 * 24*3600 * 7.5 = 1.92e8
This number seems a little low? The DQN paper had 1e7 timesteps. Might be to do with sample efficiency?
EDIT 14/06/2022
Multiple copies of OpenAI Five were trained in parallel, so the total training time is much higher than 296 days.
Table 1 shows 220,000 GPU iterations, each iteration has a batch size of between 1M and 3M timesteps (Table 2), so the total number of episodes is on the order of 2e11","On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems. OpenAI Five leveraged existing reinforcement learning techniques, scaled to learn from batches of approximately 2 million frames every 2 seconds. We developed a distributed training system and tools for continual training which allowed us to train OpenAI Five for 10 months. By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.
",Confident,United States of America,Industry,7104.0,"""OpenAI Five is a single training run that ran from June 30th, 2018 to April 22nd, 2019. "" --> 296 days",,Unreleased,,1536.0,,,1474.0,"Cannot multiply the hardware quantity by training time to get the quantity of GPU-hours! Page 5: "" the number of GPUs (up to 1536 at the peak)""",,,,,
MMLSTM,Language,"Beijing University of Posts and Telecommunications,University of West London","Kai Shuang, Rui Li, Mengyu Gu, Jonathan Loo, Sen Su",2019-12-05,Major–Minor Long Short-Term Memory for Word-Level Language Model,http://repository.uwl.ac.uk/id/eprint/6490/1/Loo_etal_IEEE_TNNLS_2019_Major-minor_long_short-term_memory_for_word-level_language_model.pdf,SOTA improvement,"""In experiments, we demonstrate the language model with MMLSTMs surpasses the existing state-of-the-art model on Penn Treebank (PTB) and WikiText-2 (WT2) datasets""",75000000.0,,2.32e+18,,WikiText-103,,,,,,"China,United Kingdom of Great Britain and Northern Ireland","Academia,Academia",,,,Unreleased,,,,,14.0,,,50.0,,,
StarGAN v2,Vision,"NAVER,Yonsei University,Swiss Federal Institute of Technology","Yunjey Choi, Youngjung Uh, Jaejun Yoo, Jung-Woo Ha",2019-12-04,StarGAN v2: Diverse Image Synthesis for Multiple Domains,https://arxiv.org/abs/1912.01865,"Highly cited,SOTA improvement","""Votes from AMT workers for the most preferred method
regarding visual quality and style reflection (%). StarGAN v2 outperforms the baselines with remarkable margins in all aspects.""",,,,,"CelebA,AFHQ","""Datasets. We evaluate StarGAN v2 on CelebA-HQ [21] and
our new AFHQ dataset (Appendix A)""",,,,Unknown,"Korea (Republic of),Korea (Republic of),Switzerland","Industry,Academia,Academia",,,,Open access (non-commercial),,,,,1376.0,,,,,,
Transformer-XL DeFINE (141M),Language,"University of Washington,Allen Institute for AI","Sachin Mehta, Rik Koncel-Kedziorski, Mohammad Rastegari, Hannaneh Hajishirzi",2019-11-27,DeFINE: DEep Factorized INput Token Embeddings for Neural Sequence Modeling,https://arxiv.org/abs/1911.12385,SOTA improvement,"""Compared to state-of-the-art methods including adaptive input representations,
this technique results in a 6% to 20% drop in perplexity""",141000000.0,,6.2e+18,,"WikiText-103,Penn TreeBank",,,,,,"United States of America,United States of America","Academia,Research collective",,,,Unreleased,,,,,21.0,,,20.0,,,
Photo-Geometric Autoencoder,3D modeling,University of Oxford,"Shangzhe Wu, Christian Rupprecht, Andrea Vedaldi
",2019-11-25,Unsupervised Learning of Probably Symmetric Deformable 3D Objects From Images in the Wild,https://arxiv.org/abs/1911.11130,SOTA improvement,"""Our model outperforms a
current state-of-the-art 3D reconstruction method that uses 2D keypoint supervision""",,,,,"CelebA,3DFAW,BFM","""We test our method on three human face
datasets: CelebA [35], 3DFAW [21, 27, 73, 69] and
BFM [47]""",,,,Unknown,United Kingdom of Great Britain and Northern Ireland,Academia,,,,Open source,,,,,270.0,,,,,,
Transformer - LibriVox + Decoding/Rescoring,Speech,Facebook,"Gabriel Synnaeve, Qiantong Xu, Jacob Kahn, Tatiana Likhomanenko, Edouard Grave, Vineel Pratap, Anuroop Sriram, Vitaliy Liptchinsky, Ronan Collobert",2019-11-19,End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures,https://arxiv.org/abs/1911.08460v3,SOTA improvement,"""Results with decoding/rescoring are shown in Table 2, where we reach 2.09% and 4.11% on test-clean and test-other , respectively, and are further improvements on the state-of-the-art.""",296000000.0,Table 2,,"""Models are trained on 64 GPUs each with an overall batch size of 256 for ResNet and TDS and 320 for Transformer. With only LIBRISPEECH, all models converged in under a week; with pseudo-labels from LIBRIVOX, training required 2-3 weeks""
GPU not specified","LibriSpeech,LibriVox","""LIBRIVOX2
is a large collection of freely-available audio books. Using tools provided with the LIBRILIGHT dataset [26], we select 72K hours of read speech from English book listings and run several preprocessing
steps. After filtering samples to remove readings of duplicate text and corrupted audio, we remove all audio for which
the speaker has overlap with a sample in LIBRISPEECH... the resulting audio corpus contains 53.8K hours of read speech.""",,"""the resulting audio corpus contains 53.8K hours of read speech""","We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard LibriSpeech dataset, and leverage additional unlabeled data from LibriVox through pseudo-labeling. We show that while Transformer-based acoustic models have superior performance with the supervised dataset alone, semi-supervision improves all models across architectures and loss functions and bridges much of the performance gaps between them. In doing so, we reach a new state-of-the-art for end-to-end acoustic models decoded with an external language model in the standard supervised learning setting, and a new absolute state-of-the-art with semi-supervised training. Finally, we study the effect of leveraging different amounts of unlabeled audio, propose several ways of evaluating the characteristics of unlabeled audio which improve acoustic modeling, and show that acoustic models trained with more audio rely less on external language models.",Confident,United States of America,Industry,,,,Open source,,,,,233.0,,,,,,
MuZero,Games,DeepMind,"J Schrittwieser, I Antonoglou, T Hubert, K Simonyan",2019-11-19,Mastering Atari Go Chess and Shogi by Planning with a Learned Model,https://arxiv.org/abs/1911.08265v2,"Highly cited,SOTA improvement",,36864000.0,"Both the representation and dynamics function use the same architecture asAlphaZero, but with 16 instead of20 residual blocks [15]. We use 3x3 kernels and 256 hidden planes for each convolution.
Previous downsampling:
• 1 convolution with stride 2 and 128 output planes, output resolution 48x48.• 2 residual blocks with 128 planes• 1 convolution with stride 2 and 256 output planes, output resolution 24x24.• 3 residual blocks with 256 planes.• Average pooling with stride 2, output resolution 12x12.• 3 residual blocks with 256 planes.• Average pooling with stride 2, output resolution 6x6.",4.8e+19,"third-generation Google Cloud TPU
(For each board game, we used 16 TPUs for training and 1000 TPUs for self-play)
For each game in Atari, we used 8 TPUs for training and 32 TPUs for self-play
Training for 12 hours (for Atari)
Data from Parameter, Compute and Data Trends in Machine Learning
Google v3 TPU: 1.23E+14 FLOP/s (although with the caveat that it might be not applicable)
Utilization rate
In LaMDA: Language Models for Dialog Applications, they report for TPU V3: 56.5%
Calculations for Atari:
12 hours → 43200 seconds
(8 TPUs for training) * (1.23*10^14 FLOP/s) * (43.2 *10^3 s) * (0.565 utilization rate) = 2.4017472 * 10^19 FLOP
Training time missing for boardgames
Assumption also 12 hours
Also: 2.4017472 * 10^19 FLOP
Total cost ≈ 4.8 * 10^19 FLOP",,,20000000000.0,"Table 1
https://arxiv.org/pdf/1911.08265.pdf","Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function. When evaluated on 57 different Atari games - the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled - our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.",,United Kingdom of Great Britain and Northern Ireland,Industry,,,,Unreleased,,,,,1567.0,,,,,,
MoCo,"Vision,Image generation",Facebook AI,"Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xe, Ross Girshick",2019-11-13,Momentum Contrast for Unsupervised Visual Representation Learning,https://arxiv.org/abs/1911.05722,Highly cited,,375000000.0,https://openai.com/blog/image-gpt/#rfref53,,,"ImageNet,Instagram-1B","""We study unsupervised training performed in:
ImageNet-1M (IN-1M): This is the ImageNet [11] training set that has ∼1.28 million images in 1000 classes (often
called ImageNet-1K; we count the image number instead,
as classes are not exploited by unsupervised learning). This
dataset is well-balanced in its class distribution, and its images generally contain iconic view of objects.
Instagram-1B (IG-1B): Following [44], this is a dataset
of ∼1 billion (940M) public images from Instagram. The
images are from ∼1500 hashtags [44] that are related to the
ImageNet categories. This dataset is relatively uncurated
comparing to IN-1M, and has a long-tailed, unbalanced
distribution of real-world data. This dataset contains both
iconic objects and scene-level images.""",,,,,United States of America,Industry,,,,Open access (non-commercial),,,,,9387.0,,,,,,
Noisy Student (L2),Vision,"Carnegie Mellon University (CMU),Google","Q Xie, MT Luong, E Hovy, QV Lee",2019-11-11,Self-training with Noisy Student improves ImageNet classification,https://arxiv.org/abs/1911.04252v4,"Highly cited,SOTA improvement","""Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model""",480000000.0,,8.4934656e+20,"""Our largest model, EfficientNet-L2, needs to be trained for 6 days on a Cloud TPU v3 Pod, which has 2048 cores, if the unlabeled batch size is 14x the labeled batch size""
2048*4.00E+12*60**2*24*4*0.3 = 8.5e20","ImageNet,JFT",,81000000.0,"""Due to duplications, there are only 81M unique images among these 130M images.""",,,"United States of America,United States of America","Academia,Industry",144.0,6 days,Google TPU v3,Unreleased,,1024.0,,,2033.0,,,,43900.60644295845,,
Sandwich Transformer,Language,"Allen Institute for AI,Facebook AI Research","Ofir Press, Noah A. Smith, Omer Levy",2019-11-10,Improving Transformer Models by Reordering their Sublayers,https://arxiv.org/abs/1911.03864,SOTA improvement,"""Sandwich transformers achieve state-of-the-art results on the enwik8 character-level language modeling dataset and on an additional word-level corpus,
but have no significant effect on machine translation""",209000000.0,209M,1.58e+20,,"BookCorpus (BooksCorpus, Toronto Book Corpus),enwik8,text8",,,,,,"United States of America,United States of America","Research collective,Industry",,,,Unreleased,,,,,71.0,,,180.0,,,
CamemBERT,Language,"Facebook,INRIA,Sorbonne University","Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, Benoît Sagot",2019-11-10,CamemBERT: a Tasty French Language Model,https://arxiv.org/abs/1911.03894,SOTA improvement,"""Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks."" (part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks)",335000000.0,"CamemBERT Large, Table 4",8.3e+20,"""Unless otherwise specified, our models use the BASE architecture, and are pretrained for 100k backpropagation steps on 256 Nvidia V100 GPUs (32GB each) for a day""
256 V100-days
256 * 125 teraflops * 24 * 3600 * 0.3 (assumed utilization)
= 8.3e20
""Following (Liu et al., 2019), we optimize the model using Adam (Kingma and Ba, 2014) (β1 = 0.9, β2 = 0.98) for 100k steps with large batch sizes of 8192 sequences, each sequence containing at most 512 tokens""
Using compute = 6*N*D, that's 6 * (100k * 8192 * 512) * 335M= 8.43e20",CCNet,"""we train another model with the LARGE architecture, referred to as CamemBERTLARGE, for a fair comparison with XLM-RLARGE. This model is trained with the CCNet corpus, described in Sec. 6, for 100k steps""
Other models in paper are trained with the French portion of OSCAR. See footnote 12.",31900000000.0," 31.9B tokens, Table 6.","Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models --in all languages except English-- very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.",Confident,"United States of America,France,France","Industry,Academia,Academia",24.0,1 day for each model (may not have been a full 24 hours),NVIDIA V100,Open source,,,,,990.0,,,13.0,2319.5419478533995,,
XLM-RoBERTa,Language,Facebook AI,"Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov
",2019-11-05,Unsupervised Cross-lingual Representation Learning at Scale,https://arxiv.org/abs/1911.02116,"Highly cited,SOTA improvement","citation ""which obtains state-of-the-art perfor-
mance on cross-lingual classification, sequence la-
beling and question answering""",550000000.0,"The number of parameters in the model is specified as ""550M params"" for XLM-R.",,"""We use the multilingual MLM loss and train our XLM-R model for
1.5 Million updates on five-hundred 32GB Nvidia
V100 GPUs with a batch size of 8192. ""
we may try to use 6ND aproximation
It gives around : 6 * 550e6 * 1.5e6 * 8192 = 40550400000000000000
but number of tokens is speculative",CC100,"The training dataset and size are mentioned as ""using more than two terabytes of filtered CommonCrawl data"" and the model being trained on ""100 languages"".",125250000000.0,size of CC100 - copied from other rows,"This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code, data and models publicly available.",Likely,United States of America,Industry,,,NVIDIA Tesla V100 DGXS 32 GB,Open access (non-commercial),,500.0,,,4958.0,,,,,,
Base LM + kNN LM + Continuous Cache,Language,"Stanford University,Facebook AI Research","Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, Mike Lewis",2019-11-01,Generalization through Memorization: Nearest Neighbor Language Models,https://arxiv.org/abs/1911.00172,SOTA improvement,"""GNN-LM achieves a new state-of-the-art perplexity of 14.8 on WikiText-103""",247000000.00000003,,7.3e+18,,WikiText-103,,,,,,"United States of America,United States of America","Academia,Industry",,,,Unreleased,,,,,580.0,,,200.0,,,
AlphaStar,Games,DeepMind,"Oriol Vinyals,Igor Babuschkin,Wojciech M. Czarnecki,Michaël Mathieu,Andrew Dudzik,Junyoung Chung,David H. Choi,Richard Powell,Timo Ewalds,Petko Georgiev,Junhyuk Oh,Dan Horgan,Manuel Kroiss,Ivo Danihelka,Aja Huang,Laurent Sifre,Trevor Cai,John P. Agapiou,Max Jaderberg,Alexander S. Vezhnevets,Rémi Leblond,Tobias Pohlen,Valentin Dalibard,David Budden,Yury Sulsky,James Molloy,Tom L. Paine,Caglar Gulcehre,Ziyu Wang,Tobias Pfaff,Yuhuai Wu,Roman Ring,Dani Yogatama,Dario Wünsch,Katrina McKinney,Oliver Smith,Tom Schaul,Timothy Lillicrap,Koray Kavukcuoglu,Demis Hassabis,Chris Apps,David Silver",2019-10-30,Grandmaster level in StarCraft II using multi-agent reinforcement learning,https://www.deepmind.com/blog/alphastar-grandmaster-level-in-starcraft-ii-using-multi-agent-reinforcement-learning,Highly cited,,139000000.0,"AlphaStar has 139 million weights, but only 55 million weights are required during inference.",5.9250000000001e+22,"384 TPUv3 chips for 44 days. Assume 33% utilization.
https://www.wolframalpha.com/input?i=123+teraFLOPS+*+384+*+0.33+*+44+days",,,,"Multiple data types. First supervised learning, then other stuff","Many real-world applications require artificial agents to compete and coordinate with other agents in complex environments. As a stepping stone to this goal, the domain of StarCraft has emerged as an important challenge for artificial intelligence research, owing to its iconic and enduring status among the most difficult professional esports and its relevance to the real world in terms of its raw complexity and multi-agent challenges. Over the course of a decade and numerous competitions1–3, the strongest agents have simplified important aspects of the game, utilized superhuman capabilities, or employed hand-crafted sub-systems4. Despite these advantages, no previous agent has come close to matching the overall skill of top StarCraft players. We chose to address the challenge of StarCraft using generalpurpose learning methods that are in principle applicable to other complex domains: a multi-agent reinforcement learning algorithm that uses data from both human and agent games within a diverse league of continually adapting strategies and counter-strategies, each represented by deep neural networks5,6. We evaluated our agent, AlphaStar, in the full game of StarCraft II, through a series of online games against human players. AlphaStar was rated at Grandmaster level for all three StarCraft races and above 99.8% of officially ranked human players.",,United Kingdom of Great Britain and Northern Ireland,Industry,1056.0,"""Each agent was trained using 32 third-generation tensor
processing units (TPUs) over 44 days""",Google TPU v3,Unreleased,,384.0,,,2994.0,,,,125758.09814850632,,
BART-large,Language,Facebook AI,"Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer",2019-10-29,"BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension",https://arxiv.org/abs/1910.13461,Highly cited,,406291456.0,"""In total, BART contains roughly 10% more parameters than the equivalently sized BERT model.""
I counted the parameters in the huggingface model
https://huggingface.co/facebook/bart-large/tree/main
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained(""facebook/bart-large"")
model = AutoModel.from_pretrained(""facebook/bart-large"")
sum(p.numel() for p in model.parameters() if p.requires_grad)",,,Wikipedia,"""All models are of comparable size and are trained for 1M steps
on a combination of books and Wikipedia data""",,,,,United States of America,Industry,,,,Open source,,,,,8296.0,,,,,,
T5-11B,Language,Google,"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu",2019-10-23,Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,https://arxiv.org/abs/1910.10683,Highly cited,,11000000000.0,The full 11-billion parameter model,3.3e+22,"https://arxiv.org/ftp/arxiv/papers/2104/2104.10350.pdf
Table 4, 4.05e22
update: 3.3e22 per FLAN paper from Google
https://arxiv.org/pdf/2210.11416.pdf",C4,,150000000000.0,"""This produces a collection of text that is not only
orders of magnitude larger than most data sets used for pre-training (about 750 GB) but also
comprises reasonably clean and natural English text. We dub this data set the “Colossal
Clean Crawled Corpus” (or C4 for short) and release it as part of TensorFlow Datasets""
750GB * 200M word/GB = 1.5e11","Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.",Confident,United States of America,Industry,481.9,"4.05*10^22 FLOP at 37.073% utilization on 512 TPU v3 chips (123 TFLOPS) -> 482 hours
https://www.wolframalpha.com/input?i=4.05*10%5E22+seconds+%2F+%28512*123*10%5E12%29+*%28123%2F45.6%29",Google TPU v3,Open source,,512.0,0.3707,,13979.0,,,,75524.39074218823,65536.0,"""We use a maximum sequence length of 512 and a batch size of 128 sequences. Whenever possible, we “pack” multiple sequences into each entry of the batch10 so that our batches contain roughly 2^16 = 65,536 tokens"""
T5-3B,Language,Google,"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu",2019-10-23,Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,https://arxiv.org/abs/1910.10683,Highly cited,,2800000000.0,"page 37, 3B and 11B. ""To further explore what kind of performance is possible when using larger models, we consider two additional variants. In both cases, we use d_model = 1024, a 24 layer encoder and decoder, and dkv = 128. For the “3B” variant, we use dff = 16,384 with 32-headed attention, which results in around 2.8 billion parameters; for “11B” we use dff = 65,536 with 128-headed attention producing a model with about 11 billion parameters""",8.658654068736e+20,"Akronomicon states 1.04e+22 FLOP. Archived source: https://github.com/lightonai/akronomicon/tree/main/akrodb
However, this seems dubiously high.
""We pre-train each model for 2^19 = 524,288 steps on C4 before fine-tuning.""
""In total, this batch size and number of steps corresponds to pre-training on 2^35 ≈ 34B tokens.""
""To compare these mixing strategies on equal footing with our baseline pre-train-then-fine-tune results, we train multi-task models for the same total number of steps: 2^19 + 2^18 = 786,432""
Using the 6DN approximation gives: 6 FLOP/token/param * 2^35 pretrain tokens * (1+1/2 finetune tokens per pretrain token) * 1 iteration of training data* 2.8 billion parameters = 8.659e20 FLOP
https://www.wolframalpha.com/input?i=6+*+2%5E35+*+2.8+billion+*+1.5",C4,,25500000000.0,"""This produces a collection of text that is not only orders of magnitude larger than most data sets used for pre-training (about 750 GB) but also
comprises reasonably clean and natural English text. We dub this data set the “Colossal Clean Crawled Corpus” (or C4 for short) and release it as part of TensorFlow Datasets""
750GB * 200M word/GB = 1.5e11
""In total, this batch size and number of steps corresponds to pre-training on 2^35 ≈ 34B tokens.""
""Note that 2^35 tokens only covers a fraction of the entire C4 data set, so we never repeat any data during pre-training.""
The fraction is 25.5 billion / 150 billion = 0.17 epochs.","Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.",,United States of America,Industry,,,Google TPU v3,Open source,,,,,13979.0,,,0.17,1613.1944624436098,,
M4-50B,Language,Google,"Ankur Bapna, Orhan Firat",2019-10-11,"Exploring Massively Multilingual, Massive Neural Machine Translation",https://blog.research.google/2019/10/exploring-massively-multilingual.html,SOTA improvement,,50000000000.0,"(sparse architecture)
""By modifying the Transformer architecture through the substitution of the vanilla feed-forward layers with sparsely-gated mixture of experts, we drastically scale up the model capacity, allowing us to successfully train and pass 50 billion parameters, which further improved translation quality across the board.""",,"Sparse architecture, so training compute is uncertain",,"""we push the limits of research on multilingual NMT by training a single NMT model on 25+ billion sentence pairs, from 100+ languages to and from English, with 50+ billion parameters.""",,25+ billion sentence pairs,"Over the last few years there has been enormous progress in the quality of machine translation (MT) systems, breaking language barriers around the world thanks to the developments in neural machine translation (NMT). The success of NMT however, owes largely to the great amounts of supervised training data. But what about languages where data is scarce, or even absent? Multilingual NMT, with the inductive bias that “the learning signal from one language should benefit the quality of translation to other languages”, is a potential remedy.",Confident,United States of America,Industry,,,,Unreleased,,,,,,,,,,,
DistilBERT,Language,Hugging Face,"Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf",2019-10-02,"DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter",https://arxiv.org/abs/1910.01108,Highly cited,,66000000.0,Table 3,1.24416e+19,"Section 3: DistilBERT was trained on 8 16GB V100 GPUs for approximately 90 hours.
1.6e13*8*60**2*90*0.3 = 1.2e19","Wikipedia,BookCorpus (BooksCorpus, Toronto Book Corpus)","Section 3: We train DistilBERT on the same corpus as the original BERT model: a concatenation of English Wikipedia and Toronto Book Corpus [Zhu et al., 2015].",,,,,Multinational,Industry,,,,Open source,,,,,5556.0,,,,,,
AlphaX-1,Vision,"Facebook AI Research,Brown University","Linnan Wang, Yiyang Zhao, Yuu Jinnai, Yuandong Tian, Rodrigo Fonseca1",2019-10-02,AlphaX: eXploring Neural Architectures with Deep Neural Networks and Monte Carlo Tree Search,https://arxiv.org/abs/1903.11059,SOTA improvement,"""In 12 GPU days and 1000 samples, AlphaX found an architecture that reaches 97.84\% top-1 accuracy on CIFAR-10, and 75.5\% top-1 accuracy on ImageNet, exceeding SOTA NAS methods in both the accuracy and sampling efficiency""",579000000.0,"Table 3: multiadds for AlphaX-1 579M, parameters 5.4M",7.6e+18,,"ImageNet,COCO","12800000 + 200000=1480000
I assume they used 1,281,167 training images when referred to Imagenet and 200 000 when referred to MS COCO
""We set up the ImageNet training using
the standard mobile configuration with the input image size
of (224 × 224)[45]. More details are available in the appendix. AlphaX sampled 1000 networks, and we selected
the top 20 networks in the pre-training to fine-tune another
530 epochs.
We use AlphaX-1 model pre-trained on ImageNet
dataset. The training dataset is MSCOCO for object
detection[15] which contains 90 classes of objects. Each
image is scaled to 300 × 300 in RGB channels. We trained
the model with 200k iterations with 0.04 initial learning rate
and the batch size is set to 24. We applied the exponential learning rate decay schedule with the 0.95 decay factor. Our
model uses momentum optimizer with momentum rate set
to 0.9. We also use the L2 weight decay for training. We
process each image with random horizontal flip and random
crop[22]. We set the matched threshold to 0.5, which means
only the probability of an object over 0.5 is effective to appear on the image. We use 8000 subsets of validation images in MSCOCO validation set and report the mean average precision (mAP) as computed with the standard COCO metric library[16].""
",1480000.0,,,,"United States of America,United States of America","Industry,Academia",,,NVIDIA Geforce GTX 1080 Ti,Unreleased,,,,,84.0,,,,,,
ALBERT,Language,"Toyota Technological Institute at Chicago,Google Research","Z Lan, M Chen, S Goodman, K Gimpel",2019-09-26,ALBERT: A Lite BERT for Self-supervised Learning of Language Representations,https://arxiv.org/abs/1909.11942,Highly cited,,18000000.0,Section 3.2 of paper,,,"BookCorpus (BooksCorpus, Toronto Book Corpus),Wikipedia",,3300000000.0,"Pretraining same as for BERT - Wikipedia and BookCorpus
""For the pre-training corpus we
use the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words)""",,,"United States of America,Multinational","Academia,Industry",,,Google TPU v3,Open source,,,,,5403.0,,,,,,
Adaptive Inputs + LayerDrop,Language,"Facebook AI Research,LORIA","Angela Fan, Edouard Grave, Armand Joulin",2019-09-25,Reducing Transformer Depth on Demand with Structured Dropout,https://arxiv.org/abs/1909.11556,SOTA improvement,"""In neural machine translation on newstest2014, our 12 encoder layer Transformer model with LayerDrop further improves the state of the art, reaching 30.2 BLEU""",423000000.00000006,,,,WikiText-103,,,,,,"United States of America,France","Industry,Academia",,,,Open source,,,,,494.0,,,,,,
Megatron-LM (8.3B),Language,NVIDIA,"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro",2019-09-17,Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism,https://arxiv.org/abs/1909.08053,"Highly cited,SOTA improvement","""Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA""
GPT-2 model here meaning model similar to GPT-2",8300000000.0,"Source: https://lair.lighton.ai/akronomicon/
Archived source: https://web.archive.org/web/20211220142906/https://lair.lighton.ai/akronomicon/
Data also available on GitHub: https://github.com/lightonai/akronomicon/blob/main/akrodb/NVIDIA/Megatron-LM.json",9.1e+21,"source: https://lair.lighton.ai/akronomicon/
archived: https://github.com/lightonai/akronomicon/tree/main/akrodb
other estimates:
8.3B is a GPT-2-based model (Table 2). ""For GPT-2 models, all training is performed with sequences of 1024 subword units at a batch size of 512 for 300k iterations""
I interpret the above as 1024*512*300k = 157B training tokens
6 * 157 billion * 8.3 billion = 7.8e21
Also, their training setup achieved 15.1 petaFLOPS or 1.5e16 FLOPS.
(512 V100s is 512 * 125 teraflops = 64 petaFLOPS so they had ~25% utilization)
2.1 days per epoch, ~4.4 epochs
2.1 * 4.4 * 24 * 3600 * 1.5e16 = 1.197e22
These are both close to the akronomicon estimate",,"""we aggregate several of the largest language
modeling datasets. We create an aggregate dataset consisting of Wikipedia (Devlin et al., 2018), CC-Stories (Trinh &
Le, 2018), RealNews (Zellers et al., 2019), and OpenWebtext (Radford et al., 2019). To avoid training set leakage
into our downstream tasks we remove the Wikipedia articles
present in the WikiText103 test set (Merity et al., 2016).""",34800000000.0,"""The resulting aggregate
corpus contains 174 GB of deduplicated text.""","Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. To demonstrate that large language models can further advance the state of the art (SOTA), we train an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT. We show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).",Likely,United States of America,Industry,327.0,"Reported throughput is 15.1 teraFLOPS per GPU on 512 GPUs
Assume total compute is 9.1e21 FLOP.
Then training time is 327 hours.
https://www.wolframalpha.com/input?i=9.1*10%5E21+FLOP+%2F+%28512*15.1+teraFLOPS%29",NVIDIA Tesla V100 DGXS 32 GB,Unreleased,,512.0,0.2269,,1242.0,"327 hours * 512 GPUs * $0.55/V100 GPU-hour = $92,083
Convert to 2020 dollars: $78,689",,4.4,106142.2892017932,,
Megatron-BERT,Language,NVIDIA,"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro",2019-09-17,Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism,https://arxiv.org/abs/1909.08053,"Highly cited,SOTA improvement","""Our BERT model achieves SOTA results on the RACE dataset""",3900000000.0,"2.1Source: https://lair.lighton.ai/akronomicon/
Archive on GitHub: https://github.com/lightonai/akronomicon/tree/main/akrodb",6.027e+22,"A source: https://lair.lighton.ai/akronomicon/ claims 5.7e22
The authors report experimenting on 1 V100 GPU and achieving throughput of 39 TFLOPS which is 30% of the peak throughput. Therefore the GPU has a peak throughput of 130 TFLOPS so it is specifically the NVIDIA V100S PCIe.
https://images.nvidia.com/content/technologies/volta/pdf/volta-v100-datasheet-update-us-1165301-r5.pdf
Param-based calculation:
6ND = 6*3.9e9*2e6*1024*1024 = 4.8e22 FLOP
Time-based calculation:
The 8.3B GPT-like arch took 2.1 days per epoch on 512 GPUs, batch size 512. An epoch was 68.5k iterations.
BERT: batch size 1024, 2e6 iterations total.
So we should expect 4B => 1.0 days per epoch (69e3*512 examples)
=> 2e6*1024/(69e3*512) = 58 days training
On 512 GPUs they achieve a peak throughput of 15.1 PFLOPS.
C=15.1 PFLOPS * 58 days = 7.6e22 FLOP.
The param and time calculations seem more trustworthy. Geometric mean is 6.027e22 FLOP",,,34800000000.0,"""The resulting aggregate corpus contains 174 GB of deduplicated text.""","Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. To demonstrate that large language models can further advance the state of the art (SOTA), we train an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT. We show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).",Confident,United States of America,Industry,1392.0,"The 8.3B GPT-like arch took 2.1 days per epoch on 512 GPUs, batch size 512. An epoch was 68.5k iterations.
BERT: batch size 1024, 2e6 iterations total.
So we should expect 4B => 1.0 days per epoch (69e3*512 examples)
=> 2e6*1024/(69e3*512) = 58 days training",NVIDIA Tesla V100S PCIe 32 GB,Unreleased,,512.0,0.2269,,1242.0,,,,615532.4058320685,524288.0,"""we set the batch size to 1024 and use a learning rate of 1.0e4 warmed up over 10,000 iterations and decayed linearly
over 2 million iterations. Other training parameters are kept
the same as (Devlin et al., 2018).""
in Devlin et al (BERT), sequences are 512 tokens"
ResNet-152 + ObjectNet,Vision,Massachusetts Institute of Technology (MIT),"Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfre- und, Josh Tenenbaum, and Boris Katz",2019-09-06,Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models,https://papers.nips.cc/paper/2019/file/97af07a14cacba681feacf3012730892-Paper.pdf,Highly cited,,38000000.0,,1.94e+19,"3-5 days of training (say, 4.5), 50 teraFLOP/second at 50% utilization rate (reported) = 1.94E19",ObjectNet,,50000.0,"In total, 95,824 images were collected from 5,982 workers out of which 50,000 images were retained
after validation and included in the dataset","We collect a large real-world test set, ObjectNet, for object recognition with controls where object backgrounds, rotations, and imaging viewpoints are random. Most scientific experiments have controls, confounds which are removed from the data, to ensure that subjects cannot perform a task by exploiting trivial correlations in the data. Historically, large machine learning and computer vision datasets have lacked such controls. This has resulted in models that must be fine-tuned for new datasets and perform better on datasets than in real-world applications. When tested on ObjectNet, object detectors show a 40-45% drop in performance, with respect to their performance on other benchmarks, due to the controls for biases. Controls make ObjectNet robust to fine-tuning showing only small performance increases. We develop a highly automated platform that enables gathering datasets with controls by crowdsourcing image capturing and annotation. ObjectNet is the same size as the ImageNet test set (50,000 images), and by design does not come paired with a training set in order to encourage generalization. The dataset is both easier than ImageNet – objects are largely centered and unoccluded – and harder, due to the controls. Although we focus on object recognition here, data with controls can be gathered at scale using automated tools throughout machine learning to generate datasets that exercise models in new ways thus providing valuable feedback to researchers. This work opens up new avenues for research in generalizable, robust, and more human-like computer vision and in creating datasets where results are predictive of real-world performance.",,United States of America,Academia,,,,Unreleased,,,,,2393.0,,,,,,
UDSMProt,Biology,Fraunhofer Heinrich Hertz Institute,"Nils Strodthoff, Patrick Wagner, Markus Wenzel, and Wojciech Samek",2019-09-04,UDSMProt: Universal Deep Sequence Models for Protein Classification,https://www.biorxiv.org/content/10.1101/704874v2.full.pdf,SOTA improvement,"""The proposed method performs on par with state-of-the-art algorithms that were tailored to these specific tasks or, for two out of three tasks, even outperforms them.""",28303800.0,"Python code:
# Given LSTM parameters
emb_sz = 400 # embedding size, typically equal to the input size for the first layer
nh = 1150 # number of hidden units
nl = 3 # number of layers
# The formula for a single LSTM layer parameters is:
# P = 4 * ((input_dim + hidden_dim) * hidden_dim + hidden_dim)
# First layer parameters (input_dim is the embedding size)
first_layer_params = 4 * ((emb_sz + nh) * nh + nh)
# For subsequent layers, input_dim is equal to hidden_dim (nh)
subsequent_layer_params = 4 * ((nh + nh) * nh + nh)
# Total parameters for all layers
total_params = first_layer_params + (nl - 1) * subsequent_layer_params
print(total_params)",6.37e+17,"Pretraining:
Table 7 gives max of 499k sequences each at (seemingly) L=1024:
499k * 1024 * 28.3M * 6 = 8.7e16
Finetuning:
Largest downstream task has 104940 sequences (Table 5), each sequence has L=1024 residues, 28.3M parameters, and 30 epochs.
105k * 1024 * 30 * 28.3 * 6 = 5.5e17.","SwissProt,a subset of UniProtKB",,,560K proteins,"Motivation: Inferring the properties of a protein from its amino acid sequence is one of the key problems in bioinformatics. Most state-of-the-art approaches for protein classification tasks are tailored to single classi- fication tasks and rely on handcrafted features such as position-specific-scoring matrices from expensive database searches. We argue that this level of performance can be reached or even be surpassed by learning a task-agnostic representation once, using self-supervised language modeling, and transferring it to specific tasks by a simple finetuning step.
Results: We put forward a universal deep sequence model that is pretrained on unlabeled protein se- quences from Swiss-Prot and finetuned on protein classification tasks. We apply it to three prototypical tasks, namely enzyme class prediction, gene ontology prediction and remote homology and fold detection. The proposed method performs on par with state-of-the-art algorithms that were tailored to these specific tasks or, for two out of three tasks, even outperforms them. These results stress the possibility of inferring protein properties from the sequence alone and, on more general grounds, the prospects of modern natural language processing methods in omics.",Likely,Germany,Research collective,,,,Open source,,,,,,,,30.0,,,
"Mogrifier (d2, MoS2, MC) + dynamic eval",Language,"DeepMind,University of Oxford","Gábor Melis, Tomáš Kočiský, Phil Blunsom",2019-09-04,Mogrifier LSTM,https://arxiv.org/abs/1909.01792,SOTA improvement,"""We establish a new state of the art on all datasets with the exception of Enwik8""",35000000.0,,,,WikiText-2,,,,,,"United Kingdom of Great Britain and Northern Ireland,United Kingdom of Great Britain and Northern Ireland","Industry,Academia",,,,Unreleased,,,,,109.0,,,145.0,,,
EN^2AS with performance reward,Language,"Beijing Institute of Technology,University of Technology Sydney,Monash University","Miao Zhang, Huiqi Li, Shirui Pan, Taoping Liu, Steven Su",2019-07-22,Efficient Novelty-Driven Neural Architecture Search,https://arxiv.org/abs/1907.09109,SOTA improvement,"""The best architecture obtained by our algorithm with
the same search space achieves the state-of-the-art test error rate of 2.51% on CIFAR-10""",23000000.0,,,,,,,,,,"China,Australia,Australia","Academia,Academia,Academia",,,,Unreleased,,,,,1.0,,,,,,
Pluribus,Games,Facebook AI Research,"Noam Brown, Tuomas Sandholm",2019-07-11,Superhuman AI for multiplayer poker,https://www.science.org/cms/asset/910714a7-ee2a-486e-9970-42fb893b08d9/pap.pdf,SOTA improvement,"first to beat humans at multiplayer poker: ""Developing a superhuman AI for multiplayer poker was the widely,recognized main remaining milestone. In this paper we describe Pluribus, an AI capable of defeating elite human professionals in six-player no-limit Texas hold’em poker, the most commonly played poker format in the world.""",,,6.6e+16,"Trained in 8 days on a 64 core CPU
https://ai.facebook.com/blog/pluribus-first-ai-to-beat-pros-in-6-player-poker/
""We trained the blueprint strategy for Pluribus in eight days on a 64-core server and required less than 512 GB of RAM. No GPUs were used. At typical cloud computing instance rates, it would cost less than $150 to train.""
Guess: trained on i7 Intel CPU, approx 5e9 FLOP/s for each core.
https://epochai.org/blog/estimating-training-compute
8 days, 64 cores, 5e9 FLOP/s, 30% utilization",,,,,,,United States of America,Industry,,,,Unreleased,,,,,594.0,,,,,,
BigBiGAN,"Vision,Image generation",Google,"Spyros Gidaris, Praveer Singh, Nikos Komodakis",2019-07-04,Large Scale Adversarial Representation Learning,https://arxiv.org/abs/1907.02544,SOTA improvement,"""BigBiGAN, an unsupervised learning approach based purely on generative models, achieves state-of-the-art results in image representation learning on ImageNet""",86000000.0,https://openai.com/blog/image-gpt/#rfref53,,,ImageNet,"""We train a BigBiGAN on unlabeled ImageNet, freeze its learned
representation, and then train a linear classifier on its outputs, fully supervised using all of the training
set labels""",,,,,United States of America,Industry,,,,Open source,,,,,501.0,,,,,,
RoBERTa Large,Language,"Facebook,University of Washington","Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov",2019-07-01,RoBERTa: A Robustly Optimized BERT Pretraining Approach,https://arxiv.org/abs/1907.11692,Highly cited,,355000000.0,,4.15383552e+21,"Section 5: We pretrain our model using 1024 V100 GPUs for approximately one day.
Note this is the base pretraining comparable to BERT, 100k steps. Subsequently they do more: ""increasing the number of pretraining steps
from 100K to 300K, and then further to 500K"".
So assume 5x the 1024 V100 GPUs for 1d estimate. Mixed precision.
C=5*1024*3.13E+13*60**2*24*0.3 = 4.2e21","CC-News,BookCorpus (BooksCorpus, Toronto Book Corpus),WebText2,Wikipedia","""We consider five English-language corpora of
varying sizes and domains, totaling over 160GB
of uncompressed text. We use the following text
corpora:
• BOOKCORPUS (Zhu et al., 2015) plus English
WIKIPEDIA. This is the original data used to
train BERT. (16GB).
• CC-NEWS, which we collected from the English portion of the CommonCrawl News
dataset (Nagel, 2016). The data contains 63
million English news articles crawled between
September 2016 and February 2019. (76GB after filtering).4
• OPENWEBTEXT (Gokaslan and Cohen, 2019),
an open-source recreation of the WebText corpus described in Radford et al. (2019). The text
is web content extracted from URLs shared on
Reddit with at least three upvotes. (38GB).5
• STORIES, a dataset introduced in Trinh and Le
(2018) containing a subset of CommonCrawl
data filtered to match the story-like style of
Winograd schemas. (31GB).""",32000000000.0,160GB*200M words/GB = 3.2e10 words,,Likely,"United States of America,United States of America","Industry,Academia",120.0,"First the model is pretrained for 100k steps on 1024 GPUs for 1 day, then pretraining is increased to 500k steps, so assuming they used the same number of GPUs, this would have taken 5 days.",NVIDIA Tesla V100 DGXS 32 GB,Open source,,1024.0,,,19065.0,,,,82771.07593250347,,
Tensorized Transformer (257M),Language,"Tianjin University,Microsoft Research Asia,Beijing Institute of Technology","Xindian Ma, Peng Zhang, Shuai Zhang, Nan Duan, Yuexian Hou, Ming Zhou, Dawei Song",2019-06-24,A Tensorized Transformer for Language Modeling,https://arxiv.org/abs/1906.09777,SOTA improvement,"""Table 2: Results and compression with state-of-the-art results on PTB and WikiText-103""",257000000.0,,4.76e+18,,WikiText-103,,,,,,"China,China,China","Academia,Industry,Academia",,,,Unreleased,,,,,135.0,,,30.0,,,
Walking Minotaur robot,Robotics,"UC Berkeley,Google Brain","Tuomas Haarnoja, Sehoon Ha, Aurick Zhou, Jie Tan, George Tucker, Sergey Levine",2019-06-19,Learning to Walk via Deep Reinforcement Learning,https://arxiv.org/abs/1812.11103,SOTA improvement,,,,,,,,,,"Deep reinforcement learning (deep RL) holds the promise of automating the acquisition of complex controllers that can map sensory inputs directly to low-level actions. In the domain of robotic locomotion, deep RL could enable learning locomotion skills with minimal engineering and without an explicit model of the robot dynamics. Unfortunately, applying deep RL to real-world robotic tasks is exceptionally difficult, primarily due to poor sample complexity and sensitivity to hyperparameters. While hyperparameters can be easily tuned in simulated domains, tuning may be prohibitively expensive on physical systems, such as legged robots, that can be damaged through extensive trial-and-error learning. In this paper, we propose a sample-efficient deep RL algorithm based on maximum entropy RL that requires minimal per-task tuning and only a modest number of trials to learn neural network policies. We apply this method to learning walking gaits on a real-world Minitaur robot. Our method can acquire a stable gait from scratch directly in the real world in about two hours, without relying on any model or simulation, and the resulting policy is robust to moderate variations in the environment. We further show that our algorithm achieves state-of-the-art performance on simulated benchmarks with a single set of hyperparameters. Videos of training and the learned policy can be found on the project website.",Unknown,"United States of America,United States of America","Academia,Industry",,,,Unreleased,,,,,378.0,,,,,,
LaNet-L (CIFAR-10),Vision,"Brown University,Facebook","Linnan Wang, Saining Xie, Teng Li, Rodrigo Fonseca, Yuandong Tian",2019-06-17,Sample-Efficient Neural Architecture Search by Learning Action Space,https://arxiv.org/abs/1906.06832,SOTA improvement,"""In practice, LaNAS finds a network that achieves SOTA 99.0% accuracy on CIFAR-10""",44100000.0,44.1M,,"LaNet-L was trained on 150 GPU-days, however the GPU was not specified",CIFAR-10,,,,"Neural Architecture Search (NAS) has emerged as a promising technique for automatic neural network design. However, existing MCTS based NAS approaches often utilize manually designed action space, which is not directly related to the performance metric to be optimized (e.g., accuracy), leading to sample-inefficient explorations of architectures. To improve the sample efficiency, this paper proposes Latent Action Neural Architecture Search (LaNAS), which learns actions to recursively partition the search space into good or bad regions that contain networks with similar performance metrics. During the search phase, as different action sequences lead to regions with different performance, the search efficiency can be significantly improved by biasing towards the good regions. On three NAS tasks, empirical results demonstrate that LaNAS is at least an order more sample efficient than baseline methods including evolutionary algorithms, Bayesian optimizations, and random search. When applied in practice, both one-shot and regular LaNAS consistently outperform existing results. Particularly, LaNAS achieves 99.0% accuracy on CIFAR-10 and 80.8% top1 accuracy at 600 MFLOPS on ImageNet in only 800 samples, significantly outperforming AmoebaNet with 33x fewer samples. Our code is publicly available at this https URL.",Likely,"United States of America,United States of America","Academia,Industry",,,,Open access (non-commercial),,,,,42.0,,,600.0,,,
PG-SWGAN,Image generation,ETH Zurich,"Jiqing Wu, Zhiwu Huang, Dinesh Acharya, Wen Li, Janine Thoma, Danda Pani Paudel, Luc Van Gool",2019-06-15,Sliced Wasserstein Generative Models,https://openaccess.thecvf.com/content_CVPR_2019/html/Wu_Sliced_Wasserstein_Generative_Models_CVPR_2019_paper.html,SOTA improvement,"""For fair comparison, we equip the same progressive growing architecture with our proposed SWGAN objective and its dual
SWD blocks (PG-SWGAN). As shown in Fig. 3 (Right)
and Fig. 5, our PG-SWGAN can outperform PG-WGAN in
terms of both qualitative and quantitative comparison on the
CelebA-HQ and LSUN datasets""",,,,,"CIFAR-10,LSUN,CelebA",,,,"In generative modeling, the Wasserstein distance (WD) has emerged as a useful metric to measure the discrepancy between generated and real data distributions. Unfortunately, it is challenging to approximate the WD of high-dimensional distributions. In contrast, the sliced Wasserstein distance (SWD) factorizes high-dimensional distributions into their multiple one-dimensional marginal distributions and is thus easier to approximate. In this paper, we introduce novel approximations of the primal and dual SWD. Instead of using a large number of random projections, as it is done by conventional SWD approximation methods, we propose to approximate SWDs with a small number of parameterized orthogonal projections in an end-to-end deep learning fashion. As concrete applications of our SWD approximations, we design two types of differentiable SWD blocks to equip modern generative frameworks---Auto-Encoders (AE) and Generative Adversarial Networks (GAN). In the experiments, we not only show the superiority of the proposed generative models on standard image synthesis benchmarks, but also demonstrate the state-of-the-art performance on challenging high resolution image and video generation in an unsupervised manner.",Unknown,Switzerland,Academia,,,,Unreleased,,,,,115.0,,,,,,
FixRes ResNeXt-101 WSL,Vision,Facebook AI,"H Touvron, A Vedaldi, M Douze, H Jégou",2019-06-14,Fixing the train-test resolution discrepancy,https://arxiv.org/abs/1906.06423,SOTA improvement,"""To the best of our knowledge our ResNeXt-101 32x48d surpasses all other models available in the literature""",829000000.0,,,,ImageNet,,940000000.0,"""Conversely, when training a ResNeXt-101 32x48d pre-trained in weakly-supervised fashion on 940 million public images at resolution 224x224 and further optimizing for test resolution 320x320, we obtain a test top-1 accuracy of 86.4% (top-5: 98.0%) (single-crop)""",,,United States of America,Industry,,,,Open access (non-commercial),,,,,405.0,"https://medium.com/swlh/deepmind-achieved-starcraft-ii-grandmaster-level-but-at-what-cost-32891dd990e4#:~:text=According%20to%20the%20analysis%20by,Source%3A%20DeepMind.",,,,,
Char-CNN-BiLSTM,Language,Capital One,"Chris Larson, Tarek Lahlou, Diana Mingels, Zachary Kulis, Erik Mueller",2019-06-13,Telephonetic: Making Neural Language Models Robust to ASR and Semantic Noise,https://arxiv.org/abs/1906.05678,SOTA improvement,"""Notably, our language model achieves a test perplexity of 37.49 on PTB, which to our knowledge is state-of-the-art among models trained only on PTB.""",,,,,,,,,,Unknown,United States of America,Industry,,,,Unreleased,,,,,2.0,,,,,,
AWD-LSTM + MoS + Partial Shuffled,Language,University of Texas at Austin,"Dilin Wang, Chengyue Gong, Qiang Liu",2019-06-10,Improving Neural Language Modeling via Adversarial Training,https://arxiv.org/abs/1906.03805,SOTA improvement,"""our method improves on the single model state-of-the-art results for language modeling on Penn Treebank (PTB) and Wikitext-2, achieving test perplexity scores of 46.01 and 38.07, respectively""",35000000.0,,3.28e+17,,WikiText-2,,,,,,United States of America,Academia,,,,Open access (non-commercial),,,,,104.0,,,750.0,,,
Transformer-XL Large + Phrase Induction,Language,"Massachusetts Institute of Technology (MIT),University of Illinois Urbana-Champaign (UIUC)","Hongyin Luo, Lan Jiang, Yonatan Belinkov, James Glass",2019-06-04,"""Improving Neural Language Models by Segmenting, Attending, and Predicting the Future""",https://arxiv.org/abs/1906.01702,SOTA improvement,"""We achieved a new state-of-the-art performance of 17.4 perplexity on the Wikitext-103 dataset""",257000000.0,,7.3e+18,,WikiText-103,,,,,,"United States of America,United States of America","Academia,Academia",,,,Unreleased,,,,,12.0,,,1.0,,,
AMDIM,"Vision,Image generation",Microsoft Research,"Philip Bachman, R Devon Hjelm, William Buchwalter",2019-06-03,Learning Representations by Maximizing Mutual Information Across Views,https://arxiv.org/abs/1906.00910,Highly cited,,626000000.0,source: https://openai.com/blog/image-gpt/#rfref13e,,,"ImageNet,CIFAR-10","""We evaluate our model using standard datasets: CIFAR10, CIFAR100, STL10 [Coates et al., 2011], ImageNet1 [Russakovsky et al., 2015], and Places205 [Zhou et al., 2014].""",,,,,United States of America,Industry,,,,Open source,,,,,1307.0,,,,,,
XLNet,Language,"Carnegie Mellon University (CMU),Google Brain","Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le",2019-06-01,XLNet: Generalized Autoregressive Pretraining for Language Understanding,https://arxiv.org/abs/1906.08237,Highly cited,,340000000.0,"Same size as BERT-Large, which was 340M",8.9e+21,"""Specifically, we train on 512 TPU v3 chips for 500K steps with an Adam weight decay optimizer, linear learning rate decay, and a batch size of 8192, which takes about 5.5 days.""
123 teraflops * 5.5 days * 24 * 3600 * 512 * 0.3 utilization (assumption) ~= 8977858560*10^12=8.9*10^21
Alternatively, 500k steps * batch size 8192 * sequence length 512 = 2.1T training passes. 340 million * 6 * 2 trillion = 4.3e21 FLOP. ","Wikipedia,BookCorpus (BooksCorpus, Toronto Book Corpus)","""Following BERT [10], we use the BooksCorpus [40] and English Wikipedia as part of our pretraining
data, which have 13GB plain text combined. In addition, we include Giga5 (16GB text) [26],
ClueWeb 2012-B (extended from [5]), and Common Crawl [6] for pretraining. We use heuristics
to aggressively filter out short or low-quality articles for ClueWeb 2012-B and Common Crawl,
which results in 19GB and 110GB text respectively. After tokenization with SentencePiece [17], we
obtain 2.78B, 1.09B, 4.75B, 4.30B, and 19.97B subword pieces for Wikipedia, BooksCorpus, Giga5,
ClueWeb, and Common Crawl respectively, which are 32.89B in total.""",,,"With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.",Confident,"United States of America,United States of America","Academia,Industry",,,Google TPU v3,Open source,,,,,7267.0,,,,,,
XLM,Language,Facebook,"G Lample, A Conneau",2019-06-01,Cross-lingual Language Model Pretraining,https://arxiv.org/abs/1901.07291,"Highly cited,SOTA improvement","""On supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT’16 Romanian-English, outperforming the previous best approach by more
than 4 BLEU""",665000000.0,,,,,"subset of Wikipedia: ""We use WikiExtractor2
to extract raw sentences
from Wikipedia dumps and use them as monolingual data for the CLM and MLM objectives.""",,,,,United States of America,Industry,,,,Open access (non-commercial),,,,,2430.0,,,,,,
DLRM-2020,Recommendation,Facebook AI,"M Naumov, D Mudigere, HJM Shi, J Huang",2019-05-31,Deep Learning Recommendation Model for Personalization and Recommendation Systems,https://arxiv.org/abs/1906.00091,SOTA improvement,"""In this paper, we develop a state-of-the-art deep learning recommendation model
(DLRM)""",100000000000.0,"Figure 1
https://arxiv.org/abs/2104.05158",4e+18,"Figure 1
https://arxiv.org/abs/2104.05158",,,,,,,United States of America,Industry,,,,Unreleased,,,,,567.0,,,,,,
MnasNet-A3,Vision,Google,"Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, Quoc V. Le",2019-05-29,MnasNet: Platform-Aware Neural Architecture Search for Mobile,https://arxiv.org/abs/1807.11626,Highly cited,,5200000.0,From https://arxiv.org/pdf/1807.11626.pdf,1.5e+21,"""each architecture search takes 4.5 days on 64 TPUv2 devices""
This seems to be referring to a TPUv2 pod, consisting of 64 four-chip modules. The total performance is 11.5 petaFLOPS.
https://en.wikipedia.org/wiki/Tensor_Processing_Unit#Second_generation_TPU
Assuming a 33% utilization rate:
4.5 days * 64 * 180 teraFLOPS * 0.33 = 1.48*10^21 FLOP
However, it is unclear if ""64 TPUv2 devices"" refers to chips or modules, so the true compute might be 1/4 of this amount.",ImageNet,,1280000.0,"""In this paper, we directly perform our architecture search on the ImageNet training set but with fewer training steps (5 epochs). As a common practice, we reserve randomly selected 50K images from the training set as the fixed validation set. ""","Designing convolutional neural networks (CNN) for mobile devices is challenging because mobile models need to be small and fast, yet still accurate. Although significant efforts have been dedicated to design and improve mobile CNNs on all dimensions, it is very difficult to manually balance these trade-offs when there are so many architectural possibilities to consider. In this paper, we propose an automated mobile neural architecture search (MNAS) approach, which explicitly incorporate model latency into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency. Unlike previous work, where latency is considered via another, often inaccurate proxy (e.g., FLOPS), our approach directly measures real-world inference latency by executing the model on mobile phones. To further strike the right balance between flexibility and search space size, we propose a novel factorized hierarchical search space that encourages layer diversity throughout the network. Experimental results show that our approach consistently outperforms state-of-the-art mobile CNN models across multiple vision tasks. On the ImageNet classification task, our MnasNet achieves 75.2% top-1 accuracy with 78ms latency on a Pixel phone, which is 1.8x faster than MobileNetV2 [29] with 0.5% higher accuracy and 2.3x faster than NASNet [36] with 1.2% higher accuracy. Our MnasNet also achieves better mAP quality than MobileNets for COCO object detection. Code is at this https URL",Speculative,United States of America,Industry,108.0,,Google TPU v3,Open source,,256.0,,,2607.0,,,,9551.591619865148,,
MnasNet-A1 + SSDLite,Vision,Google,"Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, Quoc V. Le",2019-05-29,MnasNet: Platform-Aware Neural Architecture Search for Mobile,https://arxiv.org/abs/1807.11626,Highly cited,,4900000.0,From https://arxiv.org/pdf/1807.11626.pdf,1.5e+21,"""each architecture search takes 4.5 days on 64 TPUv2 devices""
This seems to be referring to a TPUv2 pod, consisting of 64 four-chip modules. The total performance is 11.5 petaFLOPS.
https://en.wikipedia.org/wiki/Tensor_Processing_Unit#Second_generation_TPU
Assuming a 33% utilization rate:
4.5 days * 64 * 180 teraFLOPS * 0.33 = 1.48*10^21 FLOP
However, it is unclear if ""64 TPUv2 devices"" refers to chips or modules, so the true compute might be 1/4 of this amount.",COCO,,118000.0,,"Designing convolutional neural networks (CNN) for mobile devices is challenging because mobile models need to be small and fast, yet still accurate. Although significant efforts have been dedicated to design and improve mobile CNNs on all dimensions, it is very difficult to manually balance these trade-offs when there are so many architectural possibilities to consider. In this paper, we propose an automated mobile neural architecture search (MNAS) approach, which explicitly incorporate model latency into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency. Unlike previous work, where latency is considered via another, often inaccurate proxy (e.g., FLOPS), our approach directly measures real-world inference latency by executing the model on mobile phones. To further strike the right balance between flexibility and search space size, we propose a novel factorized hierarchical search space that encourages layer diversity throughout the network. Experimental results show that our approach consistently outperforms state-of-the-art mobile CNN models across multiple vision tasks. On the ImageNet classification task, our MnasNet achieves 75.2% top-1 accuracy with 78ms latency on a Pixel phone, which is 1.8x faster than MobileNetV2 [29] with 0.5% higher accuracy and 2.3x faster than NASNet [36] with 1.2% higher accuracy. Our MnasNet also achieves better mAP quality than MobileNets for COCO object detection. Code is at this https URL",Speculative,United States of America,Industry,108.0,,Google TPU v3,Open source,,256.0,,,2607.0,,,,9551.591619865148,,
EfficientNet-L2,Vision,Google,"M Tan, Q Le",2019-05-28,EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,https://arxiv.org/abs/1905.11946,Highly cited,,480000000.0,,,,ImageNet,,,,,,United States of America,Industry,,,,Open source,,,,,13444.0,,,,,,
CPC v2,"Vision,Image generation","DeepMind,UC Berkeley",,2019-05-22,Data-Efficient Image Recognition with Contrastive Predictive Coding,https://arxiv.org/abs/1905.09272,SOTA improvement,"""this unsupervised representation substantially improves transfer learning to object detection on the
PASCAL VOC dataset, surpassing fully supervised pre-trained ImageNet classifiers""",303000000.0,source: https://openai.com/blog/image-gpt/#rfref25d,,,ImageNet,"""In all cases, the dataset of unlabeled images Du we pre-train
on is the full ImageNet ILSVRC 2012 training set""",,,,,"United Kingdom of Great Britain and Northern Ireland,United States of America","Industry,Academia",,,,Unreleased,,,,,491.0,,,,,,
AWD-LSTM-DRILL + dynamic evaluation† (WT2),Language,IDIAP,"Nikolaos Pappas, James Henderson",2019-05-14,Deep Residual Output Layers for Neural Language Generation,https://arxiv.org/abs/1905.05513,SOTA improvement,"""our models improve over the state-of-the-art by +1.6 perplexity on PennTreebank and by +3.9 perplexity on
Wikitext-2""",34000000.0,,4.24e+17,,WikiText-2,,,,,,Switzerland,Academia,,,,Open source,,,,,7.0,,,1000.0,,,
ResNeXt-101 Billion-scale,Vision,Facebook AI,"IZ Yalniz, H Jégou, K Chen, M Paluri",2019-05-02,Billion-scale semi-supervised learning for image classification,https://arxiv.org/abs/1905.00546,SOTA improvement,"""We demonstrate the performance of our method on popular classification benchmarks for both images and videos and significantly outperforms the state of the art.""",193000000.0,,,,YFCC-100M,,,,,,United States of America,Industry,,,,Open access (non-commercial),,,,,415.0,,,,,,
ResNet-50 Billion-scale,Vision,Facebook AI,"I. Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, Dhruv Mahajan",2019-05-02,Billion-scale semi-supervised learning for image classification,https://arxiv.org/abs/1905.00546,Highly cited,,25000000.0,25M parameters vanilla ResNet50,,,YFCC-100M,"""The following web-scale datasets are used for
semi-supervised learning experiments involving an unlabeled dataset U.
• YFCC-100M [38] is a publicly available dataset of about
90 million images from Flickr website with associated
tags. After removing duplicates, we use this data for
most experiments and ablations.
• IG-1B-Targeted: Following [27], we collected a dataset
of 1B public images with associated hashtags from a
social media website. We consider images tagged with
at least one of the 1500 hashtags associated with one of
the 1000 ImageNet-1k classes.""",1090000000.0,"1 billion + 90 million, per above",,,United States of America,Industry,,,,Open access (non-commercial),,,,,415.0,,,,,,
Neuro-Symbolic Concept Learner,"Vision,Language","Massachusetts Institute of Technology (MIT),Tsinghua University,MIT-IBM Watson AI Lab,DeepMind","Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, Jiajun Wu",2019-04-26,"The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision",https://arxiv.org/abs/1904.12584,SOTA improvement,"""NS-CL’s modularized design enables interpretable, robust, and accurate visual reasoning: it achieves state-of-the-art performance on the CLEVR datase""",,,,,"CLEVR,VQS,ImageNet","CLEVR, ImageNet, VQS
5000 in CLEVR
64509 in VQS
and whole ImageNet for pretraining
""We train NS-CL on 5K images (<10% of CLEVR’s 70K training images). We generate 20 questions for each image for the entire curriculum learning process""
section 4.3 ""All models use a pre-trained semantic parser on the full CLEVR dataset""
""The only extra supervision of the visual perception module comes from the pre-training of the perception modules on ImageNet (Deng et al., 2009). To quantify the influence of this pre-training""
In appendix G.2 (VQS Dataset):""All models are trained on the first 63,509 images of the training set, and tested on the test split. For hyper-parameter tuning and model selection, the rest 5,000 images from the training set are used for validation.",,"CLEVR, ImageNet, VQS
5000 in CLEVR
64509 in VQS
and whole ImageNet for pretraining
""We train NS-CL on 5K images (<10% of CLEVR’s 70K training images). We generate 20 questions for each image for the entire curriculum learning process""
section 4.3 ""All models use a pre-trained semantic parser on the full CLEVR dataset""
""The only extra supervision of the visual perception module comes from the pre-training of the perception modules on ImageNet (Deng et al., 2009). To quantify the influence of this pre-training""
In appendix G.2 (VQS Dataset):
""All models are trained on the first 63,509 images of the training set, and tested on the test split. For hyper-parameter tuning and model selection, the rest 5,000 images from the training set are used for validation.","We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that learns visual concepts, words, and semantic parsing of sentences without explicit supervision on any of them; instead, our model learns by simply looking at images and reading paired questions and answers. Our model builds an object-based scene representation and translates sentences into executable, symbolic programs. To bridge the learning of two modules, we use a neuro-symbolic reasoning module that executes these programs on the latent scene representation. Analogical to human concept learning, the perception module learns visual concepts based on the language description of the object being referred to. Meanwhile, the learned visual concepts facilitate learning new words and parsing new sentences. We use curriculum learning to guide the searching over the large compositional space of images and language. Extensive experiments demonstrate the accuracy and efficiency of our model on learning visual concepts, word representations, and semantic parsing of sentences. Further, our method allows easy generalization to new object attributes, compositions, language concepts, scenes and questions, and even new program domains. It also empowers applications including visual question answering and bidirectional image-text retrieval.",Unknown,"United States of America,China,United States of America,United Kingdom of Great Britain and Northern Ireland","Academia,Academia,Academia,Industry",,,,Unreleased,,,,,695.0,,,,,,
DANet,Vision,Chinese Academy of Sciences,"Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, Hanqing Lu",2019-04-21,Dual Attention Network for Scene Segmentation,https://openaccess.thecvf.com/content_CVPR_2019/html/Fu_Dual_Attention_Network_for_Scene_Segmentation_CVPR_2019_paper.html,Highly cited,,,,,,"Cityscapes,COCO-Stuff,PASCAL-Context",,,,,Unknown,China,Academia,,,,Open source,,,,,4252.0,,,,,,
BERT-Large-CAS (PTB+WT2+WT103),Language,Amazon,"Chenguang Wang, Mu Li, Alexander J. Smola",2019-04-20,Language Models with Transformers,https://arxiv.org/abs/1904.09408,SOTA improvement,"""CAS achieves perplexities between 20.42 and 34.11 on all problems, i.e. on average an improvement of 12.0 perplexity units compared to state-of-the-art LSTMs""",395000000.0,,5.21e+20,,"Penn TreeBank,WikiText-2,WikiText-103",,,,,,United States of America,Industry,,,,Unreleased,,,,,110.0,,,50.0,,,
SpecAugment,Language,Google Brain," Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le",2019-04-18,SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,https://arxiv.org/abs/1904.08779,Highly cited,,,,,,"LibriSpeech,Switchboard,Fisher",,,,,Unknown,United States of America,Industry,,,,Unreleased,,,,,2946.0,,,,,,
Transformer-XL + RMS dynamic eval,Language,University of Edinburgh,"Ben Krause, Emmanuel Kahembwe, Iain Murray, Steve Renals",2019-04-17,Dynamic Evaluation of Transformer Language Models,https://arxiv.org/abs/1904.08378,SOTA improvement,"""By applying dynamic evaluation to Transformer-XL models, we improve the state of the art on enwik8 from 0.99 to 0.94 bits/char, text8 from 1.08 to 1.04 bits/char, and WikiText-103 from 18.3 to 16.4 perplexity points.""",257000000.0,,,,WikiText-103,,,,,,United Kingdom of Great Britain and Northern Ireland,Academia,,,,Unreleased,,,,,40.0,,,,,,
WeNet (Penn Treebank),Language,Amazon,"Zhiheng Huang, Bing Xiang",2019-04-08,WeNet: Weighted Networks for Recurrent Network Architecture Search,https://arxiv.org/abs/1904.03819,SOTA improvement,"""We show that an architecture found by WeNets arXiv:1904.03819v1 [cs.NE] 8 Apr 2019 WeNet: Weighted Networks for Recurrent Network Architecture Search achieves state-of-the-art results on the Penn Treebank language dataset""",23000000.0,Table 1,,,Penn TreeBank,,,,,Confident,United States of America,Industry,,,,Unreleased,,,,,5.0,,,6000.0,,,
True-Regularization+Finetune+Dynamic-Eval,Language,"Mobvoi,Williams College","Yangyang Shi, Mei-Yuh Hwang, Xin Lei, Haoyu Sheng",2019-04-08,Knowledge Distillation For Recurrent Neural Network Language Modeling With Trust Regularization,https://arxiv.org/abs/1904.04163,SOTA improvement,"""In the first experiment, the student model achieves state-of-the-art perplexity results on the Penn Treebank dataset [1] with a model size one third of that of the
previously published best model""",7000000.0,,,,Penn TreeBank,,,,,,"China,United States of America","Industry,Academia",,,,Unreleased,,,,,24.0,,,,,,
Cross-lingual alignment,Language,"Tel Aviv University,Massachusetts Institute of Technology (MIT)","Tal Schuster, Ori Ram, Regina Barzilay, and Amir Globerson.",2019-04-04,"Cross-lingual alignment of contextual word embeddings, with applications to zero- shot dependency parsing.",https://arxiv.org/abs/1902.09492,SOTA improvement,"""our method consistently outperforms the previous state-of-the-art on 6 tested languages""",,,2.56e+18,"From author communication:
Precision: float32
Hardware: 4 GPU NVIDIA 1080Ti
NVIDIA 1080Ti: 1.06E+13
Compute: 7 GPU-days
0.4 * 1.06E+13 FLOP/s * 7 days * 24h/day * 3600s/h
= 2.56E+18","Wikipedia,CoNLL2017",,,,,,"Israel,United States of America","Academia,Academia",,,NVIDIA GeForce GTX 1080 Ti,Open source,,,,ELMo,190.0,,,,,,
FAIRSEQ Adaptive Inputs,Language,"Facebook AI Research,Google Brain","Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli",2019-04-01,"""fairseq: A Fast, Extensible Toolkit for Sequence Modeling""",https://arxiv.org/abs/1904.01038,Highly cited,,247000000.00000003,,7.3e+18,,WikiText-103,,,,,,"United States of America,United States of America","Industry,Industry",,,,Unreleased,,,,,2791.0,,,,,,
SciBERT,Language,Allen Institute for AI,"Iz Beltagy, Kyle Lo, Arman Cohan",2019-03-26,SciBERT: A Pretrained Language Model for Scientific Text,https://arxiv.org/abs/1903.10676,"Highly cited,SOTA improvement","""We demon-
strate statistically significant improvements
over BERT and achieve new state-of-the-
art results on several of these tasks""",110000000.0,"110M
size of bert base from https://huggingface.co/google-bert/bert-base-uncased
relevant citation:
""We use the original BERT code to
train SCIBERT on our corpus with the same con-
figuration and size as BERT-Base. We train 4
different versions of SCIBERT: (i) cased or un-
cased and (ii) BASEVOCAB or SCIVOCAB. The
two models that use BASEVOCAB are finetuned
from the corresponding BERT-Base models. The
other two models that use the new SCIVOCAB are
trained from scratch.""",8.926848e+19,"4*123e12*0.3*(7*24*3600) = 8.926848e+19
(num gpu) * (peak compute) * (assumed utilization rate) * (time in seconds)
We have:
4 TPUv3 chips.123teraFLOPs per chip.
7 days of training
""We use a single TPU v3 with 8 cores. Training the SCIVOCAB models from scratch on our corpus takes 1 week (5 days with max length 128, then 2 days with max length 512). """,,"""We train SCIBERT on a random
sample of 1.14M papers from Semantic
Scholar (Ammar et al., 2018). """,3300000000.0,"""The average paper length is 154 sentences (2,769 tokens) resulting in a corpus size of 3.17B tokens, similar to the 3.3B tokens
on which BERT was trained.""","Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at this https://github.com/allenai/scibert/",Confident,United States of America,Research collective,168.0,1 week,Google TPU v3,Open source,,4.0,,,2808.0,,,,247.26289010271603,,
NMT Transformer 437M,Language,"Google,Bar-Ilan University","Roee Aharoni, Melvin Johnson, Orhan Firat",2019-02-28,Massively Multilingual Neural Machine Translation,https://arxiv.org/abs/1903.00089,SOTA improvement,"""We report results on the publicly available TED talks multilingual corpus where we show that massively multilingual many-to-many models are effective in low resource settings, outperforming the previous state-of-the-art while supporting up to 59 languages.""",437700000.0,"""Regarding the model, for these experiments we
use a larger Transformer model with 6 layers in
both the encoder and the decoder, model dimension set to 1024, hidden dimension size of 8192,
and 16 attention heads. This results in a model
with approximately 473.7M parameters.""",,,,"""Since we are not aware of a publicly available resource for this purpose, we construct an in-house
dataset. This dataset includes 102 language pairs
which we “mirror” to-and-from English, with up
to one million examples per language pair. This
results in 103 languages in total, and 204 translation directions which we train simultaneously.""
96M total examples, per Table 4",,"96M total examples, per Table 4. One sentence per example?","Multilingual neural machine translation (NMT) enables training a single model that supports translation from multiple source languages into multiple target languages. In this paper, we push the limits of multilingual NMT in terms of number of languages being used. We perform extensive experiments in training massively multilingual NMT models, translating up to 102 languages to and from English within a single model. We explore different setups for training such models and analyze the trade-offs between translation quality and various modeling decisions. We report results on the publicly available TED talks multilingual corpus where we show that massively multilingual many-to-many models are effective in low resource settings, outperforming the previous state-of-the-art while supporting up to 59 languages. Our experiments on a large-scale dataset with 102 languages to and from English and up to one million examples per direction also show promising results, surpassing strong bilingual baselines and encouraging future work on massively multilingual NMT.",Confident,"United States of America,Israel","Industry,Academia",,,,Unreleased,,,,,520.0,,,,,,
KataGo,Games,Jane Street,David J. Wu,2019-02-27,Accelerating Self-Play Learning in Go,https://arxiv.org/abs/1902.10565,SOTA improvement,Better than ELF OpenGo while using 1/50th the compute.,2500000.0,https://arxiv.org/abs/2210.00849 gives parameter count for AlphaZero in Fig 1b.,2.32e+19,"""[KataGo] surpasses the strength of ELF OpenGo after training on about 27 V100 GPUs for 19 days""
14.13 teraFLOP/s * 19 days = 2.32e+19 FLOP",,"Self-play: ""In total, KataGo’s main run lasted for 19 days using a maximum of 28 V100 GPUs at any time (averaging 26-27) and generated about 241 million training samples across 4.2 million games.""",241000000.0,241 million training samples across 4.2 million games,"By introducing several improvements to the AlphaZero process and architecture, we greatly accelerate self-play learning in Go, achieving a 50x reduction in computation over comparable methods. Like AlphaZero and replications such as ELF OpenGo and Leela Zero, our bot KataGo only learns from neural-net-guided Monte Carlo tree search self-play. But whereas AlphaZero required thousands of TPUs over several days and ELF required thousands of GPUs over two weeks, KataGo surpasses ELF's final model after only 19 days on fewer than 30 GPUs. Much of the speedup involves non-domain-specific improvements that might directly transfer to other problems. Further gains from domain-specific techniques reveal the remaining efficiency gap between the best methods and purely general methods such as AlphaZero. Our work is a step towards making learning in state spaces as large as Go possible without large-scale computational resources.",Speculative,Multinational,Industry,456.0,27 processors for 19 days,NVIDIA Tesla V100 DGXS 16 GB,Open source,,,,,76.0,,,,104.91425851608678,,
ProxylessNAS,Vision,Massachusetts Institute of Technology (MIT),"Han Cai, Ligeng Zhu, and Song Han",2019-02-23,ProxylessNAS: Direct neural architecture search on target task and hardware,https://arxiv.org/abs/1812.00332,Highly cited,,,,3.70656e+19,"For their searched Imagenet models, they used 200 GPU hours on a V100 GPU.
At FP32, a V100 GPU has a peak performance of 1.56E+14 FLOPS.
Utilization rate of 0.33.",ImageNet,,1280000.0,,"Neural architecture search (NAS) has a great impact by automatically designing effective neural network architectures. However, the prohibitive computational demand of conventional NAS algorithms (e.g. 104 GPU hours) makes it difficult to \emph{directly} search the architectures on large-scale tasks (e.g. ImageNet). Differentiable NAS can reduce the cost of GPU hours via a continuous representation of network architecture but suffers from the high GPU memory consumption issue (grow linearly w.r.t. candidate set size). As a result, they need to utilize~\emph{proxy} tasks, such as training on a smaller dataset, or learning with only a few blocks, or training just for a few epochs. These architectures optimized on proxy tasks are not guaranteed to be optimal on the target task. In this paper, we present \emph{ProxylessNAS} that can \emph{directly} learn the architectures for large-scale target tasks and target hardware platforms. We address the high memory consumption issue of differentiable NAS and reduce the computational cost (GPU hours and GPU memory) to the same level of regular training while still allowing a large candidate set. Experiments on CIFAR-10 and ImageNet demonstrate the effectiveness of directness and specialization. On CIFAR-10, our model achieves 2.08\% test error with only 5.7M parameters, better than the previous state-of-the-art architecture AmoebaNet-B, while using 6× fewer parameters. On ImageNet, our model achieves 3.1\% better top-1 accuracy than MobileNetV2, while being 1.2× faster with measured GPU latency. We also apply ProxylessNAS to specialize neural architectures for hardware with direct hardware metrics (e.g. latency) and provide insights for efficient CNN architecture design.",,United States of America,Academia,,,NVIDIA V100,Open source,,,,,1806.0,,,,122.74110657965193,,
GPT-2 (1.5B),Language,OpenAI,"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever",2019-02-14,Language Models are Unsupervised Multitask Learners,https://openai.com/blog/better-language-models/,Highly cited,,1500000000.0,"""GPT-2 is a large transformer-based language model with 1.5 billion parameters""",4.3e+21,"We use COMPUTE = FORWARD COMPUTE PER TOKEN * 3 BACKWARD FORWARD ADJUSTMENT* N EPOCHS * N TOKENS IN TRAINING DATASET
The number of epochs is not reported, but this other paper [1] claims in table 1 that it is 20 or 100 epochs. 100 epochs is consistent with the original GPT paper.
40GB dataset is 8B words, or 1/0.75 * 8B = 10.66B tokens.
6 * (40 * 200 million * 1/0.75 * 20) * 1.5 billion parameters = 1.92e21
6 * (40 * 200 million * 1/0.75 * 100) * 1.5 billion parameters = 9.6e21
Geometric mean is 4.29e21
[1] https://arxiv.org/abs/1906.06669",WebText,,3000000000.0,"“All results presented in this paper use a preliminary version of WebText which does not include links created after Dec 2017 and which after de-duplication and some heuristic based cleaning contains slightly over 8 million documents for a total of 40 GB of text.”
40GB is approximately 8e9 words.
","Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.",,United States of America,Industry,,,,Open source,,,,,16305.0,,,20.0,,,
Hanabi 4 player,Games,"DeepMind,University of Oxford,Carnegie Mellon University (CMU),Google Brain",,2019-02-01,The Hanabi Challenge: A New Frontier for AI Research,https://arxiv.org/abs/1902.00506,Historical significance,Adapted some SOTA RL algorithms to a new task that posed research challenges,764000.0,source: https://docs.google.com/spreadsheets/d/1Kj4Q5WADcDXtUJLIOfGTCE3tGvxNczEMwyy8QtgSkHk/edit#gid=54587040&fvid=1361937389,4.3e+18,14.13e+12 FLOP/s * 7 days * 86400 s/day * 0.50 utilization = 4.3e+18 FLOP,,,,,,,"United Kingdom of Great Britain and Northern Ireland,United Kingdom of Great Britain and Northern Ireland,United States of America,United States of America","Industry,Academia,Academia,Industry",,,,Unreleased,,,,,229.0,"7 days on V100 –> 7 * 24 * $0.55 = $92.40
Adjust to 2020 dollars: $78.32",,,,,
MT-DNN,Language,Microsoft,"X Liu, P He, W Chen, J Gao",2019-01-31,Multi-Task Deep Neural Networks for Natural Language Understanding,https://arxiv.org/abs/1901.11504,"Highly cited,SOTA improvement","""MT-DNN extends the model proposed in Liu et al. (2015) by incorporating a pre-trained bidirectional transformer language model, known as BERT (Devlin et al., 2018). MT-DNN obtains new state-of-the-art results on ten NLU tasks, including SNLI, SciTail, and eight out of nine GLUE tasks, pushing the GLUE benchmark to 82.7% (2.2% absolute improvement)""",330000000.0,,,,"GLUE,SciTail","GLUE, SNLI, and SciTail ",,,,,United States of America,Industry,,,,Open source,,,,,1217.0,,,,,,
Transformer-XL (257M),Language,"Carnegie Mellon University (CMU),Google Brain","Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov",2019-01-09,Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context,https://arxiv.org/abs/1901.02860,Highly cited,,257000000.0,"Transformer-XL Large, Table 1",1.09e+19,,WikiText-103,,,,,,"United States of America,United States of America","Academia,Industry",,,,Open source,,,,,3155.0,,,,,,
Decoupled weight decay regularization,Vision,University of Freiburg,Ilya Loshchilov and Frank Hutter,2019-01-04,Decoupled weight decay regularization.,https://arxiv.org/abs/1711.05101,Highly cited,,36500000.0,"From author communication
WideResNet 28-10 models with 36.5 million parameters (3.65E+07)",2.47e+18,"From author communication
Per image: 5.24 billion FLOPs (5.24E+09) Per training run: 50k times 5.24E+09 times 1800 epochs = 2.47E+18 FLOPs",CIFAR-10,,50000.0,,,,Germany,Academia,,,,Open source,,,,,2061.0,,,,,,
Transformer ELMo,Language,"Allen Institute for AI,University of Washington","ME Peters, M Neumann, L Zettlemoyer, W Yih",2019-01-01,Dissecting Contextual Word Embeddings: Architecture and Representation,https://www.semanticscholar.org/paper/Dissecting-Contextual-Word-Embeddings%3A-Architecture-Peters-Neumann/ac11062f1f368d97f4c826c317bf50dcc13fdb59,SOTA improvement,"""Our model is the Reconciled Span Parser (RSP; Joshi et al., 2018), which, using ELMo representations, achieved state of the art performance for this
task. As shown in Table 2, the LSTM based models demonstrate the best performance with a 0.2% and 1.0% improvement over the Transformer and CNN models, respectively""",56000000.0,,,,,More info on this is extractable with some time,,,,,"United States of America,United States of America","Research collective,Academia",,,,Unreleased,,,,,373.0,,,,,,
GPipe (Transformer),Language,Google,"Y Huang, Y Cheng, A Bapna, O Firat",2018-11-16,GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism,https://arxiv.org/abs/1811.06965,"Highly cited,SOTA improvement","""We train a single 6-billion-parameter,
128-layer Transformer model on a corpus spanning over 100 languages and achieve better quality than all bilingual models.""",6000000000.0,Section 5: ,,,,,20000000000.0,"[WORDS]
Section 5: ""We use a
corpus of parallel documents over 102 languages and English, containing a total of 25 billion training examples, ranging from 10^4 to 10^9 per language""
10^9 sentences * 20 words per sentence",,,United States of America,Industry,,,,,,,,,1218.0,,,,,,
GPipe (Amoeba),Vision,Google,"Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, Zhifeng Chen",2018-11-16,GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism,https://arxiv.org/abs/1811.06965,Highly cited,,557000000.0,Section 4,,,ImageNet,,1281167.0,Table 4,,,United States of America,Industry,,,,,,,,,1218.0,,,,,,
Multi-cell LSTM,Language,University of Hyderabad,"Thomas Cherian, Akshay Badola, Vineet Padmanabhan",2018-11-15,Multi-cell LSTM Based Neural Language Model,https://arxiv.org/abs/1811.06477,SOTA improvement,"""The proposed multi-cell LSTM language models outperform the state-of-the-art results on well-known Penn Treebank (PTB) setup""",7200000.0,,2010000000000000.0,,,,,,,,India,Academia,,,,Unreleased,,,,,6.0,,,50.0,,,
Fine-tuned-AWD-LSTM-DOC(fin),Language,Samsung R&D Institute Russia,"Vadim Popov, Mikhail Kudinov",2018-11-12,Fine-tuning of Language Models with Discriminator,https://arxiv.org/abs/1811.04623,SOTA improvement,"""The novel approach that we propose allows us to reach state-of-theart quality on Penn Treebank: perplexity decreases from 52.4 to 52.1.""",35000000.0,"Model used for language experiments comes from https://aclanthology.org/D18-1489/ and https://arxiv.org/abs/1711.03953
See Table 7 (Proposed method) in first link for model used in Penn Treebank experiment, and Table 2 (Ours) in second link for model used on WikiText-2 experiment.
Parameter count for ""large scale experiment"" in section 4.3, a single-layer LSTM with 500 hidden units, seems difficult to accurately estimate. Output dimensionality uses differentiated softmax, varying from 16-150.",1920000000000000.0,"From section 4.3, 90 epochs total on ~1.07B tokens.
2 * 23M parameters * 3 * 1.067e9 tokens * (60+30) epochs = 7.6e19 FLOP",,,1066666667.0,"Large scale experiment: 4GB of text * 200M words/GB * (0.75 words/token)^-1 = 1,066,666,667 tokens",,Speculative,Russia,Industry,,,,Unreleased,,,,,2.0,,,15.0,,,
Mesh-TensorFlow Transformer 4.9B (language modelling),Language,Google Brain,"Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, Hyoukjoong Mingsheng Lee, Cliff Hong, Ryan Young, Blake Sepassi, Hechtman",2018-11-05,Mesh-TensorFlow: Deep Learning for Supercomputers,https://arxiv.org/abs/1811.02084,SOTA improvement,"'Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing state of the art results on WMT'14 English-to-French translation task and the one-billion-word language modeling benchmark.'",4900000000.0,4.9B from section 9.1 : ''The largest model (4.9B parameters) took 13 hours to train on a 512-core TPUv2 cluster.',1.617408e+20,"flops = (256) * ( 45 * 10**12) * (13 * 3600) * (0.3) = 1.6e20
(num gpu) * (peak flops) * (time in seconds) * (assumed utilization rate)
from section 9.1 : ''The largest model (4.9B parameters) took 13 hours to train on a 512-core TPUv2 cluster.'
from https://en.wikipedia.org/wiki/Tensor_Processing_Unit
45TFLOPs per chips","Wikipedia,One Billion Word benchmark",from section 9.1 Wikipedia and one-billion-word language modeling benchmark.,6333333333.333333,"from section 9.1. Experiments done on a ""billion word benchmark"" and a 5B token wikipedia dataset. At 4/3 tokens per word, 1.3B tokens in the first.","Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce Mesh-TensorFlow, a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the ""batch"" dimension, in Mesh-TensorFlow, the user can specify any tensor-dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A Mesh-TensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce. We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing state of the art results on WMT'14 English-to-French translation task and the one-billion-word language modeling benchmark. Mesh-Tensorflow is available at this https URL .",Confident,United States of America,Industry,13.0,"from section 9.1 ""For the billion-word language modeling benchmark, we trained the models for 10 epochs. The largest model (4.9B parameters) took 13 hours to train on a 512-core TPUv2 cluster.""",Google TPU v2,,,256.0,,,357.0,,,10.0,935.3300509163912,,
Mesh-TensorFlow Transformer 2.9B (translation),Language,Google Brain,"Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, Hyoukjoong Mingsheng Lee, Cliff Hong, Ryan Young, Blake Sepassi, Hechtman",2018-11-05,Mesh-TensorFlow: Deep Learning for Supercomputers,https://arxiv.org/abs/1811.02084,SOTA improvement,"'Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing state of the art results on WMT'14 English-to-French translation task and the one-billion-word language modeling benchmark.'",2900000000.0,"2.9B from section 9.1 : ""On the WMT14 En-Fr translation tasks (3), we trained the models for 3 epochs. The largest model
(2.9B parameters) was trained for 22 hours on a 128-core TPUv2 cluster.""",6.84288e+19,"flops = (64) * ( 45 * 10**12) * (22 * 3600) * (0.3) = 6.8e19
(num gpu) * (peak flops) * (time in seconds) * (assumed utilization rate)
from section 9.1 : ""On the WMT14 En-Fr translation tasks (3), we trained the models for 3 epochs. The largest model
(2.9B parameters) was trained for 22 hours on a 128-core TPUv2 cluster.""
from https://en.wikipedia.org/wiki/Tensor_Processing_Unit
45TFLOPs per chips",WMT14,"from section 9.1 ""On the WMT14 En-Fr translation tasks (3), we trained the models for 3 epochs. The largest model
(2.9B parameters) was trained for 22 hours on a 128-core TPUv2 cluster.""",1800000000.0,"Per Attention is All You Need, WMT 2014 En-Fr is ~36 million sentence pairs. If the average sentence is ~25 tokens (ballpark), dataset size is
36M * 25 * 2 = 1.8B tokens","Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce Mesh-TensorFlow, a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the ""batch"" dimension, in Mesh-TensorFlow, the user can specify any tensor-dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A Mesh-TensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce. We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing state of the art results on WMT'14 English-to-French translation task and the one-billion-word language modeling benchmark. Mesh-Tensorflow is available at this https URL .",Likely,United States of America,Industry,22.0,"from section 9.1 ""On the WMT14 En-Fr translation tasks (3), we trained the models for 3 epochs. The largest model
(2.9B parameters) was trained for 22 hours on a 128-core TPUv2 cluster.""",Google TPU v2,,,64.0,,,357.0,,,10.0,395.71656000308855,,
MemoReader,Language,"Samsung,Korea University","Seohyun Back, Seunghak Yu, Sathish Indurthi, Jihie Kim, Jaegul Choo",2018-10-31,"MemoReader: Large-Scale Reading Comprehension through Neural Memory Controller
",https://aclanthology.org/D18-1237/,SOTA improvement,"""TriviaQA. As shown in Table 2, our model,
even without DEBS, outperforms the existing
state-of-the-art method such as ‘BiDAF + SA +
SN’ by a large margin in all the cases""",,,,"""Our model does require more memory than existing methods, but a single GPU (e.g., M40 with 12GB memory) was enough to train model within a reasonable amount of time""
""Reasonable"" could mean anything, maybe hours to a few days.",TriviaQA,,,,"Machine reading comprehension helps machines learn to utilize most of the human
knowledge written in the form of text. Existing approaches made a significant progress comparable to human-level performance, but they
are still limited in understanding, up to a few paragraphs, failing to properly comprehend
lengthy document. In this paper, we propose a novel deep neural network architecture to handle a long-range dependency in RC tasks. In
detail, our method has two novel aspects: (1) an advanced memory-augmented architecture
and (2) an expanded gated recurrent unit with dense connections that mitigate potential information distortion occurring in the memory.
Our proposed architecture is widely applicable
to other models. We have performed extensive experiments with well-known benchmark
datasets such as TriviaQA, QUASAR-T, and
SQuAD. The experimental results demonstrate
that the proposed method outperforms existing
methods, especially for lengthy documents.",Unknown,"Korea (Republic of),Korea (Republic of)","Industry,Academia",,"""reasonable amount of time"" with a single GPU",NVIDIA M40,Unreleased,,,,,17.0,,,,,,
TrellisNet,Language,"Carnegie Mellon University (CMU),Bosch Center for Artificial Intelligence,Intel Labs","Shaojie Bai, J. Zico Kolter, Vladlen Koltun",2018-10-15,Trellis Networks for Sequence Modeling,https://arxiv.org/abs/1810.06682,SOTA improvement,"""Experiments demonstrate that trellis networks outperform the current state of the art methods on a variety of challenging benchmarks, including word-level language modeling and character-level language modeling
tasks""",180000000.0,"180M, Table 2",2.78e+18,,WikiText-103,,,,,,"United States of America,Germany,Multinational","Academia,Industry,Industry",,,,Unreleased,,,,,132.0,,,25.0,,,
MetaMimic,Games,Google,"Tom Le Paine, Sergio Gomez",2018-10-11,One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL,https://arxiv.org/abs/1810.05017,SOTA improvement,"""By retaining and taking advantage of all its experiences,
MetaMimic also substantially outperforms the state-of-the-art D4PG RL agent, when D4PG
uses only the current task experiences.""",22000000.0,"""This representational demand motivates the introduction of high-capacity deep neural networks. We found the architecture, shown in Figure 3, with residual connections, 20 convolution layers with 512 channels
for a total of 22 million parameters, and instance normalization to drastically improve performance, as shown in Figure 6 of the Experiments section.""",,,,,,,,,United States of America,Industry,,,,,,,,,26.0,,,,,,
BERT-Large,Language,Google,"J Devlin, MW Chang, K Lee, K Toutanova",2018-10-11,BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,https://arxiv.org/abs/1810.04805,Highly cited,,340000000.0,,2.85e+20,more info here https://docs.google.com/document/d/1B8x6XYcmB1u6Tmq3VcbAtj5bzhDaj2TcIPyK6Wpupx4/edit?usp=sharing,,,3300000000.0,"""For the pre-training corpus we
use the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words)""","We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).",,United States of America,Industry,96.0,"from appendix A.2: ""Training of BERTLARGE was performed
on 16 Cloud TPUs (64 TPU chips total). Each pre-
training took 4 days to complete.""",Google TPU v2,Open source,,64.0,0.29,,74818.0,,,,1751.4770087736404,,
Transformer (Adaptive Input Embeddings) WT103,Language,Facebook AI Research,"Alexei Baevski, Michael Auli",2018-09-28,Adaptive Input Representations for Neural Language Modeling,https://arxiv.org/abs/1809.10853,SOTA improvement,"""On the WikiText-103 benchmark we achieve 18.7 perplexity, an improvement of 10.5 perplexity compared to the previously best published result""",247000000.0,Table 2,7.2e+19,"8 V100s * 67 hours per Table 2.
Table 1 shows their biggest adaptive input embeddings run: 145 hours with 64 V100 GPUs
125e12 FLOP/sec * 8 * 67 * 3600 * 0.3 (utilization assumption) = 7.2e19 FLOP
",WikiText-103,"The training data of WIKITEXT-103 comprises about 100M tokens""",100000000.0,"""The training data of WIKITEXT-103 comprises about 100M tokens""
Datasets are not combined but used to train separate models","We introduce adaptive input representations for neural language modeling which extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity. There are several choices on how to factorize the input and output layers, and whether to model words, characters or sub-word units. We perform a systematic comparison of popular choices for a self-attentional architecture. Our experiments show that models equipped with adaptive embeddings are more than twice as fast to train than the popular character input CNN while having a lower number of parameters. On the WIKITEXT-103 benchmark we achieve 18.7 perplexity, an improvement of 10.5 perplexity compared to the previously best published result and on the BILLION WORD benchmark, we achieve 23.02 perplexity.",Confident,United States of America,Industry,67.0,,NVIDIA V100,Open source,,64.0,,,347.0,,,,2880.917278699733,,
BigGAN-deep 512x512,Image generation,"Heriot-Watt University,DeepMind","A Brock, J Donahue, K Simonyan",2018-09-28,Large Scale GAN Training for High Fidelity Natural Image Synthesis,https://arxiv.org/abs/1809.11096,Highly cited,,112694781.0,"I used the publicly available implementation available at [1]
There I loaded the biggan-deep512/1 model, and ran script [2] to compute the number of parameters
[1] https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/biggan_generation_with_tf_hub.ipynb
[2]
n_params = 0
for var in module.variables:
n_params += np.prod(var.shape.as_list())
pass
print(n_params)",1.8e+21,"3e21, estimate taken from:
https://www.lesswrong.com/posts/wfpdejMWog4vEDLDg/ai-and-compute-trend-isn-t-predictive-of-what-is-happening",JFT-300M,,292000000.0,"""To confirm that our design choices are effective for even larger and more complex and diverse datasets, we also present results of our system on a subset of JFT-300M (Sun et al., 2017). The full JFT-300M dataset contains 300M real-world images labeled with 18K categories. Since the category distribution is heavily long-tailed, we subsample the dataset to keep only images with the 8.5K most common labels. The resulting dataset contains 292M images – two orders of magnitude larger than ImageNet. ""","Despite recent progress in generative image modeling, successfully generating high-resolution, diverse samples from complex datasets such as ImageNet remains an elusive goal. To this end, we train Generative Adversarial Networks at the largest scale yet attempted, and study the instabilities specific to such scale. We find that applying orthogonal regularization to the generator renders it amenable to a simple ""truncation trick,"" allowing fine control over the trade-off between sample fidelity and variety by reducing the variance of the Generator's input. Our modifications lead to models which set the new state of the art in class-conditional image synthesis. When trained on ImageNet at 128x128 resolution, our models (BigGANs) achieve an Inception Score (IS) of 166.5 and Frechet Inception Distance (FID) of 7.4, improving over the previous best IS of 52.52 and FID of 18.6.",Likely,"United Kingdom of Great Britain and Northern Ireland,United Kingdom of Great Britain and Northern Ireland","Academia,Industry",48.0,"""We train on a Google TPU v3 Pod, with the number of cores proportional to the resolution: 128 for 128×128, 256 for 256×256, and 512 for 512×512. Training takes between 24 and 48 hours for most models""",Google TPU v3,Open source,,256.0,,,4541.0,,,,5170.456705747183,,
LSTM+NeuralCache,Language,"KU Leuven,ESAT - PSI,Apple","Lyan Verwimp, Joris Pelemans, Hugo Van hamme, Patrick Wambacq",2018-09-24,Information-Weighted Neural Cache Language Models for ASR,https://arxiv.org/abs/1809.08826,SOTA improvement,"""We obtain a 29.9%/32.1% (validation/test set) relative improvement in perplexity with respect to a baseline LSTM LM on the WikiText-2 dataset, outperforming previous work on neural cache LMs""
...
""we observe that neural cache models
consistently outperform regular cache models on this dataset.""",2100000.0,,1020000000000000.0,,,,,,,,"Belgium,Belgium,United States of America","Academia,Academia,Industry",,,,Unreleased,,,,,3.0,,,39.0,,,
"AWD-LSTM-MoS + dynamic evaluation (WT2, 2018)",Language,"Peking University,Microsoft Research Asia","Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, Tie-Yan Liu",2018-09-18,FRAGE: Frequency-Agnostic Word Representation,https://arxiv.org/abs/1809.06858,SOTA improvement,"""Specifically, in language modeling and machine translation, we achieve better performance than the state-of-the-art results on PTB, WT2
and WMT14 English-German datasets.""",35000000.0,,,,WikiText-2,,,,,,"China,China","Academia,Industry",,,,Unreleased,,,,,152.0,,,,,,
Transformer + Simple Recurrent Unit,Language,"ASAPP,Cornell University,Google,Princeton University","Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, Yoav Artzi",2018-09-17,Simple Recurrent Units for Highly Parallelizable Recurrence,https://arxiv.org/abs/1709.02755v5,SOTA improvement,"""We use the state-of-the-art Transformer
model of Vaswani et al. (2017) as our base architecture... When SRU is incorporated into the architecture,
both the 4-layer and 5-layer model outperform the
Transformer base model""",90000000.0,"5-layer model, Table 3",1.1e+19,"""We use a single NVIDIA Tesla V100 GPU for each model. The published results were obtained
using 8 GPUs in parallel, which provide a large effective batch size during training. To approximate
the setup, we update the model parameters every 5×5120 tokens and use 16,000 warm-up steps
following OpenNMT suggestions. We train each
model for 40 epochs (250,000 steps), and perform
3 independent trials for each model configuration.
A single run takes about 3.5 days with a Tesla V100 GPU.""
125 trillion * 3.5 * 24 * 3600 * 0.3 = 1.1e19",WMT English-German,"""We train translation models on the WMT English→German dataset, a standard
benchmark for translation systems (Peitz et al.,
2014; Li et al., 2014; Jean et al., 2015). The
dataset consists of 4.5 million sentence pairs""",,,"Common recurrent neural architectures scale poorly due to the intrinsic difficulty in parallelizing their state computations. In this work, we propose the Simple Recurrent Unit (SRU), a light recurrent unit that balances model capacity and scalability. SRU is designed to provide expressive recurrence, enable highly parallelized implementation, and comes with careful initialization to facilitate training of deep models. We demonstrate the effectiveness of SRU on multiple NLP tasks. SRU achieves 5--9x speed-up over cuDNN-optimized LSTM on classification and question answering datasets, and delivers stronger results than LSTM and convolutional models. We also obtain an average of 0.7 BLEU improvement over the Transformer model on translation by incorporating SRU into the architecture.",Confident,"United States of America,United States of America,United States of America,United States of America","Industry,Academia,Industry,Academia",,,NVIDIA V100,,,8.0,,,293.0,,,40.0,45.37321924485877,,
ESRGAN,"Vision,Image generation","Chinese University of Hong Kong (CUHK),Chinese Academy of Sciences,Nanyang Technological University","Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Chen Change Loy, Yu Qiao, Xiaoou Tang",2018-09-01,ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks,https://arxiv.org/abs/1809.00219,Highly cited,,,,,,,,,,,Unknown,"Hong Kong,China,Singapore","Academia,Academia,Academia",,,,,,,,,2929.0,,,,,,
(ensemble): AWD-LSTM-DOC (fin) × 5 (WT2),Language,"NTT Communication Science Laboratories,Tohoku University","Sho Takase, Jun Suzuki, Masaaki Nagata",2018-08-30,Direct Output Connection for a High-Rank Language Model,https://arxiv.org/abs/1808.10143,SOTA improvement,"""The proposed method improves the current state-of-the-art language model and achieves the best score on the Penn Treebank and WikiText-2, which are the standard benchmark datasets""",185000000.0,,6.93e+17,,WikiText-2,,,,,,"Japan,Japan","Industry,Academia",,,,Open source,,,,,36.0,,,300.0,,,
Big Transformer for Back-Translation,Language,"Facebook AI Research,Google Brain","Sergey Edunov, Myle Ott, Michael Auli, David Grangier",2018-08-28,Understanding Back-Translation at Scale,https://arxiv.org/abs/1808.09381,"Highly cited,SOTA improvement","""Finally, we scale to hundreds of millions of monolingual sentences and achieve a new state of the art of 35 BLEU on the WMT'14 English-German test set. """,,"""We re-implemented the Transformer model in py-
torch using the fairseq toolkit.1 All experiments
are based on the Big Transformer architecture with
6 blocks in the encoder and decoder. We use the
same hyper-parameters for all experiments, i.e.,
word representations of size 1024, feed-forward
layers with inner dimension 4096. ""
I am not sure what authors mean by 'Big Transformer architecture'",1.080843264e+20,"(128) * (28.26 * 10**12) * (27*3600 + 40*60) * (0.3) = 108084326400000000000
(number of gpus) * (peak flops) * (seconds) * (assumed utilization rate)
""We run experiments on DGX-1 machines with 8Nvidia V100 GPUs and machines are intercon-
nected by Infiniband. Experiments are run on 16
machines and we perform 30K synchronous up-
dates. ""
""We train models with 16-bit floating point
operations""
from https://www.techpowerup.com/gpu-specs/tesla-v100-pcie-16-gb.c2957 V100 have 28.26 TFLOPS
in section 5.6 we have
""train this system we perform 300K training up-
dates in 27h 40min on 128 GPUs;""",WMT English-German,"""Finally, for WMT English-German we train
on all 226M available monolingual training sen-
tences and perform 250K updates in 22.5 hours on 128 GPUs. """,3390000000.0,"""Finally, for WMT English-German we train on all 226M available monolingual training sentences and perform 250K updates in 22.5 hours on 128 GPUs.""
We assume that 1 sentence have 15 words","An effective method to improve neural machine translation with monolingual data is to augment the parallel training corpus with back-translations of target language sentences. This work broadens the understanding of back-translation and investigates a number of methods to generate synthetic source sentences. We find that in all but resource poor settings back-translations obtained via sampling or noised beam outputs are most effective. Our analysis shows that sampling or noisy synthetic data gives a much stronger training signal than data generated by beam or greedy search. We also compare how synthetic data compares to genuine bitext and study various domain effects. Finally, we scale to hundreds of millions of monolingual sentences and achieve a new state of the art of 35 BLEU on the WMT'14 English-German test set. ",Likely,"United States of America,United States of America","Industry,Industry",27.666,"""training updates in 27h 40min on 128 GPUs""",NVIDIA V100,,,128.0,,,1155.0,,,,2442.1618775733145,,
AWD-LSTM-MoS+PDR + dynamic evaluation (WT2),Language,IBM,Siddhartha Brahma,2018-08-14,Improved Language Modeling by Decoding the Past,https://arxiv.org/abs/1808.05908,SOTA improvement,"""our Past Decode Regularization (PDR) method achieves a word level perplexity of 55.6 on the Penn Treebank and 63.5 on the WikiText-2 datasets using a single softmax. We also show gains by using PDR in combination with a mixture-of-softmaxes, achieving a word level perplexity of 53.8 and 60.5 on these datasets. In addition, our method achieves 1.169 bits-per-character on the Penn Treebank Character dataset for character level language modeling. These results constitute a new state-of-the-art in their respective settings.""",35000000.0,,,,WikiText-2,,,,,,United States of America,Industry,,,,Unreleased,,,,,6.0,,,,,,
Big-Little Net (speech),Speech,IBM,"Chun-Fu (Richard) Chen, Quanfu Fan, Neil Mallinar, Tom Sercu, Rogerio Feris",2018-07-10,Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition,https://arxiv.org/abs/1807.03848,SOTA improvement,"""Furthermore, our model surpasses state-of-the-art CNN acceleration approaches by a large margin in accuracy and FLOPs reduction. On the task of speech recognition, our proposed multi-scale CNNs save 30% FLOPs with slightly better word error rates, showing good generalization across domains.""",3320000.0,table 3,4.290048e+17,980000000 (number of FLOPs from table 3) * 27360000 (dataset size) * 16 (number of epochs from appendix B.1) = 429004800000000000,"Switchboard,Fisher","""We train ResNet style acoustic models in the hybrid framework on Switchboard+Fisher (2000h) and provide results on Hub5 (Switchboard and Call Home portions). Switchboard is a large dataset with 2000 hours of transcribed speech from 28, 000 speakers""",27360000.0,"""We train ResNet style acoustic models in the hybrid framework on Switchboard+Fisher (2000h) and provide results on Hub5 (Switchboard and Call Home portions). Switchboard is a large dataset with 2000 hours of transcribed speech from 28, 000 speakers""
2000h * 13680 words per hour = 27360000
https://docs.google.com/document/d/1G3vvQkn4x_W71MKg0GmHVtzfd9m0y3_Ofcoew0v902Q/edit#heading=h.3pbt0hfgv7pq","In this paper, we propose a novel Convolutional Neural Network (CNN) architecture for learning multi-scale feature representations with good tradeoffs between speed and accuracy. This is achieved by using a multi-branch network, which has different computational complexity at different branches. Through frequent merging of features from branches at distinct scales, our model obtains multi-scale features while using less computation. The proposed approach demonstrates improvement of model efficiency and performance on both object recognition and speech recognition tasks,using popular architectures including ResNet and ResNeXt. For object recognition, our approach reduces computation by 33% on object recognition while improving accuracy with 0.9%. Furthermore, our model surpasses state-of-the-art CNN acceleration approaches by a large margin in accuracy and FLOPs reduction. On the task of speech recognition, our proposed multi-scale CNNs save 30% FLOPs with slightly better word error rates, showing good generalization across domains. The codes are available at https://github.com/IBM/BigLittleNet.",Speculative,United States of America,Industry,,,,Open source,,,,,83.0,,,16.0,,,
Big-Little Net,Vision,IBM,"Chun-Fu Chen, Quanfu Fan, Neil Mallinar, Tom Sercu, and Rogerio Feris",2018-07-10,Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition,https://arxiv.org/abs/1807.03848,SOTA improvement,"""On object recognition task, we demonstrated that our approach provides approximately 2× speedup over baselines while
improving accuracy, and the result significantly outperforms the state-of-the-art networks by a large margin in terms of accuracy and FLOPs reduction""",77360000.0,Table 2,2.46048e+17,"Using the 6ND formula:
6×number of tokens×number of parameters×number of epochs
6×1.28×10^6×77360000×110=6.5353728e+16 FLOPs
9.32*10^9 (flops per inference)*1.28×10^6(dataset size)/16 (batch size) * 110 epochs * 3 (to account for backpropagation)= 2.46048e+17 FLOPs",ImageNet,,1280000.0,,"In this paper, we propose a novel Convolutional Neural Network (CNN) architecture for learning multi-scale feature representations with good tradeoffs between speed and accuracy. This is achieved by using a multi-branch network, which has different computational complexity at different branches. Through frequent merging of features from branches at distinct scales, our model obtains multi-scale features while using less computation. The proposed approach demonstrates improvement of model efficiency and performance on both object recognition and speech recognition tasks,using popular architectures including ResNet and ResNeXt. For object recognition, our approach reduces computation by 33% on object recognition while improving accuracy with 0.9%. Furthermore, our model surpasses state-of-the-art CNN acceleration approaches by a large margin in accuracy and FLOPs reduction. On the task of speech recognition, our proposed multi-scale CNNs save 30% FLOPs with slightly better word error rates, showing good generalization across domains.",Likely,United States of America,Industry,,,NVIDIA Tesla K80,Open source,,,,,83.0,,,110.0,,256.0,"""All the models were trained with 110 epochs, batch size 256"""
RCAN,"Image generation,Vision",Northeastern University," Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, Yun Fu",2018-07-08,Image Super-Resolution Using Very Deep Residual Channel Attention Networks,https://openaccess.thecvf.com/content_ECCV_2018/html/Yulun_Zhang_Image_Super-Resolution_Using_ECCV_2018_paper.html,Highly cited,,,,,,,,,,,Unknown,United States of America,Academia,,,,,,,,,3516.0,,,,,,
Population-based DRL,Games,DeepMind,"Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, Thore Graepel",2018-07-03,Human-level performance in first-person multiplayer games with population-based deep reinforcement learning,https://arxiv.org/abs/1807.01281,SOTA improvement,"Qualitatively clearly SOTA: ""In this work, we demonstrate for the first time that an agent can achieve human-level in a popular 3D multiplayer first-person video game, Quake III Arena Capture the Flag (28), using only pixels and game points as input... proved far stronger than existing state-of-the-art agents""",122000000.0,"Calculated from the architecture schematic in Figure S11 on pg 55 of the Capture the Flag supplementary materials. This is dominated by the size of the vision module, which is 116 million parameters, followed by the temporal processors which is 4.3 million parameters. The RL policy itself is only 0.79 million parameters. Also, I'm pretty uncertain if I'm right about how I calculated these parameters.
Source:
https://docs.google.com/spreadsheets/d/1Kj4Q5WADcDXtUJLIOfGTCE3tGvxNczEMwyy8QtgSkHk/edit#gid=54587040&fvid=1361937389",3.49e+19,"Source:
https://docs.google.com/spreadsheets/d/1Kj4Q5WADcDXtUJLIOfGTCE3tGvxNczEMwyy8QtgSkHk/edit#gid=54587040&fvid=1361937389",,,,,"Recent progress in artificial intelligence through reinforcement learning (RL) has shown great success on increasingly complex single-agent environments and two-player turn-based games. However, the real-world contains multiple agents, each learning and acting independently to cooperate and compete with other agents, and environments reflecting this degree of complexity remain an open challenge. In this work, we demonstrate for the first time that an agent can achieve human-level in a popular 3D multiplayer first-person video game, Quake III Arena Capture the Flag, using only pixels and game points as input. These results were achieved by a novel two-tier optimisation process in which a population of independent RL agents are trained concurrently from thousands of parallel matches with agents playing in teams together and against each other on randomly generated environments. Each agent in the population learns its own internal reward signal to complement the sparse delayed reward from winning, and selects actions using a novel temporally hierarchical representation that enables the agent to reason at multiple timescales. During game-play, these agents display human-like behaviours such as navigating, following, and defending based on a rich learned representation that is shown to encode high-level game knowledge. In an extensive tournament-style evaluation the trained agents exceeded the win-rate of strong human players both as teammates and opponents, and proved far stronger than existing state-of-the-art agents. These results demonstrate a significant jump in the capabilities of artificial agents, bringing us closer to the goal of human-level intelligence.",,United Kingdom of Great Britain and Northern Ireland,Industry,,,,,,,,,636.0,,,,,,
FTW,Games,DeepMind,"Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, Thore Graepel",2018-07-03,Human-level performance in first-person multiplayer games with population-based deep reinforcement learning,https://arxiv.org/abs/1807.01281,SOTA improvement,"""In this work, we demonstrate for the first time that an agent can achieve human-level in a popular 3D multiplayer first-person video game, Quake III Arena Capture the Flag (28), using only pixels and game points as input.""",126001330.0,"Architecture described in figure S11 of the supplement
The architecture includes modules for visual embedding, reward prediction, recurrent processing, policy, baseline and pixel control.
Input is 84x84x3 pixels as seen in figure S10 of the supplement
""We elected to use a resolution of 84x84 pixels as in previous related work in this environment. Each pixel is represented by a triple of three bytes""
Visual embedding (84x84x3 -> 256)
32*(8*8*3+1)+64*(4*4*32+1)+64*(3*3*64+1)+64*(3*3*64+1) + (84/(S^4)*84/(S^4)*64+1)*256
Note there is no information about the stride S used in the convolutions; we assume S = 1
Reward prediction (256 -> 3)
(256+1)*128 + (128+1)*3
Recurrent processing (n-> 512)
VU1 (256 -> 512)
4*(799+2*32)*((512+(32*2) + 3*32 + 5*2 + 3)+(799+2*32)+1) + 2*(256+1)*256
VU2 (512 -> 512)
4*(512+2*32)*((512+(32*2) + 3*32 + 5*2 + 3)+(512+2*32)+1) + 2*(256+1)*256
LSTMs usually have 4*(n*m+n*n+n) parameters, where n=input size and m=output size.
This DNS + LSTM takes as input the concatenation of the previous layer of size n and R read vectors of size W=32; and outputs m units plus an interface vector of size (W*R) + 3*W + 5*R + 3, for a total of about 4*(n+R*W)*((m+(W*R) + 3*W + 5*R + 3)+(n+R*32)+1) parameters
I assume R=2 since that seems implied by the previous paper (?)
The first VU has as input the visual embedding (size 256), the previous action (size 540) and the previous reward (size 3), for a total size of 256+540+3 = 799. The output is size 512.
The second VU has input size 512 and output size 512
The DNC memory architecture is described in https://www.nature.com/articles/nature20101.epdf
Policy (512 -> 5x3x3x3x2x2)
6*(512+1)*256 + (256+1)*5 + 3*(256+1)*3 + 2*(256+1)*2
Baseline
(512+1)*256 + (256+1)*1
Pixel control
(512+1)*32*7*7 + 32*(9*9+1) + 5*(4*4+1) + 3*2*(4*4+1) + 2*2*(4*4+1) + 1*(4*4+1)
""we trained independent pixel control policies for each of the six action groups""",7.26e+21,"We assume that most operations happen in the visual embedding.
2* 84^2*84^2 * 32 * 3 / 1^2 = 9.5 *10^9
new image size: 76 x 76 x 32
ignore ReLU/additions becaue probably very little influence
2 * 76^2 * 76^2 * 10* 64 = 4 *10^10
new image size: 72 x 72 x 64
2 * 72^2 *72^2 * 64 * 64 * 3= 6.6 * 10^11
new image size: 69 x 69 x 64
2 * 69^2 *69^2 * 64 * 64 * 3= 5.5 * 10^11
new image size: 66 x 66 x 64
Linear layer: 2* ( 66*66*64)*256 = 1.4*10^8
Total aprox: 1.21e+12 FLOP/forward pass
",,,,,,,United Kingdom of Great Britain and Northern Ireland,Industry,,,,,,,,,636.0,,,,,,
ShuffleNet v2,Vision,"Tsinghua University,Megvii Inc","N Ma, X Zhang, HT Zheng",2018-06-30,ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design,https://arxiv.org/abs/1807.11164,Highly cited,,2280000.0,,,,,,,,,,"China,China","Academia,Industry",,,,,,,,,3903.0,,,,,,
QT-Opt,"Robotics,Vision","Google Brain,UC Berkeley","Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, Sergey Levine",2018-06-27,QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation,https://arxiv.org/abs/1806.10293,Highly cited,,1200000.0,"""The Q-function Qθ(s, a) is represented in our system by a large convolutional neural network with 1.2M parameters""",3.4875e+19,"""We distribute training across 10 GPUs, using asynchronous SGD with momentum... This system allows us to train the Q-function at 40 steps per second with a batch size of 32 across 10 NVIDIA P100 GPUs.""
""We found empirically that a large number of gradient steps (up to 15M) were needed to train an effective Q-function...""
15M steps * 0.025 seconds/step * 9.30E+12 FLOP/sec/GPU * 10 GPU = 3.4875E+19",,"""... we collected over 580k grasps over the course of several weeks across 7 robots""
",5984870.0,"Observations take up 4TB of disk space, and the input space is a 472x472 RGB image.
Assuming 24 bit depth color (8 bits per channel), that suggests 472 * 472 * 3 * 8 bits = 668.352 kB per image (this could be off by a factor of 2 depending on actual bit depth)
4 TB / 668.352 kB = 5,984,870 images; around 10 per grasp attempt.
15M gradient steps with batchsize 32 implies:
15M steps * 32 images/step * 1/5984870 images ~= each image seen 80 times","In this paper, we study the problem of learning vision-based dynamic manipulation skills using a scalable reinforcement learning approach. We study this problem in the context of grasping, a longstanding challenge in robotic manipulation. In contrast to static learning behaviors that choose a grasp point and then execute the desired grasp, our method enables closed-loop vision-based control, whereby the robot continuously updates its grasp strategy based on the most recent observations to optimize long-horizon grasp success. To that end, we introduce QT-Opt, a scalable self-supervised vision-based reinforcement learning framework that can leverage over 580k real-world grasp attempts to train a deep neural network Q-function with over 1.2M parameters to perform closed-loop, real-world grasping that generalizes to 96% grasp success on unseen objects. Aside from attaining a very high success rate, our method exhibits behaviors that are quite distinct from more standard grasping systems: using only RGB vision-based perception from an over-the-shoulder camera, our method automatically learns regrasping strategies, probes objects to find the most effective grasps, learns to reposition objects and perform other non-prehensile pre-grasp manipulations, and responds dynamically to disturbances and perturbations.",Likely,"United States of America,United States of America","Industry,Academia",104.2,"""We distribute training across 10 GPUs, using asynchronous SGD with momentum... This system allows us to train the Q-function at 40 steps per second with a batch size of 32 across 10 NVIDIA P100 GPUs.""
""We found empirically that a large number of gradient steps (up to 15M) were needed to train an effective Q-function...""
15M steps * 0.025 seconds/step * 1/3600 hours/second = 104.2 hours",NVIDIA P100,Unreleased,,,,,1442.0,"Using cost from ML Hardware Data spreadsheet,
$0.919/hr/GPU * 104.2 hours * 10 GPUs = $957.60
Likely an underestimate, as the cloud pricing comes from 2023 and incorporates 5 additional years of depreciation on the P100.",,80.0,1317.8035861002786,,
DARTS,Language,"DeepMind,Carnegie Mellon University (CMU)","Hanxiao Liu, Karen Simonyan, Yiming Yang",2018-06-24,DARTS: Differentiable Architecture Search,https://arxiv.org/abs/1806.09055,Highly cited,,33000000.0,,1.1e+16,,WikiText-2,,,,,,"United Kingdom of Great Britain and Northern Ireland,United States of America","Industry,Academia",,,,Unreleased,,,,,3990.0,,,300.0,,,
MobileNetV2,Vision,Google,"M Sandler, A Howard, M Zhu",2018-06-18,MobileNetV2: Inverted Residuals and Linear Bottlenecks,https://ieeexplore.ieee.org/document/8578572,Highly cited,,3400000.0,Rados,,,,,,,,,United States of America,Industry,,,,,,,,,15118.0,,,,,,
Relational Memory Core,Language,"DeepMind,University College London (UCL)","Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, Timothy Lillicrap",2018-06-05,Relational recurrent neural networks,https://arxiv.org/abs/1806.01822,SOTA improvement,"""Finally, we test the RMC on a suite of tasks that may profit from more capable relational reasoning across sequential information, and show large gains in RL domains (e.g. Mini PacMan), program evaluation, and language modeling, achieving state-of-the-art results on the WikiText-103, Project Gutenberg, and GigaWord datasets.""",,,,,WikiText-103,,,,,Unknown,"United Kingdom of Great Britain and Northern Ireland,United Kingdom of Great Britain and Northern Ireland","Industry,Academia",,,,Unreleased,,,,,235.0,,,,,,
GPT,Language,OpenAI,"A Radford, K Narasimhan, T Salimans, I Sutskever",2018-06-01,Improving Language Understanding by Generative Pre-Training,https://openai.com/blog/language-unsupervised/,Highly cited,,117000000.0,"""The model had 117M parameters in total.""
source: https://medium.com/walmartglobaltech/the-journey-of-open-ai-gpt-models-32d95b7b7fb2",1.7578125e+19,"COMPUTE = FORWARD COMPUTE PER TOKEN * 3 BACKWARD FORWARD ADJUSTMENT * EPOCHS * DATASET SIZE
""We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens.""
","BookCorpus (BooksCorpus, Toronto Book Corpus)","""We use the BooksCorpus dataset [71] for training the language model""",1000000000.0,"""BookCorpus is a large collection of free novel books written by unpublished authors, which contains 11,038 books (around 74M sentences and 1G words) of 16 different sub-genres (e.g., Romance, Historical, Adventure, etc.).""
https://paperswithcode.com/dataset/bookcorpus
BookCorpus seems to have about 5000MB of content
source: https://huggingface.co/datasets/bookcorpusopen
Assuming a byte-pair encoder similar to GPT-2, there are 8 bytes / token.
So approximately 5000MB / 8 bytes / token = 5e9 / 8 tokens","Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).",,United States of America,Industry,720.0,"""1 month on 8 GPUs."" from the reference link",NVIDIA Quadro P600,Open source,,8.0,,,8616.0,,,,,,
aLSTM(depth-2)+RecurrentPolicy (WT2),Language,"University of Manchester,Alan Turing Institute","Sebastian Flennerhag, Hujun Yin, John Keane, Mark Elliot",2018-05-22,Breaking the Activation Function Bottleneck through Adaptive Parameterization,https://arxiv.org/abs/1805.08574,SOTA improvement,"""Without tuning for WT2, both outperform previously published results in 150 epochs (table 3) and converge to new state of the art performance in 190 epochs""",32000000.0,,7.59e+16,,,,,,,,"United Kingdom of Great Britain and Northern Ireland,United Kingdom of Great Britain and Northern Ireland","Academia,Government",,,,Unreleased,,,,,12.0,,,190.0,,,
Dropout-LSTM+Noise(Bernoulli) (WT2),Language,"Columbia University,New York University (NYU),Princeton University","Adji B. Dieng, Rajesh Ranganath, Jaan Altosaar, David M. Blei",2018-05-03,Noisin: Unbiased Regularization for Recurrent Neural Networks,https://arxiv.org/abs/1805.01500,SOTA improvement,"this is the best model in this paper per Table 4
""On language modeling benchmarks, Noisin improves over dropout by as much as 12.2% on the Penn Treebank and 9.4% on the Wikitext-2 dataset""",51000000.0,,1.27e+17,,,,,,,,"United States of America,United States of America,United States of America","Academia,Academia,Academia",,,,Unreleased,,,,,26.0,,,200.0,,,
ResNeXt-101 32x48d,Vision,Facebook,"Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, Laurens van der Maaten",2018-05-02,Exploring the Limits of Weakly Supervised Pretraining,https://arxiv.org/abs/1805.00932,"Highly cited,SOTA improvement","""We show improvements on several image classification and object detection tasks, and report the highest ImageNet-1k single-crop, top-1 accuracy to date: 85.4%",829000000.0,"Table 6
",8.74395e+21,"Table 6: 153e9 mult-adds.
Section 2.4: ""minibatches of 8,064 images"".
Compute = 2 * 3 * mult-adds * dataset size = 2 * 3 * 153e9 * 9525e6 = 8.74e21 FLOP","ImageNet,Instagram","Instagram images, captioned with hashtags",9525000000.0,Table 3: (300+1925+300+7000) million images,,Confident,United States of America,Industry,,"""Mahajan et al. (2018) required 19
GPU years to train their ResNeXt101-32x48d""
https://arxiv.org/abs/2103.00020",,,,336.0,,,1241.0,,,,,,
Diffractive Deep Neural Network,Vision,University of California Los Angeles (UCLA),"Xing Lin, Yair Rivenson, Nezih T Yardimci, Muhammed Veli, Yi Luo, Mona Jarrahi, and Aydogan Ozcan",2018-04-14,All-Optical Machine Learning Using Diffractive Deep Neural Networks,https://arxiv.org/abs/1804.08711,Highly cited,,8000000000.0,"""For example, using five 3D-printed transmission layers, containing a total of 0.2 million neurons and ~8.0 billion connections that are trained using deep learning, we experimentally demonstrated the function of a handwritten digit classifier.""
My understanding is that every connection correspond to the parameter to learn.",,,MNIST,"""For this task, phase-only transmission masks were designed by training a 5-layer D2NN with ~55,000 images from MNIST handwritten digit database (14). """,55000.0,"size of MNIST
""For this task, phase-only transmission masks were designed by training a 5-layer D2NN with ~55,000 images from MNIST handwritten digit database (14). ""","We introduce an all-optical Diffractive Deep Neural Network (D2NN) architecture that can learn to implement various functions after deep learning-based design of passive diffractive layers that work collectively. We experimentally demonstrated the success of this framework by creating 3D-printed D2NNs that learned to implement handwritten digit classification and the function of an imaging lens at terahertz spectrum. With the existing plethora of 3D-printing and other lithographic fabrication methods as well as spatial-light-modulators, this all-optical deep learning framework can perform, at the speed of light, various complex functions that computer-based neural networks can implement, and will find applications in all-optical image analysis, feature detection and object classification, also enabling new camera designs and optical components that can learn to perform unique tasks using D2NNs.",Likely,United States of America,Academia,,,,,,,,,1464.0,,,,,,
YOLOv3,Vision,University of Washington,"Joseph Redmon, Ali Farhadi",2018-04-08,YOLOv3: An Incremental Improvement,https://arxiv.org/abs/1804.02767,Highly cited,,56933216.0,"Feature extractor (ignoring biases)
32*3*3*3 +
64*3*3*32 +
32*1*1*64 +
64*3*3*32 +
128*3*3*64 +
2*(64*1*1*128 +
128*3*3*64) +
256*3*3*128 +
8*(128*1*1*256 +
256*3*3*128) +
512*3*3*256 +
8*(256*1*1*512 +
512*3*3*256) +
1024*3*3*512 +
4*(512*1*1*1024 +
1024*3*3*512) +
4*4*1024*1000
source: table 1
This is assuming the average pooling step changes the output size from 8x8 to 4x4.
The weights file is 237MB. If the weights are saved as float32, 4 bytes per weight, then there are approximately 237M/4=59M parameters, consistent with the calculation above.",5.093919992e+19,"We use the formula training_compute = ops_per_forward_pass * 3.5 * n_epochs * n_examples
Assuming 160 epochs of training as in https://arxiv.org/pdf/1612.08242.pdf",ImageNet,,1281167.0,Source: https://image-net.org/download.php,"We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at this https URL",,United States of America,Academia,,,"NVIDIA M40,NVIDIA GTX Titan X",,,,,,17623.0,,,,,,
"LSTM (Hebbian, Cache, MbPA)",Language,"DeepMind,University College London (UCL)","Jack W Rae, Chris Dyer, Peter Dayan, Timothy P Lillicrap",2018-03-27,Fast Parametric Learning with Activation Memorization,https://arxiv.org/abs/1803.10049,SOTA improvement,"""We also show improved performance for word-based language models on news reports (GigaWord), books (Project Gutenberg) and Wikipedia articles (WikiText-103) --- the latter achieving a state-of-the-art perplexity of 29.2.""",45199999.99999999,,2.4e+19,,,,4300000000.0,"Omniglot: 32k images
Wikitext-103: ""Over 100 million tokens""
Guttenberg: 175,181,505 tokens
GigaWord v5: 4B tokens",,,"United Kingdom of Great Britain and Northern Ireland,United Kingdom of Great Britain and Northern Ireland","Industry,Academia",144.0,6 days,NVIDIA P100,Unreleased,,8.0,,,46.0,,,90.0,590.5320553042133,,
4 layer QRNN (h=2500),Language,Salesforce Research,"Stephen Merity, Nitish Shirish Keskar, Richard Socher",2018-03-22,An Analysis of Neural Language Modeling at Multiple Scales,https://arxiv.org/abs/1803.08240,SOTA improvement,"""QRNNs achieve stateof-the-art results on character-level (Penn Treebank, enwik8) and word-level (WikiText-103)
datasets, respectively""",26000000.0,,2.4e+17,,WikiText-103,,,,,,United States of America,Industry,,,,Unreleased,,,,,183.0,,,14.0,,,
Rotation,"Image generation,Vision",École des Ponts ParisTech,"Spyros Gidaris, Praveer Singh, Nikos Komodakis",2018-03-21,Unsupervised Representation Learning by Predicting Image Rotations,https://arxiv.org/abs/1803.07728,Highly cited,,86000000.0,https://openai.com/blog/image-gpt/#rfref53,,,,,,,,,France,Academia,,,,,,,,,2917.0,,,,,,
LSTM (2018),Language,"Intel Labs,Carnegie Mellon University (CMU)","Shaojie Bai, J. Zico Kolter, Vladlen Koltun",2018-03-04,An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling,https://arxiv.org/abs/1803.01271,Highly cited,,13000000.0,,,,Penn TreeBank,,,,,,"Multinational,United States of America","Industry,Academia",,,,Open source,,,,,4024.0,,,,,,
Chinese - English translation,Language,Microsoft,"H Hassan, A Aue, C Chen, V Chowdhary",2018-03-01,Achieving Human Parity on Automatic Chinese to English News Translation,https://www.microsoft.com/en-us/research/publication/achieving-human-parity-on-automatic-chinese-to-english-news-translation/,SOTA improvement,"""We find that our latest neural machine translation system has reached a new state-of-the-art, and that the translation quality is at human parity when compared to professional human translations""",,,,,,,,,,Unknown,United States of America,Industry,,,,,,,,,575.0,,,,,,
Residual Dense Network,"Vision,Image generation","Northeastern University,University of Rochester"," Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, Yun Fu",2018-02-24,Residual Dense Network for Image Super-Resolution,https://openaccess.thecvf.com/content_cvpr_2018/html/Zhang_Residual_Dense_Network_CVPR_2018_paper.html,Highly cited,,,,,,,,,,,Unknown,"United States of America,United States of America","Academia,Academia",,,,,,,,,2822.0,,,,,,
Spectrally Normalized GAN,Image generation,"Preferred Networks Inc,Ritsumeikan University,National Institute of Informatics","Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida",2018-02-16,Spectral Normalization for Generative Adversarial Networks,https://arxiv.org/abs/1802.05957,Highly cited,,,,,,,,,,,Unknown,"Japan,Japan,Japan","Industry,Academia",,,,,,,,,3966.0,,,,,,
TCN (P-MNIST),Language,"Carnegie Mellon University (CMU),Intel Labs","Shaojie Bai, J. Zico Kolter, Vladlen Koltun",2018-02-15,Convolutional Sequence Modeling Revisited,https://openreview.net/forum?id=rk8wKk-R-,SOTA improvement,"""For the permuted sequential MNIST, TCNs outperform state of the art results using recurrent nets (95.9%) with Zoneout+Recurrent BatchNorm (Cooijmans et al., 2016; Krueger et al., 2017), a highly optimized method for regularizing RNNs""",42000.0,,,,P-MNIST,,,,,Confident,"United States of America,Multinational","Academia,Industry",,,,,,,,,64.0,,,,,,
ENAS,Language,"Google Brain,Carnegie Mellon University (CMU),Stanford University","Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, Jeff Dean",2018-02-09,Efficient Neural Architecture Search via Parameter Sharing,https://arxiv.org/abs/1802.03268,Highly cited,,24000000.0,,2.01e+16,,Penn TreeBank,,,,,,"United States of America,United States of America,United States of America","Industry,Academia,Academia",,,,Unreleased,,,,,2760.0,,,150.0,,,
DeepLabV3+,Vision,Google,"Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig Adam",2018-02-07,Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation,https://arxiv.org/abs/1802.02611v3,Highly cited,,,,,,,,,,,Unknown,United States of America,Industry,,,,,,,,,10396.0,,,,,,
IMPALA,Games,DeepMind,"Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, Koray Kavukcuoglu",2018-02-05,IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures,https://arxiv.org/abs/1802.01561,"Highly cited,SOTA improvement","""IMPALA is able to achieve better performance than previous agents with less data""",1600000.0,"""Figure 3 in the paper states that the large architecture has 1.6 million parameters. I am using the large model because it was the only one trained on all the Atari games at once, which seems like the most impressive task in the suite.""
Source: https://docs.google.com/spreadsheets/d/1Kj4Q5WADcDXtUJLIOfGTCE3tGvxNczEMwyy8QtgSkHk/edit#gid=54587040&fvid=1361937389",1.68e+20,"Source: Ajeya Cotra and Tom Davidson, https://docs.google.com/spreadsheets/d/1Kj4Q5WADcDXtUJLIOfGTCE3tGvxNczEMwyy8QtgSkHk/edit#gid=54587040&fvid=1361937389",,,240000000000.0,"From fig 6, there were 1e10 environment frames, and 24 agents. Thus we note down 2.4e11 for the ""dataset size""","In this work we aim to solve a large collection of tasks using a single reinforcement learning agent with a single set of parameters. A key challenge is to handle the increased amount of data and extended training time. We have developed a new distributed agent IMPALA (Importance Weighted Actor-Learner Architecture) that not only uses resources more efficiently in single-machine training but also scales to thousands of machines without sacrificing data efficiency or resource utilisation. We achieve stable learning at high throughput by combining decoupled acting and learning with a novel off-policy correction method called V-trace. We demonstrate the effectiveness of IMPALA for multi-task reinforcement learning on DMLab-30 (a set of 30 tasks from the DeepMind Lab environment (Beattie et al., 2016)) and Atari-57 (all available Atari games in Arcade Learning Environment (Bellemare et al., 2013a)). Our results show that IMPALA is able to achieve better performance than previous agents with less data, and crucially exhibits positive transfer between tasks as a result of its multi-task approach.",,United Kingdom of Great Britain and Northern Ireland,Industry,100.0,Maximum training time for IMPALA is 100 hours according to Figure 6. This seems to refer to the 1 GPU model. The 8 GPU model looks to have been trained about 1/8 as long.,NVIDIA P100,,,1.0,,,1384.0,,,,53.42846804494412,,
AmoebaNet-A (F=448),Vision,Google Brain,"Esteban Real, Alok Aggarwal, Yanping Huang, Quoc V Le",2018-02-05,Regularized Evolution for Image Classifier Architecture Search,https://arxiv.org/abs/1802.01548,Highly cited,,469000000.0,Table 2,3.85296912e+20,"450 K40 GPUs for 20k models (approx. 7 days).
(From Imagenet paper-data, Besiroglu et al., forthcoming) ",ImageNet-1k,,1280000.0,,"The effort devoted to hand-crafting neural network image classifiers has motivated the use of architecture search to discover them automatically. Although evolutionary algorithms have been repeatedly applied to neural network topologies, the image classifiers thus discovered have remained inferior to human-crafted ones. Here, we evolve an image classifier---AmoebaNet-A---that surpasses hand-designs for the first time. To do this, we modify the tournament selection evolutionary algorithm by introducing an age property to favor the younger genotypes. Matching size, AmoebaNet-A has comparable accuracy to current state-of-the-art ImageNet models discovered with more complex architecture-search methods. Scaled to larger size, AmoebaNet-A sets a new state-of-the-art 83.9% / 96.6% top-5 ImageNet accuracy. In a controlled comparison against a well known reinforcement learning algorithm, we give evidence that evolution can obtain results faster with the same hardware, especially at the earlier stages of the search. This is relevant when fewer compute resources are available. Evolution is, thus, a simple method to effectively discover high-quality architectures.",,United States of America,Industry,168.0,"""Each experiment ran on 450 K40 GPUs for 20k models (approx. 7 days).""",NVIDIA Tesla K40s,,,450.0,,,2659.0,,,,11766.339677271537,,
AmoebaNet-A (F=190),Vision,Google Brain,"E Real, A Aggarwal, Y Huang, QV Le",2018-02-05,Regularized Evolution for Image Classifier Architecture Search,https://arxiv.org/abs/1802.01548,Highly cited,,87000000.0,Table 2,,,,,,,,,United States of America,Industry,,,,,,,,,2659.0,,,,,,
QRNN,Language,Salesforce Research,"Stephen Merity, Nitish Shirish Keskar, James Bradbury, Richard Socher",2018-02-01,Scalable Language Modeling: WikiText-103 on a Single GPU in 12 hours,https://mlsys.org/Conferences/doc/2018/50.pdf,SOTA improvement,"""we reduce our per-epoch time substantially and achieve a new state-of-the-art on WikiText-103 despite training for 14 epochs""",135000000.0,,3.6e+17,,WikiText-103,,,,,,United States of America,Industry,,,,Unreleased,,,,,4.0,,,14.0,,,
ELMo,Language,"University of Washington,Allen Institute for AI","ME Peters, M Neumann, M Iyyer, M Gardner",2018-02-01,Deep contextualized word representations,https://arxiv.org/abs/1802.05365,Highly cited,,94000000.0,,,3300e12 - https://github.com/amirgholami/ai_and_memory_wall,,,,,,,"United States of America,United States of America","Academia,Research collective",,,,,,,,,10768.0,,,,,,
ULM-FiT,Language,"University of San Francisco,Insight Centre NUI Galway,Fast.ai","Jeremy Howard, Sebastian Ruder",2018-01-18,Universal Language Model Fine-tuning for Text Classification,https://arxiv.org/abs/1801.06146,Highly cited,,441000000.0,https://files.fast.ai/models/wt103/?C=S;O=D,2.72538e+17,=103000000*441000000*6=2.72538e+17,"IMDb,Yelp,Trec-6,DBpedia,AG news,WikiText-103",,103000000.0,"We pretrain the language model on Wikitext-103
(Merity et al., 2017b) consisting of 28,595 preprocessed Wikipedia articles and 103 million words.
Fine-tuning datasets:
TREC-6 Question 5.5k
IMDb Sentiment 25k
Yelp-bi Sentiment 560k
Yelp-full Sentiment 650k
AG Topic 120k
DBpedia Topic 560k
560+120+650+560+25+5.5=1920.5k = 1920500","Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model. Our method significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100x more data. We open-source our pretrained models and code.",Speculative,"United States of America,Ireland","Academia,Academia",,,,Open source,,,,AWD-LSTM,1940.0,,,,,,
Refined Part Pooling,Vision,"Tsinghua University,University of Technology Sydney,University of Texas at San Antonio","Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, Shengjin Wang",2018-01-09,Beyond Part Models: Person Retrieval with Refined Part Pooling (and a Strong Convolutional Baseline),https://arxiv.org/abs/1711.09349,Highly cited,,,,,,,,,,,Unknown,"China,Australia,United States of America","Academia,Academia,Academia",,,,,,,,,1927.0,,,,,,
Tacotron 2,Speech,"Google,UC Berkeley","Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu",2017-12-19,Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Prediction,https://arxiv.org/abs/1712.05884,Highly cited,,,"some architecture details:
""Input characters are represented using a learned 512-dimensional
character embedding, which are passed through a stack of 3 convolutional layers each containing 512 filters with shape 5 × 1, i.e., where
each filter spans 5 characters, followed by batch normalization [18]
and ReLU activations. As in Tacotron, these convolutional layers
model longer-term context (e.g., N-grams) in the input character
sequence. The output of the final convolutional layer is passed into a
single bi-directional [19] LSTM [20] layer containing 512 units (256
in each direction) to generate the encoded features.""",,,,"""We train all models on an internal US English dataset[12], which
contains 24.6 hours of speech from a single professional female
speaker.""",340000.0,"""We train all models on an internal US English dataset[12], which contains 24.6 hours of speech from a single professional female speaker.""
13,680 words/hour * 24.6 = 336528 words","This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and F0 features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.",Confident,"United States of America,United States of America","Industry,Academia",,,,,,,,,2886.0,,,,,,
AlphaZero,Games,DeepMind,"D Silver, T Hubert, J Schrittwieser, I Antonoglou",2017-12-05,Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm,https://arxiv.org/abs/1712.01815,Highly cited,,,,3.667927300468287e+22,Extracted from AI and Compute (https://openai.com/blog/ai-and-compute/) charts by using https://automeris.io/WebPlotDigitizer/.,,,700000.0,"""We trained a separate instance of AlphaZero for each game. Training proceeded
for 700,000 steps""","The game of chess is the most widely-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go, by tabula rasa reinforcement learning from games of self-play. In this paper, we generalise this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains. Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a world-champion program in each case.",,United Kingdom of Great Britain and Northern Ireland,Industry,,,Google TPU v2,,,64.0,,,1464.0,,,,229918.6146969874,,
2-layer-LSTM+Deep-Gradient-Compression,Language,"Tsinghua University,Stanford University,NVIDIA","Yujun Lin, Song Han, Huizi Mao, Yu Wang, William J. Dally",2017-12-05,Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training,https://arxiv.org/abs/1712.01887,Highly cited,,6020000.0,,1340000000000000.0,,,,,,,,"China,United States of America,United States of America","Academia,Academia,Industry",,,,Unreleased,,,,,1270.0,,,40.0,,,
PNASNet-5,Vision,"Johns Hopkins University,Google AI,Stanford University","C Liu, B Zoph, M Neumann, J Shlens",2017-12-02,Progressive Neural Architecture Search,https://arxiv.org/abs/1712.00559,Highly cited,,,,6.62904e+19,"8 times less compute than Zoph (2018), which used 500 p100s for 4 days.
(From Imagenet paper-data, Besiroglu et al., forthcoming) ",ImageNet-1k,,1280000.0,,"We propose a new method for learning the structure of convolutional neural networks (CNNs) that is more efficient than recent state-of-the-art methods based on reinforcement learning and evolutionary algorithms. Our approach uses a sequential model-based optimization (SMBO) strategy, in which we search for structures in order of increasing complexity, while simultaneously learning a surrogate model to guide the search through structure space. Direct comparison under the same search space shows that our method is up to 5 times more efficient than the RL method of Zoph et al. (2018) in terms of number of models evaluated, and 8 times faster in terms of total compute. The structures we discover in this way achieve state of the art classification accuracies on CIFAR-10 and ImageNet.",,"United States of America,Multinational,United States of America","Academia,Industry,Academia",,,,,,,,,1843.0,,,,,,
PNAS-net,Vision,"Johns Hopkins University,Google AI,Stanford University","C Liu, B Zoph, M Neumann, J Shlens",2017-12-02,Progressive Neural Architecture Search,https://arxiv.org/abs/1712.00559,Highly cited,,86000000.0,,,,,,,,,,"United States of America,Multinational,United States of America","Academia,Industry,Academia",,,,,,,,,1843.0,,,,,,
TriNet,Video,"Visual Computing Institute,RWTH Aachen University","Alexander Hermans, Lucas Beyer, Bastian Leibe",2017-11-21,In Defense of the Triplet Loss for Person Re-Identification,https://arxiv.org/abs/1703.07737,Highly cited,,,,,,,,,,,Unknown,"Germany,Germany",Academia,,,,,,,,,2887.0,,,,,,
"AWD-LSTM-MoS + dynamic evaluation (WT2, 2017)",Language,Carnegie Mellon University (CMU),"Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, William W. Cohen",2017-11-10,Breaking the Softmax Bottleneck: A High-Rank RNN Language Model,https://arxiv.org/abs/1711.03953,SOTA improvement,"""Experimental results confirm that the
proposed method significantly improves state-of-the-art language models, achieving a perplexity of 55.31 and 62.89 on
the test set of Penn Treebank and WikiText-2""",35000000.0,,4.37e+17,,,,,,,,United States of America,Academia,,,,Unreleased,,,,,358.0,,,1000.0,,,
Fraternal dropout + AWD-LSTM 3-layer (WT2),Language,"Jagiellonian University,Mila - Quebec AI (originally Montreal Institute for Learning Algorithms),University of Montreal / Université de Montréal","Konrad Zolna, Devansh Arpit, Dendi Suhubdy, Yoshua Bengio",2017-10-31,Fraternal Dropout,https://arxiv.org/abs/1711.00066,SOTA improvement,"""We evaluate our model and achieve state-of-the-art results in sequence
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment