LLM360 Research Suite is a comprehensive set of large language model (LLM) artifacts from each of our models, for academic and industry researchers to explore LLM training dynamics.
We encountered two major loss spikes while training K2:
Analysis360 serves as the single source of truth for all evaluation metrics and provides in-depth analysis from many different angles.
We note a trend of progressively less disclosure of important pretraining details over time: (1) availability of pretraining code, (2) disclosure of training configurations and hyperparameters, (3) intermediate checkpoints of model weights, (4) intermediate checkpoints of optimizer states, (5) disclosure of data mixture and sources, (6) reproducibility of pretraining data sequence, and (7) availability (or reconstruction scripts) of the pretraining data.
LLM Name | Release Date | Pretraining | Checkpoints | Pretraining Dataset | Tokens | ||||
---|---|---|---|---|---|---|---|---|---|
Name | Date | Code | Config | Model | Optim | Data Mix | Ordering | Available | (T ) |
K2 | May’24 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 1.4 |
OLMo-7B | May’24 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 2.5 |
Arctic | Apr’24 | ✓ | ✓ | 1.5 | |||||
CrystalCoder | Dec’23 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 1.4 |
Amber | Dec’23 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 1.3 |
Yi | Nov’23 | ? | |||||||
Mistral | Sep’23 | ? | |||||||
Qwen | Aug’23 | ✓ | 2.4 | ||||||
Llama 2 | Jul’23 | ✓ | 2.0 | ||||||
Falcon | May’23 | ✓ | ✓ | 1.5 | |||||
MPT | May’23 | ✓ | ✓ | ✓ | 1.0 | ||||
INCITE | May’23 | ✓ | ✓ | ✓ | ✓ | ✓ | 1.0 | ||
OpenLLama | May’23 | ✓ | ✓ | ✓ | ✓ | ✓ | 1.0 | ||
LLAMA | Feb’23 | ✓ | ✓ | 1.0 | |||||
Pythia | Feb’23 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 0.30 |
BLOOM | Nov’22 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 0.34 |
OPT | May’22 | ✓ | ✓ | ✓ | ✓ | 0.18 | |||
GPT-NeoX | Apr’22 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 0.40 |
GPT-J | May’21 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 0.40 |
We run evaluations on a variety of benchmarks, including the conventional benchmarks like MMLU, Hellaswag, ARC, user-preference aligned benchmarks like MT-bench, long-context evaluations like LongEval, and additional studies on safety benchmarks for truthfulness, toxicity, and bias. Moreover, we report results on the model samples we preselected from a suite of LLMs where they all trained on same data seen in the exact same order to better observe and understand how our models develop and evolve over the training process. We also provide public access to all checkpoints, all code and all wandb dashboards for detailed training and evaluation curves.
Here's a full list of analysis/metrics we have collected so far. For each model we release, at this point, Amber, CrystalCoder, and K2, we put down the links to specific wandb reports if the evaluation is done. Amber, CrystalCoder, and K2 currently use their own evaluation scripts, we are working on consolidating these in the future, more details can be found in later sections. Please refer to model cards (Amber, CrystalCoder, and K2) for any terms or technology you find unfamiliar. We will keep updating and expanding the list as our study proceeds, please stay tuned on the upcoming changes!
Metrics/Analysis | Description | Amber | CrystalCoder |
---|---|---|---|
mmlu | A test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more | 5 shot | 0 shot 5 shot |
race | A test to measure reading comprehension ability | 0 shot | 0 shot |
arc_challenge | A set of grade-school science questions | 25 shot | 0 shot 25 shot |
boolq | A question answering dataset for yes/no questions containing 15942 examples | 0 shot | 0 shot |
hellaswag | A test of commonsense inference | 10 shot | 0 shot
10 shot |
openbookqa | A question-answering dataset modeled after open book exams for assessing human understanding of a subject | 0 shot | 0 shot |
piqa | A test to measure physical commonsense and reasoning | 0 shot | 0 shot |
siqa | A test to measure commonsense reasoning about social interactions | 0 shot | |
winogrande | An adversarial and difficult Winograd benchmark at scale, for commonsense reasoning | 0 shot | 0 shot 5 shot |
crowspairs | A challenge set for evaluating what language models (LMs) on their tendency to generate biased outputs | 0 shot | |
truthfulqa | A test to measure a model’s propensity to reproduce falsehoods commonly found online | 0 shot | 0 shot |
pile | A test to measure model's perplexity, we covered 18/22 sub datasets | perplexity | |
drop | A reading comprehension benchmark requiring discrete reasoning over paragraphs | 3 shot | |
mbpp | Around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry-level programmers | pass 1 pass 10 |
|
humaneval | A test to measure functional correctness for synthesizing programs from docstrings | pass 1 pass 10 |
|
gsm8k | Diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems | 5 shot | |
copa | A test to assess progress in open-domain commonsense causal reasoning | 0 shot | |
toxigen | A test to measure model's toxicity on text generation | toxigen | |
toxicity identification | A test to measure model's capability on identifying toxic text | toxicity identification | |
bold | A test to evaluate fairness in open-ended language generation in English language | bold | |
memorization and token orders analysis | An analysis to understand model's memorization abilities | memorization |