Performance and Evaluation
Performance and Evaluation is a robust large language model evaluation consisting of general and domain specific evaluations to assess model knowledge and function.
* We run evaluations ourselves and some results may be sensitive to detailed settings. We will share details of our evaluation methods shortly. *
Average of 22 Evaluations
Base Models |
Overall Eval |
Llama 3-70B |
63.0 |
K2 |
58.1 |
K2-Stage 1 |
56.6 |
Llama 2-70B |
56.5 |
Llama-65B |
53.3 |
Base Models |
Overall Eval |
Llama 3-70B-Instruct |
63.4 |
K2 Chat |
59.5 |
Llama 2-70B-Chat |
55.3 |
OpenLLM Leaderboard Metrics
Models |
Average |
Llama 3-70B |
73.3 |
Llama 2-70B |
65.8 |
K2 |
64.3 |
K2-Stage 1 |
63.9 |
Llama-65B |
62.6 |
Models |
MMLU |
HellaSwag |
ARC-C |
Winogrande |
Truthful QA |
GSM8K |
Llama 3-70B |
75 |
87.9 |
69.8 |
81.1 |
45.6 |
80.4 |
Llama 2-70B |
65.4 |
86.9 |
67.2 |
77.7 |
44.9 |
52.6 |
K2-Stage 1 |
64.8 |
85.5 |
64.8 |
77 |
40.8 |
50.2 |
K2 |
62.6 |
83.2 |
61.9 |
79.5 |
40.4 |
58.3 |
Llama-65B |
59.7 |
85.9 |
63.2 |
77.2 |
42.6 |
47 |
OpenLLM Leaderboard Metrics: Chat & Instruction Models
Models |
Average |
Llama 3-70B |
77.6 |
K2 Chat |
65.2 |
Llama 2-70B |
64.8 |
Models |
MMLU |
HellaSwag |
ARC-C |
Winogrande |
Truthful QA |
GSM8K |
Llama 3-70B Chat |
78.6 |
85.6 |
72 |
76.1 |
61.9 |
91.2 |
K2 Chat |
63.5 |
81.7 |
61.3 |
79.5 |
44.7 |
60.7 |
Llama 2-70B-Chat |
61.1 |
85.9 |
65.3 |
75.1 |
52.8 |
48.4 |
Math
Models |
Average |
Llama 3-70B |
67.5 |
K2 |
51.3 |
Llama 2-70B |
46.1 |
K2-Stage 1 |
44.6 |
Llama-65B |
42.5 |
Models |
MathQA |
GSM8K |
Llama 3-70B |
54.5 |
80.4 |
K2 |
44.2 |
58.3 |
Llama 2-70B |
39.5 |
52.6 |
K2-Stage 1 |
39 |
50.2 |
Llama-65B |
38 |
47 |
Math: Chat & Instruction Models
Models |
Average |
Llama 3-70B-Instruct |
79.3 |
K2 Chat |
52.8 |
Llama 2-70B-Chat |
43.2 |
Models |
MathQA |
GSM8K |
Llama 3-70B-Instruct |
67.4 |
91.2 |
K2 Chat |
44.8 |
60.7 |
Llama 2-70B-Chat |
38 |
48.4 |
Medical
Models |
Average |
Llama 3-70B |
75.5 |
K2-Stage 1 |
62.8 |
Llama 2-70B |
60.8 |
K2-Stage 2 |
59.6 |
Llama-65B |
56.5 |
Models |
MedQA |
MedMCQA |
PubMedQA |
Llama 3-70B |
78.3 |
70.8 |
77.4 |
Llama 2-70B |
56.2 |
51.8 |
74.4 |
K2 - Stage 1 |
53.7 |
56 |
78.6 |
K2 - Stage 2 |
51.7 |
53.5 |
73.6 |
Llama-65B |
46.2 |
46.9 |
76.4 |
Medical: Chat & Instruction Models
Models |
Average |
Llama 3-70B-Instruct |
75.7 |
K2 Chat |
60 |
Llama 2-70B-Chat |
57.2 |
Models |
MedQA |
MedMCQA |
PubMedQA |
Llama 3-70B-instruct |
76.4 |
71 |
79.6 |
K2 Chat |
53.6 |
51.3 |
75 |
Llama 2-70B-chat |
50 |
44.8 |
76.8 |
Multiple Choice
Models |
Average |
Llama 3-70B |
67.7 |
Llama 2-70B |
61.1 |
K2 - Stage 1 |
60.2 |
K2 |
59.9 |
Llama-65B |
58.7 |
Multiple Choice: Chat & Instruction Models
Models |
Average |
Llama 3-70B |
70.3 |
K2 Chat |
60.1 |
Llama 2-70B |
59.4 |
Models |
RACE |
PIQA |
ARC-E |
OpenBookQ |
CrowS-Pairs |
ToxiGen |
LogiQA |
MMLU |
HellaSwag |
ARC-C |
Winogrande |
Truthful QA |
MedMCQA |
PubMedQA |
MathQA |
GSM8K |
Llama 3-70B-Instruct |
47 |
85 |
89.8 |
55.2 |
71.1 |
45.6 |
41.5 |
78.6 |
85.6 |
72 |
76.1 |
61.9 |
76.4 |
71 |
67.4 |
91.2 |
K2 Chat |
46.1 |
82.3 |
84.6 |
48 |
64.2 |
43.2 |
38 |
64.9 |
81.7 |
61.3 |
79.5 |
44.7 |
53.6 |
51.3 |
44.8 |
60.7 |
Llama 2-70B-Chat |
44 |
81.8 |
85.5 |
47.2 |
71.9 |
43.9 |
37.7 |
63 |
85.9 |
65.3 |
75.1 |
52.8 |
44.8 |
76.8 |
38 |
48.4 |