from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch

model_name_or_path = "TheBloke/OpenHermes-2.5-Mistral-7B-AWQ"

tokenizer = AutoTokenizer.from_pretrained("teknium/OpenHermes-2.5-Mistral-7B")
# model = AutoModelForCausalLM.from_pretrained(
#     model_name_or_path, torch_dtype=torch.float16, device_map="auto", attn_implementation="flash_attention_2"
# )
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto")

article = '''

== BEGIN ARTICLE ==

Llama 2 : Open Foundation and Fine-Tuned Chat Models
Hugo Touvron∗Louis Martin†Kevin Stone†
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang
Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang
Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic
Sergey Edunov Thomas Scialom∗
GenAI, Meta
Abstract
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned
large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
Our fine-tuned LLMs, called Llama 2-Chat , are optimized for dialogue use cases. Our
models outperform open-source chat models on most benchmarks we tested, and based on
ourhumanevaluationsforhelpfulnessandsafety,maybeasuitablesubstituteforclosed-
source models. We provide a detailed description of our approach to fine-tuning and safety
improvements of Llama 2-Chat in order to enable the community to build on our work and
contribute to the responsible development of LLMs.
∗Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com
†Second author
2
Figure 1: Helpfulness human evaluation results for Llama
2-Chatcomparedtootheropen-sourceandclosed-source
models. Human raters compared model generations on ~4k
promptsconsistingofbothsingleandmulti-turnprompts.
The95%confidenceintervalsforthisevaluationarebetween
1%and2%. MoredetailsinSection3.4.2. Whilereviewing
these results, it is important to note that human evaluations
canbenoisyduetolimitationsofthepromptset,subjectivity
of the review guidelines, subjectivity of individual raters,
and the inherent difficulty of comparing generations.
Figure 2: Win-rate % for helpfulness and
safety between commercial-licensed base-
lines and Llama 2-Chat , according to GPT-
4. Tocomplementthehumanevaluation,we
used a more capable model, not subject to
ourownguidance. Greenareaindicatesour
modelisbetteraccordingtoGPT-4. Toremove
ties, we used win/ (win+loss). The orders in
whichthemodelresponsesarepresentedto
GPT-4arerandomlyswappedtoalleviatebias.
1 Introduction
Large Language Models (LLMs) have shown great promise as highly capable AI assistants that excel in
complex reasoning tasks requiring expert knowledge across a wide range of fields, including in specialized
domains such as programming and creative writing. They enable interaction with humans through intuitive
chat interfaces, which has led to rapid and widespread adoption among the general public.
ThecapabilitiesofLLMsareremarkableconsideringtheseeminglystraightforwardnatureofthetraining
methodology. Auto-regressivetransformersarepretrainedonanextensivecorpusofself-superviseddata,
followed by alignment with human preferences via techniques such as Reinforcement Learning with Human
Feedback(RLHF).Althoughthetrainingmethodologyissimple,highcomputationalrequirementshave
limited the development of LLMs to a few players. There have been public releases of pretrained LLMs
(such as BLOOM (Scao et al., 2022), LLaMa-1 (Touvron et al., 2023), and Falcon (Penedo et al., 2023)) that
match the performance of closed pretrained competitors like GPT-3 (Brown et al., 2020) and Chinchilla
(Hoffmann et al., 2022), but none of these models are suitable substitutes for closed “product” LLMs, such
asChatGPT,BARD,andClaude. TheseclosedproductLLMsareheavilyfine-tunedtoalignwithhuman
preferences, which greatly enhances their usability and safety. This step can require significant costs in
computeandhumanannotation,andisoftennottransparentoreasilyreproducible,limitingprogresswithin
the community to advance AI alignment research.
In this work, we develop and release Llama 2, a family of pretrained and fine-tuned LLMs, Llama 2 and
Llama 2-Chat , at scales up to 70B parameters. On the series of helpfulness and safety benchmarks we tested,
Llama 2-Chat models generally perform better than existing open-source models. They also appear to
be on par with some of the closed-source models, at least on the human evaluations we performed (see
Figures1and3). Wehavetakenmeasurestoincreasethesafetyofthesemodels,usingsafety-specificdata
annotation and tuning, as well as conducting red-teaming and employing iterative evaluations. Additionally,
thispapercontributesathoroughdescriptionofourfine-tuningmethodologyandapproachtoimproving
LLM safety. We hope that this openness will enable the community to reproduce fine-tuned LLMs and
continue to improve the safety of those models, paving the way for more responsible development of LLMs.
Wealsosharenovelobservationswemadeduringthedevelopmentof Llama 2 andLlama 2-Chat ,suchas
the emergence of tool usage and temporal organization of knowledge.
3
Figure 3: Safety human evaluation results for Llama 2-Chat compared to other open-source and closed-
source models. Human raters judged model generations for safety violations across ~2,000 adversarial
prompts consisting of both single and multi-turn prompts. More details can be found in Section 4.4. It is
importanttocaveatthesesafetyresultswiththeinherentbiasofLLMevaluationsduetolimitationsofthe
promptset,subjectivityofthereviewguidelines,andsubjectivityofindividualraters. Additionally,these
safety evaluations are performed using content standards that are likely to be biased towards the Llama
2-Chatmodels.
We are releasing the following models to the general public for research and commercial use‡:
1.Llama 2 ,anupdatedversionof Llama 1,trainedonanewmixofpubliclyavailabledata. Wealso
increasedthesizeofthepretrainingcorpusby40%,doubledthecontextlengthofthemodel,and
adoptedgrouped-queryattention(Ainslieetal.,2023). Wearereleasingvariantsof Llama 2 with
7B,13B,and70Bparameters. Wehavealsotrained34Bvariants,whichwereportoninthispaper
but are not releasing.§
2.Llama 2-Chat , a fine-tuned version of Llama 2 that is optimized for dialogue use cases. We release
variants of this model with 7B, 13B, and 70B parameters as well.
WebelievethattheopenreleaseofLLMs,whendonesafely,willbeanetbenefittosociety. LikeallLLMs,
Llama 2 is a new technology that carries potential risks with use (Bender et al., 2021b; Weidinger et al., 2021;
Solaimanet al.,2023). Testingconductedtodate hasbeeninEnglish andhasnot— andcouldnot— cover
all scenarios. Therefore, before deploying any applications of Llama 2-Chat , developers should perform
safetytestingand tuningtailoredtotheirspecificapplicationsofthemodel. Weprovidearesponsibleuse
guide¶and code examples‖to facilitate the safe deployment of Llama 2 andLlama 2-Chat . More details of
our responsible release strategy can be found in Section 5.3.
Theremainderofthispaperdescribesourpretrainingmethodology(Section2),fine-tuningmethodology
(Section 3), approach to model safety (Section 4), key observations and insights (Section 5), relevant related
work (Section 6), and conclusions (Section 7).
‡https://ai.meta.com/resources/models-and-libraries/llama/
§We are delaying the release of the 34B model due to a lack of time to sufficiently red team.
¶https://ai.meta.com/llama
‖https://github.com/facebookresearch/llama
4
Figure4: Trainingof Llama 2-Chat : Thisprocessbeginswiththe pretraining ofLlama 2 usingpublicly
availableonlinesources. Followingthis,wecreateaninitialversionof Llama 2-Chat throughtheapplication
ofsupervised fine-tuning . Subsequently, the model is iteratively refined using Reinforcement Learning
with Human Feedback (RLHF) methodologies, specifically through rejection sampling and Proximal Policy
Optimization(PPO).ThroughouttheRLHFstage,theaccumulationof iterativerewardmodelingdata in
parallel with model enhancements is crucial to ensure the reward models remain within distribution.
2 Pretraining
Tocreatethenewfamilyof Llama 2models,webeganwiththepretrainingapproachdescribedinTouvronetal.
(2023), using an optimized auto-regressive transformer, but made several changes to improve performance.
Specifically,weperformedmorerobustdatacleaning,updatedourdatamixes,trainedon40%moretotal
tokens,doubledthecontextlength,andusedgrouped-queryattention(GQA)toimproveinferencescalability
for our larger models. Table 1 compares the attributes of the new Llama 2 models with the Llama 1 models.
2.1 Pretraining Data
Our training corpus includes a new mix of data from publicly available sources, which does not include data
fromMeta’sproductsorservices. Wemadeanefforttoremovedatafromcertainsitesknowntocontaina
highvolumeofpersonalinformationaboutprivateindividuals. Wetrainedon2trilliontokensofdataasthis
providesagoodperformance–costtrade-off,up-samplingthemostfactualsourcesinanefforttoincrease
knowledge and dampen hallucinations.
Weperformedavarietyofpretrainingdatainvestigationssothatuserscanbetterunderstandthepotential
capabilities and limitations of our models; results can be found in Section 4.1.
2.2 Training Details
We adopt most of the pretraining setting and model architecture from Llama 1 . We use the standard
transformer architecture (Vaswani et al., 2017), apply pre-normalization using RMSNorm (Zhang and
Sennrich, 2019), use the SwiGLU activation function (Shazeer, 2020), and rotary positional embeddings
(RoPE, Su et al. 2022). The primary architectural differences from Llama 1 include increased context length
andgrouped-queryattention(GQA).WedetailinAppendixSectionA.2.1eachofthesedifferenceswith
ablation experiments to demonstrate their importance.
Hyperparameters. We trained using the AdamW optimizer (Loshchilov and Hutter, 2017), with β1=
0.9, β2= 0.95,eps= 10−5. We use a cosine learning rate schedule, with warmup of 2000 steps, and decay
finallearningratedownto10%ofthepeaklearningrate. Weuseaweightdecayof 0.1andgradientclipping
of1.0. Figure 5 (a) shows the training loss for Llama 2 with these hyperparameters.
5
Training Data Params Context
LengthGQA Tokens LR
Llama 1See Touvron et al.
(2023)7B 2k ✗ 1.0T 3.0×10−4
13B 2k ✗ 1.0T 3.0×10−4
33B 2k ✗ 1.4T 1.5×10−4
65B 2k ✗ 1.4T 1.5×10−4
Llama 2A new mix of publicly
available online data7B 4k ✗ 2.0T 3.0×10−4
13B 4k ✗ 2.0T 3.0×10−4
34B 4k ✓ 2.0T 1.5×10−4
70B 4k ✓ 2.0T 1.5×10−4
Table 1: Llama 2 family of models. Token counts refer to pretraining data only. All models are trained with
a global batch-size of 4M tokens. Bigger models — 34B and 70B — use Grouped-Query Attention (GQA) for
improved inference scalability.
0 250 500 750 1000 1250 1500 1750 2000
Processed Tokens (Billions)1.41.51.61.71.81.92.02.12.2Train PPLLlama-2
7B
13B
34B
70B
Figure 5: Training Loss for Llama 2 models. We compare the training loss of the Llama 2 family of models.
We observe that after pretraining on 2T Tokens, the models still did not show any sign of saturation.
Tokenizer. Weusethesametokenizeras Llama 1;itemploysabytepairencoding(BPE)algorithm(Sennrich
etal.,2016)usingtheimplementationfromSentencePiece(KudoandRichardson,2018). Aswith Llama 1,
we split all numbers into individual digits and use bytes to decompose unknown UTF-8 characters. The total
vocabulary size is 32k tokens.
2.2.1 Training Hardware & Carbon Footprint
TrainingHardware. WepretrainedourmodelsonMeta’sResearchSuperCluster(RSC)(LeeandSengupta,
2022)aswellasinternalproductionclusters. BothclustersuseNVIDIAA100s. Therearetwokeydifferences
between the two clusters, with the first being the type of interconnect available: RSC uses NVIDIA Quantum
InfiniBandwhileourproductionclusterisequippedwithaRoCE(RDMAoverconvergedEthernet)solution
based on commodity ethernet Switches. Both of these solutions interconnect 200 Gbps end-points. The
seconddifferenceistheper-GPUpowerconsumptioncap—RSCuses400Wwhileourproductioncluster
uses350W.Withthistwo-clustersetup,wewereabletocomparethesuitabilityofthesedifferenttypesof
interconnectforlargescaletraining. RoCE(whichisamoreaffordable,commercialinterconnectnetwork)
6
Time
(GPU hours)Power
Consumption (W)Carbon Emitted
(tCO 2eq)
Llama 27B 184320 400 31.22
13B 368640 400 62.44
34B 1038336 350 153.90
70B 1720320 400 291.42
Total 3311616 539.00
Table 2: CO2emissions during pretraining. Time: total GPU time required for training each model. Power
Consumption: peak power capacity per GPU device for the GPUs used adjusted for power usage efficiency.
100%oftheemissionsaredirectlyoffsetbyMeta’ssustainabilityprogram,andbecauseweareopenlyreleasing
these models, the pretraining costs do not need to be incurred by others.
can scale almost as well as expensive Infiniband up to 2000 GPUs, which makes pretraining even more
democratizable. On A100s with RoCE and GPU power capped at 350W, our optimized codebase reached up
to 90% of the performance of RSC using IB interconnect and 400W GPU power.
Carbon Footprint of Pretraining. Following preceding research (Bender et al., 2021a; Patterson et al., 2021;
Wu et al., 2022; Dodge et al., 2022) and using power consumption estimates of GPU devices and carbon
efficiency, we aim tocalculate thecarbon emissions resultingfrom the pretrainingof Llama 2 models. The
actualpowerusageofaGPUisdependentonitsutilizationandislikelytovaryfromtheThermalDesign
Power(TDP)thatweemployasanestimationforGPUpower. Itisimportanttonotethatourcalculations
do not account for further power demands, such as those from interconnect or non-GPU server power
consumption,norfromdatacentercoolingsystems. Additionally,thecarbonoutputrelatedtotheproduction
of AI hardware, like GPUs, could add to the overall carbon footprint as suggested by Gupta et al. (2022b,a).
Table 2 summarizes the carbon emission for pretraining the Llama 2 family of models. A cumulative of
3.3M GPUhours ofcomputation wasperformed onhardware oftype A100-80GB (TDPof 400Wor 350W).
We estimate the total emissions for training to be 539 tCO 2eq, of which 100% were directly offset by Meta’s
sustainability program.∗∗Our open release strategy also means that these pretraining costs will not need to
be incurred by other companies, saving more global resources.
2.3 Llama 2 Pretrained Model Evaluation
In this section, we report the results for the Llama 1 andLlama 2 base models, MosaicML Pretrained
Transformer(MPT)††models,andFalcon(Almazroueietal.,2023)modelsonstandardacademicbenchmarks.
For all the evaluations, we use our internal evaluations library. We reproduce results for the MPT and Falcon
modelsinternally. Forthesemodels,wealwayspickthebestscorebetweenourevaluationframeworkand
any publicly reported results.
InTable3,wesummarizetheoverallperformanceacrossasuiteofpopularbenchmarks. Notethatsafety
benchmarks are shared in Section 4.1. The benchmarks are grouped into the categories listed below. The
results for all the individual benchmarks are available in Section A.2.2.
•Code.Wereporttheaveragepass@1scoresofourmodelsonHumanEval(Chenetal.,2021)and
MBPP (Austin et al., 2021).
•CommonsenseReasoning. WereporttheaverageofPIQA(Bisketal.,2020),SIQA(Sapetal.,2019),
HellaSwag (Zellers et al., 2019a), WinoGrande (Sakaguchi et al., 2021), ARC easy and challenge
(Clark et al., 2018), OpenBookQA (Mihaylov et al., 2018), and CommonsenseQA (Talmor et al.,
2018). We report 7-shot results for CommonSenseQA and 0-shot results for all other benchmarks.
•World Knowledge. We evaluate the 5-shot performance on NaturalQuestions (Kwiatkowski et al.,
2019) and TriviaQA (Joshi et al., 2017) and report the average.
•Reading Comprehension. For reading comprehension, we report the 0-shot average on SQuAD
(Rajpurkar et al., 2018), QuAC (Choi et al., 2018), and BoolQ (Clark et al., 2019).
∗∗https://sustainability.fb.com/2021-sustainability-report/
††https://www.mosaicml.com/blog/mpt-7b
7
Model Size CodeCommonsense
ReasoningWorld
KnowledgeReading
ComprehensionMath MMLU BBH AGI Eval
MPT7B 20.5 57.4 41.0 57.5 4.9 26.8 31.0 23.5
30B 28.9 64.9 50.0 64.7 9.1 46.9 38.0 33.8
Falcon7B 5.6 56.1 42.8 36.0 4.6 26.2 28.0 21.2
40B 15.2 69.2 56.7 65.7 12.6 55.4 37.1 37.0
Llama 17B 14.1 60.8 46.2 58.5 6.95 35.1 30.3 23.9
13B 18.9 66.1 52.6 62.3 10.9 46.9 37.0 33.9
33B 26.0 70.0 58.4 67.6 21.4 57.8 39.8 41.7
65B 30.7 70.7 60.5 68.6 30.8 63.4 43.5 47.6
Llama 27B 16.8 63.9 48.9 61.3 14.6 45.3 32.6 29.3
13B 24.5 66.9 55.4 65.8 28.7 54.8 39.4 39.1
34B 27.8 69.9 58.7 68.0 24.2 62.6 44.1 43.4
70B37.5 71.9 63.6 69.4 35.2 68.9 51.2 54.2
Table3: Overallperformanceongroupedacademicbenchmarkscomparedtoopen-sourcebasemodels.
•MATH. We report the average of the GSM8K (8 shot) (Cobbe et al., 2021) and MATH (4 shot)
(Hendrycks et al., 2021) benchmarks at top 1.
•Popular Aggregated Benchmarks . We report the overall results for MMLU (5 shot) (Hendrycks
et al., 2020), Big Bench Hard (BBH) (3 shot) (Suzgun et al., 2022), and AGI Eval (3–5 shot) (Zhong
et al., 2023). For AGI Eval, we only evaluate on the English tasks and report the average.
As shown in Table 3, Llama 2 models outperform Llama 1 models. In particular, Llama 2 70B improves the
resultsonMMLUandBBHby ≈5and≈8points,respectively,comparedto Llama 1 65B.Llama 2 7Band30B
modelsoutperformMPTmodelsofthecorrespondingsizeonallcategoriesbesidescodebenchmarks. Forthe
Falcon models, Llama 2 7B and 34B outperform Falcon 7B and 40B models on all categories of benchmarks.
Additionally, Llama 2 70B model outperforms all open-source models.
In addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown
in Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant
gaponcodingbenchmarks. Llama 2 70BresultsareonparorbetterthanPaLM(540B)(Chowdheryetal.,
2022)onalmostallbenchmarks. Thereisstillalargegapinperformancebetween Llama 2 70BandGPT-4
and PaLM-2-L.
We also analysed the potential data contamination and share the details in Section A.6.
Benchmark (shots) GPT-3.5 GPT-4 PaLM PaLM-2-L Llama 2
MMLU (5-shot) 70.0 86.4 69.3 78.3 68.9
TriviaQA (1-shot) – – 81.4 86.1 85.0
Natural Questions (1-shot) – – 29.3 37.5 33.0
GSM8K (8-shot) 57.1 92.0 56.5 80.7 56.8
HumanEval (0-shot) 48.1 67.0 26.2 – 29.9
BIG-Bench Hard (3-shot) – – 52.3 65.7 51.2
Table 4: Comparison to closed-source models on academic benchmarks. Results for GPT-3.5 and GPT-4
are from OpenAI (2023). Results for the PaLM model are from Chowdhery et al. (2022). Results for the
PaLM-2-L are from Anil et al. (2023).

== END ARTICLE ==

'''

chat = [
    {
        "role": "system",
        "content": "You are Hermes 2, an unbiased AI assistant, and you always answer to the best of your ability."
    },
    {
        "role": "user",
        "content": (
            "You are given a partial and unparsed scientific article, please read it carefully and complete the "
            f"request below.{article}Please summarize the article in 5 sentences."
        )
    },
]
processed_chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer(processed_chat, return_tensors='pt').to(model.device)

streamer = TextStreamer(tokenizer)

# Run and measure generate
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
torch.cuda.reset_peak_memory_stats(model.device)
torch.cuda.empty_cache()
torch.cuda.synchronize()
start_event.record()
generation_output = model.generate(**input_ids, do_sample=False, max_new_tokens=512, streamer=streamer)
# generation_output = model.generate(**input_ids, do_sample=False, max_new_tokens=512, streamer=streamer, prompt_lookup_num_tokens=10)
end_event.record()
torch.cuda.synchronize()
max_memory = torch.cuda.max_memory_allocated(model.device)
print("Max memory (MB): ", max_memory * 1e-6)
new_tokens = generation_output.shape[1] - input_ids.input_ids.shape[1]
print("Throughput (tokens/sec): ", new_tokens / (start_event.elapsed_time(end_event) * 1.0e-3))