Llama 3 is reshaping the landscape of generative AI by introducing an unprecedented scale of model weights.GPT-4 Turbo, celebrated for its exceptional quality, boasts an MMLU (Model Measure of Language Understanding) score of 86.However, Llama 3 is closing the gap: its 8 billion parameter version scores 66.6 MMLU, while the 70 billion parameter variant reaches 79.5 MMLU. The upcoming Llama 3 model, with over 400 billion parameters, achieves an MMLU of 86.1, positioning it as a game-changer in the realm of generative AI.

Enhanced Capabilities with Llama 3

Llama 3 is available in two sizes 8B and 70B parameters both in pre-trained and instruction-fine-tuned version.

The text-based models we are releasing today are the first in the Llama 3 collection of models. Our goal in the near future is to make Llama 3 multilingual and multimodal, have longer context, and continue to improve overall performance across core LLM capabilities such as reasoning and coding.

Moreover, Meta is currently developing a 400+ billion parameter model set to launch soon, purportedly poised to rival advanced closed LLMs such as GPT-4.

Model Architecture

Llama 3 is a decoder-only model, which simplifies its architecture by focusing on generating output based on the input it's been trained on. It's unclear whether the Llama 2 family exclusively used this architecture but that most likey should be the case. For clarity, there are three main types of architectures in language models:

1. Decoder Only: These models generate text based on the context they receive. They are optimized for tasks like text completion and generation, where the focus is on producing coherent and contextually relevant output. (This is essentially what we understand LLMs to do anyway).

2. Encoder Only: These models are primarily used for tasks that involve understanding input, such as text classification or sentiment analysis, where the model assesses and processes input without the need to generate new text. (This sounds so much like classic AI).

3. Encoder-Decoder: This architecture combines both encoding and decoding capabilities, enabling the model to understand input (encode) and generate output (decode). It's versatile for tasks like translation or summarizing, where both understanding and generating text are necessary. (This sounds like combining RAG with LLMs as we know them).

In the current landscape, many leading large language models (LLMs) are indeed decoder-only, emphasizing their role in generating extensive, coherent text from prompts. These models typically do not convert external data into embeddings directly; instead, they work with embeddings that have been pre-processed by other mechanisms or during initial training stages. This approach allows them to focus on generating high-quality text based on those embeddings.


Llama 3 is trained using over 15 trillion tokens, with content gleaned from publicly available sources. As readers may know, in the case of pre-training, more data does not necessarily equate to more intelligence. To address this issue, Meta employed a series of data-filtering pipelines, including heuristic filters, NSFW filters, semantic deduplication approaches, and text classifiers to predict data quality. Using common parlance, Llama is adept at sniffing out before ingesting new food.

Another interesting point Meta mentioned is that their model's performance continued to improve linearly even beyond 15 trillion tokens (one might wonder what would happen if the number of tokens exceeded the US GDP). Obviously, large models are less efficient during inference, so multiple sized models are needed to meet various needs.

Three Parallel Tracks

To maximize parallelism for largest mode training Meta used three tracks:

  1. Data parallelization
  2. Model parallelization
  3. Pipeline parallelization

In addition, to maximize uptime,Meta also introduced infrastructure automated error-detection, handing and maintenance of infrastructure.

Instruction Fine-Tuning

Instruction fine-tuning, as outlined by Meta, represents a sophisticated phase in model training where traditional methods like Supervised Fine-Tuning (SFT) are augmented with reinforcement strategies such as Rejection Sampling, Proximal Policy Optimization (PPO), and Direct Policy Optimization (DPO). Essentially, this approach is poised to supersede the combination of SFT and the infamous Reinforcement Learning with Human Feedback (RLHF), aiming to refine the model's adeptness at following detailed user commands.


One key takeaway about benchmarks: they should be regarded seriously, not literally. Vendors naturally showcase their models in the best possible light, which explains the variance in metrics like the MMLU 5-shot across different sources. Ultimately, it's the overarching trend that counts, and here, Meta has significantly reshaped the landscape by setting an exceptionally high standard.The revelation that Llama 3's 400 billion parameter model matches the prowess of GPT-4 has profound implications. This breakthrough is not only a wake-up call but also a catalyst that will undoubtedly keep industry leaders, including those at OpenAI, on high alert as they navigate this evolving terrain.

You may also like

DBRX : Elevating the LLM Landscape with Sparse MoE Architecture
DBRX : Elevating the LLM Landscape with Sparse MoE Architecture
23 April, 2024

A few weeks ago, Databricks released DBRX, which they have dubbed as an open general-purpose LLM. DBRX utilizes a Mixtur...

The Memorization Menace: When LLMs Retain More Than Intended
The Memorization Menace: When LLMs Retain More Than Intended
13 May, 2024

If only ChatGPT had real memory, it would be so much better? It's a common sentiment among users who wish for a more con...