Introduction
In recent years, Large Language Models (LLMs) have revolutionized the field of AI, becoming an essential tool for many tasks. The main component in these models’ architecture is a large Transformer model. However, to process our prompts, LLMs use another crucial component called a tokenizer. The tokenizer converts the prompt into tokens, which are part of the model’s vocabulary.
In the image above, we see how GPT-4 tokenizes the sentence “will tokenization eventually be dead?”, by mostly assigning one token per word, except for the word “tokenization” which is split into two tokens. The LLM processes this tokenized input to generate a response using the same token vocabulary.
However, this method differs significantly from how humans analyze information and generate creative content, as humans operate at multiple levels of abstraction, far beyond individual words.
Introducing Large Concept Models (LCMs)
A recent research paper from Meta aims to bridge this gap. The paper is titled Large Concept Models: Language Modeling in a Sentence Representation Space, and it introduces a new architecture called Large Concept Models (LCMs). Unlike traditional LLMs that process tokens, LCMs work with concepts.
Understanding Concepts vs. Tokens
Concepts represent the semantics of higher-level ideas or actions and are not tied to specific single words. Furthermore, concepts are not restricted to language alone and can be derived from multiple modalities. For instance, the concept behind a particular sentence remains consistent whether it is in English, another language, or conveyed through text or voice.
Better Long Context Handling
One significant advantage of processing concepts is the better handling of long context inputs. Since the concept sequence is much shorter than the token sequence for the same input, this method significantly reduces the challenge of managing long sequences.
Hierarchical Reasoning
A key observation is that processing concepts rather than subword tokens allows for better hierarchical reasoning. To illustrate this, the following example from the paper may help. Imagine a researcher giving a 15-minute talk. The researcher typically wouldn’t prepare a detailed speech by writing out every single word. Instead, the researcher would outline a flow of higher-level ideas to communicate during the speech.
Should the researcher give the same talk multiple times, the actual words being spoken may differ, the talk could even be given in different languages, but the flow of higher-level abstract ideas will remain the same.
Example of Concept-Based Reasoning
The above figure from the paper provides an example of reasoning in an embedding space of concepts for a summarization task. On the left, we have embeddings of five sentences, which represent our concepts. To create a summary for these sentences, the concepts are mapped into two other concept representations, which represent the summary. To understand how this works, let’s review the high-level architecture of LCMs.
High-Level Architecture of Large Concept Models (LCMs)
Understanding the high-level architecture of LCMs is crucial. The process begins with an input sequence of words divided into sentences, which are assumed to be the basic building blocks representing concepts.
Concept Encoder (SONAR)
These sentences are first passed through a concept encoder, which encodes them into concept embeddings. The concept encoder used in this paper is an existing encoder and decoder component called SONAR. SONAR supports 200 languages as text input and output—more than double the number of languages supported by most LLMs today. It also accepts 76 languages as speech input.
Large Concept Model (LCM)
Next, the sequence of concepts is processed by a Large Concept Model to generate a new sequence of concepts at the output. The main component which is the LCM itself, operates solely in the embedding space, making it independent of any specific language or modality. This approach can be extended beyond text and speech which are explored in the paper.
Concept Decoder (SONAR)
Finally, the generated concepts are decoded back into language using SONAR. The decoder can convert the output of the LCM into more than one language or even more than one modality.
Hierarchical Structure
The hierarchical structure is explicit in the architecture, first extracting concepts, then reasoning based on these concepts, and finally generating the output, possibly multiple times without needing to run the LCM again.
Resembles JEPA
It is worth noting that the concept of predicting information in an abstract representation space is not entirely new to Meta AI. This idea is somewhat similar to the Joint Embedding Predictive Architecture (JEPA) from previous work by Meta, which aligns with Yann LeCun’s vision for a more human-like AI. We have covered the JEPA models for images (I-JEPA) and videos (V-JEPA) extensively in previous reviews.
Inner Architecture of Large Concept Models (LCMs)
We’re now ready to delve into a few different architectures of Large Concept Models. Below we will explore Base-LCM, the first attempt of generating a Large Concept Model, and afterwards we’ll review Diffusion-based LCMs which are an improved LCM architecture.
Base-LCM: Large Concept Model Naive Architecture
Let’s now delve into the first approach to defining a Large Concept Model (LCM). This method is analogous to training a large language model to predict the next token. However, instead of predicting the next token, the model is trained to predict the next concept within the concepts embedding space. This version is referred to as Base-LCM.
In the above figure from the paper, we see the high-level architecture of Base-LCM. At the bottom on the left, we have a sequence of concepts. This sequence, excluding the last concept, is fed into the model to predict the next concept. The output is then compared to the actual next concept, which was not included in the model input. A mean squared error (MSE) loss is used to train the model.
The model comprises a main Transformer decoder component, along with smaller components before and after the Transformer, referred to as PreNet and PostNet. The PreNet component normalizes the concept embeddings received from SONAR and maps them into the Transformer’s dimension. The PostNet component projects the model output back to SONAR’s dimension.
Base-LCM Limitation
Unlike large language models that learn a distribution for next token prediction, this model is trained to output a very specific concept. However, there are likely many other concepts that could make sense in a given context.
This leads us to the next version of LCM architecture. The challenge of having many possible plausible outputs for a given input has already been tackled in the image generation domain. For example, if we ask an image generation model to generate a cute cat, we will likely be satisfied with many different options for generated cute cat images. A widely used architecture for image generation models is diffusion model. Inspired by this, diffusion-based architecture is also explored for large concept models.
Understanding Diffusion Models
Diffusion models take a prompt as input, such as “A cat is sitting on a laptop”. The model learns to gradually remove noise from an image to generate a clear picture. The process starts with a random noise image, and at each step, the model removes some of the noise. The noise removal is conditioned on the input prompt, resulting in an image that matches the prompt. The three dots imply that we skip steps in the above example. Finally, we get a clear image of a cat, which is the final output of the diffusion model for the given prompt. The noise removal process usually takes between tens to thousands of steps, which can result in a latency drawback. During training, to learn how to remove noise, noise is gradually added to a clear image—this is the diffusion process.
Diffusion-Based LCMs: Improved Large Concept Model Architecture
Now that we’ve recalled what diffusion models are, we can explore the two types of diffusion-based large concept models depicted in the below figure from the paper.
One-Tower Large Concept Model
On the left, we see a version called the One-Tower LCM. At the bottom, there is an input sequence of concepts, along with a number representing the noisening timestamp. Zero for all concept embeddings indicates that they are clean concepts, and only the last concept is noisy, noted with a t timestamp, which needs to be cleaned to get the clean next concept prediction. The model is built similarly to the Base-LCM but runs multiple times. At each step, it removes some noise from the noisy next concept, iteratively processing its output as the noisy concept for a certain number of steps.
Two-Tower Large Concept Model
On the right, we see another version called the Two-Tower LCM. The main difference from the One-Tower version is that it separates the encoding of the preceding context from the diffusion of the next concept embedding. The clean concept embeddings are first encoded using a decoder-only Transformer. The outputs are then fed to a second model, the denoiser, which also receives the noisy next concept and iteratively denoises it to predict the clean next concept. The denoiser consists of Transformer layers, with a cross-attention block to attend to the encoded previous concepts.
Results
Comparing Different Versions of Large Concept Models (LCMs)
In the above table from the paper, we see instruction-tuning evaluation results for the various models. The diffusion-based versions significantly outperform the other versions for the two reported metrics: ROUGE-L, which evaluates the quality of generated summaries by measuring the longest common subsequence between the generated text and the reference text, and the coherence metric, which evaluates how logically consistent and smoothly flowing the generated text is.
The Quant models are additional large concept model versions that we did not cover in this post. At the bottom of the table, we see that smaLlama achieves slightly better results than the diffusion-based large concept model versions.
Higher-Scale Evaluation of Large Concept Models (LCMs)
To verify the method on higher scale, the Two-Tower LCM model was scaled up to 7B parameters. In the table below, we can see how it performs for summarization tasks comparing to the following baselines:
- Encoder-Decoder Transformer Models: T5
- Decoder-Only LLMs: Gemma-7B, Llama-3.1-8B, and Mistral-7B-v0.3
The results show that the LCM produces competitive ROUGE-L scores, comparable to the specifically tuned T5-3B model, and surpasses the instruction-finetuned LLMs. Key findings include:
- Abstractive Summaries: LCMs tend to generate more abstractive summaries rather than extractive ones, indicated by lower OVL-3 scores.
- Repetition Rate: LCMs produce fewer repetitions compared to LLMs, with repetition rates closer to the ground truth.
- Fluency: According to the CoLA classifier, LCMs generate less fluent summaries than LLMs, though even human-generated summaries scored lower than LLM outputs.
- Source Attribution and Semantic Coverage: Similar trends are observed in source attribution (SH-4) and semantic coverage (SH-5), potentially due to biases in model-based metrics favoring LLM-generated content.
Conclusion
Meta AI has introduced Large Concept Models (LCMs), an innovative architecture that processes higher-level concepts instead of individual tokens, closely mimicking human reasoning. LCMs demonstrate competitive performance on summarization tasks, outperforming traditional LLMs in several key areas. By processing concepts, LCMs handle long context inputs more effectively and enable better hierarchical reasoning. The incorporation of diffusion-based models enhances LCM performance by iteratively refining concept predictions.
Limitations and Future Directions
While Large Concept Models (LCMs) show promise, several areas may be interesting for future research, including:
- Embedding Space Selection: Although SONAR provided sufficient capabilities for a proof of concept, future work may focus on developing improved embedding spaces tailored specifically for LCMs. Incorporating Byte Latent Transformer as a byte-level encoder and decoder could be an intriguing direction.
- Joint Encoder & Decoder Training: During LCM training, SONAR was used as a frozen component. Training an encoder and decoder together may yield interesting results and may enhance the overall performance of LCMs.
- Concept Granularity: In this work, a concept was defined at the sentence level. Future research may investigate techniques for splitting sentences into meaningful conceptual units that balance data encoding and abstraction. This will address the challenge of defining appropriate concept granularity.
Links & Resources
- Paper
- Code
- Video
- Join our newsletter to receive concise 1-minute read summaries for the papers we review – Newsletter
All credit for the research goes to the researchers who wrote the paper we covered in this post.