In recent years, large language models are in charge of remarkable advances in AI, with models such as GPT-3 and 4 which are closed source and with open-source models such as LLaMA 2 and 3, and many more. However, as we moved forward, these models got larger and larger and it became important to find ways to improve their efficiency. Mixture-of-Experts (MoE) to the rescue.
Mixture-of-Experts (MoE) High-Level Idea
One known method that has been adapted with impressive success is called Mixture-of-Experts, or MoE in short, which allows to increase models capacity without a proportional increase in computational cost. The idea is that different parts of the model, which are the experts, learn to handle certain types of inputs, and the model learns when to use each expert. For a given input, only a portion of the experts are used, which makes this model more compute-efficient.
Before diving in, if you prefer a video format then check out the following video:
A bit of “History”
If someone would ask what was invented first, MoE or Transformers, what would you say? Transformers were invented in the famous “Attention Is All You Need” paper from Google at June 2017, while the Mixture-of-Experts layer, which is very similar to what is used in LLMs today was actually invented earlier in the same year, also by Google in a paper titled: Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, with authors including none other than Geoffrey Hinton, also known as the Godfather of AI.
The Sparse Mixture-of-Experts Layer
As mentioned above, the idea with mixture of experts is that instead of having one large model that handles all of the input space, we divide the problem such that different inputs are handled by different segments of the model. These different model segments are called the experts. What does it mean?
Mixture-of-Experts High-Level Architecture
The MoE layer has a router component, also called a gating component, and an experts component, where the experts component is comprised of multiple distinct experts, each with its own weights.
The Routing Component
Given input tokens such as the 8 tokens on the left, each token passes via the router, and the router decides which expert should handle this token, and routes the token to be processed by that expert. More commonly, the router chooses more than one expert for each token, so in this example we choose 2 experts out of 4.
The Experts Component
The chosen experts yield outputs for the input, which we combine together. These experts can be smaller than if we would use one large model to process all tokens, and they can run in parallel and not all of them need to run for each input, so this is why the computational cost is reduced.
Repeating The Process For All Tokens
This flow is repeated for each input token, so the second token also passes via the router, and the router can choose different experts to activate. For the input prompt, the tokens are handled together and we do not do this one after the other as shown in this example.
Multiple MoE Layers
We’ve discussed about a single MoE layer, while in practice there are more than one, so the outputs from one MoE layer are propagated to the next layer in the model.
MoE Layer – Paper Diagram
Let’s now also review a figure from the paper which presents the MoE layer. Since this is from before the Transformers era, the model used here is a recurrent language model.
At the bottom we can see the input to the MoE layer which is provided from the previous layer. The input first passes via the gating network which decides which experts should process the input. We see that there are n experts and that two of them were chosen.
Weighted Sum Of The Experts Output
Another role of the gating network is not only to decide which experts to use, but also to determine the weight of each expert output. Finally, we can see that the outputs from the chosen experts are combined together, based on the gating network output, which gives us a weighted sum of the experts outputs, which is forwarded to the next model layer.
Final Note About Training
The different MoE layers, the gating components and the experts, are all trained together as part of the same model.
Links & References
- Paper – https://arxiv.org/abs/1706.03762
- Video – https://youtu.be/kb6eH0zCnl8
- Join our newsletter to receive concise 1 minute read summaries of the papers we review – https://aipapersacademy.com/newsletter/
All credit for the research goes to the researchers who wrote the paper we covered in this post.