In this post we dive into Orca’s research paper from Microsoft, where they present a language model which was trained with a novel approach of imitation learning and achieved very impressive results. For example, in above chart which measures performance on Vicuna evaluation set using GPT-4 as the judge, where results are visualized with relative performance to ChatGPT, they show that Orca achieves better results than many famous large language models including ChatGPT, while being about 7% of its size! We’ll explain how they did that and why it is so interesting.
If you prefer a video format then check out the following video:
Various top large language models such as Vicuna, Alpaca and WizardLM take a base large language model, mostly LLaMA, which is much smaller than the huge ChatGPT or GPT-4, and enhance the base model capability by fine-tuning it on a dataset that was created using responses from ChatGPT or GPT-4. This process of learning from the outputs of a different model is called imitation learning.
Orca Results – Was Imitation Learning Up Until Now Done Right?
According to Orca paper, with imitation learning as been done so far, the models learn to imitate the style rather than the reasoning process of the huge models. The researchers claim that common evaluation methods are causing to overestimate the smaller model capability. So in the chart at the top of the post we saw that Vicuna reach 92% of ChatGPT quality, when GPT-4 is the judge, but this is problematic because GPT-4 may prefer responses from models that were fine-tuned on GPT responses. And really when the researchers ran evaluations on complex datasets, like professional and academic exams from AGIEval which we can see in the above chart, we see that Vicuna legs behind much more significantly than the 92% of ChatGPT quality, while with Orca we see the performance is significantly closer to ChatGPT.
In the above chart we can see possibly the most impressive result in this paper, where we compare Orca to ChatGPT and Vicuna on BigBench-Hard (BBH) dataset, which includes tasks which humans still dominant over large language models, for example boolean expressions like “not true and true” which should result in “false”. We can see that Orca outperforms even ChatGPT, and improves over Vicuna by more than 100%!
Orca model also starts from LLaMA, with 13 billion params, and fine-tuned on ChatGPT and GPT-4 outputs but the key idea is with how they build their training dataset.
Let’s review the key factors for their success.
The first factor for this improvement is called explanation tuning, a new approach for imitation learning presented in the paper that is meant to allow the model to learn the thought process of the teacher model like ChatGPT. The idea is that the reason current imitation-learning based models fail to reach higher quality is because the responses they use for fine-tuning are mostly simple and short. In Orca dataset, they have used detailed responses from GPT-4 and ChatGPT that explain the reasoning process of the teacher as it generates the response. So, for example, before Orca, to generate a sample for training, GPT-4 would get a query that includes an instruction and an input as the prompt, and would generate a response. We see in the following example that the output is simple and short.
When generating a sample for Orca training, using the same query as above, but in the following example we have an addition to the prompt of a system instruction, which provide guidelines for GPT-4 regarding how it should generate the response. For example, here it says: “You are an AI assistant. Provide a detailed answer so user don’t need to search outside to understand the answer”, and we get a very detailed response that include reasoning for the answer. Overall, 16 hand-crafted system instructions we used across Orca training samples, which helped Orca learn the reasoning process of ChatGPT and GPT-4 rather than imitate their styles.
Task Diversity and Data Scale
The second key factor that helps Orca outstand is task diversity and data scale. In the above table from the paper, we see a comparison of the data sizes used by models from the same family which shows the order of magnitude increase in scale with Orca which is trained on 5 million samples, while the largest from the other models is WizardLM which has 250k samples. In order to create their dataset, they have used the Flan collection from Google that includes an extensive diversity of tasks and instructions. They selectively sampled from Flan a collection of 5M instructions from diverse tasks, which is also focused on complex instructions. They then collect 5 million responses for all of the samples from ChatGPT, and out of the 5 million samples they also collected GPT-4 responses for 1 million samples.
Key Takeaways from Orca
We have two key takeaways from this paper.
- Foundational language models such as GPT-4 or ChatGPT can be used as teachers for smaller models. This is sort of the opposite as what was claimed in a very recent paper about the false promise of model imitation, as what Microsoft shows with Orca is that model imitation is in fact not a false promise, since they are doing that and achieve remarkable results.
- Explanation tuning results shows that learning from step-by-step explanations is a promising direction and can improve model capabilities.
References & Links
- Paper page – https://arxiv.org/abs/2306.02707
- Video – https://youtu.be/D8eZugu63vI
- We use ChatPDF to analyze research papers – https://www.chatpdf.com/?via=ai-papers (affiliate)
All credit for the research goes to the researchers who wrote the paper we covered in this post.