CODEFUSION is a new code generation model which was introduced in a research paper from Microsoft, titled: “CODEFUSION: A Pre-trained Diffusion Model for Code Generation”.
Recently, we’ve observed a significant progress with code generation using AI, which is mostly based on large language models (LLMs), so we refer to them as code LLMs. With a code LLM, the model is able to get a prompt as input, which can be a beginning of an implementation, such as a method name, or in some models an instruction that explains the problem to solve. In response, the code LLM would generate the complete implementation.
As humans, when we write code, we often reach a point where we decide to start writing some piece of code from scratch. However, during the generation of the complete implementation by a code LLM, the model has one chance to get the implementation right. In other words, the model has no easy way to reconsider tokens it already generated. So here comes CODEFUSION, a code generation model presented in the paper, that tackles this limitation by letting the model to revise its implementation in multiple iterations.
CODEFUSION is able to do so by being a diffusion model for code generation. In this post, we’ll provide a super quick reminder for what are diffusion models, and then explain how CODEFUSION works. Finally, we’ll review how CODEFUSION performs based on the results from the paper, which includes an interesting possible leak regarding ChatGPT size.
If you prefer a video format then check out our video:
Diffusion Models – Quick Recap
Diffusion models serve as the backbone of the top text-to-image generation models such as DALL-E, Stable Diffusion, Imagen and more. Diffusion models get a prompt as input such as “A cat is sitting on a laptop”. The model learns to gradually remove noise from an image in order to generate a clear image. The model starts with a random noise image like we have on the left in the example above, and in each step it removes some of the noise, while the noise removal is conditioned on the input prompt, so we’ll end with an image that match the prompt. The 3 dots imply that we skip steps in this example. Finally, we get a nice clear image of a cat, which we take as the final output of the diffusion model for the provided prompt. The noise removal process usually takes between 10s to 1000s of steps, so it comes with a latency drawback.
CODEFUSION is a diffusion model for code generation. The idea is to allow the model to reconsider its solution in each denoising step. This way mitigating the limitation explained in the beginning of the post, where code LLMs cannot easily reconsider tokens that were already generated. In the above diagram from the paper, we can see the architecture of CODEFUSION. On the bottom left we see a prompt for example. In the example here the researchers use Microsoft Excel conditions, and the model is asked to generate conditions to highlight students that match a certain criteria. The prompt is first passed via a transformer-based encoder, which transforms the text tokens to a vector representation, called embeddings, which are specified with the ‘e’ letter. The encoder here is not trained from scratch but rather it is the pre-trained encoder from the CodeT5 model. The embeddings are passed to another transformer-based component called denoiser. The denoiser gets the embeddings and a random noise in a latent space. The denoiser is the one doing the multiple iterations of gradually removing the noise, conditioned on the embeddings. Note that the denoising happens in a latent space and not on code tokens. Finally, after enough iterations, the denoiser ends with x0 which is a representation of the final denoiser solution in the latent space. Before projecting the denoised embeddings x0 back to discrete code tokens, x0 is passed together with the prompt embedding E to another transformer-based decoder that creates the final representation for the solution, specified with the letter ‘d’, but still not as code tokens.
Finally, to generate the code tokens, we pass the decoder output via a classification head that yields the most likely code tokens. In this example we see it yields Excel conditions based on the original prompt.
CODEFUSION Training Process
The CODEFUSION model is trained in two training phases.
- The first training phase is unsupervised pre-training, where the data used contain code snippets only, without prompts. In this phase we train only the denoiser and decoder, and the missing prompt embedding is replaced with a random noise.
- The second training phase is supervised fine-tuning, where the data in this stage is combined of both prompts and code snippets. In this phase all components are being fine-tuned including the encoder.
So how does CODEFUSION perform comparing to other models? We can see the answer in the above table from the paper. Each row represents different evaluated model and CODEFUSION results are reported in the bottom row. When tested over python benchmark, we see that only GPT-3 was able to get better results than CODEFUSION in the top-1 accuracy metric. But GPT-3 is orders of magnitude larger than CODEFUSION so this is still very impressive. We can also see CODEFUSION is on par with ChatGPT. An interesting side note observation here is a possible leak from Microsoft regarding the size of ChatGPT which is mentioned here to be 20 billion params. When looking at the rest of CODEFUSION results we can see that on the top-3 and top-5 metrics it was able to beat all other tested models in python, bash and CF rules (which are the Excel conditions). Overall, given that CODEFUSION is much smaller than other strong code models, results are very impressive and show great potential for this research direction.
References & Links
- CODEFUSION paper page – https://arxiv.org/abs/2310.17680
- Video – https://youtu.be/wKSdzL67TqY
- Join our newsletter – https://aipapersacademy.com/newsletter/
- We use ChatPDF to analyze research papers – https://www.chatpdf.com/?via=ai-papers (affiliate)
All credit for the research goes to the researchers who wrote the paper we covered in this post.