DINOv2 is a computer vision model from Meta AI that claims to finally provide a foundational model in computer vision, closing some of the gap from natural language processing where it is already common for a while now. In this post, we’ll explain what does it mean to be a foundational model in computer vision and why DINOv2 can count as such. DINOv2 is a huge model (relative to computer vision) with one billion params so this comes with hard challenges both for training the model and using it. We’ll review the challenges and what the researchers in Meta AI has done in order to overcome these challenges using self-supervision and distillation. Don’t worry if you are not familiar with these terms, we’ll explain them when we’ll get there. Let’s start by first understanding what DINOv2 provides that make it a foundational model in computer vision.
If you prefer a video format, then a lot of what we cover here is also covered in this video:
What is a Foundational Model?
In the life before a foundational model, one would need to find or create a dataset, choose some architecture for a model and train the model on that dataset. The model you need may be complex and may require a long or hard training.
So here comes DINOv2, a pretrained huge visual transformer (ViT) model which is a known architecture in the field of computer vision and says that you may not need a robust complex dedicated model.
Say for example that we have a cat image (the one on the left in the picture below). We can provide this image as an input to DINOv2. DINOv2 will yield a vector of numbers, often called embeddings or visual features. These embeddings contain deep understanding of the input cat image, and once we have them, we can use them in smaller simpler models that handle specific tasks. For example, we can have one model that should handle semantic segmentation, which means categorizing related parts in the image, and one model to estimate the depth of the objects in the picture. The output examples here are taken from Meta AI demo for using DINOv2.
Another very important attribute for DINOv2 here is that while training these task specific models, DINOv2 can be frozen, or in other words, no finetuning is needed, which further simplifies the training of the simpler models and their usage, since DINOv2 can be executed on an image once and the output can be used by multiple models, unlike if it was finetuned then there was a need to run the finetuned DINOv2 version for any task specific model we have. Also finetuning such a huge model is not trivial to do and requires proper hardware that is not accessible to everyone.
How to use DINOv2?
We do not dive deep into code here, but if you would want to use DINOv2 then you could simply load it using pytorch code as in the following code taken from DINOv2 GitHub page. We see that there are few possible versions of different model sizes to load, so you can decide which version to use based on your needs and resources. The accuracy does not significantly drop when using a smaller version which is cool, especially if using one of the middle size versions.
dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')
This brings us to talk about how they generated the different types of model versions, and the answer is distillation.
Distillation means transferring knowledge from a large trained model into a new smaller model. An interesting note is that while doing so in DINOv2 the researchers got better results than when trying to train smaller models directly. The way it is being done is taking the large pretrained DINOv2 model and use it to teach new smaller models. So, for example given a cat image (see picture below), DINOv2 yields some embeddings. Then, we have a smaller model which we also feed with the same cat image and it also yields embeddings. The distillation process will try to minimize the difference between the embeddings coming from the new model to the one coming from DINOv2. Remember, we keep DINOv2 frozen here so only the smaller model on the right side is changing.
This method is often called teacher-student distillation since the left side here acts as a teacher while the right one acts as a student
In practice, to get better results from the distillation process, we do not use just one student but rather multiple ones and each simultaneously gets the same inputs and yields results. During training, an average of all of the student models is created which ends up to be the final graduated distilled model.
With DINOv2, the model size was increased dramatically from previous DINO version, which raise the need for more training data. This brings us to talk about self-supervised learning with large curated data.
Self Supervised Learning with Large Curated Data
First, what is self-supervised learning? In short, it means our training data has no labels and the model learn solely from the images. The first DINO version also used self-supervised learning techniques. Ok, so without data labelling it should be easier to increase training data size right? However, previous attempts to increase uncurated data size with self-supervised learning have caused a drop in quality.
With DINOv2, the researchers have built an automated pipeline to create a curated dataset which helped them to reach state of the art results comparing to other self-supervised learning models. They started with 25 sources of data that combined had 1.2 billion images (!) and extracted 142 million images out of it.
So, this pipeline has multiple filtering steps. For example, in the original uncurated dataset we’ll likely find a lot of cat images, and also some other images. Training on such data as is may lead to a model that is very good in understanding cats but may not do very good in generalizing to other domains
So one of the steps in this pipeline was to use clustering, which basically means grouping images based on similarities. Then, they could sample from each group a similar number of images and were able to create a smaller but more diverse dataset.
Better Pixel Level Understanding
Another benefit for using self-supervised learning is better pixel level understanding. A common approach in computer vision nowadays is using text guided pretraining. For example, the following cat image would come with a description text that may be similar to “a white kitten in a field of grass”.
Both the image and the text are provided as input to such models. However, the description text may miss data, such as that the cat is walking or the small white flowers, which may lead to limiting the learning capability.
With DINOv2 and self-supervised learning, the model has amazing capability to learn pixel level information. As an example, in the picture below we can see multiple horse images, and when visualizing the result of DINOv2 on them, we can see that horses in different pictures get similar colors for same body parts, even if there are multiple horses in a picture, and even if they are super tiny, very impressive.
- Paper – https://arxiv.org/abs/2304.07193
- Code – https://github.com/facebookresearch/dinov2
- Video – https://youtu.be/csEgtSh7jV4
- Demo – https://dinov2.metademolab.com/
A more recent computer vision progress by Meta AI is human-like I-JEPA model, which we covered here.