In this post we dive into YOLO-NAS, an improved version in the YOLO models family for object detection which was precented earlier this year by Deci.
YOLO models have been around for a while now, presented in 2015 with the paper You Only Look Once, which is what the shortcut YOLO stands for, and over the years we saw various improved versions up until YOLOv8 by ultralytics. In May 2023, a company called Deci has released YOLO-NAS and showed it achieves great results with very low latency with the best accuracy-latency tradeoff to date. In this post we will explain how they were able to do that.
If you prefer a video format then check out the following video:
Neural Architecture Search
Let’s start with Neural Architecture Search which is the meaning of the shortcut NAS in YOLO-NAS name. Most of the times, model architectures are designed by human experts. Since there is a huge number of potential model architectures, it is likely that even if we reach great results, we did not nail it exactly on the best choice of model architecture out there and we could still find a different model architecture that would yield better results. As a result, Neural Architecture Search was invented and it includes three main components.
- A search space which defines the set of valid possible architectures to choose from.
- A search algorithm which is in charge of how to sample possible architectures from the search space, as there is no way to try them one by one.
- An evaluation strategy which is used to compare between candidate architectures.
So, in order to come up with model architecture for YOLO-NAS, Deci’s researchers have used their own neural architecture search implementation called AutoNAC, which stands for Automated Neural Architecture Construction. They have provided AutoNAC with the details needed to search YOLO possible architectures, which created an initial search space of huge size of 10^14 possible architectures, and AutoNAC, which is hardware aware, found an optimal architecture for YOLO-NAS, which is optimized for Nvidia T4, in a process that took 3,800 hours of GPU.
Quantization Aware Architecture
Another attribute that helped YOLO-NAS reach great results in super low latency is quantization aware architecture. First, what does it even mean and why is it important?
Well, objection detection in real-time is critical for various applications, such as a safe use of autonomous cars, so we want to deploy object detection models on cars, phones and more instead of running the model in the cloud. However, edge devices resources are limited so it make it hard to deploy large models on them due to their size and also long inference time.
Quantization in machine learning usually refers to the process of reducing the precision of the model weights so they will consume less memory and run faster. This however many times comes with a decrease in the model accuracy. The quantization technique used in YOLO-NAS is called INT8 quantization which is a way of converting the model weights from float32 to int8 so each weight is one byte in memory instead of four. They were able to do that thanks to a new building block called QARepVGG which they instructed the neural architecture search algorithm to include. QARepVGG was recently introduced in research paper from Meituan. QARepVGG is an improved version of RepVGG block that is commonly used in object detection models that significantly improved the loss of accuracy after quantization. We won’t dive into its internals here. They have also used hybrid quantization technology to apply quantization only for specific layers in the model to balance information loss and latency.
YOLO-NAS as a Foundation Model
Let’s move on to talk about YOLO-NAS being a foundational model. So what a foundational model means? Say we have two different tasks at hand, one model to detect bone fractures objects in an image and another task to detect fish objects in an aquarium. Instead of training each model from scratch, we can start with two instances of YOLO-NAS and fine-tune the first model on a bone scans dataset and fine-tune the second model on an aquarium dataset. With this approach we enjoy the strength of YOLO-NAS pre-training that was done on very large datasets with advanced techniques, while still adapting better to specific use cases thanks to the final fine-tuning.
If you’re interested to learn more about foundational models in computer vision then check out our post on DINOv2 – https://aipapersacademy.com/dinov2-from-meta-ai-finally-a-foundational-model-in-computer-vision/