Multimodality Papers

Looking for a specific paper or subject?


  • LLaMA-Mesh by Nvidia: LLM for 3D Mesh Generation

    LLaMA-Mesh by Nvidia: LLM for 3D Mesh Generation

    Large language models (LLMs) have literally conquered the world by this point. The native data modality used by large language models is obviously text. However, given their power, an active research domain is to try to harness their strong capabilities for other data modalities, and we already see LLMs that can understand images. Today we’re…

  • TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

    TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

    In this post we dive into TinyGPT-V, a new multimodal large language model which was introduced in a research paper titled “TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones”. Before divining in, if you prefer a video format then check out our video review for this paper: Motivation In recent years we’ve seen a…

  • NExT-GPT: Any-to-Any Multimodal LLM

    NExT-GPT: Any-to-Any Multimodal LLM

    NExT-GPT is a multimodal large language model (MM-LLM) developed by NExT++ lab from the National University of Singapore, and presented in a research paper titled “NExT-GPT: Any-to-Any Multimodal LLM”. With the remarkable progress of large language models, we can provide a LLM with a test prompt in order to get a meaningful answer in response.…

  • ImageBind: One Embedding Space To Bind Them All

    ImageBind: One Embedding Space To Bind Them All

    ImageBind is a model by Meta AI which can make sense out of six different types of data. This is exciting because it brings AI a step closer to how humans are observing the environment using multiple senses. In this post, we will explain what is this model, why should we care about it by…

  • Meta-Transformer: A Unified Framework for Multimodal Learning

    Meta-Transformer: A Unified Framework for Multimodal Learning

    In this post we dive into Meta-Transformer, a multimodal learning method, which was presented in a research paper titled Meta-Transformer: A Unified Framework for Multimodal Learning. In the paper, the researchers show they were able to process information from 12(!) different modalities that we see in the picture above, which includes image, text, audio, infrared,…

Scroll to Top