paint-brush
Introducing LLaVA-Phi: A Compact Vision-Language Assistant Powered By a Small Language Modelby@textmodels

Introducing LLaVA-Phi: A Compact Vision-Language Assistant Powered By a Small Language Model

tldt arrow

Too Long; Didn't Read

In this paper, we introduce LLaVA-ϕ, an efficient multi-modal assistant that harnesses the power of the recently advanced small language model, Phi-2
featured image - Introducing LLaVA-Phi: A Compact Vision-Language Assistant Powered By a Small Language Model
Writings, Papers and Blogs on Text Models HackerNoon profile picture

Abstract and 1 Introduction

2. Related Work

3. LLaVA-Phi and 3.1. Training

3.2. Qualitative Results

4. Experiments

5. Conclusion, Limitation, and Future Works and References

Abstract

In this paper, we introduce LLaVA-ϕ (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a notable advancement in the realm of compact multi-modal models. It demonstrates that even smaller language models, with as few as 2.7B parameters, can effectively engage in intricate dialogues that integrate both textual and visual elements, provided they are trained with high-quality corpora. Our model delivers commendable performance on publicly available benchmarks that encompass visual comprehension, reasoning, and knowledge-based perception. Beyond its remarkable performance in multi-modal dialogue tasks, our model opens new avenues for applications in timesensitive environments and systems that require real-time interaction, such as embodied agents. It highlights the potential of smaller language models to achieve sophisticated levels of understanding and interaction, while maintaining greater resource efficiency. The project is available at https://github.com/zhuyiche/llava-phi.

1. Introduction

Large vision language models, including Flamingo [1], GPT-4V [30], and Gemini [33], have exhibited remarkable proficiency in executing instructions, engaging in multi-turn dialogues, and handling image-based question-answering tasks. The progression of open-source vision language models has been significantly propelled by the rapid advancement of open-source Large Language Models like LLaMA [34] and Vicuna [5]. These developments primarily focus on leveraging language models with a minimum of 7B parameters, integrated with a vision encoder to enhance visual comprehension. However, this approach often results in increased test time and reduced inference speed, which are less than ideal for time-sensitive or real-time interactive applications, such as autonomous driving and robotics. This leads to an important inquiry: How effectively can small vision-language assistants perform in comparison?


Gemini [33] has blazed a trail for multi-modal models in mobile technology. Its streamlined variant, Gemini-Nano, boasts 1.8/3.25 billion parameters and is deployable on mobile devices. However, details like the model architecture, training data, and training methodologies remain proprietary and inaccessible to the public. In the realm of small language models, there have been notable advancements: TinyGSM [23], with 2.6 billion parameters, achieves over 80% accuracy on the GSM8k [7] benchmark. Additionally, models such as Phi [13] have demonstrated capabilities in language understanding, commonsense reasoning, and code generation, rivaling larger language models like LLaMA2-7B. This progress underscores the significant strides being made in the efficiency and effectiveness of smaller-scale language models.


In this paper, we introduce LLaVA-Phi, a compact vision-language assistant powered by a small language model. Our work combines the powerful opensourced multi-modal model, LLaVA-1.5 [24], with the best-performing open-sourced small language models, Phi2 [21]. We follow a two-stage training pipeline and leverage high-quality visual instruction tuning data from LLaVA. LLaVA-Phi was evaluated across eight diverse benchmarks. Despite possessing only 3 billion parameters, it achieves performance comparable to, or even surpassing, some larger multi-modal models that are three times larger.


Notably, LLaVA-Phi-3B demonstrates exceptional proficiency in ScienceQA [28], outperforming existing large multimodal models. Additionally, we qualitatively demonstrate LLaVA-Phi’s strong generalization ability in handling challenging questions, generating code based on instructions, and solving mathematical problems.


This paper is available on arxiv under CC BY 4.0 DEED license.

(1) Yichen Zhu, Midea Group;

(2) Minjie Zhu, Midea Group and East China Normal University;

(3) Ning Liu, Midea Group;

(4) Zhicai Ou, Midea Group;

(5) Xiaofeng Mou, Midea Group.