Research

Research Highlights

Professor YongMan Ro’s research team develops a multimodal large language model that surpasses the performance of GPT-4V

Professor YongMan Ro’s research team develops a multimodal large language model that surpasses the performance of GPT-4V

 

Inline image 2024 05 31 15.36.55.915

<(From left) Professor YongMan Ro, ph.d. candidate ByungKwan Lee, ph.d. candidate Beomchan Park(integrated), ph.d. candidate Chae Won Kim>
 

On  June 20, 2024, Professor YongMan Ro’s research team announced that  they have developed and released an open-source multimodal large  language model that surpasses the visual performance of closed  commercial models like OpenAI’s ChatGPT/GPT-4V and Google’s Gemini-Pro. A  multimodal large language model refers to a massive language model  capable of processing not only text but also image data types.

 

The  recent advancement of large language models (LLMs) and the emergence of  visual instruction tuning have brought significant attention to  multimodal large language models. However, due to the support of  abundant computing resources by large overseas corporations, very large  models with parameters similar to the number of neural networks in the  human brain are being created.

These models are all developed in  private, leading to an ever-widening performance and technology gap  compared to large language models developed at the academic level. In  other words, the open-source large language models developed so far have  not only failed to match the performance of closed large language  models like ChatGPT/GPT-4V and Gemini-Pro, but also show a significant  performance gap.

 

To  improve the performance of multimodal large language models, existing  open-source large language models have either increased the model size  to enhance learning capacity or expanded the quality of visual  instruction tuning datasets that handle various vision language tasks.  However, these methods require vast computational resources or are  labor-intensive, highlighting the need for new efficient methods to  enhance the performance of multimodal large language models.

 

Professor YongMan Ro’s research team has announced the development of two technologies that significantly  enhance the visual performance of multimodal large language models  without significantly increasing the model size or creating high-quality  visual instruction tuning datasets.

 

The first technology developed by  the research team, CoLLaVO, verified that the primary reason existing  open-source multimodal large language models perform significantly lower  compared to closed models is due to a markedly lower capability in  object-level image understanding. Furthermore, they revealed that the  model’s object-level image understanding ability has a decisive and  significant correlation with its ability to handle visual-language  tasks.

 

Inline image 2024 05 31 15.25.02.940

[Figure – Crayon Prompt Training Methodology]
 

To  efficiently enhance this capability and improve performance on  visual-language tasks, the team introduced a new visual prompt called  Crayon Prompt. This method leverages a computer vision model known as  panoptic segmentation to segment image information into background and  object units. Each segmented piece of information is then directly fed  into the multimodal large language model as input. 

 

Additionally, to  ensure that the information learned through the Crayon Prompt is not  lost during the visual instruction tuning phase, the team proposed an  innovative training strategy called Dual QLoRA.

This strategy trains  object-level image understanding ability and visual-language task  processing capability with different parameters, preventing the loss of  information between them.

Consequently, the CoLLaVO multimodal large  language model exhibits superior ability to distinguish between  background and objects within images, significantly enhancing its  one-dimensional visual discrimination ability.

 

Inline image 2024 05 31 15.25.53.251

[Figure – CoLLaVO Multimodal LLM Performance Evaluation]
 
 
Following  the development of CoLLaVO, Professor YongMan Ro’s research team  developed and released their second large language model, MoAI. This  model is inspired by cognitive science elements that humans use to judge  objects, such as understanding the presence, state, and interactions of  objects, as well as background comprehension and text interpretation.

The team pointed out that existing multimodal large language models use vision encoders that are semantically aligned with text, leading to a lack of detailed and comprehensive real-world scene understanding at the pixel level.

 
To incorporate these cognitive science elements into a multimodal large language model, MoAI employs four computer vision models: panoptic segmentation, open-world object detection (which has no limits on detectable objects), scene graph generation, and optical character recognition (OCR).
 
The results from these four computer vision models are then translated into human-understandable language and directly used as input for the multimodal large language model.

By  combining the simple and efficient approach of CoLLaVO’s Crayon Prompt +  DualQLoRA with MoAI’s array of computer vision models, the research  team verified that their models outperformed closed commercial models  like OpenAI’s ChatGPT/GPT-4V and Google’s Gemini-Pro.

 
 
Inline image 2024 05 31 15.27.06.852
[Figure – MoAI Multimodal LLM Performance Evaluation]
 
 
The  two consecutive multimodal large language models, CoLLaVO and MoAI,  were developed with the participation of ByungKwan Lee (Ph.D student)  as the first author. Additionally, Beomchan Park (integrated master’s  and Ph.D. student), and Chae Won Kim, (Ph.D. student), contributed as  co-authors.
The open-source large language model CoLLaVO was accepted on  May 16, 2024, by the prestigious international conference in the field  of natural language processing (NLP), ‘Findings of the Association for  Computational Linguistics (ACL Findings) 2024’. MoAI is currently  awaiting approval from the top international conference in computer  vision, the ‘European Conference on Computer Vision (ECCV) 2024’.

Accordingly, Professor YongMan Ro stated, “The open-source multimodal large language models developed by our research team, CoLLaVO and MoAI, have been recommended on Huggingface Daily Papers and are being recognized by researchers worldwide through various social media platforms. Since all the models have been released as open-source large language models, these research models will contribute to the advancement of multimodal large language models.”

 This research  was conducted at the Future Defense Artificial Intelligence  Specialization Research Center and the School of Electrical Engineering  of Korea Advanced Institute of Science and Technology (KAIST).

 

[1] CoLLaVO Demo GIF Video Clip https://github.com/ByungKwanLee/CoLLaVO

 

images 000078 imga4.jpg

< CoLLaVO Demo GIF >

 

[2] MoAI Demo GIF Video Clip https://github.com/ByungKwanLee/MoAI

images 000078 image5.png

< MoAI Demo GIF >