Professor Changick Kim’s Research Team Develops ‘VideoMamba,’ a High-Efficiency Model Opening a New Paradigm in Video Recognition

Professor Changick Kim’s Research Team Develops ‘VideoMamba,’ a High-Efficiency Model Opening a New Paradigm in Video Recognition

1 1

<(From left) Professor Changick Kim, Jinyoung Park integrated Ph.D. candidate, Hee-Seon Kim Ph.D. candidate, Kangwook Ko Ph.D. candidate, and Minbeom Kim Ph.D. candidate>

 

On the 9th, Professor Changick Kim’s research team announced the development of a high-efficiency video recognition model named ‘VideoMamba.’ VideoMamba demonstrates superior efficiency and competitive performance compared to existing video models built on transformers, like those underpinning large language models such as ChatGPT. This breakthrough is seen as pioneering a new paradigm in the field of video utilization.

 

1 2

 

Figure 1: Comparison of VideoMamba’s memory usage and inference speed with transformer-based video recognition models. 

 

VideoMamba is designed to address the high computational complexity associated with traditional transformer-based models.

These models typically rely on the self-attention mechanism, which scales quadratically in complexity. However, VideoMamba utilizes a Selective State Space Model (SSM) mechanism, enabling efficient linear complexity processing. This allows VideoMamba to effectively capture the spatio-temporal information in videos and efficiently handle long dependencies within video data.

 

2

 

Figure 2: Detailed structure of the spatio-temporal forward and backward Selective State Space Model in VideoMamba. 

 

To maximize the efficiency of the video recognition model, Professor Kim’s team incorporated spatio-temporal forward and backward SSMs into VideoMamba. This model integrates non-sequential spatial information and sequential temporal information effectively, enhancing video recognition performance.

The research team validated VideoMamba’s performance across various video recognition benchmarks. As a result, VideoMamba achieved high accuracy with low GFLOPs (Giga Floating Point Operations) and memory usage, and it demonstrated very fast inference speed.

 

VideoMamba offers an efficient and practical solution for various applications requiring video analysis. For example, autonomous driving can analyze driving footage to accurately assess road conditions and recognize pedestrians and obstacles in real time, thereby preventing accidents.

In the medical field, it can analyze surgical videos to monitor the patient’s condition in real-time and respond swiftly to emergencies. In sports, it can analyze players’ movements and tactics during games to improve strategies and detect fatigue or potential injuries during training to prevent them. VideoMamba’s fast processing speed, low memory usage, and high performance provide significant advantages in these diverse video utilization fields.

 

The research team includes Jinyoung Park (integrated Ph.D candidate), Hee-Seon Kim (Ph.D. candidate), Kangwook Ko (Ph.D. candidate) as co-first authors, and Minbeom Kim (Ph.D. candidate) as a co-author, with Professor Changick Kim as the corresponding author from the Department of Electrical and Electronic Engineering at KAIST.

The research findings will be presented at the European Conference on Computer Vision (ECCV) 2024, one of the top international conferences in the field of computer vision, to be held in Milan, Italy, in September this year. (Paper title: VideoMamba: Spatio-Temporal Selective State Space Model).

 

This work was supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2020-0-00153, Penetration Security Testing of ML Model Vulnerabilities and Defense).

Professor Seungwon Shin’s Research Team Publishes Paper at Top Conference in Computer Science (USENIX Security)

Professor Seungwon Shin’s Research Team Publishes Paper at Top Conference in Computer Science (USENIX Security)

3849308447258157867.3850060435970071428@dooray

<Professor Seungwon Shin>

 

Our department’s professor Seungwon Shin’s research team has announced findings that the data prefetching feature in Apple’s new M-series processors can be more effectively exploited for traditional cache attacks. Data prefetching is one of the key optimization functions of a processor, used to reduce memory access time by preloading data into the cache that the program is expected to need. 
 
Processors typically provide hardware-based prefetching along with a set of instructions to support software-based prefetching. Professor Seungwon Shin’s team conducted a comparative analysis of the Instruction Set Architectures (ISA) of x86 and ARM, proving that the data prefetching feature on ARM-based processors can be more effectively utilized for cache attacks.
Through this research, they devised three new cache-based attacks and demonstrated that a covert channel implemented on Apple’s M-series processors could transmit data at over three times the speed of traditional cache attacks. Additionally, they proved approximately 8 times performance improvement in side-channel attacks that extract encryption keys compared to previous works. 
 
Professor Seungwon Shin’s team emphasized the significance of starting the research on security vulnerabilities in ARM-based processors, especially as Apple has started to manufacture desktop processors based on ARM architecture.
This study will be presented at USENIX Security, one of the top conferences in computer security, in August 2024 and can be found on the conference’s website. 
(https://www.usenix.org/conference/usenixsecurity24/presentation/choi)
 

3849308447258157867.3849311937720073324@dooray

Professor Chan-Hyun Youn’s Research Team Developed a Technique to Prevent Abnormal Data Generation in Diffusion Models

Professor Chan-Hyun Youn’s Research Team Developed a Technique to Prevent Abnormal Data Generation in Diffusion Models

3844436294526049410.3844438126017379975@dooray3844436294526049410.3844438126033254187@dooray3844436294526049410.3844438126046602054@dooray3844436294526049410.3844438126058967463@dooray

<(From left) Professor Chan-Hyun Youn, Jinhyeok Jang Ph.D. candidate, Changha Lee Ph.D. candidate, Minsu Jeon Ph.D. >

 

Professor Chan-Hyun Youn’s research team from the EE department has developed a momentum-based generation technique to address the issue of abnormal data generation frequently encountered by diffusion model-based generative AI.

While diffusion model-based generative AI, which has recently garnered significant attention, generally produces realistic images, it often generates abnormal details, such as unnaturally bent joints or horses with only three legs.

 

3844436294526049410.3844438126075552175@dooray

Figure 1 : The generated images by Stable Diffusion with the proposed technique

 

To address this problem, the research team reformulated the generative process of diffusion models as an optimization problem, such as gradient descent. Both the generative process of diffusion models and gradient descent can be expressed as a Generalized Expectation-Maximization problem, and visualization revealed the presence of numerous local minima and saddle points in the generative process.

This observation demonstrated that inappropriate outcomes are akin to local minima or saddle points. Based on this insight, the team introduced the widely used momentum technique from optimization into the generative process.

 

Various experiments confirmed that the generation of inappropriate images significantly decreased without additional training, and the quality of generated images improved even with reduced computational cost. These results suggest a new insight about the generative process of diffusion models as a progressive optimization problem and show that introducing the momentum technique into the generative process reduces inappropriate outcomes.

 

This new research outcome is expected to not only improve generation results but also provide a new interpretation of generative AI and inspire various follow-up studies. The research findings were presented in February at the 38th Annual AAAI Conference on Artificial Intelligence (AAAI 2024) in Vancouver, Canada, one of the leading international conferences in the AI field, under the title “Rethinking Peculiar Images by Diffusion Models: Revealing Local Minima’s Role.”

Professor Chan-Hyun Youn’s Research Team Developed a Dataset Watermarking Technique for Dataset Copyright Protection

Professor Chan-Hyun Youn’s Research Team Developed a Dataset Watermarking Technique for Dataset Copyright Protection

3844433368535702746.3844433844043905195@dooray3844433368535702746.3844433844073238403@dooray

<Professor Chan-Hyun Youn, and Jinhyeok Jang Ph.D. candidate>

 

Professor Chan-Hyun Youn’s research team from the EE department has developed a technique for dataset copyright protection named “Undercover Bias.” Undercover Bias is based on the premise that all datasets contain bias and that bias itself has discriminability. By embedding artificially generated biases into a dataset, it is possible to verify AI models using the watermarked data without adequate permission.

 

This technique addresses the issues of data copyright and privacy protection, which have become significant societal concerns with the rise of AI. It embeds a very subtle watermark into the target dataset. Unlike prior methods, the watermark is nearly imperceptible and clean-labeled. However, AI models trained on the watermarked dataset unintentionally acquire the ability to classify the watermark. The presence or absence of this property allows for the verification of unauthorized use of the dataset.

 

3844433368535702746.3844433844082677819@dooray

Figure 1 : Schematic of verification based on Undercover Bias

 

The research team demonstrated that the proposed method can verify models trained using the watermarked dataset with 100% accuracy across various benchmarks.

Further, they showed that models trained with adequate permission are misidentified as unauthorized with a probability of less than 3e-5%, proving the high reliability of the proposed watermark. The study will be presented at one of the leading international conferences in the field of computer vision, the European Conference on Computer Vision (ECCV) 2024, to be held in Milan, Italy, in October this year.

 

ECCV is renowned in the field of computer vision, alongside conferences like CVPR and ICCV, as one of the top-tier international academic conferences. The paper will be titled “Rethinking Data Bias: Dataset Copyright Protection via Embedding Class-wise Hidden Bias.”

Professor Chan-Hyun Youn’s Research Team Developed a Network Calibration Technique to Improve the reliability of artificial neural networks

Professor Chan-Hyun Youn’s Research Team Developed a Network Calibration Technique to Improve the reliability of artificial neural networks

3844424708055224837.3844427238256168274@dooray3844424708055224837.3844427238272435522@dooray

<(from left) Professor Chan-Hyun Youn, Gyusang Cho ph.d. candidate>

 

Professor Chan-Hyun Youn’s research team from the EE department, has successfully developed a network calibration algorithm called “Tilt and Average; TNA” to improve the reliability of neural networks. Unlike existing methods based on calibration maps, the TNA technique transforms the weights of the classifier’s last layer, offering a significant advantage in that it can be seamlessly integrated with existing methods. This research is being evaluated as an outstanding technology in the field of enhancing artificial intelligence reliability.

3844424708055224837.3844427238236031186@dooray

 

The research proposes a new algorithm to address the overconfident prediction problem inherent in existing artificial neural networks. Utilizing the high-dimensional geometry of the last linear layer, this algorithm focuses on the angular aspects between the row vectors of the weights, suggesting a mechanism to adjust (Tilt) and compute the average (Average) of their directions.

 

The research team confirmed that the proposed method can reduce calibration error by up to 20%, and the algorithm’s ability to integrate with existing calibration map-based techniques is a significant advantage. The results of this study are scheduled to be presented at the ICML (International Conference on Machine Learning, https://icml.cc), one of the premier international conferences in the field of artificial intelligence, held in Vienna, Austria, this July. Now in its 41st year, ICML is renowned as one of the most prestigious and long-standing international conferences in the machine learning field, alongside other top conferences such as CVPR, ICLR, and NeurIPS.

 

In addition, this research was conducted with support from the Korea Coast Guard (RS-2023-00238652) and the Defense Acquisition Program Administration (DAPA) (KRIT-CT-23-020). The paper can be found as : Gyusang Cho and Chan-Hyun Youn, “Tilt and Average : Geometric Adjustment of the Last Layer for Recalibration”, ICML (2024) 

 
[Professor Myoungsoo Jung’s Research Team Pioneers the ‘CXL-GPU’ Market. KAIST Develops High Capacity and Performance GPU]
 
3842864258736465856.3843537081640038163@dooray
<Professor Myoungsoo Jung’s Research Team>
 

Recently, big tech companies at the forefront of large-scale AI service provision are competitively increasing the size of their models and data to deliver better performance to users. The latest large-scale language models require tens of terabytes (TB, 10^12 bytes) of memory for training. A domestic research team has developed a next-generation interface technology-enabled high-capacity, high-performance AI accelerator that can compete with NVIDIA, which currently dominates the AI accelerator market.

 

Professor Jung Myoungsoo’s research team, announced on the 8th that they have developed a technology to optimize the memory read/write performance of high-capacity GPU devices with the next-generation interface technology, Compute Express Link (CXL).

 

The internal memory capacity of the latest GPUs is only a few tens of gigabytes (GB, 10^9 bytes), making it impossible to train and infer models with a single GPU. To provide the memory capacity required by large-scale AI models, the industry generally adopts the method of connecting multiple GPUs. However, this method significantly increases the total cost of ownership (TCO) due to the high prices of the latest GPUs.

 

3842864258736465856.3843535396340293895@dooray

< Representative Image of the CXL-GPU >

 

Therefore, the ‘CXL-GPU’ structure technology, which directly connects large-capacity memory to GPU devices using the next-generation connection technology, CXL, is being actively reviewed in various industries. However, the high-capacity feature of CXL-GPU alone is not sufficient for practical AI service use. Since large-scale AI services require fast inference and training performance, the memory read/write performance to the memory expansion device directly connected to the GPU must be comparable to that of the local memory of the existing GPU for actual service utilization.

 

*CXL-GPU: It supports high capacity by integrating the memory space of memory expansion devices connected via CXL into the GPU memory space. The CXL controller automatically handles operations needed for managing the integrated memory space, allowing the GPU to access the expanded memory space in the same manner as accessing its local memory. Unlike the traditional method of purchasing additional expensive GPUs to increase memory capacity, CXL-GPU can selectively add memory resources to the GPU, significantly reducing system construction costs.

 

Our research team has developed technology to improve the causes of decreased memory read/write performance of CXL-GPU devices. By developing technology that allows the memory expansion device to determine its memory write timing independently, the GPU device can perform memory writes to the memory expansion device and the GPU’s local memory simultaneously. This means that the GPU does not have to wait for the completion of the memory write task, thereby solving the write performance degradation issue.

 

제안하는 CXL-GPU의 구조

< Proposed CXL-GPU Architecture > 

 

Furthermore, the research team developed a technology that provides hints from the GPU device side to enable the memory expansion device to perform memory reads in advance.

Utilizing this technology allows the memory expansion device to start memory reads faster, achieving faster memory read performance by reading data from the cache (a small but fast temporary data storage space) when the GPU device actually needs the data.

3842864258736465856.3843535396499373584@dooray

< CXL-GPU Hardware Prototype >

 

This research was conducted using the ultra-fast CXL controller and CXL-GPU prototype from Panmnesia*, a semiconductor fabless startup. Through the technology efficacy verification using Panmnesia’s CXL-GPU prototype, the research team confirmed that it could execute AI services 2.36 times faster than existing GPU memory expansion technology. The results of this research will be presented at the USENIX Association Conference and HotStorage research presentation in Santa Clara this July.

 

*Panmnesia possesses a proprietary CXL controller with pure domestic technology that has reduced the round-trip latency for CXL memory management operations to less than double-digit nanoseconds (nanosecond, 10^9 of a second) for the first time in the industry. This is more than three times faster than the latest CXL controllers worldwide. Panmnesia has utilized its high-speed CXL controller to directly connect multiple memory expansion devices to the GPU, enabling a single GPU to form a large-scale memory space in the terabyte range.

Professor Jung stated, “Accelerating the market adoption of CXL-GPU can significantly reduce the memory expansion costs for big tech companies operating large-scale AI services.”

 

3842864258736465856.3843535396705573424@dooray

< Evaluation Results of CXL-GPU Execution Time > 

 

Professor Minsoo Rhu has been inducted into the Hall of Fame of the IEEE/ACM International Symposium on Computer Architecture (ISCA) 2024

Professor Minsoo Rhu has been inducted into the Hall of Fame of the IEEE/ACM International Symposium on Computer Architecture (ISCA) 2024

Inline image 2024 07 02 10.57.33.530

<Professor Minsoo Rhu>
 
Professor Minsoo Rhu has been inducted into the Hall of Fame of the IEEE/ACM International Symposium on Computer Architecture (ISCA) this year.
 
ISCA (https://www.iscaconf.org/isca2024/) is an international conference with a long history (51th year) and the highest authority in the field of computer architecture. Along with the MICRO (IEEE/ACM International Symposium on Microarchitecture) and HPCA (IEEE International Symposium on High-Performance Computer Architecture) conferences, it is considered as one of the top three international conferences in the computer architecture field.
 
Professor Minsoo Rhu is a leading researcher in South Korea on research in AI semiconductors and GPU-based high-performance computing systems within the field of computer architecture. Following his induction into the HPCA Hall of Fame in 2021 and the MICRO Hall of Fame in 2022, he has published more than eight papers at ISCA and has been inducted into the ISCA Hall of Fame in 2024.
 
This year, the ISCA conference will be held from June 29 to July 3 in Buenos Aires, Argentina, where Professor Rhu’s research team will present a total of three papers (see below).
 
[Information on Professor Minsoo Rhu’s Research Team’s ISCA Presentations]
 
1. Yujeong Choi, Jiin Kim, and Minsoo Rhu, “ElasticRec: A Microservice-based Model Serving Architecture Enabling Elastic Resource Scaling for Recommendation Models,” ISCA-51     
arXiv paper link
 
2. Yunjae Lee, Hyeseong Kim, and Minsoo Rhu, “PreSto: An In-Storage Data Preprocessing System for Training Recommendation Models,” ISCA-51
arXiv paper link
 
3. Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang, “Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference,” ISCA-51 
arXiv paper link

 

Professor Kim Lee-Sup Lab’s Master’s Graduate Park Jun-Young Wins Best Paper Award at the International Design Automation Conference

Professor Kim Lee-Sup Lab’s Master’s Graduate Park Jun-Young Wins Best Paper Award at the International Design Automation Conference

 

Inline image 2024 07 02 11.10.07.255

<(From left to right) Professor Kim Lee-Sup, Master’s Graduate Park Jun-Young, Ph.D. Graduate Kang Myeong-Goo, Master’s Graduate Kim Yang-Gon, Ph.D. Graduate Shin Jae-Kang,    Ph.D. Candidate Han Yunki>

 

Master’s graduate Park Jun-Young from Professor Kim Lee-Sup’s lab of our department achieved the significant accomplishment of winning the Best Paper Award at the International Design Automation Conference (DAC) held in San Francisco, USA, from June 23 to June 27. Established in 1964, DAC is an international academic conference in its 61st year, covering semiconductor design automation, AI algorithms, and chip design. It is considered the highest authority in the related field, with only about 20 percent of submitted papers being selected for presentation.

The awarded research is based on Park Jun-Young’s master’s thesis, proposing an algorithmic approximation technique and hardware architecture to reduce memory transfer for KV caching, a problem in Large Language Model inference. The excellence of this research was recognized by the Best Paper Award selection committee and was chosen as the final Best Paper Award winner from among the four candidate papers (out of 337 presented and 1,545 submitted papers).

The details are as follows:

 

  • Conference Name: 2024 61st IEEE/ACM Design Automation Conference (DAC)
  • Date: June 23-27, 2024
  • Award: Best Paper Award
  • Authors: Park Jun-Young, Kang Myeong-Goo, Han Yunki, Kim Yang-Gon, Shin Jae-Kang, Kim Lee-Sup (Advisor)
  • Paper Title: Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation

 

Inline image 2024 07 02 11.11.15.807

Professor Shinhyun Choi’s Research Team solves the Reliability Issues of Next-Generation Neuromorphic Computing

Professor Shinhyun Choi’s Research Team solves the Reliability Issues of Next-Generation Neuromorphic Computing

images 000078 photo1.jpg 12

<(From left) Professor Shinhyun Choi, Master’s student Jongmin Bae, Postdoc Cho-ah Kwon (Hanyang University), and Professor Sang-Tae Kim (Hanyang University)>
 

Neuromorphic computing, which implements AI computation in hardware by mimicking the human brain, has recently garnered significant attention. Memristors (conductance-changing devices), used as unit elements in neuromorphic computing, boast advantages such as low power consumption, high integration, and efficiency.

However, issues with irregular device characteristics have posed reliability problems for large-scale neuromorphic computing systems.

Our research team has developed a technology to enhance reliability, potentially accelerating the commercialization of neuromorphic computing.

 

On June 21, professor Shin-Hyun Choi’s research team announced a collaborative study with Hanyang University researchers. The study developed a doping method using aliovalent ions* to improve the reliability and performance of next-generation memory devices.

*Aliovalent ion: An ion with a different valence (a measure of its ability to bond) compared to the original atom.

 

The joint research team identified that doping with aliovalent ions could enhance the uniformity and performance of devices by addressing the primary issue of irregular device characteristic changes in next-generation memory devices, confirmed through experiments and atomic-level simulations.

 

images 000078 image1.jpg 11

Figure 1. Results of aliovalent ion doping developed in this study, demonstrating the improvement effects and the material principles underpinning them

 

The team reported that the appropriate injection of aliovalent halide ions into the oxide layer could solve the irregular device reliability problem, thereby improving device performance. This method was experimentally confirmed to enhance the uniformity, speed, and performance of device operation.

 

Furthermore, atomic-level simulation analysis showed that the performance improvement effect of the device was consistent with the experimental results observed in both crystalline and amorphous environments. The study revealed that doped aliovalent ions attract nearby oxygen vacancies, enabling stable device operation, and expand the space near the ions, allowing faster device operation.

 

Professor Shinhyun Choi states, “The aliovalent ion doping method we developed significantly enhances the reliability and performance of neuromorphic devices. This can contribute to the commercialization of next-generation memristor-based neuromorphic computing and can be applied to various semiconductor devices using the principles we uncovered.”

 

This research, with Master’s student Jongmin Bae and Postdoctoral researcher Choa Kwon from Hanyang University as co-first authors, was published in the June issue of the international journal ‘Science Advances’ (Paper title: Tunable ion energy barrier modulation through aliovalent halide doping for reliable and dynamic memristive neuromorphic systems).

 

The study was supported by the National Research Foundation of Korea’s Advanced Device Source Proprietary Technology Development Program, the Advanced Materials PIM Device Program, the Young Researcher Program, the Nano Convergence Technology Institute Semiconductor Process-based Nano-Medical Device Development Project, and the Innovation Support Program of the National Supercomputing Center.

Professor YongMan Ro’s research team develops a multimodal large language model that surpasses the performance of GPT-4V

Professor YongMan Ro’s research team develops a multimodal large language model that surpasses the performance of GPT-4V

 

Inline image 2024 05 31 15.36.55.915

<(From left) Professor YongMan Ro, ph.d. candidate ByungKwan Lee, ph.d. candidate Beomchan Park(integrated), ph.d. candidate Chae Won Kim>
 

On  June 20, 2024, Professor YongMan Ro’s research team announced that  they have developed and released an open-source multimodal large  language model that surpasses the visual performance of closed  commercial models like OpenAI’s ChatGPT/GPT-4V and Google’s Gemini-Pro. A  multimodal large language model refers to a massive language model  capable of processing not only text but also image data types.

 

The  recent advancement of large language models (LLMs) and the emergence of  visual instruction tuning have brought significant attention to  multimodal large language models. However, due to the support of  abundant computing resources by large overseas corporations, very large  models with parameters similar to the number of neural networks in the  human brain are being created.

These models are all developed in  private, leading to an ever-widening performance and technology gap  compared to large language models developed at the academic level. In  other words, the open-source large language models developed so far have  not only failed to match the performance of closed large language  models like ChatGPT/GPT-4V and Gemini-Pro, but also show a significant  performance gap.

 

To  improve the performance of multimodal large language models, existing  open-source large language models have either increased the model size  to enhance learning capacity or expanded the quality of visual  instruction tuning datasets that handle various vision language tasks.  However, these methods require vast computational resources or are  labor-intensive, highlighting the need for new efficient methods to  enhance the performance of multimodal large language models.

 

Professor YongMan Ro’s research team has announced the development of two technologies that significantly  enhance the visual performance of multimodal large language models  without significantly increasing the model size or creating high-quality  visual instruction tuning datasets.

 

The first technology developed by  the research team, CoLLaVO, verified that the primary reason existing  open-source multimodal large language models perform significantly lower  compared to closed models is due to a markedly lower capability in  object-level image understanding. Furthermore, they revealed that the  model’s object-level image understanding ability has a decisive and  significant correlation with its ability to handle visual-language  tasks.

 

Inline image 2024 05 31 15.25.02.940

[Figure – Crayon Prompt Training Methodology]
 

To  efficiently enhance this capability and improve performance on  visual-language tasks, the team introduced a new visual prompt called  Crayon Prompt. This method leverages a computer vision model known as  panoptic segmentation to segment image information into background and  object units. Each segmented piece of information is then directly fed  into the multimodal large language model as input. 

 

Additionally, to  ensure that the information learned through the Crayon Prompt is not  lost during the visual instruction tuning phase, the team proposed an  innovative training strategy called Dual QLoRA.

This strategy trains  object-level image understanding ability and visual-language task  processing capability with different parameters, preventing the loss of  information between them.

Consequently, the CoLLaVO multimodal large  language model exhibits superior ability to distinguish between  background and objects within images, significantly enhancing its  one-dimensional visual discrimination ability.

 

Inline image 2024 05 31 15.25.53.251

[Figure – CoLLaVO Multimodal LLM Performance Evaluation]
 
 
Following  the development of CoLLaVO, Professor YongMan Ro’s research team  developed and released their second large language model, MoAI. This  model is inspired by cognitive science elements that humans use to judge  objects, such as understanding the presence, state, and interactions of  objects, as well as background comprehension and text interpretation.

The team pointed out that existing multimodal large language models use vision encoders that are semantically aligned with text, leading to a lack of detailed and comprehensive real-world scene understanding at the pixel level.

 
To incorporate these cognitive science elements into a multimodal large language model, MoAI employs four computer vision models: panoptic segmentation, open-world object detection (which has no limits on detectable objects), scene graph generation, and optical character recognition (OCR).
 
The results from these four computer vision models are then translated into human-understandable language and directly used as input for the multimodal large language model.

By  combining the simple and efficient approach of CoLLaVO’s Crayon Prompt +  DualQLoRA with MoAI’s array of computer vision models, the research  team verified that their models outperformed closed commercial models  like OpenAI’s ChatGPT/GPT-4V and Google’s Gemini-Pro.

 
 
Inline image 2024 05 31 15.27.06.852
[Figure – MoAI Multimodal LLM Performance Evaluation]
 
 
The  two consecutive multimodal large language models, CoLLaVO and MoAI,  were developed with the participation of ByungKwan Lee (Ph.D student)  as the first author. Additionally, Beomchan Park (integrated master’s  and Ph.D. student), and Chae Won Kim, (Ph.D. student), contributed as  co-authors.
The open-source large language model CoLLaVO was accepted on  May 16, 2024, by the prestigious international conference in the field  of natural language processing (NLP), ‘Findings of the Association for  Computational Linguistics (ACL Findings) 2024’. MoAI is currently  awaiting approval from the top international conference in computer  vision, the ‘European Conference on Computer Vision (ECCV) 2024’.

Accordingly, Professor YongMan Ro stated, “The open-source multimodal large language models developed by our research team, CoLLaVO and MoAI, have been recommended on Huggingface Daily Papers and are being recognized by researchers worldwide through various social media platforms. Since all the models have been released as open-source large language models, these research models will contribute to the advancement of multimodal large language models.”

 This research  was conducted at the Future Defense Artificial Intelligence  Specialization Research Center and the School of Electrical Engineering  of Korea Advanced Institute of Science and Technology (KAIST).

 

[1] CoLLaVO Demo GIF Video Clip https://github.com/ByungKwanLee/CoLLaVO

 

images 000078 imga4.jpg

< CoLLaVO Demo GIF >

 

[2] MoAI Demo GIF Video Clip https://github.com/ByungKwanLee/MoAI

images 000078 image5.png

< MoAI Demo GIF >