
Just as people’s attention is often drawn to images before text when both appear together, a “multimodal artificial intelligence” that uses multiple senses simultaneously also tends to depend more heavily on certain data types.
The KAIST research team has developed a multimodal AI learning technology that can recognize both images and texts evenly, enabling much more accurate predictions even in such situations.
Professor Steven Euijong Whang’s team from the School of Electrical Engineering has developed a new data augmentation technique that helps multimodal artificial intelligence—responsible for processing diverse data types simultaneously—utilize all data sources evenly.
Multimodal artificial intelligence processes multiple types of information such as text and video simultaneously. However, AI often shows a tendency to make judgments biased toward one type of data, resulting in reduced prediction performance.
To solve this problem, the research team intentionally mixed mismatched data for training.
By doing so, the AI learns to utilize text, images, and sound in a balanced manner instead of relying solely on one type of data.
In addition, the researchers applied a training strategy that compensates for low-quality data and places more emphasis on difficult data, showing that performance can be stably improved across various situations.
This method is not bound to any specific model architecture and can be easily applied to various types of data, making it highly scalable and practical.

Professor Whang said, “To improve AI performance, how and what data are used for learning is much more important than merely changing the model structure (algorithm).
This study demonstrates that designing and processing the data itself can be an effective approach for enabling multimodal AI to utilize information in a balanced way, without being biased toward specific data such as images or text.”
This research was conducted by Ph.D. candidate Seong-Hyeon Hwang and Master’s student So Young Choi as co–first authors, with Professor Steven Euijong Whang serving as the corresponding author.
The research results will be presented at the NeurIPS (Conference on Neural Information Processing Systems), one of the world’s most prestigious AI conferences, to be held in San Diego, USA, and Mexico City, Mexico this December.
※ Paper Title: MIDAS: Misalignment-based Data Augmentation Strategy for Imbalanced Multimodal Learning
(Original Paper: https://arxiv.org/pdf/2509.25831)
Meanwhile, this research was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) under the following projects: Robust, Fair, Extensible Data-Centric Continual Learning (RS-2022-II220157)
Non-invasive near-infrared based AI technology for the diagnosis and treatment of brain diseases (RS-2024-00444862)