Research

Research Highlights

KAIST Researchers First in the World to Identify Security Threat Exploiting Google Gemini’s “Malicious Expert AI” Structure

images 000117 photo1.jpg 3
< (From left) Ph.D. candidates Mingyoo Song and Jaehan Kim, Professor Sooel Son, (Top right) Professor Seungwon Shin, Lead Researcher Seung Ho Na >

Most major commercial Large Language Models (LLMs), such as Google’s Gemini, utilize a Mixture-of-Experts (MoE) structure. This architecture enhances efficiency by dynamically selecting and using multiple “small AI models (Expert AIs)” depending on input queries. However, the EE research team has revealed for the first time in the world that this very structure can actually become a new security threat.

 

A joint research team led by Professor Seungwon Shin (School of Electrical Engineering) and Professor Sooel Son (School of Computing) has identified an attack technique that can seriously compromise the safety of LLMs by exploiting the MoE structure. For this research, they received the Distinguished Paper Award at ACSAC 2025, one of the most prestigious international conferences in the field of information security.

 

ACSAC (Annual Computer Security Applications Conference) is among the most influential international academic conferences in security. This year, only two papers out of all submissions were selected as Distinguished Papers. It is highly unusual for a domestic Korean research team to achieve such a feat in the field of AI security.

 

In this study, the team systematically analyzed the fundamental security vulnerabilities of the MoE structure. In particular, they demonstrated that even if an attacker does not have direct access to the internal structure of a commercial LLM, the entire model can be induced to generate dangerous responses if just one maliciously manipulated “Expert Model” is distributed through open-source channels and integrated into the system.

 

images 000117 image1.jpg 1
< Conceptual diagram of the attack technology proposed by the research team.>

 

To put it simply: even if there is only one “malicious expert” mixed among normal AI experts, that specific expert may be repeatedly selected for processing harmful queries, causing the overall safety of the AI to collapse. A particularly dangerous factor highlighted was that this process causes almost no degradation in model performance, making the problem extremely difficult to detect in advance.

 

Experimental results showed that the attack technique proposed by the research team could increase the harmful response rate from 0% to up to 80%. They confirmed that the safety of the entire model significantly deteriorates even if only one out of many experts is “infected.”

 

This research is highly significant as it presents the first new security threat that can occur in the rapidly expanding global open-source-based LLM development environment. Simultaneously, it suggests that verifying the “source and safety of individual expert models” is now essential—not just performance—during the AI model development process.

 

Professors Seungwon Shin and Sooel Son stated, “Through this study, we have empirically confirmed that the MoE structure, which is spreading rapidly for the sake of efficiency, can become a new security threat. This award is a meaningful achievement that recognizes the importance of AI security on an international level.”

 

The study involved Ph.D. candidates Jaehan Kim and Mingyoo Song, Dr. Seung Ho Na (currently at Samsung Electronics), Professor Seungwon Shin, and Professor Sooel Son. The results were presented at ACSAC in Hawaii, USA, on December 12, 2025.

 

images 000117 image2.jpg
<Photo of the Distinguished Paper Award certificate>

 

Paper Title: MoEvil: Poisoning Experts to Compromise the Safety of Mixture-of-Experts LLMs

GitHub (Open Source): https://github.com/jaehanwork/MoEvil

This research was supported by the Korea Internet & Security Agency (KISA) and the Institute of Information & Communications Technology Planning & Evaluation (IITP) under the Ministry of Science and ICT.