
What if an AI, when asked about a minister appointed last month, returned the name of a predecessor from a year ago? This example illustrates a critical limitation of current AI systems: their inability to reliably reflect up-to-date information. Our university’s research team has developed a new evaluation framework that automatically incorporates changes in real-world information and detects “temporal errors” that can appear plausible on the surface. The study is expected to enhance AI reliability by providing a systematic benchmarking framework.
A research team led by Professor Steven Euijong Whang from the School of Electrical Engineering, in joint research with Microsoft Research, has developed a system that automatically evaluates and diagnoses the temporal reasoning capabilities of Large Language Models (LLMs) using temporal database technology.
For AI to earn users’ trust, it must be able to accurately understand real-world information that changes over time. However, existing evaluation methods have largely focused on whether answers are simply right or wrong, or have examined only a narrow set of temporal relations, making them insufficient for evaluating the wide range of question scenarios that arise in real-world environments.
To overcome this challenge, the research team integrated “Temporal Database” design theory—an approach refined and validated over the past 40 years—into AI evaluation for the first time. By leveraging the temporal dependencies and relational structure of data, the technology can automatically generate 13 types of complex time-sensitive questions directly from the database, eliminating the need for researchers to manually create evaluation questions.

In particular, this technology marks a major innovation by replacing the conventional approach of manually writing evaluation questions with a data-driven method that generates them automatically. By automating the entire process—from question generation to answer derivation and verification—based on the database, it also reduces maintenance burden by eliminating the need to manually revise evaluation items.
When real-world information changes, the evaluation questions, answers, and verification criteria are automatically updated simply by revising the relevant data in the database. Although the latest information must still be provided by external data sources or administrators, the framework is designed to automatically conduct the evaluation once the data has been updated.
Additionally, going beyond traditional methods that assess only whether a final answer is correct or incorrect, the research team introduced a new metric that evaluates the factual validity of the dates or time periods used during the answering process. Using this metric, the team achieved a 21.7% improvement in detecting “Temporal Hallucinations”—cases in which an answer appears correct on the surface but is based on faulty temporal reasoning—compared with previous methods.
The database-based approach also improved evaluation efficiency. By eliminating the reliance on unnecessary data, the research team reduced the amount of input data required by an average of 51% compared with previous methods and demonstrated its effectiveness in reducing evaluation maintenance costs.

Professor Steven Euijong Whang stated, “This research shows that classical database design theory can play a crucial role in addressing the reliability challenges of today’s AI systems. By transforming large amounts of domain-specific data into evaluation resources, we expect this work to provide a practical foundation for verifying AI performance in various fields such as medicine and law.”
Soyeon Kim, a PhD student at KAIST, participated as the lead author of this study, and Jindong Wang (Microsoft Research, currently at William & Mary) and Xing Xie (Microsoft Research) participated as co-authors. The research results will be presented this April at ICLR 2026, the most prestigious academic conference in the field of artificial intelligence.
※ Paper Title: Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models
※ Paper Link: https://arxiv.org/abs/2508.02045
Meanwhile, this research was conducted with support from Microsoft Research, the National Research Foundation of Korea, and the Institute for Information & Communications Technology Planning & Evaluation (IITP) Global AI Frontier Lab projects (RS-2024-00469482, RS-2024-00509258).