Big data analytics has become an extremely important and challenging problem in disciplines like computer science, biology, and medicine. As massive amounts of data are available for analysis, scalable integration techniques and knowledge bases are becoming important. For example, the Google search engine integrates knowledge from various sources in order to provide direct answers to users. At the same time, as an adverse effect of integration, new privacy issues arise where one’s sensitive information can easily be inferred from a large amount of data. More recently, Big data analytics has also become necessary in large-scale machine learning systems in order to manage and train models on Big data. In my talk, I will first focus on the problem of entity resolution (ER), which identifies database records that refer to the same real world entity. In practice, ER is not a one-time process, but is constantly improved as the data, schema, and application are better understood. I will address the problem of keeping the ER result up-to-date when the ER logic “evolves” frequently. A naive approach that re-runs ER from scratch may not be tolerable for resolving large datasets. I will show when and how we can instead exploit previous “materialized” ER results to save redundant work with evolved logic. I will also explain how I used crowdsourcing techniques to enhance ER. I will then briefly introduce my work on managing information leakage where one must try to prevent an adversary from “connecting the dots” using ER and piece together sensitive information, which leads to privacy loss. Second, I will talk about how knowledge bases are impacting search engines in understanding data and explain a new ontology (called Biperpedia), which I led development as the technical leader at Google Research, specializing for search applications. For example, given the query “brazil coffee production 2017,” a search engine can use Biperpedia to understand that the user is asking for some numeric attribute (called coffee production) of the country Brazil in the year 2017. While the attributes of existing knowledge bases like Freebase are manually curated, Biperpedia automatically extracts attributes (thousands per class) on the long tail from Search queries and Web Text using machine learning and natural language processing techniques. Third, I will briefly introduce ongoing research at Google where I am developing Big data management infrastructure for large-scale machine learning systems. Unlike conventional software, the success of a machine learning system heavily depends on the quality of the data used for training models. In fact, most significant outages in production-scale machine learning systems involve some problem in the data. Hence, in large-scale machine learning, data management (including analytics) is just as important as model training and needs to be done in a principled fashion.
Steven Euijong Whang is a Research Scientist at Google Research. His research interests include Big Data Analytics, Big Data Systems, DB-AI Integration, Information Integration, Knowledge Systems, and Machine Learning. The goal of his work is to understand and solve various data management challenges in the Big Data era. Dr. Whang received his Ph.D. in computer science in 2012 from Stanford University working with Prof. Hector Garcia-Molina. He received his M.S. in computer science from Stanford in 2007 and his B.S. in computer science from KAIST in 2003, from which he graduated with Summa Cum Laude (ranked 1st out of 96 in computer science). For his dissertation work, he has made significant contributions to information integration (in particular, entity resolution) proposing a general framework of entity resolution and comprehensively working out a series of related issues. At Google Research, he led the Biperpedia project (a scalable state-of-art knowledge base) as the technical leader and is currently working on Big data management challenges in large-scale machine learning systems.