While “Big Data” is now in vogue, many DOE science facilities in US have produced a vast amount of experimental and simulation data for many years. They are projected to generate in excess of 1 exabyte per year by 2018. To accommodate growing volumes of data, organizations will continue to deploy larger, well provisioned storage infrastructures. These data sets, however do not exist in isolation. They need to be placed at one location data center for post science discovery. In order to support the increased growth of data and the desire to move it between organization, network operators are increasing the capabilities of the network. DOE’s ESnet for example, has upgraded its network to 100Gb/s between many DOE facilities, and future deployments will most likely support 400Gb/s followed by 1Tb/s throughput. While future terabit networks hold the promise of significantly improving big-data motion among geographically distributed data facilities, significant challenges must be overcome even on today’s 100 gigabit networks to realize end-to-end performance. In this talk, I first identify the issues that lead to congestion on the path of an end-to-end data transfer in the terabit network environment, and address solutions to optimize the data transfers between data facilities. Second, I discuss the challenges of building a virtual data facility that combines different data sets dispersed across data facilities, and present our approach to realize the virtual data facility.
Prof. Youngjae Kim is an assistant professor in the Department of Computer Science and Engineering at Sogang University. Before joining Sogang University, He had been a research staff member in US Department of Energy’s Oak Ridge National Laboratory from 2009-2015. Dr. Kim earned a PhD in computer science and engineering at The Pennsylvania State University, a MS in computer science at KAIST and a BS in computer science and engineering at Sogang University. His research interests focus on software infrastructure for data storage, data movement, data management and data analysis spanning from research and development to integration, deployment, operation and service for big data and cloud computing environments.