Video Panoptic Segmentation (Prof. In-So Kweon)

Conference/Journal, Year: CVPR, 2020

Panoptic segmentation has become a new standard of visual recognition task by unifying previous semantic segmentation and instance segmentation tasks in concert. In this paper, we propose and explore a new video extension of this task, called video panoptic segmentation. The task requires generating consistent panoptic segmentation as well as an association of instance ids across video frames. To invigorate research on this new task, we present two types of video panoptic datasets. The first is a re-organization of the synthetic VIPER dataset into the video panoptic format to exploit its large-scale pixel annotations. The second is a temporal extension on the Cityscapes val. set, by providing new video panoptic annotations (Cityscapes-VPS). Moreover, we propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames. To provide appropriate metrics for this task, we propose a video panoptic quality (VPQ) metric and evaluate our method and several other baselines. Experimental results demonstrate the effectiveness of the presented two datasets. We achieve state-of-the-art results in image PQ on Cityscapes and also in VPQ on CityscapesVPS and VIPER datasets. The datasets and code are available at https://github.com/mcahny/vps.

9

Salient View Selection for Visual Recognition of Industrial Components (Prof. In-So Kweon)

Conference/Journal, Year: ICRA, 2020

We introduce a new method to find a salient viewpoint with a deep representation, according to ease of semantic segmentation. The main idea in our segmentation network is to utilize the multipath network with informative two views. In order to collect training samples, we assume all the information of designed components and even error tolerances are available. Before installing the actual camera layout, we simulate different model descriptions in a physically correct way and determine the best viewing parameters to retrieve a correct instance model from an established database. By selecting the salient viewpoint, we better understand fine-grained shape variations with specular materials. From the fixed top-view, our system initially predicts a 3-DoF pose of a testing object in a data-driven way, and precisely align the model with a refined semantic mask. Under various conditions of our system setup, the presented method is experimentally validated. A robotic assembly task with our vision solution is also successfully demonstrated.

8

Linear RGB-D SLAM for Atlanta World (Prof. In-So Kweon)

Conference/Journal, Year: ICRA, 2020

We present a new linear method for RGB-D based simultaneous localization and mapping (SLAM). Compared to existing techniques relying on the Manhattan world assumption defined by three orthogonal directions, our approach is designed for the more general scenario of the Atlanta world. It consists of a vertical direction and a set of horizontal directions orthogonal to the vertical direction and thus can represent a wider range of scenes. Our approach leverages the structural regularity of the Atlanta world to decouple the non-linearity of camera pose estimations. This allows us separately to estimate the camera rotation and then the translation, which bypasses the inherent non-linearity of traditional SLAM techniques. To this end, we introduce a novel tracking-by-detection scheme to estimate the underlying scene structure by Atlanta representation. Thereby, we propose an Atlanta frame-aware linear SLAM framework which jointly estimates the camera motion and a planar map supporting the Atlanta structure through a linear Kalman filter. Evaluations on both synthetic and real datasets demonstrate that our approach provides favorable performance compared to existing state-of-the-art methods while extending their working range to the Atlanta world.

7

Globally Optimal Relative Pose Estimation for Camera on a Selfie Stick (Prof. In-So Kweon)

Conference/Journal, Year: ICRA, 2020

Taking selfies has become a photographic trend nowadays. We envision the emergence of the “video selfie” capturing a short continuous video clip (or burst photography) of the user, themselves. A selfie stick is usually used, whereby a camera is mounted on a stick for taking selfie photos. In this scenario, we observe that the camera typically goes through a special trajectory along a sphere surface. Motivated by this observation, in this work, we propose an efficient and globally optimal relative camera pose estimation between a pair of two images captured by a camera mounted on a selfie stick. We exploit the special geometric structure of the camera motion constrained by a selfie stick and define its motion as spherical joint motion. By the new parametrization and calibration scheme, we show that the pose estimation problem can be reduced to a 3-DoF (degrees of freedom) search problem, instead of a generic 6-DoF problem. This allows us to derive a fast branch-and-bound global optimization, which guarantees a global optimum. Thereby, we achieve efficient and robust estimation even in the presence of outliers. By experiments on both synthetic and real-world data, we validate the performance as well as the guaranteed optimality of the proposed method.

6

Globally Optimal Relative Pose Estimation for Camera on a Selfie Stick (Prof. In-So Kweon)

Conference/Journal, Year: ICRA, 2020

Taking selfies has become a photographic trend nowadays. We envision the emergence of the “video selfie” capturing a short continuous video clip (or burst photography) of the user, themselves. A selfie stick is usually used, whereby a camera is mounted on a stick for taking selfie photos. In this scenario, we observe that the camera typically goes through a special trajectory along a sphere surface. Motivated by this observation, in this work, we propose an efficient and globally optimal relative camera pose estimation between a pair of two images captured by a camera mounted on a selfie stick. We exploit the special geometric structure of the camera motion constrained by a selfie stick and define its motion as spherical joint motion. By the new parametrization and calibration scheme, we show that the pose estimation problem can be reduced to a 3-DoF (degrees of freedom) search problem, instead of a generic 6-DoF problem. This allows us to derive a fast branch-and-bound global optimization, which guarantees a global optimum. Thereby, we achieve efficient and robust estimation even in the presence of outliers. By experiments on both synthetic and real-world data, we validate the performance as well as the guaranteed optimality of the proposed method.

6

EE PhD Soo Ye Kim and Sanghyun Woo selected as 2021 Google PhD Fellow.

KAIST PhD candidates Soo Ye Kim from the School of Electrical Engineering (advisor: Prof. Munchurl Kim), Sanghyun Woo from the School of Electrical Engineering (advisor: Prof. In So Kweon), and Hae Beom Lee from the Kim Jaechul Graduate School of AI (advisor: Prof. Sung Ju Hwang) were selected as recipients of the 2021 Google PhD Fellowship. 

0

< The 2021 Google PhD fellow Soo Ye Kim, Sanghyun Woo, and Hae Beom Lee (from left) >

The Google PhD Fellowship is a scholarship program that recognizes outstanding graduate students for their exceptional and innovative research in areas relevant to computer science and related fields. This year, 75 students from around the world have received the fellowship. Selected fellows receive a $10,000 scholarship and an opportunity to discuss research along with feedback from experts at Google.

Soo Ye Kim and Sanghyun Woo were named fellows in the field of “Machine Perception, Speech Technology and Computer Vision”. Soo Ye Kim was selected for her outstanding achievements in deep learning based super-resolution, and Sanghyun Woo was selected for his outstanding achievements in the field of computer vision. Hae Beom Lee was named a fellow in the field of “Machine Learning” for his outstanding achievements in meta-learning.

 

Soo Ye Kim’s research achievements include the formulation of novel methods for super-resolution and HDR video restoration and deep joint frame interpolation and super-resolution methods. Many of her works have been presented in leading conferences in computer vision and AI such as CVPR, ICCV, and AAAI. In addition, she has been collaborating as a research intern with the Vision Group Team at Adobe Research to study depth map refinement techniques. 

 

Sanghyun Woo’s research achievements include an effective deep learning model design based on the attention mechanism and learning methods based on self-learning and simulators. His works have been also presented in leading conferences such as CVPR, ECCV, and NeurIPS. In particular, his work on the Convolutional Block Attention Module (CBAM) which was presented at ECCV in 2018 has surpassed over 2700 citations on Google Scholar after being referenced in many computer vision applications. He was also a recipient of Microsoft Research PhD Fellowship in 2020.

 

Prof. Joonhyuk Kang, the head of KAIST EE, congratulated  and encouraged Sanghyun Woo for receiving the fellowship by carrying out it as a research personnel for military service and  Soo Ye Kim for great achievement through industrial cooperations sprightly. 

 

Due to the COVID-19 pandemic, the award ceremony was held virtually at the Google PhD Fellow Summit from August 31st to September 1st. The list of fellowship recipients is displayed on the Google webpage.

(Link: https://research.google/outreach/phd-fellowship/recipients/ )

 

연구내용 0

[Research achievements of Soo Ye Kim: Deep learning based joint super-resolution and inverse tone-mapping framework for HDR videos]

attention module 0

 

 

[Research achievements of Sanghyun Woo: Attention mechanism based deep learning models]

Hyunjun Lim, Yeeun Kim, Kwangik Jung, Sumin Hu, and Hyun Myung, ""Avoiding Degeneracy for Monocular Visual-Inertial System with Point and Line Features,"" in Proc. IEEE Int'l Conf. on Robotics and Automation (ICRA),

In this paper, a degeneracy avoidance method for a point-and-line-based visual simultaneous localization and mapping (SLAM) algorithm is proposed. Visual SLAM predominantly uses point features. However, point features lack robustness in low texture and illuminance variant environments. Therefore, line features are used to compensate for the weaknesses of point features. In addition, point features are poor in representing discernable features for the naked eye, meaning mapped point features cannot be recognized. To overcome the limitations above, line features were actively employed in previous studies. However, since degeneracy arises in the process of using line features, this paper attempts to solve this problem. First, a simple method to identify degenerate lines is presented. In addition, a novel structural constraint is proposed to avoid the degeneracy problem. At last, a point-and-line-based monocular SLAM system using a robust optical-flow-based line tracking method is implemented. The results are verified using experiments with the EuRoC dataset and compared with other state-of-the-art algorithms. It is proven that our method yields more accurate localization as well as mapping results.

 

2

Hyungtae Lim, Sungwon Hwang, Hyun Myung, ""ERASOR: Egocentric Ratio of Pseudo Occupancy-based Dynamic Object Removal for Static 3D Point Cloud Map Building,"" in Proc. IEEE Int'l Conf. on Robotics and Automation (ICRA)

Scan data of urban environments often include representations of dynamic objects, such as vehicles, pedestrians, and so forth. However, when it comes to constructing a 3D point cloud map with sequential accumulations of the scan data, the dynamic objects often leave unwanted traces in the map. These traces of dynamic objects act as obstacles and thus impede mobile vehicles from achieving good localization and navigation performances. To tackle the problem, this paper presents a novel static map building method called ERASOR, Egocentric RAtio of pSeudo Occupancy-based dynamic object Removal, which is fast and robust to motion ambiguity. Our approach directs its attention to the nature of most dynamic objects in urban environments being inevitably in contact with the ground. Accordingly, we propose the novel concept called pseudo occupancy to express the occupancy of unit space and then discriminate spaces of varying occupancy. Finally, Region-wise Ground Plane Fitting (R-GPF) is adopted to distinguish static points from dynamic points within the candidate bins that potentially contain dynamic points. As experimentally verified on SemanticKITTI, our proposed method yields promising performance against state-of-the-art methods overcoming the limitations of existing ray tracing-based and visibility-based methods.

 

1

Jinwoo Jeon, Sungwook Jung, Eungchang Lee, Duckyu Choi, Hyun Myung, ""Run Your Visual-Inertial Odometry on NVIDIA Jetson: Benchmark Tests on a Micro Aerial Vehicle,"" in Proc. IEEE Int'l Conf. on Robotics and Automation (ICRA)

This paper presents benchmark tests of various visual(-inertial) odometry algorithms on NVIDIA Jetson platforms. The compared algorithms include mono and stereo, covering Visual Odometry (VO) and Visual-Inertial Odometry

(VIO): VINS-Mono, VINS-Fusion, Kimera, ALVIO, StereoMSCKF, ORB-SLAM2 stereo, and ROVIO. As these methods are mainly used for unmanned aerial vehicles (UAVs), they must perform well in situations where the size of the processing

board and weight is limited. Jetson boards released by NVIDIA satisfy these constraints as they have a sufficiently powerful central processing unit (CPU) and graphics processing unit (GPU) for image processing. However, in existing studies, the performance of Jetson boards as a processing platform for executing VO/VIO has not been compared extensively in terms of the usage of computing resources and accuracy. Therefore, this study compares representative VO/VIO algorithms on several NVIDIA Jetson platforms, namely NVIDIA Jetson TX2, Xavier NX, and AGX Xavier, and introduces a novel dataset ‘KAIST VIO dataset’ for UAVs. Including pure rotations, the dataset has several geometric trajectories that are harsh to visual(-inertial) state estimation. The evaluation is performed in terms of the accuracy of estimated odometry, CPU usage, and memory usage on various Jetson boards, algorithms, and trajectories. We present the results of the comprehensive benchmark test and release the dataset for the computer vision and robotics applications.

Video Prediction Recalling Long-term Motion Context via Memory Alignment Learning (노용만 교수님)

저자: 이상민, 김학구, 최대휘, 김형일, 노용만

Our work addresses long-term motion context issues for predicting future frames. To predict the future precisely, it is required to capture which long-term motion context (e.g., walking or running) the input motion (e.g., leg movement) belongs to. The bottlenecks arising when dealing with the long-term motion context are: (i) how to predict the long-term motion context naturally matching input sequences with limited dynamics, (ii) how to predict the long-term motion context with high-dimensionality (e.g., complex motion). To address the issues, we propose novel motion context-aware video prediction. To solve the bottleneck (i), we introduce a long-term motion context memory (LMC-Memory) with memory alignment learning. The proposed memory alignment learning enables to store long-term motion contexts into the memory and to match them with sequences including limited dynamics. As a result, the long-term context can be recalled from the limited input sequence. In addition, to resolve the bottleneck (ii), we propose memory query decomposition to store local motion context (i.e., low-dimensional dynamics) and recall the suitable local context for each local part of the input individually. It enables to boost the alignment effects of the memory. Experimental results show that the proposed method outperforms other sophisticated RNN-based methods, especially in long-term condition. Further, we validate the effectiveness of the proposed network designs by conducting ablation studies and memory feature analysis. The source code of this work is available.

 

1

Figure 1. Memory alignment learning with long-term motion context memory