Sungwon Hwang, Hyungtae Lim, and Hyun Myung†, “Equivariance-bridged SO(2)-Invariant Representation Learning using Graph Convolutional Network,” BMVC 2021, Nov. 2021

  • Abstract

Training a Convolutional Neural Network (CNN) to be robust against rotation has mostly been done with data augmentation. In this paper, another progressive vision of research direction is highlighted to encourage less dependence on data augmentation by achieving structural rotational invariance of a network. The deep equivariance-bridged SO(2) invariant network is proposed to echo such vision. First, Self-Weighted Nearest Neighbors Graph Convolutional Network (SWN-GCN) is proposed to implement Graph Convolutional Network (GCN) on the graph representation of an image to acquire rotationally equivariant representation, as GCN is more suitable for constructing deeper network than spectral graph convolution-based approaches. Then, invariant representation is eventually obtained with Global Average Pooling (GAP), a permutation-invariant operation suitable for aggregating high-dimensional representations, over the equivariant set of vertices retrieved from SWN-GCN. Our method achieves the state-of-the-art image classification performance on rotated MNIST and CIFAR-10 images, where the models are trained with a non-augmented dataset only. Quantitative validations over invariance of the representations also demonstrate strong invariance of deep representations of SWN-GCN over rotations.

 

3

Hyungyu Lee, Myeongwoo Jeong, Chanyoung Kim, Hyungtae Lim, Changgue Park, Sungwon Hwang, and Hyun Myung†, “Low-level Pose Control of Tilting Multirotor for Wall Perching Tasks Using Renforcement Learning,” in Proc. of 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems, Praque, Czech Republic, Sep. 2021

  • Abstract

Recently, needs for unmanned aerial vehicles (UAVs) that are attachable to the wall have been highlighted. As one of the ways to address the need, researches on various tilting multirotors that can increase maneuverability has been employed. Unfortunately, existing studies on the tilting multirotors require considerable amounts of prior information on the complex dynamic model. Meanwhile, reinforcement learning on quadrotors has been studied to mitigate this issue. Yet, these are only been applied to standard quadrotors, whose systems are less complex than those of tilting multirotors. In this paper, a novel reinforcement learning-based method is proposed to control a tilting multirotor on real-world applications, which is the first attempt to apply reinforcement learning to a tilting multirotor. To do so, we propose a novel reward function for a neural network model that takes power efficiency into account. The model is initially trained over a simulated environment and then fine-tuned using real-world data in order to overcome the sim-to-real gap issue. Furthermore, a novel, efficient state representation with respect to the goal frame that helps the network learn optimal policy better is proposed. As verified on real-world experiments, our proposed method shows robust controllability by overcoming the complex dynamics of tilting multirotors.

 

2

Wonkeun Youn, Hyungtae Lim, Hyoung Sik Choi, Matthew Rhudy, Hyeok Ryu, Sungyug Kim, and Hyun Myung†, “State Estimation of HALE UAV with Deep-learning-aided Virtual AOA/SSA Sensor for Analytical Redundancy,” IEEE RA-L (Robotics and Automation Letters), vol.6, no.3, pp.5276-5283, Jul. 2021

  • Abstract

High-altitudelong-endurance (HALE) unmanned aerial vehicles (UAVs) are employed in a variety of fields because of their ability to fly for a long time at high altitudes, even in the stratosphere. Two paramount concerns exist: enhancing their safety during long-term flight and reducing their weight as much as possible to increase their energy efficiency based on analytical redundancy approaches. In this letter, a novel deep-learning-aided navigation filter is proposed, which consists of two parts: an end-to-end mapping-based synthetic sensor measurement model that utilizes long short-term memory (LSTM) networks to estimate the angle of attack (AOA) and sideslip angle (SSA) and an unscented Kalman filter for state estimation. Our proposed method can not only reduce the weight of HALE UAVs but also ensure their safety by means of an analytical redundancy approach. In contrast to conventional approaches, our LSTM-based method achieves better estimation by virtue of its nonlinear mapping capability.

 

1

Layered Depth Refinement with Mask Guidance

Title: Layered Depth Refinement with Mask Guidance 

Authors: Soo Ye Kim, Members (Adobe Research) and Munchurl Kim 

Abstract 

Depth maps are used in a wide range of applications from 3D rendering to 2D image effects such as Bokeh. However, those predicted by single image depth estimation (SIDE) models often fail to capture isolated holes in objects and/or have inaccurate boundary regions. Meanwhile, high-quality masks are much easier to obtain, using commercial automasking tools or off-the-shelf methods of segmentation and matting or even by manual editing. Hence, in this paper, we formulate a novel problem of mask-guided depth map refinement that utilizes a generic mask to refine the depth prediction of SIDE models. Our framework performs layered refinement and inpainting/outpainting, decomposing the depth map into two separate layers signified by the mask and the inverse mask. As datasets with both depth and mask annotations are scarce, we propose a self-supervised learning scheme that uses arbitrary masks and RGB-D datasets. We empirically show that our method is robust to different types of masks and initial depth predictions, accurately refining depth values in inner and outer mask boundary regions. We further analyze our model with an ablation study and demonstrate results on real applications. 

Self-Supervised Deep Monocular Depth Estimation with Ambiguity Boosting

Title: Self-Supervised Deep Monocular Depth Estimation with Ambiguity Boosting 

Authors: Juan Luis Gonzalez Bello and Munchurl Kim 

Abstract 

We propose a novel two-stage training strategy with ambiguity boosting for the self-supervised learning of single view depths from stereo images. Our proposed two-stage learning strategy firstly aims to obtain a coarse depth prior by training an auto-encoder network for a stereoscopic view synthesis task. This prior knowledge is then boosted and used to self-supervise the model in the second stage of training in our novel ambiguity boosting loss. Our ambiguity boosting loss is a confidence-guided type of data augmentation loss that improves the accuracy and consistency of generated depth maps under several transformations of the single-image input. To show the benefits of the proposed two-stage training strategy with boosting, our two previous depth estimation (DE) networks, one with t-shaped adaptive kernels and the other with exponential disparity volumes, are extended with our new learning strategy, referred to as DBoosterNet-t and DBoosterNet-e, respectively. Our self-supervised DBoosterNets are competitive, and in some cases even better, compared to the most recent supervised SOTA methods, and are remarkably superior to the previous self-supervised methods for monocular DE on the challenging KITTI dataset. We present intensive experimental results, showing the efficacy of our method for the self-supervised monocular DE task. 

 

5

XVFI: eXtreme Video Frame Interpolation (oral)

XVFI: eXtreme Video Frame Interpolation (oral) 

Authors: Hyeonjun Sim*, Jihyong Oh* and Munchurl Kim (*: equal contributions) 

Abstract 

In this paper, we firstly present a dataset (X4K1000FPS) of 4K videos of 1000 fps with the extreme motion to the research community for video frame interpolation (VFI), and propose an extreme VFI network, called XVFI-Net, that first handles the VFI for 4K videos with large motion. The XVFI-Net is based on a recursive multi-scale shared structure that consists of two cascaded modules for bidirectional optical flow learning between two input frames (BiOF-I) and for bidirectional optical flow learning from target to input frames (BiOF-T). The optical flows are stably approximated by a complementary flow reversal (CFR) proposed in BiOF-T module. During inference, the BiOFI module can start at any scale of input while the BiOFT module only operates at the original input scale so that the inference can be accelerated while maintaining highly accurate VFI performance. Extensive experimental results show that our XVFI-Net can successfully capture the essential information of objects with extremely large motions and complex textures while the state-of-the-art methods exhibit poor performance. Furthermore, our XVFI-Net framework also performs comparably on the previous lower resolution benchmark dataset, which shows a robustness of our algorithm as well. All source codes, pre-trained models, and proposed X4K1000FPS datasets are publicly available at https://github.com/JihyongOh/XVFI. 

 

4

Juan Luis Gonzalez Bello and Munchurl Kim, “PLADE-Net: Towards Pixel-Level Accuracy for Self-Supervised Single-View Depth Estimation with Neural Positional Encoding and Distilled Matting Loss,” Conference on Computer Vision and Pattern Recognition (CVPR), June 19-25, 2021.

Juan Luis Gonzalez Bello and Munchurl Kim, “PLADE-Net: Towards Pixel-Level Accuracy for Self-Supervised Single-View Depth Estimation with Neural Positional Encoding and Distilled Matting Loss,” Conference on Computer Vision and Pattern Recognition (CVPR), June 19-25, 2021. 

Abstact 

In this paper, we propose a self-supervised singleview pixel-level accurate depth estimation network, called PLADE-Net. The PLADE-Net is the first work that shows unprecedented accuracy levels, exceeding 95% in terms of the δ 1 metric on the challenging KITTI dataset. Our PLADE-Net is based on a new network architecture with neural positional encoding and a novel loss function that borrows from the closed-form solution of the matting Laplacian to learn pixel-level accurate depth estimation from stereo images. Neural positional encoding allows our PLADE-Net to obtain more consistent depth estimates by letting the network reason about location-specific image properties such as lens and projection distortions. Our novel distilled matting Laplacian loss allows our network to predict sharp depths at object boundaries and more consistent depths in highly homogeneous regions. Our proposed method outperforms all previous self-supervised single-view depth estimation methods by a large margin on the challenging KITTI dataset, with unprecedented levels of accuracy. Furthermore, our PLADE-Net, naively extended for stereo inputs, outperforms the most recent self-supervised stereo methods, even without any advanced blocks like 1D correlations, 3D convolutions, or spatial pyramid pooling. We present extensive ablation studies and experiments that support our method’s effectiveness on the KITTI, CityScapes, and Make3D datasets 

 

3

Soo Ye Kim*, Hyeonjun Sim* and Munchurl Kim, “KOALAnet: Blind Super-Resolution using Kernel-Oriented Adaptive Local Adjustment,” Conference on Computer Vision and Pattern Recognition (CVPR), June 19-25, 2021 (*: equal contribution).

Soo Ye Kim*, Hyeonjun Sim* and Munchurl Kim, “KOALAnet: Blind Super-Resolution using Kernel-Oriented Adaptive Local Adjustment,” Conference on Computer Vision and Pattern Recognition (CVPR), June 19-25, 2021 (*: equal contribution).  

Abstract 

Blind super-resolution (SR) methods aim to generate a high quality high resolution image from a low resolution image containing unknown degradations. However, natural images contain various types and amounts of blur: some may be due to the inherent degradation characteristics of the camera, but some may even be intentional, for aesthetic purposes (e.g. Bokeh effect). In the case of the latter, it becomes highly difficult for SR methods to disentangle the blur to remove, and that to leave as is. In this paper, we propose a novel blind SR framework based on kernel-oriented adaptive local adjustment (KOALA) of SR features, called KOALAnet, which jointly learns spatially-variant degradation and restoration kernels in order to adapt to the spatiallyvariant blur characteristics in real images. Our KOALAnet outperforms recent blind SR methods for synthesized LR images obtained with randomized degradations, and we further show that the proposed KOALAnet produces the most natural results for artistic photographs with intentional blur, which are not over-sharpened, by effectively handling images mixed with in-focus and out-of-focus areas.

 

2

 

 

Jaehyup Lee, Soomin Seo and Munchurl Kim, “SIPSA-Net: Shift-Invariant Pan Sharpening with Moving Object Alignment for Satellite Imagery,” Conference on Computer Vision and Pattern Recognition (CVPR), June 19-25, 2021. (Oral Paper)

Jaehyup Lee, Soomin Seo and Munchurl Kim, “SIPSA-Net: Shift-Invariant Pan Sharpening with Moving Object Alignment for Satellite Imagery,” Conference on Computer Vision and Pattern Recognition (CVPR), June 19-25, 2021. (Oral Paper

Abstract 

Pan-sharpening is a process of merging a highresolution (HR) panchromatic (PAN) image and its corresponding low-resolution (LR) multi-spectral (MS) image to create an HR-MS and pan-sharpened image. However, due to the different sensors’ locations, characteristics and acquisition time, PAN and MS image pairs often tend to have various amounts of misalignment. Conventional deeplearning-based methods that were trained with such misaligned PAN-MS image pairs suffer from diverse artifacts such as double-edge and blur artifacts in the resultant PANsharpened images. In this paper, we propose a novel framework called shift-invariant pan-sharpening with moving object alignment (SIPSA-Net) which is the first method to take into account such large misalignment of moving object regions for PAN sharpening. The SISPA-Net has a feature alignment module (FAM) that can adjust one feature to be aligned to another feature, even between the two different PAN and MS domains. For better alignment in pansharpened images, a shift-invariant spectral loss is newly designed, which ignores the inherent misalignment in the original MS input, thereby having the same effect as optimizing the spectral loss with a well-aligned MS image. Extensive experimental results show that our SIPSA-Net can generate pan-sharpened images with remarkable improvements in terms of visual quality and alignment, compared to the state-of-the-art methods 

 

1

Learning-driven exploration for reinforcement learning

[Title]

Learning-driven exploration for reinforcement learning

 

[Authors]

Muhammad Usama, Dong Eui Chang

 

[Abstract]

Effective and intelligent exploration remains an unresolved problem for reinforcement learning. Most contemporary reinforcement learning relies on simple heuristic strategies which are unable to intelligently distinguish the well-explored and the unexplored regions of state space, which can lead to inefficient use of training time. We introduce entropy-based exploration (EBE) that enables an agent to explore efficiently the unexplored regions of state space. EBE quantifies the agent’s learning in a state using state-dependent action values and adaptively explores the state space, i.e. more exploration for the unexplored region of the state space. We perform experiments on a diverse set of environments and demonstrate that EBE enables efficient exploration that ultimately results in faster learning without having to tune any hyperparameter. The code to reproduce the experiments is given at https://github.com/Usama1002/ EBE-Exploration and the supplementary video is given at https://youtu.be/nJggIjjzKic.

 

6