Seong Min Kye, Joon Son Chung, Hoirin Kim, “SUPERVISED ATTENTION FOR SPEAKER RECOGNITION,” SLT2021, pp. 286-293, Jan. 19, 2021.

Abstract: The recently proposed self-attentive pooling (SAP) has shown good performance in several speaker recognition systems. In SAP systems, the context vector is trained end-to-end together with the feature extractor, where the role of context vector is to select the most discriminative frames for speaker recognition. However, the SAP underperforms compared to the temporal average pooling (TAP) baseline in some settings, which implies that the attention is not learnt effectively in end-to-end training. To tackle this problem, we introduce strategies for training the attention mechanism in a supervised manner, which learns the context vector using classified samples. With our proposed methods, context vector can be boosted to select the most informative frames. We show that our method outperforms existing methods in various experimental settings including short utterance speaker recognition, and achieves competitive performance over the existing baselines on the VoxCeleb datasets.

 

1

Hee Seung Wang, Seong Kwang Hong, Jae Hyun Han, Young Hoon Jung, Hyun Kyu Jeong, Tae Hong Im, Chang Kyu Jeong, Bo- Yeon Lee, Gwangsu Kim, Chang D Yoo, Keon Jae Lee. “Biomimetic and flexible piezoelectric mobile acoustic sensors with multiresonant ultrathin structures for machine learning biometrics”, (2021) Science Advances 7 (7), eabe5683

Flexible resonant acoustic sensors have attracted substantial attention as an essential component for intuitive human-machine interaction (HMI) in the future voice user interface (VUI). Several researches have been reported by mimicking the basilar membrane but still have dimensional drawback due to limitation of controlling a multifrequency band and broadening resonant spectrum for full-cover phonetic frequencies. Here, highly sensitive piezoelectric mobile acoustic sensor (PMAS) is demonstrated by exploiting an ultrathin membrane for biomimetic frequency band control. Simulation results prove that resonant bandwidth of a piezoelectric film can be broadened by adopting a lead-zirconate-titanate (PZT) membrane on the ultrathin polymer to cover the entire voice spectrum. Machine learning–based biometric authentication is demonstrated by the integrated acoustic sensor module with an algorithm processor and customized Android app. Last, exceptional error rate reduction in speaker identification is achieved by a PMAS module with a small amount of training data, compared to a conventional microelectromechanical system microphone.

 

12

FIG. 4 Machine learning–based mobile biometric authentication of PMAS module.

(A) Schematic diagram of machine learning (ML)–based mobile biometric authentication using PMAS module. The multichannel signals of PMAS were wirelessly transferred to algorithm database for access control to a smartphone. (B) Comparison of voice feature between original sound and PMAS module signal. The graphs include voltage signal of time domain, FFT response, and STFT spectrogram. (C) Flowchart of GMM algorithm for speaker training and testing procedures composed of signal averaging, feature extraction, and layer formation. The speaker decision was performed by comparing the input voice information with pretrained dataset. (D) Speaker identification error rate of the PMAS module outperforming a commercial MEMS microphone in condition of 150 data training, 150 data testing, and seven mixtures. (E) Real-time mobile biometric authentication demonstrated by PMAS module and customized smartphone app for access permission and prohibition in condition of five training and one testing words. Photo credit: Hee Seung Wang, Korea Advanced Institute of Science and Technology.

Tung Luu, Chang D. Yoo, “Hindsight Goal Ranking on Replay Buffer for Sparse Reward Environment.”, IEEE Access 2021

This paper proposes a method for prioritizing the replay experience referred to as Hindsight Goal Ranking (HGR) in overcoming the limitation of Hindsight Experience Replay (HER) that generates hindsight goals based on uniform sampling. HGR samples with higher probability on the states visited in an episode with larger temporal difference (TD) error, which is considered as a proxy measure of the amount which the RL agent can learn from an experience. The actual sampling for large TD error is performed in two steps: first, an episode is sampled from the relay buffer according to the average TD error of its experiences, and then, for the sampled episode, the hindsight goal leading to larger TD error is sampled with higher probability from future visited states. The proposed method combined with Deep Deterministic Policy Gradient (DDPG), an off-policy model-free actor-critic algorithm, accelerates learning significantly faster than that without any prioritization on four challenging simulated robotic manipulation tasks. The empirical results show that HGR uses samples more efficiently than previous methods across all tasks.

 

11

Leda Sari, Mark Hasegawa-Johnson, and Chang D. Yoo, “Counterfactually Fair Automatic Speech Recognition”, IEEE/ACM Transactions on Audio, Speech and Language Processing, 2021.

Widelyused automatic speech recognition (ASR) systems have been empirically demonstrated in various studies to be unfair, having higher error rates for some groups of users than others. One way to define fairness in ASR is to require that changing the demographic group affiliation of any individual (e.g., changing their gender, age, education or race) should not change the probability distribution across possible speech-to-text transcriptions. In the paradigm of counterfactual fairness, all variables independent of group affiliation (e.g., the text being read by the speaker) remain unchanged, while variables dependent on group affiliation (e.g., the speaker’s voice) are counterfactually modified. Hence, we approach the fairness of ASR by training the ASR to minimize change in its outcome probabilities despite a counterfactual change in the individual’s demographic attributes. Starting from the individualized counterfactual equal odds criterion, we provide relaxations to it and compare their performances for connectionist temporal classification (CTC) based end-to-end ASR systems. We perform our experiments on the Corpus of Regional African American Language (CORAAL) and the LibriSpeech dataset to accommodate for differences due to gender, age, education, and race. We show that with counterfactual training, we can reduce average character error rates while achieving lower performance gap between demographic groups, and lower error standard deviation among individuals.

H. Gao, X. Wang, Chang D. Yoo et al., Seamless equal accuracy ratio for inclusive CTC speech recognition. Speech Communication (2021).

Concerns have been raised regarding performance disparity in automatic speech recognition (ASR) systems as they provide unequal transcription accuracy for different user groups defined by different attributes that include gender, dialect, and race. In this paper, we propose “equal accuracy ratio”, a novel inclusiveness measure for ASR systems that can be seamlessly integrated into the standard connectionist temporal classification (CTC) training pipeline of an end-to-end neural speech recognizer to increase the recognizer’s inclusiveness. We also create a novel multi-dialect benchmark dataset to study the inclusiveness of ASR, by combining data from existing corpora in seven dialects of English (African American, General American, Latino English, British English, Indian English, Afrikaaner English, and Xhosa English). Experiments on this multi-dialect corpus show that using the equal accuracy ratio as a regularization term along with CTC loss, succeeds in lowering the accuracy gap between user groups and reduces the recognition error rate compared with a non-regularized baseline. Experiments on additional speech corpora that have different user groups also confirm our findings.

Thanh Nguyen”, Tung Luu”, Thang Vu, Chand D. Yoo, “Sample-efficient Reinforcement Learning Representation Learning with Curiosity Contrastive Forward Dynamics Model”, The 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems

Developing an agent in reinforcement learning (RL) that is capable of performing complex control tasks directly from high-dimensional observation such as raw pixels is yet a challenge as efforts are made towards improving sample efficiency and generalization. This paper considers a learning framework for Curiosity Contrastive Forward Dynamics Model (CCFDM) in achieving a more sample-efficient RL based directly on raw pixels. CCFDM incorporates a forward dynamics model (FDM) and performs contrastive learning to train its deep convolutional neural network-based image encoder (IE) to extract conducive spatial and temporal information for achieving a more sample efficiency for RL. In addition, during training, CCFDM provides intrinsic rewards, produced based on FDM prediction error, encourages the curiosity of the RL agent to improve exploration. The diverge and less-repetitive observations provide by both our exploration strategy and data augmentation available in contrastive learning improve not only the sample efficiency but also the generalization. Performance of existing model-free RL methods such as Soft Actor-Critic built on top of CCFDM outperforms prior state-of-the-art pixel-based RL methods on the DeepMind Control Suite benchmark.

 

10

Thang Vu, Haeyong Kang, Chang D. Yoo, “SCNet: Training Inference Sample Consistency for Instance Segmentation”, in AAAI Conference on Artificial Intelligence (AAAI), 2021.

Cascaded architectures have brought significant performance improvement in object detection and instance segmentation. However, there are lingering issues regarding the disparity in the Intersection-over-Union (IoU) distribution of the samples between training and inference. This disparity can potentially exacerbate detection accuracy. This paper proposes an architecture referred to as Sample Consistency Network (SCNet) to ensure that the IoU distribution of the samples at training time is close to that at inference time. Furthermore, SCNet incorporates feature relay and utilizes global contextual information to further reinforce the reciprocal relationships among classifying, detecting, and segmenting subtasks. Extensive experiments on the standard COCO dataset reveal the effectiveness of the proposed method over multiple evaluation metrics, including box AP, mask AP, and inference speed. In particular, while running 38% faster, the proposed SCNet improves the AP of the box and mask predictions by respectively 1.3 and 2.3 points compared to the strong Cascade Mask R-CNN baseline.

 

9

Hobin Ryu, Sunghun Kang, Haeyoung Kang, Chang D. Yoo ,”Semantic Grouping Network for Video Captioning”, in AAAI Conference on Artificial Intelligence (AAAI), 2021.

This paper considers a video caption generating network referred to as Semantic Grouping Network (SGN) that attempts (1) to group video frames with discriminating word phrases of partially decoded caption and then (2) to decode those semantically aligned groups in predicting the next word. As consecutive frames are not likely to provide unique information, prior methods have focused on discarding or merging repetitive information based only on the input video. The SGN learns an algorithm to capture the most discriminating word phrases of the partially decoded caption and a mapping that associates each phrase to the relevant video frames – establishing this mapping allows semantically related frames to be clustered, which reduces redundancy. In contrast to the prior methods, the continuous feedback from decoded words enables the SGN to dynamically update the video representation that adapts to the partially decoded caption. Furthermore, a contrastive attention loss is proposed to facilitate accurate alignment between a word phrase and video frames without manual annotations. The SGN achieves state-of-the-art performances by outperforming runner-up methods by a margin of 2.1%p and 2.4%p in a CIDEr-D score on MSVD and MSR-VTT datasets, respectively. Extensive experiments demonstrate the effectiveness and interpretability of the SGN.

 

8

Junyeong Kim, Sunjae Yoon, DaHyun Kim, Chang D. Yoo, “Structured Co-reference Graph Attention for Video-grounded Dialogue”, in AAAI Conference on Artificial Intelligence (AAAI), 2021.

A video-grounded dialogue system referred to as the Structured Co-reference Graph Attention (SCGA) is presented for decoding the answer sequence to a question regarding a given video while keeping track of the dialogue context. Although recent efforts have made great strides in improving the quality of the response, performance is still far from satisfactory. The two main challenging issues are as follows: (1) how to deduce co-reference among multiple modalities and (2) how to reason on the rich underlying semantic structure of video with complex spatial and temporal dynamics. To this end, SCGA is based on (1) Structured Co-reference Resolver that performs dereferencing via building a structured graph over multiple modalities, (2) Spatio-temporal Video Reasoner that captures local-to-global dynamics of video via gradually neighboring graph attention. SCGA makes use of pointer network to dynamically replicate parts of the question for decoding the answer sequence. The validity of the proposed SCGA is demonstrated on AVSD@DSTC7 and AVSD@DSTC8 datasets, a challenging video-grounded dialogue benchmarks, and TVQA dataset, a large-scale videoQA benchmark. Our empirical results show that SCGA outperforms other state-of-the-art dialogue systems on both benchmarks, while extensive ablation study and qualitative analysis reveal performance gain and improved interpretability.

7

Kiran Ramnath, Leda Sari, Mark Hasegawa-johnson and Chang Yoo, ” Worldly Wise (WoW)-Cross-Lingual Knowledge Fusion for Fact-based Visual Spoken-Question Answering”, Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021.

Although Question-Answering has long been of research interest, its accessibility to users through a speech interface and its support to multiple languages have not been addressed in prior studies. Towards these ends, we present a new task and a synthetically-generated dataset to do Fact-based Visual Spoken-Question Answering (FVSQA). FVSQA is based on the FVQA dataset, which requires a system to retrieve an entity from Knowledge Graphs (KGs) to answer a question about an image. In FVSQA, the question is spoken rather than typed. Three sub-tasks are proposed: (1) speech-to-text based, (2) end-to-end, without speech-to-text as an intermediate component, and (3) cross-lingual, in which the question is spoken in a language different from that in which the KG is recorded. The end-to-end and cross-lingual tasks are the first to require world knowledge from a multi-relational KG as a differentiable layer in an end-to-end spoken language understanding task, hence the proposed reference implementation is called WorldlyWise (WoW). WoW is shown to perform endto-end cross-lingual FVSQA at same levels of accuracy across 3 languages – English, Hindi, and Turkish.

 

6