In this talk, we will raise a challenging question: Can we find the “elementary particles” of a person’s speech in one language and use them for rendering his/her voice in a different language? A positive “yes” answer and the found “elementary particles” can then have many useful applications, e.g. mixed code text-to-speech (TTS), language learning, speech-to-speech translation, etc. We try to answer the question by limiting ourselves first to how to train a TTS of a target language with speech corpus collected from a speaker in his own mother tongue, which is different from the target language, along with a speech corpus recorded by a reference speaker (in the same target language). We then use “trajectory tiling algorithm,” invented by us for synthesizing high quality, unit selection TTS, to “tile” the trajectories of all sentences in the reference speaker’s corpus with the most appropriate speech segments in the monolingual speaker’s data. To make the tiling proper across two different (reference and source) speakers, the difference between their speech signals needs to be equalized with appropriate vocal tract length normalization, e.g., a bilinear warping function or formant mapping. All tiled sentences are then used to train a new HMM-based TTS of the monolingual speaker but in the reference speaker’s language. Different length units of the ‘elementary particles” have been tried and a label-less frame length (10 ms) segments have been found to yield the best TTS quality. Some preliminary results also show that training a speech recognizer with speech data of different languages tends to improve the ASR performance in each individual language. Also, in addition to the fact that audio “elementary particles” of human speech in different languages can be discovered as frame-level speech segments, the mouth shapes of a mono-lingual speaker have also been found adequate for rendering the lips movement of talking heads in different languages. Various demos will be shown to illustrate our findings.
Frank K. Soong is a Principal Researcher and Research Manager, Speech Group, Microsoft Research Asia (MSRA), Beijing, China, where he works on fundamental research on speech and its practical applications. His professional research career spans over 30 years, first with Bell Labs, US, then with ATR, Japan, before joining MSRA in 2004. At Bell Labs, he worked on stochastic modeling of speech signals, optimal decoder algorithm, speech analysis and coding, speech and speaker recognition. He was responsible for developing the recognition algorithm which was developed into voice-activated mobile phone products rated by the Mobile Office Magazine (Apr. 1993) as the “outstandingly the best”. He is a co-recipient of the Bell Labs President Gold Award for developing the Bell Labs Automatic Speech Recognition (BLASR) software package. He has served as a member of the Speech and Language Technical Committee, IEEE Signal Processing Society and other society functions, including Associate Editor of the IEEE Speech and Audio Transactions and chairing IEEE Workshop. He published extensively with more than 200 papers and co-edited a widely used reference book, Automatic Speech and Speech Recognition- Advanced Topics, Kluwer, 1996. He is a visiting professor of the Chinese University of Hong Kong (CUHK) and a few other top-rated universities in China. He is also the co-Director of the National MSRA-CUHK Joint Research Lab. He got his BS, MS and PhD from National Taiwan Univ., Univ. of Rhode Island, and Stanford Univ, all in Electrical Eng. He is an IEEE Fellow “for contributions to digital processing of speech”.