Yotaro Kubo

Welcome to my personal website introducing my research activities in machine learning and speech recognition. I'm currently a Research Scientist at Sakana AI, focusing on generative models, probabilistic systems, and deep learning applications.

Please feel free to contact me on my e-mail address, or LinkedIn.

Short Bio

Yotaro Kubo received the B.E., M.E., and Dr.Eng. degrees from Waseda University, Tokyo, Japan, in 2007, 2008, and 2010, respectively. He was a visiting scientist at RWTH Aachen University for six months in 2010. After that period, he joined Nippon Telegraph and Telephone Corporation (NTT) and had been with NTT Communication Science Laboratories. From 2014 to 2019, he was with Amazon (in Aachen, Germany) and developed and investigated speech recognition for voice search and personal assistants. From 2019 to 2025, he was a research scientist at Google (in Tokyo, Japan). Since 2025, he is a research scientist at Sakana AI. His research interests include generative/discriminative hybrid modeling, kernel-based probabilistic models, and integration of probabilistic systems. He is a member of the IEEE, and the Acoustical Society of Japan (ASJ).

Research Interests

Machine Learning for Speech Signal and Spoken Language Processing

Generative/discriminative-hybrid training of hidden Markov models
Flat-direct classifiers for automatic speech recognition enhanced by using nonlinear feature transformation
Deep learning with discrete parameters/variables for network structure estimation

Software architecture for efficient research

Publications

Refereed Journal Papers

Y. Kubo, S. Watanabe, T. Hori, A. Nakamura "Structural Classification Methods based on Weighted Finite-State Transducers for Automatic Speech Recognition," IEEE Transactions on Audio, Speech, and Language Processing, Vol. 20, No. 8, pp. 2240 - 2251, Oct 2012. (IEEExplorer)
Y. Kubo, S. Okawa, A. Kurematsu, K. Shirai, "Temporal AM-FM Combination for Robust Speech Recognition," Speech Communication, Vol. 54, No. 5, pp. 716-725, May 2011.(Science Direct)
Y. Kubo, S. Watanabe, A. Nakamura, E. McDermott, T. Kobayashi, "A Sequential Pattern Classifier Based on Hidden Markov Kernel Machine and Its Application to Phoneme Classification," IEEE Journal of Selected Topics in Signal Processing, Vol. 4, No. 6, pp. 974-984, December 2010. (IEEExplorer)
Y. Kubo, S. Okawa, A. Kurematsu, K. Shirai, "Recognizing Reverberant Speech Based on Amplitude and Frequency Modulation," IEICE Trans. on Inf. and Syst., Vol. E-61-D, No., pp. 448-456, March 2008.
Y. Kubo, M. Honda, K. Shirai, T. Komori, S. Nobumasa, T. Takagi, "An Improved High-quality MPEG-2/4 Advanced Audio Coding Encoder," Acoustical Science & Technology, Vol. 29, No. 6, pp. 362-371, December 2008. (Full Text via JStage)
M. Delcroix, K. Kinoshita, T. Naktani, S. Araki, A. Ogawa, T. Hori, S. Watanabe, M. Fujimoto, T. Yoshioka, T. Oba, Y. Kubo, M. Souden, S.-J. Hahm, A. Nakamura, "Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral & temporal modeling of sounds," Computer Speech and Language, Vol. 27, No. 3, pp. 851-873, Elsevier. (Science Direct)
M. Delcroix, T. Yoshioka, A. Ogawa, Y. Kubo, M. Fujimoto, N. Ito, K. Kinoshita, M. Espi, S. Araki, T. Hori, T. Nakatani, "Strategies for distant speech recognition in reverberant environments," EURASIP Journal on Advances in Signal Processing, (2015) 2015: 60. https://doi.org/10.1186/s13634-015-0245-7.

Refereed Conference/Workshop Papers

Y. Kubo, R. Sproat, C. Taguchi, L. Jones, "Building Tailored Speech Recognizers for Japanese Speaking Assessment" Proc. INTERSPEECH-2026, Sydney, Sept 2026.[preprint]
Y. Kubo, S. Karita, M. Bacchiani, "Knowledge Transfer from Large-Scale Pretrained Language Models to End-to-end Speech Recognizers," Proc. ICASSP-2022, Singapore, May 2022.
Y. Kubo, M. Bacchiani, "Joint Phoneme-Grapheme Model for End-to-end Speech Recognition," Proc. ICASSP-2020, Barcelona, Spain, May 2020.
Y. Kubo, G. Tucker, S. Wiesler, "Compacting Neural Network Classifiers via Dropout Training," Proc. NIPS Workshop on Efficient Methods for Deep Neural Networks, Barcelona, Spain, Dec 2016. (ArXiv)
Y. Kubo, J. Suzuki, T. Hori, A. Nakamura, "Restructuring Output Layers of Deep Neural Networks Using Minimum Risk Parameter Clustering," Proc. Interspeech 2014, Singapore, Sept 2014. [pdf]
Y. Kubo, T. Hori, A. Nakamura, "A Method for Structure Estimation of Weighted Finite-State Transducers and Its Application To Grapheme-to-Phoneme Conversion," Proc. Interspeech 2013, Lyon, France, August 2013.
Y. Kubo, T. Hori, A. Nakamura, "Large Vocabulary Continuous Speech Recognition Based on WFST Structured Classifiers and Deep Bottleneck Features," Proc. ICASSP 2013, Vancouver, Canada, May 2013. [pdf]
Y. Kubo, T. Hori, A. Nakamura, "Integrating Deep Neural Networks into Structured Classification Approach based on Weighted Finite-State Transducers," Proc. INTERSPEECH 2012, Portland, Oregon, U.S., September 2012. [pdf]
Y. Kubo, S. Watanabe, A. Nakamura, "Decoding Network Optimization Using Minimum Transition Error Training," Proc. ICASSP 2012, Kyoto, Japan, pp. 4197-4200, March 2012. [pdf]
Y. Kubo, S. Watanabe, A. Nakamura, S. Wiesler, R. Schlueter, H. Ney, "Basis Vector Orthogonalization for an Improved Kernel Gradient Matching Pursuit Method," Proc. ICASSP 2012, Kyoto, Japan, pp. 1909-1912, March 2012. [pdf]
Y. Kubo, S. Wiesler, R. Schlueter, H. Ney, S. Watanabe, A. Nakamura, T. Kobayashi "Subspace Pursuit Method for Kernel-Log-Linear Models," Proc. ICASSP 2011, Prague, Czech, May 2011. [pdf]
Y. Kubo, S. Watanabe, A. Nakamura, T. Kobayashi, "A Regularized Discriminative Training Method of Acoustic Models Derived by Minimum Relative Entropy Discrimination," Proc. INTERSPEECH-2010, Makuhari, Japan, September 2010. [pdf]
Y. Kubo, S. Okawa, A. Kurematsu, K. Shirai, "A Comparative Study on AM and FM Features," Proc. Interspeech-2008, Brisbane, September 2008. [pdf]
Y. Kubo, S. Okawa, A. Kurematsu, K. Shirai, "Independent Feature Selection Algorithms for the Creation of Multistream Speech Recognizers," Proc. ITRW on Speech Analysis and Processing for Knowledge Discovery, Aalborg, June 2008.
Y. Kubo, S. Okawa, A. Kurematsu, K. Shirai, "Noisy Speech Recognition Using Temporal AM-FM Combination," Proc. ICASSP-2008, Las Vegas, pp. 4709-4712, April 2008. [pdf]
Y. Kubo, S. Okawa, A. Kurematsu, K. Shirai, "A Study on Temporal Features Derived by Analytic Signal," Proc. Interspeech-2007, Antwerpen, pp. 1130-1133, September 2007. [pdf]
S. Kuroki, Y. Kubo, T. Akiba, Y. Tang, "KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI," Proc. ICASSP-2026, Barcelona, Spain, May 2026. [preprint]
S. Karita, Y. Kubo, M. Bacchiani, L. Jones, "A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition," Proc. INTERSPEECH-2021, Brno, Czech (online presentation), Sept 2021.
M. Espi, M. Fujimoto, Y. Kubo, T. Nakatani, "Spectrogram Patch Based Acoustic Event Detection and Classification in Speech Overlapping Conditions," Proc. HSCMA, Nancy, France, May 2014.
M. Delcroix, T. Yoshioka, A. Ogawa, Y. Kubo, M. Fujimoto, N. Ito, K. Kinoshita, M. Espi, T. Hori, T. Nakatani, A. Nakamura, "Linear Prediction-Based Dereverberation With Advanced Speech Enhancement and Recognition Technologies for The Reverb Challenge," Proc. REVERB Workshop, Florence, Italy, May 2014.
M. Fujimoto, Y. Kubo, T. Nakatani, "Unsupervised non-parametric Bayesian modeling of non-stationary noise for model-based noise suppression," Proc. ICASSP 2014, Florence, Italy, May 2014.
T. Hori, Y. Kubo, A. Nakamura, "Real-time one-pass decoding with recurrent neural network language model for speech recognition," Proc. ICASSP 2014, Florence, Italy, May 2014.
M. Blondel, Y. Kubo, N. Ueda, "Online Passive-Aggressive Algorithms for Non-Negative Matrix Factorization and Completion," Proc. AISTATS 2014, Reykjavik, Iceland, April 2014.
M. Delcroix, Y. Kubo, T. Nakatani, A. Nakamura, "Is Speech Enhancement Pre-Processing Still Relevant When Using Deep Neural Networks for Acoustic Modeling?" Proc. Interspeech 2013, Lyon, France, August 2013.
S. Watanabe, Y. Kubo, T. Oba, T. Hori, A. Nakamura, "Bag of Arcs: New Representation of Speech Segment Features Based on Finite State Machines," Proc. ICASSP 2012, Kyoto, Kapan, pp. 4201-4204, March 2012.
M. Delcroix, K. Kinoshita, T. Nakatani, S. Araki, A. Ogawa, T. Hori, S. Watanabe, M. Fujimoto, T. Yoshioka, T. Oba, Y. Kubo, M. Souden, S.-J. Hahm, A. Nakamura, "Speech recognition in the presence of highly non-stationary noise based on spatial, spectral and temporal speech/ noise modeling combined with dynamic variance adaptation," Proc. CHiME (Computational Hearing in Multisource Environments) 2011, September 2011.
S. Wiesler, A. Richard, Y. Kubo, R. Schlueter, H. Ney, "Feature Selection for Log-Linear Acoustic Models," Proc. ICASSP 2011, Prague, Czech, May 2011.

Theses

Y. Kubo, "Automatic Speech Recognition Based on Temporal Analysis of Amplitude and Frequency Modulation," Master of Informatics and Computer Science, Waseda University, 2008 (written in Japanese).
Y. Kubo, "Regularized Discrimination of High-Dimensional Signal Representations for Automatic Speech Recognition," Doctor of Engineering, Waseda University, 2010 (pdf) (HTML)

Tutorial Articles (in Japanese)

Y. Kubo, "深層学習が支える音声認識技術]{深層学習が支える音声認識技術 (Automatic Speech Recognition Technologies Boosted by Deep Learning)," The Transactions of IEICE, May 2022 (to appear).
Y. Kubo, "音声認識のための深層学習 (Deep Learning for Speech Recognition)," Journal of the Japanese Society for Artificial Intelligence, Vol. 29, No. 1, pp. 62-71, Jan 2014.
T. Hori, S. Araki, Y. Kubo, A. Ogawa, T. Oba, A. Nakamura, "自然な会話を聞き取る音声認識技術 (Speech Recognition Technologies for Natural Conversation Scenes)," Nikkei Electronics, 2013.10.24, Oct 2013.
Y. Kubo, A. Ogawa, T. Hori, A. Nakamura, "Speech Recognition Based on Unified Model of Acoustic and Language Aspects of Speech," NTT Technical Review, Vol.11, No.12. (English Version in "NTT Technical Review") or (Japanese Version in "NTT技術ジャーナル")
Y. Kubo, "ディープラーニングによるパターン認識 (Deep Learning for Pattern Recognition)," IPSJ Magazine, Vol. 54, No. 5, pp. 500-508, May 2013 (IPSJ Digital Library).
K. Shirai, T. Kobayashi, M. Abe, K. Iwata, R. Imai, H. Kikuchi, K. Ohtsuki, H. Fujisawa, M. Honda, Y. Hayashi, K. Mano, T. Takezawa, S. Takahashi, S. Okawa, K. Hoashi, N. Masaki, N. Osaka, "音声言語処理の潮流 (The Tide of Spoken Language Processing)," Corona Publishing, Mar 2010 (I wrote explanations about tandem approach and neural networks).
Y. Kubo, "Mac OS Xのアプリケーション開発 (Developing Applications for Mac OS X)," UNIX Magazine 2006.03, Mar 2006.

Book

S. Asoh, M. Yasuda, S. Maeda, D. Okanohara, T. Okatani, Y. Kubo, D. Bollegala (Ed: T. Kamishima), "深層学習 --Deep Learning--," Kindai Kagakusya, Nov 2015. (written in Japanese; Amazon ; also available in Korean http://jpub.tistory.com/m/779)
Ed: Acoustical Society of Japan, Ed: Y. Haneda, Ed: S. Okawa, Ed: S. Kiya, "音響学入門ペディア -- Acousticpedia for Beginners --," CORONA Publishing, Mar, 2017. (written in Japanese; Amazon)
Y. Kubo (Ed: Acoustical Society of Japan) "機械学習による音声認識 -- Machine Learning in Automatic Speech Recognition --," CORONA Publishing, Apr, 2021. (written in Japanese; Amazon)

Domestic Workshop Papers (Not refereed; excerpt)

Y. Kubo, T. Hori, A. Nakamura, "An initial attempt to estimate a structure of weighted finite-state transducers and its application to grapheme-to-phoneme conversion," Proc. Fall Meeting of ASJ, 1-8-11, 2013. (重み付き有限状態トランスデューサの構造推定とそのGrapheme-To-Phoneme変換への応用)
Y. Kubo, T. Hori, A. Nakamura, "Large Vocabulary Continuous Speech Recognition based on Deep Neural Networks and WFST-based Structured Classifiers," Proc. Spring Meeting of ASJ, 2-9-13, 2013. (Deep Neural Network と WFST 型構造識別器を用いた大語彙連続音声認識)
Y. Kubo, T. Hori, A. Nakamura, "An Evaluation of English Lecture Recognizers Based on Deep Neural Networks," Proc. Fall Meeting of ASJ, 2-1-8, 2012. (ディープニューラルネットワークをた音声認識器の英語講義音声認識による評価)
Y. Kubo, T. Hori, A. Nakamura, "WFST-based Structured Classification of Features Extracted by Using Deep Neural Networks," Technical Report of IEICE, SP-2012-57 (2012-07), 2012. (Deep Learningに基づく音声特徴量の有限状態トランスデューサ型識別モデルによる識別; 電子情報通信学会音声研究会奨励賞対象論文)
Y. Kubo, S. Watanabe, T. Hori, A. Nakamura, "Minimum transition error learning of a structural classification model based on weighted finite-state transducers," Proc. Spring Meeting of ASJ, 1-P-21, 2012. (重み付き有限状態トランスデューサに基づく構造識別モデルの最小状態遷移エラー学習)
Y. Kubo, S. Watanabe, A. Nakamura, S. Wiesler, R. Schlueter, H. Ney, "Approaches towards accurate acoutic modeling based on kernel log-linear models," Proc. Fall Meeting of ASJ, 2-Q-16, 2011. (カーネル対数線形モデルによる音響モデルの高精度化に向けた検討)
Y. Kubo, S. Watanabe, A. Nakamura, S. Wiesler, R. Schlueter, H. Ney, T. Kobayashi, "A subspace pursuit model to speed up kernel-based acoustic models," Proc. Spring Meeting of ASJ, 1-5-10, 2011. (カーネルマシンを内包する音響モデルの高速化に向けた部分空間追跡法)
Y. Kubo, S. Watanabe, A. Nakamura, T. Kobayashi, "Parallelizable optimization methods and lattice-based representations for minimum relative entropy discrimination training," IPSJ SIG Technical Report, 2009-SLP-80, 2010. (最小相対エントロピー識別学習へのラティスによる仮説表現と並列化可能な最適化手法の導入; 情報処理学会山下記念研究賞対象論文)
Y. Kubo, S. Watanabe, A. Nakamura, E. McDermott, T. Kobayashi, "Sequence classification using hidden Markov kernel machines and its application to phoneme recognition task," Collection of Preview Slides of the 12th Workshop on Information-Based Induction Sciences (IBIS 2009), 2009. (隠れマルコフカーネルマシンを用いた系列データの識別とその音素認識タスクへの適用)
Y. Kubo, S. Watanabe, A. Nakamura, E. McDermott, T. Kobayashi, "Hidden Markov kernel machines derived by minimum relative entropy discrimination training of hidden Markov models for automatic speech recognition," Proc. Fall Meeting of ASJ, 1-1-4, pp. 11-14, 2009. (隠れマルコフモデルの最小相対エントロピー識別学習則より導出されるカーネルマシンを用いた音声認識)
Y. Kubo, S. Watanabe, A. Nakamura, E. McDermott, T. Kobayashi, "A kernel machine derived by minimum relative entropy," IPSJ SIG Technical Report, 2009-SLP-77, no. 6, 2009. (最小相対エントロピー識別学習に基づくカーネルマシンを利用した音声認識)
Y. Kubo, S. Watanabe, A. Nakamura, T. Kobayashi, "A regularized discriminative training method for continuous density hidden Markov models based on minimum relative entropy discriminative formulation," Proc. Spring Meeting of ASJ, 2-5-16, 2009. (最小相対エントロピー基準によるパラメタ分布の正則化を用いた連続分布HMMの識別学習; 日本音響学会粟屋潔学術奨励賞受賞対象論文)
Y. Kubo, S. Okawa, A. Kurematsu, K. Shirai, "Multi-stream speech recognizers using independent feature decomposition," Proc. Spring Meeting of ASJ, 2-10-4, pp. 73-74, 2008. (変調特徴量の独立性基準による分解を用いたマルチストリーム音声認識)
Y. Kubo, S. Okawa, A. Kurematsu, K. Shirai, "A study on speech recognizer based on temporal AM-FM analysis," Technical Report of IEICE, SP-2007-356, pp. 31-36, 2007. (AMとFMの長時間分析に基づく音声認識)
Y. Kubo, S. Okawa, A. Kurematsu, K. Shirai, "Speech recognition using narrow-band analytic signal and non-linear discriminant analysis," Technical Report of IEICE, SP-2007-116, pp. 85-90, 2007. (狭帯域解析信号と非線形識別分析を用いた音声認識)
Y. Kubo, S. Okawa, A. Kurematsu, K. Shirai, "Long-term instantaneous phase analysis for automatic speech recognition of Japanese spontaneous speech," Proc. Spring Meeting of ASJ, 3-10-9, pp. 121-122, 2007. (長時間瞬時位相分析による話し言葉音声認識)
Y. Kubo, M. Honda, K. Shirai, T. Komori, N. Seiyama, T. Takagi, "Improvement of MPEG-2/4 AAC coder using optimal bit allocation aiming for broadcasting," Proc. Spring Meeting of ASJ, 3-P-13, pp. 623-624, 2007. (放送配信に向けたオーディオ符号化AACのビット配分手法による音質改善)
Y. Kubo, M. Honda, K. Shirai, "Improvement of MPEG-2/4 AAC coder aiming for broadcasting," Proc. Fall Meeting of ASJ, 1-Q-21, pp. 279-280, 2006. (放送配信に向けた音声符号化AACの音質改善)

Click to expand domestic papers

Academic Activities

Member of IEEE (The Institute of Electrical and Electronics Engineers)
Member of ASJ (The Acoustical Society of Japan)
Reviewer of the following scientific journals

IEEE Transaction on Signal Processing
IEEE Transaction on Audio, Speech and Language Processing
Speech Communication
IEICE Transactions on Information and Systems

Talks/ Lectures

"A Speech Recognition Toolkit based on Python", EuroSciPy-2010, Paris, France, July 2010.
"An application method of minimum relative entropy discrimination for hidden Markov models," InterACT Talk (Karlsruhe University), Karlsruhe, Germany, Sep. 2010. (Host: Dr. Sebastian Stueker)
"Subspace Pursuit Methods for Kernel-Log-Linear Models," in National Institute of Information and Communications Technology (NICT), Kyoto, Japan, Nov. 2011.
"High-Dimensional Log-Linear Models for Automatic Speech Recognition," in Microsoft Research Asia, Beijing, China, Jan. 2012.
"Python in Automatic Speech Recognition Research (音声認識研究におけるPython)," Tokyo.SciPy #004 (aka Kan.SciPy #001), Jun. 2012.
"Automatic Recognition of Conversational Speech," in Microsoft Research Redmond, WA, USA, Sep. 2012. (with Dr. Seong-Jun Hahm)
"Basics and Outlooks of Deep Learning (ディープラーニングの基礎と展望)," in ALAGIN Young Researchers' Workshop (Tokyo University), Tokyo, Japan, Dec. 2012.
"Recent Developments in Deep Neural Networks for Automatic Speech Recognition: Methods and Applications," in Theme Workshop of FIRST Aihara Project (Tokyo University), Tokyo, Japan, Mar. 2013. (Host: Dr. Takaki Makino)
"Recent developments in speech recognition technologies (音声認識技術の現在と最先端)," in Nara Institute of Science and Technology (NAIST), Nara, Japan, May. 2013. (in Japanese; Invited Lecture; Host: Dr. Graham Neubig)
"Practical kernel methods for automatic speech recognition," in Mitsubishi Electric Research Laboratories, MA, USA, May. 2013. (Host: Dr. Shinji Watanabe)
"WFST-based structured classification for meeting recognition," in SLS Seminar at Massachusetts Institute of Technology, MA, USA, May. 2013. (Host: Prof. James Glass)
"Integration of structured classification and deep neural networks for automatic speech recognition," in Midwest Speech and Language Days (Toyota Technological Institute Chicago), IL, USA, May. 2013. (Invited Talk; Host: Prof. Sadaoki Furui)
"Deep Learning and its application to Automatic Speech Recognition (Deep Learningとその音声認識への応用)" in Waseda University, Tokyo, Japan, Jun. 2013. (in Japanese; Host: Prof. Tetsunori Kobayashi)
"Basics of Deep Learning and Speech Recognition (深層学習と音声認識の基本)" in ALAGIN Speech Processing Seminar, Tokyo, Japan, Oct. 2013. (in Japanese)
"Recent studies on Deep Learning for Speech Recognition (音声認識分野における深層学習技術の研究動向)" The 16th Information-Based Induction Sciences Workshop, Tokyo, Japan, Nov. 2013. (in Japanese; Invited Talk)
"Deep Learning (ディープラーニング技術)" in NHK Science & Technology Research Laboratories, Tokyo, Japan, Nov. 2013. (in Japanese; Host: Dr. Shoei Sato)
"Deep Learning and Its Application to Pattern Recognition Problems (深層学習とそのパターン認識への応用)" in CS Colloquium at Tsukuba University, Ibaraki, Japan, Dec. 2013. (in Japanese; Host: Dr. Hideitsu Hino)
"Applications of Deep Learning in Speech Recognition (音声認識における深層学習の活用とその進展)," invited talk in Ongaku-Symposium 2014
"Advances in Speech Recognition for Digital Assistants," invited talk in Industry Forum in IEEE ICNC-2020.
"Neural speech recognition," Tutorial in ISCA Speaker Odyssey Workshop 2020 (Co-organized with Shigeki Karita).

Degrees

Doctor of Engineering from Waseda University, Tokyo, Japan (2010)
Master of Informatics and Computer Science from Waseda University, Tokyo, Japan (2008)
Bachelor of Informatics and Computer Science from Waseda University, Tokyo, Japan (2007)

Awards

IEICE ISS Young Researcher's Award in Speech Field, 2013.
The Itakura Award from the Acoustical Society of Japan (ASJ), 2013.
IEEE SPS Japan Chapter Student Paper Award, 2011.
The Yamashita SIG Research Award from the Information Processing Society of Japan (IPSJ), 2011.
The Awaya Award from the Acoustical Society of Japan (ASJ), 2010.