Joint speech and text machine translation for up to 100 languages

成果类型:
Article
署名作者:
Barrault, Loic; Chung, Yu-An; Meglioli, Mariano Coria; Dale, David; Dong, Ning; Duquenne, Paul-Ambroise; Elsahar, Hady; Gong, Hongyu; Heffernan, Kevin; Hoffman, John; Klaiber, Christopher; Li, Pengwei; Licht, Daniel; Maillard, Jean; Rakotoarison, Alice; Sadagopan, Kaushik Ram; Wenzek, Guillaume; Ye, Ethan; Akula, Bapi; Chen, Peng-Jen; El Hachem, Naji; Ellis, Brian; Gonzalez, Gabriel Mejia; Haaheim, Justin; Hansanti, Prangthip; Howes, Russ; Huang, Bernie; Hwang, Min-Jae; Inaguma, Hirofumi; Jain, Somya; Kalbassi, Elahe; Kallet, Amanda; Kulikov, Ilia; Lam, Janice; Li, Daniel; Ma, Xutai; Mavlyutov, Ruslan; Peloquin, Benjamin; Ramadan, Mohamed; Ramakrishnan, Abinesh; Sun, Anna; Tran, Kevin; Tran, Tuan; Tufanov, Igor; Vogeti, Vish; Wood, Carleigh; Yang, Yilin; Yu, Bokai; Andrews, Pierre; Balioglu, Can; Costa-jussa, Marta R.; Celebi, Onur; Elbayad, Maha; Gao, Cynthia; Guzman, Francisco; Kao, Justine; Lee, Ann; Mourachko, Alexandre; Pino, Juan; Popuri, Sravya; Ropers, Christophe; Saleem, Safiyyah; Schwenk, Holger; Tomasello, Paden; Wang, Changhan; Wang, Jeff; Wang, Skyler
署名单位:
Inria; University of California System; University of California Berkeley
刊物名称:
Nature
ISSN/ISSBN:
0028-2782
DOI:
10.1038/s41586-024-08359-z
发表日期:
2025-01-16
关键词:
摘要:
Creating the Babel Fish, a tool that helps individuals translate speech between any two languages, requires advanced technological innovation and linguistic expertise. Although conventional speech-to-speech translation systems composed of multiple subsystems performing translation in a cascaded fashion exist1, 2-3, scalable and high-performing unified systems4,5 remain underexplored. To address this gap, here we introduce SEAMLESSM4T-Massively Multilingual and Multimodal Machine Translation-a single model that supports speech-to-speech translation (101 to 36 languages), speech-to-text translation (from 101 to 96 languages), text-to-speech translation (from 96 to 36 languages), text-to-text translation (96 languages) and automatic speech recognition (96 languages). Built using a new multimodal corpus of automatically aligned speech translations and other publicly available data, SEAMLESSM4T is one of the first multilingual systems that can translate from and into English for both speech and text. Moreover, it outperforms the existing state-of-the-art cascaded systems, achieving up to 8% and 23% higher BLEU (Bilingual Evaluation Understudy) scores in speech-to-text and speech-to-speech tasks, respectively. Beyond quality, when tested for robustness, our system is, on average, approximately 50% more resilient against background noise and speaker variations in speech-to-text tasks than the previous state-of-the-art systems. We evaluated SEAMLESSM4T on added toxicity and gender bias to assess translation safety. For the former, we included two strategies for added toxicity mitigation working at either training or inference time. Finally, all contributions in this work are publicly available for non-commercial use to propel further research on inclusive speech translation technologies.