The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans
arXiv - CS - Sound Pub Date : 2020-12-23 , DOI: arxiv-2012.13006
Shinji Watanabe, Florian Boyer, Xuankai Chang, Pengcheng Guo, Tomoki Hayashi, Yosuke Higuchi, Takaaki Hori, Wen-Chin Huang, Hirofumi Inaguma, Naoyuki Kamo, Shigeki Karita, Chenda Li, Jing Shi, Aswin Shanmugam Subramanian, Wangyou Zhang

This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text to speech (TTS), voice conversation (VC), speech translation (ST), and speech enhancement (SE) with support for beamforming, speech separation, denoising, and dereverberation. All applications are trained in an end-to-end manner, thanks to the generic sequence to sequence modeling properties, and they can be further integrated and jointly optimized. Also, ESPnet provides reproducible all-in-one recipes for these applications with state-of-the-art performance in various benchmarks by incorporating transformer, advanced data augmentation, and conformer. This project aims to provide up-to-date speech processing experience to the community so that researchers in academia and various industry scales can develop their technologies collaboratively.

中文翻译：

2020年ESPnet更新：新功能，扩展的应用程序，性能改进和未来计划

本文介绍了端到端语音处理工具包ESPnet（https://github.com/espnet/espnet）的最新开发。该项目于2017年12月启动，主要处理基于序列到序列建模的端到端语音识别实验。该项目发展迅速，现已涵盖多种语音处理应用。现在，ESPnet还包括文本到语音（TTS），语音对话（VC），语音翻译（ST）和语音增强（SE），并支持波束成形，语音分离，去噪和去混响。由于通用的序列到序列建模属性，所有应用程序都以端到端的方式进行了培训，并且可以进一步集成和联合优化它们。也，ESPnet通过结合使用变压器，高级数据增强和一致性处理程序，为这些应用程序提供可重现的多合一配方，并在各种基准中具有最先进的性能。该项目旨在为社区提供最新的语音处理经验，以便学术界和各个行业规模的研究人员可以共同开发其技术。

更新日期：2020-12-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文