A 10000+ Hours Multi-domain Chinese Corpus for Speech Recognition
Please visit the official website, read the license, and follow the instruction to download the data.
First, we collect all the data from YouTube and Podcast; Then, OCR is used to label YouTube data, auto trancrition is used to label Podcast data; Finally, a novel end-to-end label error detection method is used to further validate and filter the data.
In summary, WenetSpeech groups all data into 3 categories, as the following table shows:
|High Label||10005||>=0.95||Supervised Training|
|Weak Label||2478||[0.6, 0.95]||Semi-supervised or noise training|
|Unlabel||9952||/||Unsupervised training or Pre-training|
|In Total||22435||/||All above|
High Label Data
All of the data is from Youtube and Podcast, and we tag all the data with its source and domain. We classify the data into 10 groups according to its domain,speaking style, or scenarios.
We provide 3 training subsets, namely
M are sampled from all the high label data which has the oracle confidence 1.0
|DEV||20||Internet||Specially designed for some speech tools which require cross-validation set in training|
|TEST_MEETING||15||Real meeting||Mismatch test which is far-field, conversational, and spontaneous meeting speech|
- WenetSpeech referred a lot of work of GigaSpeech, including metadata design, license design, data encryption, downloading pipeline, and so on. The authors would like to thank Jiayu Du and Guoguo Chen for their suggestions on this work.
- The authors would like to thank my college Lianhui Zhang, Yu Mao for collecting some of the YouTube data.