The Gradient

Share this post

New Datasets to Democratize Speech Recognition Technology

thegradientpub.substack.com

Discover more from The Gradient

Articles, interviews, and news coverage about AI brought to you by a team of AI researchers and builders.
Over 32,000 subscribers
Continue reading
Sign in
Articles

New Datasets to Democratize Speech Recognition Technology

Presenting the The People’s Speech, a massive English-language dataset of audio transcriptions, and the Multilingual Spoken Words Corpus (MSWC), a 50-language, 6000-hour dataset of individual words

Andrey Kurenkov
Dec 14, 2021
2
Share this post

New Datasets to Democratize Speech Recognition Technology

thegradientpub.substack.com
Share

New Datasets to Democratize Speech Recognition Technology

Over the last year, we at MLCommons.org set out to create public datasets to ease two pressing bottlenecks for open source speech recognition resources. We created The People’s Speech, a massive English-language dataset of audio transcriptions of full sentences, and the Multilingual Spoken Words Corpus (MSWC), a 50-language, 6000-hour dataset of individual words. Together, these datasets greatly improve upon the depth (TPS) and breadth (MSWC) of speech recognition resources licensed for researchers and entrepreneurs to share and adapt.

Read the article for audio samples!

Continue Reading ->

2
Share this post

New Datasets to Democratize Speech Recognition Technology

thegradientpub.substack.com
Share
Comments
Top
New
Community

No posts

Ready for more?

© 2023 The Gradient
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing