tinyML Talks: The Multilingual Spoken Words Corpus, a Massive Keyword Spotting Dataset

tinyML Talks The Multilingual Spoken Words Corpus, a Massive Keyword Spotting Dataset Mark Mazumder , PhD Student Harvard University This talk will present the Multilingual Spoken Words Corpus (MSWC), a speech dataset of over 340,000 spoken words in 50 languages, with over 23 million audio examples. MSWC has many use cases, ranging from voice-enabled consumer devices to call center automation. The dataset is CC-BY licensed and free for academic research and commercial use. We will introduce applications of MSWC for few-shot keyword spotting and spoken term search tasks in low-resource languages, and share a brief tutorial on getting started with the dataset. We will also discuss how we automated the construction of our dataset and our self-supervised approach for detecting outlier samples.

1 view