One Voice Detector to Rule Them All
VAD is among the most important and fundamental algorithms in any production or data preparation pipelines related to speech
One Voice Detector to Rule Them All
Preview:
In our work we are often surprised by the fact that most people know about Automatic Speech Recognition (ASR), but know very little about Voice Activity Detection (VAD). It is baffling, because VAD is among the most important and fundamental algorithms in any production or data preparation pipelines related to speech – though it remains mostly “hidden” if it works properly.
Another problem arises if you try to find a high quality VAD with a permissible license. Typically academic solutions are poorly supported, slow, and may not support streaming. Google’s formidable WebRTC VAD is an established and well-known solution, but it has started to show its age. Despite its stellar performance (30ms chunks, << 1ms CPU time per chunk) it often fails to properly distinguish speech from noise. Commercial solutions typically have strings attached and send some or another form of telemetry or are not “free” in other ways.
So we decided to fix this and publish (under a permissible license) our internal VAD satisfying the following criteria:
High quality;
Highly portable;
No strings attached;
Supports 8 kHz and 16 kHz;
Supports 30, 60 and 100 ms chunks;
Trained on 100+ languages, generalizes well;
One chunk takes ~ 1ms on a single CPU thread. ONNX may be up to 2-3x faster;
In this article we will tell you about Voice Activity Detection in general, describe our approach to VAD metrics, and show how to use our VAD and test it on your own voice.