Algorithm
The VAD algorithm works as follows:
- Sample rate conversion is performed on input audio so that the processed audio has a sample rate of 16000.
- The converted samples are batched into "frames" of size determined automatically by the model (512 samples for the Silero v6 model used here).
- The Silero vad model is run on each frame and produces a number between 0 and 1 indicating the probability that the sample contains speech.
- If the algorithm has not detected speech lately, then it is in a state of
not speaking
. Once it encounters a frame with speech probability greater thanpositiveSpeechThreshold
, it is changed into a state ofspeaking
. When it encounters frames forredemptionMs
milliseconds with speech probability less thannegativeSpeechThreshold
without having encountered a frame with speech probability greater thanpositiveSpeechThreshold
, the speech audio segment is considered to have ended and the algorithm returns to a state ofnot speaking
. Frames with speech probability in betweennegativeSpeechThreshold
andpositiveSpeechThreshold
are effectively ignored. - When the algorithm detects the end of a speech audio segment (i.e. goes from the state of
speaking
tonot speaking
), it counts the number of frames with speech probability greater thanpositiveSpeechThreshold
in the audio segment. If the duration is less thanminSpeechMs
, then the audio segment is considered a false positive. Otherwise,preSpeechPadMs
milliseconds of audio are prepended to the audio segment and the segment is made accessible through the higher-level API.
Configuration
All of the main APIs accept certain common configuration parameters that modify the VAD algorithm.
positiveSpeechThreshold: number
- determines the threshold over which a probability is considered to indicate the presence of speech. default:0.5
negativeSpeechThreshold: number
- determines the threshold under which a probability is considered to indicate the absence of speech. default:0.35
redemptionMs: number
- number of milliseconds of speech-negative frames to wait before ending a speech segment. default:768
preSpeechPadMs: number
- number of milliseconds of audio to prepend to a speech segment. default:96
minSpeechMs: number
- minimum duration in milliseconds for a speech segment. default:288