Automatic Speech Recognition is Made Easy with OpenAI’s Whisper

OpenAI's Whisper is a new open-source ML model designed for multilingual automatic speech recognition

OpenAI's Whisper can achieve human-level robustness and accuracy in ASR with only an off-the-shelf transformer trained on 680,000 hours of weakly-supervised, multilingual audio data. All without the requirement for fine-tuning. The model is open-source, and several weight sizes are made available to the public. The transformer is a typical encoder-decoder model. First, the audio recordings from different speech recognition tasks are transformed into log-Mel spectrograms, which are audio representations in the time-frequency-amplitude domain, with frequencies recorded in Mels, a logarithmic scale aimed to replicate human pitch perception. After that, one-dimensional convolution using GELU is used to achieve dimensionality reduction on the spectrograms.

To ensure that different features are scaled equally and to improve the uniformity of the loss landscape, inputs are always standardized to 0 mean and unit variance. While GELU does dropout stochastically, increasing the possibility of input being deleted as x drops, ReLU performs dropout x0. The input is positionally encoded, and transmitted through the transformer encoder stack, and the representation created is then used to condition the autoregressive decoder. A task's start and end, task type, whether speech is included in the input or not, timestamp information, and other details are all indicated by unique tokens at the beginning of the decoding process.

The authors employ many strategies to prevent repetition looping while greedy decoding is used to sample outputs, such as beginning from temperature 0 and progressively increasing it if the entropy of the generated tokens is too low (someone should tell them about typical sampling).

The authors decided to scavenge for any ASR data they could find, concentrating on data pre-processing methods because human-validated, supervised speech recognition and translation data are difficult to come by. These included heuristics to spot and exclude translations produced by machines, such as the absence of punctuation or the use of all caps. To ensure a match between the transcript and the audio language, researchers also deployed a language detector. To identify and manually review data points with a high error rate and omit potential outliers, they first trained a model on the data. The dataset was two orders larger than earlier supervised ASR datasets with a total of 680.000 hours. The model weights and code were published, however, this dataset was not.

The word-error rate (WER) metric, which penalizes any discrepancy between model output and ground truth, is criticized by the authors. We're interested in semantic flaws, not all of these stylistic differences. To standardize word usage and so reduce WER, the writers created several dictionaries. Effective robustness is another parameter used to gauge the model's performance. Effective robustness is robustness compared to another model, and robustness assesses how well the model generalizes to out-of-distribution datasets. When Whisper and wav2vec are put side by side, we discover that Whisper has higher effective robustness and, on average, commits 55% fewer errors.

According to the authors' scaling principles, WER decreases by half for every 16-fold increase in training data. We should anticipate superhuman performance for ASR in the upcoming generation of models if this is the case. The non-Indo-European languages typically perform worse, with Welsh (CY) also being an outlier despite allegedly having been trained on 9,000 hours of translation data. However, this trend does not apply to all languages. As WER approaches human SR levels, scaling model parameters results in diminishing returns.

The Whisper speech translator from OpenAI uses a ton of data and decoding techniques to achieve human-level speech translation and recognition. The question of whether future ASR algorithms will outperform humans in the next few years still has to be answered.

OpenAI Whisper

Automatic Speech Recognition is Made Easy with OpenAI’s Whisper

OpenAI's Whisper is a new open-source ML model designed for multilingual automatic speech recognition

Related Stories