Lowering the barriers to speech recognition

image


Automatic speech recognition (STT or ASR) has come a long way in improvement and has a rather extensive history. The conventional wisdom is that only large corporations are capable of creating more or less working "general" solutions that will show sane quality metrics regardless of the data source (different voices, accents, domains). Here are a few main reasons for this misconception:


  • High requirements for computing power;
  • A large amount of data needed for training;
  • Publications usually write only about the so-called state-of-the-art solutions, which have high quality indicators, but are absolutely impractical.

In this article, we will dispel some misconceptions and try to slightly approximate the point of "singularity" for speech recognition. Namely:


  • , , NVIDIA GeForce 1080 Ti;
  • Open STT 20 000 ;
  • , STT .

3 β€” , .





PyTorch, β€” Deep Speech 2.


:


  • GPU;
  • . Python PyTorch , ;
  • . ;

, PyTorch "" , , (, C++ JAVA).


Open STT


20 000 . (~90%), .



, , , β€œβ€ (. Google, Baidu, Facebook). , STT β€œβ€ β€œβ€.


, , STT, :


  • , ;
  • ;
  • ;
  • .

.


,


Deep Speech 2 (2015) :


, %,WER, %WER, %
112029,2350,97
10120013,8022,99
20240011,6520,41
5060009,5115,90
100120008,4613,59

WER (word error rate, ) . : 9- 2 2D- 7 68 . , Deep Speech 2.

: , . , . STT LibriSpeech ASR (LibriSpeech) .


, Google, Facebook, Baidu 10 000 β€” 100 000 . , , Facebook, , , , .


. 1 2 10 ( , , STT ).


, (LibriSpeech), , - . open-source , Google, . , , STT-. , , Common Voice, .



/
Wav2Letter++25621C++
FairSeq956111PyTorch
OpenNMT2 401138PyTorch
EspNet5 44151PyTorch
ML300-5001 β€” 10PyTorch

( ) β€” . , STT, /, PyTorch TensorFlow. , , .


/ ( ), , :


  • ( );
  • (end-to-end , , ) ;
  • ( β€” 10GB- );
  • LibriSpeech, , ;
  • STT , , , , ;
  • , PR, β€œ ” β€œβ€. , , , , , ( , , , );
  • - , , , , , ;

, FairSeq EspNet, , . , ?



LibriSpeech, 8 GPU US $10 000 .


β€” . , .


, - "" Common Voice Mozilla.



ML: - (state-of-the-art, SOTA) , .


, , , , .


, c β€œ ” β€œ, ” .



, :


  • - , (. Goodhart's Law);
  • β€œβ€ , ( , );
  • , ;
  • , ;
  • , 95% , . . β€œ ” (β€œpublish or perish”), , , , ;
  • , , , , . , , , . , .


    , ML :


    • / / ;
    • ;
    • .



, :


  • -;
  • semi-supervised unsupervised (wav2vec, STT-TTS) , , ;
  • end-to-end (LibriSpeech), , 1000 ( LibriSpeech);
  • MFCC . . , STFT. , - SincNet.


, , , . :


  • , ;
  • open-source ( , ).

STT


STT :


  • ;
  • ;
  • ;
  • , 2-4 1080Ti.


β€” "" . , ( ). , β€” .


, , β€” . β€” . .


, AWS NVIDIA Tesla GPU, , 5-10 GPU.

:


  • , [ ] x [ ]. , , : 1) 2) ? , , ;


    , .


    l_curve


    , "L-"

  • β€” . , , "". ;


  • . ) ; ) , ;


  • , , . , , Mobilenet/EfficientNet/FBNet ;


  • , ML : 1) : , , ; 2) Ceteris paribus: , , .. , ;


  • , , ( ) , . 10 20 , , , "" .



( ):




models


β€” . β€” . "" β€” Wav2Letter. DeepSpeech , 2-3 . GPU β€” , . , .

Deep Speech 2 Pytorch. LSTM GRU , . , . , , :


  • ~3-5 ;
  • 5-10 ;
  • 1080Ti .

β„–1: .


( ) .


β„–2: .


, β€” . , : , separable convolutions.


, , . , . , 3-4 , 3-4 .


β„–3: Byte-Pair-Encoding .


. BPE , , WER ( ) . , : BPE . , BPE , .
.


β„–4: .


encoder-decoder. , , state-of-the-art .


, , GPU . , 500-1000 GPU , 3-4 CPU ( , ). , 2-4 , , .


β„–5: .


, , , 1080Ti , , , , 4 8 GPU ( GPU). , .


β„–6: .


, , β€” . , , .


curriculum learning. , , .


β„–7. .


, β€” . :


  • Sequence-to-sequence ;
  • Beam search β€” AM.

beam search KenLM 25 CPU .



:


  • , ;
  • . ;
  • / . "" ;
  • .

, ( ) , , . , , .



, :


  • . β€” . , ;
  • (). . , ;
  • . , , ;
  • . , , ;
  • . , , ;
  • . , , ;
  • YouTube. , , . β€” ;
  • (e-commerce). , ;
  • "Yellow pages". . , , ;
  • . - . , "" . , ;
  • (). , , , .


:


  • Tinkoff ( , );
  • (, , , , );
  • Yandex SpeechKit;
  • Google;
  • Kaldi 0.6 / Kaldi 0.7 ( , vosk-api);
  • wit.ai;
  • stt.ai;
  • Azure;
  • Speechmatics;
  • Voisi;

β€” Word error rate (WER).


. ("" -> "1-"), . , WER ~1 .


2019 2020 . , . WER ~1 . , , .

WERWERWER
21410%3%29%
()01713%13%86%
01615%15%60%
01618%18%70%
Court hearings0721%21%53%
Audio books41427%22%70%
YouTube11731%thirty%73%
Calls (e-commerce)2thirteen32%29%76%
Yellow pages1633%31%72%
Medical terms1640%39%72%
Calls (pranks)31441%38%85%

The article is already quite huge. If you are interested in a more detailed methodology and the positions of each system on each domain, then you will find an extended version of system comparison here , and a description of the comparison methodology here .


All Articles