Automatic speech recognition (STT or ASR) has come a long way in improvement and has a rather extensive history. The conventional wisdom is that only large corporations are capable of creating more or less working "general" solutions that will show sane quality metrics regardless of the data source (different voices, accents, domains). Here are a few main reasons for this misconception:

High requirements for computing power;
A large amount of data needed for training;
Publications usually write only about the so-called state-of-the-art solutions, which have high quality indicators, but are absolutely impractical.

In this article, we will dispel some misconceptions and try to slightly approximate the point of "singularity" for speech recognition. Namely:

, , NVIDIA GeForce 1080 Ti;
Open STT 20 000 ;
, STT .

3 — , .

Open STT
STT

PyTorch, — Deep Speech 2.

GPU;
. Python PyTorch , ;
. ;

, PyTorch "" , , (, C++ JAVA).

Open STT

20 000 . (~90%), .

, , , “” (. Google, Baidu, Facebook). , STT “” “”.

, , STT, :

, ;
;
;
.

,

Deep Speech 2 (2015) :

, %	,	WER, %	WER, %
1	120	29,23	50,97
10	1200	13,80	22,99
20	2400	11,65	20,41
50	6000	9,51	15,90
100	12000	8,46	13,59

WER (word error rate, ) . : 9- 2 2D- 7 68 . , Deep Speech 2.

: , . , . STT LibriSpeech ASR (LibriSpeech) .

, Google, Facebook, Baidu 10 000 — 100 000 . , , Facebook, , , , .

. 1 2 10 ( , , STT ).

, (LibriSpeech), , - . open-source , Google, . , , STT-. , , Common Voice, .

			/
Wav2Letter++	256	21	C++
FairSeq	956	111	PyTorch
OpenNMT	2 401	138	PyTorch
EspNet	5 441	51	PyTorch
ML	300-500	1 — 10	PyTorch

( ) — . , STT, /, PyTorch TensorFlow. , , .

/ ( ), , :

( );
(end-to-end , , ) ;
( — 10GB- );
LibriSpeech, , ;
STT , , , , ;
, PR, “ ” “”. , , , , , ( , , , );
- , , , , , ;

, FairSeq EspNet, , . , ?

LibriSpeech, 8 GPU US $10 000 .

— . , .

, - "" Common Voice Mozilla.

ML: - (state-of-the-art, SOTA) , .

, , , , .

, c “ ” “, ” .

, :

- , (. Goodhart's Law);
“” , ( , );
, ;
, ;
, 95% , . . “ ” (“publish or perish”), , , , ;
, , , , . , , , . , .

, ML :
- / / ;
- ;
- .

, :

-;
semi-supervised unsupervised (wav2vec, STT-TTS) , , ;
end-to-end (LibriSpeech), , 1000 ( LibriSpeech);
MFCC . . , STFT. , - SincNet.

, , , . :

, ;
open-source ( , ).

STT

STT :

;
;
;
, 2-4 1080Ti.

— "" . , ( ). , — .

, , — . — . .

, AWS NVIDIA Tesla GPU, , 5-10 GPU.

, [ ] x [ ]. , , : 1) 2) ? , , ;

, .

, "L-"
— . , , "". ;
. ) ; ) , ;
, , . , , Mobilenet/EfficientNet/FBNet ;
, ML : 1) : , , ; 2) Ceteris paribus: , , .. , ;
, , ( ) , . 10 20 , , , "" .

( ):

Ceteris paribus: , , . , , , ;
Open STT v0.5-beta;
, ( " " I/O, , ).

models

— . — . "" — Wav2Letter. DeepSpeech , 2-3 . GPU — , . , .

Deep Speech 2 Pytorch. LSTM GRU , . , . , , :

~3-5 ;
5-10 ;
1080Ti .

№1: .

( ) .

№2: .

, — . , : , separable convolutions.

, , . , . , 3-4 , 3-4 .

№3: Byte-Pair-Encoding .

. BPE , , WER ( ) . , : BPE . , BPE , .
.

№4: .

encoder-decoder. , , state-of-the-art .

, , GPU . , 500-1000 GPU , 3-4 CPU ( , ). , 2-4 , , .

№5: .

, , , 1080Ti , , , , 4 8 GPU ( GPU). , .

№6: .

, , — . , , .

curriculum learning. , , .

№7. .

, — . :

Sequence-to-sequence ;
Beam search — AM.

beam search KenLM 25 CPU .

, ;
. ;
/ . "" ;
.

, ( ) , , . , , .

, :

. — . , ;
(). . , ;
. , , ;
. , , ;
. , , ;
. , , ;
YouTube. , , . — ;
(e-commerce). , ;
"Yellow pages". . , , ;
. - . , "" . , ;
(). , , , .

Tinkoff ( , );
(, , , , );
Yandex SpeechKit;
Google;
Kaldi 0.6 / Kaldi 0.7 ( , vosk-api);
wit.ai;
stt.ai;
Azure;
Speechmatics;
Voisi;

— Word error rate (WER).

. ("" -> "1-"), . , WER ~1 .

2019 2020 . , . WER ~1 . , , .

			WER	WER	WER
	2	14	10%	3%	29%
()	0	17	13%	13%	86%
	0	16	15%	15%	60%
	0	16	18%	18%	70%
Court hearings	0	7	21%	21%	53%
Audio books	4	14	27%	22%	70%
YouTube	1	17	31%	thirty%	73%
Calls (e-commerce)	2	thirteen	32%	29%	76%
Yellow pages	1	6	33%	31%	72%
Medical terms	1	6	40%	39%	72%
Calls (pranks)	3	14	41%	38%	85%

The article is already quite huge. If you are interested in a more detailed methodology and the positions of each system on each domain, then you will find an extended version of system comparison here , and a description of the comparison methodology here .

Lowering the barriers to speech recognition

Open STT

,

STT

More articles: