Dalam artikel sebelumnya, kami meneliti mekanisme perhatian, metode yang sangat umum dalam model pembelajaran mendalam modern yang dapat meningkatkan indikator kinerja aplikasi terjemahan mesin saraf. Pada artikel ini, kita akan melihat Transformer, model yang menggunakan mekanisme perhatian untuk meningkatkan kecepatan belajar. Selain itu, untuk sejumlah tugas, Transformers mengungguli model terjemahan mesin saraf Google. Namun, keuntungan terbesar Transformers adalah efisiensi tinggi dalam kondisi paralelisasi. Bahkan Google Cloud merekomendasikan menggunakan Transformer sebagai model ketika bekerja di Cloud TPU . Mari kita coba mencari tahu apa model terdiri dan apa fungsinya.

Model Transformer pertama kali diusulkan dalam artikel Attention is All You Need . Implementasi pada TensorFlow tersedia sebagai bagian dari paket Tensor2Tensor , di samping itu, sekelompok peneliti NLP dari Harvard membuat anotasi panduan artikel dengan implementasi pada PyTorch . Dalam panduan yang sama ini, kami akan mencoba menguraikan ide dan konsep utama secara paling sederhana dan konsisten, yang kami harap akan membantu orang-orang yang tidak memiliki pengetahuan mendalam tentang bidang subjek untuk memahami model ini.

Tinjauan Tingkat Tinggi

Mari kita lihat modelnya sebagai semacam kotak hitam. Dalam aplikasi terjemahan mesin, ia menerima kalimat dalam satu bahasa sebagai input dan menampilkan kalimat dalam bahasa lain.

the_transformer_3

, , , .

The_transformer_encoders_decoders

– ; 6 , ( 6 , ). – , .

The_transformer_encoder_decoder_stack

, . :

Transformer_encoder

, , (self-attention), . .

(feed-forward neural network). .

, , ( , seq2seq).

Transformer_decoder

, , /, , .

NLP-, , , (word embeddings).

embeddings

512. .

. , , : 512 ( , – ). , , , , .

, .

encoder_with_tensors

: . , , , .

, .

!

, , – , , , .

encoder_with_tensors_2

. , .

, « » -, . , «Attention is All You Need». , .

– , :

”The animal didn't cross the street because it was too tired”

«it» ? (street) (animal)? .

«it», , «it» «animal».

( ), , .

(RNN), , RNN /, , . – , , «» .

transformer_self-attention_visualization

«it» #5 ( ), «The animal» «it».

Tensor2Tensor, , .

, , , .

– ( – ): (Query vector), (Key vector) (Value vector). , .

, , . 64, / 512. , (multi-head attention) .

transformer_self_attention_vectors

x1 WQ q1, «», . «», «» «» .

«», «» «»?

, . , , , .

– (score). , – «Thinking». . , .

. , #1, q1 k1, — q1 k2.

transformer_self_attention_score

– 8 ( , – 64; , ), (softmax). , 1.

self-attention_softmax

- (softmax score) , . , -, , .

– - ( ). : , , ( , , 0.001).

– . ( ).

self-attention-output

. , . , , . , , .

– , . X , (WQ, WK, WV).

perhitungan matriks perhatian-diri

. (512, 4 ) q/k/v (64, 3 ).

, , 2-6 .

self-attention-matrix-calculation-2

, (multi-head attention). :

. , , z1 , . «The animal didn’t cross the street because it was too tired», , «it».
« » (representation subspaces). , , // ( 8 «» , 8 /). . ( /) .

transformer_attention_heads_qkv

, WQ/WK/WV «», Q/K/V . , WQ/WK/WV Q/K/V .

, , 8 , 8 Z .

transformer_attention_heads_z

. , 8 – ( ), Z .

? WO.

transformer_attention_heads_weight_matrix_o

, , . , . , .

transformer_multi-headed_self-attention-recap

, «» , , , «» «it» :

transformer_self-attention_visualization_2

«it», «» «the animal», — «tired». , «it» «animal» «tired».

«» , , .

transformer_self-attention_visualization_3

— .

. , . , Q/K/V .

transformer_positional_encoding_vectors

, , , .

, 4, :

transformer_positional_encoding_example

: , , , — .. 512 -1 1. , .

transformer_positional_encoding_large_example

20 () 512 (). , : ( ), – ( ). .

( 3.5). get_timing_signal_1d(). , , (, , , ).

, , , , ( , ) , (layer-normalization step).

transformer_resideual_layer_norm

, , :

transformer_resideual_layer_norm_2

. , :

transformer_resideual_layer_norm_3

, , , . , .

. K V. «-» , :

transformer_decoding_1

. ( – ).

, , . , , . , , , .

transformer_decoding_2

. ( –inf) .

«-» , , , , .

. ? .

– , , , , (logits vector).

10 (« » ), . , 10 000 – . .

( , 1). .

transformer_decoder_output_softmax

, , .

, , , , .

, . .. , .

, 6 («a», «am», «i», «thanks», «student» «<eos>» (« »).

kosa kata

, (, one-hot-). , «am», :

one-hot-vocabulary-example

: one-hot- .

(loss function) – , , , .

, . – «merci» «thanks».

, , , «thanks». .. , .

transformer_logits_output_and_label

() , /. , , , .

? . , . -.

, . . , «je suis étudiant» – «I am a student». , , , :

(6 , – 3000 10000);
, «i»;
, «am»;
.. , .

output_target_probability_distributions

, :

output_trained_model_probability_distributions

, . , , (.: ). , , , – , .

, , , , . , (greedy decoding). – , , 2 ( , «I» «a») , , : , «I», , , «a». , , . #2 #3 .. « » (beam search). (beam_size) (.. #1 #2), - (top_beams) ( ). .

, . , :

Penulis

Asli oleh Jay Alammar
Terjemahan - Ekaterina Smirnova
Pengeditan dan tata letak - Shkarin Sergey

Transformer dalam gambar

Tinjauan Tingkat Tinggi

!

Penulis

More articles: