Transformer in pictures

In a previous article, we examined the attention mechanism, an extremely common method in modern deep learning models that can improve the performance indicators of neural machine translation applications. In this article, we will look at Transformer, a model that uses the attention mechanism to increase learning speed. Moreover, for a number of tasks, Transformers outperform Googleā€™s neural machine translation model. However, the biggest advantage of Transformers is their high efficiency in parallelization conditions. Even Google Cloud recommends using Transformer as a model when working on Cloud TPU . Let's try to figure out what the model consists of and what functions it performs.


The Transformer model was first proposed in the article Attention is All You Need . An implementation on TensorFlow is available as part of the Tensor2Tensor package , in addition, a group of NLP researchers from Harvard created a guide annotation of the article with an implementation on PyTorch . In this same guide, we will try to most simply and consistently outline the main ideas and concepts, which, we hope, will help people who do not have a deep knowledge of the subject area to understand this model.


High Level Review


Let's look at the model as a kind of black box. In machine translation applications, it accepts a sentence in one language as input and displays a sentence in another.


the_transformer_3


, , , .


The_transformer_encoders_decoders


ā€“ ; 6 , ( 6 , ). ā€“ , .


The_transformer_encoder_decoder_stack


, . :


Transformer_encoder


, , (self-attention), . .


(feed-forward neural network). .


, , ( , seq2seq).


Transformer_decoder



, , /, , .


NLP-, , , (word embeddings).


embeddings


512. .


. , , : 512 ( , ā€“ ). , , , , .


, .


encoder_with_tensors


: . , , , .


, .


!


, , ā€“ , , , .


encoder_with_tensors_2


. , .



, Ā« Ā» -, . , Ā«Attention is All You NeedĀ». , .


ā€“ , :


ā€The animal didn't cross the street because it was too tiredā€

Ā«itĀ» ? (street) (animal)? .


Ā«itĀ», , Ā«itĀ» Ā«animalĀ».


( ), , .


(RNN), , RNN /, , . ā€“ , , Ā«Ā» .


transformer_self-attention_visualization


Ā«itĀ» #5 ( ), Ā«The animalĀ» Ā«itĀ».


Tensor2Tensor, , .



, , , .


ā€“ ( ā€“ ): (Query vector), (Key vector) (Value vector). , .


, , . 64, / 512. , (multi-head attention) .


transformer_self_attention_vectors


x1 WQ q1, Ā«Ā», . Ā«Ā», Ā«Ā» Ā«Ā» .


Ā«Ā», Ā«Ā» Ā«Ā»?


, . , , , .


ā€“ (score). , ā€“ Ā«ThinkingĀ». . , .


. , #1, q1 k1, ā€” q1 k2.


transformer_self_attention_score


ā€“ 8 ( , ā€“ 64; , ), (softmax). , 1.


self-attention_softmax


- (softmax score) , . , -, , .


ā€“ - ( ). : , , ( , , 0.001).


ā€“ . ( ).


self-attention-output


. , . , , . , , .



ā€“ , . X , (WQ, WK, WV).


self-attention-matrix-calculation


. (512, 4 ) q/k/v (64, 3 ).


, , 2-6 .


self-attention-matrix-calculation-2


.



, (multi-head attention). :


  1. . , , z1 , . Ā«The animal didnā€™t cross the street because it was too tiredĀ», , Ā«itĀ».
  2. Ā« Ā» (representation subspaces). , , // ( 8 Ā«Ā» , 8 /). . ( /) .

transformer_attention_heads_qkv


, WQ/WK/WV Ā«Ā», Q/K/V . , WQ/WK/WV Q/K/V .


, , 8 , 8 Z .


transformer_attention_heads_z


. , 8 ā€“ ( ), Z .


? WO.


transformer_attention_heads_weight_matrix_o


, , . , . , .


transformer_multi-headed_self-attention-recap


, Ā«Ā» , , , Ā«Ā» Ā«itĀ» :


transformer_self-attention_visualization_2


Ā«itĀ», Ā«Ā» Ā«the animalĀ», ā€” Ā«tiredĀ». , Ā«itĀ» Ā«animalĀ» Ā«tiredĀ».


Ā«Ā» , , .


transformer_self-attention_visualization_3



ā€” .


. , . , Q/K/V .


transformer_positional_encoding_vectors


, , , .


, 4, :


transformer_positional_encoding_example


?


: , , , ā€” .. 512 -1 1. , .


transformer_positional_encoding_large_example


20 () 512 (). , : ( ), ā€“ ( ). .


( 3.5). get_timing_signal_1d(). , , (, , , ).



, , , , ( , ) , (layer-normalization step).


transformer_resideual_layer_norm


, , :


transformer_resideual_layer_norm_2


. , :


transformer_resideual_layer_norm_3



, , , . , .


. K V. Ā«-Ā» , :


transformer_decoding_1


. ( ā€“ ).


, , . , , . , , , .


transformer_decoding_2


.


. ( ā€“inf) .


Ā«-Ā» , , , , .



. ? .


ā€“ , , , , (logits vector).


10 (Ā« Ā» ), . , 10 000 ā€“ . .


( , 1). .


transformer_decoder_output_softmax


, , .



, , , , .


, . .. , .


, 6 (Ā«aĀ», Ā«amĀ», Ā«iĀ», Ā«thanksĀ», Ā«studentĀ» Ā«<eos>Ā» (Ā« Ā»).


vocabulary


.


, (, one-hot-). , Ā«amĀ», :


one-hot-vocabulary-example


: one-hot- .


(loss function) ā€“ , , , .



, . ā€“ Ā«merciĀ» Ā«thanksĀ».


, , , Ā«thanksĀ». .. , .


transformer_logits_output_and_label


() , /. , , , .


? . , . -.


, . . , Ā«je suis Ć©tudiantĀ» ā€“ Ā«I am a studentĀ». , , , :


  • (6 , ā€“ 3000 10000);
  • , Ā«iĀ»;
  • , Ā«amĀ»;
  • .. , .

output_target_probability_distributions


, :


output_trained_model_probability_distributions


, . , , (.: ). , , , ā€“ , .


, , , , . , (greedy decoding). ā€“ , , 2 ( , Ā«IĀ» Ā«aĀ») , , : , Ā«IĀ», , , Ā«aĀ». , , . #2 #3 .. Ā« Ā» (beam search). (beam_size) (.. #1 #2), - (top_beams) ( ). .



, . , :



:



Authors


Source: https://habr.com/ru/post/undefined/


All Articles