在上一篇文章中，我们研究了注意力机制，这是现代深度学习模型中一种极为常见的方法，可以改善神经机器翻译应用程序的性能指标。在本文中，我们将介绍Transformer，该模型使用注意力机制来提高学习速度。此外，在许多任务上，变形金刚优于Google的神经机器翻译模型。但是，变压器的最大优点是在并行化条件下具有很高的效率。甚至Google Cloud也建议在Cloud TPU上使用Transformer作为模型。让我们尝试找出模型的组成及其执行的功能。

“ 注意就是您所需要的”一文中首先提出了Transformer模型。Tensor2Tensor软件包中包含TensorFlow的实现，此外，来自哈佛大学的一组NLP研究人员使用PyTorch的实现创建了本文的指南注释。在同一指南中，我们将尝试以最简单，一致的方式概述主要思想和概念，以帮助那些对主题领域没有深入了解的人们理解该模型。

高层审查

让我们将模型视为一种黑匣子。在机器翻译应用程序中，它接受一种语言的句子作为输入，并显示另一种语言的句子。

the_transformer_3

, , , .

The_transformer_encoders_decoders

– ; 6 , ( 6 , ). – , .

The_transformer_encoder_decoder_stack

, . :

变压器编码器

, , (self-attention), . .

(feed-forward neural network). .

, , ( , seq2seq).

Transformer_decoder

, , /, , .

NLP-, , , (word embeddings).

512. .

. , , : 512 ( , – ). , , , , .

, .

encoder_with_tensors

: . , , , .

, .

!

, , – , , , .

encoder_with_tensors_2

. , .

, « » -, . , «Attention is All You Need». , .

– , :

”The animal didn't cross the street because it was too tired”

«it» ? (street) (animal)? .

«it», , «it» «animal».

( ), , .

(RNN), , RNN /, , . – , , «» .

变压器_自我注意_可视化

«it» #5 ( ), «The animal» «it».

Tensor2Tensor, , .

, , , .

– ( – ): (Query vector), (Key vector) (Value vector). , .

, , . 64, / 512. , (multi-head attention) .

transformer_self_attention_vectors

x1 WQ q1, «», . «», «» «» .

«», «» «»?

, . , , , .

– (score). , – «Thinking». . , .

. , #1, q1 k1, — q1 k2.

transformer_self_attention_score

– 8 ( , – 64; , ), (softmax). , 1.

自我注意_softmax

- (softmax score) , . , -, , .

– - ( ). : , , ( , , 0.001).

– . ( ).

自我注意输出

. , . , , . , , .

– , . X , (WQ, WK, WV).

自我注意矩阵计算

. (512, 4 ) q/k/v (64, 3 ).

, , 2-6 .

自我注意矩阵计算2

, (multi-head attention). :

. , , z1 , . «The animal didn’t cross the street because it was too tired», , «it».
« » (representation subspaces). , , // ( 8 «» , 8 /). . ( /) .

transformer_attention_heads_qkv

, WQ/WK/WV «», Q/K/V . , WQ/WK/WV Q/K/V .

, , 8 , 8 Z .

transformer_attention_heads_z

. , 8 – ( ), Z .

? WO.

变压器_注意_头部_重量_矩阵_o

, , . , . , .

Translator_multi-headed_self-attention-recap

, «» , , , «» «it» :

transformer_self-attention_visualization_2

«it», «» «the animal», — «tired». , «it» «animal» «tired».

«» , , .

transformer_self-attention_visualization_3

— .

. , . , Q/K/V .

Translator_positional_encoding_vectors

, , , .

, 4, :

transformer_positional_encoding_example

: , , , — .. 512 -1 1. , .

transformer_positional_encoding_large_example

20 () 512 (). , : ( ), – ( ). .

( 3.5). get_timing_signal_1d(). , , (, , , ).

, , , , ( , ) , (layer-normalization step).

transformer_resideual_layer_norm

, , :

transformer_resideual_layer_norm_2

. , :

transformer_resideual_layer_norm_3

, , , . , .

. K V. «-» , :

transformer_decoding_1

. ( – ).

, , . , , . , , , .

transformer_decoding_2

. ( –inf) .

«-» , , , , .

. ? .

– , , , , (logits vector).

10 (« » ), . , 10 000 – . .

( , 1). .

transformer_decoder_output_softmax

, , .

, , , , .

, . .. , .

, 6 («a», «am», «i», «thanks», «student» «<eos>» (« »).

, (, one-hot-). , «am», :

一个热门词汇示例

: one-hot- .

(loss function) – , , , .

, . – «merci» «thanks».

, , , «thanks». .. , .

Translator_logits_output_and_label

() , /. , , , .

? . , . -.

, . . , «je suis étudiant» – «I am a student». , , , :

(6 , – 3000 10000);
, «i»;
, «am»;
.. , .

output_target_probability_distributions

, :

output_trained_model_probability_distributions

, . , , (.: ). , , , – , .

, , , , . , (greedy decoding). – , , 2 ( , «I» «a») , , : , «I», , , «a». , , . #2 #3 .. « » (beam search). (beam_size) (.. #1 #2), - (top_beams) ( ). .

, . , :

图片中的变压器

高层审查

!

s

More articles: