openAI-GPT-2-3

In 2019, we witnessed the brilliant use of machine learning. The OpenAI GPT-2 model has demonstrated an impressive ability to write coherent and emotional texts that are superior to our understanding of what modern language models can generate. GPT-2 is not some particularly new architecture - it is very reminiscent of the Transformer-Decoder (decoder-only Transformer). The difference between GPT-2 is that it is a truly huge language model based on Transformer, trained on an impressive data set. In this article, we will look at the architecture of the model that allows us to achieve such results: we will examine in detail the self-attention layer and the use of the decoding Transformer for tasks that go beyond language modeling.

Content

1: GPT-2
- BERT'
- - : GPT-2
- : GPT-2,
2:
- ( )
- 1 – ,
- 2 –
- 3 –
- GPT-2
- !
3:

1: GPT-2

Word2vec , – , , . – , .

swiftkey-keyboard

, GPT-2 , , , . GPT-2 40 (WebText), OpenAI . , , SwiftKey, 78 , GPT-2 500 , GPT-2 – 13 ( 6,5 ).

gpt2-sizes

GPT-2 AllenAI GPT-2 Explorer. GPT-2 ( ), .

, – .. . – , - .

transformer-encoder-decoder

, , , , ( AlphaStar).

gpt-2-transformer-xl-bert-3

? , GPT-2 :

gpt2-sizes-hyperparameters-3

BERT'

:
, .

GPT-2 . BERT , , . . , GPT-2, , . , GPT-2 :

gpt-2-output

: , , . . «» (auto-regression) RNN .

gpt-2-autoregression-2

GPT-2 TransformerXL XLNet . BERT . . , BERT . XLNet , .

– :

transformer-encoder-block-2

(, 512 ). , .

– , . :

transformer-decoder-block-2

, [mask] , BERT', , , .

, , #4, , :

transformer-decoder-block-self-attention-2

, BERT, GPT-2. . :

self-attention-and-masked-self-attention

, «Generating Wikipedia by Summarizing Long Sequences» , : . «-». 6 :

transformer-decoder-intro

. , . , 4000 – 512 .

, , . « », / .

GPT-2 OpenAI .

- : GPT-2

, , . , , , , . (Budgie)

GPT-2 , .

gpt-2-layers-2

GPT-2 1024 . .

GPT-2 – ( ) (), (.. ). , ( <|endoftext|>; <|s|>).

gpt2-simple-output-2

, . , (score) – , (50 GPT-2). – «the». - – , , , , – . . GPT-2 top-k, , , (, , top-k = 1).

gpt-2-simple-output-3

, . GPT-2 ( ). GPT-2 .

. . NLP-, , – , .

gpt2-token-embeddings-wte-2

– , - . GPT-2. 768 /.

, <|s|> . , – , . , 1024 .

gpt2-positional-encoding

. , GPT-2.

gpt2-input-embedding-positional-encoding-3

#1.

, , . , . , , .

gpt2-transformer-block-vectors-2

. , :

, , , .

, . , . , , :

;
(« »);
.

: , , ( ). , , .

, «a robot» «it». , , , .

gpt2-self-attention-example-2

. :

– , ( ). , ;
– . ;
– ; , , .

self-attention-example-folders-3

. – , . . , – . , .

(: ).

self-attention-example-folders-scores-3

, .

gpt2-value-vector-sum

, 50% «robot», 30% «a» 19% – «it». . .

( ), .

gpt2-output-projection-2

, . .

gpt2-output-scores-2

(top_k = 1). , . , , ( ). – top_k 40: 40 .

gpt2-output

, . , (1024 ) .

: GPT-2,

, , GPT-2. , , . , ( TransformerXL XLNet).

, :

«» «» ; GPT-2 (Byte Pair Encoding) . , .
GPT-2 / (inference/evaluation mode). . . (512), 1, .
/ . .
, . Transformer , .
. «zoom in», :

zoom-in

2:

, «it»:

gpt2-self-attention-1-2

, . , , . , , .

( )

, . , 4 .

, ;
;
.

self-attention-summary

1 – ,

. . . ( «» ):

self-attention-1

, WQ, WK, WV

2 –

, , №2: .

self-attention-2

( ) ,

3 –

. , .

self-attention-3-2

, – , .

, , . ( ).

, , , . №2. , . . , :

masked-self-attention-2

, (attention mask). , , («robot must obey orders»). 4 : ( , – ). .. , 4 , ( 4 ) .

transformer-decoder-attention-mask-dataset

, . , , ( ), :

queries-keys-attention-mask

«» . , , – (-inf) (, -1 GPT-2):

transformer-attention-mask

, , , :

transformer-attention-masked-scores-softmax

( №1), («robot»), 100% .
( №2), («robot must»), «must» 48% «robot» 52% «must».
..

GPT-2

GPT-2.

:

, GPT-2 , . , , , .

( <|s|>).

gpt2-self-attention-qkv-1-2

GPT-2 «a». :

gpt2-self-attention-qkv-2-2

, «robot», , «a» – , :

gpt2-self-attention-qkv-3-2

GPT-2: 1 – ,

, «it». , «it» + #9:

gpt2-self-attention-1

( ), , .

gpt2-self-attention-2

(bias vector),

, , «it».

gpt2-self-attention-3

( ) ,

GPT-2: 1.5 – «»

, «» . . (Q), (K) (V). «» – . GPT-2 12 «» , :

gpt2-self-attention-split-attention-heads-1

, «» . «» , ( 12 «» ):

gpt2-self-attention-split-attention-heads-2

GPT-2: 2 –

( , «» ):

gpt2-self-attention-scoring

( «» #1 ):

gpt2-self-attention-scoring-2

GPT-2: 3 –

, , , «» #1:

gpt2-self-attention-multihead-sum-1

GPT-2: 3.5 – «»

«» , , :

gpt2-self-attention-merge-heads-1

. .

GPT-2: 4 –

, , . , «» :

gpt2-self-attention-project-1

, , :

gpt2-self-attention-project-2

GPT-2: #1

– , , . . 4 ( GPT-2 768, 768*4 = 3072 ). ? ( 512 #1 – 2048). , , .

gpt2-mlp1

( )

GPT-2:

(768 GPT-2). .

gpt2-mlp-2

( )

!

, - . , . , , :

gpt2-transformer-block-weights-2

. , :

gpt2-weights-2

, :

gpt2-117-parameters

- 124 117. , , (, ).

3:

, . , . .

. :

decoder-only-transformer-translation

, . , ( , ) . :

wikipedia-summarization

decoder-only-summarization

Sample Efficient Text Summarization Using a Single Pre-Trained Transformer , . , , - .

GPT-2 .

. « » – (, « »).

, . , (), ( ). (, , ) «» – , .

music-transformer-performance-encoding-3

– one-hot . midi . :

music-representation-example

one-hot :

music-transformer-input-representation-2

music-transformer-self-attention-2