👨🏽‍🔧 🙉 😯 Les transformateurs comme réseaux de neurones graphiques 🆒 👩🏻‍🎨 🔇

TL; DR : traduction du post de Chaitanya Joshi "Les transformateurs sont des réseaux de neurones graphiques ": diagrammes, formules, idées, liens importants. Publié avec l'aimable autorisation de l'auteur.

Les amis des centres de données posent souvent la même question: Graph Neural Networks est une excellente idée, mais ont-ils eu de véritables histoires de réussite? Ont-ils des applications pratiques?

Vous pouvez donner un exemple des options déjà bien connues - les systèmes de recommandation sur Pinterest , Alibaba et Twitter . Mais il y a une histoire délicate de succès: la tempête qui a pris le traitement industriel de l' architecture en langage naturel du Transformer .

(Transformers). NLP- GNN-, , "" , .

— (representation learning).

NLP

, "" . () (latent/hidden) - . , . , (error/loss functions).

, (natural language processing; NLP), (recurrent neural networks; RNN) — , . RNN , . , RNN, .

, , RNN ( ) .

RNN NLP. : , , (attention mechanism; attention), . , — , "".

2017 , NLP — — RNN. , , !
, Yannic Kilcher.

. , $h$ — $i$ - $S$ — $l$ $l+1$ :

h_{i}^{ℓ + 1} = Attention (Q^{ℓ} h_{i}^{ℓ}, K^{ℓ} h_{j}^{ℓ}, V^{ℓ} h_{j}^{ℓ}),

$h_{i}^{\ell+1} = \text{Attention} \left( Q^{\ell} h_{i}^{\ell} \ , K^{\ell} h_{j}^{\ell} \ , V^{\ell} h_{j}^{\ell} \right),$

i . e ., h_{i}^{ℓ + 1} = \sum_{j \in S} w_{i j} (V^{ℓ} h_{j}^{ℓ}),

$i.e.,\ h_{i}^{\ell+1} = \sum_{j \in \mathcal{S}} w_{ij} \left( V^{\ell} h_{j}^{\ell} \right),$

где w_{i j} = {softmax}_{j} (Q^{ℓ} h_{i}^{ℓ} \cdot K^{ℓ} h_{j}^{ℓ}),

$\text{} \ w_{ij} = \text{softmax}_j \left( Q^{\ell} h_{i}^{\ell} \cdot K^{\ell} h_{j}^{\ell} \right),$

$j \in \mathcal{S}$ , $Q^{\ell}, K^{\ell}, V^{\ell}$ — ( Query, Key Value). , . ! — RNN, .

, :

$h_{i}^{\ell}$ ${ h_{j}^{\ell} ;\ \forall j \in \mathcal{S} }$ $w_{ij}$ $(i,j)$ , softmax $j$ . , $h_{i}^{\ell+1}$ $i$ , ${ h_{j}^{\ell} }$ , $w_{ij}$ . .

(Multi-Head Attention)

- --- (dot product attention): . , "" (attention heads) () ( "" ):

h_{i}^{ℓ + 1} = Concat ({head}_{1}, \dots, {head}_{K}) O^{ℓ},

$h_{i}^{\ell+1} = \text{Concat} \left( \text{head}_1, \ldots, \text{head}_K \right) O^{\ell},$

{head}_{k} = Attention (Q^{k, ℓ} h_{i}^{ℓ}, K^{k, ℓ} h_{j}^{ℓ}, V^{k, ℓ} h_{j}^{ℓ}),

$\text{head}_k = \text{Attention} \left( Q^{k,\ell} h_{i}^{\ell} \ , K^{k, \ell} h_{j}^{\ell} \ , V^{k, \ell} h_{j}^{\ell} \right),$

$Q^{k,\ell}, K^{k,\ell}, V^{k,\ell}$ — $k$ - "" , $O^{\ell}$ — , $h_i^{\ell+1}$ $h_i^{\ell}$ .

, "" " ", . .

""

Scale issues and the Feed-forward sub-layer

, - , , : . - (1), - " " , , $w_{ij}$ . - (2), , "". $h_{i}^{\ell+1}$ . , (normalization layer).

(2) LayerNorm, . , --- (1).

, , : . , $h_i^{\ell+1}$ () , ReLU, , :

h_{i}^{ℓ + 1} = LN (MLP (LN (h_{i}^{ℓ + 1})))

$h_i^{\ell+1} = \text{LN} \left( \text{MLP} \left( \text{LN} \left( h_i^{\ell+1} \right) \right) \right)$

, , , . ! , LayerNorm . — , .
, !

"" , NLP- , . , ("") , (residual connections) "" "". .

GNN

NLP.

(GNN) (GCN) . (, , ). , . GNN (propagate) — — .

, , : , GNN, : "" .

GNN $h$ $i$ $\ell$ $h_i^{\ell}$ , $h_j^{\ell}$ $j \in \mathcal{N}(i)$ :

h_{i}^{ℓ + 1} = σ (U^{ℓ} h_{i}^{ℓ} + \sum_{j \in N (i)} (V^{ℓ} h_{j}^{ℓ})),

$h_{i}^{\ell+1} = \sigma \Big( U^{\ell} h_{i}^{\ell} + \sum_{j \in \mathcal{N}(i)} \left( V^{\ell} h_{j}^{\ell} \right) \Big),$

$U^{\ell}, V^{\ell}$ — GNN, $\sigma$ — (, , ReLU). —

$j \in \mathcal{N}(i)$ , , , / - — , .

, ?

, :

$j$ , , Graph Attention Network (GAT). , — — "" !

— , —

, , — , . GNN, (.. ) (.. ) , .

, GNN . — $j \in \mathcal{N}(i)$ , NLP $\mathcal{S}$ , $j \in \mathcal{S}$ .

, , , , , , — . GNN-. , GNN, .

?

, , , .

— , NLP?

( , ) : , . TreeLSTM, , , /GNN NLP?

(long-term dependencies)?

: . , , , $n$ GNN $n^2$ . - $n$ .

NLP-, . , "" LSH (Locality-Sensitive Hashing) . .

, , GNN. , (Binary Partitioning) " ".

" "?

NLP- , , , . , — , , — - " ".

, , " ".

"" — GNN , ( , ) , GNN ? .

? ?

. , , - "" . , , .

"" GNN, GAT , MoNet (Gaussian kernels) . . "" ?

, GNN (, ) . " " !

- , ? Yann Dauphin (ConvNet). , , !

?

, , , , (learning rate schedule), "" (warmup strategy) (decay settings). , , — , — .

, , , .

DeepMind- , - ? " 16 000 "" (warmup), 500 000 "" (decay), 9 000 ".

, , , : , " " ?

" "?

, , (inductive bias), ?

, : The Illustrated Transformer The Annotated Transformer.

GNN : Arthur Szlam Attention/Memory Networks, GNN . - (position paper) DeepMind, — "" — . , , DGL. seq2seq "" "" GNN.

, , GNN NLP ( HuggingFace: Transformers).

Et enfin, nous avons récemment écrit un article dans lequel nous appliquons des transformateurs à un ensemble de données avec des croquis QuickDraw . Vérifiez-le!

Une addition

Le message a également été traduit en chinois . Rejoignez sa discussion sur reddit et sur Twitter !

Traduction de l'anglais: Anton Alekseev
(laboratoire d'intelligence artificielle, POMI RAS du nom de V.A. Steklov)

Pour ses précieux commentaires, le traducteur remercie Denis Kiryanovkirdin et Mikhail Evtikhiev aspr_spb.

Les transformateurs comme réseaux de neurones graphiques

NLP

(Multi-Head Attention)

""

GNN

— , —

?

— , NLP?

(long-term dependencies)?

" "?

? ?

?

Une addition

More articles: