Error back propagation algorithm using Word2Vec as an example

Since I encountered significant difficulties in finding an explanation of the back propagation mechanism of the error that I would like, I decided to write my own post about the back propagation of the error using the Word2Vec algorithm. My goal is to explain the essence of the algorithm using a simple but non-trivial neural network. In addition, word2vec has become so popular in the NLP community that it will be useful to focus on it.


This post is connected with another, more practical post that I recommend reading, it discusses the direct implementation of word2vec in python. In this post, we will focus mainly on the theoretical part.


Let's start with the things that are necessary for a true understanding of back propagation. In addition to concepts from machine learning, such as the loss function and gradient descent, two more components from mathematics come in handy:



If you are familiar with these concepts, then further considerations will be simple. If you have not mastered them yet, you can still understand the basics of backpropagation.


First, I want to define the concept of back propagation, if the meaning is not clear enough, it will be disclosed in more detail in the following paragraphs.


1. What is a backpropagation algorithm?


Within the framework of a neural network, the only parameters involved in training the network, that is, to minimize the loss function, are weights (here I mean weights in the broad sense, including displacements). Weights change at each iteration until we get to the minimum of the loss function.


, β€” , , .
, , .
, , , , w1 w2.



1. .


, w1 w2 .


, . , βˆ‚L/βˆ‚w1 βˆ‚L/βˆ‚w2, , . Ξ·, .


2. Word2Vec


word2vec, , , . , word2vec, NLP.


, word2vec [N, 3], N - , . , , '', , ( ), , ''. , word2vec .


word2vec : (CBOW) (skip-gram). , CBOW, , skip-gram.


. , woed2vec .


3. CBOW


CBOW . , :

2. Continuous Bag-of-Words


, ,
a = 1 (identity function, , ).
Softmax.


one hot encoding , , , , , 1.


: ['', '', '', '', '', '']
OneHot('') = [0, 0, 0, 1, 0, 0]
OneHot(['', '']) = [1, 0, 0, 1, 0, 0]
OneHot(['', '', '']) = [1, 0, 0, 0, 1, 1]


, W VΓ—N, Wβ€² NΓ—V, V β€” , N β€” ( , word2vec)


y t, , , , , .


, .


, word2vec :
"I like playing football"
CBOW (2) .
, 4 , V=4, , N=2, :



:


Vocabulary=[β€œI”,β€œlike”,β€œplaying”,β€œfootball”]


'' '' , . :



:



, one-hot encoding.



, , , . , , .


3.1 (Loss function)


1, , x:


h=WTxu=Wβ€²Th=Wβ€²TWTxy=  Softmax(u)=Softmax(Wβ€²TWTx)


, h β€” , u β€” , y β€” .


, , , (wt, wc). , onehot encoding .


, onehot wt ( ).


softmax , :


L=-logP(wt|wc)=-logyjβˆ— =-log[Softmax(ujβˆ—)]=-log(expujβˆ—βˆ‘iexpui),


, j* β€” .
. (1):


L=βˆ’ujβˆ—+logβˆ‘iexp(ui).(1)


.


"I like play football", , "I" "like", , x=(1,0,0,0) β€” "I", Λ†y=(0,1,0,0), "like".


word2vec, . W 4Γ—2


W=(βˆ’1.381187280.548493730.39389902βˆ’1.1501331βˆ’1.169676280.360780220.06676289βˆ’0.14292845)


Wβ€² 2Γ—4


Wβ€²=(1.39420129βˆ’0.894417570.998696670.444470370.69671796βˆ’0.233643410.21975196βˆ’0.0022673)


"I like" :


h=WTx=(βˆ’1.381187280.54849373)



u=Wβ€²Th=(βˆ’1.543507651.10720623βˆ’1.25885456βˆ’0.61514042)



y=Softmax(u)=(0.052565670.74454790.069875590.13301083)


y,


L=βˆ’logP(β€œlike”|β€œI”)=βˆ’logy3=βˆ’log(0.7445479)=0.2949781.


, (1):


L=βˆ’u2+log4βˆ‘i=1ui=βˆ’1.10720623+log[exp(βˆ’1.54350765)+exp(1.10720623)+exp(βˆ’1.25885456)+exp(βˆ’0.61514042)]=0.2949781.


, "like play", , .


3.2 CBOW


, , W W` . , .


. (1) W W`. βˆ‚L/βˆ‚W βˆ‚L/βˆ‚Wβ€²


, . (1) W W`, u=[u1, ...., uV],


L=L(u(W,Wβ€²))=L(u1(W,Wβ€²),u2(W,Wβ€²),…,uV(W,Wβ€²)) .


:


βˆ‚Lβˆ‚Wβ€²ij=Vβˆ‘k=1βˆ‚Lβˆ‚ukβˆ‚ukβˆ‚Wβ€²ij(2)



βˆ‚Lβˆ‚Wij=Vβˆ‘k=1βˆ‚Lβˆ‚ukβˆ‚ukβˆ‚Wij .(3)


, (2) (3) .


(2), Wij, W, i j , uj ( yj).



3. (a) yj hi Wβ€²ij Wβ€². (b) , xk N Wk1…WkN W.


, βˆ‚uk/βˆ‚Wβ€²ij, , k=j, 0.


(4):


βˆ‚Lβˆ‚Wβ€²ij=βˆ‚Lβˆ‚ujβˆ‚ujβˆ‚Wβ€²ij(4)


βˆ‚L/βˆ‚uj, (5):


βˆ‚Lβˆ‚uj=βˆ’Ξ΄jjβˆ—+yj:=ej(5)


, Ξ΄jjβˆ— β€” , , 1, , 0 .


(5) e N ( ), , , .


(4) (6):


βˆ‚ujβˆ‚Wβ€²ij=Vβˆ‘k=1Wikxk(6)


(5) (6) (4) (7):


βˆ‚Lβˆ‚Wβ€²ij=(βˆ’Ξ΄jjβˆ—+yj)(Vβˆ‘k=1Wkixk)(7)


βˆ‚L/βˆ‚Wij, Xk, yj j W , 3(b). . βˆ‚uk/βˆ‚Wij, uk u :


uk=Nβˆ‘m=1Vβˆ‘l=1Wβ€²mkWlmxl .


βˆ‚uk/βˆ‚Wij, l=i m=j, (8):


βˆ‚ukβˆ‚Wij=Wβ€²jkxi .(8)


(5) (8) , (9):


βˆ‚Lβˆ‚Wij=Vβˆ‘k=1(βˆ’Ξ΄kkβˆ—+yk)Wβ€²jkxi(9)


. (7) (9) . (7)


βˆ‚Lβˆ‚Wβ€²=(WTx)βŠ—e(10)


βŠ— .


(9) :


βˆ‚Lβˆ‚W=xβŠ—(Wβ€²e)(11)


3.3


, (7) (9), , . . Ξ·>0, :


Wnew=Woldβˆ’Ξ·βˆ‚Lβˆ‚WWβ€²new=Wβ€²oldβˆ’Ξ·βˆ‚Lβˆ‚Wβ€²


3.4


. , . , . , . , , , .


4. CBOW


CBOW . . (4) . OneHot Encoded . word2vec. .



4. CBOW


CBOW CBOW .


h=1CWTCβˆ‘c=1x(c)=WTΒ―xu=Wβ€²Th=1CCβˆ‘c=1Wβ€²TWTx(c)=Wβ€²TWTΒ―xy=  Softmax(u)=Softmax(Wβ€²TWTΒ―x)


, '' Β―x=βˆ‘Cc=1x(c)/C


, . :


L=βˆ’logP(wo|wc,1,wc,2,…,wc,C)=βˆ’ujβˆ—+logβˆ‘iexp(ui).(12)


, :


βˆ‚Lβˆ‚Wβ€²ij=Vβˆ‘k=1βˆ‚Lβˆ‚ukβˆ‚ukβˆ‚Wβ€²ij(13)



βˆ‚Lβˆ‚Wij=Vβˆ‘k=1βˆ‚Lβˆ‚ukβˆ‚ukβˆ‚Wij .(14)


CBOW , , . Wβ€²ij


βˆ‚Lβˆ‚Wβ€²ij=βˆ‘Vk=1βˆ‚Lβˆ‚ukβˆ‚ukβˆ‚Wβ€²ij=βˆ‚Lβˆ‚ujβˆ‚ujβˆ‚Wβ€²ij=(βˆ’Ξ΄jjβˆ—+yj)(βˆ‘Vk=1WkiΒ―xk)(15)


Wij:


βˆ‚Lβˆ‚Wij=Vβˆ‘k=1βˆ‚Lβˆ‚ukβˆ‚βˆ‚Wij(1CNβˆ‘m=1Vβˆ‘l=1Wβ€²mkCβˆ‘c=1Wlmx(c)l)=1CVβˆ‘k=1Cβˆ‘c=1(βˆ’Ξ΄kkβˆ—+yk)Wβ€²jkx(c)i.(16)


:


βˆ‚Lβˆ‚Wβ€²ij=(βˆ’Ξ΄jjβˆ—+yj)(Vβˆ‘k=1WkiΒ―xk)(17)



βˆ‚Lβˆ‚Wij=Vβˆ‘k=1(βˆ’Ξ΄kkβˆ—+yk)Wβ€²jkΒ―xi.(18)


(17) (18) .
(17) :


βˆ‚Lβˆ‚Wβ€²=(WTΒ―x)βŠ—e(19)


(18):


βˆ‚Lβˆ‚W=Β―xβŠ—(Wβ€²e)(20)


, CBOW .
βŠ— .


5. Skip-gram


CBOW, , . :



5. Skip-gram .


skip-gram :


h=WTxuc=Wβ€²Th=Wβ€²TWTxc=1,…,Cyc=  Softmax(u)=Softmax(Wβ€²TWTx)c=1,…,C


( uc) , y1=y2β‹―=yC. :


L=βˆ’logP(wc,1,wc,2,…,wc,C|wo)=βˆ’logC∏c=1P(wc,i|wo)=βˆ’logC∏c=1exp(uc,jβˆ—)βˆ‘Vj=1exp(uc,j)=βˆ’Cβˆ‘c=1uc,jβˆ—+Cβˆ‘c=1logVβˆ‘j=1exp(uc,j)


skip-gram CΓ—V
:


L=L(u1(W,Wβ€²),u2(W,Wβ€²),…,uC(W,Wβ€²))=L(u1,1(W,Wβ€²),u1,2(W,Wβ€²),…,uC,V(W,Wβ€²))


:


βˆ‚Lβˆ‚Wβ€²ij=Vβˆ‘k=1Cβˆ‘c=1βˆ‚Lβˆ‚uc,kβˆ‚uc,kβˆ‚Wβ€²ij



βˆ‚Lβˆ‚Wij=Vβˆ‘k=1Cβˆ‘c=1βˆ‚Lβˆ‚uc,kβˆ‚uc,kβˆ‚Wij .


βˆ‚L/βˆ‚uc,j, :


βˆ‚Lβˆ‚uc,j=βˆ’Ξ΄jjβˆ—c+yc,j:=ec,j


CBOW :


βˆ‚Lβˆ‚Wβ€²ij=Vβˆ‘k=1Cβˆ‘c=1βˆ‚Lβˆ‚uc,kβˆ‚uc,kβˆ‚Wβ€²ij=Cβˆ‘c=1βˆ‚Lβˆ‚uc,jβˆ‚uc,jβˆ‚Wβ€²ij=Cβˆ‘c=1(βˆ’Ξ΄jjβˆ—c+yc,j)(Vβˆ‘k=1Wkixk)


Wij , :


βˆ‚Lβˆ‚Wij=Vβˆ‘k=1Cβˆ‘c=1βˆ‚Lβˆ‚uc,kβˆ‚βˆ‚Wij(Nβˆ‘m=1Vβˆ‘l=1Wβ€²mkWlmxl)=Vβˆ‘k=1Cβˆ‘c=1(βˆ’Ξ΄kkβˆ—c+yc,k)Wβ€²jkxi.


, skip-gram :


βˆ‚Lβˆ‚Wβ€²ij=Cβˆ‘c=1(βˆ’Ξ΄jjβˆ—c+yc,j)(Vβˆ‘k=1Wkixk)(21)



βˆ‚Lβˆ‚Wij=Vβˆ‘k=1Cβˆ‘c=1(βˆ’Ξ΄kkβˆ—c+yc,k)Wβ€²jkxi.(22)


(21):


βˆ‚Lβˆ‚Wβ€²=(WTx)βŠ—Cβˆ‘c=1ec(23)


(22):


βˆ‚Lβˆ‚W=xβŠ—(Wβ€²Cβˆ‘c=1ec)(24)


6.


word2vec. . [2] ( softmax, negative sampling), . [1].


, word2vec.


The next step is to implement these equations in your favorite programming language. If you like Python, I already implemented these equations in my next post .


Hope to see you there!


Sitelinks


[1] X. Rong, word2vec Parameter Learning Explained , arXiv: 1411.2738 (2014).
[2] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in Vector Space , arXiv: 1301.3781 (2013).


All Articles