How we searched for candidates using machine learning

To find real talent, companies have to come up with the most unusual ways to search. EPAM also loves to look for new ways to solve common problems. This experiment began with the fact that our recruiters turned to colleagues from Data practice and asked to think about how to create a search system for candidates for open vacancies in the company. A system that would help reduce the time spent on finding a relevant candidate in open sources *, as well as increase the quality and quantity of good candidates. Our Data Science team took up the task together with students from the EPAM Training Center. Next, I will talk about the main approaches that can solve this problem, our solution and the results. In general, the post turned out to be more of a reference, but through the prism of a specific business case. I also tried to leave links,where it seems relevant to me so that you can learn more about a particular technology or approach.

* - sites and resources where candidates, users themselves post information about themselves. Access to these resources is not limited, including licenses and terms of service of these resources (Terms of service).



Task


Typically, automation means process optimization. In our case, the goal was formulated as increasing the efficiency of the search for candidates. Efficiency in this case is expressed through finding the most suitable candidate vacancies with minimal resources.


, (). , . ( ) , , . : , , , . , .


, : , , , .


β€” . , . , , .


:


. , . : , , , β€” . , () .


#1. β€” One-Hot Encoding (OHE)


, 1 β€” , 0 β€” .


This approach is simple, but has several disadvantages. Perhaps the main problem of this approach is that the skills in the space obtained with its help will be orthogonal to each other, and we will not be able to compare their similarity with each other. Most likely, it’s not so important for us to distinguish between such skills as Java7 and Java8, for example, and it would be nice to distinguish them from other skills that are completely unrelated to the position of a Java developer. With this approach, Java7 from Java8 will be the same as Java7 from Python.

In addition, the disadvantage of this approach is that we cannot distinguish between specific and popular skills that are common throughout our sample. This will make a certain noise in our search and interfere with distinguishing candidates and highlighting similar ones.


An easy way to slightly adjust the influence of popular skills on the search is to use not binary estimates, but weighted ones based on the frequency of occurrence in the sample as a whole and in separate documents. To do this, use the TF-IDF method . But in this case, we still cannot assess how similar the skills are to each other.


Method # 2. Matrix Factorization


Representation of candidates in space, where each skill is the coordinate of space, is redundant, as part of the skills are almost identical. Accordingly, similar skills can collapse into some factors / components / latent symptoms. One approach to finding such components is a group of matrix factorization methods .


User-Skills , . . β€” (skills’ embedding). , β€” , , .


, , . , . β€” , . , .


, , .


#3.





, . , β€” . , , , , ( supervised ), , , , , , , (unsupervised ). .


, . , , , .


β€” , .


,

β€” StarSpace. «», . , , , , , .

, , . , .

#4.


, β€” .


, . , , β€” , β€” . , β€” . β€” - β€” , , . , , .



, β€” .


Nodes - for example, candidates can be somewhat similar to each other, be in the same community, share common interests, work in the same company or have other identical characteristics - this is responsible for the uniformity characteristic. On the other hand, nodes of different groups can be united by the fact that they play the same role in their groups - leaders, assistant leaders, information keepers, communicators, outsiders. If we wanted to compare two graphs, we could understand that leaders in one column play the same role as leaders in another - this is what is called structural similarity.


One way or another, graph representation methods try to construct a space taking into account both uniformity and structural equivalence of a graph.


Graph factorization


First of all, we consider a method based on graph factorization.

, : , .. β€” 1, β€” 0. , .


, .


a-like word2vec*


( , ) . , , , . . , , . , β€” w2v(skip-gram), doc2vec. ( word2vec).


You can read more about similar graph representation methods, for example, here - DeepWalk , Node2vec , Graph2vec .


Source

Convolutional Networks on Graphs


Here is an idea similar to the previous method: we go through the graph and use information about its neighbors to represent a single node. In addition, information on the general structure of the graph and the characteristics of the node is involved in the training of representation. The main innovation of these methods is that the model normalizes the values ​​of each node in such a way that the position in the latent space of two nodes is closer, the more similar are the structural roles of these nodes in the subgraph.


This procedure is called graph folding.



More details can be found here:


,

PyTorch BigGraph β€” Facebook Research. , . , , .

:


: β€” , . , , , .


, β€” IT-. , , IT-, (.. ), , .


GitHub (github.com, Terms of Service), . , GitHub API GitHub Archive, GitHub , .

GitHub . : ( , ), , , , , (), , , , , , .


GitHub , , . - , ; (), , . , , .




GitHub, embedding, .. . , .




.




, embedding.




embedding , , β€” .




. .


, GitHub , . , , , .




4 , , 5 . , , , , : Java, JavaScript, Python, DevOps, Data Science. 3500 . , , 35% , 65% β€” . , . , , Java Developer β€” 60%, , , . , DevOps, , . , β€” 25,5% .


What have we achieved


  • The percentage of relevant candidates recommended by the model is comparable with the percentage of other systems, including job search resources.
  • It was possible to increase the internal base of candidates by several hundred, adding a source that was not previously involved.
  • The time taken to find 1 candidate was reduced by 29% compared to other β€œcold” search sources (that is, sources that are not used for direct job search).
  • We were able to more efficiently handle requests with rare skills.
  • And hire a few senior engineers who were not in an active job search.

What I would like to improve


The resulting solution has disadvantages that we have not yet been able to solve:
  • There is still no good solution for assessing the level of proficiency of candidates.
  • GitHub , .
  • , , GitHub.
  • , .


, , , , .


, , , .


All Articles