Key Python Programmer Skills

In our dynamic time, the programmer needs to keep abreast and constantly learn new skills in order to remain a sought-after specialist.

I have been programming in Python for about two years, and now the time has come to consciously approach the development of new skills. To do this, I decided to analyze the vacancies and present the required skills in the form of a graph. I expected to see that skills will form clusters corresponding to different specialties: backend development, data science, etc. But what about the reality? First things first.

Data collection


First you had to decide on the data source. I considered several options: Habr Career , Yandex Work , HeadHunter and others. HeadHunter seemed the most convenient, because here in the vacancies there is a list of key skills and there is a convenient open API .

Having studied the HeadHunter API, I decided to first parse the list of job id for a given keyword (in this case, “python”), and then parse the list of corresponding tags for each job.

When searching for vacancies, vacancies are returned page by page, the maximum number of vacancies per page is 100. At first I saved the full results as a list of page answers.

For this, the requests module was used. In the “user-agent” field, according to the API, the name of the virtual browser was entered so that HH understood that the script was accessing it. He made a slight delay between requests so as not to overload the server.

ses = requests.Session()
ses.headers = {'HH-User-Agent': "Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/10.0"}

phrase_to_search = 'python'
url = f'https://api.hh.ru/vacancies?text={phrase_to_search}&per_page=100'
res = ses.get(url)

# getting a list of all pesponses
res_all = []
for p in range(res.json()['pages']):
    print(f'scraping page {p}')
    url_p = url + f'&page={p}'
    res = ses.get(url_p)
    res_all.append(res.json())
    time.sleep(0.2)

As a result, I got a list of answer dictionaries, where each dictionary corresponded to one page of search results.

As it turned out, the hh.ru API limits the maximum number of vacancies to two thousand, that is, with 100 vacancies per page, the maximum number of pages can be 20. For the Python keyword, 20 vacancy pages were returned, which means that real Python vacancies are more likely all the more.

To get a list of tags, I did the following:
  • Iterated over each page of search results,
  • Iterated over each job on the page and got the job id,
  • requested details of the vacancy through the API,
  • if at least one tag was specified in the vacancy, then the list of tags was added to the list.

# parcing vacancies ids, getting vacancy page and scraping tags from each vacancy
tags_list = []
for page_res_json in res_all:
    for item in page_res_json['items']:
        vac_id = item['id']
        vac_res = ses.get(f'https://api.hh.ru/vacancies/{vac_id}')
        if len(vac_res.json()["key_skills"]) > 0:  # at least one skill present
            print(vac_id)
            tags = [v for v_dict in vac_res.json()["key_skills"] for _, v in v_dict.items()]
            print(' '.join(tags))
            tags_list.append(tags)
            print()
        time.sleep(0.1)

Tag lists were saved as a dictionary

res = {'phrase': phrase_to_search, 'items_number': len(tags_list), 'items': tags_list}
with open(f'./data/raw-tags_{phrase_to_search}.json', 'w') as fp:  # Serializing
    json.dump(res, fp)

Interestingly, out of the 2000 vacancies viewed, only 1579 vacancies had tags.

Data formatting


Now you need to process the tags and translate them into a format convenient for displaying as a graph, namely:
  • bring all tags to a single register, so “machine learning”, “Machine learning” and “Machine Learning” mean the same thing
  • calculate the value of the node as the frequency of occurrence of each tag,
  • calculate the value of the connection as the frequency of joint meeting of tags with each other.

Reducing to a single register, calculating the frequency of occurrence of each tag, filtering by the size of the node was carried out as follows.

tags_list['items'] = [[i.lower() for i in line] for line in tags_list['items']]

# counting words occurrences
flattened_list = [i for line in tags_list for i in line]
nodes_dict_all = {i: flattened_list.count(i) for i in set(flattened_list)}
nodes_dict = {k:v for k, v in nodes_dict_all.items() if v > del_nodes_count}

Pairwise occurrence calculated as follows. First I created a dictionary in which the keys were all all possible pairs of tags in the form of tuple, and the values ​​were zero. Then it went through the list of tags and increased the counters for each pair encountered. Then I deleted all those elements whose values ​​were zero.

# tags connection dict initialization
formatted_tags = {(tag1, tag2): 0 for tag1, tag2 in itertools.permutations(set(nodes_dict.keys()), 2)}

# count tags connection
for line in tags_list:
    for tag1, tag2 in itertools.permutations(line, 2):
        if (tag1, tag2) in formatted_tags.keys():
            formatted_tags[(tag1, tag2)] += 1

# filtering pairs with zero count
for k, v in formatted_tags.copy().items():
    if v == 0:
        del formatted_tags[k]

At the output, I formed a dictionary of the form

{
'phrase': phrase searched,
'items_number': number of vacancies parced, 
'items': {
 	"nodes": [
			{
			"id": tag name, 
		 	"group": group id, 
		 	"popularity": tag count
			},
		] 
	"links": [
			{
			"source": pair[0], 
			"target": pair[1], 
			"value": pair count
			},
		]
	}
}

nodes = []
links = []
for pair, count in formatted_tags.items():
    links.append({"source": pair[0], "target": pair[1], "value": count})

max_count = max(list(nodes_dict.values()))
count_step = max_count // 7
for node, count in nodes_dict.items():
    nodes.append({"id": node, "group": count // count_step, "popularity": count})

data_to_dump = in_json.copy()
data_to_dump['items'] = {"nodes": nodes, "links": links}

Python visualization


To visualize the graph, I used the networkx module. This is what happened the first time without filtering the nodes.



This visualization is more like a ball of tangled threads than a skill graph. The connections are confused and penetrate the graph so densely that it is impossible to make out nodes. In addition, there are too many nodes on the graph, some so small that they have no statistical significance.

Therefore, I filtered out the smallest nodes less than 5 in size, and also made gray links. In this picture, I have not yet brought the words to a single register, while I tried to delete the largest Python node in order to discharge the connection.



It has become much better. Now the nodes are separated, and the links do not clog the visualization. It became possible to see the basic skills, they are located in large balls in the center of the graph, and small nodes. But this graph still has much to improve.

JavaScript visualization


I would probably continue to pick this code if at that moment I did not have help in the form of a brother. He was actively involved in the work and made a beautiful dynamic display based on the JavaScript module D3 .

It turned out like this.


Dynamic visualization is available here. Note that nodes can be pulled.

Results Analysis


As we can see, the graph turned out to be very intertwined, and clearly defined clusters cannot be detected at first glance. You can immediately notice several large nodes that are most in demand: linux, sql, git, postgresql and django. There are also skills of medium popularity and rarely encountered skills.

In addition, you can pay attention to the fact that skills still form clusters by profession, located on opposite sides of the center:

  • bottom left - data analysis,
  • at the bottom are the databases,
  • bottom right - front-end development,
  • on the right is testing,
  • top right - web development,
  • top left - machine learning.

This description of the clusters is based on my knowledge and may contain errors, but the idea itself, I hope, is clear.

Based on the results obtained, the following conclusions can be drawn:
  • you need to master skills that correspond to large nodes, they will always be useful,
  • you need to master the skills of the cluster that suits your interests.

I hope you enjoyed it and this analysis will be useful to you.

You can take a look at the code or participate in its development using the links: GitHub project , Observable laptop with visualization

Success in mastering new horizons!

All Articles