Graphing Dummies: A Step-by-Step Guide

Earlier, we published a post where, with the help of graphs, we analyzed communities at boiling points from different cities of Russia. Now we want to tell how to build such graphs and analyze them.



Under the cut - a step-by-step instruction for those who have long wanted to deal with the visualization of graphs and were waiting for the right occasion.


1. Choice of hypothesis


If you try to visualize at least something, mindlessly loading data into a graphing program, the result will not please you. Therefore, first formulate for yourself what you want to know with the help of graphs, and come up with a viable hypothesis.

To do this, figure out what data you already have, what of them can be represented by "objects", and what is the "connections" between them. Usually there are much fewer objects than links - you can check yourself in this way.

We prepared our test case together with the boiling point team from Tomsk. Accordingly, we will have all the data for analysis on events and their participants from there. We wondered if a community had been formed from the participants in these events and how it looked from the point of view of the participants belonging to business, universities and government.

We suggested that people who attended the same event are connected to each other. Moreover, the more often they attended the events together, the stronger the connection.
In the second case, we decided to find out how the membership of the participants in one of the “nos” (our key areas) relates to the cross-cutting technologies of interest to them. Is the distribution even? Are there any hot topics? For this analysis, we took data on event participants from 200 Tomsk technology companies.

In principle, even such initial formulations of hypotheses are enough to proceed to the second step.

2. Data preparation


Now that you have decided what you want to find out, take the entire array of data, see what information about the "objects" is stored, throw out all the excess and add the missing. If the data is distributed across several sources, first collect everything in one heap, removing duplicates.

I will explain with an example. We had data on the participants of 650 events. This, relatively speaking, is 650 Excel tables with ~ 23000 entries in them containing the fields “Leader ID”, “Position”, “Organization”. To build a graph, one unique identifier is enough (fortunately, there is one here - this is a Leader ID) and a sign that ties each participant to one of the three areas under consideration: government, business or universities. And we do not have this information yet.

To get it, you can go ahead: in each of the 650 files, remove the extra columns and add a new field, fill it with values ​​for each row, for example: “1” for power, “2” for business and “3” for education and science. And you can first combine all 650 files into one large list, remove duplicates, and only then add new values. In the first case, such work will take 1-2 months. In the second - 1-2 weeks.

In general, when adding new attributes, try to group the data first. For example, you can sort the participants by company / organization and set the attribute in bulk.

We are preparing the data further. To load them into most visualization programs, you will need to create two files: one with a list of vertices, and the second with a list of edges.



The vertex file in our case contained two columns: Id — vertex number and Label — type. The edges file also contained two columns: Source - id of the initial vertex, Target - id of the final vertex.

How to turn data that participants 1, 2, 5 and 23 attended one event into ribs? It is necessary to create six lines and mark the connection of each participant with each: 1 and 2, 1 and 5, 1 and 23, 2 and 5, 2 and 23, 5 and 23.

In our second example, the tables looked like this:



The vertices are listed as markets and end-to-end technologies. If, say, a representative of a company belonging to the Technet market (ID = 4) attended an event on the topic “Big Data and AI” (ID = 17), we put in the edge table an edge (line) connecting these vertices (Source = 4, Target = 17).

The data preparation phase is the most time-consuming part of the process, but be patient.

3. Graph visualization


So, the data tables are prepared, you can look for a means to represent them in the form of a graph. For visualization, we used the Gephi program - a powerful open source tool that can process graphs with hundreds of thousands of vertices and links. You can download it from the official site .

I will take screenshots from the second project, in which there were a small number of vertices and links, so that everything was as clear as possible.

First of all, we need to load tables with vertices and edges. To do this, select the "Import from CSV" item from the menu of the "Data Lab" section.



First, load the file with the vertices. On the first screen of the form, indicate that we are importing the vertices, and check that the program correctly determines the encoding of signatures.



On the third form, “Import Report”, it is important to indicate the type of graph. We are not oriented.



Similarly, load the ribs. In the first window, indicate that this is a file with edges, and also check the encoding.



An important moment awaits us in the third window “Import Report”. Here it is important to indicate not only that the graph is not oriented, but also load the edges into the same workspace as the vertices. Therefore, select the item "Append to existing workplace".



As a result, we will see the graph in approximately this form (the “Processing” tab):



So, the edges have different thicknesses depending on the number of connections between the vertices. You can see how much weight each edge has become on the Data Lab tab in the properties of the edges in the Weight column.

What is bad here: all the vertices are of the same size and are located absolutely randomly. On the tab “Processing” we will fix it. First, select Nodes in the upper left window and click on the icon with circles (“Size”). Next, select the Ranking item - it allows you to set the size of the vertex, depending on some parameter. We have the opportunity to choose only one parameter - Degree (degree), which shows how many edges come out of the vertex. Choose the minimum and maximum size of the circle and click the "Apply" button. Here, if you select other icons, you can adjust the color of the vertex marker and the color of the edges. Now the graph is already more visual.



The next thing to do is unravel the graph. This can be done manually, moving the vertices, or you can use the styling algorithms that are implemented in Gephi.

What do we achieve with proper styling? Maximum visibility. The less vertices and edges on the graph of overlays, the less the intersections of edges, the better. It would also be nice if adjacent peaks were located closer to each other, and non-adjacent ones are farther apart. Well, everything was distributed over the visible region, and not compressed into one heap.

How to do it in Gephi? The lower left window “Stacking” contains the most popular stacking algorithms built on power analogies. Imagine that the vertices are charged balls that repel each other, but some are held together by something similar to springs. If you set the appropriate forces and "release" the graph, the vertices will scatter to the maximum distances allowed by the springs.

The most uniform picture is provided by the Fruchterman and Reingold algorithm. Select Fruchterman Reingold from the drop-down menu and set the size of the plot area. Click the Execute button. It will turn out something like this:



You can help the algorithm and, without stopping it, drag some vertices, trying to unravel the graph. But remember that there is no “Cancel” button, it will not be possible to return to the previous location of the vertices. Therefore, keep new versions of the project before each risky change.

Another useful algorithm is Force Atlas 2. It presents a graph in the form of metal rings connected by springs. The deformed springs set the system in motion, it oscillates and finally takes a stable position. This algorithm is good for visualizations that emphasize the structure of a group and highlight subsets with a high degree of interaction.

This algorithm has a large number of settings. Consider the most important. "Overlap Prohibition" prevents peaks from overlapping each other. Sparseness increases the distance between the vertices, making the graph more readable. The graph is also made more airy by reducing the influence of the weights of the ribs on the relative positions of the vertices.

After playing with the settings, we get the following graph:



After receiving the graph in the form that suits you, proceed to the final processing. This is the "View" tab. Here we can specify, for example, drawing a graph with curved edges, which minimizes the overlapping of vertices on other edges. We can enable vertex labels by setting the font size and color. Finally, change the background of the substrate. For example, like this:



In order to save the resulting image, click on the inscription “Export SVG / PDF / PNG in the lower left corner of the window. Also, do not forget to save the project itself through the top menu “File” - “Save Project”.

In our case, it was important to highlight the relationship between end-to-end technologies and the NTI markets, for which we manually built all the markets in one line in the center and placed everything else above and below. The result is such a graph. Still, to solve specific problems without manual alignment of vertices could not be done.



You probably think how we managed to color the peaks in different colors? There is one trick. You can go to the “Data Lab” tab, create a new column at the vertices there, naming it “Market”. And fill in for each vertex with the values: 1 if it is the STI market, 0 if it is an end-to-end technology. Then just go to the "Processing", select the icon in the form of a palette, Nodes - Partition, and as a separator - our new attribute Market.



For more complex constructions, when it is required to select clusters and paint them with different colors, Gephi uses a rich arsenal of statistical calculations, the results of which can be used for separate coloring. These calculations are located in the right column of the Processing tab.



For example, by clicking the “Run” button next to the “Modularity” calculation, you will find out an estimate of the level of clustering of your graph. If after that you set the color of the vertices depending on the Modularity Class, a nice picture will appear like this:



If you want to learn more about the capabilities of Gephi, you should read the manual on working with the program from Martin Grangin http://www.martingrandjean.ch/gephi- introduction / .

4. Analysis of the result


So, you got the final visualization of the graph. What does she give you? Firstly, it’s beautiful, it can be inserted into a presentation, shown to your friends or made a screensaver on your desktop. Secondly, from it you can understand how complex and multicluster the structure of the subject area you are considering is. Thirdly, pay attention to the largest peaks and the fattest connections. These are special elements on which everything rests.
So, having built a graph of the expert community attending events at the Boiling Point, we immediately found participants who are most likely to act as superconnectors. They were “peaks” through which clusters were united into a single whole. And in the second case, we saw how the concentration of specialists from Tomsk companies looks from the point of view of their belonging to the market and the end-to-end digital technology on which they rely. This indirectly indicates the level of technological competencies and expertise of the region.

The help of graphs in understanding the surrounding reality is really great, so do not be lazy and try to create your own data visualization. It is not at all difficult, but sometimes labor-intensive.

All Articles