What can be done in 48 hours? Interview with the BioHack 2019 bioinformatics hackathon winner

The fourth bioinformatics hackathon BioHack 2020 starts on March 27 in St. Petersburg . During the existence of the hackathon, more than 300 young specialists from different countries participated in it and 58 solutions were developed. Leading research organizations presented their projects for working at the hackathon: Institute of Physiology named after I.P. Pavlova, Institute of Cytology RAS, St. Petersburg State University, Federal Scientific Center for Physical Medicine, JetBrains BioLabs, Protein Institute RAS, Genotek, MIPT, iBinom and others.

In 2019, the Garlic team took the main prize of 150,000 rubles. For 48 hours allocated to work, the team created a tool that allows you to search for genomic rearrangements of a given structure. We asked the curator of the project, Dmitry Konanov, to talk about the project, the hackathon and, in general, the life of bioinformatics.



- Tell me, what were you doing at that moment when you were a member of the hackathon?
- At the time of my participation in the hackathon, I worked in the bioinformatics laboratory at the Federal Scientific and Practical Center of Physical-Chemical Medicine of the FMBA of Russia (Federal Scientific and Clinical Center for Physico-Chemical Medicine of the Federal Medical and Biological Agency), wrote a diploma there. Now I continue to work in the center.

- Why did you decide to participate in BioHack?
- Somehow it happened so spontaneously. The deadline was already approaching - one of the last days of receiving projects was on, the laboratory asked me if I wanted to participate in the hackathon: I just had to send the project. I wrote something in about 15 minutes and sent a request.



- So it was a project that you already worked on in the laboratory?
- I wanted to work on it, I started working on it, but it was very unfinished. At the hackathon, we brought it to the state I wanted to bring to - the algorithm has become more automated.

- Tell me how the idea of ​​the project came about.
- In general, the initial idea is not mine, but Alexander Manolov, Ph.D. bioinformatics laboratories. This is the person who was at that time the head of my diploma.

It is known that the genomes of bacteria are very plastic. Many events can occur in them: the transfer of genes from one bacterium to another, a change in their sequence, insertion and removal of fragments of the genome. The idea is this: let there be 4 bacterial genomes. Each genome consists of 5 genes: the first XYZTF genome, the second XRLAF genome, the third XYKTF genome, and the fourth XYLTF. (Figure “Perestroika in the graphs”). In our example, the same letters in the genomes correspond to homologous (one might say, the same) genes. The sequence of letters shows the sequence of genes in the genomes.

We define each gene from the sample of genomes as a node of the graph and draw edges between those gene-nodes that are located sequentially in at least one of the genomes of the sample.



Thus, we obtain a graph that contains information on all possible variants of the alternation of genes in the selected genomes. And with this graph structure you can already do whatever your heart desires.

Our first task was to look for regions with a high local involvement in graphs that arise in the so-called hot spots of genomic rearrangements - places where, for reasons that are not always understood, the genome changes intensively from strain to strain. We called the measure of entanglement in the vicinity of the node genome complexity. This value essentially numerically shows how often perestroika occurs in certain regions.

- And what is the essence of the problem that you solved with the team at the hackathon?
- At the hackathon we brought out a task more mathematically beautiful or something.
Any genome rearrangement, be it a deletion (approx. Loss of a chromosome site), insertion or inversion (approx. Reverse order of genes of a chromosome site), leads to the formation of subgraphs of a certain topology in our large graph of rearrangements. And I thought it would be good if we were able to look for specific subgraphs corresponding in structure to the perestroika of interest to us. This would make it possible to efficiently find points in the genome at which events occur more often, and to compare the frequency between different species and genera of bacteria. It is known, for example, that there are parts of the genome that are forbidden for inversions, and areas where inversions occur most often.

A purely fundamental interest was to look at the frequency distribution profiles along the genome and for other types of rearrangements. If we talk about practical significance, then this story is directly related to biotechnology: we think that, knowing the susceptibility of different fragments of the genome to inserts, we can assume in advance where it is more likely that a random fragment of foreign DNA will be inserted. But we did not check it.

Even before the hackathon, I handled something with pens, wrote a rather crooked algorithm that would look for a specific pattern (code name Smile, due to its characteristic appearance). I found the frequency and distribution along the genome for many species, even some funny things were found out, for example, in bacteria with a large number of smileys, rearrangements of any type occurred equally likely along the entire genome, and in bacteria with a small number of smileys only in a limited number of hot spots (at close integral in genome frequency). Of course, I wanted to do something more universal so that I could ask any possible subgraphs for the search. I brought this idea to the hackathon.

As a result of two days of work, we got a tool called GARLIC-Finder - a tool for studying genomic rearrangements of a given structure. We wrote a universal language for specifying subgraphs for searching, but since such a task is NP-complicated, looking “head-on” turned out to be only small static subgraphs. Therefore, we added the ability for the user to add custom algorithms that are optimized for the search for specific patterns. At the hackathon, we settled on three patterns - a pair of genes between which inserts (Garlic), transpositions of a genome fragment (Penguin) and a gene with a very rich environment (Spider) often occur (Fig. “Search for subgraph-rearrangements”). Garlic was the first and therefore gave the name to our Tulu. It has become an acronym: G enome re AR rangementsL earning I nterfa C e.



I even took advantage of this a bit later.

- Little? That is, this project did not advance further than the hackathon?
- Now the problem is that we are still at the stage of publishing a large article on graphs and on genomic complexity. A person who writes a dissertation on this topic is engaged in this. We sent the first option in the summer, but it was rejected, unfortunately. The other day sent again, already in another edition. If all goes well, perhaps we will continue to dig in that direction.

- What did the hackathon give you?
- The project has become a big part of my thesis. New optimization ideas have come up. Well, in general, I myself learned a lot of new things.

- What did you spend money on, if not secret?
- It's no secret, a good player with headphones :).

- What programming language was used to solve the problem?
- Python, exclusively Python. And different libraries to it. To work with NetworkX graphs, to visualize Graphviz and its binding to a Python. Well, the classic Matplotlib and Pandas for working with data. And one self-written library is gene-graph-lib .

- And who was on your team?
- Two programmers and one biologist. Everyone turned out to be very helpful.
What was your global goal, why did you decide to send the project to a hackathon?
I wanted to solve a problem and solve it effectively. I planned to do it myself, but here a unique opportunity came up, and I decided to use it. Well, I just wanted to see what a hackathon is.

- Do you like it?
- Wonderful, just wonderful! The organization, the food, the room where all this happened, the people are wonderful. There was nothing to complain about at all.

It would be very good if they allowed me to use local monitors, staff equipment, as I understand it - the hackathon was held in the EPAM office - but, of course, we were not allowed.

- How did you prepare for the hackathon? What needs to be done, in addition to taking your equipment?
- To the leaders (approx. Curators)there was a requirement to prepare a presentation for 1.5-2 minutes about the project. It is important for the participants to carefully read the terms of the projects in order to see what requirements the leaders have for the team members. It may happen that a person on a laptop is fully equipped with the environment of the second Python, and the curator, for example, the third. This does not matter, but it may take extra time to reinstall the environment, and you just had to carefully read that you need a third Python.

But in terms of knowledge of how to prepare, not everything is clear here. Naturally, you need to be able to program in the required languages ​​and fumble a bit in the context of the problem that is proposed as a project. Although we had a team member who did not know biology at all, but was very useful - it was he who wrote the language parser for defining subgraphs, this task fell entirely on his shoulders.

- You already spoke about the organization, the premises, the food. Where did you sleep? And did you sleep at all?
“For 48 hours, I slept for four hours, I guess.” I was always on the site, on the last night I just left for the hotel.

- That is, the participants need to be mentally prepared for this.
- And morally and physically especially. If a person has experience in preparing for some terrible sessions, when you do not sleep for two nights, this is a good preparation. I had such cases during my studies, so I was ready.

- What is your global goal? Why do you do bioinformatics?
- In general, I accidentally came into bioinformatics. I studied at the faculty of the Academy of Agricultural Sciences of RAS. There, students, starting from the second year, are required to go to scientific work one day a week at one of the institutes of the Russian Academy of Sciences. I responded to the proposal of the IBCh RAS, without any idea what I would have to do. I came there and it turned out that I would do an analysis of NGS data and proteomics. Then I started to learn Python and understand bioinformatics. He worked there for two years, the project seemed to stall a bit, and I went to where I am working now.
I like it. I always loved both mathematics and biology, somehow it happened.

- What books, courses, lectures, films do you recommend for children to watch?
- There is a course on bioinformatic algorithms at Coursera from the University of San Diego, in the creation of which I participatedPavel Pevzner , on Stepic he also is. I solved some problems there - quite useful. They allow you to pump knowledge both in molecular biology and in coding. The essence of most tasks is that you need to program some kind of simple sequence analysis or the like. I know that the Institute of Bioinformatics conducts guest lectures that can be viewed on YouTube, in addition, they have courses on Stepic . In Python, I faithfully read about 500 of the first pages of Learning Python by Mark Lutz , and then just reading the documentation, changelogs and practice.

The most important thing is to solve problems. It’s useless to read the theory, and in the process of solving problems you learn to solve real problems.

- Do you plan to participate in the hackathon this year?
“Yes, I think so.”

- With what? Or is it still a secret?
- There are two options, while ideas are being formed. I will not voice it. I still have a whole month. I’ll give it, probably, as always at the last moment :)

- And what is now being discussed in the world of bioinformatics?
- Often love hype themes. I have a student from the Russian Chemical Technical University who writes a diploma on a graph topic, so he decided to build a graph on the recently published genome of the coronavirus and its relatives.

- Intrigued. We will wait for new discoveries and new interesting projects from you and your colleagues!

You can submit the project until February 28, and register as a participant until March 5 at biohack.ru .

All Articles