ChIP-seq data analysis: from histones to computer tasks

Every year, the Institute of Bioinformatics in St. Petersburg and Moscow recruits biologists, mathematicians and programmers to immerse themselves in the world of bioinformatics. Biologists learn to program and train to implement ideas in code, and computer scientists study biology and apply algorithmic approaches to biological and medical problems. The most important part of training is real science projects. In this article, we will talk about the work and results of students of the Institute, done under the direction of Oleg Shpynov from JetBrains Research in 2019. The project is devoted to the study of changes in human chromatin using machine learning.


Informatics students 2019 Institute of Bioinformatics

What is sequencing and why is it needed


The desire to satisfy curiosity and understand oneself, which began with a description of human anatomy, gradually deepened and moved to a more detailed level. Blood cells and their interaction with parasites, the mechanisms of transmission of hereditary information and the formation of metastases by cancer cells were studied.

The advent of sequencing technologies has allowed us to go one more level deeper and look directly “in the face” of the carrier of genetic information - DNA. In other words, deoxyribonucleic acid, which is located in the nucleus of almost every cell in our body, is responsible for how we look, how tall, what timbre of voice we speak and whether we can get malaria. However, technology, like biochemical methods, does not stand still. Their combination made it possible to "bring to light" more complex mechanisms of the body. Let's deal with this in more detail.

How do we sequence organisms


Sequencing technologies have changed, and now technological progress allows, depending on the wishes, sequencing of separate cells, watching changes in them over time, or simply getting complete information about the sequence of the carrier of hereditary information - DNA. In fact, sequencing allows you to translate a biological molecule into a text file, which you can then work with as plain text. Modern sequencing methods use the “shotgun” approach and yield a huge number of short fragments. In some analyzes, these short fragments are “tried on” on existing genomes and look at the differences in the sequence of the “text”.

What are histones and what do they affect


The DNA strand is very long and can’t be permanently in an untwisted state - it is inconvenient and dangerous (there is a greater likelihood of a gap somewhere). Therefore, the molecule spirals (twists very strongly) and is compactly packed, wrapped in special protein complexes, like hair on curlers. These proteins are called nucleosomes and are composed of histone proteins. Histone modification is one example of a more general mechanism of epigenetic regulation. The organism is alive and needs to respond to surrounding changes. The reaction of the body is including the change in gene expression. If the DNA fragment on which the gene is located is tightly packed and wound on the nucleosome, then it is impossible to get to it and read the information. Therefore, special phosphoryl and acetyl groups are hung on histones,so-called phosphorylation or acetylation occurs. This causes the histone to "move" and give access to the desired DNA fragment. But the nucleosome still remains bound to DNA and this can be used in regulatory studies.


The mechanism of acetylation and methylation of histones ( source )

Chromatin-immunoprecipitation sequencing (ChIP-seq) and its use


To study the DNA fragments that remain bound to the protein, there is a special method: chromatin immunoprecipitation (chromatin immunoprecipitation, ChIP). This analysis takes place as follows:

  • reversible crosslinking between DNA and its interacting proteins (usually by formaldehyde treatment)
  • DNA isolation and fragmentation by ultrasound or endonucleases
  • protein-specific antibody deposition
  • the destruction of cross-links between protein and DNA, DNA purification

In short, we remove the protein linked to the DNA from the solution and make it “let go” of the DNA. From a biological point of view, the field of action is understandable: the study of gene expression, closed and open areas, etc. We'll talk about the things that programmers can do in this task below.

In the case of ChIP sequencing (-seq), the resulting DNA fragments are amplified (artificial duplication of fragments) and sequenced. Set of sequences of small pieces of DNA and study bioinformatics.

The received data passes quality control, is filtered, aligned to a DNA sequence and processed by special programs.


DNA Preparation Scheme for Analysis

The task of finding DNA binding sites is often called the peak calling task, and the tool class is peak callers. At the moment, there are many computational approaches and tools for analyzing such data, however, the algorithms are not ideal and have a number of limitations. There are still many unsolved computational problems for programmers and computer scientists in this area.

Here are some of them that students of mathematical and technical specialties are currently solving:

  • Uneven fragmentation and control

The availability of chromatin during fragmentation is not the same in different parts of the genome: it is more accessible in actively transcribed regions, therefore, the corresponding DNA fragments will prevail in the sample, which can lead to a false-positive result. Tightly packed areas, in contrast, may be less likely to fragment and therefore be less represented in the sample, which can lead to a false negative result.

  • Number of cells

The classical technique has a number of limitations. So, usually a significant number of cells (about 10 million) are needed for ChIP-seq, which complicates the application of this method on small organisms (such as fungi or protozoa), and also limits the number of experiments that can be performed with a valuable sample.

  • Data noise

During the ChIP-seq experiment, it is possible to obtain in the final library not only DNA fragments that were associated with the protein, but also other, non-specifically related fragments. This may occur due to not ideal specificity of the antibody, problems with washing free DNA fragments, etc. Such fragments form the so-called noise in the data. The problem lies not only in the existence of noise, but also in the complexity of its measurement. To assess its level, there is a signal-to-noise ratio (SNR) metric, which is determined by the number and power of the peaks obtained for each sample. However, a high SNR does not guarantee the correct determination of binding sites, but merely reflects the presence of a large number of genome regions,which are aligned (on the chromosome in this place the sequence coincides with the desired) many reads - small fragments of DNA.

Problem Solving Options


Part of these tasks was solved by students of the Institute of Bioinformatics under the direction of Oleg Shpynov from JetBrains Research as part of semester research projects.
Noisy peak calling.
student: Chaplygina Daria



In the article “Impact of sequencing depth in ChIP-seq experiments” (1), the authors studied the effect of library size (the number of initial reads) on the results of peak search algorithms. They created artificial datasets for different types of histone modifications by random sampling from real experiments. As expected, the poorer the library, the more difficult it is for the algorithms to find peaks, the results are inconsistent between different methods. But they also noticed that, in the case of using the same tool, the coordination between biological replicates is lost. In a semester project, we investigated the effect of noise in the source data.

The data set with a controlled noise level was obtained on the basis of publicly available data from ChIP-seq experiments from the ENCODE project siteENCODE project . Two noise models were used for this:

  1. Additive model. Fragments from random sections of DNA were added to the source file with “clean data”. The proportion of random fragments ranged from 0% to 90%.
  2. Probabilistic Model. For each experiment, a mathematical model was built using the Tulip tool. With its help, a completely new experiment was generated, one of the parameters of which - the percentage of fragments that are located inside the DNA-protein binding sites - varied from 10% to 0.5%.

Probabilistic Model. For each experiment, a mathematical model was built using the Tulip tool. With its help, a completely new experiment was generated, one of the parameters of which - the percentage of fragments that are located inside the DNA-protein binding sites - varied from 10% to 0.5%.


Visualization of data changes when applying a probabilistic noise model

On the obtained data set, we analyzed three algorithms: MACS2 (2), SICER (3) and SPAN (an algorithm developed by JetBrains Research. It is based on semi-supervisedmachine learning method). As it turned out, with a fixed SNR, one can predict the expected accuracy and completeness of the set of peaks that will be found by the algorithm. At a high noise level (or low SNR): MACS2 and SICER almost do not find peaks, while SPAN shows the most stable results in terms of a combination of indicators.



Accuracy and completeness of peak search algorithms in a controlled noise level

We studied how in the process of noisiness two metrics of data quality change: SNR and percentage of fragments inside peaks (FRIP - Fraction of Reads In Peaks). The measurements showed that for the same SNR, the fraction of fragments per region of DNA – protein interaction can vary significantly (in some cases, the difference was up to 50%). Existing standards and recommendations for assessing the quality of these ChIP-seq experiments are incomplete, and new integrated approaches are required.
As part of the work, we also developed pipelines for semi-automatic conducting such experiments.

Implementation of approaches and source code:

github.com/DaryaChaplygina/NoisyPeakCalling ,

github.com/DaryaChaplygina/NoisyPeakCalling2 .

Deep learning to the rescue!
student: Daria Balashova

One of the limitations of the classical ChIP-seq method is the large amount of necessary cellular material, which does not allow the experiment, for example, in the case of rare cell populations or in the case of several measurements for one biological sample. The new ChIP-seq (4) Ultra-Low-Input (ULI) method requires significantly less material — 100,000 cells are sufficient — but has greater variability and noise level in the data.

The use of deep machine learning methods is gaining popularity in bioinformatics, demonstrating excellent results in solving problems such as processing biomedical images. In the work “Denoising genome-wide histone ChIP-seq with convolutional neural networks” (5), the authors proposed an algorithmCoda is a method of improving the quality of ChIP-seq data based on convolutional neural networks. They created and trained a deep neural network not only to improve poor quality data, but also to find peaks in them.

In the framework of this project, the original algorithm was adapted for ULI ChIP-seq data. Using the findings from the previous project and the ULI ChIP-seq data from the article “Epigenetic changes in aging human monocytes” (6), we analyzed such important characteristics of the algorithm as improving quality metrics, for example, SNR. As a result, the DCNN algorithm was created . - convolutional neural network to automatically improve the quality of the data based on the signal-to-noise ratio in the case of biological repetitions. If improvement and signal purification works quite well, then the search for binding sites of proteins with DNA using deep learning methods is still an unresolved problem, since existing approaches require a large and high-quality training sample.


Schematic representation of the application of the convolutional neural network DCNN

Implementation of the approach and source code: github.com/dashabalashova/Denoising_CNN .

Instead of an afterword


Bioinformatics allows you to apply the approaches of programmers to biological data and gain new knowledge that will help biologists and doctors to study humans. Now open is accepting applications for the summer school 2020 , which will be held in St. Petersburg from July 27 to August 1. It is ideal for exploring bioinformatics.

For those who have decided on a more serious training - there is a chance to jump into the last car and apply for a retraining program in bioinformatics in St. Petersburg and Moscow until February 22, or until March 1 at the retreat seminar on systems biology .

For those who like to read and discover new things, we have a list of books and textbooks on algorithms, programming, genetics and biology.

Bibliography:


  1. Jung, Y. L., Luquette, L. J., Ho, J. W., Ferrari, F., Tolstorukov, M., Minoda, A.,… & Park, P. J. (2014). Impact of sequencing depth in ChIP-seq experiments. Nucleic acids research, 42(9), e74-e74.
  2. Zhang, Y., Liu, T., Meyer, C. A., Eeckhoute, J., Johnson, D. S., Bernstein, B. E.,… & Liu, X. S. (2008). Model-based analysis of ChIP-Seq (MACS). Genome biology, 9(9), R137.
  3. Xu, S., Grullon, S., Ge, K., & Peng, W. (2014). Spatial clustering for identification of ChIP-enriched regions (SICER) to map regions of histone methylation patterns in embryonic stem cells. In Stem Cell Transcriptional Networks (pp. 97-111). Humana Press, New York, NY.
  4. Brind'Amour, J., Liu, S., Hudson, M., Chen, C., Karimi, MM, & Lorincz, MC (2015). An ultra-low-input native ChIP-seq protocol for genome-wide profiling of rare cell populations. Nature communications, 6 (1), 1-8.
  5. Koh, PW, Pierson, E., & Kundaje, A. (2017). Denoising genome-wide histone ChIP-seq with convolutional neural networks. Bioinformatics, 33 (14), i225-i233.
  6. Schukina, Bagaitkar, Shpynov et al., In review, artyomovlab.wustl.edu/aging


Authors of the article:
Olga Bondareva, Institute of Bioinformatics
Oleg Shpinov , JetBrains Research
Ekaterina Vyakhhi , Institute of Bioinformatics

All Articles