Parsing a movie theater site on Golang

Hello everyone, in this article we will look at a small sample code written to collect data from a site in the popular programming language Go today.

This article will be especially interesting to those who have heard a lot about Go, but have not yet tried it on their own.

For one of the internal projects, we needed to collect data about the films that are being rented and the schedule of shows, now we will consider the first (simplest) version of the parser, with which it all started.

For those who are too lazy to read the article code, I immediately provide a link to the repository .

In its first version, the parser was able to collect only the following information:

  • About sessions in one cinema,
  • Detailed description,
  • Title,
  • Studio,

And save it in JSON. Initially, you need to select the appropriate library for parsing.

Google gives a large number of options for the query `golang web scraping`, many of them are reflected in this list , I advise you to familiarize yourself with it, I chose geziyor , as it supports JS Rendering (which, by the way, we will not use in this example, however, this feature is very useful when parsing sites) and is quite simple to use.

So, the library is selected, the next step is to install it and start using it in the code.
Installing the library is extremely simple:

go get -u github.com/geziyor/geziyor

Now let's move on to writing code.

Inside the body of the main function, we call the parser, give it the URL of the page from which we will start collecting data and indicate that we want to export the result to a JSON file:


func main() { 
    geziyor.NewGeziyor(&geziyor.Options{ 
        StartURLs: []string{"https://kinoteatr.ru/raspisanie-kinoteatrov/city/#"}, 
        ParseFunc: parseMovies, 
        Exporters: []export.Exporter{&export.JSON{}}, 
    }).Start() 
}

A start has been made, but there is not enough logic for collecting data, for this we need to implement the parseMovies function .

The collection logic will be as follows:

  • Search for a block containing movie information,
  • Collection of information about all sessions inside this block,
  • Collection of the name of the film and film studio,
  • Gathering a link to a page with detailed information about the film,
  • Collection of descriptions from this page

Let's move on to the implementation of this function.


Here, all blocks containing information about the film are selected for further processing.

func parseMovies(g *geziyor.Geziyor, r *client.Response) {
    r.HTMLDoc.Find("div.shedule_movie").Each(func(i int, s *goquery.Selection) { 

In this way, information about the sessions is collected, and then it is brought into a convenient form for a person (we remove extra spaces and indents on a new line).


var sessions = strings.Split(s.Find(".shedule_session_time").Text(), " \n ") 
sessions = sessions[:len(sessions)-1] 

for i := 0; i < len(sessions); i++ { 
    sessions[i] = strings.Trim(sessions[i], "\n ") 
}

This block of code is responsible for obtaining a page with detailed information about the film and obtaining its description.


if href, ok := s.Find("a.gtm-ec-list-item-movie").Attr("href"); ok {
    g.Get(r.JoinURL(href), func(_g *geziyor.Geziyor, _r *client.Response) {
        description = _r.HTMLDoc.Find("span.announce p.movie_card_description_inform").Text() 
        description = strings.ReplaceAll(description, "\t", "") 
        description = strings.ReplaceAll(description, "\n", "") 
        description = strings.TrimSpace(description) 

This calls the API to export the results to a JSON file.


g.Exports <- map[string]interface{}{ 
    "title":        strings.TrimSpace(s.Find("span.movie_card_header.title").Text()), 
    "subtitle":    strings.TrimSpace(s.Find("span.sub_title.shedule_movie_text").Text()), 
    "sessions":    sessions, 
    "description": description, 
}

Hooray, you're done! It remains only to combine the written blocks of code together and start parsing.

This is how the parser works, we see messages on the successful receipt of pages in the terminal, this is convenient.



And this is how the result of parsing looks.



Thanks for reading the article, love programming on Go.

All Articles