Comparison of Russian rap scenes using R and Text Mining techniques. Noize Mc and Kasta vs Pharaoh and Morgenshtern

R. Text Mining. Rap


The popularity of many contemporary rap artists remains a mystery to me and other followers of the “old school”. Constant debate about who is better, whose lyrics are more interesting, whose music more diverse occupy the minds of many Internet users. To confirm these disputes, not just with words, but with facts, I analyzed the texts of four Russian rap artists, using the programming language R.

Some of them were insanely popular in the early 2000s. Now they all also attract their listeners, but, unfortunately, they are becoming less and less. And two are now at the peak of their popularity and attract an extensive and mostly young audience. And my further analysis will show that given the vocabulary that they use, this fact does not cause joy. To find out who is who will be quite simple, because the artists whose songs I used are: Caste, Noize-Mc, Pharaoh and Morgenstern. I think everyone understands that I will refer Caste and Noise to the "old school", and Pharaoh and Morgenstern to the "new".

Analysis Albums


For analysis, I selected all the studio official albums released by the artists (information about the albums was taken from the site www.wikipedia.com , all the literature is at the end):

  1. Kasta: “Louder than water, higher than grass”; "A flash in the eye"; "Four-headed Yelling"; “It's clear about the flaw” - 74 tracks.
  2. Noize Mc: “The Greatest Hits Vol. 1"; "Last album"; "New album"; "Protivo Gunz"; "Confusion"; "Hard Reboot 3.0"; "King of the hill"; "Hiphopera: Orpheus & Eurydice" - 160 tracks.
  3. Pharaoh: "The Wadget"; Phlora "Dolor"; Phosphor "Pink Phloyd"; Phuneral "Rule" - 95 tracks.
  4. Morgenshtern: “Before It Becomes Known”; “Smile, you fool!”; “Legendary Dust” - 30 tracks.

I specially selected the aforementioned artists as even those who are even a little familiar with their work will agree that the texts are very different (Casta + Noise vs Pharaoh + Morgenstern) and it will be interesting to compare them with each other. A logical question arises: how can objectively and correctly compare the four albums of Caste and the eight albums of Pharaoh? Everything is very simple - after some manipulations, which I will discuss later, the volume of words will become more or less comparable. After all, as everyone knows, quantity is not equal to quality.
To collect the words themselves, I used the genius.com website and their API. Fortunately, the developers of the service provide an open application programming interface (API) that makes it easy enough to extract lyrics (by artist, album) from the database for subsequent analysis.

All analysis was performed using the R programming language, plus for stemming (the process of finding the word base for a given source word) python was used, because it could not cope with the encoding in R and the mystem program (Windows 10 does not like to be friends with UTF-8 and R, they say using an apple OS or Linux such problems do not arise).

Before processing. Browse Texts. Word count


To download the lyrics used the library "genius". The function from this package “genius_album” very easily allows you to download all the texts on albums at once. Be careful and double-check, as not all lyrics are always available for all artists, some of them had to be added manually. After the download, it became interesting how many words are used in the songs as a whole (along with pronouns, prepositions, particles, etc.). Then we compare these figures with the already processed stemming and stop words results. To make it easier to understand the ratio of the number of albums and tracks to the number of words used, I will duplicate this information once again:

  1. Noize Mc - 8 albums, 160 tracks.
  2. Casta - 4 albums, 74 tracks.
  3. Pharaoh - 7 albums, 95 tracks.
  4. Morgenstern - 3 albums, 30 tracks.

image

Interestingly, Pharaoh and Noise have almost the same number of albums (seven and eight, respectively), but, as can be seen from the graph, the quality of the albums is very different both in the number of songs and in the richness of vocabulary (57962 vs 24184).

In order to minimize this difference and make the comparison more correct and correct, it was calculated how many words an average artist uses in one of his songs:

  1. Noize Mc - 362 words.
  2. Caste - 388 words.
  3. Pharaoh - 254 words.
  4. Morgenstern - 273 words

It is clear that such a comparison is conditional and rather approximate, but the figures speak for themselves.

And this is how the top 10 words of each artist look and the number of references to these words:

image

image

As one would expect, without processing the “top words” are prepositions, pronouns and conjunctions that do not reflect any results and do not carry any special semantic load. Therefore, at this stage, nothing interesting or unexpected happened.

The next step was the processing and preparation of texts for analysis. The process of stemming was performed using the mystem program from Yandex in Python, which is available to everyone. This step was taken in order to understand how many unique words artists use and how widely they use the Russian language in their texts. After all, it would be a mistake to count the same word in different cases several times. This shows the singer’s variability and ability to persuade, rather than the breadth of his vocabulary.

Also, in order to get a more representative result, it is necessary to get rid of stop words that do not carry emotional and semantic load (prepositions, pronouns, particles, etc.). Unfortunately, there are no good libraries in R packages that contain stop words for the Russian language. I want to draw your attention to the fact that the author himself must determine whether this or that word is a stop word and whether it should be deleted. Always carefully review this kind of dictionaries so as not to weed out the right and useful word for you. The stopwords package supports quite a few languages, but I preferred to use words from an external resource with my own refinement.

After processing


As you can see from the graph, the number of words has significantly decreased after stamping and removing stop words. This is not surprising given that almost all of the original most popular words have come to a halt.

image

In general, the number of words that remained after stamping and deleting stop words, as a percentage of the initial number, are practically equal for everyone. It is worth noting that they are equal in groups. In the "old school" it is 55-58%, in the "new" 46-50%.
Very important and interesting information is the number of unique words each artist has. For Noise, this is 8891 words, for Caste 5307, for Pharaoh 3899 and for Morgenstern 1242. Who wants to expand his vocabulary a bit, but does not want to read books, can listen to Noize Mc and Caste.

Of course, many are interested in what words are leading now, after processing. I present graphics with the top 10 words for each artist:

image

image

Surely, many of the readers were struck by words with asterisks. Pharaoh and Morgenstern really have a lot of profanity in the texts, which, in my personal opinion, has a rather negative effect on the full structure of the text and its perception. These two performers have the same word in the second position. A word that perfectly demonstrates the spirit and culture of their music. A little later, I will clearly demonstrate what kind of emotional tone is leading in the lyrics of the performers.

Common words. Word comparison


To make the information more visual, I placed all the words of the performers on one chart using the “comparison.cloud” function from the “wordcloud” package for this, it’s easier to compare and perceive them (and again we can see how the mats stand out). Showing words with bar plots can be quite problematic, since with more of them, a lot of space is required. Also a good function from the package of the same name is "wordcloud2": when you hover over a word, a window appears that shows the frequency of its use.

image

Since artists use the same language to write their songs, it will be interesting to see, without dividing into artists, which words they most often use. The commonality.cloud function from the wordcloud package was used for this graph. The font size corresponds to the frequency of mentioning the word in the texts.

image

Sentimental text analysis


Each film, book or song has its own mood, which is transmitted to the audience or listeners and affects them. It is interesting to see what mood the performers of the old and new schools broadcast to their students. You can find out by analyzing the words from which category: "Negative", "Positive", "Neutral" prevail in the songs of musicians. As expected, for the Russian language there is not a pretty high-quality dictionary with sentimental analysis of words for R (if someone knows this, please share). Therefore, I had to use the external one with my upgrade (link to the dictionary at the end of the text).

Not all words had correspondences in the dictionary, which is of course a little sad, with the English language such problems practically do not arise. Therefore, I decided to show the emotional coloring of the most frequently repeated words. It is these words that the listener most often hears, and it is they that have the most powerful effect on him and determine the perception of the whole song. In general, if the reader is even a little familiar with the work of all authors, then he is unlikely to be surprised. Well, if for someone the analyzed names are new, then please welcome, get acquainted with their work. Below you can see the graphs. For all artists, the most frequently used words are shown.

Morgenstern. The repetition rate of a word is more than 10 times. The abundant number of red columns stands out quite strongly, and if you sort out what these words are, it becomes doubly sad from what message this artist carries to his audience.

image

Pharaoh . The lexicon also leaves much to be desired. Frequency more than 20 times.

image

Next is the time for the old-timers of the Russian rap scene. Those for whom it is really not a shame, and can be recommended for listening.

Caste . Bright predominance of words with a positive connotation. And negative words are not shocking with their immorality. Frequency> = 25

image

And finally, the rhyme master and the words Noize Mc (frequency> = 30).

image

The abundance of negatively colored vocabulary, which Morgenstern and Pharaoh use in their songs, affects the perception of their songs and the mood that they broadcast. It’s hard to get pleasant emotions from the music when it does its best to force you to do the opposite.
Since the used dictionary with sentimental analysis does not contain all the words, it is difficult to draw a 100% and sure conclusion about the mood of the songs by the artists, since a lot also depends on the context. However, I will show you how many and what words the artists use (from what they managed to attach).

image

Obviously, most of the words of all artists have a neutral tint, which practically does not affect the listener. But, interestingly, Pharaoh and Morgenstern use more words with a negative connotation than with a positive one. And this, despite the inferiority of the dictionary and the absence there of a lot of obscene words and their variations (the dictionary contains 28,248 words and I had to add some of them manually).

image

Caste and Noise Ms are also led by neutral words, but in second place are positive ones that do not cause us negative emotions.

Yes, of course, I cannot evaluate the influence of the context in this type of analysis and, for example, the word “love” can be used with the “not” particle and have a negative connotation. But you must admit, the phrase "I do not love you" is more pleasant than the phrase "I hate you." And the negative of this phrase will not be corrected even by the particle “not”. All the same, we will only hear the word "Hate."

Musical taste is an individual matter and everyone decides what to listen to. But take another look at the charts and think about how you want to fill your everyday life. Music accompanies us everywhere and often very much affects our mood, so why consciously make it worse every day?

In general, this article is also about the fact that programming can be interesting and can be applied in various fields. It can show already familiar information from a new angle, make you think about what seemed obvious or insignificant. It depends only on you what will be hidden behind the lines of code and what interesting they will tell.

Learn programming languages, develop and listen to quality music for writing which takes more than seven days of online time on YouTube. For those who don’t know, the Morgenstern album “Legendary Dust” was recorded 6 days during online broadcasts on YouTube and as a result became the most successful in the career of Morgenstern, gaining VKontakte million listenings in the first half hour of release and five million plays in eleven hours. In the first two days after the release, the album was listened to by VKontakte more than 21 million times, which is a record for a social network.

List of used literature:

1. ru.wikipedia.org/wiki/Noize_MC
2. ru.wikipedia.org/wiki/Pharaoh
3. ru.wikipedia.org/wiki/Casta_ (group)
4.ru.wikipedia.org/wiki/Morgenstern_ (musician)
5. github.com/stopwords-iso/stopwords-ru/blob/master/stopwords-ru.txt (stop words)
6. github.com/dkulagin/kartaslov/ tree / master / dataset / emo_dict (sentiment dictionary).
License: creativecommons.org/licenses/by-nc-sa/4.0
7. ru.wikipedia.org/wiki/Legendary_Dust

All Articles