Frequency analysis: we analyze one of the favorite tasks of the participants of the online stage of NeoQUEST-2020


Today we are publishing the Wright-Up assignment NeoQUEST-2020 , about which our participants wrote: “4 Task mega cool” and “The best task of the Olympiad”.
Despite the apparent simplicity, only 8 people completed this task! And now you can get to know him better, for this - welcome under cat!


So, participants are invited to download the audio recording . A minute of listening is enough to understand that Morse code doesn't even smell here, just like meaningful words in an audio recording are also not observed. Therefore, we end the torture by listening and begin to look at the sound in more detail. For example, using the good old (and free) Audacity . We open the audio recording and pay attention to its structure: between keystrokes there is absolute silence (someone was lucky with working conditions, you have no phone calls and chatting colleagues forever!), Which means that it will not be difficult to extract every single keystroke by filtering sections with zero signal amplitude.



Okay, let's do it. Now we have two things: an “array” in which there are 100,500 clicks, and the hope that among these sounds there are repetitives - only then can it be said that “the X key was pressed N times, and the Y key was M times.” So, all the initial data has been analyzed, it remains to answer the question “what will such a set of statements lead us to, provided that the audio recording lasts 10 minutes?”. Yes, that's right, it’s time to uncover the classical frequency analysis . Well, we test our theory using python, numpy and scikits.audiolab (for processing sound, you can choose any other package and even language with the same success).
To set up the environment, you can run the following commands (it definitely works on Ubuntu 19.10):

$ sudo apt-get install -y python python-dev python-pip python-setuptools libsndfile-dev python-numpy
$ pip install --upgrade pip
$ pip install scikits.audiolab


Now we implement the test of our assumption in several lines of code that uses the most primitive way to identify the sound of clicking - by the sum of the values ​​of the corresponding numpy array:

get list of separated ‘clicks’
data, fs, enc = scikits.audiolab.wavread(filename)
data = data.ravel()
clicks = [list(g) for k, g in groupby(data, lambda x: x != 0) if k]

# calculate number of unique ‘clicks’ and their frequences
freqs = dict() 
cypher_test = list()
for click in clicks:
    value = np.sum(np.array(click), dtype = np.float32)
    cypher_test.append(value)
    if value not in freqs:
        freqs[value] = 1
    else:
        freqs[value] += 1


The result of the code execution, shown on the diagram - like the heat of a bonfire in the middle of a nuclear winter - cannot but warm the soul: at least 27 unique clicks, and even the appearance of the distribution inspires confidence that the frequency analysis will not be futile.



According to Wikipedia and other more reliable sources, 26 English letters, sorted in decreasing order of occurrence, form the etaoinshrdlcumwfgypbvkjxqz series . Plus, of course, a gap, which occupies a leading position. You know what to do! Again, we use several lines of primitive code to compare each sound by clicking its letter (in frequency of occurrence) and carry out the initial replacement:

COMMON_FREQ_ALPHABET = { 0:' ', 1:'e', 2:'t', 3:'a', 4:'o', 5:'i', 6:'n', 7:'s', 8:'h', 9:'r', 10:'d', 11:'l', 12:'c', 13:'u', 14:'m',15:'w', 16:'f', 17:'g', 18:'y', 19:'p', 20:'b', 21:'v', 22:'k', 23:'j', 24:'x', 25:'q', 26:'z' }

# calculate key
key = dict()
index = 0
for freq in sorted(freqs.items(), key=operator.itemgetter(1),reverse=True):
    key[freq[0]] = COMMON_FREQ_ALPHABET[index]
    index +=1

# replace sounds according the key
partly_plain_text = ""
for elem in cypher_test:
	partly_plain_text += key[elem]


After replacing the frequency analysis, we get the following text:

onlineayvinf saty stw vltn selg kird moaher csctl ncmertl we tnswer utn metscre gtll saory retl enocfh worb ltrfe tvvetr tftin vltue scn satra zero gtua t gew rotd sveuitl vroklem etsa ketcay idet who lef ipere saone retuh vrodcua uca fipe reuord scre aoob jcesaion oca dcrinf vtss dry ahen show low lisa kltub okxeuas low lifha old frow wtauh aowtrd suhool my frow well heta norah mile mcsa resa tdd tmonf yocnf kesa kird beev kird gooa smtll wts bnow oca utme uca vltna some when bnow awo how dripe tftin vca while oaher some ocr letrn satr mincae digger men scrgtues infor retd hetrd lisa rtin utll vora yes whta idet ftme bind new him old deev eta keftn keftn or socah ahe some mtv tkle sonf i metn saory yetr fold mile dtrb eqtmvle world gish tfe geel ahocstnd one mile his ky ahe tkope ahose uhildren rtin gcll gtll tnimtl ury vocnd vltue ahose shora jciub where ahocfh saood fopern sorry i gorfoa i vromised ao lend yoc my uode fea ia grom my fiahck neoscveruoder htnd ahese binf wind utrb new vossikle tdd grom saood lifha down lipe mincae rotd foa satra gire ahocfh ahose some scn aia iaeai aet iaeai aia iaeai aet iaeai aia iaeai aet iaeai aia iaeai aet iaeai aia iaeai aet iaeai aia iaeai aet iaeai ucn shoa ucn shoa ucn shoa ucn shoa ucn shoa ucn shoa ucn shoa ucn shoa ucn shoa ucn shoa ucn shoa ucn shoa ucn shoa shtbe wtpe ltke nibe shtbe wtpe ltke nibe shtbe wtpe ltke nibe shtbe wtpe ltke nibe shtbe wtpe ltke nibe shtbe wtpe htn mtd dou htn mtd dou htn mtd dou htn mtd dou htn mtd dou htn mtd dou htn mtd dou htn mtd dou htn mtd dou htn dou gipe yet gipe yet gipe yet gipe yet gipe yet gipe yet gipe yet gipe yet gipe yet gipe yet gipe yet gipe yet gipe

It looks a little like something meaningful, but individual words are definitely guessed. Sequentially look for words in which it is obvious which replacement should be made:
  1. moaher -> mother and tnswer -> answer, i.e. a and t must be interchanged;
  2. larfe → large and selg → self, i.e. f and g must be interchanged;
  3. kird → bird, i.e. k is replaced by b and onlinetyving → onlinetyping, i.e. v replace with p;
  4. meascre → measure and enocgh → enough, i.e. u and c must be interchanged.

As a result of these manipulations, the text is dramatically transformed, and his first word explains the lack of connectivity: the audio recording was recorded while practicing typography on a keyboard simulator. Oh, the owner of the key was clearly not preparing for the apocalypse! .. We can be happy for him, probably now he is fingering very quickly, raking the snow that has blocked the exit from his refuge. But what is this to us? We’ll go over the text, maybe there’s something useful.

onlinetyping stay saw plan self bird mother usual numeral we answer can measure fall story real enough worb large appear again place sun start zero fact a few road special problem east beauty idea who leg ipere stone reach product cut gipe record sure toob juestion out during pass dry then show low list blacb obxects low light old grow watch toward school my grow well heat north mile must rest add among young best bird beep bird foot small was bnow out came cut plant some when bnow two how dripe again put while other some our learn star minute differ men surfaces ingor read heard list rain call port yes what idea game bind new him old deep eat began began or south the some map able song i mean story year gold mile darb eqample world fish age feel thousand one mile his by the abope those children rain full fall animal cry pound place those short juicb where though stood gopern sorry i forgot i promised to lend you my code get it from my github neosupercoder hand these bing wind carb new possible add from stood light down lipe minute road got start fire though those some sun tit iteti tea iteti tit iteti tea iteti tit iteti tea iteti tit iteti tea iteti tit iteti tea iteti tit iteti tea iteti cun shot cun shot cun shot cun shot cun shot cun shot cun shot cun shot cun shot cun shot cun shot cun shot cun shot shabe wape labe nibe shabe wape labe nibe shabe wape labe nibe shabe wape labe nibe shabe wape labe nibe shabe wape han mad doc han mad doc han mad doc han mad doc han mad doc han mad doc han mad doc han mad doc han mad doc han doc fipe yea fipe yea fipe yea fipe yea fipe yea fipe yea fipe yea fipe yea fipe yea fipe yea fipe yea fipe yea fipe

It seems we were lucky: during the training, someone was distracted and wrote a salvage message to his friend, probably in some messenger: sorry i forgot i promised to lend you my code get it from my github neosupercoder. Reading someone else’s correspondence, of course, is not good, but there is a one hundred percent exception, we paid more than enough for this information. We go to github and look for the repository by username:



Success! It remains only to read the only file from the repository "neo-medicine":



Hurray! Now we are the happy owners of the key!
Ahead - the analysis of other tasks NeoQUEST-2020 - we will not let our readers get bored with self-isolation!

All Articles