Hello, Habr!

Once, having studied another book on the notorious Data Science, I came to the conclusion that it would be time to put the accumulated knowledge into practice and see the life of the analytics department with my own eyes. Fortunately, Yandex launched a selection for a six-month internship in the appropriate direction, and I could not pass by. The acceptance of applications 2020 has already ended, so in this article I will, with a clear conscience, analyze the tasks that Yandex proposed to solve for applicants in the first stage. There will be Python code. Spoiler: difficult, but interesting.

Task 1. Deadline

The task

Novice analyst is trying to solve the problem. If the problem could not be solved, then he loses motivation, and the probability of success on the next attempt falls. One attempt takes a day, and the task deadline is 90 days. The probability that the analyst will solve the problem from the i-th attempt is:

$\frac{1}{(i+1)}$
$\frac{1}{(i+1)^2}$

How likely is the analyst to solve the problem before the deadline?

Decision

You may have already reached for typing: "@nice_one, you said it would be difficult, but what is this?" Patience, friends, this is a simple task to warm up, but there is something to miss if you do not think about the condition. Let us examine the example of the first paragraph. It is necessary to calculate the total probability that the analyst will solve the problem in any of the 90 days available in the reserve, while the probability of success on each i-th day is given. A tempting option may seem to substitute in the expression a number from 1 to 90 instead of i and add, but this is not true. This expression indicates the likelihood of success on a particular i-day, but in order to get to that i-day, the analyst must fail in the past (i - 1) days. If the probability of success on the i-th day is

\frac{1}{(i + 1)}

$\frac{1}{(i+1)}$ , then the probability of failure on this day is therefore equal to

1 - \frac{1}{(i + 1)} = \frac{i}{i + 1}

$1 - \frac{1}{(i+1)} = \frac{i}{i+1}$ . As you know, to find the probability of the simultaneous occurrence of several events, it is necessary to multiply the probability of each occurrence. Thus, the probability that the analyst can cope in exactly n days is

(\prod_{k = 1}^{n - 1} \frac{k}{k + 1}) \cdot \frac{1}{n + 1}

$\biggl(\prod\limits_{k=1}^{n-1}\frac{k}{k+1}\biggr)\cdot\frac{1}{n+1}$ .

Members standing under the sign of the work are responsible for failure in each of the first

(n - 1)

$(n - 1)$ days, then you need to multiply the product by the probability of success on the n-th day.
So, for any number of days, we know the probability of success for exactly this period. We are interested in the total probability of success for each possible period of up to 90 days inclusive. Now you can substitute numbers from 1 to 90, but already in the resulting formula. The easiest way is to write a loop in some python that will calculate and add probabilities, which I did.

The code

import numpy as np

n = 90

probs = []

for i in range(1, n+1): #   

    prob_now = 1/(i+1) #      

    prob_not_before = []
    
    for k in range(1, i): #      
        prob_not_before.append(k/(k+1))
        
    prob_not_before = np.array(prob_not_before).prod() # 

    probs.append(prob_not_before * prob_now)

s = sum(probs) #   

print(s)

The decision of the second paragraph is completely similar to the first, only the formula differs. I will leave the code solving the second point - I think everything will be clear.

Point 2

import numpy as np

n = 90

probs = []

for i in range(1, n+1): #   

    prob_now = 1/((i+1)**2) #      

    prob_not_before = []
    
    for k in range(1, i): #      
        prob_not_before.append(1 - (1/((k+1)**2)))
        
    prob_not_before = np.array(prob_not_before).prod() 

    probs.append(prob_not_before * prob_now)

s = sum(probs) #   

print(s)

Task 2. The fate of the hamster

The task

In order to survive in winter, the greedy hungry hamster decided to rob a nut factory located 1000 meters from its hole. There were 3,000 nuts left at the factory. A maximum of 1000 nuts are placed on the cheeks of the hamster. Everywhere and no matter what the hamster goes with, every meter he needs to be reinforced with 1 nut. The hamster is already at the factory and is dangerous. What is the maximum number of nuts he can stock? The answer must be rounded to the nearest integer.

Decision

Strongly reminiscent of the task of a jeep, is not it? So it is, before us is its next variety. In the general case, a vehicle (in this case, a hamster) appears in the jeep problem, which needs to travel a certain distance in conditions of limited capacity of the fuel container (hamster cheeks). The idea underlying the solution to any problem of this class - along the way, you can leave fuel supplies, as well as go back for a new one. Otherwise, a single solution algorithm does not exist, since the initial conditions and goals can be very different. The option proposed here is interesting because it is necessary not only to go the distance from the factory to the hole (which is elementary, because the hamster can hold exactly 1000 nuts, which is enough for 1000 meters), but transfer as many nuts as possible to it. It’s best to draw a diagram,depicting a length of 1000 m and a supply of nuts at the factory, and think about how to act the hamster if he wants to transport 3000 nuts into the hole, eating as little as possible, i.e., having passed as little as possible the total distance. Let's try to move at the smallest steps, 1 m each, transferring all 3000 nuts with you on several trips.

Obviously, in order to transfer 3000 nuts to any point, the hamster will need to return to the previous one at least 3 times. When 2000 nuts remain and the rest are eaten along the way, the hamster will need 2 trips to the previous point in order to move them to a new one. When the fuel is less than 1000 units, you don’t have to go back, it all fits in the hamster’s cheeks. Thus, the process of transferring nuts can be divided into three corresponding stages. Let's see what the hamster will have “fuel consumption” on each. When there are more than 2,000 nuts, to move 1 meter the hamster will have to:

Pick up the full cheeks of nuts and walk 1 m
Unload 998 nuts (1 ate on the way, 1 left to go back)
Go back 1 m again to the nut stock
Repeat steps 1-3 for the second thousand nuts
Take the last thousand and go with it 1 m forward

Thus, 1 m displacement with all prey costs a hamster 5 nuts. When the nuts become <2000, and this happens after 200 m of movement, the algorithm will be as follows:

Pick up the full cheeks of nuts and walk 1 m
Unload 998 nuts (1 ate on the way, 1 left to go back)
Go back 1 m again to the nut stock
Take the last thousand and go with it 1 m forward

1 m displacement costs the hamster 3 nuts. When he reaches the point of 534 m, a total of 2001 nuts will be eaten, and the hamster will have to take the last 999 nuts and calmly go the remaining 466 meters into its hole. When he gets there, 533 nuts will remain in the cheeks - this will be the answer to the problem.

I want to note that tasks of this class are quite popular in the theory of algorithms, as well as in interviews in large companies. The best way to learn how to solve them is practice. There is no single mechanism for solving it (well, or he swam past me), but it’s quite possible to get a hand on them and develop creative thinking.

Task 3. Analytical distribution

The task

Yandex wants to create

M

$M$ teams of analysts. When hiring, each analyst randomly chooses a group for himself where he will work. The team leader wants to figure out which minimum number of thousands of analysts is enough to hire so that his group is more likely

P

$P$ was no less

N

$N$ person?

You must write a Python program that accepts

N

$N$ ,

M

$M$ and

P

$P$ in one line, and the output gives the number of thousands of analysts.

1 \leq N \leq 100

$1 \leq N \leq 100$ ,

1 \leq M \leq 100000

$1 \leq M \leq 100000$ ,

0 \leq P \leq 1

$0 \leq P \leq 1$

Decision

Well, knowledge of statistics, namely the binomial distribution , came in handy . We denote the number of analysts hired by Yandex for

X

$X$ . Each of the hired analysts chooses a team. From the point of view of our team leader, hiring an analyst for work is an experiment with two outcomes: either a newcomer falls into our team or not. Hit probability equals

\frac{1}{M}

$\frac{1}{M}$ , the probability of the analyst choosing another group, respectively, is

\frac{M - 1}{M}

$\frac{M-1}{M}$ . In total, such experiments with the choice of the team will be

X

$X$ . The number of hits in our team

n

$n$ of

X

$X$ the election of analysts is distributed binomially, the distribution function is equal to:

P (n ⩽ N) = \sum_{k = 0}^{N} (\binom{X}{k}) (\frac{1}{M})^{k} (\frac{M - 1}{M})^{X - k}

$\mathbb{P}(n \leqslant N) = \sum\limits_{k=0}^{ N} \binom{X}{k}\, \biggl(\frac{1}{M}\biggr)^k \biggl(\frac{M-1}{M}\biggr)^{X-k}$

This function shows the probability that the number of hits will be less than or equal to the specified

N

$N$ . We are interested in the likelihood that the number of hits will be greater than or equal to the given one, so the task looks like this:

X : 1 - P_{x} (n ⩽ N) = P; X - ?

$X: 1 - \mathbb{P_x}(n \leqslant N) = P; X-?$

That is, you need to find the number of analysts hired

X

$X$ at which the team gets at least

N

$N$ person for a given probability

P

$P$ .

Well, we figured out the math - how to find it now

X

$X$ ? Busting. You can write a cycle that will sort through the number of hired analysts, increasing it until the probability of getting at least

N

$N$ analysts will not be satisfactory.

The code

def c(n, k): #   ,    
    if 0 <= k <= n:
        nn = 1
        kk = 1
        for t in range(1, min(k, n - k) + 1):
            nn *= n
            kk *= t
            n -= 1
        return nn // kk
    else:
        return 0

def bin_prob(trials, k, m): #      

    return c(trials, k) * ((1/m)**k) * ((1 - 1/m)**(trials - k))

def cdf(maximum, trials, m): #   
    value = 0
    for i in range(maximum + 1):
        value += bin_prob(trials, i, m)
    return value

n, m, p = [(float(i)) for i in input().split()] #       
n = int(n)
m = int(m)


x = 1000 
while (1 - cdf(n, x, m)) < p: #      
    x += 1000 #   

print(int(x / 1000)) #

Task 4. Gift research

The task

Santa Claus brought Anastasia 100 gifts and placed them under the Christmas tree. The tree is big and fluffy, so Anastasia is difficult to navigate under it. Anastasia examines gifts this way: accidentally reaches out from the random side of the tree to a random distance, takes a gift, examines it and puts it back. It turns out that every time Anastasia can equally likely take any gift from those lying under the tree. Find the expectation of the share of gifts that Anastasia will consider for 100 random stretches?

Decision

At first glance, the task seems very simple, even confidence appears that a solution can be found by some elementary formula, but not everything is so simple. Not so simple. I spent indecently a lot of time on this task, trying to paint options and derive a formula, but I did not succeed. Then I went to Google and, to my surprise, I had to dig deep into the forums, before I found a solution for the general case . So, if we randomly select elements from a set with a return, the probability is

n

$n$ selections from

m

$m$ elements of the set pull out exactly

k

$k$ different equals:

P (m, k, n) = \frac{(\binom{m}{k}) \cdot k! \cdot S_{2} (n, k)}{m^{n}}

$P(m, k, n) = \frac{\binom{m}{k}\cdot k! \cdot S_2(n, k)}{m^n}$

Where

S_{2}

$S_2$ there is a Stirling number of the second kind - the number of unordered partitions of the set from

n

$n$ items on

k

$k$ nonempty subsets. Well, in order to find the expectation, it is necessary to add up the probabilities calculated by this formula for each possible fraction of the unique gifts examined - from one hundredth to one whole. This can be done using a loop in Python.

The code

import math
import numpy as np
import sys
import sympy #     -   

sys.setrecursionlimit(10**9)

def c(n, k): # C

    return (math.factorial(n))/(math.factorial(k) * math.factorial(n-k))

def s(n, k): #      

    return sympy.functions.combinatorial.numbers.stirling(n, k)

    
def p(m, k, n): #    k

    return c(m, k) * math.factorial(k) * s(n, k) / (m**n)


pr = []
#      ,    ...
for j in range(1, 101): 
    pr.append(p(100, j, 100))
    
pr = np.array(pr)
#...    100
frac = np.array([i for i in range(1, 101)]) / 100


print(sum(pr*frac)) #

Task 5. Equiprobable traveler

The task

The traveler begins to move along the edges of a two-dimensional grid with whole nodes strictly to the right or up. He moves from a point

(0, 0)

$(0, 0)$ exactly

(100, 100)

$(100, 100)$ . How likely is it to cross a river in a straight line connecting the start and end points, if we assume that all possible routes are equally probable? It is believed that the traveler crossed the river if he was in the same route strictly above and below the river. A river entry is not considered an intersection.

Decision

We find the probability of crossing the classical approach - we divide the number of routes with intersection by the total number of possible routes. Let be

n

$n$ - the length of the edges of the square grid. Then the total number of possible routes:

N = \frac{(2 n!)}{(n!)^{2}}

$N = \frac{(2n!)}{(n!)^2}$

The derivation of the formula is described here . But how to find out the number of river crossing routes for each

n

$n$ ? Having been puzzled by this question, I decided to take a few smaller grid lengths, draw fields and manually calculate how many routes cross the river, hoping to trace the dependence (I highly recommend that you also take a piece of paper and a pen now and experiment with drawing small grids and paths).

Let there be a grid of size 3 by 3 cells. The side diagonal of the grid is occupied by the river, the traveler is in the lower left corner.

Picture

The drawing is not perfect, but I tried, honestly

As soon as I made the drawing, I realized that it would be much easier to track the routes that the river does not cross, namely the routes below the river. Then it will be possible to multiply their number by 2, thus taking into account the mirror paths above the river. Since we also know the total number of routes, we find the number of people crossing the river. But back to the main task - we need a relationship between

n

$n$ and the number of river crossing paths.

In the figure above for the case of 3x3, I marked in blue some “land” routes accessible to the traveler: the marked routes pass along the edges of the cells with a horizontal coordinate of 2, the traveler does not enter the left and upper edges of the cells before. There are 3 such routes, i.e.

n

$n$ . Now let's figure out the routes that go through the cell in column 1.

Picture

I marked the new paths in red. So, it is clear that if a traveler turns onto the left and then the upper edge of the cell (1, 0), then only 2 of the three paths through the cells with a horizontal coordinate of 2 will be accessible to him, because you can only move up and to the right - the third path lies lower . Thus, adding a new cell from column 1 to the route, we increased the total number of paths by the number of routes that passes through the cells of column 2 that are not lower than our new cell.

Take a 4 by 4 grid and continue to unravel the tangle. It became clear that adding a new cell to a column increases the number of paths by the number of routes that passes through the next column no lower than the top edge of the added cell. I will not mark the routes with color, I will limit myself to a text description, but if you think it is necessary, draw - in the process of solving I drew a dozen different grids before I was able to confidently feel the dependence.

Picture

The rightmost column again gives us

n

$n$ routes. The top edge of the cell (2, 0) will add to us

n - 1

$n-1$ route. The top edge of the cell (2, 1) will add

n - 2

$n-2$ route. The top edge of the cell (1, 0) will add as many routes as the cells (2, 0) and (2, 1) have added together. If you wish, you can draw a larger grid and continue to consider the routes with the same algorithm. Our task is to calculate the routes for a 100x100 grid. To do this, you can write a program that will accept the input

n

$n$ and build a matrix

n \times n

$n\times n$ starting from column

n

$n$ and then for each cell of the previous columns, counting the number of paths added by the cell based on the data of the previous column. Thus, the number of non-river crossing paths will be found.

The code

import numpy as np
import math

def routes_total(n): #   
    return math.factorial(2*n) / (math.factorial(n)**2)

def fill_matrix(n): #  ,       
    net = np.zeros((n, n)) 
    net[0, 0] = n #    n 
    for i in range(n-2):
        net[1, i] = n - i - 1 

    for i in range(2, n):
        for j in range(n - i - 1): 
            net[i, j] = 0
            for g in range(j, n - i + 1):
                net[i, j] += net[i - 1, g]
    
    #      2,     
    return (2 * sum(sum(net))) 

#      -    1
print(1  - fill_matrix(100) / routes_total(100))

Task 6. State of Linear Distribution

The task

The Linear Distribution State is a multitude of cities, some of which are connected by roads.

Once the King of the State became aware that the People of Breakpoints were about to invade its borders. Since the State was not ready for defense, the king made a difficult decision - to divide the State into many small ones, each of which will independently defend its borders.

It was decided that two cities can and should be left in one state if it is possible to get to the second from one city, even if the People of the Breakpoints seize one road between any two cities of the Linear Distribution State. In all other cases, cities must be in different states.

On each road that will cross the border of any two new states, it is necessary to put a bastion. This is necessary in case one of these states is captured by the People of Breakpoints. Then the second will be able to continue to defend its borders. In other words, the bastion will be put on the road that connects cities from different states.

The king asked you to give him a list of roads on which you need to put bastions.

Program input and output format

Input format

n

$n$

m

$m$ — .

(1 \leq n \leq 20000, 1 \leq m \leq 200000)

$(1 \leq n \leq 20000, 1 \leq m \leq 200000)$ . m . i

b_{i}, e_{i}

$b_i, e_i$ — ,

(1 \leq b_{i}, e_{i} \leq n)

$(1 \leq b_i, e_i \leq n)$

b — , . b — , , . , .

, , , , — .

Decision

And here is the graph theory problem. For long stories about the fate of the State of Linear Distribution, the drafters concealed the rather interesting task of finding bridges in a graph whose nodes are cities and the edges are roads. Briefly, a bridge is such an edge of a graph, the removal of which will cut off a certain part of this graph from other vertices. This is the idea of capturing the road - if the bridge is captured, the communication between some cities will be broken, otherwise there will always be an alternative road between the cities, therefore it is the bridges that the states divide, it is necessary to put bastions on the bridges.

Bridge Search Algorithm Based on Depth Search(Depth-first search, DFS) - a graph traversal method in which all edges coming from the initial vertex are examined, and if the edge leads to a vertex that has not yet been considered, the algorithm immediately recursively starts from this vertex. The following fact will help in the search for bridges:

Let us search in depth, looking now for all the edges from the vertex V. Then if the current edge (V, U) is such that from the vertex U and from any of its descendants in the traversal tree, there is no inverse path to the peak of V or any of its ancestors, the considered edge is a bridge.

In order to learn how to verify this fact for the vertex V, we introduce the time of entry into the vertex disc [V](from the English. discovered). In this variable, the step of the algorithm at which the vertex has been processed will be recorded. Also, with each vertex V, the lowest [V] variable will be associated , into which we write the time of occurrence of the earliest vertex U, which can be reached from vertex V. During the initial processing of the vertex lowest [V] = disc [V] (To the vertex earlier than you don’t get yourself), but later in the process of searching in depth we can find some son V, one of whose edges leads to the ancestor of V (Let's call him S). In this case, we will update lowest [V]: lowest [V] = disc [S]. And when can we hook the bridge? Then, when we search in depth, we reach the top, which has no sons that have not yet been considered (Again, we call it U). In this case, we will check which earliest vertex can be reached from U, and if this earliest vertex occurs later than the immediate parent of U (This is possible, for example, when U has no sons, then lowest [U] = disc [U ] ), then the connection of U with the parent is a bridge.

The code of the implemented algorithm with comments is attached below. It is convenient not to create separate variables for disc and lowest of each vertex, but to place arrays for each value, where the index is the number of the vertex to which the value belongs.

The code

import sys
from collections import Counter
import numpy as np
sys.setrecursionlimit(10**6) 

n, m = [int(i) for i in input().split()]
roads = [None] #    -    
graph = {}  #      ,    
for i in range(1, n+1):
    graph[i] = []
for i in range(1, m+1):
    twns = [int(j) for j in input().split()]
    graph[twns[0]].append(twns[1])
    graph[twns[1]].append(twns[0])
    roads.append(frozenset([j for j in twns]))
    
disc = [0] * (n+1) #  discovered
lowest = disc.copy() #  lowest
used = disc.copy() #  used. ,    
c = Counter(roads)

timer = 0 #   
nbridges = 0 #  
bridges = [] #  

def dfs(v, parent): #    ,    
    
    global timer
    global nbridges
    global bridges
    
    timer += 1 #   
    disc[v] = timer 
    lowest[v] = timer
    used[v] = True #     
    for u in graph[v]: #      
        if u == parent:
            continue #      ,    ,    
        if used[u]: #  ,    ,  
            lowest[v] = min(lowest[v], disc[u]) # ,       ;  lowest 
        else: #   
            dfs(u, v) #      
            #           cc  U:
            lowest[v] = min(lowest[v], lowest[u])  
            if lowest[u] > disc[v]: #   u    v   ,   
                twns = [] # ,  
                twns.append(u)
                twns.append(v)
                if c[frozenset(twns)] > 1: #     ,  ,    
                    continue
                nbridges += 1
                bridges.append(roads.index(set(twns)))

dfs(1, 0) #      

print(nbridges)
bridges = np.sort(bridges)
for bridge in bridges:
    print(bridge)

The following source helped me deal with the problem in many ways , so I consider it necessary to leave a link to it. It’s worth watching and here is this video - there is a good animation of the algorithm.

Conclusion

These are the tasks that a specialist applying for an internship in Yandex should confidently solve. The above set of tasks was given 5 hours - quite a short time, in my opinion, but everyone works at their own pace.

My decisions have been tested, and they give the correct answers, but I have no doubt that there are more effective ways to cope with the proposed tasks. If you are ready to offer a faster or more understandable solution or if you find a mistake with me, do not hesitate to write about it.

I wish everyone to find a position for themselves!

Analyst internship in Yandex: analysis of test tasks

Task 1. Deadline

The task

Decision

Task 2. The fate of the hamster

The task

Decision

Task 3. Analytical distribution

The task

Decision

Task 4. Gift research

The task

Decision

Task 5. Equiprobable traveler

The task

Decision

Task 6. State of Linear Distribution

The task

Input format

Decision

Conclusion

More articles: