Stratification, or how to learn to trust data

Look at these two sets of points and think: which one seems more โ€œrandomโ€ to you? The distribution in the left figure is clearly uneven. There are places in which points are condensed, and there are also places in which there are almost no points: because of this, it may even seem that the left chart is darker. In the right figure, local condensations and rarefactions are also present, but they are less conspicuous.




Meanwhile, it was the left graph that was obtained using the โ€œhonestโ€ random number generator. The right graph also contains completely random points; but these points are generated so that all the small squares contain an equal number of points.


Stratification is a method of selecting a subset of objects from the general population, divided into subsets (strata). During stratification, objects are selected in such a way that the final sample retains the ratio of the size of the strata (or in a controlled manner violated these relations, see clause 3). Say, in the considered example, the general population is points inside a unit square; strata are sets of points inside smaller squares.


. , . , - .


1. :



, , โ€” , 0.4. . -.



() :


import random

random.seed(100)

for i in range(500):
    x, y = random.random(), random.random()
    print x, y

, : , ; . , , , .


import random

random.seed(100)

cellsCount = 10
cellId = 0

for i in range(500):
    cellVerticalIdx = (cellId / cellsCount) % cellsCount
    cellHorizontalIdx = cellId % cellsCount
    cellId += 1

    left = float(cellVerticalIdx + 0) / cellsCount
    right = float(cellVerticalIdx + 1) / cellsCount

    top = float(cellHorizontalIdx + 1) / cellsCount
    bottom = float(cellHorizontalIdx + 0) / cellsCount

    x, y = random.random(), random.random()
    x = left + x * (right - left)
    y = bottom + y * (top - bottom)

    print x, y

โ€” . , โ€” .



, , , .


, . ! , , , .


2. -


.


: , . , , . , .


: , .. . , , . , . , , โ€” .


. :



. , . , ยซยป , , . , , !


, , -, .. , . ( ), :



, , , . , , , , .


3.


-, -: , , , . A/B- , , , 0.5% , .


( , , ..), , .


Online Stratified Sampling: Evaluating Classifiers at Web-Scale Microsoft Research, .


, N, npC.


Kโ€” . k- Nkp^kC.


p^=โˆ‘k=1KNkNp^k


:


var(p^)=โˆ‘k=1K(NkN)2var(p^k)


, !


n, nkk- :


nkโˆNkโ‹…var(pk)


, . , .


, : . , - , , .




, , . - SimilarWeb Alexa - , . , . , , .


: ? ? ?


If there are no answers or they are unsatisfactory, it may well be that the data will deceive you.


All Articles