🗂️ 🎴 🎄 Selection of the importance of features for k-nearest neighbors (well, or other hyperparameters) by descent similar to gradient ↪️ ♨️ 💰

A true nonsense can not only fulfill the impossible, but also serve as a warning example

Experimenting with the simplest task of machine learning, I found that it would be interesting to select 18 hyperparameters at the same time in a fairly wide range. In my case, everything was so simple that the task could be taken with brute computer power.

When learning something, it can be very interesting to invent some kind of bicycle. Sometimes it turns out to really come up with something new. Sometimes it turns out that everything was invented before me. But even if I just repeat the path traveled long before me, as a reward I often get an understanding of the underlying mechanisms of the algorithms of their capabilities and internal limitations. To which I invite you.

In Python and DS, to put it mildly, I am a beginner, and I do many things that can be implemented into one team according to my old programming habit, which Python punishes by slowing down, not at times, but by orders of magnitude. Therefore, I upload all my code to the repository. If you know how to implement it much more efficiently - do not be shy, edit there, or write in the comments. https://github.com/kraidiky/GDforHyperparameters

Those who are already a cool datasatanist and have tried everything in this life will be interesting, I believe, a visualization of the learning process, which is applicable not only to this task.

Formulation of the problem

There is such a good DS course from ODS.ai and there is the third lecture Classification, decision trees and the method of nearest neighbors . There, it is shown on extremely simple and probably synthetic data how the simplest decision tree gives an accuracy of 94.5%, and the same extremely simple method of k nearest neighbors gives 89% without any preprocessing

Import and load data

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('data/telecom_churn.csv')
df['Voice mail plan'] = pd.factorize(df['Voice mail plan'])[0]
df['International plan'] = pd.factorize(df['International plan'])[0]
df['Churn'] = df['Churn'].astype('int32')
states = df['State']
y = df['Churn']
df.drop(['State','Churn'], axis = 1, inplace=True)
df.head()

Compare wood with knn

%%time
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score

X_train, X_holdout, y_train, y_holdout = train_test_split(df.values, y, test_size=0.3,
random_state=17)

tree = DecisionTreeClassifier(random_state=17, max_depth=5)
knn = KNeighborsClassifier(n_neighbors=10)

tree_params = {'max_depth': range(1,11), 'max_features': range(4,19)}
tree_grid = GridSearchCV(tree, tree_params, cv=10, n_jobs=-1, verbose=False)
tree_grid.fit(X_train, y_train)
tree_grid.best_params_, tree_grid.best_score_, accuracy_score(y_holdout, tree_grid.predict(X_holdout))

({'max_depth': 6, 'max_features': 16}, 0.944706386626661, 0.945)

same for knn

%%time
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

knn_pipe = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_jobs=-1))])
knn_params = {'knn__n_neighbors': range(1, 10)}
knn_grid = GridSearchCV(knn_pipe, knn_params, cv=10, n_jobs=-1, verbose=False)

knn_grid.fit(X_train, y_train)
knn_grid.best_params_, knn_grid.best_score_, accuracy_score(y_holdout, knn_grid.predict(X_holdout))

({'knn__n_neighbors': 9}, 0.8868409772824689, 0.891)
At this point, I felt sorry for knn that was obviously dishonest, because we had no work with the metric. I did not think with my brain, I took feature_importances_ from the tree and normalized the input to it. Thus, the more important the feature, the greater its contribution is the distance between points.

We feed the data normalized to the importance of features

%%time
feature_importances = pd.DataFrame({'features': df.columns, 'importance':tree_grid.best_estimator_.feature_importances_})
print(feature_importances.sort_values(by=['importance'], inplace=False, ascending=False))

scaler = StandardScaler().fit(X_train)
X_train_transformed = scaler.transform(X_train)
X_train_transformed = X_train_transformed * np.array(feature_importances['importance'])

X_holdout_transformed = scaler.transform(X_holdout)
X_holdout_transformed = X_holdout_transformed * np.array(feature_importances['importance'])

knn_grid = GridSearchCV(KNeighborsClassifier(n_jobs=-1), {'n_neighbors': range(1, 11, 2)}, cv=5, n_jobs=-1, verbose=False)
knn_grid.fit(X_train_transformed, y_train)
print (knn_grid.best_params_, knn_grid.best_score_, accuracy_score(y_holdout, knn_grid.predict(X_holdout_transformed)))

5	Total day minutes	0.270386
17	Customer service calls	0.147185
8	Total eve minutes	0.135475
2	International plan	0.097249
sixteen	Total intl charge	0.091671
fifteen	Total intl calls	09.090008
4	Number vmail messages	0.050646
10	Total eve charge	0.038593
7	Total day charge	0.026422
3	Voice mail plan	0.017068
eleven	Total night minutes	0.014185
thirteen	Total night charge	0.005742
12	Total night calls	0.005502
9	Total eve calls	0.003614
6	Total day calls	0.002246
14	Total intl minutes	0.002009
0	Account length	0.001998
1	Area code	0.000000

{'n_neighbors': 5} 0.909129875696528 0.913

The tree just shared a little bit of knowledge with knn and now we see 91%. That is not so far from 94.5% of the vanilla tree. And then an idea came to me. But how, in fact, do we need to normalize the input so that knn shows the best result?

First, we’ll estimate in our mind how much this will now be considered “forehead”. 18 parameters, for each we make, say, 10 possible steps of the factors in the logarithmic scale. We get 10e18 options. One option with all the possible odd number of neighbors is less than 10 and cross-validation is also 10, I think about 1.5 seconds. It turns out 42 billion years. Perhaps the idea of leaving the reckoning for the night will have to be abandoned. :) And somewhere around here I thought, “Hey! So I'll make a bike that will fly! ”

Gradient Search

In fact, this task most likely has only one maximum available. Well, that is not one of course, a whole area of good results, but they are pretty much alike. Therefore, we can just walk along the gradient and find the most suitable point. The first thought was to generalize the genetic algorithm, but here the adaptive terrain does not seem to be very crossed, and this would be a little bit overkill.

I will try to do it manually for a start. To push factors as hyperparameters, I need to deal with scalers. In the previous example, as in the lesson, I used StandartScaler, which centered the training sample on average and made sigma = 1. In order to scale it nicely inside the pipeline, the hyperparameter has to be made a little trickier. I began to look for something suitable for my case among the converters lying in sklearn.preprocessing, but I did not find anything. Therefore, I tried to inherit from StandartScaler by hanging an additional bundle of factors on it.

Class for nominalization and then multiplication by scale slightly compatible with sklearn pipeline

from sklearn.base import TransformerMixin
class StandardAndPoorScaler(StandardScaler, TransformerMixin):
    #normalization = None
    def __init__(self, copy=True, with_mean=True, with_std=True, normalization = None):
        #print("new StandardAndPoorScaler(normalization=", normalization.shape if normalization is not None else normalization, ") // ", type(self))
        self.normalization = normalization
        super().__init__(copy, with_mean, with_std)
    def fit(self, X, y=None):
        #print(type(self),".fit(",X.shape, ",", y.shape if y is not None else "<null>",")")
        super().fit(X, y)
        return self
    def partial_fit(self, X, y=None):
        #print(type(self),".partial_fit(",X.shape, ",", y.shape if y is not None else "<null>)")
        super().partial_fit(X, y)
        if self.normalization is None:
            self.normalization = np.ones((X.shape[1]))
        elif type(self.normalization) != np.ndarray:
            self.normalization = np.array(self.normalization)
        if X.shape[1] != self.normalization.shape[0]:
            raise "X.shape[1]="+X.shape[1]+" in equal self.scale.shape[0]="+self.normalization.shape[0]
    def transform(self, X, copy=None):
        #print(type(self),".transform(",X.shape,",",copy,").self.normalization", self.normalization)
        Xresult = super().transform(X, copy)
        Xresult *= self.normalization
        return Xresult
    def _reset(self):
        #print(type(self),"._reset()")
        super()._reset()
    
scaler = StandardAndPoorScaler(normalization = feature_importances['importance'])
scaler.fit(X = X_train, y = None)
print(scaler.normalization)

Trying to apply this class

%%time
knn_pipe = Pipeline([('scaler', StandardAndPoorScaler()), ('knn', KNeighborsClassifier(n_jobs=-1))])

knn_params = {'knn__n_neighbors': range(1, 11, 4), 'scaler__normalization': [feature_importances['importance']]}
knn_grid = GridSearchCV(knn_pipe, knn_params, cv=5, n_jobs=-1, verbose=False)

knn_grid.fit(X_train, y_train)
knn_grid.best_params_, knn_grid.best_score_, accuracy_score(y_holdout, knn_grid.predict(X_holdout))

({'knn__n_neighbors': 5, 'scaler__normalization': Name: importance, dtype: float64}, 0.909558508358337, 0.913)

The result is slightly different from my expectations. Well, that is, in principle, everything works. Just to understand this, I had to reproduce this class with all the guts from scratch in three hours, and only then I realized that print doesn’t print not because sklearn is somehow wrongly made, but because GridSearchCV creates clones in the main stream , but configures and trains them in other threads. And everything that you print in other streams disappears into oblivion. But if you put n_jobs = 1, then all calls to overridden functions are shown as cute. Knowledge came out very expensive, now you also have it, and you paid for it by reading a tedious article.

Okay, let's move on. Now I want to give some variance for each of their parameters, and then give it a little less around the best value, and so on until I get a result similar to reality. This will be the first rude baseline of what should eventually get the algorithm of my dreams.

I will form several options for reweighting, differing in several parameters

feature_base = feature_importances['importance']
searchArea = np.array([feature_base - .05, feature_base, feature_base + .05])
searchArea[searchArea < 0] = 0
searchArea[searchArea > 1] = 1
print(searchArea[2,:] - searchArea[0,:])

import itertools

affected_props = [2,3,4]
parametrs_ranges = np.concatenate([
    np.linspace(searchArea[0,affected_props], searchArea[1,affected_props], 2, endpoint=False),
    np.linspace(searchArea[1,affected_props], searchArea[2,affected_props], 3, endpoint=True)]).transpose()

print(parametrs_ranges) #      .  125 
recombinations = itertools.product(parametrs_ranges[0],parametrs_ranges[1],parametrs_ranges[1])

variances = []
for item in recombinations: #          ,       Python .
    varince = feature_base.copy()
    varince[affected_props] = item
    variances.append(varince)
print(variances[0])
print(len(variances))
#  knn   ,               .

Well, the data set for the first experiment is ready. Now I’ll try to experiment with the data, for a start by exhaustive search of the resulting 15 options.

We make a trial selection of parameters as in the article

%%time
#scale = np.ones([18])
knn_pipe = Pipeline([('scaler', StandardAndPoorScaler()), ('knn', KNeighborsClassifier(n_neighbors = 7 , n_jobs=-1))])

knn_params = {'scaler__normalization': variances} # 'knn__n_neighbors': range(3, 9, 2), 
knn_grid = GridSearchCV(knn_pipe, knn_params, cv=10, n_jobs=-1, verbose=False)

knn_grid.fit(X_train, y_train)
knn_grid.best_params_, knn_grid.best_score_, accuracy_score(y_holdout, knn_grid.predict(X_holdout))

Well, everything is bad, time was spent on a breakthrough, and the result is very unstable. This is also seen from the X_holdout check, the result dances like in a kaleidoscope with minor changes to the input data. I'll try a different approach. I will change only one parameter at a time, but with a much larger discretization.

I change one 4th property

%%time
affected_property = 4
parametrs_range = np.concatenate([
    np.linspace(searchArea[0,affected_property], searchArea[1,affected_property], 29, endpoint=False),
    np.linspace(searchArea[1,affected_property], searchArea[2,affected_property], 30, endpoint=True)]).transpose()

print(searchArea[1,affected_property])
print(parametrs_range) # C   ,  .


variances = []
for item in parametrs_range: #          ,       Python .
    varince = feature_base.copy()
    varince[affected_property] = item
    variances.append(varince)
print(variances[0])
print(len(variances))
#  knn   ,               .

knn_pipe = Pipeline([('scaler', StandardAndPoorScaler()), ('knn', KNeighborsClassifier(n_neighbors = 7 , n_jobs=-1))])

knn_params = {'scaler__normalization': variances} # 'knn__n_neighbors': range(3, 9, 2), 
knn_grid = GridSearchCV(knn_pipe, knn_params, cv=10, n_jobs=-1, verbose=False)

knn_grid.fit(X_train, y_train)
knn_grid.best_params_, knn_grid.best_score_, accuracy_score(y_holdout, knn_grid.predict(X_holdout))

({'scaler__normalization': 4 0.079957 Name: importance, dtype: float64}, 0.9099871410201458, 0.913)

Well, what do we have with a goose? Shifts of one to two tenths of a percent on cross-validation, and a half-percent jump on X_holdout if you look at different affected_property. Apparently it is essential and cheap to improve the situation if you start with the fact that the tree gives us it is impossible on such data. But suppose that we do not have an initial, known weight distribution, and try to do the same at an arbitrary point in the cycle with tiny steps. It’s very interesting what we’ll come to.

Initial filling

searchArea = np.array([np.zeros((18,)), np.ones((18,)) /18, np.ones((18,))])
print(searchArea[:,0])

history_parametrs = [searchArea[1,:].copy()]
scaler = StandardAndPoorScaler(normalization=searchArea[1,:])
scaler.fit(X_train)
knn = KNeighborsClassifier(n_neighbors = 7 , n_jobs=-1)
knn.fit(scaler.transform(X_train), y_train)
history_holdout_score = [accuracy_score(y_holdout, knn.predict(scaler.transform(X_holdout)))]

Function slightly changing one parameter (with debug logs)

%%time
def changePropertyNormalization(affected_property, points_count = 15):
    test_range = np.concatenate([
        np.linspace(searchArea[0,affected_property], searchArea[1,affected_property], points_count//2, endpoint=False),
        np.linspace(searchArea[1,affected_property], searchArea[2,affected_property], points_count//2 + 1, endpoint=True)]).transpose()
    variances = [searchArea[1,:].copy() for i in range(test_range.shape[0])]
    for row in range(len(variances)):
        variances[row][affected_property] = test_range[row]
    
    knn_pipe = Pipeline([('scaler', StandardAndPoorScaler()), ('knn', KNeighborsClassifier(n_neighbors = 7 , n_jobs=-1))])
    knn_params = {'scaler__normalization': variances} # 'knn__n_neighbors': range(3, 9, 2), 
    knn_grid = GridSearchCV(knn_pipe, knn_params, cv=10, n_jobs=-1, verbose=False)

    knn_grid.fit(X_train, y_train)
    holdout_score = accuracy_score(y_holdout, knn_grid.predict(X_holdout))
    best_param = knn_grid.best_params_['scaler__normalization'][affected_property]
    print(affected_property,
          'property:', searchArea[1, affected_property], "=>", best_param,
          'holdout:', history_holdout_score[-1], "=>", holdout_score, '(', knn_grid.best_score_, ')')
    #             .
    before = searchArea[:, affected_property]
    propertySearchArea = searchArea[:, affected_property].copy()
    if best_param == propertySearchArea[0]:
        print('|<<')
        searchArea[0, affected_property] = best_param/2 if best_param > 0.01 else 0
        searchArea[2, affected_property] = (best_param + searchArea[2, affected_property])/2
        searchArea[1, affected_property] = best_param
    elif best_param == propertySearchArea[2]:
        print('>>|')
        searchArea[2, affected_property] = (best_param + 1)/2 if best_param < 0.99 else 1
        searchArea[0, affected_property] = (best_param + searchArea[0, affected_property])/2
        searchArea[1, affected_property] = best_param
    elif best_param < (propertySearchArea[0] + propertySearchArea[1])/2:
        print('<<')
        searchArea[0, affected_property] = max(propertySearchArea[0]*1.1 - .1*propertySearchArea[1], 0)
        searchArea[2, affected_property] = (best_param + propertySearchArea[2])/2
        searchArea[1, affected_property] = best_param
    elif best_param > (propertySearchArea[1] + propertySearchArea[2])/2:
        print('>>')
        searchArea[0, affected_property] = (best_param + propertySearchArea[0])/2
        searchArea[2, affected_property] = min(propertySearchArea[2]*1.1 - .1*propertySearchArea[1], 1)
        searchArea[1, affected_property] = best_param
    elif best_param < propertySearchArea[1]:
        print('<')
        searchArea[2, affected_property] = searchArea[1, affected_property]*.25 + .75*searchArea[2, affected_property]
        searchArea[1, affected_property] = best_param
    elif best_param > propertySearchArea[1]:
        print('>')
        searchArea[0, affected_property] = searchArea[1, affected_property]*.25 + .75*searchArea[0, affected_property]
        searchArea[1, affected_property] = best_param
    else:
        print('=')
        searchArea[0, affected_property] = searchArea[1, affected_property]*.25 + .75*searchArea[0, affected_property]
        searchArea[2, affected_property] = searchArea[1, affected_property]*.25 + .75*searchArea[2, affected_property]
    normalization = searchArea[1,:].sum() #,      .
    searchArea[:,:] /= normalization
    print(before, "=>",searchArea[:, affected_property])
    history_parametrs.append(searchArea[1,:].copy())
    history_holdout_score.append(holdout_score)
    
changePropertyNormalization(1, 9)
changePropertyNormalization(1, 9)

I didn’t optimize anything anywhere, and as a result, I took the next decisive step for almost half an hour:

Hidden text

40 .

%%time
#   
searchArea = np.array([np.zeros((18,)), np.ones((18,)) /18, np.ones((18,))])
print(searchArea[:,0])

history_parametrs = [searchArea[1,:].copy()]
scaler = StandardAndPoorScaler(normalization=searchArea[1,:])
scaler.fit(X_train)
knn = KNeighborsClassifier(n_neighbors = 7 , n_jobs=-1)
knn.fit(scaler.transform(X_train), y_train)
history_holdout_score = [accuracy_score(y_holdout, knn.predict(scaler.transform(X_holdout)))]

for tick in range(40):
    for p in range(searchArea.shape[1]):
        changePropertyNormalization(p, 7)
    
print(searchArea[1,:])
print(history_holdout_score)

The resulting accuracy from knn: 91.9% Better than when we tear the data from the tree. And much, much better than in the original version. Compare what we have with the importance of features according to the decision tree:

Visualization of the importance of features according to knn

feature_importances['knn_importance'] = history_parametrs[-1]
diagramma = feature_importances.copy()
indexes = diagramma.index
diagramma.index = diagramma['features']
diagramma.drop('features', 1, inplace = True)
diagramma.plot(kind='bar');
plt.savefig("images/pic1.png", format = 'png')
plt.show()
feature_importances

Seem to be? Yes, it seems. But far from identical. Interesting observation. There are several features in the data set that completely duplicate each other, for example, 'Total night minutes' and 'Total night charge'. So pay attention, knn itself sawed out a significant part of such repeated features.

We will save the results to a file, otherwise it’s somewhat inconvenient to return to work ....

parametrs_df = pd.DataFrame(history_parametrs)
parametrs_df['scores'] = history_holdout_score
parametrs_df.index.name = 'index'
parametrs_df.to_csv('parametrs_and_scores.csv')

findings

Well, the Result .919 per se is not bad for knn, there are 1.5 times fewer errors than in the vanilla version and 7% less than when we took the feature_importance tree to drive. But the most interesting thing is that now we have feature_importance according to knn itself. It is somewhat different from what the tree told us. For example, tree and knn have different opinions about which of the signs are not important for us at all.

Well, in the end. We got something relatively new and unusual, having a reserve of knowledge of three lectures mlcourse.ai

ods and Google to answer simple questions about python. In my opinion, not bad.

Now slides

A byproduct of the algorithm's work is the path it has traveled. True, the path is 18-dimensional, which hinders his awareness a little, well, to follow in real time what the algorithm is doing there, learning or using garbage is not so convenient. According to the error schedule, this, in fact, is not always visible. The error may not change noticeably for a long time, but the algorithm is very busy, crawling along a long narrow valley in adaptive space. Therefore, I will apply, for starters, the first simplest but quite informative approach - I randomly project an 18-dimensional space onto a two-dimensional space so that the contributions of all parameters, regardless of their significance, are single. In fact, the 18-dimensional path is very small, in our article Peeping Over Throws of a Neural Network I likewise admired the space of the scales of all the synapses that the neural network had and it was nice and informative.

I read the data from the file, if I return to work, having passed the training stage itself

parametrs_df = pd.read_csv('parametrs_and_scores.csv', index_col = 'index')
history_holdout_score = np.array(parametrs_df['scores'])
parametrs_df.drop('scores',axis=1)
history_parametrs = np.array(parametrs_df.drop('scores',axis=1))

The error on validation ceases to change from some point. Here it would be possible to screw in an automatic stop of learning and use the received function for the rest of my life, but I already have a little time. :(

We determine how much to study.

last = history_holdout_score[-1]
steps = np.arange(0, history_holdout_score.shape[0])[history_holdout_score != last].max()
print(steps/18)

35.5555555555555556
We changed one parameter at a time, so one optimization cycle consists of 18 steps. It turns out that we had 36 meaningful steps, or something like that. Now let's try to visualize the trajectory along which the method was trained.

Hidden text

%%time
#    :
import matplotlib.pyplot as plt
%matplotlib inline
import random
import math
random.seed(17)
property_projection = np.array([[math.sin(a), math.cos(a)] for a in [random.uniform(-math.pi, math.pi) for i in range(history_parametrs[0].shape[0])]]).transpose()
history = np.array(history_parametrs[::18]) #   - 18 .
#           . :(
points = np.array([(history[i] * property_projection).sum(axis=1) for i in range(history.shape[0])])
plt.plot(points[:36,0],points[0:36,1]);
plt.savefig("images/pic2.png", format = 'png')
plt.show()

It can be seen that a significant part of the journey was completed in the first four steps. Let's look at the rest of the way with increasing

Without the first 4 points

plt.plot(points[4:36,0],points[4:36,1]);
plt.savefig("images/pic3.png", format = 'png')

Let's take a closer look at the final part of the path and see what the teacher did after reaching her destination.

getting closer

plt.plot(points[14:36,0],points[14:36,1]);
plt.savefig("images/pic4.png", format = 'png')
plt.show()
plt.plot(points[24:36,0],points[24:36,1]);
plt.plot(points[35:,0],points[35:,1], color = 'red');
plt.savefig("images/pic5.png", format = 'png')
plt.show()

It can be seen that the algorithm is being trained intently. Until he finds his destination. The specific point, of course, depends on randomization in cross-validation. But regardless of the specific point, the general picture of what is happening is understandable.

By the way, I used to use such a schedule to demonstrate the learning process.
Not the entire trajectory is shown, but the last few steps with sliding smoothing of the scale. An example can be found in my other article, “We Spy on the Throws of a Neural Network”. And yes, of course, everyone who encounters such visualization immediately asks why all the factors have the same weight, importance, then they have different. Last time in the article, I tried to re-weight the importance of synapses and it turned out less informative.

This time, armed with new knowledge, I will try using t-SNE to deploy multidimensional space into a projection in which everything can be better.

t-SNE

%%time
import sklearn.manifold as manifold
tsne = manifold.TSNE(random_state=19)
tsne_representation = tsne.fit_transform(history)
plt.plot(tsne_representation[:, 0], tsne_representation[:, 1])
plt.savefig("images/pic6.png", format = 'png')
plt.show();

t-Sne seems to have unfolded the space so that it completely ate the scale of the changes for those features that quickly stopped changing, which made the picture completely uninformative. Conclusion - do not try to slip the algorithms into places not intended for them.: \

You can not read further

I also tried to inject tsne inside to visualize intermediate optimization states, in the hope that beauty would turn out. but it turned out not beauty, some garbage. If interested, see how to do it. The Internet is littered examples of such injecting code but by simply copying they do not pa bot because substitute contained in sklearn.manifold.t_sne internal function _gradient_descent , and it depending on the version may be very different both in the signature and on the treatment of internal variables. So just find the sources in yourself, pick out your version of the function from there and insert just one line in it that adds intermediate dumps to your variable of your

own : positions.append (p.copy ()) # We save the current position.

And then, like, we beautifully visualize what we get as a result:

Injection code

from time import time
from scipy import linalg
# This list will contain the positions of the map points at every iteration.
positions = []
def _gradient_descent(objective, p0, it, n_iter,
                      n_iter_check=1, n_iter_without_progress=300,
                      momentum=0.8, learning_rate=200.0, min_gain=0.01,
                      min_grad_norm=1e-7, verbose=0, args=None, kwargs=None):
    # The documentation of this function can be found in scikit-learn's code.
    if args is None:
        args = []
    if kwargs is None:
        kwargs = {}

    p = p0.copy().ravel()
    update = np.zeros_like(p)
    gains = np.ones_like(p)
    error = np.finfo(np.float).max
    best_error = np.finfo(np.float).max
    best_iter = i = it

    tic = time()
    for i in range(it, n_iter):
        positions.append(p.copy()) # We save the current position.
        
        check_convergence = (i + 1) % n_iter_check == 0
        # only compute the error when needed
        kwargs['compute_error'] = check_convergence or i == n_iter - 1

        error, grad = objective(p, *args, **kwargs)
        grad_norm = linalg.norm(grad)

        inc = update * grad < 0.0
        dec = np.invert(inc)
        gains[inc] += 0.2
        gains[dec] *= 0.8
        np.clip(gains, min_gain, np.inf, out=gains)
        grad *= gains
        update = momentum * update - learning_rate * grad
        p += update

        if check_convergence:
            toc = time()
            duration = toc - tic
            tic = toc

            if verbose >= 2:
                print("[t-SNE] Iteration %d: error = %.7f,"
                      " gradient norm = %.7f"
                      " (%s iterations in %0.3fs)"
                      % (i + 1, error, grad_norm, n_iter_check, duration))

            if error < best_error:
                best_error = error
                best_iter = i
            elif i - best_iter > n_iter_without_progress:
                if verbose >= 2:
                    print("[t-SNE] Iteration %d: did not make any progress "
                          "during the last %d episodes. Finished."
                          % (i + 1, n_iter_without_progress))
                break
            if grad_norm <= min_grad_norm:
                if verbose >= 2:
                    print("[t-SNE] Iteration %d: gradient norm %f. Finished."
                          % (i + 1, grad_norm))
                break

    return p, error, i

manifold.t_sne._gradient_descent = _gradient_descent

Apply the `` fixed '' t-SNE

tsne_representation = manifold.TSNE(random_state=17).fit_transform(history)
X_iter = np.dstack(position.reshape(-1, 2) for position in positions)
position_reshape = [position.reshape(-1, 2) for position in positions]
print(position_reshape[0].shape)
print('[0] min', position_reshape[0][:,0].min(),'max', position_reshape[0][:,0].max())
print('[1] min', position_reshape[1][:,0].min(),'max', position_reshape[1][:,0].max())
print('[2] min', position_reshape[2][:,0].min(),'max', position_reshape[2][:,0].max())

(41, 2)
[0] min -0.00018188123 max 0.00027207955
[1] min -0.05136269 max 0.032607622
[2] min -4.392309 max 7.9074526
The values dance in a very wide range, so I will scale them before drawing them. On cycles, all this is done kapets slowly. :(

I scale

%%time
from sklearn.preprocessing import MinMaxScaler
minMaxScaler = MinMaxScaler()
minMaxScaler.fit_transform(position_reshape[0])
position_reshape = [minMaxScaler.fit_transform(frame) for frame in position_reshape]
position_reshape[0].min(), position_reshape[0].max()

Animate

%%time

from matplotlib.animation import FuncAnimation, PillowWriter
#plt.style.use('seaborn-pastel')

fig = plt.figure()

ax = plt.axes(xlim=(0, 1), ylim=(0, 1))
line, = ax.plot([], [], lw=3)

def init():
    line.set_data([], [])
    return line,
def animate(i):
    x = position_reshape[i][:,0]
    y = position_reshape[i][:,1]
    line.set_data(x, y)
    return line,

anim = FuncAnimation(fig, animate, init_func=init, frames=36, interval=20, blit=True, repeat_delay = 1000)
anim.save('images/animate_tsne_learning.gif', writer=PillowWriter(fps=5))

It is instructive in terms of skills, but absolutely useless in this task and ugly.

On this I say goodbye to you. I hope the idea that even from knn you can get something new and interesting, as well as pieces of code, will help you as well as to have fun with the data at this intellectual feast during the plague.

Selection of the importance of features for k-nearest neighbors (well, or other hyperparameters) by descent similar to gradient

Formulation of the problem

Gradient Search

findings

Now slides

You can not read further

More articles: