Your first step in Data Science. Titanic

A small introduction


I believe that we could do more things if we were provided with step-by-step instructions that will tell what and how to do. I myself recall in my life those moments when a business could not start due to the fact that it was simply difficult to understand where to start. Perhaps, once upon a time on the Internet you saw the words “Data Science” and decided that you are far from that, and the people who are doing this somewhere out there, in another world. So no, they are right here. And, perhaps, thanks to people from this sphere, an article has appeared in your feed. There are many courses that will help you get comfortable with this craft, here I will help you take the first step.

Well, are you ready? I must say right away that you will have to know Python 3, since I will use it here. And also I advise you to pre-install on the Jupyter Notebook or see how to use google colab.

Step one


image

Kaggle is your significant assistant in this matter. In principle, you can do without it, but I will talk about this in another article. This is the platform that hosts the Data Science competition. In each such competition in the early stages you will receive an unrealistic amount of experience in solving various problems, development experience and teamwork experience, which is important in our time.

We will take our task from there. It is called so: "Titanic." The condition is this: to predict each individual person will survive. Generally speaking, the task of the person involved in DS is the collection of data, their processing, model training, forecasting and so on. At kaggle, we are allowed to skip the data collection stage - they are presented on the platform. We need to download them and you can get started!

You can do this as follows:

in the Data tab are the files that contain the data

image

image

Downloaded data, prepared our Jupyter notebooks and ...

Second step


How do we download this data now?

First, we import the necessary libraries:

import pandas as pd
import numpy as np

Pandas will allow us to download .csv files for further processing.

Numpy is needed to present our data table as a matrix with numbers.
Move on. Take the train.csv file and upload it to us:

dataset = pd.read_csv('train.csv')

We will refer to our sample train.csv data through the dataset variable. Let's take a look at what is located there:

dataset.head()

image

The head () function allows us to view the first few lines of the data frame.

Survived columns are just our results, which are known in this data frame. On the issue of the problem, we need to predict the Survived column for test.csv data. This data stores information about other passengers of the Titanic, for whom we, the decision makers, are not aware of the outcomes.

So, we will divide our table into dependent and independent data. Everything is simple here. Dependent data is that data that is independent of what is in the outcomes. Independent data is data that influences the outcome.

For example, we have such a data set:

“Vova taught computer science - no.
Vova received on computer science 2. "

Assessment in computer science depends on the answer to the question: did Vova teach computer science? Is it clear? Moving on, we are closer to the goal!

The traditional variable for independent data is X. For dependent, y.

We do the following:

X = dataset.iloc[ : , 2 : ]
y = dataset.iloc[ : , 1 : 2 ]

What it is? With the iloc [:, 2:] function, we tell the python: I want to see in the variable X the data starting from the second column (inclusively and provided that the count starts from zero). In the second row we say that we want to see in the data of the first column.

[a: b, c: d] is a construction of what we use in parentheses. If you do not specify any variables, they will remain defaulted. That is, we can specify [:,: d] and then we will get in the data frame all the columns, except those that go, starting from the number d and further. Variables a and b define strings, but we all need them, so we leave this default.

Let's see what happened:

X.head()

image

y.head()

image

In order to simplify this small lesson, we will remove columns that require special “care”, or do not affect survival at all. They contain data of type str.

count = ['Name', 'Ticket', 'Cabin', 'Embarked']
X.drop(count, inplace=True, axis=1)

Super! We go to the next step.

Step three


Here we need to encode our data so that the machine better understands how this data affects the result. But we will not encode everything, but only the data of str type that we left. Column "Sex". How do we want to encode? Imagine the data on the human gender by the vector: 10 - male, 01 - female.

To get started, we will translate our tables into the NumPy matrix:

X = np.array(X)
y = np.array(y)

And now we look:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])],
                       remainder='passthrough')
X = np.array(ct.fit_transform(X))

The sklearn library is such a cool library that allows us to do the full work in Data Science. It contains a large number of interesting machine learning models, and also allows us to do data preparation.

OneHotEncoder will allow us to encode the gender of the person in that representation, as we described. 2 classes will be created: male, female. If the person is a man, then 1 will be written in the “male” column, and 0, respectively.

After OneHotEncoder (), it costs [1] - this means that we want to encode column number 1 (counting from scratch).

Super. We move even further!

As a rule, this happens that some data remains empty (that is, NaN - not a number). For example, there is information about a person: his name, gender. But there is no data on his age. In this case, we will use this method: we find the arithmetic mean of all the columns and, if some data is missing in the column, then fill the void with the arithmetic mean.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X)
X = imputer.transform(X)

Now let's take into account that such situations happen when the data is very much scattered. Some data is in the interval [0: 1], and some can go for hundreds and thousands. To exclude such a spread and the computer was more accurate in the calculations, we will scale the data, scale it. Let all numbers not exceed three. To do this, use the function StandartScaler.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X[:, 2:] = sc.fit_transform(X[:, 2:])

Now our data looks like this:

image

Class. We are already close to our goal!

Fourth step


Train our first model! From the sklearn library we can find a huge amount of interesting things. I applied the Gradient Boosting Classifier model to this task. We use a classifier, since our task is a classification task. It is necessary to attribute the forecast to 1 (survived) or 0 (did not survive).

from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(learning_rate=0.5, max_depth=5, n_estimators=150)
gbc.fit(X, y)

The fit function tells python: Let the model look for dependencies between X and y.

Less than a second and the model is ready.

image

How to apply it? We will see now!

Step Five Conclusion


Now we need to load the table with our test data, for which we need to make a forecast. With this table, we will do all the same actions that we did for X.

X_test = pd.read_csv('test.csv', index_col=0)

count = ['Name', 'Ticket', 'Cabin', 'Embarked']
X_test.drop(count, inplace=True, axis=1)

X_test = np.array(X_test)

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])],
                       remainder='passthrough')
X_test = np.array(ct.fit_transform(X_test))

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X_test)
X_test = imputer.transform(X_test)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_test[:, 2:] = sc.fit_transform(X_test[:, 2:])

We will apply our model already!

gbc_predict = gbc.predict(X_test)

All. We made a forecast. Now it needs to be recorded in csv and sent to the site.

np.savetxt('my_gbc_predict.csv', gbc_predict, delimiter=",", header = 'Survived')

Done. Got a file containing predictions for each passenger. It remains to upload these decisions to the site and get an estimate of the forecast. Such a primitive solution gives not only 74% of the correct answers on the public, but also some impetus to Data Science. The most curious can at any time write me in private messages and ask a question. Thanks to all!

All Articles