My cheat sheet by pandas

One teacher once told me that if you look for an analogue of a programmer in the world of books, it turns out that programmers are not like textbooks, but rather table of contents: they don’t remember everything, but they know how to quickly find what they need.

The ability to quickly find descriptions of functions allows programmers to work productively without losing the state of the stream. That's why I created the cheat sheet presented here pandasand included in it what I use every day, creating web applications and machine learning models. This is not an exhaustive list of features , but it includes the functions that I use most often, examples and my explanations about situations in which these functions are especially useful.



pandas

1. Preparation for work


If you want to independently test what is going to be discussed here, download the Anime Recommendations Database dataset from Kaggle. Unzip it and place it in the same folder where your Jupyter Notebook is located (hereinafter - notepad).

Now run the following commands.

import pandas as pd
import numpy as np
anime = pd.read_csv('anime-recommendations-database/anime.csv')
rating = pd.read_csv('anime-recommendations-database/rating.csv')
anime_modified = anime.set_index('name')

After that, you should be able to reproduce what I will show in the following sections of this material.

2. Import data


▍ Download CSV data


Here I want to talk about converting CSV data directly into data frames (into Dataframes). Sometimes when downloading data in CSV format, you need to specify their encoding (for example, it may look like encoding='ISO-8859–1'). This is the first thing that you should try to do if it turns out that after loading the data the data frame contains unreadable characters.

anime = pd.read_csv('anime-recommendations-database/anime.csv')


Downloaded CSV data

There is a similar function for loading data from Excel filespd.read_excel.

▍Create a data frame from manually entered data


This can be useful when you need to manually enter simple data into the program. For example, if you need to evaluate the changes undergoing data passing through the data processing pipeline.

df = pd.DataFrame([[1,'Bob', 'Builder'],
                  [2,'Sally', 'Baker'],
                  [3,'Scott', 'Candle Stick Maker']], 
columns=['id','name', 'occupation'])


Manual Entry

▍ Copying a data frame


Copying data frames can be useful in situations where you need to make changes to these data, but you also need to save the original. If data frames need to be copied, it is recommended to do this immediately after downloading them.

anime_copy = anime.copy(deep=True)


Copy of the data frame

3. Data export


▍ Export to CSV format


When exporting data, they are saved in the same folder as the notepad. Below is an example of saving the first 10 rows of a data frame, but what exactly to save depends on the specific task.

rating[:10].to_csv('saved_ratings.csv', index=False)

You can export data as Excel files using the function df.to_excel.

4. View and research data


▍Getting n records from the beginning or end of the data frame


First, nlet's talk about the output of the first elements of the data frame. I often infer a number of elements from the beginning of a data frame somewhere in a notebook. This allows me to conveniently access this data in the event that I forget about what exactly is in the data frame. The conclusion of the last few elements plays a similar role.

anime.head(3)
rating.tail(1)


Data from the beginning of the data frame


Data from the end of the data frame

▍ Counting the number of rows in a data frame


The function len(),that I will show here is not included pandas. But it is well suited for counting the number of rows of data frames. The results of its work can be stored in a variable and used where they are needed.

len(df)
#=> 3

▍Counting the number of unique values ​​in a column


To calculate the number of unique values ​​in a column, you can use this construction:

len(ratings['user_id'].unique())

▍Getting data frame information


Information about the data frame includes general information about it, such as the header, the number of values, data types of the columns.

anime.info()


Data frame information

There is another function similar todf.info-df.dtypes. It only displays information about the data types of the columns.

▍Displaying statistics about the data frame


Knowing the statistical information about the data frame is very useful in situations where it contains a lot of numerical values. For example, knowing the average, minimum, and maximum values ​​of a column ratinggives us some insight into how the data frame as a whole looks. Here is the corresponding command:

anime.describe()


Data frame statistics

▍Count calculation


In order to calculate the number of values ​​in a particular column, you can use the following construction:

anime.type.value_counts()


Count the number of items in a column

5. Extracting information from data frames


▍Creating a list or Series object based on column values


This can be useful in cases where you want to extract the values ​​of the columns into variables xand yfor training the model. The following commands apply here:

anime['genre'].tolist()
anime['genre']


The results of the anime ['genre']. Tolist () command


Anime ['genre'] Team Results

▍Getting a list of values ​​from an index


Let's talk about getting lists of values ​​from an index. Please note that I used the data frame here anime_modified, as its index values ​​look more interesting.

anime_modified.index.tolist()


Team Results

▍Getting a list of column values


Here is a command that allows you to get a list of column values:

anime.columns.tolist()


Team Results

6. Adding data to the data frame and removing them from it


▍Connecting a new column with a specified value to the data frame


Sometimes I have to add new columns to dataframes. For example - in cases where I have test and training sets in two different data frames, and before combining them, I need to mark them so that they can be distinguished later. For this, the following construction is used:

anime['train set'] = True

▍Create a new data frame from a subset of columns


This can be useful if you want to save several columns of a huge data frame in the new data frame, but you do not want to write down the names of the columns that you want to delete.

anime[['name','episodes']]


Command Result

▍ Delete specified columns


This technique may be useful if you need to remove only a few columns from the data frame. If you need to delete many columns, then this task can be quite tedious, so here I prefer to use the opportunity described in the previous section.

anime.drop(['anime_id', 'genre', 'members'], axis=1).head()


Team Results

▍Adding a row with the sum of values ​​from other rows to the data frame


To demonstrate this example, we will independently create a small data frame that is convenient to work with. The most interesting thing here is the design df.sum(axis=0), which allows you to get the sum of the values ​​from different rows. 

df = pd.DataFrame([[1,'Bob', 8000],
                  [2,'Sally', 9000],
                  [3,'Scott', 20]], columns=['id','name', 'power level'])
df.append(df.sum(axis=0), ignore_index=True)


The result of the command

The command of the viewdf.sum(axis=1)allows you to summarize the values ​​in the columns.

A similar mechanism is also applicable for calculating average values. For example -df.mean(axis=0).

7. Combining data frames


▍ Concatenation of two data frames


This technique is applicable in situations where there are two data frames with the same columns that need to be combined.

In this example, we first divide the data frame into two parts, and then combine these parts again:

df1 = anime[0:2]
df2 = anime[2:4]
pd.concat([df1, df2], ignore_index=True)


Dataframe df1


Datframe df2


Dataframe combining df1 and df2

▍ Merge data frames


The function df.mergethat we will look at here is similar to the left SQL join. It is used when two data frames need to be combined in a certain column.

rating.merge(anime, left_on=’anime_id’, right_on=’anime_id’, suffixes=(‘_left’, ‘_right’))


Team Results

8. Filtering


▍Getting rows with the desired index values


The index values ​​of the data frame anime_modifiedare the names of the anime. Notice how we use these names to select specific columns.

anime_modified.loc[['Haikyuu!! Second Season','Gintama']]


Team Results

▍Getting strings by numerical indices


This technique is different from that described in the previous section. When using the function, the df.ilocfirst line is assigned an index 0, the second is an index 1, and so on. Such indexes are assigned to rows even if the data frame has been modified and string values ​​are used in its index column.

The following construction allows you to select the first three rows of the data frame:

anime_modified.iloc[0:3]


Team Results

▍Getting rows by given column values


To get rows of a data frame in a situation where there is a list of column values, you can use the following command:

anime[anime['type'].isin(['TV', 'Movie'])]


The results of the command

If we are interested in a single meaning - you can use this design:

anime[anime[‘type’] == 'TV']

▍Getting a data frame slice


This technique is similar to getting a slice of a list. Namely, we are talking about getting a fragment of a data frame containing rows corresponding to a given index configuration.

anime[1:3]


Team Results

▍ Filtering by value


From data frames, you can select the rows that match the specified condition. Please note that using this method preserves existing index values.

anime[anime['rating'] > 8]


Team Results

9. Sort


To sort data frames by column values, you can use the function df.sort_values:

anime.sort_values('rating', ascending=False)


Team Results

10. Aggregation


▍Df.groupby function and counting the number of records


Here's how to count the number of records with different values ​​in columns:

anime.groupby('type').count()


Team Results

▍ df.groupby function and column aggregation in various ways


Pay attention to what is used here reset_index(). Otherwise, the column typebecomes an index column. In most cases, I recommend doing the same.

anime.groupby(["type"]).agg({
  "rating": "sum",
  "episodes": "count",
  "name": "last"
}).reset_index()

▍Create a pivot table


In order to extract some data from the data frame, there is nothing better than a pivot table. Please note that here I seriously filtered the data frame, which accelerated the creation of the pivot table.

tmp_df = rating.copy()
tmp_df.sort_values('user_id', ascending=True, inplace=True)
tmp_df = tmp_df[tmp_df.user_id < 10
tmp_df = tmp_df[tmp_df.anime_id < 30]
tmp_df = tmp_df[tmp_df.rating != -1]
pd.pivot_table(tmp_df, values='rating', index=['user_id'], columns=['anime_id'], aggfunc=np.sum, fill_value=0)


Team Results

11. Data cleaning


▍ Write to cells containing the NaN value of some other value


Here we talk about writing values 0to cells containing the value NaN. In this example, we create the same pivot table as before, but without use fill_value=0. And then use the function fillna(0)to replace the values NaNwith 0.

pivot = pd.pivot_table(tmp_df, values='rating', index=['user_id'], columns=['anime_id'], aggfunc=np.sum)
pivot.fillna(0)


Table containing NaN values


Results of replacing NaN values ​​with 0

12. Other useful features


▍ Sampling random samples from a dataset


I use the function df.sampleevery time I need to get a small random set of rows from a large data frame. If a parameter is used frac=1, then the function allows you to get an analog of the original data frame, the rows of which will be mixed.

anime.sample(frac=0.25)


Team Results

▍Selecting data frame rows


The following construction allows you to iterate over the rows of a data frame:

for idx,row in anime[:2].iterrows():
    print(idx, row)


Team Results

▍ Fighting IOPub data rate exceeded error


If you encounter an error IOPub data rate exceeded, try using the following command when starting Jupyter Notebook:

jupyter notebook — NotebookApp.iopub_data_rate_limit=1.0e10

Summary


Here I talked about some useful pandastricks for using the Jupyter Notebook environment. I hope my cheat sheet comes in handy.

Dear readers! Are there any opportunities pandaswithout which you cannot imagine your daily work?


All Articles