🧔🏼 👨🏽‍🏭 🧑🏼 My cheat sheet by pandas 👩🏻‍🤝‍👨🏾 👶🏽 🆔

One teacher once told me that if you look for an analogue of a programmer in the world of books, it turns out that programmers are not like textbooks, but rather table of contents: they don’t remember everything, but they know how to quickly find what they need.

The ability to quickly find descriptions of functions allows programmers to work productively without losing the state of the stream. That's why I created the cheat sheet presented here pandasand included in it what I use every day, creating web applications and machine learning models. This is not an exhaustive list of features , but it includes the functions that I use most often, examples and my explanations about situations in which these functions are especially useful.

pandas

1. Preparation for work

If you want to independently test what is going to be discussed here, download the Anime Recommendations Database dataset from Kaggle. Unzip it and place it in the same folder where your Jupyter Notebook is located (hereinafter - notepad).

Now run the following commands.

import pandas as pd
import numpy as np
anime = pd.read_csv('anime-recommendations-database/anime.csv')
rating = pd.read_csv('anime-recommendations-database/rating.csv')
anime_modified = anime.set_index('name')

After that, you should be able to reproduce what I will show in the following sections of this material.

2. Import data

▍ Download CSV data

Here I want to talk about converting CSV data directly into data frames (into Dataframes). Sometimes when downloading data in CSV format, you need to specify their encoding (for example, it may look like encoding='ISO-8859–1'). This is the first thing that you should try to do if it turns out that after loading the data the data frame contains unreadable characters.

anime = pd.read_csv('anime-recommendations-database/anime.csv')

Downloaded CSV data

There is a similar function for loading data from Excel filespd.read_excel.

▍Create a data frame from manually entered data

This can be useful when you need to manually enter simple data into the program. For example, if you need to evaluate the changes undergoing data passing through the data processing pipeline.

df = pd.DataFrame([[1,'Bob', 'Builder'],
                  [2,'Sally', 'Baker'],
                  [3,'Scott', 'Candle Stick Maker']], 
columns=['id','name', 'occupation'])

Manual Entry

▍ Copying a data frame

Copying data frames can be useful in situations where you need to make changes to these data, but you also need to save the original. If data frames need to be copied, it is recommended to do this immediately after downloading them.

anime_copy = anime.copy(deep=True)

Copy of the data frame

3. Data export

▍ Export to CSV format

When exporting data, they are saved in the same folder as the notepad. Below is an example of saving the first 10 rows of a data frame, but what exactly to save depends on the specific task.

rating[:10].to_csv('saved_ratings.csv', index=False)

You can export data as Excel files using the function df.to_excel.

4. View and research data

▍Getting n records from the beginning or end of the data frame

First, nlet's talk about the output of the first elements of the data frame. I often infer a number of elements from the beginning of a data frame somewhere in a notebook. This allows me to conveniently access this data in the event that I forget about what exactly is in the data frame. The conclusion of the last few elements plays a similar role.

anime.head(3)
rating.tail(1)

Data from the beginning of the data frame

Data from the end of the data frame

▍ Counting the number of rows in a data frame

The function len(),that I will show here is not included pandas. But it is well suited for counting the number of rows of data frames. The results of its work can be stored in a variable and used where they are needed.

len(df)
#=> 3

▍Counting the number of unique values in a column

To calculate the number of unique values in a column, you can use this construction:

len(ratings['user_id'].unique())

▍Getting data frame information

Information about the data frame includes general information about it, such as the header, the number of values, data types of the columns.

anime.info()

Data frame information

There is another function similar todf.info-df.dtypes. It only displays information about the data types of the columns.

▍Displaying statistics about the data frame

Knowing the statistical information about the data frame is very useful in situations where it contains a lot of numerical values. For example, knowing the average, minimum, and maximum values of a column ratinggives us some insight into how the data frame as a whole looks. Here is the corresponding command:

anime.describe()

Data frame statistics

▍Count calculation

In order to calculate the number of values in a particular column, you can use the following construction:

anime.type.value_counts()

Count the number of items in a column

5. Extracting information from data frames

▍Creating a list or Series object based on column values

This can be useful in cases where you want to extract the values of the columns into variables xand yfor training the model. The following commands apply here:

anime['genre'].tolist()
anime['genre']

The results of the anime ['genre']. Tolist () command

Anime ['genre'] Team Results

▍Getting a list of values from an index

Let's talk about getting lists of values from an index. Please note that I used the data frame here anime_modified, as its index values look more interesting.

anime_modified.index.tolist()

Team Results

▍Getting a list of column values

Here is a command that allows you to get a list of column values:

anime.columns.tolist()

Team Results

6. Adding data to the data frame and removing them from it

▍Connecting a new column with a specified value to the data frame

Sometimes I have to add new columns to dataframes. For example - in cases where I have test and training sets in two different data frames, and before combining them, I need to mark them so that they can be distinguished later. For this, the following construction is used:

anime['train set'] = True

▍Create a new data frame from a subset of columns

This can be useful if you want to save several columns of a huge data frame in the new data frame, but you do not want to write down the names of the columns that you want to delete.

anime[['name','episodes']]

Command Result

▍ Delete specified columns

This technique may be useful if you need to remove only a few columns from the data frame. If you need to delete many columns, then this task can be quite tedious, so here I prefer to use the opportunity described in the previous section.

anime.drop(['anime_id', 'genre', 'members'], axis=1).head()

Team Results

▍Adding a row with the sum of values from other rows to the data frame

To demonstrate this example, we will independently create a small data frame that is convenient to work with. The most interesting thing here is the design df.sum(axis=0), which allows you to get the sum of the values from different rows.

df = pd.DataFrame([[1,'Bob', 8000],
                  [2,'Sally', 9000],
                  [3,'Scott', 20]], columns=['id','name', 'power level'])
df.append(df.sum(axis=0), ignore_index=True)

The result of the command

The command of the viewdf.sum(axis=1)allows you to summarize the values in the columns.

A similar mechanism is also applicable for calculating average values. For example -df.mean(axis=0).

7. Combining data frames

▍ Concatenation of two data frames

This technique is applicable in situations where there are two data frames with the same columns that need to be combined.

In this example, we first divide the data frame into two parts, and then combine these parts again:

df1 = anime[0:2]
df2 = anime[2:4]
pd.concat([df1, df2], ignore_index=True)

Dataframe df1

Datframe df2

Dataframe combining df1 and df2

▍ Merge data frames

The function df.mergethat we will look at here is similar to the left SQL join. It is used when two data frames need to be combined in a certain column.

rating.merge(anime, left_on=’anime_id’, right_on=’anime_id’, suffixes=(‘_left’, ‘_right’))

Team Results

8. Filtering

▍Getting rows with the desired index values

The index values of the data frame anime_modifiedare the names of the anime. Notice how we use these names to select specific columns.

anime_modified.loc[['Haikyuu!! Second Season','Gintama']]

Team Results

▍Getting strings by numerical indices

This technique is different from that described in the previous section. When using the function, the df.ilocfirst line is assigned an index 0, the second is an index 1, and so on. Such indexes are assigned to rows even if the data frame has been modified and string values are used in its index column.

The following construction allows you to select the first three rows of the data frame:

anime_modified.iloc[0:3]

Team Results

▍Getting rows by given column values

To get rows of a data frame in a situation where there is a list of column values, you can use the following command:

anime[anime['type'].isin(['TV', 'Movie'])]

The results of the command

If we are interested in a single meaning - you can use this design:

anime[anime[‘type’] == 'TV']

▍Getting a data frame slice

This technique is similar to getting a slice of a list. Namely, we are talking about getting a fragment of a data frame containing rows corresponding to a given index configuration.

anime[1:3]

Team Results

▍ Filtering by value

From data frames, you can select the rows that match the specified condition. Please note that using this method preserves existing index values.

anime[anime['rating'] > 8]

Team Results

9. Sort

To sort data frames by column values, you can use the function df.sort_values:

anime.sort_values('rating', ascending=False)

Team Results

10. Aggregation

▍Df.groupby function and counting the number of records

Here's how to count the number of records with different values in columns:

anime.groupby('type').count()

Team Results

▍ df.groupby function and column aggregation in various ways

Pay attention to what is used here reset_index(). Otherwise, the column typebecomes an index column. In most cases, I recommend doing the same.

anime.groupby(["type"]).agg({
  "rating": "sum",
  "episodes": "count",
  "name": "last"
}).reset_index()

▍Create a pivot table

In order to extract some data from the data frame, there is nothing better than a pivot table. Please note that here I seriously filtered the data frame, which accelerated the creation of the pivot table.

tmp_df = rating.copy()
tmp_df.sort_values('user_id', ascending=True, inplace=True)
tmp_df = tmp_df[tmp_df.user_id < 10] 
tmp_df = tmp_df[tmp_df.anime_id < 30]
tmp_df = tmp_df[tmp_df.rating != -1]
pd.pivot_table(tmp_df, values='rating', index=['user_id'], columns=['anime_id'], aggfunc=np.sum, fill_value=0)

Team Results

11. Data cleaning

▍ Write to cells containing the NaN value of some other value

Here we talk about writing values 0to cells containing the value NaN. In this example, we create the same pivot table as before, but without use fill_value=0. And then use the function fillna(0)to replace the values NaNwith 0.

pivot = pd.pivot_table(tmp_df, values='rating', index=['user_id'], columns=['anime_id'], aggfunc=np.sum)
pivot.fillna(0)

Table containing NaN values

Results of replacing NaN values with 0

12. Other useful features

▍ Sampling random samples from a dataset

I use the function df.sampleevery time I need to get a small random set of rows from a large data frame. If a parameter is used frac=1, then the function allows you to get an analog of the original data frame, the rows of which will be mixed.

anime.sample(frac=0.25)

Team Results

▍Selecting data frame rows

The following construction allows you to iterate over the rows of a data frame:

for idx,row in anime[:2].iterrows():
    print(idx, row)

Team Results

▍ Fighting IOPub data rate exceeded error

If you encounter an error IOPub data rate exceeded, try using the following command when starting Jupyter Notebook:

jupyter notebook — NotebookApp.iopub_data_rate_limit=1.0e10

Summary

Here I talked about some useful pandastricks for using the Jupyter Notebook environment. I hope my cheat sheet comes in handy.

Dear readers! Are there any opportunities pandaswithout which you cannot imagine your daily work?

My cheat sheet by pandas

1. Preparation for work

2. Import data

▍ Download CSV data

▍Create a data frame from manually entered data

▍ Copying a data frame

3. Data export

▍ Export to CSV format

4. View and research data

▍Getting n records from the beginning or end of the data frame

▍ Counting the number of rows in a data frame

▍Counting the number of unique values ​​in a column

▍Getting data frame information

▍Displaying statistics about the data frame

▍Count calculation

5. Extracting information from data frames

▍Creating a list or Series object based on column values

▍Getting a list of values ​​from an index

▍Getting a list of column values

6. Adding data to the data frame and removing them from it

▍Connecting a new column with a specified value to the data frame

▍Create a new data frame from a subset of columns

▍ Delete specified columns

▍Adding a row with the sum of values ​​from other rows to the data frame

7. Combining data frames

▍ Concatenation of two data frames

▍ Merge data frames

8. Filtering

▍Getting rows with the desired index values

▍Getting strings by numerical indices

▍Getting rows by given column values

▍Getting a data frame slice

▍ Filtering by value

9. Sort

10. Aggregation

▍Df.groupby function and counting the number of records

▍ df.groupby function and column aggregation in various ways

▍Create a pivot table

11. Data cleaning

▍ Write to cells containing the NaN value of some other value

12. Other useful features

▍ Sampling random samples from a dataset

▍Selecting data frame rows

▍ Fighting IOPub data rate exceeded error

Summary

More articles:

▍Counting the number of unique values in a column

▍Getting a list of values from an index

▍Adding a row with the sum of values from other rows to the data frame