👼🏽 👨🏼‍💼 👨🏻‍⚖️ How to help pandas in processing large amounts of data? 🧗🏻 🚝 🧑🏾‍🤝‍🧑🏽

The pandas library is one of the best tools for exploratory data analysis . But this does not mean that pandas is a universal tool suitable for solving any problems. In particular, we are talking about processing large amounts of data. I happened to spend a very, very long time, waiting for pandas to read many files, or process them, calculating on the basis of the information contained in them some indicators that interest me. The fact is that pandas does not support parallel data processing mechanisms. As a result, this package fails to take full advantage of the capabilities of modern multi-core processors. Large datasets in pandas are processed slowly.

Recently, I set out to find something that would help me in the processing of big data. I managed to find what I was looking for, I embedded the found tool in my data processing pipeline. I use it to work with large amounts of data. For example, to read files containing 10 gigabytes of data, to filter and aggregate them. When I manage to solve such problems, I save what I got in a smaller CSV file that is suitable for pandas, and then I start working with the data using pandas.

Here is a Jupyter notebook containing examples of this material that you can experiment with.

Dask

The tool that I use to process large amounts of data is the Dask library . It supports parallel data processing, allowing you to speed up the work of existing tools. This includes numpy, pandas and sklearn. Dask is a free open source project. It uses Python APIs and data structures, which makes it easy to integrate Dask into existing projects. If we briefly describe Dask, then we can say that this library simplifies the solution of ordinary problems and makes it possible to solve problems of enormous complexity.

Comparing pandas and dask

I can describe the possibilities of Dask here, since this library has a lot of interesting things, but instead, I just consider one practical example. In the course of work, I usually encounter sets of large files, the data stored in which needs to be analyzed. Let's play one of my typical tasks and create 10 files, each of which contains 100,000 records. Each such file has a size of 196 MB.

from sklearn.datasets import make_classification
import pandas as pd
for i in range(1, 11):
    print('Generating trainset %d' % i)
    x, y = make_classification(n_samples=100_000, n_features=100)
    df = pd.DataFrame(data=x)
    df['y'] = y
    df.to_csv('trainset_%d.csv' % i, index=False)

Now read these files using pandas and measure the time required to read them. There is no built-in support in pandas glob, so we have to read the files in a loop:

%%time
import glob
df_list = []
for filename in glob.glob('trainset_*.csv'):
    df_ = pd.read_csv(filename)
    df_list.append(df_)
df = pd.concat(df_list)
df.shape

It took 16 seconds for pandas to read these files:

CPU times: user 14.6 s, sys: 1.29 s, total: 15.9 s
Wall time: 16 s

If we talk about Dask, it can be noted that this library allows you to process files that do not fit in memory. This is done by breaking them into blocks and by compiling task chains. Let's measure the time Dask needs to read these files:

import dask.dataframe as dd
%%time
df = dd.read_csv('trainset_*.csv')
CPU times: user 154 ms, sys: 58.6 ms, total: 212 ms
Wall time: 212 ms

Dask took 154 ms! How is this even possible? In fact, this is not possible. Dask implements a delayed task paradigm. Calculations are performed only when their results are needed. We describe the execution graph, which gives Dask the ability to optimize the execution of tasks. Repeat the experiment. Note that the function read_csvfrom Dask has built-in support for working with glob:

%%time
df = dd.read_csv('trainset_*.csv').compute()
CPU times: user 39.5 s, sys: 5.3 s, total: 44.8 s
Wall time: 8.21 s

Using the function computeforces Dask to return the result, for which you need to really read the files. The result is that Dask reads files twice as fast as pandas.

It can be said that Dask allows you to equip Python projects with scaling tools.

Comparing CPU Usage in Pandas and Dask

Does Dask use all the processor cores on the system? Compare the use of processor resources in pandas and in Dask when reading files. The same code that we reviewed above applies here.

Using processor resources when reading files with pandas

Using processor resources when reading files using Dask

A couple of the above animated images allow you to clearly see how pandas and Dask use processor resources when reading files.

What happens in the bowels of Dask?

A Dask dataframe consists of several pandas dataframes, which are separated by indexes. When we execute a function read_csvfrom Dask, it reads the same file by several processes.

We can even visualize a graph of this task.

exec_graph = dd.read_csv('trainset_*.csv')
exec_graph.visualize()

Dask runtime graph when reading multiple files

Dask disadvantages

Perhaps you now have the following thought: “If the Dask library is so good, why not just use it instead of pandas?” But not so simple. Only some pandas functions are ported to Dask. The fact is that certain tasks are difficult to parallelize. For example, sorting data and assigning indexes to unsorted columns. Dask is not a tool that solves absolutely all tasks of data analysis and processing. This library is recommended to be used only for working with data sets that do not fit in memory entirely. Since the Dask library is based on pandas, everything that works slowly in pandas will remain slow in Dask. As I said before, Dask is a useful tool that you can embed in a data processing pipeline, but this tool does not replace other libraries.

Install Dask

In order to install Dask, you can use the following command:

python -m pip install "dask[complete]"

Summary

In this article, I only superficially touched on the capabilities of Dask. If you are interested in this library - take a look at these great tutorials on Dask, and documentation of datafreymam Dask. And if you want to know what functions support Dask dataframes, read the API descriptionDataFrame .

Would you use the Dask library?

We remind you that we are continuing the prediction contest in which you can win a brand new iPhone. There is still time to break into it, and make the most accurate forecast on topical values.

How to help pandas in processing large amounts of data?