Adding Parallel Computing to Pandas

Perhaps you have faced the task of parallel computing on pandas dataframes. This problem can be solved both by native Python, and with the help of a wonderful library - pandarallel. In this article I will show how this library allows you to process your data using all available capacities.



The library allows you not to think about the number of threads, creating processes and provides an interactive interface for monitoring progress.


Installation


pip install pandas jupyter pandarallel requests tqdm

As you can see, I also install tqdm. With it, I will clearly demonstrate the difference in the speed of code execution in a sequential and parallel approach.


Customization


import pandas as pd
import requests

from tqdm import tqdm
tqdm.pandas()

from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

You can find the full list of settings in the pandarallel documentation.


Create a data frame


For experiments, create a simple data frame - 100 rows, 1 column.


df = pd.DataFrame(
    [i for i in range(100)],
    columns=["sample_column"]
)


Example of a task suitable for parallelization


As we know, the solution of not all problems can be parallelized. A simple example of a suitable task is to call some external source, such as an API or database. In the function below, I call an API that returns me one random word. My goal is to add a column with words derived from this API to the data frame.


def function_to_apply(i):
    r = requests.get(f'https://random-word-api.herokuapp.com/word').json()

    return r[0]


df["sample-word"] = df.sample_column.progress_apply(function_to_apply)

, tqdm, — progress_apply apply. , , progress bar.



"" 35 .



, parallel_apply:


df["sample-word"] = df.sample_column.parallel_apply(function_to_apply)


5 .



pandas , pandarallel, Github .



! — .


All Articles