6 ways to significantly speed up pandas with a couple of lines of code. Part 1

In this article I will talk about six tools that can significantly speed up your pandas code. I assembled the tools according to one principle - ease of integration into the existing code base. For most tools, you just need to install the module and add a couple lines of code.

  • Numba
  • Multiprocessing
  • Pandarallel


  • Swifter
  • Modin
  • Dask


import numpy as np
import numba

#    100 000   4 ,     0  100
df = pd.DataFrame(np.random.randint(0,100,size=(100000, 4)),columns=['a', 'b', 'c', 'd'])

def multiply(x):
    return x * 5
#    numba 
def multiply_numba(x):
    return x * 5

In [1]: %timeit df['new_col'] = df['a'].apply(multiply)
23.9 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#   Pandas
In [2]: %timeit df['new_col'] = df['a'] * 5
545 µs ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#    numba
#     ,  numba    
In [3]: %timeit df['new_col'] = multiply_numba(df['a'].to_numpy())
329 µs ± 2.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

def square_mean(row):
    row = np.power(row, 2)
    return np.mean(row)
# :
# df['new_col'] = df.apply(square_mean, axis=1)

# numba      pandas (Dataframe, Series  .)
#       numpy
def square_mean_numba(arr):
    res = np.empty(arr.shape[0])
    arr = np.power(arr, 2)
    for i in range(arr.shape[0]):
        res[i] = np.mean(arr[i])
    return res
# :
# df['new_col'] = square_mean_numba(df.to_numpy())



df = pd.read_csv('abcnews-date-text.csv', header=0)
#    10 ,    
df = pd.concat([df] * 10)

020030219aba decides against community broadcasting lic...
120030219act fire witnesses must be aware of defamation
220030219a g calls for infrastructure protection summit
320030219air nz staff in aust strike for pay rise
420030219air nz strike to affect australian travellers

def mean_word_len(line):
    for i in range(6):
        words = [len(i) for i in line.split()]
        res = sum(words) / len(words)
    return res

def compute_avg_word(df):
    return df['headline_text'].apply(mean_word_len)


from multiprocessing import Pool

#   4  
n_cores = 4
pool = Pool(n_cores)

def apply_parallel(df, func):
    df_split = np.array_split(df, n_cores)
    df = pd.concat(pool.map(func, df_split))
    return df
# df['new_col'] = apply_parallel(df, compute_avg_word)


  • x2-3
  • — , .


from pandarallel import pandarallel
To be continued

In this part, we looked at 2 fairly simple approaches to pandas optimization - using jit compilation and task parallelization. In the next part I will talk about more interesting and complex tools, but for now I suggest you test the tools yourself to make sure they are effective.

> Part 2


PS: Trust, but verify - all the code used in the article (benchmarks and graphing), I posted on github

