Incredible parallel Apache Parquet I / O performance in Python

In anticipation of the start of the Data Engineer course , we prepared a translation of a small but interesting material.




In this article, I will talk about how Parquet compresses large datasets into a small footprint file, and how we can achieve a bandwidth far exceeding the bandwidth of the I / O stream using concurrency (multithreading).

Apache Parquet: Best at Low Entropy Data


As you can understand from the specification of the Apache Parquet format , it contains several levels of coding that can achieve a significant reduction in file size, among which are:

  • Encoding (compression) using a dictionary (similar to the pandas.Categorical way of presenting data, but the concepts themselves are different);
  • Compression of data pages (Snappy, Gzip, LZO or Brotli);
  • Encoding of the execution length (for null - pointers and indexes of the dictionary) and integer bit packing;

To show you how this works, let's look at a dataset:

['banana', 'banana', 'banana', 'banana', 'banana', 'banana',
 'banana', 'banana', 'apple', 'apple', 'apple']

Almost all Parquet implementations use the default dictionary for compression. Thus, the encoded data is as follows:

dictionary: ['banana', 'apple']
indices: [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]

Indexes in the dictionary are additionally compressed by the repetition coding algorithm:

dictionary: ['banana', 'apple']
indices (RLE): [(8, 0), (3, 1)]

Following the return path, you can easily restore the original array of strings.

In my previous article, I created a dataset that compresses very well in this way. When working with pyarrow, we can enable and disable encoding using the dictionary (which is enabled by default) to see how this will affect the file size:

import pyarrow.parquet as pq

pq.write_table(dataset, out_path, use_dictionary=True,
               compression='snappy)

A data set that takes up 1 GB (1024 MB) in pandas.DataFrame, with Snappy compression and compression using a dictionary, takes only 1.436 MB, that is, it can even be written to a diskette. Without compression using the dictionary, it will occupy 44.4 MB.

Concurrent reading in parquet-cpp using PyArrow


In the implementation of Apache Parquet in C ++ - parquet-cpp , which we made available for Python in PyArrow, the ability to read columns in parallel was added.

To try this feature, install PyArrow from conda-forge :

conda install pyarrow -c conda-forge

Now when reading the Parquet file, you can use the argument nthreads:

import pyarrow.parquet as pq

table = pq.read_table(file_path, nthreads=4)

For data with low entropy, decompression and decoding are strongly tied to the processor. Since C ++ does all the work for us, there are no problems with GIL concurrency and we can achieve a significant increase in speed. See what I was able to achieve by reading a 1 GB dataset in a pandas DataFrame on a quad-core laptop (Xeon E3-1505M, NVMe SSD):



You can see the full benchmarking scenario here .

I have included performance here for both compression cases using a dictionary and cases without using a dictionary. For data with low entropy, despite the fact that all files are small (~ 1.5 MB using dictionaries and ~ 45 MB without), compression using a dictionary significantly affects performance. With 4 threads, pandas read performance increases to 4 GB / s. This is much faster than the Feather format or any other I know.

Conclusion


With the release of version 1.0 parquet-cpp (Apache Parquet in C ++), you can see for yourself the increased I / O performance that is now available to Python users.

Since all the basic mechanisms are implemented in C ++, in other languages ​​(for example, R), you can create interfaces for Apache Arrow (columnar data structures) and parquet-cpp . Python binding is a lightweight shell of the core libarrow and libparquet C ++ libraries.

That's all. If you want to learn more about our course, sign up for an open day , which will be held today!

All Articles