Speed ​​up numpy, scikit and pandas 100 times with Rust and LLVM: interview with developer Weld

Hello, Habr! I present to you the translation of the article "Interview with Weld's main contributor: accelerating numpy, scikit and pandas as much as 100x with Rust and LLVM" .

After working for several weeks with data science tools in Python and R, I began to wonder if there is any intermediate representation (IR) like CUDA that can be used in different languages. There must be something better than reimplementation and optimization of the same methods in each language. In addition to this, it would be nice to have a common runtime to optimize the entire program, rather than each function individually.

After several days of researching and testing various projects, I found Weld(you can read the academic article ).

To my surprise, one of the author Weld is Matei Zaharia (Matei Zaharia), the creator of Spark.

So, I contacted Shoumik Palkar , Weld's main contributor, and interviewed him. Showmick is a graduate student at the Department of Computer Science at Stanford University, where he entered on the advice of Matey Zakharia.

Weld is not yet ready for industrial use, but very promising. If you are interested in the future of data science and Rust in particular, you will love this interview.

Advertisement of the author of the original article
image
«Not a Monad Tutorial», Weld . !

, mail@fcarrone.com @unbalancedparen.

What was Weld designed for, what problems does it solve?


Weld's goal is to increase productivity for applications that use high-level APIs such as NumPy and Pandas. The main problem that he solves is cross-functional and cross-library optimizations not provided by other libraries today. In particular, many widely used libraries have modern implementations of algorithms for individual functions (for example, the fast join algorithm implemented in Pandas by C, or the quick matrix multiplication in NumPy), but they do not provide the possibility of optimizing the combination of these functions. For example, preventing unnecessary memory scans when performing matrix multiplication followed by aggregation.

Weld provides a common runtime environment that allows libraries to express computations in a common intermediate representation (IR). This IR can then be optimized using the compiler optimizer, and then compiled on the fly (JIT) into parallel machine code with optimizations such as loop merging, vectorization, etc. Weld's IR is initially parallel, so the programs expressed in it can always be trivially parallelized.

We also have a new project called Split annotations, which will be integrated with Weld and designed to reduce the barrier to the inclusion of such optimizations in existing libraries.

Wouldn't it be easier to optimize numpy, pandas and scikit? How much faster is it?


Weld provides an optimization of the combination of functions in these libraries, while optimization of the libraries can accelerate the calls of only individual functions. In fact, many of these libraries are already very well optimized for each individual function, but provide performance below the limits of modern equipment, because they do not use parallelism or do not use the memory hierarchy efficiently.

For example, many NumPy functions for multidimensional arrays (ndarray) are implemented in the low-level C language, but a call to each function requires a full scan of the input data. If these arrays do not fit in the CPU caches, most of the execution time can be spent loading data from main memory, rather than performing calculations. Weld can view individual function calls and perform optimizations, such as combining loops that will store data in caches or CPU registers. Such optimizations can improve performance by more than an order of magnitude in multi-core systems, since they provide better scalability.

image

Integrations of Weld prototypes with Spark (top left), NumPy (top right) and TensorFlow (bottom left) show up to 30-fold improvement over their own infrastructure implementations without changes in the user application code. Cross-library optimization of Pandas and NumPy (bottom right) can improve performance by two orders of magnitude.

What is Baloo?


Baloo is a library that implements a subset of the Pandas API using Weld. It was developed by Radu Jica, a Master in CWI (Centrum Wiskunde & Informatica, Amsterdam). The goal of Baloo is to apply the above optimization types to Pandas to improve single-threaded performance, reduce memory usage, and ensure concurrency.

Does Weld / Baloo support external operations (like, say, Dask) for processing data that does not fit in memory?


Weld and Baloo currently do not support external operations (out-of-core, external memory), although we will be happy to receive opensource-development in this direction!

Why did you choose Rust and LLVM to implement Weld? Did you come to Rust right away?


We chose Rust because:

  • It has a minimal runtime (in fact, it simply checks the boundaries of arrays) and is easy to embed in other languages ​​such as Java and Python
  • It contains functional programming paradigms, such as pattern matching, which make writing code easier, for example, to optimize the pattern matching compiler
  • ( Rust crates), .

We chose LLVM because it is an open source compilation framework that is widely used and supported. We generate LLVM directly instead of C / C ++, so we do not need the C compiler. It also reduces compilation time, since we do not parse C / C ++ code.

Rust was not the first language Weld was implemented in. The first implementation was on Scala, which was chosen because of its algebraic data types and the presence of such a powerful feature as pattern matching. This simplified the writing of the optimizer, which is the main part of the compiler. Our original optimizer was made like Catalyst, an extensible optimizer in Spark SQL.

We moved away from Scala because it was too difficult to embed the JVM-based language in other runtimes and languages.

Weld CPU GPU, RAPIDS, data science Python GPU?


The main difference between Weld and systems such as RAPIDS is that it is aimed at optimizing applications containing different kernels (functions in CUDA terms) by compiling on the fly, and not at optimizing implementations of individual functions. For example, the Weld GPU backend will compile one single CUDA kernel optimized for the final application, instead of invoking separate kernels.

In addition, Weld IR is hardware independent, which allows it to be used for both the GPU and the CPU, as well as non-standard equipment such as vector processors.
Of course, Weld essentially intersects with other projects in the same field, including RAPIDS, and is created under their influence.

Runtime environments such as Bohrium (implements lazy computing in NumPy) and Numba (Python library, JIT code compiler) share Weld's high-level goals. And optimizer systems like Spark SQL directly influence Weld design.

Does Weld have other uses besides data science library optimizations?


One of the most exciting aspects of Weld IR is that it natively supports data concurrency. This means that loop parallelization in Weld IR is always safe. This makes Weld IR attractive for new types of equipment.

For example, NEC employees used Weld to run Python workloads on a custom high-bandwidth vector processor, simply adding a new backend to an existing Weld IR.

IR can also be used to implement a layer of physical operations in a database. And we plan to add features that will also allow us to compile a subset of Python into Weld code.

Are libraries ready for use in real projects? And if not, when can you expect a finished result?


Many of the examples and benchmarks on which we tested these libraries are taken from the real workload. Therefore, we would really like users to try the current version in their applications, and leave their feedback. And, best of all, they proposed their own patches.

In general, at the moment it cannot be said that in real applications everything will work out of the box.

Our next releases over the next few months will focus exclusively on the usability and reliability of Python libraries. Our goal is to make libraries good enough for inclusion in real projects. And also the ability to use non-Weld library versions in places where support has not yet been added.

As I noted in the first question, the Split annotations project ( source code and academic article ) should simplify this transition.

Split annotations is a system that allows you to add annotations to existing code to determine how to split, transform, and parallelize it. It provides the optimization that we consider most effective in Weld (storing data chunks in CPU caches between function calls, instead of scanning the entire data set). But Split annotations are much easier to integrate than Weld, because they use existing library code without relying on the IR compiler. It also facilitates maintenance and debugging, which in turn improves reliability.

Libraries that do not yet have full Weld support can use Split annotations. This will allow us to gradually add Weld support based on user feedback, while incorporating new optimizations.

All Articles