✌🏻 🤜🏻 💡 Loading NumPy Arrays from Disk: Comparison of memmap () and Zarr / HDF5 👨🏼‍⚕️ ☸️ 👢

If your NumPy array is too large to fit in RAM, you can process it by breaking it into fragments . You can do this either in transparent mode or explicitly by loading these fragments from the disk one at a time. In this situation, you can resort to two classes of tools:

NumPy method memmap(), a transparent mechanism that allows you to perceive a file located on a disk as if it is all in memory.
Zarr and HDF5 data storage formats that are similar to each other, which allow, if necessary, loading from disk and saving compressed fragments of the array to disk.

Each of these methods has its own strengths and weaknesses.

The material, the translation of which we publish today, is devoted to the analysis of the features of these methods of working with data, and the story of in which situations they may come in handy. In particular, special attention will be paid to data formats that are optimized for performing calculations and are not necessarily designed to transfer this data to other programmers.

What happens when reading data from a disk or writing data to a disk?

When a file is read from disk for the first time, the operating system does not just copy the data to the process memory. First, it copies this data into its memory, storing a copy of it in the so-called “buffer cache”.

What is the use here?

The fact is that the operating system stores data in the cache in case you need to read the same data from the same file again.

If the data is read again, then it enters the program memory not from the disk, but from the RAM, which is orders of magnitude faster.

If the memory occupied by the cache is needed for something else, the cache will be automatically cleared.

When data is written to disk, it moves in the opposite direction. At first they are written only to the buffer cache. This means that write operations are usually very fast, since the program does not need to focus on a slow disk. She, during the recording, needs to work only with RAM.

As a result, the data is flushed to the disk from the cache.

Working with an array using memmap ()

In our case, memmap()it allows us to perceive a file on disk as if it is an array stored in memory. The operating system, transparent to the program, performs read / write operations, accessing either the buffer cache or the hard disk, depending on whether the requested data is cached in memory or not. An algorithm like this is executed here:

Is the data in the cache? If so - great - you can contact them directly.
Is the data on disk? Access to them will be slower, but you won’t have to worry about it, they will be loaded in transparent mode.

As an additional plus, memmap()it can be noted that in most cases the buffer cache for the file will be built into the program memory. This means that the system does not have to maintain an additional copy of the data in the program memory outside the buffer.

The method memmap()is built into NumPy:

import numpy as np
array = np.memmap("mydata/myarray.arr", mode="r",
                  dtype=np.int16, shape=(1024, 1024))

Run this code, and you will have an array at your disposal, the work with which will be completely transparent for the program - regardless of whether the work is done with the buffer cache or with the hard disk.

Memmap () limitations

Although in certain situations it memmap()can show itself quite well, this method also has limitations:

Data must be stored in the file system. Data cannot be downloaded from binary storage like AWS S3.
, . , . , , , .
N- , , , . .

Let us explain the last point. Imagine that we have a two-dimensional array containing 32-bit (4-byte) integers. 4096 bytes are read per disk. If you read data located in a file sequentially from a disk (say, such data is in array lines), then after each read operation we will have 1024 integers. But if you read data whose location in the file does not match their location in the array (say, data located in columns), then each read operation will allow you to get only 1 required number. As a result, it turns out that to get the same amount of data, you have to perform a thousand times more read operations.

Zarr and HDF5

In order to overcome the above limitations, you can use Zarr or HDF5 data storage formats , which are very similar:

You can work with HDF5 files in Python using pytables or h5py . This format is older than Zarr and has more restrictions, but its plus is that it can be used in programs written in different languages.
Zarr is a format implemented using the Python package of the same name. It is much more modern and flexible than HDF5, but you can use it (at least for now) only in the Python environment. According to my feelings, in most situations, if there is no need for multilingual support for HDF5, it is worth choosing Zarr. Zarr, for example, has better multithreading support.

Further we will discuss only Zarr, but if you are interested in the HDF5 format and its deeper comparison with Zarr, you can watch this video.

Using Zarr

Zarr allows you to store pieces of data and load them into memory in the form of arrays, and also - write these pieces of data in the form of arrays.

Here's how to load an array using Zarr:

>>> import zarr, numpy as np
>>> z = zarr.open('example.zarr', mode='a',
...               shape=(1024, 1024),
...               chunks=(512,512), dtype=np.int16)
>>> type(z)
<class 'zarr.core.Array'>
>>> type(z[100:200])
<class 'numpy.ndarray'>

Please note that until a slice of the object is received, we will not be at our disposal numpy.ndarray. An entity zarr.core.arrayis just metadata. Only data that is included in the slice is loaded from the disk.

Why did I choose Zarr?

Zarr circumvents the limitations memmap()discussed above:
Data fragments can be stored on disk, in AWS S3 storage, or in some storage system that provides the ability to work with key / value format records.
The size and structure of the data fragment is determined by the programmer. For example, data can be organized in such a way as to be able to efficiently read information located on different axes of a multidimensional array. This is true for HDF5.
Fragments can be compressed. The same can be said of HDF5.

Let us dwell on the last two points in more detail.

Fragment Dimensions

Suppose we are working with an array of 30,000 x 3,000 elements in size. If you need to read this array and moving along its axis X, and moving along its axis Y, you can save fragments containing the data of this array, as shown below (in practice, most likely, you will need more than 9 fragments):

Now, data located on both the axis Xand the axis Ycan be loaded efficiently. Depending on what kind of data is needed in the program, you can download, for example, fragments (1, 0), (1, 1), (1, 2), or fragments (0, 1), (1, 1), (2, 1).

Data compression

Each fragment can be compressed. This means that data can enter the program faster than the disk allows you to read uncompressed information. If the data is compressed 3 times, this means that it can be downloaded from the disk 3 times faster than uncompressed data, minus the time it takes the processor to unpack it.

After the fragments are downloaded, they can be removed from the program memory.

Summary: memmap () or Zarr?

memmap()Which is better to use - or Zarr?

Memmap()It looks interesting in such cases:

There are many processes that read parts of the same file. These processes, thanks to the application memmap(), will be able to share the same buffer cache. This means that only one copy of the data needs to be kept in memory, no matter how many processes are running.
The developer has no desire to manually manage the memory. He plans to simply rely on the capabilities of the operating system, which will solve all memory management issues automatically and invisibly to the developer.

Zarr is especially useful in the following situations (in some of them, as will be noted, HDF5 format is also applicable):

Data is downloaded from remote sources, not from the local file system.
It is very likely that the bottleneck of the system will be reading from disk. Data compression will allow more efficient use of hardware capabilities. This also applies to HDF5.
If you need to obtain slices of multidimensional arrays along different axes, Zarr helps to optimize such operations by selecting the appropriate size and structure of fragments. This is true for HDF5.

I would choose between memmap()Zarr and, first of all, I would try to use Zarr - because of the flexibility that this package gives and the data storage format it implements.

Dear readers! How do you solve the problem of working with large NumPy arrays?

Loading NumPy Arrays from Disk: Comparison of memmap () and Zarr / HDF5