Linux code performance testing with examples

When I started learning Java, one of the first tasks I tried to solve was to determine even / odd numbers. I knew several ways to do this, but decided to look for the “right” way on the Internet. The information on all the links found told me about the only correct solution of the form x% 2, in order to obtain the remainder of the division. If the remainder is 0, the number is even; if the remainder is 1, it is odd.

Since the time of ZX Spectrum, I remembered another way and it is associated with the representation of numbers in the binary system. Any number in the decimal system can be written as the sum of the powers of two. For example, for one byte, and this is 8 bits, any number in the decimal system can be represented as the sum of the numbers 128 + 64 + 32 + 16 + 8 + 4 + 2 + 1.

This is just a sequence of powers of two. When translating a number into the binary system, if we need to take into account the number, in the binary representation it will be one, if not necessary, it will be 0.

For example:

10 = 1010 (8 + 0 + 2 + 0)
13 = 1101 (8 + 4 + 0 + 1)
200 = 11001000 (128 + 64 + 0 + 0 + 8 + 0 + 0 + 0)

You can immediately pay attention to the fact that an odd number can only give a zero power of two with a value of 1, all other powers of two will be even by definition. This automatically means that from the point of view of the binary number system, if we want to check any number for parity, we do not need to check the whole number, no matter how big it is. We need to check only the first bit (the rightmost). If it is 0, then the number is even, since all other bits give an even number, and vice versa, if it is one in the rightmost bit, then the number is guaranteed to be odd, because all other bits give only an even value.
To check only the right bit in a number, you can use several methods. One of them is binary AND.

AND


Binary AND (AND) works by the following rule. If you apply to any number, let's call it original, logical AND with the number 0, then the result of such an operation is always 0. This way you can zero out the bits you don't need. If you apply to the original 1, then you get the original.

In the binary system, it is easy to write this:

0 AND 0 = 0 (zero the original)
1 AND 0 = 0 (zero the original)
0 AND 1 = 0 (do not change the original)
1 AND 1 = 1 (do not change the original)

From here some simple rules.

If we apply the AND operation of all units to all numbers (all bits are on), we get the same initial number.

If we apply AND of all zeros to any number (all bits are off), we get 0.

For example:

If we apply AND 0 to byte 13, then we get 0. In decimal it looks like 13 AND 0 = 0

If we apply AND 0 to byte 200, we get 0, or write down 200 AND 0 = 0 briefly.
The same is the opposite, apply to 13 all the included bits, for a byte it will be eight units, and we get the original. In the binary system 00001101 AND 11111111 = 00001101 or in the decimal system 13 AND 255 = 13

For 200 there will be 11001000 AND 11111111 = 11001000, respectively, or in the decimal system 200 AND 255 = 200

Binary verification


To check the number for parity, we only need to check the rightmost bit. If it is 0, then the number is even; if 1, then it is not even. Knowing that with AND we can leave some bits original, and some we can reset, we can just reset all bits except the rightmost one. For example:

13 in the binary system is 1101. Let's apply AND 0001 to it (we reset all bits, the last one remains the original). In 1101, we change all the bits to 0 except the last one and get 0001. We got only the last bit from our original number. In the decimal system, it will look like 13 AND 1 = 1.

The same thing with the number 200, in binary form 11001000. We apply AND 00000001 to it, according to the same scheme, zero all the bits, leave the last one as it is, we get 00000000, and we reset the left 7 zeros with AND, and we got the last 0 from the original number. In the decimal system, it looks like 200 AND 1 = 0

Thus, applying the AND 1 command to any number, we get either 0 or 1. And if the result is 0, then the number is even. When 1, the number is odd.

In Java, the binary AND is written as &. Accordingly, 200 & 1 = 0 (even) and 13 & 1 = 1 (odd).

This implies at least two methods for determining even numbers.

X% 2 - through the remainder of the division by two
X & 1 - through binary AND

Binary operations such as OR, AND, XOR are processed by the processor in a minimum amount of time. But the division operation is a non-trivial task, and in order to execute it, the processor needs to process a lot of instructions, essentially execute the whole program. However, there are binary left and right shift operations that allow, for example, quickly dividing a number by 2. The question is whether compilers use this optimization and whether there is a difference between these two comparisons, which in fact do the same.

Coding


We will write a program that will process 9,000,000,000 numbers in a cycle in order, and determine their belonging to even / odd by determining the remainder of the division.

public class OddEvenViaMod {
        public static void main (String[] args) {
                long i=0;
                long odds=0;
                long evens=0;
                do {
                if ((i % 2) == 0) {
                        evens++;
                        }
                else {
                        odds++;
                        }
                i++;
                } while (i<9000000000L);
                System.out.println("Odd " + odds);
                System.out.println("Even " + evens);
        }
}

And we will write exactly the same, but literally change two characters, checking the same thing through binary AND.

public class OddEvenViaAnd {
        public static void main (String[] args) {
                long i=0;
                long odds=0;
                long evens=0;
                do {
                if ((i & 1) == 0) {
                        evens++;
                        }
                else {
                        odds++;
                        }
                i++;
                } while (i<9000000000L);
                System.out.println("Odd " + odds);
                System.out.println("Even " + evens);

Now we need to somehow compare these two programs.

Resources on Linux. CPU


A huge amount of hours has been spent on creating any operating system, in particular on fair distribution of resources between programs. On the one hand, this is good, since running two programs, you can be sure that they will work in parallel, but on the other hand, when you need to check the performance of a program, it is extremely necessary to limit or at least reduce the external impact on the program from others programs and operating system.

The first thing to figure out is the processor. Linux OS for each process stores a bitmask, which indicates which kernels can be used by the application and which are not. You can view and change this mask with the taskset command.

For example, let's see the number of cores in my processor:

[user@localhost]# grep -c processor /proc/cpuinfo
4

My computer has a processor with 4 cores. This is good, because I am going to allocate one of them to my needs.

Let's see if all of them are currently in use with the top command:

[user@localhost]# top

Press "1" to view information on each core separately:

top - 13:44:11 up 1 day, 23:26,  7 users,  load average: 1.48, 2.21, 2.02
Tasks: 321 total,   1 running, 320 sleeping,   0 stopped,   0 zombie
%Cpu0  :  7.7 us,  6.8 sy,  0.0 ni, 85.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  9.2 us,  4.2 sy,  0.0 ni, 86.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  7.6 us,  3.4 sy,  0.0 ni, 89.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  8.4 us,  4.2 sy,  0.0 ni, 87.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16210820 total,   296972 free, 10072092 used,  5841756 buff/cache
KiB Swap: 16777212 total, 16777212 free,        0 used.  5480568 avail Mem
....

Here we see that all cores are used approximately the same. (us and sy and id indicators are approximately equal for each core).

Now let's try to see the same with the taskset command.

[user@localhost]# taskset -p 1
pid 1's current affinity mask: f

The bitmask “F” in the hexadecimal system means 15 in decimal, or 1111 in binary (8 + 4 + 2 + 1). All bits are enabled, which means that all cores are used by a process with PID 1.
On Linux, when one process spawns another with a clone system call, the bitmask is copied from the parent at the time of cloning. This means that if we change this mask for our init process (in my case it is systemd), then when starting any new process through systemd this new process will already be launched with a new mask.

You can change the mask for the process using the same command, listing the numbers of CPU cores that we want to leave used for the process. Suppose we want to leave the kernel 0.2.3 for our process, and we want to disable kernel 1 for our systemd process. To do this, we need to run the command:

[user@localhost]#  taskset -pc 0,2,3 1
pid 1's current affinity list: 0-3
pid 1's new affinity list: 0,2,3

We check:

[user@localhost]# taskset -p 1
pid 1's current affinity mask: d

The mask changed to "D" in the hexadecimal notation, which is 13 in decimal and 1101 in binary (8 + 4 + 0 + 1).

From now on, any process that will be cloned by the systemd process will automatically have a mask 1101 of CPU usage, which means that kernel number 1 will not be used.

We prohibit the use of the kernel to all processes


Preventing the main Linux process from using a single kernel will only affect new processes created by this process. But in my system there is already not one process, but a whole multitude, such as crond, sshd, bash and others. If I need to prohibit all processes from using one core, then I must run the taskset command for each running process.

To get a list of all processes, we will use the API that the kernel gives us, namely the / proc file system.

Further in the loop, we look at the PID of each running process and change the mask for it and all threads:

[user@localhost]# cd /proc; for i in `ls -d [0-9]*`; do taskset -a -pc 0,2,3 $i; done
pid 1's current affinity list: 0,2,3
pid 1's new affinity list: 0,2,3
...

Since during the execution of the program, some processes could have time to spawn other processes, it is better to run this command several times.

Check the result of our work with the top command:

[user@localhost]# top
top - 14:20:46 up 2 days, 3 min,  7 users,  load average: 0.19, 0.27, 0.57
Tasks: 324 total,   4 running, 320 sleeping,   0 stopped,   0 zombie
%Cpu0  :  8.9 us,  7.7 sy,  0.0 ni, 83.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  9.5 us,  6.0 sy,  0.0 ni, 84.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  8.4 us,  6.6 sy,  0.0 ni, 85.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16210820 total,   285724 free, 10142548 used,  5782548 buff/cache
KiB Swap: 16777212 total, 16777212 free,        0 used.  5399648 avail Mem

As you can see, the picture has changed a bit, now for the kernel 0.2.3 the average parameters us, sy, id are the same for us, and for kernel 1 our core consumption in userspace and sys is 0, and the kernel is idle at 100% (idle 100 ) Kernel 1 is no longer used by our applications and a very small percentage is currently used by the kernel.

Now the task of testing performance is reduced to starting our process on a free core.

Memory


Physical memory allocated to a process can be easily taken from any process. This mechanism is called swap. If Linux has a place for swap, it will do it anyway. The only way to prevent the OS from taking memory from our process, like any other process, is to completely disable the swap section, which we will do:

[user@localhost]$ sudo swapoff -a
[user@localhost]$ free -m
              total        used        free      shared  buff/cache   available
Mem:          15830        7294        1894         360        6641        7746
Swap:             0           0           0

We allocated 1 processor core, which is not used, and also removed the ability to swap memory from the Linux kernel.

Disk


In order to reduce the impact of the disk on the launch of our process, create a disk in memory and copy all the necessary files to this disk.

Create a directory and mount the file system:

[user@localhost]$ sudo mkdir /mnt/ramdisk;
[user@localhost]$ mount -t tmpfs -o size=512m tmpfs /mnt/ramdisk
[user@localhost]$ chown user: /mnt/ramdisk

Now we need to figure out what and how we plan to launch it. In order to run our program, we first need to compile our code:

[user@localhost]$ javac OddEvenViaMod.java

Then you need to run it:

[user@localhost]$ java OddEvenViaMod

But in our case, we want to run the process on the processor core that is not used by any other process. Therefore, run it through taskset:

[user@localhost]# taskset -c 1 time java OddEvenViaMod

In our tests, we need to measure the time, so our launch line turns into

taskset -c 1 time java OddEvenViaMod

Linux OS supports several formats of executable files, and the most common of them is ELF format. This file format allows you to tell the OS not to run your file, but to run another file. At first glance, it does not sound very logical and understandable. Imagine that I launch the Minesweeper game, and the Mario game starts up for me - it looks like a virus. But this is the logic. If my program requires any dynamic library, for example libc, or any other, this means that the OS must first load this library into memory, and after that load and run my program. And it seems logical to place such functionality in the operating system itself, but the operating system works in a protected area of ​​memory and should contain as little functionality as possible and necessary.Therefore, the ELF format provides the opportunity to tell the OS that we want to download some other program, and this "other" program will download all the necessary libraries and our program and start the whole thing.

So, we need to run 3 files, this is taskset, time, java.

Check the first of them:

[user@localhost]$ whereis taskset
taskset: /usr/bin/taskset /usr/share/man/man1/taskset.1.gz

Bash will run the file / usr / bin / taskset, check what is inside:

[user@localhost]$ file /usr/bin/taskset
/usr/bin/taskset: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=7a2fd0779f64aa9047faa00f498042f0f0c5dc60, stripped

This is the ELF file I wrote about above. In the ELF file, in addition to the program itself, there are various headers. By launching this file, the OS checks its headers, and if the header “Requesting program interpreter” exists in the file, the OS will launch the file from this header, and it will pass the initially launched file as an argument.

Check if this header exists in our ELF file:

[user@localhost]$ readelf -a /usr/bin/taskset  | grep -i interpreter
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]

The header exists, which means that by launching the / usr / bin / taskset file we actually run /lib64/ld-linux-x86-64.so.2.

Check what this file is:

[user@localhost]$ ls -lah /lib64/ld-linux-x86-64.so.2
lrwxrwxrwx 1 root root 10 May 21  2019 /lib64/ld-linux-x86-64.so.2 -> ld-2.17.so

This is a sim link to the file /lib64/ld-2.17.so. Check it out:

[user@localhost]$ file /lib64/ld-2.17.so
/lib64/ld-2.17.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=a527fe72908703c5972ae384e78d1850d1881ee7, not stripped

As you can see, this is another ELF file that the OS will run. We look at the headers:

[user@localhost]$ readelf -a /lib64/ld-2.17.so  | grep -i interpreter
[user@localhost]$

We see that this ELF file does not have such a header, so the OS will run this file and transfer control to it. And already this file will open our file / usr / bin / taskset, read from there information on all the necessary libraries. The list of required libraries is also in the headers of the ELF file. We can look at this list with the ldd or readelf command, which is the same thing:

[user@localhost]$ ldd /usr/bin/taskset
	linux-vdso.so.1 =>  (0x00007ffc4c1df000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f4a24c4e000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f4a2501b000)

[user@localhost]$ readelf -a /usr/bin/taskset  | grep NEEDED
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]

VDSO is a linked piece of memory that is not related to libraries, therefore it is missing from the ELF file as a necessary library.

From this it is clear that the program /lib64/ld-2.17.so is responsible for running all the programs that require it, and these are all programs with dynamically linked libraries.

If we run / usr / bin / taskset, this is exactly the same as we run /lib64/ld-2.17.so with the / usr / bin / taskset argument.

We return to the problem of the influence of the disk on our tests. Now we know that if we want to load our program from memory, then we need to copy not one file, but several:

[user@localhost]$ cp /lib64/libc-2.17.so /mnt/ramdisk/
[user@localhost]$ cp /lib64/ld-2.17.so /mnt/ramdisk/
[user@localhost]$ cp /usr/bin/taskset /mnt/ramdisk/

We do the same for time, the library requirements for which are exactly the same (we already copied ld and libc).

[user@localhost]$ cp /usr/bin/time /mnt/ramdisk/

For java, things are a little more complicated, since java requires many different libraries that can be copied for a long time. To simplify my life a little, I will copy the entire directory from my java openjdk to a disk in memory and create a sim link. Of course, disk accesses will remain in this case, but there will be fewer.

[user@localhost]$ cp -R /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.222.b10-0.el7_6.x86_64 /mnt/ramdisk/

Rename the old directory, adding the ending .default to it

[user@localhost]$ sudo mv /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.222.b10-0.el7_6.x86_64{,.default}

And create a symlink:

[user@localhost]$ sudo ln -s /mnt/ramdisk/java-1.8.0-openjdk-1.8.0.222.b10-0.el7_6.x86_64 /usr/lib/jvm/

We already know how to run a binary file through the argument to the /lib64/ld-2.17.so file, which actually starts. But how to make the program /lib64/ld-2.17.so load loaded libraries from the directory we specified? man ld to help us, from which we learn that if you declare the environment variable LD_LIBRARY_PATH, the ld program will load the libraries from the directories we specify. Now we have all the data to prepare the launch line of the Java application.

We start several times in a row and check:

[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so java OddEvenViaMod
Odd 4500000000
Even 4500000000
10.66user 0.01system 0:10.67elapsed 99%CPU (0avgtext+0avgdata 20344maxresident)k
0inputs+64outputs (0major+5229minor)pagefaults 0swaps
[user@localhost ramdisk]$
[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so java OddEvenViaMod
Odd 4500000000
Even 4500000000
10.65user 0.01system 0:10.67elapsed 99%CPU (0avgtext+0avgdata 20348maxresident)k
0inputs+64outputs (0major+5229minor)pagefaults 0swaps
[user@localhost ramdisk]$
[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so java OddEvenViaMod
Odd 4500000000
Even 4500000000
10.66user 0.01system 0:10.68elapsed 99%CPU (0avgtext+0avgdata 20348maxresident)k
0inputs+64outputs (0major+5229minor)pagefaults 0swaps
[user@localhost ramdisk]$
[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so java OddEvenViaMod
Odd 4500000000
Even 4500000000
10.65user 0.01system 0:10.67elapsed 99%CPU (0avgtext+0avgdata 20348maxresident)k
0inputs+96outputs (0major+5234minor)pagefaults 0swaps
[user@localhost ramdisk]$

During the execution of the program, we can run top and make sure that the program runs on the correct CPU core.

[user@localhost ramdisk]$ top
...
%Cpu0  : 19.7 us, 11.7 sy,  0.0 ni, 68.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  9.8 us,  9.1 sy,  0.0 ni, 81.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 14.0 us,  9.0 sy,  0.0 ni, 77.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 s
...

As you can see, the results in most cases are similar. Unfortunately, we cannot completely remove the influence of the OS on the CPU core, so the result still depends on the specific tasks inside the Linux kernel at the time of launch. Therefore, it is better to use the median of the values ​​of several starts.

In our case, we see that the java program processes 9,000,000,000 with parity through the remainder of the division in 10.65 seconds on one CPU core.

Let's do the same test with our second program, which does the same thing through binary AND.

[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so java OddEvenViaAnd
Odd 4500000000
Even 4500000000
4.02user 0.00system 0:04.03elapsed 99%CPU (0avgtext+0avgdata 20224maxresident)k
0inputs+64outputs (0major+5197minor)pagefaults 0swaps
[user@localhost ramdisk]$
[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so java OddEvenViaAnd
Odd 4500000000
Even 4500000000
4.01user 0.00system 0:04.03elapsed 99%CPU (0avgtext+0avgdata 20228maxresident)k
0inputs+64outputs (0major+5199minor)pagefaults 0swaps
[user@localhost ramdisk]$
[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so java OddEvenViaAnd
Odd 4500000000
Even 4500000000
4.01user 0.01system 0:04.03elapsed 99%CPU (0avgtext+0avgdata 20224maxresident)k
0inputs+64outputs (0major+5198minor)pagefaults 0swaps
[user@localhost ramdisk]$
[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so java OddEvenViaAnd
Odd 4500000000
Even 4500000000
4.02user 0.00system 0:04.03elapsed 99%CPU (0avgtext+0avgdata 20224maxresident)k
0inputs+64outputs (0major+5198minor)pagefaults 0swaps

Now we can say with confidence that the comparison for parity through binary AND takes 4.02 seconds, which means that compared to checking through the remainder of division, it works 2.6 times faster, at least on openjdk version 1.8.0.

Oracle Java vs Openjdk


I downloaded and unpacked java jdk from the oracle website to the /mnt/ramdisk/jdk-13.0.2 directory.

Compile:

[user@localhost ramdisk]$ /mnt/ramdisk/jdk-13.0.2/bin/javac OddEvenViaAnd.java

We launch:

[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/jdk-13.0.2/bin/java OddEvenViaAnd
Odd 4500000000
Even 4500000000
10.39user 0.01system 0:10.41elapsed 99%CPU (0avgtext+0avgdata 24260maxresident)k
0inputs+64outputs (0major+6979minor)pagefaults 0swaps
[user@localhost ramdisk]$
[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/jdk-13.0.2/bin/java OddEvenViaAnd
Odd 4500000000
Even 4500000000
10.40user 0.01system 0:10.42elapsed 99%CPU (0avgtext+0avgdata 24268maxresident)k
0inputs+64outputs (0major+6985minor)pagefaults 0swaps

We compile the second program:

[user@localhost ramdisk]$ /mnt/ramdisk/jdk-13.0.2/bin/javac OddEvenViaMod.java

We launch:

[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/jdk-13.0.2/bin/java OddEvenViaMod
Odd 4500000000
Even 4500000000
10.39user 0.01system 0:10.40elapsed 99%CPU (0avgtext+0avgdata 24324maxresident)k
0inputs+96outputs (0major+7003minor)pagefaults 0swaps
[user@localhost ramdisk]$
[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/jdk-13.0.2/bin/java OddEvenViaMod
Odd 4500000000
Even 4500000000
10.40user 0.00system 0:10.41elapsed 99%CPU (0avgtext+0avgdata 24316maxresident)k
0inputs+64outputs (0major+6992minor)pagefaults 0swaps

The execution time of the same sources in oracle jdk is the same for the remainder of the division and binary AND, which looks normal, but this time is equally bad, which was shown in openjdk on the remainder of the division.

Python


Let's try to compare the same in Python. First, the option with the remainder of dividing by 2:

odd=0
even=0
for i in xrange(100000000):
	if i % 2 == 0:
		even += 1
	else:
		odd += 1
print "even", even
print "odd", odd

We launch:

[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/python2.7 odd_mod.py
even 50000000
odd 50000000
11.69user 0.00system 0:11.69elapsed 99%CPU (0avgtext+0avgdata 4584maxresident)k
0inputs+0outputs (0major+1220minor)pagefaults 0swaps
[user@localhost ramdisk]$
[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/python2.7 odd_mod.py
even 50000000
odd 50000000
11.67user 0.00system 0:11.67elapsed 99%CPU (0avgtext+0avgdata 4584maxresident)k
0inputs+0outputs (0major+1220minor)pagefaults 0swaps
[user@localhost ramdisk]$
[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/python2.7 odd_mod.py
even 50000000
odd 50000000
11.66user 0.00system 0:11.66elapsed 99%CPU (0avgtext+0avgdata 4584maxresident)k
0inputs+0outputs (0major+1220minor)pagefaults 0swaps

Now the same thing with binary AND:

odd=0
even=0
for i in xrange(100000000):
	if i & 1 == 0:
		even += 1
	else:
		odd += 1
print "even", even
print "odd", odd

We launch:

[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/python2.7 odd_and.py
even 50000000
odd 50000000
10.41user 0.00system 0:10.41elapsed 99%CPU (0avgtext+0avgdata 4588maxresident)k
0inputs+0outputs (0major+1221minor)pagefaults 0swaps
[user@localhost ramdisk]$
[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/python2.7 odd_and.py
even 50000000
odd 50000000
10.43user 0.00system 0:10.43elapsed 99%CPU (0avgtext+0avgdata 4588maxresident)k
0inputs+0outputs (0major+1222minor)pagefaults 0swaps
[user@localhost ramdisk]$
[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/python2.7 odd_and.py
even 50000000
odd 50000000
10.43user 0.00system 0:10.43elapsed 99%CPU (0avgtext+0avgdata 4584maxresident)k
0inputs+0outputs (0major+1222minor)pagefaults 0swaps

The results show that AND is faster.

On the Internet, it has been written many times that global variables in Python are slower. I decided to compare the execution time of the last program with AND and exactly the same, but wrapped in a function:

def main():
	odd=0
	even=0
	for i in xrange(100000000):
		if i & 1 == 0:
			even += 1
		else:
			odd += 1
	print "even", even
	print "odd", odd

main()

Run in the function:

[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/python2.7 odd_and_func.py
even 50000000
odd 50000000
5.08user 0.00system 0:05.08elapsed 99%CPU (0avgtext+0avgdata 4592maxresident)k
0inputs+0outputs (0major+1222minor)pagefaults 0swaps
[user@localhost ramdisk]$
[user@localhost ramdisk]$ export LD_LIBRARY_PATH=/mnt/ramdisk ; /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/taskset -c 1 /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/time /mnt/ramdisk/ld-2.17.so /mnt/ramdisk/python2.7 odd_and_func.py
even 50000000
odd 50000000
5.08user 0.00system 0:05.09elapsed 99%CPU (0avgtext+0avgdata 4592maxresident)k
0inputs+0outputs (0major+1223minor)pagefaults 0swaps

As you can see, the same parity comparison in Python via binary AND in a function processes 100000000 numbers on a single CPU core in ~ 5 seconds, the same comparison via AND without a function takes ~ 10 seconds, and comparison without a function through the remainder of the division takes ~ 11 seconds

Why a Python program in a function works faster than without it has already been described more than once and is related to the scope of variables.

Python has the ability to disassemble a program into internal functions that Python uses when interpreting a program. Let's see what functions Python uses for the variant with the odd_and_func.py function:

[user@localhost ramdisk]# python
Python 2.7.5 (default, Jun 20 2019, 20:27:34)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> def main():
...     odd=0
...     even=0
...     for i in xrange(100000000):
...             if i & 1 == 0:
...                     even += 1
...             else:
...                     odd += 1
...     print "even", even
...     print "odd", odd
...
>>> import dis
>>> dis.dis(main)
  2           0 LOAD_CONST               1 (0)
              3 STORE_FAST               0 (odd)

  3           6 LOAD_CONST               1 (0)
              9 STORE_FAST               1 (even)

  4          12 SETUP_LOOP              59 (to 74)
             15 LOAD_GLOBAL              0 (xrange)
             18 LOAD_CONST               2 (100000000)
             21 CALL_FUNCTION            1
             24 GET_ITER
        >>   25 FOR_ITER                45 (to 73)
             28 STORE_FAST               2 (i)

  5          31 LOAD_FAST                2 (i)
             34 LOAD_CONST               3 (1)
             37 BINARY_AND
             38 LOAD_CONST               1 (0)
             41 COMPARE_OP               2 (==)
             44 POP_JUMP_IF_FALSE       60

  6          47 LOAD_FAST                1 (even)
             50 LOAD_CONST               3 (1)
             53 INPLACE_ADD
             54 STORE_FAST               1 (even)
             57 JUMP_ABSOLUTE           25

  8     >>   60 LOAD_FAST                0 (odd)
             63 LOAD_CONST               3 (1)
             66 INPLACE_ADD
             67 STORE_FAST               0 (odd)
             70 JUMP_ABSOLUTE           25
        >>   73 POP_BLOCK

  9     >>   74 LOAD_CONST               4 ('even')
             77 PRINT_ITEM
             78 LOAD_FAST                1 (even)
             81 PRINT_ITEM
             82 PRINT_NEWLINE

 10          83 LOAD_CONST               5 ('odd')
             86 PRINT_ITEM
             87 LOAD_FAST                0 (odd)
             90 PRINT_ITEM
             91 PRINT_NEWLINE
             92 LOAD_CONST               0 (None)
             95 RETURN_VALUE

And check the same without using the function in our code:

>>> f=open("odd_and.py","r")
>>> l=f.read()
>>>
>>> l
'odd=0\neven=0\nfor i in xrange(100000000):\n\tif i & 1 == 0:\n\t\teven += 1\n\telse:\n\t\todd += 1\nprint "even", even\nprint "odd", odd\n'
>>> k=compile(l,'l','exec')
>>> k
<code object <module> at 0x7f2bdf39ecb0, file "l", line 1>
>>> dis.dis(k)
  1           0 LOAD_CONST               0 (0)
              3 STORE_NAME               0 (odd)

  2           6 LOAD_CONST               0 (0)
              9 STORE_NAME               1 (even)

  3          12 SETUP_LOOP              59 (to 74)
             15 LOAD_NAME                2 (xrange)
             18 LOAD_CONST               1 (100000000)
             21 CALL_FUNCTION            1
             24 GET_ITER
        >>   25 FOR_ITER                45 (to 73)
             28 STORE_NAME               3 (i)

  4          31 LOAD_NAME                3 (i)
             34 LOAD_CONST               2 (1)
             37 BINARY_AND
             38 LOAD_CONST               0 (0)
             41 COMPARE_OP               2 (==)
             44 POP_JUMP_IF_FALSE       60

  5          47 LOAD_NAME                1 (even)
             50 LOAD_CONST               2 (1)
             53 INPLACE_ADD
             54 STORE_NAME               1 (even)
             57 JUMP_ABSOLUTE           25

  7     >>   60 LOAD_NAME                0 (odd)
             63 LOAD_CONST               2 (1)
             66 INPLACE_ADD
             67 STORE_NAME               0 (odd)
             70 JUMP_ABSOLUTE           25
        >>   73 POP_BLOCK

  8     >>   74 LOAD_CONST               3 ('even')
             77 PRINT_ITEM
             78 LOAD_NAME                1 (even)
             81 PRINT_ITEM
             82 PRINT_NEWLINE

  9          83 LOAD_CONST               4 ('odd')
             86 PRINT_ITEM
             87 LOAD_NAME                0 (odd)
             90 PRINT_ITEM
             91 PRINT_NEWLINE
             92 LOAD_CONST               5 (None)
             95 RETURN_VALUE

As you can see, in the variant with the declared function, Python uses the internal functions with the FAST postfix, for example, STORE_FAST, LOAD_FAST, and in the variant without the declaration of the function, Python uses the internal functions STORE_NAME and LOAD_NAME.

This article has little practical meaning and is aimed more at understanding some of the features of Linux, and compilers.

Good to all!

All Articles