Inside the Python virtual machine. Part 1



Hello everyone. I finally decided to figure out how the Python interpreter works. To do this, he began to study one article-book and conceived at the same time to translate it into Russian. The fact is that translations do not allow you to miss an incomprehensible sentence and the quality of assimilation of the material increases).I apologize in advance for possible inaccuracies. I always try to translate as correctly as possible, but one of the main problems: there is simply no mention of some terms in the Russian equivalent.

Translation Note
Python , «code object», ( ) . , .

— Python, - , : , , ( ! ) , , - ( ) .. — () - str (, Python3, bytes).

Introduction


The Python programming language has been around for quite some time. The development of the first version was started by Guido Van Rossum in 1989, and since then the language has grown and become one of the most popular. Python is used in various applications: from graphical interfaces to data analysis applications.

The purpose of this article is to go behind the scenes of the interpreter and provide a conceptual overview of how a program written in Python is executed. CPython will be considered in the article, because at the time of writing, it is the most popular and basic Python implementation.

Python and CPython are used as synonyms in this text, but any mention of Python means CPython (the python version implemented in C). Other implementations include PyPy (python implemented in a limited subset of Python), Jython (implementation on the Java Virtual Machine), etc.

I like to divide the execution of a Python program into two or three main steps (listed below), depending on how the interpreter is called. These steps will be covered to varying degrees in this article:

  1. Initialization - this step involves setting up the various data structures required by the python process. Most likely this will happen when the program is executed in non-interactive mode through the shell of the interpreter.
  2. — , : , , .
  3. — .

The mechanism for generating "parsing" trees, as well as abstract syntax trees (ASD), is language independent. Therefore, we will not cover this topic very much, because the methods used in Python are similar to the methods of other programming languages. On the other hand, the process of constructing symbol tables and code objects from ADS in Python is more specific, therefore, it deserves special attention. It also discusses the interpretation of compiled code objects and all other data structures. The topics covered by us will include, but are not limited to: the process of constructing symbol tables and creating code objects, Python objects, frame objects, code objects, functional objects, opcodes, interpreter loops, generators, and user classes.

This material is intended for anyone interested in learning how the CPython virtual machine works. It is assumed that the user is already familiar with python and understands the basics of the language. When studying the structure of a virtual machine, we will encounter a significant amount of C-code, so it will be easier for a user who has an elementary understanding of the C language to understand the material. And so, basically, what you need to get acquainted with this material: the desire to learn more about the CPython virtual machine.

This article is an extended version of personal notes made in the study of the internal work of the interpreter. There is a lot of quality stuff in PyCon videos , school lectures, and this blog.. My work would not have been completed without these fantastic sources of knowledge.

At the end of this book, the reader will be able to understand the intricacies of how the Python interpreter executes your program. This includes the various stages of program execution and data structures that are critical in the program. To begin with, we will take a bird's eye view of what happens when a trivial program is executed, when the name of the module is passed to the interpreter on the command line. CPython executable code can be installed from source, following the Python Developer's Guide .

This book uses the Python 3 version.

30,000-foot view


This chapter talks about how the interpreter executes a Python program. In the following chapters, we will examine the various parts of this “puzzle” and provide a more detailed description of each part. Regardless of the complexity of a program written in Python, this process is always the same. The excellent explanation given by Yaniv Aknin in his series of articles on Python Internal sets the topic for our discussion.

The source module test.py can be executed from the command line (when passing it as an argument to the Python interpreter program in the form of $ python test.py). This is just one way to invoke the Python executable. We can also launch an interactive interpreter, execute lines of a file as code, etc. But this and other methods do not interest us. It is the transfer of the module as an argument (inside the command line) to the executable file (Figure 2.1) that best reflects the flow of various actions that are involved in the actual execution of the code.


Figure 2.1: Stream at runtime.

The python executable is a regular C program, so when it is called, processes similar to those that exist, for example, in the linux kernel or the simple hello world program, occur. Take a minute of your time to understand: the python executable is just another program that launches your own. Such "relationships" exist between the C language and the assembler (or llvm). The standard initialization process (which depends on the platform where the execution takes place) starts when the python executable is called with the module name as an argument.

This article assumes you are using a Unix-based operating system, so some features may vary on Windows.

The C language at startup executes all its “magic” of initialization - it loads libraries, checks / sets environment variables, and after that, the main method of the python executable is launched just like any other C program. Python's main executable file is located in ./Programs/python.c and performs some initialization (such as making copies of program command line arguments that were passed to the module). The main function then calls the Py_Main function located in ./Modules/main.c . It processes the initialization process of the interpreter: analyzes the command line arguments, sets flags, reads environment variables, executes hooks, randomizes hash functions, etc. Also calledPy_Initialize from pylifecycle.c , which handles the initialization of interpreter and stream state data structures, are two very important data structures.

Examining declarations of interpreter data structures and stream states makes it clear why they are needed. The state of the interpreter and the stream are just structures with pointers to fields that contain the information necessary to execute the program. Interpreter state data is created via typedef (just think of this keyword in C as a type definition, although this is not entirely true). The code for this structure is shown in Listing 2.1.

 1     typedef struct _is {
 2 
 3         struct _is *next;
 4         struct _ts *tstate_head;
 5 
 6         PyObject *modules;
 7         PyObject *modules_by_index;
 8         PyObject *sysdict;
 9         PyObject *builtins;
10         PyObject *importlib;
11 
12         PyObject *codec_search_path;
13         PyObject *codec_search_cache;
14         PyObject *codec_error_registry;
15         int codecs_initialized;
16         int fscodec_initialized;
17 
18         PyObject *builtins_copy;
19     } PyInterpreterState;

Code Listing 2.1: Interpreter state data structure

Anyone who has used the Python programming language for a long time can recognize several fields mentioned in this structure (sysdict, builtins, codec).

  1. The * next field is a reference to another instance of the interpreter, since several Python interpreters can exist within the same process.
  2. The * tstate_head field indicates the main thread of execution (if the program is multi-threaded, then the interpreter is common for all threads created by the program). We will discuss this in more detail shortly.
  3. modules, modules_by_index, sysdict, builtins and importlib speak for themselves. All of them are defined as instances of PyObject , which is the root type for all objects in the Python virtual machine. Python objects will be discussed in more detail in the following chapters.
  4. The fields related to codec * contain information that helps with downloading encodings. This is very important for decoding bytes.

Program execution must occur in a thread. The state structure of the stream contains all the information that the stream needs to execute some code object. Part of the stream data structure is shown in Listing 2.2.

 1     typedef struct _ts {
 2         struct _ts *prev;
 3         struct _ts *next;
 4         PyInterpreterState *interp;
 5 
 6         struct _frame *frame;
 7         int recursion_depth;
 8         char overflowed; 
 9                         
10         char recursion_critical; 
11         int tracing;
12         int use_tracing;
13 
14         Py_tracefunc c_profilefunc;
15         Py_tracefunc c_tracefunc;
16         PyObject *c_profileobj;
17         PyObject *c_traceobj;
18 
19         PyObject *curexc_type;
20         PyObject *curexc_value;
21         PyObject *curexc_traceback;
22 
23         PyObject *exc_type;
24         PyObject *exc_value;
25         PyObject *exc_traceback;
26 
27         PyObject *dict;  /* Stores per-thread state */
28         int gilstate_counter;
29 
30         ... 
31     } PyThreadState;

Listing 2.2: Part of the stream state data

structure The interpreter data structures and stream states are discussed in more detail in the following chapters. The initialization process also sets up import mechanisms as well as elementary stdio.

After completing all initialization, Py_Main calls the run_file function (also located in the main.c module). The following is a series of function calls: PyRun_AnyFileExFlags -> PyRun_SimpleFileExFlags -> PyRun_FileExFlags -> PyParser_ASTFromFileObject. PyRun_SimpleFileExFlagscreates the __main__ namespace in which the contents of the file will be executed. It also checks if the pyc version of the file exists (the pyc file is a simple file containing an already compiled version of the source code). If a pyc version exists, an attempt will be made to read it as a binary file, and then run it. If the pyc file is missing, then PyRun_FileExFlags is called, etc. The PyParser_ASTFromFileObject function calls PyParser_ParseFileObject , which reads the contents of the module and builds parsing trees from it. Then, the created tree is passed to PyParser_ASTFromNodeObject , which creates an abstract syntax tree from it.

, Py_INCREF Py_DECREF. , . CPython : , , Py_INCREF. , , Py_DECREF.

AST is generated when run_mod is called . This function calls PyAST_CompileObject , which creates code objects from AST. Note that the bytecode generated during the PyAST_CompileObject call is passed through the simple peephole optimizer , which performs low optimization of the generated bytecode before creating code objects. The run_mod function then applies the PyEval_EvalCode function from the ceval.c file to the code object. This leads to another series of function calls: PyEval_EvalCode -> PyEval_EvalCode -> _PyEval_EvalCodeWithName -> _PyEval_EvalFrameEx. The code object is passed as an argument to most of these functions in one form or another. _PyEval_EvalFrameEx- This is a normal interpreter loop that handles the execution of code objects. However, it is called not just with the code object as an argument, but with the frame object, which has as a attribute a field that refers to the code object. This frame provides the context for the execution of the code object. In simple words: the interpreter loop continuously reads the next instruction indicated by the instruction counter from the array of instructions. Then it executes this instruction: it adds or removes objects from the value stack in the process until it empties into the array of instructions to be executed (well, or something exceptional happens that disrupts the loop).

Python provides a set of functions that you can use to examine real code objects. For example, a simple program can be compiled into a code object and disassembled to obtain opcodes that are executed by the python virtual machine. This is shown in Listing 2.3.

1         >>> def square(x):
2         ...     return x*x
3         ... 
4 
5         >>> dis(square)
6         2           0 LOAD_FAST                0 (x)
7                     2 LOAD_FAST                0 (x)
8                     4 BINARY_MULTIPLY     
9                     6 RETURN_VALUE        

Code Listing 2.3: Disassembling a function in Python The

header file ./Include/opcodes.h contains a complete list of all the instructions / opcodes for the Python virtual machine. Opcode's are pretty simple. Take our example in Listing 2.3, which has a set of four instructions. LOAD_FAST loads the value of its argument (in this case x) onto the value stack. The python virtual machine is stack-based, so the values ​​for opcode operations are “popped” from the stack, and the calculation results are pushed back onto the stack for further use by other opcodes. Then BINARY_MULTIPLY pops two items from the stack, performs binary multiplication of both values, and pushes the result back onto the stack. RETURN VALUE Instructionretrieves a value from the stack, sets the return value for the object to this value, and exits the interpreter loop. If you look at Listing 2.3, it’s clear that this is a pretty strong simplification.

The current explanation of the interpreter loop does not take into account a number of details, which will be discussed in subsequent chapters. For example, here are the questions to which we did not receive an answer:

  • Where did the values ​​loaded by the LOAD_FAST statement come from?
  • Where do the arguments come from, which are used as part of the instructions?
  • How are nested function and method calls handled?
  • How does the interpreter loop handle exceptions?

After completing all the instructions, the Py_Main function continues execution, but this time starts the cleaning process. If Py_Initialize is called to perform initialization during the start of the interpreter, then Py_FinalizeEx is called to perform the cleanup. This process includes waiting for the exit from the threads, calling any exit handlers, as well as freeing up the still used memory allocated by the interpreter.

And so, we looked at the "high level" description of the processes that occur in the Python executable when a script is run. As noted earlier, there are many questions that remain to be answered. In the future, we will delve into the study of the interpreter and consider in detail each of the stages. And we will start by describing the compilation process in the next chapter.

All Articles