What is inside a .wasm file? Introducing wasm-decompile

We have many compilers and other tools at our disposal to create and work with .wasm files. The number of these tools is constantly growing. Sometimes you need to look into the .wasm file and figure out what is inside it. Maybe you are the developer of one of the Wasm tools, or maybe you are a programmer who writes code designed to be converted to Wasm and is interested in how it looks into what its code will turn into. Such interest may be triggered, for example, by performance considerations.



The problem is that the .wasm files contain pretty low-level code that looks a lot like real assembler code. In particular, unlike, for example, the JVM, all data structures are compiled into sets of load / store operations, and not into something that has clear class and field names. Compilers, like LLVM, can change the input code in such a way that what they get does not look close to it. 

What about the one who wants, taking a .wasm file, to find out what is happening in it?

Disassembling or ... decompiling?


To convert .wasm files to .wat files containing a standard textual representation of Wasm code, you can use tools like wasm2wat (this is part of the WABT toolkit ). The results of this conversion are very accurate, but reading the resulting code is not particularly convenient.

Here, for example, is a simple function written in C:

typedef struct { float x, y, z; } vec3;

float dot(const vec3 *a, const vec3 *b) {
    return a->x * b->x +
           a->y * b->y +
           a->z * b->z;
}

The code is stored in a file dot.c.

We use the following command:

clang dot.c -c -target wasm32 -O2

Next, to convert what happened to a .wat file, we apply the following command:

wasm2wat -f dot.o

Here is what it will give us:

(func $dot (type 0) (param i32 i32) (result f32)
  (f32.add
    (f32.add
      (f32.mul
        (f32.load
          (local.get 0))
        (f32.load
          (local.get 1)))
      (f32.mul
        (f32.load offset=4
          (local.get 0))
        (f32.load offset=4
          (local.get 1))))
    (f32.mul
      (f32.load offset=8
        (local.get 0))
      (f32.load offset=8
        (local.get 1))))))

The code is small, but for many reasons, it is extremely difficult to read. Besides the fact that expressions are not used here, and the fact that it, on the whole, looks rather verbose, it is not easy to understand the data structures represented in the form of commands for loading data from memory. Now imagine that you need to analyze such code of a much larger size. Such an analysis will be a very difficult task.

Let's try, instead of using wasm2wat, run the following command:

wasm-decompile dot.o

Here is what she will give us:

function dot(a:{ a:float, b:float, c:float },
             b:{ a:float, b:float, c:float }):float {
  return a.a * b.a + a.b * b.b + a.c * b.c
}

It already looks much better. In addition to using expressions reminiscent of a programming language already known to you, the decompiler parses commands aimed at working with memory and tries to recreate the data structures represented by these commands. The system then annotates each variable, which is used as a pointer with a “built-in” structure declaration. The decompiler does not create a named structure declaration, since it does not know if there is something in common between structures that use 3 float values ​​each.

As you can see, the results of decompilation turned out to be more understandable than the results of disassembly.

What language is the code written by the decompiler written?


The wasm-decompile tool outputs the code, trying to make this code look like some kind of “average” programming language. At the same time, this tool tries not to go too far from Wasm.

The first goal of wasm-decompiler was to create readable code. That is - such a code that will allow its reader to easily understand what is happening in the decompiled .wasm file. The second purpose of this tool is to provide the most accurate representation of the .wasm file, by generating code that fully represents what is happening in the source file. Obviously, these goals are far from always in good agreement with each other.

What wasm-decompiler outputs was not originally conceived as code representing some real programming language. There is currently no way to compile this code in Wasm.

Commands for loading and saving data


As shown above, wasm-decompile looks for load and save commands associated with a particular pointer. If these commands form a continuous sequence, the decompiler displays one of the “built-in” data structure declarations.

If not all “fields” were accessed, the decompiler cannot reliably distinguish the structure from a certain sequence of operations with working with memory. In this case, wasm-decompile uses the fallback option, using simpler types like float_ptr(if the types are the same), or, in the worst case, generates code that illustrates how to work with an array, like o[2]:int. Such code tells us that it opoints to elements of the type int, and we turn to the third such element.

This last situation arises much more often than you might think, since local Wasm functions are more focused on using registers rather than variables. As a result, in the optimized code, the same pointer can be used to work with completely unrelated objects.

The decompiler seeks an intelligent approach to indexing and is able to identify patterns like (base + (index << 2))[0]:int. The source of such patterns is the usual indexing operations for C, such base[index]as where it basepoints to a 4-byte type. This is very common in code, since Wasm, in load and save data commands, only supports offsets defined as constants. In the code generated by wasm-decompile, such constructs are converted to type base[index]:int.

In addition, the decompiler knows when absolute addresses point to a data section.

Program flow control


If we talk about control constructs, the most famous among them is the if-then Wasm construct, which turns into if (cond) { A } else { B }, with the addition of the fact that such a construct in Wasm can return a value, so it can also represent a ternary operator, like cond ? A : B, which is in some languages.

Other Wasm control structures are block based blockand loop, as well as transitions br, br_ifand br_table. The decompiler tries to stay as close to these structures as possible. He does not seek to recreate while / for / switch constructs that could serve as the basis for them. The fact is that this approach shows itself better when processing optimized code. For example, a conventional designloop might look in the code returned by wasm-decompile like this:

loop A {
  //    .
  if (cond) continue A;
}

Here Ais a label that allows you to build nested structures in each other loop. The fact that there are commands ifand continueused to control the cycle may look somewhat alien to while loops, but they correspond to the Wasm-construction br_if.

Blocks are drawn up in a similar way, but here the conditions are at the beginning, and not at the end:

block {
  if (cond) break;
  //    .
}

The result of decompiling the if-then construct is shown here. In future versions of the decompiler, probably, instead of such code, where possible, a more familiar if-then construct will be formed.

The most unusual Wasm tool used to control the flow of a program is br_table. This tool is a kind of switch statement, except that it uses inline blocks. All this complicates the reading of code. The decompiler simplifies the structure of such structures, striving to make their perception a little easier:

br_table[A, B, C, ..D](a);
label A:
return 0;
label B:
return 1;
label C:
return 2;
label D:

This is reminiscent of use switchfor analysis awhen the default option is D.

Other interesting features


Here are some more features of wasm-decompile:

  • , . C++-.
  • , , , . . , .
  • .
  • Wasm-, . , wasm-decompile , , , .
  • , ( , C- ).

 


Decompiling Wasm code is a task that is much more complicated than, for example, decompiling JVM byte code.

The bytecode is not optimized, that is, it reproduces the structure of the source code quite accurately. At the same time, despite the fact that such code may not contain the original names, the bytecode uses references to unique classes, and not to memory areas.

Unlike JVM bytecode, code that gets into .wasm files is highly optimized by LLVM. As a result, such code often loses most of its original structure. The output code is very different from what the programmer would write. This greatly complicates the task of decompiling Wasm code with the output of results that can bring real benefits to programmers. However, this does not mean that we should not strive to solve this problem!

Summary


If you are interested in the topic of Wasm code decompilation, then perhaps the best way to understand this topic is to take and decompile your own .wasm project! In addition, here you can find more detailed guidance on wasm-decompile. The decompiler code can be found in the files of this repository, the names of which begin with decompile(if you want, join the work on the decompiler). Here you can find tests showing additional examples of differences between .wat files and decompilation results.

And with what tools do you research .wasm files?

, , iPhone. , .


All Articles