How we automated porting products from C # to C ++

Hello, Habr. In this post I will talk about how we managed to organize a monthly release of libraries for the C ++ language, the source code of which is developed in C #. This is not about managed C ++, or even about creating a bridge between unmanaged C ++ and the CLR environment — it's about automating C ++ code generation that repeats the API and functionality of the original C # code.

We wrote the necessary infrastructure for translating code between languages ​​and emulating the functions of the .Net library ourselves, thus solving a problem that is usually considered academic. This allowed us to start releasing monthly releases of pre-Donets products for the C ++ language too, getting the code for each release from the corresponding version of the C # code. At the same time, the tests that covered the original code are ported along with it and allow you to control the performance of the resulting solution along with specially written tests in C ++.

In this article I will briefly describe the history of our project and the technologies used in it. I will touch on the issues of economic justification only in passing, since the technical side is much more interesting to me. In the following articles of the series, I plan to dwell in more detail on topics such as code generation and memory management, as well as on some others, if the community has relevant questions.

Background


Initially, our company was engaged in the release of libraries for the .Net platform. These libraries mainly provide APIs for working with some file formats (documents, tables, slides, graphics) and protocols (email), occupying a certain niche in the market for such solutions. All development was conducted in C #.

At the end of the 2000s, the company decided to enter a new market for itself, starting to release similar products for Java. Development from scratch would obviously require an investment of resources comparable to the initial development of all affected products. The option of wrapping the Donnet code into a layer that translates calls and data from Java to .Net and vice versa was also rejected for some reasons. Instead, the question was posed as to whether it is possible in any way to completely migrate existing code to the new platform. This was all the more relevant since it was not a one-time promotion, but a monthly release of new releases of each product, synchronized between two languages.

It was decided to break the decision into two parts. The first - the so-called Porter - would convert the C # source code syntax to Java, simultaneously replacing .Net types and methods with their counterparts from Java libraries. The second - the Library - would emulate the work of those parts of the .Net library for which it is difficult or impossible to establish direct correspondence with Java, attracting available third-party components for this.

In favor of the principal feasibility of such a plan, the following spoke:

  1. Ideologically, the C # and Java languages ​​are quite similar - at least, with the structure of types and the organization of work with memory;
  2. It was about porting libraries; there was no need to port the GUI;
  3. , , - , System.Net System.Drawing;
  4. , .Net ( Framework, Standard Xamarin), .

I will not go into details, since they deserve a separate article (and not one). I can only say that it took about two years from the start of development to the release of the first Java product, and since then the release of Java products has become a regular practice of the company. During the development of the project, the porter has evolved from a simple utility that converts text according to established rules, to a complex code generator that works with the AST representation of the source code. The library is also overgrown with code.

The success of the Java direction determined the company's desire to further expand into new markets for itself, and in 2013 the question was raised about the release of products for the C ++ language in a similar scenario.

Formulation of the problem


In order to ensure the release of positive versions of products, it was necessary to create a framework that would allow you to get C ++ code from arbitrary C # code, compile it, check it and give it to the client. It was about libraries with volumes ranging from several hundred thousand to several million lines (excluding dependencies).

At the same time, experience with the Java porter was taken into account: initially, when it was just a simple tool for converting syntaxes, the practice of manually finalizing the ported code naturally arose. In the short term, focused on the speedy release of products, this was relevant, since it allowed to accelerate the development process, however, in the long term, this significantly increased the costs of preparing each version for release due to the need to correct each translation error every time it occurs.

Of course, this complexity was manageable - at least by transferring only patches into the resulting Java code, which are calculated as the difference between the output of the porter for the next two revisions of the C # code. This approach made it possible to correct each ported line only once and in the future use the already developed code where no changes were made. However, when developing a positive porter, the goal was to get rid of the stage of fixing the ported code, instead fixing the framework itself. Thus, each arbitrarily rare translation error would be corrected once - in the porter code, and this fix would apply to all future releases of all ported products.

In addition to the porter itself, it was also required to develop a library in C ++ that would solve the following problems:

  1. Emulation of the .Net environment to the extent that it is necessary for the ported code to work;
  2. Adapting ported C # code to the realities of C ++ (type structure, memory management, other service code);
  3. Smoothing the differences between “rewritten C #” and C ++ itself, to make it easier for programmers not familiar with .Net paradigms to use ported code.

For obvious reasons, no attempt was made to directly map .Net types to types from the standard library. Instead, it was decided to always use types from his library as a replacement for the Donnet types.

Many readers will immediately ask why they did not use existing implementations like Mono . There were reasons for that.

  1. By attracting such a finished library, it would be possible to satisfy only the first requirement, but not the second and not the third.
  2. Mono C# , , , .
  3. (API, , , C++, ) , .
  4. , .Net, . , , .

Theoretically, such a library could be translated into C ++ entirely using a port, however, this would require a fully functional porter at the very beginning of development, since without a system library debugging of any ported code is impossible in principle. In addition, the question of optimizing the translated code of the system library would be even more acute than for the code of ported products, since calls to the system library tend to become a bottleneck.

As a result, it was decided to develop the library as a set of adapters that provide access to functions already implemented in third-party libraries, but through a .Net-like API (similar to Java). This would reduce the work and use ready-made, already optimized, C ++ components.

An important requirement for the framework was that the ported code had to be able to work as part of user applications (as far as libraries were concerned). This meant that the memory management model should have been made clear to C ++ programmers, since we cannot force arbitrary client code to run in a garbage collection environment. The use of smart pointers was chosen as a compromise model. About how we managed to ensure such a transition (in particular, to solve the problem of circular references), I will discuss in a separate article.

Another requirement was the ability to port not only libraries, but also tests for them. The company boasts a high culture of test coverage of its products, and the ability to run in C ++ the same tests that were written for the original code would greatly simplify the search for problems after translation.

The remaining requirements (launch format, test coverage, technology, etc.) concerned mainly the methods of working with the project and on the project. I will not dwell on them.

Story


Before continuing, I have to say a few words about the structure of the company. The company works remotely, all the teams in it are distributed. The development of a certain product is usually the responsibility of a team, united by language (almost always) and geography (mainly).

Active work on the project began in the fall of 2013. Due to the distributed structure of the company, and also due to some doubts about the success of the development, three versions of the framework were launched immediately: two of them served one product each, the third covered three at once. It was assumed that this would then stop the development of less effective solutions and reallocate resources if necessary.

In the future, four more teams joined the work on the “common” framework, two of which later reconsidered their decision and refused to release products for C ++. At the beginning of 2017, a decision was made to stop the development of one of the “individual” solutions and transfer the corresponding team to work with a “common” framework. The stopped development assumed the use of the Boehm GC as a means of memory management and contained a much richer implementation of some parts of the system library, which was then transferred to the “general” solution.

Thus, two developments came to the finish line - that is, to the release of ported products - one “individual” and one “collective”. The first releases based on our (“common”) framework happened in February 2018. Subsequently, the releases of all six teams using this solution became monthly, and the framework itself was released as a separate product of the company. Even the question was raised of making it open-source, but this discussion has not yet developed.

The team, which continued to work independently on a similar framework, also released its first C ++ release in 2018.

The first releases contained truncated versions of the original products, which allowed to delay the work of broadcasting unimportant parts as much as possible. In subsequent releases, a portion-wise addition of functionality has occurred (and is occurring).

Organization of work on the project


The organization of joint work on the project by several teams managed to undergo significant changes. Initially, it was decided that one large, “central”, team would be responsible for the development, support, and fixing of the framework, while the small “product” teams involved in the release of final products in C ++ would be mainly responsible for trying to port their code and providing feedback (information about porting, compilation and execution errors). Such a scheme, however, turned out to be unproductive, since the central team was overloaded with requests from all the “product” teams, and they could not move on until the problems they encountered were resolved.

For reasons that are largely independent of the state of this particular development, it was decided to disband the “central” team and transfer people to “product” teams, which were now responsible for fixing the framework to their needs. In this case, each team itself would make a decision on whether to use its common groundwork or generate its own fork of the project. Such a statement of the question was relevant for the Java framework, whose code was stable at that time, but consolidation of efforts was required to fill the C ++ library as soon as possible, so that the teams still worked together.

This form of work also had its drawbacks, so in the future another reform was carried out. The “central” team was restored, although in a smaller composition, but with different functions: now it was not responsible for the actual development of the project, but for the organization of joint work on it. This included support for the CI environment, organizing Merge Request practices, holding regular meetings with development participants, supporting documentation, covering tests, helping with architectural solutions and troubleshooting, and so on. In addition, the team took on the work to eliminate technical debt and other resource-intensive areas. In this mode, development continues to this day.

Thus, the project was initiated by the efforts of several (about five) developers and in the best of times numbered about twenty people. Some ten to fifteen people responsible for the development and support of the framework and the release of six ported products can be considered a stable value in recent years.

The author of these lines joined the company in mid-2016, starting to work in one of the teams broadcasting their code using a “common” solution. In the winter of the same year, when it was decided to recreate the “central” team, I moved to the position of her team leader. Thus, my experience in the project today is more than three and a half years.

The autonomy of the teams responsible for the release of ported products has led to the fact that in some cases it turned out to be easier for developers to supplement the porter with operating modes than to compromise on how it should behave by default. This explains more than you might expect, the number of options available when configuring the porter.

Technologies


It is time to talk about the technologies used in the project. Porter is a console application written in C #, because in this form it is easier to embed in scripts that perform tasks such as "port-compile-run tests." In addition, there is a GUI component that allows you to achieve the same goals by clicking on the buttons.

The ancient NRefactory library is responsible for parsing code and resolving semantics . Unfortunately, at the time the project started, Roslyn was not yet available, although migration to it, of course, is in our plans.

Porter uses AST wood walkwaysto collect information and generate C ++ output code. When C ++ code is generated, the AST representation is not created, and all the code is saved as plain text.

In many cases, the porter needs additional information for fine tuning. Such information is transmitted to him in the form of options and attributes. The options apply to the entire project immediately and allow you to set, for example, the names of export macro members of classes or C # preprocessor definitions used in code analysis. Attributes are hung on types and entities and determine the processing specific to them (for example, the need to generate keywords “const” or “mutable” for class members or to exclude them from porting).

C # classes and structures are translated into C ++ classes, their members and executable code are translated into the nearest equivalents. Generic types and methods map to C ++ templates. C # links are translated into smart pointers (strong or weak) defined in the Library. More details about the principles of the porter will be discussed in a separate article.

Thus, the original C # assembly is converted to a C ++ project, which instead of .Net libraries depends on our shared library. This is shown in the following diagram:



cmake is used to build the library and ported projects. The compilers VS 2017 and 2019 (Windows), GCC and Clang (Linux) are currently supported.

As mentioned above, most of our .Net implementations are thin layers of third-party libraries that do the bulk of the work. It includes:

  • Skia - for working with graphics;
  • Botan - to support encryption functions;
  • ICU - for working with strings, encodings and cultures;
  • Libxml2 - for working with XML;
  • PCRE2 - for working with regular expressions;
  • zlib - to implement compression functions;
  • Boost - for various purposes;
  • several other libraries.

Both the porter and the library are covered in numerous tests. Library tests use the gtest framework. Porter tests are written mainly in NUnit / xUnit and are divided into several categories, certifying that:

  • the porter output on these input files matches the target;
  • the output of the ported programs after compilation and launch coincides with the target;
  • NUnit tests from input projects are successfully converted to gtest tests in ported projects and pass;
  • Ported Projects API works successfully in C ++;
  • the impact of individual options and attributes on the translation process is as expected.

We use GitLab to store the source code . Jenkins was chosen as the CI environment . Ported products are available as Nuget packages and as download archives.

Problems


While working on the project, we had to face a lot of problems. Some of them were expected, while others appeared already in the process. We briefly list the main ones.

  1. .Net C++.
    , C++ Object, RTTI. .Net STL.
  2. .
    , , . , C# , C++ — .
  3. .
    — . , . , .
  4. .
    C++ , , .
  5. C#.
    C# , C++. , :

    • , ;
    • , (, yeild);
    • , (, , , C#);
    • , C++ (, C# foreground-).
  6. .
    , .Net , .
  7. .
    - , , «» , . , , , , using, -. . , .
  8. .
    , , , , , / - .
  9. .
    . , . , , , .
  10. Difficulties with the protection of intellectual property.
    If C # code is fairly easily obfuscated by boxed solutions, then in C ++ you have to make additional efforts, since many class members cannot be deleted from header files without consequences. Translating generic classes and methods into templates also creates vulnerabilities by exposing algorithms.

Despite this, the project is very interesting from a technical point of view. Work on it allows you to learn a lot and learn a lot. The academic nature of the task also contributes to this.

Summary


As part of the project, we were able to implement a system that solves an interesting academic problem for the sake of its direct practical application. We organized a monthly issue of company libraries in a language for which they were not originally intended. It turned out that most of the problems are completely solvable, and the resulting solution is reliable and practical.

Soon it is planned to publish two more articles. One of them will describe in detail, with examples, how a porter works and how C # constructs are displayed in C ++. In another speech, we will talk about how we managed to ensure compatibility of memory models of two languages.

I will try to answer the questions in the comments. If readers show interest in other aspects of our development and the answers begin to go beyond correspondence in the comments, we will consider the possibility of publishing new articles.

All Articles