Complementing SQL. Part 1. The complexity of parsing. Stories about finalizing the ANTLR file

I publish the original article on Habr, the translation of which is posted on the Codingsight blog .

What will happen in this article?


I have been working in the company for more than five years, which has been developing the IDE line for working with databases. When starting work on this article, I could not imagine how many interesting stories I could remember, because when I finished I got more than 30 pages of text. After a little thought, I grouped the stories by topic, and divided the article into several.

As I publish, I will add links to the following parts:

Part 1. The complexity of parsing. Stories about finalizing ANTLR with a file
Part 2. Optimization of working with strings and opening files
Part 3. Life of extensions for Visual Studio. Work with IO. Unusual use of SQL
Part 4. Working with exceptions, the impact of data on the development process. Using ML.NET

A lot of interesting things happened during the work: we found several bugs in .NET, optimized some functions many times, and some only by percent, did something very cool the first time, and something we didn’t succeed even after a few attempts. My team is developing and supporting IDE language features, the main one being code completion. Hence the name of the series of articles. In each of their parts I will tell several stories: some about successes, some about failures.

In this part, I will focus on the problems of parsing SQL, the fight against these problems and the mistakes made in this fight.



For those who are interested in only part of this and just for easy navigation, the content of the article:

Content




Who am I?


I came to this job as a June with very little experience. I, like many, came to programming because I wanted to make toys. Several even quite successfully done. I even wrote about the development of one here. Recently, by the way, he resurrected her on another server. There were still a dozen games made or abandoned at different stages of readiness. In addition to games, before getting this work, I managed to complete several projects on freelance. The largest of these was the social media application, which was a football portal with tournament tables, player data and the ability to notify users of the final score or goals scored via SMS. This was done almost 10 years ago. At that time, not everyone went with smartphones, and who used to go more often without the Internet or with the three-cursed EDGE on which to open a text site was not always possible. So the idea seemed good to me.

Somehow it turned out that in addition to games, I was also drawn to create different tuning for myself or other developers. At times, he was close to what I had to do at work a little later. For example, one of the projects that I did when studying the Win32 API was a program highlighting XML code in Rich Edit Control. In addition, it was possible to upload backlit code to BMP, HTML or BB codes that were fashionable then on different forums.

Another project that turned out to be damn close to what I happened to do at work was a program that analyzes C / C ++ code and builds a block diagram from it. The flowchart was in strict accordance with the requirements of one teacher at the university. It was made clumsily, in the forehead, with zero knowledge of the theory of syntactic analysis, and worked exclusively on my crappy character. A few years later, while clearing the computer of old junk, I stumbled upon it and could not remove it, so I posted it on github for the sake of history.

These experiments, coupled with freelance, gave quite a good experience and made it possible to get a job. Over time, after a couple dozen soaked in blood and tears review I even start to benefit the company and the product. Turning back, it’s rather funny to understand that as a result of several accidents, I began to do exactly what I was drawn to.

What are the difficulties?


This block is needed to immerse the reader in the context of what we are actually doing.



Desktop development


Perhaps this is not even complexity, because it is already a very mature field in which there is not much unknown, the libraries are stable, best-practice is known. This feature of our project is here, because I, like many other programmers, is prone to novelty, and novelty is now all gone on the web. There were days when, in the rain, I climbed onto the windowsill, covered with a blanket, with a mug of cocoa, and thought about redis, react, highload and distributed systems that are being developed somewhere without me right now. The other side of this problem is that it’s not easy to describe the complexity of the project to the familiar beer developers, and when you work on applications that operate according to fundamentally different laws, it becomes even more difficult.


Parsing SQL and Dialects


I have also written small parsers for different languages ​​before this work. For some time I taught .NET courses. For some groups, as an additional task, as part of the “string” topic, they suggested writing their own simple JSON parser. Only SQL and its variants are far from XML or JSON designed to be equally understandable to parsers and people. Moreover, SQL is syntactically more complex than C / C ++, with its many functions accumulated over a long history. SQL is structured differently, they tried to make it look like a natural language, quite successfully, by the way. The language has several thousand (depending on the dialect) keywords. Often, in order to distinguish one expression from another, you need to peek two or more words (tokens) forward. This approach is called lookahead. There is a classification of parsers according tohow far they can peer ahead: LA (1), LA (2), or LA (*), which means that the parser can look as far forward as it can take to determine the correct branch. Sometimes the optional end of one clause inside one SQL statement coincides with the beginning of another, which is also optional: such situations make parsers much more complicated. T-SQL pours water into the fire, in which the semicolon is optional, and the possible, but not mandatory end of some SQL statements may conflict with the beginning of others.such situations greatly complicate parsers. T-SQL pours water into the fire, in which the semicolon is optional, and the possible, but not mandatory end of some SQL statements may conflict with the beginning of others.such situations greatly complicate parsers. T-SQL pours water into the fire, in which the semicolon is optional, and the possible, but not mandatory end of some SQL statements may conflict with the beginning of others.

Still not believing? There is a mechanism for describing formal languages ​​through grammars . Grammar is a code in one language that describes another. From a grammar, you can often generate a parser using a tool. The most famous grammar description tools and languages ​​are YACC and ANTLR . YACC generated parsers are used directly in the MySQL , MariaDB , PostgreSQL engines. You could try to take them directly from open source and develop code completion and other functions based on SQL analysis based on them. Moreover, such an implementation would receive free updates in terms of development, and the parser would behave guaranteed to be identical to the database engine. So why do we use ANTLR? It qualitatively supports C # /. NET, there are good tools for working with it, its syntax is much easier to read and write. The ANTLR syntax turned out to be so convenient that microsoft has recently used it in the official C # documentation .

Let us return to my proof of the complexity of SQL, in comparison with other languages, in terms of parsing. To do this, I want to compare the grammar sizes for different languages ​​available publicly. In dbForge we use our own grammars, they are more complete than those available publicly, but, unfortunately, are very clogged with C # code inserts to support different functions, more about this in the section “Parsing without trees” in the “Errors” section.

Below are the grammar sizes for different languages:

JS - 475 lines parser + 273 lexer = 748 lines
Java - 615 lines parser + 211 lexer = 826 lines
C # - 1159 lines parser + 433 lexer = 1592 lines
C ++ - 1933 lines

MySQL- 2515 lines parser + 1189 lexer = 3704 lines
T-SQL - 4035 lines parser + 896 lexer = 4931 lines
PL SQL - 6719 lines parser + 2366 lexer = 9085 lines

At the end of some lexers there is an enumeration of Unicode characters available in the language, this is useless in assessment of the complexity of the language. I took the number of lines before starting such transfers.

There may be questions regarding the complexity of parsing a language based on the number of lines in its grammar. Questions can also be to the fullness of open grammars in comparison with the actual syntax of languages. Despite this, I still consider it useful to give these numbers, since the spread is too large.


This is not all, because we need not just to parse several files into SQL. We are doing an IDE, which means we must work on incomplete or invalid scripts. Even if you write your scripts without errors, perhaps you write them inconsistently, it is unlikely that the script is valid throughout the process of its development. For example, I first write “SELECT FROM”, after which I will be glad to see the list of available tables. When I select a table, I will rearrange the carriage to SELECT, press the spacebar and wait for the list of columns from this particular table. This is a very simple example, but it shows that the parser providing the Code Completion in the IDE cannot crash with the exception of encountering an invalid script. We had to come up with a lot of tricks to ensure that the tooltip works correctly in many such scenarios,but users still send different scenarios for working with unfinished scripts, which means we have to come up with new tricks.

Predicate wars


When parsing the code, sometimes there are situations when the next word does not make it clear which of the two alternatives to choose. Sometimes it’s enough to peek at a strictly defined number of words in advance and you can definitely choose an alternative. The mechanism for resolving this kind of inaccuracy is called the ANTLR lookahead. The parser method in this case is constructed as an embedded chain of ifs, each of which looks one word further. Below is an example of a grammar generating uncertainty of this kind.

rule1:
    'a' rule2 | rule3
    ;

rule2:
    'b' 'c' 'd'
    ;

rule3:
    'b' 'c' 'e'
    ;

In the middle of rule1, having already passed token 'a', the parser will have to look 2 token ahead to choose which rule to follow. Once again, this check will be done inside the rule. This grammar can be rewritten so that this lookahead does not exist, unfortunately the structure often suffers from such optimizations, and the performance gain is relatively not high.

More complex mechanisms exist to resolve more complex uncertainties. One of them is the syntax predicate mechanism (synpred) in ANTLR3.

He comes to the rescue in those cases when, for example, the optional completion of one clause intersects with the beginning of the optional other following. The predicate, in ANTLR3 terms, is the generated method, which makes a virtual passage through the text in accordance with one of the alternatives and, if successful, returns true, this completion of the predicate is called successful. Virtual pass is also called backtracking pass. If the predicate worked successfully, then a real pass is made. This becomes a problem when another predicate begins inside one predicate, then one section can be traversed a hundred and a thousand times.

Consider a simplified example. There are 3 points of uncertainty (A, B, C).

  1. The parser enters A, remembers the position in the text, begins a virtual passage of the 1st level.
  2. B, , 2- .
  3. C, , 3- .
  4. 3- , 2 .
  5. 2 , 1 B .
  6. , A, B .

Thus, all checks inside C will be performed 4 times, B - 3 times, A - 2 times. But what if the suitable alternative is second or third in the list? Then at some stage of one of the predicates it will fail, the position in the text will roll back and another predicate will start executing.

Repeatedly analyzing the cause of the application freezing, we came across cases when the synpred “tail” was executed several thousand times. Synpreds are especially problematic in recursive rules. Unfortunately, by its nature, SQL is recursive, which is at least the ability to use a subquery almost everywhere. Sometimes with the help of various tricks and tricks it turns out to turn out the rule so that the predicate is gone.

Obviously, synpred has a negative effect on performance. At some stage, it was necessary to put their population under strict control. The only problem is that when writing grammar code, the appearance of synpred may not be obvious. Moreover, sometimes a change in one rule leads to the appearance of a predicate in another. This cannot be controlled manually. To control the number of predicates, we wrote a simple regular, which was called by a special MsBuild Task. If the number of predicates was different from the number specified in a separate file, then Task interrupted the assembly and reported an error. Seeing such an error, the developer had to rewrite the rule code several times, trying to get rid of unnecessary predicates, possibly involving other developers in the problem. If a predicate is inevitable,then the developer updates the number of predicates in the corresponding file. A change to this file draws extra attention to the review.

In rare cases, we even wrote our own predicates in C # to get around the ones that ANTLR generates. Fortunately, such a mechanism also exists.

Cool bike solutions




Grammar Inheritance


After the announcement of any changes in each of the DBMS we support, our work to support them begins. Almost always, the starting point for this is to support syntactic constructions in the grammar. For each SQL dialect, we write our own grammar, this generates some repetition of the code, but ultimately it is easier than looking for a common between them. Just a couple of years ago, MySQL and MariaDB were very similar, writing separate grammars was impractical. Because those few constructions that were in MariaDB, but were not in MySQL, we supported in the MySQL grammar. This was an unpleasant moment: for the MySQL user, we could suggest constructs that would not be valid. In recent years, MariaDB and MySQL have become very divergent in terms of syntax and more. It became apparentthat it’s wrong to support two different languages ​​within the same grammar.

A possible solution could be a complete copy of the grammar, after which each is supported separately. As a result of a lengthy discussion, we made a bold decision. I am very glad that we did not copy the code, every cell in me resisted this decision. It was decided to write your own grammar preprocessor ANTLR, which implements the mechanism of grammar inheritance. Some time ago, I came across an ANTLR3 grammar in the official repository of ANTLR4 grammars. I think this sentence needs to be read several times to realize the depth of madness.

Discussing the idea of ​​inheritance, we quickly realized that we would like to have a mechanism for polymorphism. The ability in the grammar of the heir not only to redefine the rule, but also to call the base. Moreover, I want to control the position of the call of the basic rule: beginning, middle or end. We decided that all the rules can be redefined, for this you do not need to specify anything additional. In order to redefine a rule, it is enough to declare a rule with the same name in the successor grammar. After that, the rule from the parent grammar will be available under a different name.

ANTLR makes a pleasant tool for development by tuning - there is an extension for VS, there are ANTLRWorks. Introducing the inheritance mechanism, I would not want to lose this opportunity. But how then to indicate the basic grammar? You can come up with some kind of convention for naming files, but this is not at all obvious. Another option could be to indicate such additional information in a separate file, but even now, typing these lines, I felt the stink of this solution. The output was an indication of the basic grammar in the grammar of the heir in the format of an ANTLR comment. All tools will simply ignore this text, and we can easily pull out the code that interests us.

The requirements have been formed, it is time to implement them. We wrote the MsBuild Task, which was built into the general build system as a pre-build-action. Task performed the work of the ANTLR grammar preprocessor, generating the resulting grammar from the base and inherited. The resulting grammar has already been processed by ANTLR itself. If a rule with the same name as in the parent was found in the successor grammar, the basic rule was renamed: the name of the parent grammar was added to its name after the underscore. By this name he could be contacted in the heir.

The preprocessor mechanism itself did not take much time, but together with the generation of the parser it turned out that it slowed down every reassembly of the project by 10-20 seconds. At some point, this ceased to suit us. We decided to think about how this can be optimized. The solution was to add the hash of the sum of all the grammars on which it depends in the form of a comment to the header of the CS parser file. Before doing anything, the preprocessor compared these hashes with the hashes of files on the disk, and if they do not differ, then the parser file was considered relevant. At the initial development stage, we had to step on the rake of invalid parsers and grammars collected by the outdated version of the preprocessor more than once. As a result, the hash sum of the assembly with the preprocessor appeared in the header comment.

Another post-processing ANTLR


In many programming languages, if the word is the key, then it can no longer be used as the name of the object. In SQL, depending on the dialect, from 800 to 3000 keywords. Most of them are closely related to the subject area, in addition, not all were introduced immediately, so a ban on using them all as object names would be met with a flurry of indignation. SQL introduces the concept of reserved and non-reserved keywords. You cannot name an object in the same way as a reserved keyword (SELECT, FROM etc) without quoting it, as it is not a reserved keyword (CONVERSATION, AVAILABILITY etc) - you can. This behavior complicates the development of the parser. At the time of the lexical analysis, the context is unknown, but the parser requires different numbers for the identifier and keyword. To solve this problem, we added one more postprocessing to the ANTLR parser.Postprocessing replaces all explicit checks with an identifier, with a call to a special method. This method implements a trickier check. If an identifier is input and an identifier is expected further, then everything is fine, but if an unreserved keyword is supplied to the input, it needs to be checked additionally. An additional check is that the method is examined in the current context in the search for branches, where this unreserved keyword can be used precisely as a keyword, and if there are no such branches, then it can be used as an identifier.but if an unreserved keyword is input, then it needs to be additionally checked. An additional check is that the method is examined in the current context in the search for branches, where this unreserved keyword can be used precisely as a keyword, and if there are no such branches, then it can be used as an identifier.but if an unreserved keyword is input, then it needs to be additionally checked. An additional check is that the method is examined in the current context in the search for branches, where this unreserved keyword can be used exactly as a keyword, and if there are no such branches, then it can be used as an identifier.

Strictly speaking, this problem can only be solved by means of ANTLR, but such a solution will not be optimal. A classic solution to this problem is creating a rule that lists all unreserved keywords and an identifier token. Further, wherever it is permissible to use an identifier, it is no longer used the identifier token of the identifier, but this is a special rule. Such a solution not only makes you remember to add the keyword when you enter it, not only where it is used, but also in this special rule, it also works much more slowly.

Mistakes



Treeless Parsing




The result of the parser, as a rule, is a syntax tree. A syntax tree ( abstract or concrete ) is a data structure that reflects the program text through the prism of formal grammar. If you want to implement a code editor with autocompletion for the language that you recently came up with, having studied the issue, you are likely to implement approximately the following algorithm:

  1. Parse text in the editor. Get the syntax tree.
  2. Find the knot under the carriage. Match it with grammar. Find out what keywords and types of objects will be available at the point.

In this case, the grammar is conveniently represented as a graph or a finite state machine.

Unfortunately, at the start of development, the IDE ANTLR existed in its third version. The fourth version has been rewritten from scratch and is fundamentally different from the third; by passing the code, the parser will automatically generate a parse tree without a single additional line. In the third version, there was a mechanism by which one could tell ANTLR how to build a tree, but it was not very pleasant to use it. Moreover, many examples and articles on this topic suggested using the actions mechanism to execute code at the moment the parser passes the rule. This mechanism is incredibly convenient and allows you to quickly get results. Unfortunately, this solution led to major architectural problems with the development of the product and an increase in the complexity of supporting new functionality. The fact is that in one file, in a grammar file,actions began to accumulate associated with a large number of different functionality, which would be nice to spread to different assemblies. In the future, we were able to distribute the handlers of the actions themselves for different assemblies, implementing a rather tricky version of the subscriber-notifier pattern, but the calls themselves, with the transmission of the necessary information, still clutter up our grammar, complicate the support of new functionality and impose serious and unpleasant restrictions on the architecture.complicate the support of new functionality and impose serious and unpleasant restrictions on the architecture.complicate the support of new functionality and impose serious and unpleasant restrictions on the architecture.

But everything is not as obvious as it might seem. The fact is that ANTLR3 is much faster than ANTLR4. According to our measurements, the differences are about 6 times. In addition, the syntax tree for large scripts could take up a lot of RAM space, and as long as we have to survive in the 32-bit address space of Visual and SqlServer Management studios this can be critical.

Conclusion


Subtotals may be as follows:

  • ANTLR a powerful tool for building parsers
  • Its advantage over others is tools, convenient syntax, a large number of supported languages
  • ANTLR4 was rewritten from scratch and implies working with the parser different from the third version
  • There is always a way to get a little more from ThirdParty libraries than they give
  • SQL specific language, building parsers for it is not an easy task
  • Parsing code for tasks related to building an IDE has its own peculiarities: you need to consider working on scripts that are not closed or invalid

See you in the next part!

All Articles