Perfect SAST. Parser

A series of articles is devoted to describing the optimal approach for implementing static code analysis tools, designed for specialists. The aim of the work is approaches, models and techniques for obtaining maximum tool performance while minimizing the complexity of developing and supporting / changing the tool. This article discusses a way to speed up the parser and reduce memory consumption. The article is structured so that the reader read the manual written by Terrence Parr.

Using ANTLR to the fullest


ANTLR works best between about 1 kb and 10 kb; larger files will not be parsed by this library with maximum speed. Also debugging work on large files becomes almost impossible. The main reason why you need to keep the input data for the parser within the specified limits is that branch prediction algorithms for ANTLR actively use the stack, so by any means you need to get rid of loops in the grammar when one rule calls itself through many other rules, including certain activation conditions for such a connection, consider an example:

var f = fun a(b) { b = 10; return b; }

This code can be described with the following grammar:

start: stmt* EOF;
stmt: varDecl | funDecl | expr ';' | 'return' expr ';';
varDecl: 'var' id '=' expr;
expr: id ('=' expr)? | number;
funDecl: 'fun' id '(' id* ')' '{' stmt* '}'

It is easy to see that stmt is called inside the funDecl rule, which forms a loop that in real applications can cause a 100 kb file to require 4 GB of RAM heap for maximum performance, or even the ability to work, which is an absolutely unacceptable situation .

One way to overcome this drawback is to add special modifications in the process of creating tokens, namely, convolution of {} blocks into a token, with the processing of these tokens later in a new parser, to the input of which collapsed tokens will be transmitted.

To do this, in ANTLR you can override the org.antlr.v4.runtime.Lexer # nextToken method of the lexer, where you can implement the convolution functionality:

  override def nextToken(): Token = super.nextToken() match {
      case x: RScanToken if x.getType == foldOpen => buildFoldToken(x)
      case x: Token => x
  }

  def buildFoldToken(start: RScanToken): RScanToken = {
    val (token, list) = createFoldBlockBody(
      mutable.MutableList[RScanToken]()
        ++ List(start.asInstanceOf[RScanToken])
    )
    val result = new RScanToken(path, _tokenFactorySourcePair, foldBlock, Lexer.DEFAULT_TOKEN_CHANNEL, start.getStartIndex, token.getStopIndex)
    result.setLine(start.getLine)
    result.setCharPositionInLine(start.getCharPositionInLine)
    result.setFoldedTokens(list)
    result
  }

  private def createFoldBlockBody(list: mutable.MutableList[RScanToken]): (RScanToken, mutable.MutableList[RScanToken]) = {
    var current: RScanToken = null
    while (current == null || current.getType != Token.EOF && current.getType != foldClose) {
      current = nextToken().asInstanceOf[RScanToken]
      list += current
    }
    if (current.getType == Token.EOF)
      throw new IllegalStateException()
    (current, list)
  }

Grammar should be supplemented by:

startJavaScript 
    :   HashBangLine? jsFileBody? EOF
    ;

startFoldBlock
   :   block EOF
   ;

block
    :   LBRACE stmtList? RBRACE
    |   FoldBlock 
//     ,
//     startFoldBlock
    ;

With this approach, the contents of the convolution will not be processed once again, since the recursive algorithm will absorb tokens and save them.

Processing speed grows from 2 to 10 times, the difference becomes especially noticeable on minified js files.

The impact on the ANTLR operation of the proposed optimization using several javascript files as an example:
FileSize kbBefore optimization, msAfter optimization, ms
normalize-and-load-metadata.js1018391372
rewrite-live-references.js8899528
dom.umd.min.js130438526607
react.pure.umd.min.js139516686495
tsserver.js8151362857117787

Without optimization, work is carried out in one thread, with optimization, if possible, processing is parallel.

The launch was carried out on a machine with 64 GB of RAM, Intel® Core (TM) i7-9700K CPU @ 3.60GHz, OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.4 + 11).

JVM startup keys: -Xmx4G -XX: + UseG1GC -XX: MaxHeapFreeRatio = 30 -XX: MinHeapFreeRatio = 10
For the tsserver.js file, the key is -Xmx32G, while the memory consumption has reached 22 GB! While minified files up to 200 kb with optimization are analyzed with a heap limit of 128 mb in parallel mode without the need for the garbage collector to put a load on the processor, a maximum of 2%!

Profiling


ANTLR supports profiling the parser / lexer in runtime. To enable it, you need to call the method: org.antlr.v4.runtime.Parser # setProfile before any rules are executed. Then you can get all the information, for example in this way:

    private def printProfileInfo(parser: AbstractRScanParser, tokenStream: TokenStream): Unit = {
      // do the actual parsing
      val parseInfo = parser.getParseInfo
      val atn = parser.getATN
      for (di <- parseInfo.getDecisionInfo if di.ambiguities.size() > 0) {
        val ds = atn.decisionToState.get(di.decision)
        val ruleName = parser.ruleName(ds.ruleIndex)
        log.debug("Ambiguity in rule '" + ruleName + "' -> {}", di)
        log.debug("=========================")
        log.debug(tokenStream.getText)
        log.debug("=========================")
      }
    }

An obvious plus with this method is the reduction in the amount of data that you need to analyze, since convolutions can be replaced with {} and always work in local convolution, which greatly speeds up the process of optimizing the grammar.

Conclusion


This article proposes a method for optimizing the work of formal grammars implemented using the ANTLR library, although the approach itself can be tried to apply to other libraries as well. A feature of optimization is the reduction in the number of transitions forward to validate the input token stream. It also allows you to analyze large files (more than 1 mb) at a memory cost of 512 mb for single-threaded processing, while the lack of optimization will require 50 times more memory for fast work using the same grammar.

Source code available on github. This project is made to demonstrate the optimization work. The project is simplified as much as possible, it is recommended to start the study in the GrammarTester class, a launch configuration has been made for it, you only need to register the path to the folder with the input data. Pay attention to the rscan.tester.cache.dir key, temporary files will be written to this folder, successful file analysis is fixed in this folder and will not allow the parser to analyze it a second time - convenient for fixing errors. At the moment, the grammar has been tested on 17 thousand files from node_modules created for the core react application. Only nested node_modules were ignored.

In the next article, we will consider options for organizing data and the R (Reflection) model for the optimal representation of the source data as analysis objects with minimal memory consumption, the ability to implement search algorithms for elements with complexity O (1), and be able to save to disk in serialized form all objects, as well as unit-testing parts of the tool.

Source: https://habr.com/ru/post/undefined/


All Articles