Viktoria Pelipenko , Kirill Epifanov , Konstantin Volohovsky

Aug 14 2024

Tags:

#Knowledge #StaticAnalysis #Java

How to develop code analyzer in 48 hours

Aug 14 2024

Author: Viktoria Pelipenko , Kirill Epifanov , Konstantin Volohovsky

Small foreword
What's an analyzer?
Core
Analysis
- Preparing for writing diagnostic rules
- Writing diagnostic rules
Viewing warnings
Ad Astra
Epilogue

Static analysis is a robust tool that helps developers to control code quality. Let's try to develop a simple analyzer for Lua using Java and see what's under the static analyzer hood.

Small foreword

We'll write the static analyzer for Lua in Java.

Why Lua? Its syntax is simple, and there's no need to get bogged down in details. It's also a good language, ~~especially compared to JavaScript~~. Why Java? It has an all-round tech stack for us, and Java is also a user-friendly language for development.

Is this a new PVS-Studio analyzer? open icon

No :) However, the article goes over how we developed a genuine analyzer pilot at the in-house hackathon. So, the title is not just a clickbait—we really had 48 hours. Our team had three devs, so if you'd like to try this solo, be prepared to spend a bit more time.

Since this is just a pilot, we'll refer to our analyzer as "Mun" in the article.

What's an analyzer?

Before we get started, it's better to realize what an analyzer is and map out the scope of work. All in all, it's clear: grab the code and grumble if there's something wrong here. What exactly do we need? We're interested in the following aspects of static analysis:

Lexer and parser. We take the source code and turn it into an easy-to-use tree (AST).
AST (or abstract syntax tree) is a way of representing the data structure of a program as a tree. It contains data about the program syntax.
Semantic data. Only syntax data isn't enough for analysis. So, we need an extra mechanism that aggregates semantic (meaning) data in the tree. For instance, it could be the variable scope.
Data-flow analysis. If we'd like to do some in-depth analysis, we can try to predict the variable values in the program nodes. For example, it helps catch errors related to zero division.

It sounds tediously, but this is only the analyzer core. Btw, a compiler has the same thing in the frontend. However, the trickiest part (code generation) just lurks in the backend.

What's the plan for the core? The scope is as follows:

write diagnostic rules to detect errors in code;
collect detected errors and issue a report;
create our own plugin to view warnings (optional).

Indeed, too much for 48 hours:) Some things have to be given up, some streamlined, and some reused. As we move through the article, we'll consider such cases.

To gain a better understanding, let's schematize it:

That's enough to get started. You can learn more about static analysis on our website.

Core

Lexer and parser

Theory

So, lexer and parser are the analyzer foundation, we can't go further without them. If we want anything beyond regular expressions.

It's quite simple with a lexer: it's cumbersome to handle the source code as text. So, we need to translate it into an intermediate representation, i.e. split it into tokens. That said, the lexer have to be not smart. This is a must-have for it because all the hardest and messiest stuff goes to a parser.

Then let's move to a parser. It takes incoming tokens, figures them out, and builds AST. Here's a quick background.

Languages have grammars, and there are different types of parsers to work with them. If we wrote it ourselves, it'd be better to write the simple LL(1) parser for context-free grammar. As we said before, Lua has an intuitive grammar so this would be enough.

Definition of LL open icon

LL means that the input string is read from left to right, and the left-to-right output is constructed for it. In general, it's not enough just look at the current token, (LL(0)) , so we may need to look at k tokens ahead. Such a parser is called LL(k).

However, we don't need it, because we're not going to write anything. Why? We have only 48 hours and no time to create the parser—especially if you've never developed it.

Chosen approach

What alternative do we have to write our lexer and parser? Generators! This is a whole class of utilities that only need us to give a specially described grammar file as input to generate the parser.

We've chosen ANTLR v4. The tool is also written in Java, which makes it really easy to use. Over many years of development, it has started to fare very well.

The issue lurks here too—we need to know how to write the grammar file for the parser. Fortunately, it's not hard to find ready-made options on GitHub, so we'll just take it from here.

Once we've configured the project with ANTLR, let's move on to building the abstract syntax tree.

Abstract Syntax Tree

Speaking of trees: ANTLR is well-designed to visualize code analysis. For example, here's the factorial calculation:

function fact (n)
  if n == 0 then
    return 1
  else
    return n * fact(n-1)
  end
end

print("enter a number:")
a = io.read("*number")
print(fact(a))

We can get the following tree:

We think it's probably closer to the parse tree regarding of classification. We can stop here and start working with it.

But we won't do that and convert it to AST. Why? It'd be easier for us, as the Java team, to work with a tree that's similar to the one in our analyzer. You can read more about Java development here. The sample class hierarchy from Spoon looks like this:

Well, enough for postponing manual work, it's time to write code. We don't fully show the whole code—it makes no sense because it's large and unsightly. To give you a little insight into the way of our thinking, I'll leave a few code snippets under a spoiler for interested folks.

Converting one tree to another open icon

We start handling from the tree top:

public void enterStart_(LuaParser.Start_Context ctx) {
    _file = new CtFileImpl();
    _file.setBlocks(new ArrayList<>());
    for (var chunk : ctx.children) {
        if (chunk instanceof LuaParser.ChunkContext) {
            CtBlock block = getChunk((LuaParser.ChunkContext)chunk);
            if (block == null)
                continue;

            block.setParent(_file);
            _file.getBlocks().add(block);
        }
    }
}

private CtBlock getChunk(LuaParser.ChunkContext ctx) {
    for (var block : ctx.children) {
        if (block instanceof LuaParser.BlockContext) {
            return getBlock((LuaParser.BlockContext)block);
        }
    }
    return  null;
}

private CtBlock getBlock(LuaParser.BlockContext ctx) {
    CtBlock block = new CtBlockImpl();
    block.setLine(ctx.start.getLine());
    block.setColumn(ctx.start.getCharPositionInLine());
    block.setStatements(new ArrayList<>());

    for (var statement : ctx.children) {
        if (statement instanceof LuaParser.StatContext) {
            var statements = getStatement((LuaParser.StatContext)statement);
            for (var ctStatement : statements) {
                ctStatement.setParent(block);
                block.getStatements().add(ctStatement);
            }
        }
    }
    return block;
}

We simply go through the tree from top to bottom, building our tree along the way. Sooner or later, we'll reach the terminal nodes. Here are function parameters:

private List<CtParameter> parseParameters(
    LuaParser.NamelistContext ctx,
    CtElement parent
) {
    var parameters = new ArrayList<CtParameter>();
    for (var child : ctx.children) {
        if (Objects.equals(child.toString(), ","))
            continue;

        var parameter = new CtParameterImpl();
        parameter.setParameterName(child.toString());
        parameter.setParent(parent);
        parameters.add(parameter);
    }
    return parameters;
}

It also doesn't seem to be very challenging—we just turn one object into another. Let's wrap the code listing up here as well. We hope that the way it works is clear.

Frankly, this approach doesn't lead to big gains in the long run. Instead of converting text to the tree, we convert one tree to another. Both tasks are rather tedious. What options do we have?

We can write our lexer and parser from scratch. That's good, but not when we're limited by deadlines and skills.
We can configure the ANTLR to output the desired AST immediately. Sounds too good to be true, but we still need to study ANTLR, which would also be a significant waste of time.
Quick time to market solution. We have to work with what we have and convert the resulting tree into the target tree. Not good, but it's endurable.
We won't convert at all. If there weren't any Java analyzer developers on the team, we'd have done so.

The section would clearly be incomplete if we didn't bring the example of code analysis for our AST. You can examine the pretty print of the tree for the same factorial calculation under the spoiler.

AST to calculate the factorial open icon

CtGlobal:
  CtFile:
    CtFunction: func
      Parameters:
        CtParameter: n
      CtBlock:
        CtIf:
          Condition:
            CtBinaryOperator: Equals
              Left:
                CtVariableRead: n
              Right:
                CtLiteral: 0
          Then block
            CtBlock:
              CtReturn:
                CtLiteral: 1    
          Else block
            CtBlock:
              CtReturn:
                CtBinaryOperator: Multiplication
                  Left:
                    CtVariableRead: n
                  Right:
                    CtInvocation: fact
                      Arguments:
                        CtBinaryOperator: Minus
                          Left:
                            CtVariableRead: n
                          Right:
                            CtLiteral: 1
    CtInvocation: print
      Arguments:
        CtLiteral: "enter a number:"
    CtAssignment:
      Left:
        CtVariableWrite: a
      Right:
        CtInvocation:
          CtFieldRead: 
            Target: io
            Field: read
          Arguments:
            CtParameter: "*number"
    CtInvocation: print
      Arguments:
        CtInvocation: fact
          Arguments:
            CtVariableRead: a

Let's finish here with a quick comment about the terminology. Although it'd be better to call our entity as a tree translator, let's just call it a parser for now.

Visitor

Next, we'll develop the feature that performs the analysis and uses to build diagnostic rules. It's a tree traversal. Overall, it's easy to grasp how to implement tree iteration. However, we need to do some useful stuff with the tree nodes. The Visitor pattern is on the stage.

Do you remember that riveting pattern from the classic GoF book that has a cool implementation but a rather vague usage scenario? So, we have a benchmark case of how it's used in real circumstances. To keep the article concise, I won't show how it's implemented in the book, but I'll show how we do it in the analyzer.

Let's start simple, the tree traversal. We define the CtScanner class and add two scan methods for a single item and their collection.

public <T extends CtElement> void scan(T element) {
    if (element != null) {
        element.accept(this);
    }
}

public <T extends CtElement> void scan(List<T> elements) {
    if (elements != null) {
        for (var element : elements) {
            scan(element);
        }
    }
}

Do you see this accept from CtElement? In our case, any class that inherits the CtVisitable interface is to implement the void accept(CtAbstractVisitor visitor) method. We'll talk about CtAbstractVisitor a bit later. Now, it's enough to know that CtScanner is inherited from it.

This is how accept looks like in CtAssignment:

@Override
public void accept(CtAbstractVisitor visitor){
    visitor.visitCtAssignment(this);
}

Yep, it's a piece of cake. Nodes call their method in the visitor. In our CtScanner there should be the method for each class of the tree node:

@Override
public void visitCtIf(CtIf ctIf) {
    scan(ctIf.getCondition());
    scan((CtStatement) ctIf.getThenStatement());
    scan((CtStatement) ctIf.getElseStatement());
}

@Override
public <T> void visitCtLiteral(CtLiteral<T> literal) {
}
@Override
public void visitCtStatement(CtStatement statement) {
}

// ....

Now let's get back to CtAbstractVisitor, an interface extracted from our CtScanner. It includes methods for visiting tree nodes—but only these methods, no scan methods. In the implementation of the visitor methods, we either leave a placeholder for future overloads—if it's the terminal node—or we continue to unfold the tree nodes by performing recursive descent along it. That's all we need to know to continue.

Semantic parsing

Introduction

It'd seem that's all with the core. For example, now our analyzer can catch simple errors like the variable self-assignment. We call it the signature analysis. To provide more advanced analysis, we'd like to obtain more data about what's going on in the code. The parser has done its job, it's time to create new entities.

A quick note on the compiler design open icon

So far, we've completely followed the way of compilers regarding of the analyzer structure, but now our path diverges. If Lua had a compiler, it'd start analysis to ensure that the code is syntactically correct. Although our tools will continue to overlap in features.

For our needs, let's create SemanticResolver for two purposes:

define the variable scope;
evaluate the variable type based on the duck typing.

However, we'll call it from the parser along with its traversal. No decorator this time.

To keep the app structure seamless, we'll contain all the semantic data in the same AST nodes. First, let's define the necessary properties in the CtVariableAccess interface:

CtBlock getScope();
void setScope(CtBlock block);
    
TypeKind getTypeKind();
void setTypeKind(TypeKind type);

Variable scope

Let's start with scope; this will be our key tool for defining the variable along with its name. First, we define the variable entity inside SemanticResolver. To keep it short, we'll show you only the interface, but the gist should be clear:

public static class Variable {
    public Variable(String identifier);
    public String getIdentifier();
    public CtBlock getScope();
    public void setScope(CtBlock block);
    public void setType(TypeKind type);
    public TypeKind getType();
    // Methods use only identifier
    @Override
    public boolean equals(Object o);
    @Override
    public int hashCode();
}

Let's also further define the stack for variable scopes:

private final Stack<Pair<CtBlock, HashSet<Variable>>> stack = new Stack<>();
private final CtGlobal global;
public SemanticResolver(CtGlobal global) {
    pushStack(global);
    this.global = stack.peek();
}

public void pushStack(CtBlock block) {
    stack.push(Pair.of(block, new HashSet<>()));
}
public void popStack() {
    stack.pop();
}

The stack entry consists of a tuple of scope and variables set registered there. The stack operation is mundane. Here's how it works in the parser:

private CtBlock getBlock(LuaParser.BlockContext ctx) {
    CtBlock block = new CtBlockImpl();
    resolver.pushStack(block);

    // ....
    
    resolver.popStack();
    return block;
}

We have to register the variables somehow. If a variable is local, it's simple—let's take the current scope and pass the variable there:

public CtBlock registerLocal(Variable variable) {
    var scope = stack.pop();
    variable.setScope(scope.getLeft());
    scope.getRight().add(variable);
    stack.push(scope);
    return scope.getLeft();
}

If the local keyword isn't used, the variable will be either global or be declared somewhere above. So, we first go through the stack and double-check if it exists:

public CtBlock registerUndefined(Variable variable) {
    var pair = lookupPair(variable);
    pair.getRight().add(variable);
    return pair.getLeft();
}

public Pair<CtBlock, HashSet<Variable>> lookupPair(Variable variable) {
    var buf = new Stack<Pair<CtBlock, HashSet<Variable>>>();
    Pair<CtBlock, HashSet<Variable>> res = null;
    while (!stack.isEmpty()) {
        var scope = stack.pop();
        buf.push(scope);
        if (scope.getRight().contains(variable)) {
            res = scope;
            break;
        }
    }

    while (!buf.isEmpty()) {
        stack.push(buf.pop());
    }

    if (res == null) {
        return global;
    }

    return res;
}

We can set the scope to variables when we have the entry:

private CtVariableWrite getVariableWriteInternal(
        ParseTree ctx,
        boolean isLocal
) {
    var node = new CtVariableWriteImpl();
    node.setVariableName(ctx.getChild(0).toString());
    CtBlock scope;
    if (isLocal) {
        scope = resolver.registerLocal(
             new SemanticResolver.Variable(node.getVariableName()));
    } else {
        scope = resolver.registerUndefined(
            new SemanticResolver.Variable(node.getVariableName()));
    }
    node.setScope(scope);
    return node;
}

And defining it for readings:

private CtExpression getExpression(LuaParser.ExpContext ctx) {
    // ....
    if (child instanceof LuaParser.PrefixexpContext) {
        // ....
        var scope = resolver.lookupScope(
            new SemanticResolver.Variable(variableRead.getVariableName())
        );
        variableRead.setScope(scope);

        return variableRead;
    }
    // ....
}

We won't show the lookupScope code. It's a single-line wrapper over lookupPair, which we can see above. Here, we can end with the variable scope. We'll check the mechanism in the diagnostic rule on a separate section. Now we continue working on the semantic parsing. Now, let's move to the variable type determination.

Duck typing

How to determine the variable type? Indeed, we obtain them from the literals. Let's further define the type and the enumeration for them:

public interface CtLiteral<T> extends CtExpression, CtVisitable {
  // ....
  void setTypeKind(TypeKind kind);
  TypeKind getTypeKind();
}

public enum TypeKind {
    Undefined,
    Number,
    String,
    Boolean,
    Nil
}

Thus, the data type can be numeric, string, logical, and nil. However, it'll be undefined by default. The split of undefined and nil may seem far-fetched, but it's okay for the pilot.

We store the literal type only in the tree node, doing it from the parser:

private <T> CtLiteralImpl<T> createLiteral(
    // ....
    TypeKind type,
) {
    // ....

    literal.setTypeKind(type);
    return literal;
}

However, the variable type will be both in the tree and in SemanticResolver. So, we can request it during further traversal and the AST building:

private ArrayList<CtAssignment> parseAssignments(LuaParser.StatContext ctx) {
    // ....
    for (int i = 0; i < variables.size(); i++) {
        var assignment = new CtAssignmentImpl();
        var variable = variables.get(i);
        
        // ....

        
        variable.setTypeKind(resolver.lookupType(variable.getVariableName()));
        resolver.setType(
            variable.getVariableName(),
            variable.getScope(),
            SemanticResolver.evaluateExpressionType(expression)
        );
    }
    return assignments;
}

There is no error in the operator precedence. Let the stored variable have the type from its past assignment. It'll facilitate our work in the future. As for the methods used here, there's nothing incredible about the lookupType implementation—it's basically the same as lookupPair. There's nothing complex about setType:

public void setType(String variable, CtBlock scope, TypeKind type) {
    var opt = stack.stream()
             .filter(x -> Objects.equals(x.getLeft(), scope))
             .findFirst();
    if (opt.isPresent()) {
        var pair = opt.get();
        var newVar = new Variable(variable);
        var meta = pair.getRight()
                       .stream()
                       .filter(x -> x.equals(newVar))
                       .findFirst();
        meta.ifPresent(value -> value.setType(type));
    }
}

However, evaluateExpressionType is trickier. Computing variable types in dynamic languages can be a bit of a hassle. Just look at the jokes about JavaScript and string concatenation. However, firstly, Lua has a separate operator '..', and secondly, we're trying to keep the process simple. So, we'll only determine whether all the operands are the same type. We'll use the familiar CtScanner.

public static TypeKind evaluateExpressionType(CtExpression expression) {
    Mutable<TypeKind> type = new MutableObject<>(null);
    var typeEvaluator = new CtScanner() {
        private boolean stop = false;
        @Override
        public void scan(CtElement el) {
            if (stop) { return; }

            if (el instanceof CtVariableRead || el instanceof CtLiteral<?>) {
                var newType = el instanceof CtVariableRead
                              ? ((CtVariableRead) el).getTypeKind()
                              : ((CtLiteral<?>) el).getTypeKind();

                if (newType.equals(TypeKind.Undefined)) {
                    type.setValue(TypeKind.Undefined);
                    stop = true;
                    return;
                } else if (type.getValue() == null) {
                    type.setValue(newType);
                } else if (!type.getValue().equals(newType)) {
                    type.setValue(TypeKind.Undefined);
                    stop = true;
                    return;
                }
            }
            super.scan(el);
        }
    };
    typeEvaluator.scan(expression);
    return type.getValue();
}

In parseAssignments, we've set the assignment type to the (CtVariableWrite) variable but forgot about reading (CtVariableRead). Let's fix it:

private CtExpression getExpression(LuaParser.ExpContext ctx) {
    // ....
    if (child instanceof LuaParser.PrefixexpContext) {
        // ....
        variableRead.setTypeKind(
            resolver.lookupType(variableRead.getVariableName())
        );
        var scope = resolver.lookupScope(
            new SemanticResolver.Variable(variableRead.getVariableName()));
        variableRead.setScope(scope);

        return variableRead;
    }
    // ....
}

We've completed the semantic analysis and almost ready to start searching for bugs.

Data-flow analysis

Inside structure

First, we make two quick stops. While the topic of the data-flow analysis deserves a series of articles, it'd be wrong to skip over it. Here, we won't dive deep into theory but just try to memorize the values set by literals.

First, let's fall into the sin of self-copying and define the variable entity for DataFlow again but in a simpler way. Again, we'll only show you the interface:

private static class Variable {
    private Variable(String identifier, CtBlock scope);
    // Methods use identifier and scope
    @Override
    public boolean equals(Object o);
    @Override
    public int hashCode();
}

Here's the rest of the class content:

public class DataFlow {
    private static class Variable {
        // ....
    }
    Map<Variable, Object> variableCache = new HashMap<>();

    public void scanDataFlow(CtElement element) {
        if (element instanceof CtAssignment) {
            CtAssignment variableWrite = (CtAssignment) element;
            if (variableWrite.getAssignment() instanceof CtLiteral<?>) {
                var assigned = variableWrite.getAssigned();
                var variable = new Variable(
                    assigned.getVariableName(),
                    assigned.getScope()
                );
                variableCache.put(
                    variable, 
                    getValue(variableWrite.getAssignment())
                );
            }
        }
    }

    public Object getValue(CtExpression expression) {
        if (expression instanceof CtVariableRead) {
            CtVariableRead variableRead = (CtVariableRead) expression;
            var variable = new Variable(
                variableRead.getVariableName(),
                variableRead.getScope()
            );
            return variableCache.getOrDefault(variable, null);
        } else if (expression instanceof CtLiteral<?>) {
            return ((CtLiteral<?>) expression).getValue();
        }
        return null;
    }
}

It's quite simple: in scanDataFlow, we associate the value with the variable, and in getValue, we extract that value for the set node. Everything is simple because we don't factor in branching, loops, or even expressions. Why don't we factor them in? Branching is the very topic that deserves its own series of articles. What about expressions? Well, we didn't make it in two days. Given what we've achieved in the last two days, we think this task is feasible. We just postpone it as a homework.

That's all. It's clear that such a solution is far from a real product, but we have laid some foundation. Then we can either try to enhance the code—and we'll introduce data-flow analysis via AST. Or we can redo everything properly and build the control-flow graph.

We've done the class implementation, but we haven't discussed how to use it. We've written the class, but what shall we do with it? Let's talk about it in the following section. Now we just say that DataFlow works right before calling the diagnostic rules. Those are called when traversing the finished AST. Thus, the rules will have access to the current variable values. This is called Environment; you can see it in your debugger.

Walker

Welcome to the last section regarding the core. We've already had the AST that is full of semantic data, as well as the data-flow analysis that is just waiting to be run. It's a good time to put it all together and set the stage for our diagnostic rules.

How is the analysis performed? It's simple: we get on the topmost tree node, and then we start recursive traversal. That is, we need something that will traverse the tree. We have CtScanner for that. Based on it, we define MunInvoker:

public class MunInvoker extends CtScanner {
    private final List<MunRule> rules = new ArrayList<>();
    private final Analyzer analyzer;

    public MunInvoker(Analyzer analyzer) {
        this.analyzer = analyzer;
        rules.add(new M6005(analyzer));
        rules.add(new M6020(analyzer));
        rules.add(new M7000(analyzer));
        rules.add(new M7001(analyzer));
    }

    @Override
    public <T extends CtElement> void scan(T element) {
        if (element != null) {
            analyzer.getDataFlow().scanDataFlow(element);
            rules.forEach(element::accept);
            super.scan(element);
        }
    }
}

You can notice a few unknown things in the code:

the Analyzer class. It encapsulates the whole analysis process and contains shared resources that need to be accessed within the rules. In our case, this is the DataFlow instance. We'll get back to Analyzer later;
four confusing classes that are added to rules. We'll talk about them in the next section, so don't panic. A little spoiler for you :)

Otherwise, the class operation shouldn't raise any questions. Each time we enter any tree node, the analyzer rule is called, and the variable values are evaluated right before it. Next, the traversal continues in line with to the CtScanner algorithm.

Analysis

Preparing for writing diagnostic rules

Rule class

So, we have the analyzer core prototype—it's a good time to start analyzing something.

The base for our rules is ready—it's the CtAbstractVisitor class. The analysis goes as follows: the rule overloads a few visitors and analyzes the data contained in the AST nodes. Let's extend CtAbstractVisitor with the abstract MunRule class, which we use to create rules. In this class, we also define the addRule method that generates warnings.

Speaking of warnings: what data do they need? First is a warning message to demonstrate users what they may have been mistaken about. Second, the user needs to know where and what the analyzer warns. So, let's add data about the file where the analyzer has detected the troublesome code block and the location of this code fragment.

Here's what the MunRule class looks like:

public abstract class MunRule extends CtAbstractVisitor {
    private Analyzer analyzer;
    public void MunRule(Analyzer analyzer) {
        this.analyzer = analyzer;
    }
    protected Analyzer getAnalyzer() {
        return analyzer;
    }

    protected void addRule(String message, CtElement element) {
        var warning = new Warning();
        warning.message = message;

        WarningPosition pos = new WarningPosition(
            Analyzer.getFile(), 
            element.getLine(), 
            element.getColumn() + 1
        );
        warning.positions.add(pos);
        analyzer.addWarning(warning);
    }

    public DataFlow getDataFlow() {
        return analyzer.getDataFlow();
    }
}

The WarningPosition and Warning classes are just data stores, so we won't list them. We'll talk about addWarning now.

Merging it together

The last thing to prepare is how we're going to view the diagnostic rules. To do this, we combine all our features together, using the already mentioned Analyzer class. Actually, here it is:

public class Analyzer {
    private DataFlow dataFlow = new DataFlow();

    public DataFlow getDataFlow() {
        return dataFlow;
    }

    public CtElement getAst(String pathToFile) throws IOException {
        InputStream inputStream = new FileInputStream(pathToFile);
        Lexer lexer = new LuaLexer(CharStreams.fromStream(inputStream));

        ParseTreeWalker walker = new ParseTreeWalker();
        var listener = new LuaAstParser();
        walker.walk(listener, new LuaParser(
            new CommonTokenStream(lexer)
        ).start_());
        return listener.getFile();
    }

    protected void addWarning(Warning warning) {
        Main.logger.info(
            "WARNING: " + warning.code + " "
            + warning.message + " ("
            + warning.positions.get(0).line + ", "
            + warning.positions.get(0).column + ")");
      }      

    public MunInvoker getMunInvoker() {
        return new MunInvoker(this);
    }

    public void analyze(String pathToFile) {
        try {
            var top = getAst(pathToFile);
            var invoker = getMunInvoker();
            invoker.scan(top);
        }
        catch (IOException ex) {
            Main.logger.error("IO error: " + ex.getMessage());
        }
    }
}

To explain how it works, we'll give you an example of the whole analysis process:

In the getAst method, we build our AST using the scheme lexer —> parser —> tree translator;
Then we call MunIvoker, which traverses the tree and calls our diagnostic rules along with the data-flow analysis;
If necessary, the rules call the Analyzer class to get the DataFlow instance;
They call addWarning when the analyzer spots suspicious code fragment. That one just outputs the message to the log.

We've got the prep work—it's time to start writing diagnostic rules.

Writing diagnostic rules

Assigning a variable to itself

We've decided to start writing the rules with a simple one: PVS-Studio has the Java diagnostic rule, V6005, where the analyzer check if a variable is assigned to itself. It can simply be copied and slightly adapted to our tree. Since our analyzer is called Mun, we start the numbers of our diagnostic rules with M. Let's create the M6005 class, extending MunRule, and override the visitCtAssignment method in it. The following check will be located in the method:

public class M6005 extends MunRule {
    private void addRule(CtVariableAccess variable) {
        addRule("The variable is assigned to itself.", variable);
    }
    @Override
    public void visitCtAssignment(CtAssignment assignment) {
        if (RulesUtils.equals(assignment.getAssigned(), 
            assignment.getAssignment())) {
            addRule(assignment.getAssigned());
        }
    }
}

The used RulesUtils.equals method is a wrapper and overload for another equals method that checks the name and scope:

public static boolean equals(CtVariableAccess left, CtVariableAccess right) {
    return left.getVariableName().equals(right.getVariableName())
           && left.getScope().equals(right.getScope());
}

We need to check the scope, because the next code fragment isn't an assignment of the variable to itself:

local a = 5;
begin
    local a = a;
end

Now, we can test the diagnostic rule on some simple code and see if it works. The following example will issue the warning on the marked line: "M6005 The variable is assigned to itself":

local a = 5;
local b = 3;

if (b > a) then
    a = a;      <=
end

Zero division

Well, we've warmed up, so let's move on. The analyzer has already had the primitive data-flow analysis (DataFlow) that we can and should use. Again, let's look at one of our existing diagnostic rules, V6020, where the analyzer checks for the zero division. We try to adapt it for our analyzer. The divisor can be either the variable as zero or a literal as zero. We need to access the cache of variables to check their value.

Here's the simple implementation of such a diagnostic rule:

public class M6020 extends MunRule {
  private void addWarning(CtElement expression, String opText) {
    addRule(String.format(
            "%s by zero. Denominator '%s' == 0.",
            opText, expression instanceof CtLiteral
                ? ((CtLiteral) expression).getValue()
                : ((CtVariableRead) expression).getVariableName()
        ),
        expression
    );
  }

  @Override
  public void visitCtBinaryOperator(CtBinaryOperator operator) {
    BinaryOperatorKind opKind = operator.getKind();
    if (opKind != BinaryOperatorKind.DIV && opKind != BinaryOperatorKind.MOD) {
      return;
    }

    apply(operator.getRightHandOperand(), opKind == BinaryOperatorKind.MOD);
  }


  private void apply(CtExpression expr, boolean isMod) {
    Object variable = getDataFlow().getValue(expr);
    if (variable instanceof Integer) {
      if ((Integer) variable == 0) {
        String opText = isMod ? "Mod" : "Divide";
        addWarning(expr, opText);
      }
    }
  }
}

We can see that the diagnostic rule works on simple examples and issues the following warning: "M6020 Divide by zero. Denominator 'b' == 0" on these lines:

local a = 5;
local b = 0;
local c = a / b;   <=
local d = a / 0;   <=

If you have made the expression evaluator as your homework, you may try the diagnostic rule on this code:

local b = 7;
local b = b - 7;
local c = a / b;

Overwritten types

Since we're writing the Lua analyzer, we need to write diagnostic rules for this language. Let's start simple.

Lua is a dynamic scripting language. Let's use this feature and write the diagnostic rule that allows us to catch overwritten types.

We'll also need to pick a new number for the diagnostic rule. Earlier we just copied them from the Java analyzer. Now, it seems like it's time to start a new thousand—the seventh. Who knows what diagnostic rule it'll be in PVS-Studio, but at the time of writing this article, it will be Lua.

Knowing about types, we facilitate our case: we need to check that the left and right assignments are of different types. Note that we ignore Undefined for the left part and Nil for the right part. The code looks like this:

public class M7000 extends MunRule {
    @Override
    public void visitCtAssignment(CtAssignment assignment) {
        var assigned = assignment.getAssigned();
        var exprType = SemanticResolver.evaluateExpressionType(
            assignment.getAssignment());
        if (assigned.getTypeKind().equals(TypeKind.Undefined)
            || exprType.equals(TypeKind.Nil)
        ) {
            return;
        }

        if (!assigned.getTypeKind().equals(exprType)) {
            addRule(
                String.format(
                    "Type of the variable %s is overridden from %s to %s.",
                    assigned.getTypeKind().toString(),
                    exprType.toString()
                assigned,
            );
        }
    }
}

It's time to check the diagnostic rule in real cases. In the following example, our analyzer issues the warning only on the last line:

local a = "string";

if (true) then
  local a = 5;
end

a = 5;     <=

The analyzer warning: "M7000 Type of the variable a is overridden from Integer to String".

Lost local

Let's finish smoothly. The Lua plugin for VS Code has a diagnostic rule that detects global lowercase variables. This check can help detect the forgotten local identifiers. Let's implement the same diagnostic rule in our analyzer.

Here, just as before, we'll need to use the variable scope data that is obtained via the semantic analysis. We just find where the variable was declared. That's also where the variable is assigned a value. Then check its scope and name. If the variable is global and starts with a lowercase, the analyzer will warn. Easy-peasy.

Let's create a new class and override the visitCtAssignment method in it again. That way, we can look for the problematic global variables:

public class M7001 extends MunRule {
    @Override
    public void visitCtAssignment(CtAssignment assignment) {
        var variable = assignment.getAssigned();
        var firstLetter = variable.getVariableName().substring(0, 1);
    
        if (variable.getScope() instanceof CtGlobal && 
            !firstLetter.equals(firstLetter.toUpperCase())) {
            addRule("Global variable in lowercase initial.", variable);
        }
    }
}

Let's check the diagnostic rule work. It issues the warning: "M7001 Global variable in lowercase initial." It's issued on the second line in the code snippet below:

function sum_numbers(b, c)
    a = b + c;  <=
    return a;
end

local a = sum_numbers(10, 5);

Alright, we've written the diagnostic rules, we've done with the analyzer (and it even works). Breathe out. Now we can enjoy the fruits of our labors.

Viewing warnings

Above we've already shown the code that allows us to view warnings in a console or file. Here is how the process of working with the analyzer looks like in the console. We run the analyzer using the command:

java -jar mun-analyzer.jar -input "C:\munproject\test.lua"

And we get:

However, first, the analyzer is a tool, and it should be user-friendly. Instead of working with the console, it'd be much more convenient to work with a plugin.

It's time to borrow again. PVS-Studio has already had plugins for many different IDEs, for example, for Visual Studio Code and IntelliJ IDEA. You can view warnings and navigate through them. Since analyzer reports have standardized formats, we can simply borrow the JSON report generation algorithm from our Java analyzer. The algorithm is extensive and dull, so we won't show it. We still have to run it from the command line but with the argument -output "D:git/MunProject/report.json".

Then, we can open the report in IntelliJ IDEA or VS Code and look at the Lua analyzer warnings:

Sweet! Now we can use the analyzer for its full-intended purpose without sacrificing usability.

Ad Astra

So, have we written the full-fledged analyzer? Uh, not exactly. At the end of this long journey, we have the most real pilot, going through all the stages. However, the scope for growth is enormous:

Core enhancements:
- enhance the data-flow analysis;
- consider the control flows;
- add interprocedural and intermodular analysis. You can read how we did it in C++ here;
- add the annotation mechanism to help enhance the data-flow analysis and the duck typing;
- provide more semantic data;
- fine-tune the existing mechanisms;
- and don't forget about a better parser.
Enhancements with diagnostic rules:
- enhance the existing ones;
- write more new ones;
- write more complex diagnostic rules beyond the couple dozen lines.
Analyzer usability enhancements:
- create a proper plugin support;
- integrate into CI/CD.
Unit tests and regression tests to check the diagnostic rule performance in their development and modification.

And much, much more. In other words, the path from pilot to full-fledged tool is quite thorny. So, PVS-Studio focuses on the existing directions: C#, C, C++, and Java instead of new ones. If you write in any of these languages, you may try out our analyzer.

Epilogue

The article ended up being a lot longer than we thought it would be, so please leave a comment if you got to the end :) We'd love to get your feedback.

If you're interested in the topic, you can read about developing analyzers for our languages:

If you enjoyed this article, you can subscribe to the personal blog of one of the article authors as well as our corporate digest to get the newsletter.

#Knowledge #StaticAnalysis #Java

379

Tags:

#Knowledge #StaticAnalysis #Java

How to develop code analyzer in 48 hours

Small foreword

What's an analyzer?

Core

Lexer and parser

Theory

Chosen approach

Abstract Syntax Tree

Visitor

Semantic parsing

Introduction

Variable scope

Duck typing

Data-flow analysis

Inside structure

Walker

Analysis

Preparing for writing diagnostic rules

Rule class

Merging it together

Writing diagnostic rules

Assigning a variable to itself

Zero division

Overwritten types

Lost local

Viewing warnings

Ad Astra

Epilogue

Posts: articles

Poll:

Comments (0)

Fill out the form in 2 simple steps below:

Your contact information:

Desired license type:

Want to try PVS‑Studio for free?

Achievements

PVS-Studio

Licensing

Company