Build Your Own (Simple) Static Code Analyzer



Stefanie Molin

Bio

  • 👩🏻‍💻 Software engineer at Bloomberg in NYC
  • ✨ Core developer of numpydoc and creator of numpydoc's pre-commit hook, which uses static code analysis
  • ✍ Author of "Hands-On Data Analysis with Pandas"
  • 🎓 Bachelor's degree in operations research from Columbia University
  • 🎓 Master's degree in computer science from Georgia Tech

What makes a tool a static code analyzer?

It analyzes source code without running it.

What are the main benefits of static code analysis?

  • Speed – can be much faster than dynamic code analysis
  • Portable – no need to install the codebase being analyzed or its dependencies

How do you build a static code analyzer?

It depends...

Abstract Syntax Trees (ASTs) are a good place to start.

Abstract Syntax Tree (AST)

  • Represents the structure of the source code as a tree
  • Nodes in the tree are language constructs (e.g., module, class, function)
  • Each node has a single parent (e.g., a class is a child of a single module)
  • Parent nodes can have multiple children (e.g., a class can have several methods)

Let's see what this code snippet (greet.py) looks like when represented as an AST:

class Greeter:
    def __init__(self, enthusiasm: int = 1) -> None:
        self.enthusiasm = enthusiasm

    def greet(self, name: str = 'World') -> str:
        return f'Hello, {name}{"!" * self.enthusiasm}'
The AST for greet.py visualized with Graphviz
The AST for greet.py visualized with Graphviz.

ASTs in Python

  • Represent syntactically-correct Python code (cannot be generated in the presence of syntax errors)
  • Created by the parser as an intermediary step when compiling source code into byte code (necessary to run it)
  • Available in the standard library via the ast module

Parsing Python source code into an AST

1. Read in the source code

>>> from pathlib import Path
>>> source_code = Path('greet.py').read_text()

2. Parse it with the ast module

If the code is syntactically-correct, we get an AST back:

>>> import ast
>>> tree = ast.parse(source_code)
>>> print(type(tree))
<class 'ast.Module'>

Inspecting the AST

Use ast.dump() to display the AST:

The root node is an ast.Module node:

It contains everything else in its body attribute:

The greet.py file first defines a class, named Greeter:

The ast.ClassDef node also contains the body of the Greeter class:

The first entry is the Greeter.__init__() method:

The ast.FunctionDef node includes information about the arguments:

Its body contains the AST representation of the function's code:

The return annotation is stored in the returns attribute:

The final entry is the Greeter.greet() method:

    
>>> print(ast.dump(tree, indent=2))
Module(
  body=[
    ClassDef(
      name='Greeter',
      body=[
        FunctionDef(
          name='__init__',
          args=arguments(
            args=[
              arg(arg='self'),
              arg(
                arg='enthusiasm',
                annotation=Name(id='int', ctx=Load()))],
            defaults=[
              Constant(value=1)]),
          body=[
            Assign(
              targets=[
                Attribute(
                  value=Name(id='self', ctx=Load()),
                  attr='enthusiasm',
                  ctx=Store())],
              value=Name(id='enthusiasm', ctx=Load()))],
          returns=Constant(value=None)),
        FunctionDef(
          name='greet',
          args=arguments(
            args=[
              arg(arg='self'),
              arg(
                arg='name',
                annotation=Name(id='str', ctx=Load()))],
            defaults=[
              Constant(value='World')]),
          body=[
            Return(
              value=JoinedStr(
                values=[
                  Constant(value='Hello, '),
                  FormattedValue(
                    value=Name(id='name', ctx=Load()),
                    conversion=-1),
                  FormattedValue(
                    value=BinOp(
                      left=Constant(value='!'),
                      op=Mult(),
                      right=Attribute(
                        value=Name(id='self', ctx=Load()),
                        attr='enthusiasm',
                        ctx=Load())),
                    conversion=-1)]))],
          returns=Name(id='str', ctx=Load()))])])

Popular open source tools that use ASTs

  • Linters and formatters, like ruff (Rust) and black (Python)
  • Documentation tools, like sphinx and the numpydoc-validation pre-commit hook
  • Automatic Python syntax upgrade tools, like pyupgrade
  • Type checkers, like mypy
  • Code security tools, like bandit
  • Code and testing coverage tools, like vulture and coverage.py
  • Testing frameworks that instrument your code or generate tests based on it, like hypothesis and pytest

Let's build a simple static code analyzer

To learn how to use the AST, we will build a tool that does the following:

  • Finds missing docstrings and suggests templates based on the code itself
  • Uses only the Python standard library

Tools may exist that do this already, but the point is to learn how to use the AST.

The input

We will analyze a single file, greet.py, for time and space considerations:

class Greeter:
    def __init__(self, enthusiasm: int = 1) -> None:
        self.enthusiasm = enthusiasm

    def greet(self, name: str = 'World') -> str:
        return f'Hello, {name}{"!" * self.enthusiasm}'

Is static code analysis really necessary here?

While we are working with one file here, most codebases will be much larger. How could we approach this scalably?

  1. Manually (open each file and edit) – tedious and error prone
  2. Regular expressions – messy and hard to get right (edge cases, context, etc.)
  3. Script to import everything and check docstrings – must be able to install codebase and its dependencies; slow
  4. Static code analysis – analyzing code without executing it means we can use this on any of our codebases

Important disclaimer before we dive in

Docstrings have been omitted from all code snippets for space 😂

Fear not – we are building a tool to fix that!

Detecting missing docstrings using the Python AST

We need to traverse the full AST (to account for nested functions and classes) and inspect each node's docstring:

Only ast.Module, ast.ClassDef, ast.FunctionDef, and ast.AsyncFunctionDef nodes can have docstrings:

If there is one, ast.get_docstring(node) returns the docstring of node; otherwise, it returns None:

    
def detect_missing_docstring(
    node: ast.AsyncFunctionDef
    | ast.ClassDef
    | ast.FunctionDef
    | ast.Module
) -> None:
    if ast.get_docstring(node) is None:
        entity = getattr(node, 'name', 'module')
        print(f'{entity} is missing a docstring')
    

In greet.py, we want to call this function on these nodes only:

The AST the nodes in greet.py that can have docstrings visualized with Graphviz

Traversing the AST

File structures vary, so we will create a NodeVisitor to ensure we find all missing docstrings:

  1. Subclass ast.NodeVisitor
  2. Create visit_<NodeType>() methods for nodes we are interested in
  3. Instantiate the visitor and call its visit() method

1. Subclass ast.NodeVisitor

class DocstringVisitor(ast.NodeVisitor):
    pass

2. Create visit_<NodeType>() methods for nodes we are interested in

class DocstringVisitor(ast.NodeVisitor):

    def visit_AsyncFunctionDef(
        self, node: ast.AsyncFunctionDef
    ) -> None:
        detect_missing_docstring(node)

    def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
        detect_missing_docstring(node)

    def visit_ClassDef(self, node: ast.ClassDef) -> None:
        detect_missing_docstring(node)

    def visit_Module(self, node: ast.Module) -> None:
        detect_missing_docstring(node)

3. Instantiate the visitor and call its visit() method

>>> visitor = DocstringVisitor()
>>> visitor.visit(tree)
module is missing a docstring

What about the missing docstrings for the Greeter class and its methods?

Complete traversal means visiting all fields

We aren't visiting the list of AST nodes in the ast.Module node's body field, so traversal starts and stops there:

The AST the nodes in greet.py that can have docstrings with attributes and types visualized with Graphviz

The general_visit() method

  • Defined on base class ast.NodeVisitor
  • Visits child nodes by calling visit() on any nodes returned from ast.iter_fields()
  • Called automatically for node types for which we didn't create methods

Modifying the DocstringVisitor

We add the _visit_helper() method, which checks the docstring and then continues the traversal:

Calling generic_visit() on each node for which we check docstrings for ensures we continue the traversal:

Now, we switch to calling _visit_helper() whenever we visit module, class, or function nodes:

    
class DocstringVisitor(ast.NodeVisitor):

    def _visit_helper(
        self,
        node: ast.AsyncFunctionDef
        | ast.ClassDef
        | ast.FunctionDef
        | ast.Module
    ) -> None:
        detect_missing_docstring(node)
        self.generic_visit(node)

    def visit_AsyncFunctionDef(
        self, node: ast.AsyncFunctionDef
    ) -> None:
        self._visit_helper(node)

    def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
        self._visit_helper(node)

    def visit_ClassDef(self, node: ast.ClassDef) -> None:
        self._visit_helper(node)

    def visit_Module(self, node: ast.Module) -> None:
        self._visit_helper(node)

Complete traversal achieved 🎉

>>> visitor = DocstringVisitor()
>>> visitor.visit(tree)
module is missing a docstring
Greeter is missing a docstring
__init__ is missing a docstring
greet is missing a docstring

Disambiguating docstring paths

greet could be the greet() method or the greet module, but greet.Greeter.greet can only be one:

greet is missing a docstring

Tracking node ancestry with a stack

From a node, we can access its children, but not its parent.

We can track lineage with a stack:

We internalize the missing docstring check as _detect_missing_docstring():

It uses the stack to print the unambiguous path to the missing docstring:

The _visit_helper() takes care of pushing onto and popping off of the stack:

We push (append) a node onto the stack before we actually visit it:

We pop the node off the stack after we have visited it and all of its descendants:

    
class DocstringVisitor(ast.NodeVisitor):

    def __init__(self, module_name: str) -> None:
        super().__init__()
        self.stack: list[str] = []
        self.module_name: str = module_name

    def _detect_missing_docstring(
        self,
        node: ast.AsyncFunctionDef
        | ast.ClassDef
        | ast.FunctionDef
        | ast.Module
    ) -> None:
        if ast.get_docstring(node) is None:
            entity = '.'.join(self.stack)
            print(f'{entity} is missing a docstring')

    def _visit_helper(
        self,
        node: ast.AsyncFunctionDef
        | ast.ClassDef
        | ast.FunctionDef
        | ast.Module
    ) -> None:
        self.stack.append(
            getattr(node, 'name', self.module_name)
        )
        self._detect_missing_docstring(node)
        self.generic_visit(node)
        self.stack.pop()

    def visit_AsyncFunctionDef(
        self, node: ast.AsyncFunctionDef
    ) -> None:
        self._visit_helper(node)

    def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
        self._visit_helper(node)

    def visit_ClassDef(self, node: ast.ClassDef) -> None:
        self._visit_helper(node)

    def visit_Module(self, node: ast.Module) -> None:
        self._visit_helper(node)

Now, we know exactly where the docstrings are missing:

>>> visitor = DocstringVisitor('greet')
>>> visitor.visit(tree)
greet is missing a docstring
greet.Greeter is missing a docstring
greet.Greeter.__init__ is missing a docstring
greet.Greeter.greet is missing a docstring

Suggesting docstring templates

ast.FunctionDef and ast.AsyncFunctionDef nodes have information that often ends up in the docstring:

  • args: Argument names, types, and defaults
  • returns: Return type annotation (if present)
  • body: AST of the function body, which can be used to infer return types, as well as whether the function raises any exceptions (out of scope)

For this keynote, we will focus on fully-typed code.

An example using the Greeter.greet() method

class Greeter:
    def __init__(self, enthusiasm: int = 1) -> None:
        self.enthusiasm = enthusiasm

    def greet(self, name: str = 'World') -> str:
        return f'Hello, {name}{"!" * self.enthusiasm}'
The AST the Greeter.greet() method visualized with Graphviz with fields
The arguments are on the left branch, the function body is in the middle, and the return annotation is on the right branch.

ast.arguments

field type description
posonlyargs list[ast.arg] positional-only arguments
args list[ast.arg] arguments that can be passed positionally or by keyword
vararg ast.arg|None *args
kwonlyargs list[ast.arg] keyword-only arguments
kw_defaults list[ast.AST|None] default values for keyword-only arguments, where None means the argument is required
kwarg ast.arg|None **kwargs
defaults list[ast.AST] default values for last n positional arguments

The Greeter.greet() method has two positional arguments, self and name, with the latter having a type of str and a default value of 'World':

arguments(
  args=[
    arg(arg='self'),
    arg(
      arg='name',
      annotation=Name(id='str', ctx=Load()))],
  defaults=[
    Constant(value='World')])

Extracting argument information in a docstring-friendly format

We need argument names, types, and default values for three groups of arguments:

  • positional: posonlyargs, args, and defaults
  • starred: vararg and kwarg
  • keyword-only: kwonlyargs and kw_defaults
Positional arguments

Using a list comprehension, we will process posonlyargs and args together since both of their defaults (if they have them) are stored in defaults:

None can be a default value, so we create a sentinel value to indicate when something has no default:

We use zip_longest loop over the values because defaults is at most the combined length of posonlyargs and args:

For now, we can exclude any self and cls arguments like this, but it would be more accurate to revisit our stack to check if the function is actually a method:

For each argument, we create a dictionary to store the name, type, and default value for later use:

Due to the structure of defaults, we created the list in reverse, so we flip it before returning it:

    
from itertools import zip_longest


NO_DEFAULT = object()

def _extract_positional_args(
    arguments: ast.arguments
) -> list[dict]:
    return [
        {
            'name': arg.arg,
            'type': getattr(arg.annotation, 'id', '__type__'),
            'default': (
                default.value
                if default is not NO_DEFAULT
                else default
            ),
        }
        for arg, default in zip_longest(
            reversed([*arguments.posonlyargs, *arguments.args]),
            reversed(arguments.defaults),
            fillvalue=NO_DEFAULT,
        )
        if arg.arg not in ['self', 'cls']
    ][::-1]
Example

Including a / in the function definition requires that the arguments preceding it (a, here) be passed by position (e.g., func('?') would work, but func(a='?') would raise an exception):

>>> _extract_positional_args(
...     ast.parse(
...         'def func(a: str, /, b: int = 3): pass'
...     ).body[0].args
... )
[{'name': 'a', 'type': 'str',
  'default': <object at 0x107c5e620>},
 {'name': 'b', 'type': 'int', 'default': 3}]
Starred arguments

We will process vararg and kwarg together, prefixing their names with the appropriate number of * characters:

Unlike the other arguments, these are either None or a single ast.arg node, so we don't need to loop over the values:

If that argument is present in the function definition, we will capture the details we need for the docstring:

Otherwise, we will record it as None, so that we can filter this out when we make the docstring:

    
def _extract_star_args(
    arguments: ast.arguments
) -> list[dict | None]:
    return [
        {
            'name': (
                f'*{arg.arg}'
                if arg_type == 'vararg'
                else f'**{arg.arg}'
            ),
            'type': getattr(arg.annotation, 'id', '__type__'),
            'default': NO_DEFAULT,
        }
        if arg
        else None
        for arg_type in ['vararg', 'kwarg']
        for arg in [getattr(arguments, arg_type)]
    ]
Example

Note that, while it is convention, there is no requirement that we name these arguments *args and **kwargs, so our code needs to correctly extract the name:

>>> _extract_star_args(
...     ast.parse(
...         'def func(*extra_args, **extra_kwargs): pass'
...     ).body[0].args
... )
[{'name': '*extra_args', 'type': '__type__',
  'default': <object at 0x107c5e630>},
 {'name': '**extra_kwargs', 'type': '__type__',
  'default': <object at 0x107c5e630>}]
Keyword-only arguments

Finally, we process kwonlyargs and kw_defaults together:

Both lists are of the same size this time, so we can use zip():

We gather the same information on the arguments:

However, a default value of None here means that there is no default:

    
def _extract_keyword_args(
    arguments: ast.arguments
) -> list[dict]:
    return [
        {
            'name': arg.arg,
            'type': getattr(arg.annotation, 'id', '__type__'),
            'default': (
                NO_DEFAULT if default is None
                else default.value
            ),
        }
        for arg, default in zip(
            arguments.kwonlyargs, arguments.kw_defaults
        )
    ]
Example

Including a * in the function definition requires that the arguments following it (a and b, here) be passed by name (e.g., func(a='?') would work, but func('?') would raise an exception):

>>> _extract_keyword_args(
...     ast.parse(
...         'def func(*, a: str, b: int = 3): pass'
...     ).body[0].args
... )
[{'name': 'a', 'type': 'str',
  'default': <object at 0x107c5e620>},
 {'name': 'b', 'type': 'int', 'default': 3}]
Putting all the arguments together

We ensure that the order of the arguments in the docstring matches their order in the function definition:

First, we include the positional arguments, with the positional-only ones preceding the ones that can be passed by position or name:

Next, we process the starred arguments. However, we only check whether varargs is present at this time, because it belongs in the positional arguments group:

Finally, we include the keyword-only arguments, with kwargs coming last (if present):

With all of the arguments extracted and ordered properly, we convert to a tuple to make it immutable and return:

    
def extract_arguments(arguments: ast.arguments) -> tuple[dict]:
    args = _extract_positional_args(arguments)

    varargs, kwargs = _extract_star_args(arguments)

    if varargs:
        args.append(varargs)

    args.extend(_extract_keyword_args(arguments))

    if kwargs:
        args.append(kwargs)

    return tuple(args)

Running this on the Greeter.greet() method extracts the name argument (ignoring self):

({'name': 'name', 'type': 'str', 'default': 'World'},)

Now, we need the return type.

returns

The return annotation for Greeter.greet() is str:

returns=Name(id='str', ctx=Load())

Extracting returns information in a docstring-friendly format

Here, we simplify by assuming that the return type annotation is provided and only handling the cases of ast.Constant and ast.Name nodes:

def extract_return_annotation(node: ast.AST) -> str:

    if isinstance(node, ast.Constant):
        return str(node.value)

    if isinstance(node, ast.Name):
        return str(node.id)

    return '__return_type__'

Combining arguments and return type into a docstring

We will suggest Numpydoc-style docstrings for functions and methods that don't have docstrings, and the user will be required to fill in any placeholder values (__description__, __type__, and __return_type__) that our tool can't infer from the type annotations:

"""
__description__

Parameters
----------
name : __type__
    __description__

Returns
-------
__return_type__
    __description__
"""

The suggest_docstring() function will construct docstrings based on function nodes in the AST:

It formats the output from the extract_arguments() function into a parameters section (if the function has parameters):

Next, it uses the output from the extract_return_annotation() function to make a returns section:

Everything is then combined with some placeholders and triple quotes to become a docstring template:

    
def suggest_docstring(
    node: ast.AsyncFunctionDef | ast.FunctionDef
) -> str:
    if args := extract_arguments(node.args):
        args = [
            f'{arg["name"]} : {arg["type"]}'
            + (
                f', default {arg["default"]}'
                if arg["default"] is not NO_DEFAULT else ''
            )
            + '\n    __description__'
            for arg in args
        ]
        args = ['', 'Parameters', '----------', *args]
    else:
        args = []

    returns = (
        extract_return_annotation(node.returns)
        + '\n    __description__'
    )

    return '\n'.join(
        [
            '"""',
            '___description___',
            *args,
            '',
            'Returns',
            '-------',
            returns,
            '"""',
        ]
    )

Updating the DocstringVisitor

In the _detect_missing_docstring() method, we now suggest a docstring for functions that are missing one:

class DocstringVisitor(ast.NodeVisitor):

    def __init__(self, module_name: str) -> None:
        super().__init__()
        self.stack: list[str] = []
        self.module_name: str = module_name

    def _detect_missing_docstring(
        self,
        node: ast.AsyncFunctionDef
        | ast.ClassDef
        | ast.FunctionDef
        | ast.Module
    ) -> None:
        if ast.get_docstring(node) is None:
            entity = '.'.join(self.stack)
            print(f'{entity} is missing a docstring')

            if isinstance(
                node, ast.AsyncFunctionDef | ast.FunctionDef
            ):
                print(
                    'Hint:',
                    suggest_docstring(node),
                    '',
                    sep='\n',
                )

    def _visit_helper(
        self,
        node: ast.AsyncFunctionDef
        | ast.ClassDef
        | ast.FunctionDef
        | ast.Module
    ) -> None:
        self.stack.append(getattr(node, 'name', self.module_name))
        self._detect_missing_docstring(node)
        self.generic_visit(node)
        self.stack.pop()

    def visit_AsyncFunctionDef(
        self, node: ast.AsyncFunctionDef
    ) -> None:
        self._visit_helper(node)

    def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
        self._visit_helper(node)

    def visit_ClassDef(self, node: ast.ClassDef) -> None:
        self._visit_helper(node)

    def visit_Module(self, node: ast.Module) -> None:
        self._visit_helper(node)

Docstrings are now suggested based on the AST

>>> visitor = DocstringVisitor('greet')
>>> visitor.visit(tree)
greet is missing a docstring
greet.Greeter is missing a docstring
greet.Greeter.__init__ is missing a docstring
Hint:
"""
___description___

Parameters
----------
enthusiasm : int, default 1
    __description__

Returns
-------
None
    __description__
"""

greet.Greeter.greet is missing a docstring
Hint:
"""
___description___

Parameters
----------
name : str, default World
    __description__

Returns
-------
str
    __description__
"""

Injecting docstring templates into source code

Suggestions are great, but we can do better.

Modifying the AST to inject docstrings

Instead of suggesting docstrings where they are missing, we will add docstring templates to the code:

To edit the AST, we need to subclass ast.NodeTransformer, which inherits from ast.NodeVisitor, this time:

In the _handle_missing_docstring() method, we will add a node to the function body with the docstring:

In order to properly indent the docstring, we need to add one additional level of indentation beyond what the function definition has (col_offset):

The AST node we inject will be an ast.Expr node, with an ast.Constant node inside containing the docstring itself:

The suggest_docstring() function includes the surrounding """, so we need to remove them (suggestion[3:-3]):

We also need to add the indent on the final line (+ prefix) since textwrap.indent() won't indent a blank line with nothing after it:

The docstring AST node will be the first entry in the ast.FunctionDef node's body field:

Since AST nodes store references to their line numbers in the source code, we fix them for all nodes in this subtree after making the insertion:

The _visit_helper() method will call _handle_missing_docstring() and make sure we perform a complete traversal:

Note that we now return the node. If we don't return it, the node will be removed from the AST:

For this example, we will just visit function nodes since we only have docstring suggestions for those:

    
from textwrap import indent

class DocstringTransformer(ast.NodeTransformer):

    def _handle_missing_docstring(
        self, node: ast.AsyncFunctionDef | ast.FunctionDef
    ) -> None:
        if ast.get_docstring(node) is None:
            suggestion = suggest_docstring(node)
            prefix = ' ' * (node.col_offset + 4)
            docstring_node = ast.Expr(
                ast.Constant(
                    indent(
                        suggestion[3:-3] + prefix,
                        prefix,
                    )
                )
            )

            node.body.insert(0, docstring_node)
            node = ast.fix_missing_locations(node)

        return node

    def _visit_helper(
        self,
        node: ast.AsyncFunctionDef | ast.FunctionDef
    ) -> ast.AsyncFunctionDef | ast.FunctionDef:
        node = self._handle_missing_docstring(node)
        self.generic_visit(node)
        return node

    def visit_AsyncFunctionDef(
        self, node: ast.AsyncFunctionDef
    ) -> ast.AsyncFunctionDef:
        return self._visit_helper(node)

    def visit_FunctionDef(
        self, node: ast.FunctionDef
    ) -> ast.FunctionDef:
        return self._visit_helper(node)

We can use the ast.unparse() function to convert from an AST back into source code, but note that formatting may be a little different and, since comments are not represented in the AST, they will not be preserved:

>>> transformer = DocstringTransformer()
>>> tree = transformer.visit(tree)
>>> print(ast.unparse(tree))
class Greeter:

    def __init__(self, enthusiasm: int=1) -> None:
        """
        ___description___

        Parameters
        ----------
        enthusiasm : int, default 1
            __description__

        Returns
        -------
        None
            __description__
        """
        self.enthusiasm = enthusiasm

    def greet(self, name: str='World') -> str:
        """
        ___description___

        Parameters
        ----------
        name : str, default World
            __description__

        Returns
        -------
        str
            __description__
        """
        return f"Hello, {name}{'!' * self.enthusiasm}"

Potential next steps

  • Have DocstringVisitor and DocstringTransformer read in the file and generate the AST
  • Infer whether a function has a return statement in the absence of a return type annotation
  • Have DocstringTransformer convert the modified AST back to source code and save it to a file
  • Support configuration via pyproject.toml and inline comments (tokenize)
  • Create a CLI and a way to run on multiple files
  • Turn this into a pre-commit hook and/or CI tool

Reference implementation

All examples herein were based on my open source project, Docstringify:

Thank you!

I hope you enjoyed the session. You can follow my work on these platforms: