Build Your Own (Simple) Static Code Analyzer
Stefanie Molin
Bio
- 👩🏻💻 Software engineer at Bloomberg in NYC
- ✨ Core developer of numpydoc and creator of numpydoc's pre-commit hook, which uses static code analysis
- ✍ Author of "Hands-On Data Analysis with Pandas"
- 🎓 Bachelor's degree in operations research from Columbia University
- 🎓 Master's degree in computer science from Georgia Tech
What makes a tool a static code analyzer?
It analyzes source code without running it.
What are the main benefits of static code analysis?
- Speed – can be much faster than dynamic code analysis
- Portable – no need to install the codebase being analyzed or its dependencies
How do you build a static code analyzer?
It depends...
Abstract Syntax Trees (ASTs) are a good place to start.
Abstract Syntax Tree (AST)
- Represents the structure of the source code as a tree
- Nodes in the tree are language constructs (e.g., module, class, function)
- Each node has a single parent (e.g., a class is a child of a single module)
- Parent nodes can have multiple children (e.g., a class can have several methods)
Let's see what this code snippet (greet.py
) looks like when represented as an AST:
class Greeter:
def __init__(self, enthusiasm: int = 1) -> None:
self.enthusiasm = enthusiasm
def greet(self, name: str = 'World') -> str:
return f'Hello, {name}{"!" * self.enthusiasm}'
The AST for
greet.py
visualized with Graphviz.
ASTs in Python
- Represent syntactically-correct Python code (cannot be generated in the presence of syntax errors)
- Created by the parser as an intermediary step when compiling source code into byte code (necessary to run it)
-
Available in the standard library via the
ast
module
Parsing Python source code into an AST
1. Read in the source code
>>> from pathlib import Path
>>> source_code = Path('greet.py').read_text()
2. Parse it with the ast
module
If the code is syntactically-correct, we get an AST back:
>>> import ast
>>> tree = ast.parse(source_code)
>>> print(type(tree))
<class 'ast.Module'>
Inspecting the AST
Use ast.dump()
to display the AST:
The root node is an ast.Module
node:
It contains everything else in its body
attribute:
The greet.py
file first defines a class, named Greeter
:
The ast.ClassDef
node also contains the body
of the Greeter
class:
The first entry is the Greeter.__init__()
method:
The ast.FunctionDef
node includes information about the arguments:
Its body
contains the AST representation of the function's code:
The return annotation is stored in the returns
attribute:
The final entry is the Greeter.greet()
method:
>>> print(ast.dump(tree, indent=2))
Module(
body=[
ClassDef(
name='Greeter',
body=[
FunctionDef(
name='__init__',
args=arguments(
args=[
arg(arg='self'),
arg(
arg='enthusiasm',
annotation=Name(id='int', ctx=Load()))],
defaults=[
Constant(value=1)]),
body=[
Assign(
targets=[
Attribute(
value=Name(id='self', ctx=Load()),
attr='enthusiasm',
ctx=Store())],
value=Name(id='enthusiasm', ctx=Load()))],
returns=Constant(value=None)),
FunctionDef(
name='greet',
args=arguments(
args=[
arg(arg='self'),
arg(
arg='name',
annotation=Name(id='str', ctx=Load()))],
defaults=[
Constant(value='World')]),
body=[
Return(
value=JoinedStr(
values=[
Constant(value='Hello, '),
FormattedValue(
value=Name(id='name', ctx=Load()),
conversion=-1),
FormattedValue(
value=BinOp(
left=Constant(value='!'),
op=Mult(),
right=Attribute(
value=Name(id='self', ctx=Load()),
attr='enthusiasm',
ctx=Load())),
conversion=-1)]))],
returns=Name(id='str', ctx=Load()))])])
Popular open source tools that use ASTs
-
Linters and formatters, like
ruff
(Rust) andblack
(Python) -
Documentation tools, like
sphinx
and thenumpydoc-validation
pre-commit hook -
Automatic Python syntax upgrade tools, like
pyupgrade
-
Type checkers, like
mypy
-
Code security tools, like
bandit
-
Code and testing coverage tools, like
vulture
andcoverage.py
-
Testing frameworks that instrument your code or generate tests based on it, like
hypothesis
andpytest
Let's build a simple static code analyzer
To learn how to use the AST, we will build a tool that does the following:
- Finds missing docstrings and suggests templates based on the code itself
- Uses only the Python standard library
Tools may exist that do this already, but the point is to learn how to use the AST.
The input
We will analyze a single file, greet.py
, for time and space considerations:
class Greeter:
def __init__(self, enthusiasm: int = 1) -> None:
self.enthusiasm = enthusiasm
def greet(self, name: str = 'World') -> str:
return f'Hello, {name}{"!" * self.enthusiasm}'
Is static code analysis really necessary here?
While we are working with one file here, most codebases will be much larger. How could we approach this scalably?
- Manually (open each file and edit) – tedious and error prone
- Regular expressions – messy and hard to get right (edge cases, context, etc.)
- Script to import everything and check docstrings – must be able to install codebase and its dependencies; slow
- Static code analysis – analyzing code without executing it means we can use this on any of our codebases ✅
Important disclaimer before we dive in
Docstrings have been omitted from all code snippets for space 😂
Fear not – we are building a tool to fix that!
Detecting missing docstrings using the Python AST
We need to traverse the full AST (to account for nested functions and classes) and inspect each node's docstring:
Only ast.Module
, ast.ClassDef
, ast.FunctionDef
, and ast.AsyncFunctionDef
nodes can have docstrings:
If there is one, ast.get_docstring(node)
returns the docstring of node
; otherwise, it returns None
:
def detect_missing_docstring(
node: ast.AsyncFunctionDef
| ast.ClassDef
| ast.FunctionDef
| ast.Module
) -> None:
if ast.get_docstring(node) is None:
entity = getattr(node, 'name', 'module')
print(f'{entity} is missing a docstring')
In greet.py
, we want to call this function on these nodes only:
Traversing the AST
File structures vary, so we will create a NodeVisitor
to ensure we find all missing docstrings:
- Subclass
ast.NodeVisitor
- Create
visit_<NodeType>()
methods for nodes we are interested in - Instantiate the visitor and call its
visit()
method
1. Subclass ast.NodeVisitor
class DocstringVisitor(ast.NodeVisitor):
pass
2. Create visit_<NodeType>()
methods for nodes we are interested in
class DocstringVisitor(ast.NodeVisitor):
def visit_AsyncFunctionDef(
self, node: ast.AsyncFunctionDef
) -> None:
detect_missing_docstring(node)
def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
detect_missing_docstring(node)
def visit_ClassDef(self, node: ast.ClassDef) -> None:
detect_missing_docstring(node)
def visit_Module(self, node: ast.Module) -> None:
detect_missing_docstring(node)
3. Instantiate the visitor and call its visit()
method
>>> visitor = DocstringVisitor()
>>> visitor.visit(tree)
module is missing a docstring
What about the missing docstrings for the Greeter
class and its methods?
Complete traversal means visiting all fields
We aren't visiting the list of AST nodes in the ast.Module
node's body
field, so traversal starts and stops there:
The general_visit()
method
- Defined on base class
ast.NodeVisitor
- Visits child nodes by calling
visit()
on any nodes returned fromast.iter_fields()
- Called automatically for node types for which we didn't create methods
Modifying the DocstringVisitor
We add the _visit_helper()
method, which checks the docstring and then continues the traversal:
Calling generic_visit()
on each node for which we check docstrings for ensures we continue the traversal:
Now, we switch to calling _visit_helper()
whenever we visit module, class, or function nodes:
class DocstringVisitor(ast.NodeVisitor):
def _visit_helper(
self,
node: ast.AsyncFunctionDef
| ast.ClassDef
| ast.FunctionDef
| ast.Module
) -> None:
detect_missing_docstring(node)
self.generic_visit(node)
def visit_AsyncFunctionDef(
self, node: ast.AsyncFunctionDef
) -> None:
self._visit_helper(node)
def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
self._visit_helper(node)
def visit_ClassDef(self, node: ast.ClassDef) -> None:
self._visit_helper(node)
def visit_Module(self, node: ast.Module) -> None:
self._visit_helper(node)
Complete traversal achieved 🎉
>>> visitor = DocstringVisitor()
>>> visitor.visit(tree)
module is missing a docstring
Greeter is missing a docstring
__init__ is missing a docstring
greet is missing a docstring
Disambiguating docstring paths
greet
could be the greet()
method or the greet
module, but greet.Greeter.greet
can only be one:
greet is missing a docstring
Tracking node ancestry with a stack
From a node, we can access its children, but not its parent.
We can track lineage with a stack:
We internalize the missing docstring check as _detect_missing_docstring()
:
It uses the stack to print the unambiguous path to the missing docstring:
The _visit_helper()
takes care of pushing onto and popping off of the stack:
We push (append) a node onto the stack before we actually visit it:
We pop the node off the stack after we have visited it and all of its descendants:
class DocstringVisitor(ast.NodeVisitor):
def __init__(self, module_name: str) -> None:
super().__init__()
self.stack: list[str] = []
self.module_name: str = module_name
def _detect_missing_docstring(
self,
node: ast.AsyncFunctionDef
| ast.ClassDef
| ast.FunctionDef
| ast.Module
) -> None:
if ast.get_docstring(node) is None:
entity = '.'.join(self.stack)
print(f'{entity} is missing a docstring')
def _visit_helper(
self,
node: ast.AsyncFunctionDef
| ast.ClassDef
| ast.FunctionDef
| ast.Module
) -> None:
self.stack.append(
getattr(node, 'name', self.module_name)
)
self._detect_missing_docstring(node)
self.generic_visit(node)
self.stack.pop()
def visit_AsyncFunctionDef(
self, node: ast.AsyncFunctionDef
) -> None:
self._visit_helper(node)
def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
self._visit_helper(node)
def visit_ClassDef(self, node: ast.ClassDef) -> None:
self._visit_helper(node)
def visit_Module(self, node: ast.Module) -> None:
self._visit_helper(node)
Now, we know exactly where the docstrings are missing:
>>> visitor = DocstringVisitor('greet')
>>> visitor.visit(tree)
greet is missing a docstring
greet.Greeter is missing a docstring
greet.Greeter.__init__ is missing a docstring
greet.Greeter.greet is missing a docstring
Suggesting docstring templates
ast.FunctionDef
and ast.AsyncFunctionDef
nodes have information that often ends up in the docstring:
args
: Argument names, types, and defaultsreturns
: Return type annotation (if present)body
: AST of the function body, which can be used to infer return types, as well as whether the function raises any exceptions (out of scope)
For this keynote, we will focus on fully-typed code.
An example using the Greeter.greet()
method
class Greeter:
def __init__(self, enthusiasm: int = 1) -> None:
self.enthusiasm = enthusiasm
def greet(self, name: str = 'World') -> str:
return f'Hello, {name}{"!" * self.enthusiasm}'
The arguments are on the left branch, the function body is in the middle, and the return annotation is on the right branch.
ast.arguments
field | type | description |
---|---|---|
posonlyargs |
list[ast.arg] |
positional-only arguments |
args |
list[ast.arg] |
arguments that can be passed positionally or by keyword |
vararg |
ast.arg|None |
*args |
kwonlyargs |
list[ast.arg] |
keyword-only arguments |
kw_defaults |
list[ast.AST|None] |
default values for keyword-only arguments, where None means the argument is required |
kwarg |
ast.arg|None |
**kwargs |
defaults |
list[ast.AST] |
default values for last n positional arguments |
The Greeter.greet()
method has two positional arguments, self
and name
, with the latter having a type of str
and a default value of 'World'
:
arguments(
args=[
arg(arg='self'),
arg(
arg='name',
annotation=Name(id='str', ctx=Load()))],
defaults=[
Constant(value='World')])
Extracting argument information in a docstring-friendly format
We need argument names, types, and default values for three groups of arguments:
- positional:
posonlyargs
,args
, anddefaults
- starred:
vararg
andkwarg
- keyword-only:
kwonlyargs
andkw_defaults
Positional arguments
Using a list comprehension, we will process posonlyargs
and args
together since both of their defaults (if they have them) are stored in defaults
:
None
can be a default value, so we create a sentinel value to indicate when something has no default:
We use zip_longest
loop over the values because defaults
is at most the combined length of posonlyargs
and args
:
For now, we can exclude any self
and cls
arguments like this, but it would be more accurate to revisit our stack to check if the function is actually a method:
For each argument, we create a dictionary to store the name, type, and default value for later use:
Due to the structure of defaults
, we created the list in reverse, so we flip it before returning it:
from itertools import zip_longest
NO_DEFAULT = object()
def _extract_positional_args(
arguments: ast.arguments
) -> list[dict]:
return [
{
'name': arg.arg,
'type': getattr(arg.annotation, 'id', '__type__'),
'default': (
default.value
if default is not NO_DEFAULT
else default
),
}
for arg, default in zip_longest(
reversed([*arguments.posonlyargs, *arguments.args]),
reversed(arguments.defaults),
fillvalue=NO_DEFAULT,
)
if arg.arg not in ['self', 'cls']
][::-1]
Example
Including a /
in the function definition requires that the arguments preceding it (a
, here) be passed by position (e.g., func('?')
would work, but func(a='?')
would raise an exception):
>>> _extract_positional_args(
... ast.parse(
... 'def func(a: str, /, b: int = 3): pass'
... ).body[0].args
... )
[{'name': 'a', 'type': 'str',
'default': <object at 0x107c5e620>},
{'name': 'b', 'type': 'int', 'default': 3}]
Starred arguments
We will process vararg
and kwarg
together, prefixing their names with the appropriate number of *
characters:
Unlike the other arguments, these are either None
or a single ast.arg
node, so we don't need to loop over the values:
If that argument is present in the function definition, we will capture the details we need for the docstring:
Otherwise, we will record it as None
, so that we can filter this out when we make the docstring:
def _extract_star_args(
arguments: ast.arguments
) -> list[dict | None]:
return [
{
'name': (
f'*{arg.arg}'
if arg_type == 'vararg'
else f'**{arg.arg}'
),
'type': getattr(arg.annotation, 'id', '__type__'),
'default': NO_DEFAULT,
}
if arg
else None
for arg_type in ['vararg', 'kwarg']
for arg in [getattr(arguments, arg_type)]
]
Example
Note that, while it is convention, there is no requirement that we name these arguments *args
and **kwargs
, so our code needs to correctly extract the name:
>>> _extract_star_args(
... ast.parse(
... 'def func(*extra_args, **extra_kwargs): pass'
... ).body[0].args
... )
[{'name': '*extra_args', 'type': '__type__',
'default': <object at 0x107c5e630>},
{'name': '**extra_kwargs', 'type': '__type__',
'default': <object at 0x107c5e630>}]
Keyword-only arguments
Finally, we process kwonlyargs
and kw_defaults
together:
Both lists are of the same size this time, so we can use zip()
:
We gather the same information on the arguments:
However, a default value of None
here means that there is no default:
def _extract_keyword_args(
arguments: ast.arguments
) -> list[dict]:
return [
{
'name': arg.arg,
'type': getattr(arg.annotation, 'id', '__type__'),
'default': (
NO_DEFAULT if default is None
else default.value
),
}
for arg, default in zip(
arguments.kwonlyargs, arguments.kw_defaults
)
]
Example
Including a *
in the function definition requires that the arguments following it (a
and b
, here) be passed by name (e.g., func(a='?')
would work, but func('?')
would raise an exception):
>>> _extract_keyword_args(
... ast.parse(
... 'def func(*, a: str, b: int = 3): pass'
... ).body[0].args
... )
[{'name': 'a', 'type': 'str',
'default': <object at 0x107c5e620>},
{'name': 'b', 'type': 'int', 'default': 3}]
Putting all the arguments together
We ensure that the order of the arguments in the docstring matches their order in the function definition:
First, we include the positional arguments, with the positional-only ones preceding the ones that can be passed by position or name:
Next, we process the starred arguments. However, we only check whether varargs
is present at this time, because it belongs in the positional arguments group:
Finally, we include the keyword-only arguments, with kwargs
coming last (if present):
With all of the arguments extracted and ordered properly, we convert to a tuple to make it immutable and return:
def extract_arguments(arguments: ast.arguments) -> tuple[dict]:
args = _extract_positional_args(arguments)
varargs, kwargs = _extract_star_args(arguments)
if varargs:
args.append(varargs)
args.extend(_extract_keyword_args(arguments))
if kwargs:
args.append(kwargs)
return tuple(args)
Running this on the Greeter.greet()
method extracts the name
argument (ignoring self
):
({'name': 'name', 'type': 'str', 'default': 'World'},)
Now, we need the return type.
returns
The return annotation for Greeter.greet()
is str
:
returns=Name(id='str', ctx=Load())
Extracting returns information in a docstring-friendly format
Here, we simplify by assuming that the return type annotation is provided and only handling the cases of ast.Constant
and ast.Name
nodes:
def extract_return_annotation(node: ast.AST) -> str:
if isinstance(node, ast.Constant):
return str(node.value)
if isinstance(node, ast.Name):
return str(node.id)
return '__return_type__'
Combining arguments and return type into a docstring
We will suggest Numpydoc-style docstrings for functions and methods that don't have docstrings, and the user will be required to fill in any placeholder values (__description__
, __type__
, and __return_type__
) that our tool can't infer from the type annotations:
"""
__description__
Parameters
----------
name : __type__
__description__
Returns
-------
__return_type__
__description__
"""
The suggest_docstring()
function will construct docstrings based on function nodes in the AST:
It formats the output from the extract_arguments()
function into a parameters section (if the function has parameters):
Next, it uses the output from the extract_return_annotation()
function to make a returns section:
Everything is then combined with some placeholders and triple quotes to become a docstring template:
def suggest_docstring(
node: ast.AsyncFunctionDef | ast.FunctionDef
) -> str:
if args := extract_arguments(node.args):
args = [
f'{arg["name"]} : {arg["type"]}'
+ (
f', default {arg["default"]}'
if arg["default"] is not NO_DEFAULT else ''
)
+ '\n __description__'
for arg in args
]
args = ['', 'Parameters', '----------', *args]
else:
args = []
returns = (
extract_return_annotation(node.returns)
+ '\n __description__'
)
return '\n'.join(
[
'"""',
'___description___',
*args,
'',
'Returns',
'-------',
returns,
'"""',
]
)
Updating the DocstringVisitor
In the _detect_missing_docstring()
method, we now suggest a docstring for functions that are missing one:
class DocstringVisitor(ast.NodeVisitor):
def __init__(self, module_name: str) -> None:
super().__init__()
self.stack: list[str] = []
self.module_name: str = module_name
def _detect_missing_docstring(
self,
node: ast.AsyncFunctionDef
| ast.ClassDef
| ast.FunctionDef
| ast.Module
) -> None:
if ast.get_docstring(node) is None:
entity = '.'.join(self.stack)
print(f'{entity} is missing a docstring')
if isinstance(
node, ast.AsyncFunctionDef | ast.FunctionDef
):
print(
'Hint:',
suggest_docstring(node),
'',
sep='\n',
)
def _visit_helper(
self,
node: ast.AsyncFunctionDef
| ast.ClassDef
| ast.FunctionDef
| ast.Module
) -> None:
self.stack.append(getattr(node, 'name', self.module_name))
self._detect_missing_docstring(node)
self.generic_visit(node)
self.stack.pop()
def visit_AsyncFunctionDef(
self, node: ast.AsyncFunctionDef
) -> None:
self._visit_helper(node)
def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
self._visit_helper(node)
def visit_ClassDef(self, node: ast.ClassDef) -> None:
self._visit_helper(node)
def visit_Module(self, node: ast.Module) -> None:
self._visit_helper(node)
Docstrings are now suggested based on the AST
>>> visitor = DocstringVisitor('greet')
>>> visitor.visit(tree)
greet is missing a docstring
greet.Greeter is missing a docstring
greet.Greeter.__init__ is missing a docstring
Hint:
"""
___description___
Parameters
----------
enthusiasm : int, default 1
__description__
Returns
-------
None
__description__
"""
greet.Greeter.greet is missing a docstring
Hint:
"""
___description___
Parameters
----------
name : str, default World
__description__
Returns
-------
str
__description__
"""
Injecting docstring templates into source code
Suggestions are great, but we can do better.
Modifying the AST to inject docstrings
Instead of suggesting docstrings where they are missing, we will add docstring templates to the code:
To edit the AST, we need to subclass ast.NodeTransformer
, which inherits from ast.NodeVisitor
, this time:
In the _handle_missing_docstring()
method, we will add a node to the function body with the docstring:
In order to properly indent the docstring, we need to add one additional level of indentation beyond what the function definition has (col_offset
):
The AST node we inject will be an ast.Expr
node, with an ast.Constant
node inside containing the docstring itself:
The suggest_docstring()
function includes the surrounding """
, so we need to remove them (suggestion[3:-3]
):
We also need to add the indent on the final line (+ prefix
) since textwrap.indent()
won't indent a blank line with nothing after it:
The docstring AST node will be the first entry in the ast.FunctionDef
node's body
field:
Since AST nodes store references to their line numbers in the source code, we fix them for all nodes in this subtree after making the insertion:
The _visit_helper()
method will call _handle_missing_docstring()
and make sure we perform a complete traversal:
Note that we now return the node. If we don't return it, the node will be removed from the AST:
For this example, we will just visit function nodes since we only have docstring suggestions for those:
from textwrap import indent
class DocstringTransformer(ast.NodeTransformer):
def _handle_missing_docstring(
self, node: ast.AsyncFunctionDef | ast.FunctionDef
) -> None:
if ast.get_docstring(node) is None:
suggestion = suggest_docstring(node)
prefix = ' ' * (node.col_offset + 4)
docstring_node = ast.Expr(
ast.Constant(
indent(
suggestion[3:-3] + prefix,
prefix,
)
)
)
node.body.insert(0, docstring_node)
node = ast.fix_missing_locations(node)
return node
def _visit_helper(
self,
node: ast.AsyncFunctionDef | ast.FunctionDef
) -> ast.AsyncFunctionDef | ast.FunctionDef:
node = self._handle_missing_docstring(node)
self.generic_visit(node)
return node
def visit_AsyncFunctionDef(
self, node: ast.AsyncFunctionDef
) -> ast.AsyncFunctionDef:
return self._visit_helper(node)
def visit_FunctionDef(
self, node: ast.FunctionDef
) -> ast.FunctionDef:
return self._visit_helper(node)
We can use the ast.unparse()
function to convert from an AST back into source code, but note that formatting may be a little different and, since comments are not represented in the AST, they will not be preserved:
>>> transformer = DocstringTransformer()
>>> tree = transformer.visit(tree)
>>> print(ast.unparse(tree))
class Greeter:
def __init__(self, enthusiasm: int=1) -> None:
"""
___description___
Parameters
----------
enthusiasm : int, default 1
__description__
Returns
-------
None
__description__
"""
self.enthusiasm = enthusiasm
def greet(self, name: str='World') -> str:
"""
___description___
Parameters
----------
name : str, default World
__description__
Returns
-------
str
__description__
"""
return f"Hello, {name}{'!' * self.enthusiasm}"
Potential next steps
- Have
DocstringVisitor
andDocstringTransformer
read in the file and generate the AST - Infer whether a function has a return statement in the absence of a return type annotation
- Have
DocstringTransformer
convert the modified AST back to source code and save it to a file -
Support configuration via
pyproject.toml
and inline comments (tokenize
) - Create a CLI and a way to run on multiple files
- Turn this into a pre-commit hook and/or CI tool
Reference implementation
All examples herein were based on my open source project, Docstringify:
- Repository: https://github.com/stefmolin/docstringify
- PyPI:
python -m pip install docstringify
Thank you!
I hope you enjoyed the session. You can follow my work on these platforms: