C compiler

This page describes the design of the C compiler. If you want to use the compiler, please see ppci-cc or ppci.api.cc().


digraph x {
codegen [label="code generator" ]
source -> lexer [label="text"]
lexer -> preprocessor [label="tokens"]
preprocessor -> parser [label="tokens"]
preprocessor -> source [label="include"]
parser -> semantics [label="callback functions"]
semantics -> scope
semantics -> codegen [label="parsed program (AST)"]

The C compilers task is to take C code, and produce intermediate code for the backend. The first step is pre processing, during which a sequence of tokens is transformed into another sequence of tokens. Next comes parsing and code generation. There are two options here: split the parser and code generation passes, such that there is a clean seperation, or do it all at once. Maybe the best choice is to do it all in one pass, and transform a sequence of tokens into IR-code in a single go.


The preprocessor is a strange thing. It must handle trigraphs, backslash newlines and macros.

The top level design of the preprocessor is the following:

  • Context: Contains state that would be global otherwise.
  • Lexer: processes a raw file into a sequence of tokens
  • Preprocessor: takes the token sequence a does macro expansion, resulting in another stream of tokens.
  • Output: The token stream maybe outputted to file.
  • Feed to compiler: The token stream might be fed into the rest of the compiler.

C compiler

The c compiler consists of the classical stages: parsing and codegeneration. Code generation is done to ir-code.


The C parsing is done by two classes CParser and CSemantics. CParser is a recursive descent parser. It dances a tight dance with the CSemantics class. This idea is taken from the Clang project. The CParser takes a token sequence from the preprocessor and matches the C syntax. Whenever a valid C construct is found, it calls the corresponding function on the CSemantics class. The semantics class keeps track of the current scope and records the global declarations. It also check types and lvalues of expressions. At the end of parsing and this type checking, an abstract syntax tree (AST) is build up. This AST is type checked and variables are resolved. This AST can be used for several purposes, for example static code analysis or style checking. It is also possible to generate C code again from this AST.


The semantics class determines for each expression node whether it is an lvalue or not. The lvalue (short for location value) indicates whether the expression has a memory address, or is a value in a register. Basically it boils down to: can we take the address of this expression with the ‘&’ operator. Numeric literals and the results of addition are not lvalues. Variables are lvalues. The lvalue property is used during code generation to check whether the value must be loaded from memory or not.


Hosted vs freestanding

A C compiler can be hosted or freestanding. The difference between those two is that a hosted C compiler also provides the standard C library. A freestanding compiler only contains a few really required header files. As a result a hosted compiler is really a combination of a C compiler and a C standard library implementation. Also, the standard library depends on the operating system which is used, where as a freestanding C compiler can be used independent of operating system. Writing an application using a hosted compiler is easier since the standard library is available.

digraph x {
  hosted [label="Hosted C application"]
  freestanding [label="Freestanding C application"]
  os [label="Operating system"]
  libc [label="C standard library"]
  compiler [label="C compiler"]
  hosted -> libc
  freestanding -> compiler
  libc -> os
  libc -> compiler

Code generation

Code generation takes the AST as a whole and loops over all its elements and generates the corresponding IR-code snippets from it. At the end of code generation, there is an IR module which can be feed into the optimizers or code generators.

C classes

The C frontend can be used to generate an AST from C code. You can use it to parse C code and analyze its structure like this:

>>> from ppci.api import get_arch
>>> from ppci.lang.c import create_ast
>>> src = "int a, *b;"
>>> ast = create_ast(src, get_arch('msp430').info)
>>> ast
Compilation unit with 2 declarations
>>> from ppci.lang.c import CAstPrinter
>>> printer = CAstPrinter()
>>> printer.print(ast)
Compilation unit with 2 declarations
    Variable [storage=None typ=Basic type int name=a]
        Basic type int
    Variable [storage=None typ=Pointer-type name=b]
            Basic type int
>>> ast.declarations[0].name

Module reference

C front end.

ppci.lang.c.create_ast(src, arch_info, filename='<snippet>', coptions=None)

Create a C ast from the given source

ppci.lang.c.preprocess(f, output_file, coptions=None)

Pre-process a file into the other file.

ppci.lang.c.c_to_ir(source: io.TextIOBase, march, coptions=None, reporter=None)

C to ir translation.

  • source (file-like object) – The C source to compile.
  • march (str) – The targetted architecture.
  • coptions – C specific compilation options.

An ppci.ir.Module.

class ppci.lang.c.CBuilder(arch_info, coptions)

C builder that converts C code into ir-code

class ppci.lang.c.CContext(coptions, arch_info)

A context as a substitute for global data

alignment(typ: ppci.lang.c.types.CType)

Given a type, determine its alignment in bytes

equal_types(typ1, typ2, unqualified=False)

Check for type equality

error(message, location)

Trigger an error at the given location


Evaluate an expression right now! (=at compile time)

gen_global_ival(typ, ival)

Create memory image for initial value of global variable


Create a new instance for the given type specifiers


Test if an expression can be evaluated at compile time


Check if the type specifiers refer to a valid basic type

pack(typ, value)

Pack a type into proper memory format

sizeof(typ: ppci.lang.c.types.CType)

Given a type, determine its size in whole bytes

class ppci.lang.c.CLexer(coptions)

Lexer used for the preprocessor

lex(src, filename)

Read a source and generate a series of tokens


Root parsing function


Scan for a complete character constant


Scan for a complete string


Create tokens from the given text


Generate tokens from characters

class ppci.lang.c.COptions

A collection of settings regarding the C language


Add a path to the list of include paths


Add all the given include paths


Given a set of parsed arguments, apply those

class ppci.lang.c.CPreProcessor(coptions)

A pre-processor for C source code


Register a define

get_define(name: str)

Retrieve the given define!

include(filename, loc, use_current_dir=False)

Turn the given filename into a series of lines

is_defined(name: str) → bool

Check if the given define is defined.


Define predefined macros

process(f, filename=None)

Process the given open file into expanded lines of tokens


Invoked when the __DATE__ macro is expanded


Invoked when the __FILE__ macro is expanded


Invoked when the __LINE__ macro is expanded

undefine(name: str)

Kill a define!

class ppci.lang.c.CParser(context, semantics)

C Parser.

Implemented in a recursive descent way, like the CLANG[1] frontend for llvm, also without the lexer hack[2]. See for the gcc c parser code [3]. Also interesting is the libfim cparser frontend [4].

The clang parser is very versatile, it can handle C, C++, objective C and also do code completion. This one is very simple, it can only parse C.

[1] http://clang.org/ [2] https://en.wikipedia.org/wiki/The_lexer_hack [3] https://raw.githubusercontent.com/gcc-mirror/gcc/master/gcc/c/ c-parser.c

[4] https://github.com/libfirm/cparser/blob/ 0cc43ed4bb4d475728583eadcf9e9726682a838b/src/parser/parser.c


Determine whether we are facing a declaration or not


Advance to the next token


Here the parsing of C is begun …

Parse the given tokens.


Parse a break


Parse a function call


Parse a case


Parse a series of statements surrounded by ‘{‘ and ‘}’


Parse an expression between parenthesis


Parse a constant expression


Parse a continue statement


Parse the rest after the first declaration spec.

For example we have parsed ‘static int’ and we will now parse the rest. This can be either a function, or a sequence of initializable variables. At least it will probably contain some sort of identifier and an optional pointer stuff.


Parse declaration specifiers.

At the end we know type, storage class and qualifiers.

Gathers storage classes: - typedef - extern - static

One of the following type specifiers: - void, char, int, unsigned, long, … - typedef-ed type - struct, union or enum

Type qualifiers: - const - volatile


Parse normal declarations


Given a declaration specifier, parse the rest.

This involves parsing optionally pointers and qualifiers and next the declaration itself.


Parse the default case


Parse a do-while statement


Parse a statement that does nothing!


Parse an enum definition


Parse an expression.

See also: http://en.cppreference.com/w/c/language/operator_precedence


Parse a for statement


Parse function postfix. We have type and name, now parse function arguments

parse_function_declaration(ds, d)

Parse a function declaration with implementation


Parse a goto


Parse an if statement


Parse a label statement


Parse a primary expression


Parse a return statement


Parse a statement


Parse either a statement or a declaration

Returns: a list of statements


Parse a struct or union


Parse an switch statement


Top level start of parsing


Parse the pointer, name and array or function suffixes.

Can be abstract type, and if so, the type may be nameless.

The result of all this action is a type modifier list with the proper type modifications required by the given code.

parse_typedef(ds, d)

Process typedefs


Parse a type specifier used in sizeof(int) for example.


Parse the C-style array or struct initializer stuff


Parse a while statement


Look at the next token to parse without popping it

class ppci.lang.c.CAstPrinter(file=None)

Print AST of a C program

class ppci.lang.c.CSemantics(context)

This class handles the C semantics


Add the given declaration to current scope

apply_type_modifiers(type_modifiers, typ)

Apply the set of type modifiers to the given type

coerce(expr: ppci.lang.generic.nodes.Expression, typ: ppci.lang.c.types.CType)

Try to fit the given expression into the given type


Called at the end of a function

error(message, location)

Trigger an error at the given location

get_common_type(typ1, typ2)

Given two types, determine the common type.

The common type is a type they can both be cast to.


Retrieve a type by type specifiers

init_array(typ, init_env)

Process array initializers

on_array_index(base, index, location)

Check array indexing

on_basic_type(type_specifiers, location)

Handle basic type

on_binop(lhs, op, rhs, location)

Check binary operator

on_builtin_va_arg(arg_pointer, typ, location)

Check va_arg builtin function

on_builtin_va_start(arg_pointer, location)

Check va_start builtin function

on_call(callee, arguments, location)

Check function call for validity

on_cast(to_typ, casted_expr, location)

Check explicit casting

on_char(value, location)

Process a character literal

on_do(body, condition, location)

The almost extinct dodo!

on_enum_value(ctyp, name, value, location)

Handle a single enum value definition

on_field_select(base, field, location)

Check field select expression

on_for(initial, condition, post, body, location)

Check for loop construction

on_function_argument(typ, name, modifiers, location)

Process a function argument into the proper class

on_function_declaration(storage_class, typ, name, modifiers, location)

Handle function declaration

on_if(condition, then_statement, no, location)

Check if statement

on_number(value, location)

React on numeric literal

on_return(value, location)

Check return statement

on_string(value, location)

React on string literal

on_struct_or_union(kind, tag, fields, location)

Handle struct or union definition

on_switch_exit(expression, statement, location)

Handle switch statement

on_ternop(lhs, op, mid, rhs, location)

Handle ternary operator ‘a ? b : c’

on_type(typ, modifiers, location)

Called when a type itself is described

static on_type_qualifiers(type_qualifiers, ctyp)

Handle type qualifiers

on_typedef(typ, name, modifiers, location)

Handle typedef declaration

on_typename(name, location)

Handle the case when a typedef is refered

on_unop(op, a, location)

Check unary operator semantics

on_variable_access(name, location)

Handle variable access

on_variable_declaration(storage_class, typ, name, modifiers, location)

Given a declaration, and a declarator, create the proper object

on_variable_initialization(variable, expression)

Handle a variable initialized to some value

class ppci.lang.c.CSynthesizer

Take an IR-module and convert it into a C-AST.

This does essentially the opposite of the codegenerator.


Synthesize an ir block into C


Convert ir instruction to its corresponding C counterpart

class ppci.lang.c.CPrinter

Render a C program as text


Spit out a declaration


Render compilation unit as C

render_type(typ, name='')

Generate a proper C-string for the given type


digraph "classes_foo" {
"0" [label="{ArrayType|size\l|}", shape="record"];
"1" [label="{BasicType|CHAR : str\lDOUBLE : str\lFLOAT : str\lINT : str\lINTEGER_TYPES : set\lLONG : str\lLONGDOUBLE : str\lLONGLONG : str\lSHORT : str\lUCHAR : str\lUINT : str\lULONG : str\lULONGLONG : str\lUSHORT : str\lVOID : str\ltype_id\l|}", shape="record"];
"2" [label="{CType|is_integer\lis_scalar\lis_struct\lis_union\lis_void\lqualifiers : NoneType\l|}", shape="record"];
"3" [label="{EnumType|complete\lvalues : NoneType\l|}", shape="record"];
"4" [label="{Field|bitsize\lname\loffset\lpadded_size : int\ltyp\l|}", shape="record"];
"5" [label="{FunctionType|argument_types\larguments\lis_vararg : bool\lreturn_type\l|}", shape="record"];
"6" [label="{IndexableType|element_type\l|}", shape="record"];
"7" [label="{PointerType|\l|}", shape="record"];
"8" [label="{StructOrUnionType|complete\lfield_map\lfields : NoneType\lfields : property\lincomplete\ltag : NoneType\l|get_field()\lget_fields()\lhas_field()\lset_fields()\l}", shape="record"];
"9" [label="{StructType|\l|}", shape="record"];
"10" [label="{UnionType|\l|}", shape="record"];
"0" -> "6" [arrowhead="empty", arrowtail="none"];
"1" -> "2" [arrowhead="empty", arrowtail="none"];
"3" -> "2" [arrowhead="empty", arrowtail="none"];
"5" -> "2" [arrowhead="empty", arrowtail="none"];
"6" -> "2" [arrowhead="empty", arrowtail="none"];
"7" -> "6" [arrowhead="empty", arrowtail="none"];
"8" -> "2" [arrowhead="empty", arrowtail="none"];
"9" -> "8" [arrowhead="empty", arrowtail="none"];
"10" -> "8" [arrowhead="empty", arrowtail="none"];
} digraph "classes_foo" {
"0" [label="{ArrayIndex|base\lindex\l|}", shape="record"];
"1" [label="{Binop|a\lb\lop\l|}", shape="record"];
"2" [label="{BuiltIn|\l|}", shape="record"];
"3" [label="{BuiltInVaArg|arg_pointer\l|}", shape="record"];
"4" [label="{BuiltInVaStart|arg_pointer\l|}", shape="record"];
"5" [label="{CExpression|lvalue\ltyp\l|}", shape="record"];
"6" [label="{Cast|expr\lto_typ\l|}", shape="record"];
"7" [label="{CharLiteral|\l|}", shape="record"];
"8" [label="{FieldSelect|base\lfield\l|}", shape="record"];
"9" [label="{FunctionCall|args\lcallee\l|}", shape="record"];
"10" [label="{ImplicitCast|\l|}", shape="record"];
"11" [label="{InitializerList|elements\l|}", shape="record"];
"12" [label="{Literal|value\l|}", shape="record"];
"13" [label="{NumericLiteral|\l|}", shape="record"];
"14" [label="{Sizeof|sizeof_typ\l|}", shape="record"];
"15" [label="{StringLiteral|\l|}", shape="record"];
"16" [label="{Ternop|a\lb\lc\lop\l|}", shape="record"];
"17" [label="{Unop|a\lop\l|}", shape="record"];
"18" [label="{VariableAccess|name\lvariable\l|}", shape="record"];
"0" -> "5" [arrowhead="empty", arrowtail="none"];
"1" -> "5" [arrowhead="empty", arrowtail="none"];
"2" -> "5" [arrowhead="empty", arrowtail="none"];
"3" -> "2" [arrowhead="empty", arrowtail="none"];
"4" -> "2" [arrowhead="empty", arrowtail="none"];
"7" -> "12" [arrowhead="empty", arrowtail="none"];
"8" -> "5" [arrowhead="empty", arrowtail="none"];
"9" -> "5" [arrowhead="empty", arrowtail="none"];
"10" -> "6" [arrowhead="empty", arrowtail="none"];
"12" -> "5" [arrowhead="empty", arrowtail="none"];
"13" -> "12" [arrowhead="empty", arrowtail="none"];
"15" -> "12" [arrowhead="empty", arrowtail="none"];
"16" -> "5" [arrowhead="empty", arrowtail="none"];
"17" -> "5" [arrowhead="empty", arrowtail="none"];
"18" -> "5" [arrowhead="empty", arrowtail="none"];
} digraph "classes_foo" {
"0" [label="{Break|\l|}", shape="record"];
"1" [label="{Case|statement\ltyp\lvalue\l|}", shape="record"];
"2" [label="{Continue|\l|}", shape="record"];
"3" [label="{DeclarationStatement|declaration\l|}", shape="record"];
"4" [label="{Default|statement\l|}", shape="record"];
"5" [label="{DoWhile|body\lcondition\l|}", shape="record"];
"6" [label="{Empty|\l|}", shape="record"];
"7" [label="{ExpressionStatement|expression\l|}", shape="record"];
"8" [label="{For|body\lcondition\linit\lpost\l|}", shape="record"];
"9" [label="{Goto|label\l|}", shape="record"];
"10" [label="{If|condition\lno\lyes\l|}", shape="record"];
"11" [label="{Label|name\lstatement\l|}", shape="record"];
"12" [label="{Return|value\l|}", shape="record"];
"13" [label="{Switch|expression\lstatement\l|}", shape="record"];
"14" [label="{While|body\lcondition\l|}", shape="record"];


This section contains some links to other compilers. These were used to draw inspiration from. In the spirit of: it is better to steal ideas then invent something bad yourself :).

Other C compilers

A c99 frontend for libfirm: https://github.com/libfirm/cparser


tcc: tiny c compiler

This compiler is tiny and uses a single pass to parse and generate code.



This compiler does parsing and type checking in one go.


CDT is an eclipse c/c++ ide.