C compiler

Overview

digraph x {
codegen [label="code generator" ]
source -> lexer [label="text"]
lexer -> preprocessor [label="tokens"]
preprocessor -> parser [label="tokens"]
preprocessor -> source [label="include"]
parser -> semantics [label="callback functions"]
semantics -> scope
semantics -> codegen [label="parsed program (AST)"]
}

The C compilers task is to take C code, and produce intermediate code for the backend. The first step is pre processing, during which a sequence of tokens is transformed into another sequence of tokens. Next comes parsing and code generation. There are two options here: split the parser and code generation passes, such that there is a clean seperation, or do it all at once. Maybe the best choice is to do it all in one pass, and transform a sequence of tokens into IR-code in a single go.

Pre-processor

The preprocessor is a strange thing. It must handle trigraphs, backslash newlines and macros.

The top level design of the preprocessor is the following:

  • Context: Contains state that would be global otherwise.
  • Lexer: processes a raw file into a sequence of tokens
  • Preprocessor: takes the token sequence a does macro expansion, resulting in another stream of tokens.
  • Output: The token stream maybe outputted to file.
  • Feed to compiler: The token stream might be fed into the rest of the compiler.

C compiler

The c compiler consists of the classical stages: parsing and codegeneration. Code generation is done to ir-code.

Parsing

The C parsing is done by two classes CParser and CSemantics. CParser is a recursive descent parser. It dances a tight dance with the CSemantics class. This idea is taken from the Clang project. The CParser takes a token sequence from the preprocessor and matches the C syntax. Whenever a valid C construct is found, it calls the corresponding function on the CSemantics class. The semantics class keeps track of the current scope and records the global declarations. It also check types and lvalues of expressions. At the end of parsing and this type checking, an abstract syntax tree (AST) is build up. This AST is type checked and variables are resolved. This AST can be used for several purposes, for example static code analysis or style checking. It is also possible to generate C code again from this AST.

L-values

The semantics class determines for each expression node whether it is an lvalue or not. The lvalue (short for location value) indicates whether the expression has a memory address, or is a value in a register. Basically it boils down to: can we take the address of this expression with the ‘&’ operator. Numeric literals and the results of addition are not lvalues. Variables are lvalues. The lvalue property is used during code generation to check whether the value must be loaded from memory or not.

Types

Code generation

Code generation takes the AST as a whole and loops over all its elements and generates the corresponding IR-code snippets from it. At the end of code generation, there is an IR module which can be feed into the optimizers or code generators.

C classes

The C frontend can be used to generate an AST from C code. You can use it to parse C code and analyze its structure like this:

>>> from ppci.api import get_arch
>>> from ppci.lang.c import create_ast
>>> src = "int a, *b;"
>>> ast = create_ast(src, get_arch('msp430').info)
>>> ast
Compilation unit with 2 declarations
>>> from ppci.lang.c import CAstPrinter
>>> printer = CAstPrinter()
>>> printer.print(ast)
Compilation unit with 2 declarations
    Variable [storage=None typ=Basic type int name=a]
        Basic type int
    Variable [storage=None typ=Pointer-type name=b]
        Pointer-type
            Basic type int
>>> ast.declarations[0].name
'a'

Module reference

C front end.

ppci.lang.c.create_ast(src, arch_info, filename='<snippet>', coptions=None)

Create a C ast from the given source

ppci.lang.c.preprocess(f, output_file, coptions=None)

Pre-process a file into the other file.

ppci.lang.c.c_to_ir(source: io.TextIOBase, march, coptions=None, reporter=None)

C to ir translation.

Parameters:
  • source (file-like object) – The C source to compile.
  • march (str) – The targetted architecture.
  • coptions – C specific compilation options.
Returns:

An ppci.ir.Module.

class ppci.lang.c.CBuilder(arch_info, coptions)

C builder that converts C code into ir-code

class ppci.lang.c.CContext(coptions, arch_info)

A context as a substitute for global data

alignment(typ: ppci.lang.c.types.CType)

Given a type, determine its alignment in bytes

equal_types(typ1, typ2, unqualified=False)

Check for type equality

error(message, location)

Trigger an error at the given location

eval_expr(expr)

Evaluate an expression right now! (=at compile time)

gen_global_ival(typ, ival)

Create memory image for initial value of global variable

get_type(type_specifiers)

Create a new instance for the given type specifiers

is_const_expr(expr)

Test if an expression can be evaluated at compile time

is_valid(type_specifiers)

Check if the type specifiers refer to a valid basic type

pack(typ, value)

Pack a type into proper memory format

sizeof(typ: ppci.lang.c.types.CType)

Given a type, determine its size in whole bytes

class ppci.lang.c.CLexer(coptions)

Lexer used for the preprocessor

lex(src, filename)

Read a source and generate a series of tokens

lex_c()

Root parsing function

lex_char()

Scan for a complete character constant

lex_string()

Scan for a complete string

lex_text(txt)

Create tokens from the given text

tokenize(characters)

Generate tokens from characters

class ppci.lang.c.COptions

A collection of settings regarding the C language

add_include_path(path)

Add a path to the list of include paths

add_include_paths(paths)

Add all the given include paths

process_args(args)

Given a set of parsed arguments, apply those

class ppci.lang.c.CPreProcessor(coptions)

A pre-processor for C source code

define(macro)

Register a define

get_define(name: str)

Retrieve the given define!

include(filename, loc, use_current_dir=False)

Turn the given filename into a series of lines

is_defined(name: str) → bool

Check if the given define is defined.

predefine_builtin_macros()

Define predefined macros

process(f, filename=None)

Process the given open file into expanded lines of tokens

special_macro_date(macro_token)

Invoked when the __DATE__ macro is expanded

special_macro_file(macro_token)

Invoked when the __FILE__ macro is expanded

special_macro_line(macro_token)

Invoked when the __LINE__ macro is expanded

undefine(name: str)

Kill a define!

class ppci.lang.c.CParser(context, semantics)

C Parser.

Implemented in a recursive descent way, like the CLANG[1] frontend for llvm, also without the lexer hack[2]. See for the gcc c parser code [3]. Also interesting is the libfim cparser frontend [4].

The clang parser is very versatile, it can handle C, C++, objective C and also do code completion. This one is very simple, it can only parse C.

[1] http://clang.org/ [2] https://en.wikipedia.org/wiki/The_lexer_hack [3] https://raw.githubusercontent.com/gcc-mirror/gcc/master/gcc/c/ c-parser.c

[4] https://github.com/libfirm/cparser/blob/ 0cc43ed4bb4d475728583eadcf9e9726682a838b/src/parser/parser.c

is_declaration_statement()

Determine whether we are facing a declaration or not

next_token()

Advance to the next token

parse(tokens)

Here the parsing of C is begun …

Parse the given tokens.

parse_break_statement()

Parse a break

parse_call(callee)

Parse a function call

parse_case_statement()

Parse a case

parse_compound_statement()

Parse a series of statements surrounded by ‘{‘ and ‘}’

parse_condition()

Parse an expression between parenthesis

parse_constant_expression()

Parse a constant expression

parse_continue_statement()

Parse a continue statement

parse_decl_group(ds)

Parse the rest after the first declaration spec.

For example we have parsed ‘static int’ and we will now parse the rest. This can be either a function, or a sequence of initializable variables. At least it will probably contain some sort of identifier and an optional pointer stuff.

parse_decl_specifiers(allow_storage_class=True)

Parse declaration specifiers.

At the end we know type, storage class and qualifiers.

Gathers storage classes: - typedef - extern - static

One of the following type specifiers: - void, char, int, unsigned, long, … - typedef-ed type - struct, union or enum

Type qualifiers: - const - volatile

parse_declarations()

Parse normal declarations

parse_declarator(abstract=False)

Given a declaration specifier, parse the rest.

This involves parsing optionally pointers and qualifiers and next the declaration itself.

parse_default_statement()

Parse the default case

parse_do_statement()

Parse a do-while statement

parse_empty_statement()

Parse a statement that does nothing!

parse_enum()

Parse an enum definition

parse_expression()

Parse an expression.

See also: http://en.cppreference.com/w/c/language/operator_precedence

parse_for_statement()

Parse a for statement

parse_function_arguments()

Parse function postfix. We have type and name, now parse function arguments

parse_function_declaration(ds, d)

Parse a function declaration with implementation

parse_goto_statement()

Parse a goto

parse_if_statement()

Parse an if statement

parse_label()

Parse a label statement

parse_primary_expression()

Parse a primary expression

parse_return_statement()

Parse a return statement

parse_statement()

Parse a statement

parse_statement_or_declaration()

Parse either a statement or a declaration

Returns: a list of statements

parse_struct_or_union()

Parse a struct or union

parse_switch_statement()

Parse an switch statement

parse_translation_unit()

Top level start of parsing

parse_type_modifiers(abstract=False)

Parse the pointer, name and array or function suffixes.

Can be abstract type, and if so, the type may be nameless.

The result of all this action is a type modifier list with the proper type modifications required by the given code.

parse_typedef(ds, d)

Process typedefs

parse_typename()

Parse a type specifier used in sizeof(int) for example.

parse_variable_initializer()

Parse the C-style array or struct initializer stuff

parse_while_statement()

Parse a while statement

peak

Look at the next token to parse without popping it

class ppci.lang.c.CAstPrinter(file=None)

Print AST of a C program

class ppci.lang.c.CSemantics(context)

This class handles the C semantics

add_declaration(declaration)

Add the given declaration to current scope

apply_type_modifiers(type_modifiers, typ)

Apply the set of type modifiers to the given type

coerce(expr: ppci.lang.generic.nodes.Expression, typ: ppci.lang.c.types.CType)

Try to fit the given expression into the given type

end_function(body)

Called at the end of a function

error(message, location)

Trigger an error at the given location

get_common_type(typ1, typ2)

Given two types, determine the common type.

The common type is a type they can both be cast to.

get_type(type_specifiers)

Retrieve a type by type specifiers

init_array(typ, init_env)

Process array initializers

on_array_index(base, index, location)

Check array indexing

on_basic_type(type_specifiers, location)

Handle basic type

on_binop(lhs, op, rhs, location)

Check binary operator

on_builtin_va_arg(arg_pointer, typ, location)

Check va_arg builtin function

on_builtin_va_start(arg_pointer, location)

Check va_start builtin function

on_call(callee, arguments, location)

Check function call for validity

on_cast(to_typ, casted_expr, location)

Check explicit casting

on_char(value, location)

Process a character literal

on_do(body, condition, location)

The almost extinct dodo!

on_enum_value(ctyp, name, value, location)

Handle a single enum value definition

on_field_select(base, field, location)

Check field select expression

on_for(initial, condition, post, body, location)

Check for loop construction

on_function_argument(typ, name, modifiers, location)

Process a function argument into the proper class

on_if(condition, then_statement, no, location)

Check if statement

on_number(value, location)

React on numeric literal

on_return(value, location)

Check return statement

on_string(value, location)

React on string literal

on_struct_or_union(kind, tag, fields, location)

Handle struct or union definition

on_switch_exit(expression, statement, location)

Handle switch statement

on_ternop(lhs, op, mid, rhs, location)

Handle ternary operator ‘a ? b : c’

on_type(typ, modifiers, location)

Called when a type itself is described

static on_type_qualifiers(type_qualifiers, ctyp)

Handle type qualifiers

on_typedef(typ, name, modifiers, location)

Handle typedef declaration

on_typename(name, location)

Handle the case when a typedef is refered

on_unop(op, a, location)

Check unary operator semantics

on_variable_access(name, location)

Handle variable access

on_variable_declaration(storage_class, typ, name, modifiers, location)

Given a declaration, and a declarator, create the proper object

on_variable_initialization(variable, expression)

Handle a variable initialized to some value

class ppci.lang.c.CSynthesizer

Take an IR-module and convert it into a C-AST.

This does essentially the opposite of the codegenerator.

syn_block(block)

Synthesize an ir block into C

syn_instruction(instruction)

Convert ir instruction to its corresponding C counterpart

class ppci.lang.c.CPrinter

Render a C program as text

gen_declaration(declaration)

Spit out a declaration

print(cu)

Render compilation unit as C

render_type(typ, name='')

Generate a proper C-string for the given type

Uml

digraph "classes_foo" {
charset="utf-8"
rankdir=BT
"0" [label="{ArrayType|size\l|}", shape="record"];
"1" [label="{BasicType|CHAR : str\lDOUBLE : str\lFLOAT : str\lINT : str\lINTEGER_TYPES : set\lLONG : str\lLONGDOUBLE : str\lLONGLONG : str\lSHORT : str\lUCHAR : str\lUINT : str\lULONG : str\lULONGLONG : str\lUSHORT : str\lVOID : str\ltype_id\l|}", shape="record"];
"2" [label="{CType|is_integer\lis_scalar\lis_struct\lis_union\lis_void\lqualifiers : NoneType\l|}", shape="record"];
"3" [label="{EnumType|complete\lvalues : NoneType\l|}", shape="record"];
"4" [label="{Field|bitsize\lname\loffset\lpadded_size : int\ltyp\l|}", shape="record"];
"5" [label="{FunctionType|argument_types\larguments\lis_vararg : bool\lreturn_type\l|}", shape="record"];
"6" [label="{IndexableType|element_type\l|}", shape="record"];
"7" [label="{PointerType|\l|}", shape="record"];
"8" [label="{StructOrUnionType|complete\lfield_map\lfields : NoneType\lfields : property\lincomplete\ltag : NoneType\l|get_field()\lget_fields()\lhas_field()\lset_fields()\l}", shape="record"];
"9" [label="{StructType|\l|}", shape="record"];
"10" [label="{UnionType|\l|}", shape="record"];
"0" -> "6" [arrowhead="empty", arrowtail="none"];
"1" -> "2" [arrowhead="empty", arrowtail="none"];
"3" -> "2" [arrowhead="empty", arrowtail="none"];
"5" -> "2" [arrowhead="empty", arrowtail="none"];
"6" -> "2" [arrowhead="empty", arrowtail="none"];
"7" -> "6" [arrowhead="empty", arrowtail="none"];
"8" -> "2" [arrowhead="empty", arrowtail="none"];
"9" -> "8" [arrowhead="empty", arrowtail="none"];
"10" -> "8" [arrowhead="empty", arrowtail="none"];
} digraph "classes_foo" {
charset="utf-8"
rankdir=BT
"0" [label="{ArrayIndex|base\lindex\l|}", shape="record"];
"1" [label="{Binop|a\lb\lop\l|}", shape="record"];
"2" [label="{BuiltIn|\l|}", shape="record"];
"3" [label="{BuiltInVaArg|arg_pointer\l|}", shape="record"];
"4" [label="{BuiltInVaStart|arg_pointer\l|}", shape="record"];
"5" [label="{CExpression|lvalue\ltyp\l|}", shape="record"];
"6" [label="{Cast|expr\lto_typ\l|}", shape="record"];
"7" [label="{CharLiteral|\l|}", shape="record"];
"8" [label="{FieldSelect|base\lfield\l|}", shape="record"];
"9" [label="{FunctionCall|args\lcallee\l|}", shape="record"];
"10" [label="{ImplicitCast|\l|}", shape="record"];
"11" [label="{InitializerList|elements\l|}", shape="record"];
"12" [label="{Literal|value\l|}", shape="record"];
"13" [label="{NumericLiteral|\l|}", shape="record"];
"14" [label="{Sizeof|sizeof_typ\l|}", shape="record"];
"15" [label="{StringLiteral|\l|}", shape="record"];
"16" [label="{Ternop|a\lb\lc\lop\l|}", shape="record"];
"17" [label="{Unop|a\lop\l|}", shape="record"];
"18" [label="{VariableAccess|name\lvariable\l|}", shape="record"];
"0" -> "5" [arrowhead="empty", arrowtail="none"];
"1" -> "5" [arrowhead="empty", arrowtail="none"];
"2" -> "5" [arrowhead="empty", arrowtail="none"];
"3" -> "2" [arrowhead="empty", arrowtail="none"];
"4" -> "2" [arrowhead="empty", arrowtail="none"];
"7" -> "12" [arrowhead="empty", arrowtail="none"];
"8" -> "5" [arrowhead="empty", arrowtail="none"];
"9" -> "5" [arrowhead="empty", arrowtail="none"];
"10" -> "6" [arrowhead="empty", arrowtail="none"];
"12" -> "5" [arrowhead="empty", arrowtail="none"];
"13" -> "12" [arrowhead="empty", arrowtail="none"];
"15" -> "12" [arrowhead="empty", arrowtail="none"];
"16" -> "5" [arrowhead="empty", arrowtail="none"];
"17" -> "5" [arrowhead="empty", arrowtail="none"];
"18" -> "5" [arrowhead="empty", arrowtail="none"];
} digraph "classes_foo" {
charset="utf-8"
rankdir=BT
"0" [label="{Break|\l|}", shape="record"];
"1" [label="{Case|statement\ltyp\lvalue\l|}", shape="record"];
"2" [label="{Continue|\l|}", shape="record"];
"3" [label="{DeclarationStatement|declaration\l|}", shape="record"];
"4" [label="{Default|statement\l|}", shape="record"];
"5" [label="{DoWhile|body\lcondition\l|}", shape="record"];
"6" [label="{Empty|\l|}", shape="record"];
"7" [label="{ExpressionStatement|expression\l|}", shape="record"];
"8" [label="{For|body\lcondition\linit\lpost\l|}", shape="record"];
"9" [label="{Goto|label\l|}", shape="record"];
"10" [label="{If|condition\lno\lyes\l|}", shape="record"];
"11" [label="{Label|name\lstatement\l|}", shape="record"];
"12" [label="{Return|value\l|}", shape="record"];
"13" [label="{Switch|expression\lstatement\l|}", shape="record"];
"14" [label="{While|body\lcondition\l|}", shape="record"];
}

References

This section contains some links to other compilers. These were used to draw inspiration from. In the spirit of: it is better to steal ideas then invent something bad yourself :).

Other C compilers

A c99 frontend for libfirm: https://github.com/libfirm/cparser

tcc

tcc: tiny c compiler

This compiler is tiny and uses a single pass to parse and generate code.

lcc

https://github.com/drh/lcc

This compiler does parsing and type checking in one go.

cdt

CDT is an eclipse c/c++ ide.

http://wiki.eclipse.org/CDT/designs/Overview_of_Parsing