C compiler¶

This page describes the design of the C compiler. If you want to use the compiler, please see ppci-cc or ppci.api.cc().

Overview¶

The C compilers task is to take C code, and produce intermediate code for the backend. The first step is pre processing, during which a sequence of tokens is transformed into another sequence of tokens. Next comes parsing and code generation. There are two options here: split the parser and code generation passes, such that there is a clean seperation, or do it all at once. Maybe the best choice is to do it all in one pass, and transform a sequence of tokens into IR-code in a single go.

Pre-processor¶

The preprocessor is a strange thing. It must handle trigraphs, backslash newlines and macros.

The top level design of the preprocessor is the following:

• Context: Contains state that would be global otherwise.
• CLexer: processes a raw file into a sequence of tokens
• Preprocessor: takes the token sequence a does macro expansion, resulting in another stream of tokens.
• Output: The token stream maybe outputted to file.
• Feed to compiler: The token stream might be fed into the rest of the compiler.

C compiler¶

The c compiler consists of the classical stages: parsing and codegeneration. Code generation is done to ir-code.

Parsing¶

The C parsing is done by two classes CParser and CSemantics. CParser is a recursive descent parser. It dances a tight dance with the CSemantics class. This idea is taken from the Clang project. The CParser takes a token sequence from the preprocessor and matches the C syntax. Whenever a valid C construct is found, it calls the corresponding function on the CSemantics class. The semantics class keeps track of the current scope and records the global declarations. It also check types and lvalues of expressions. At the end of parsing and this type checking, an abstract syntax tree (AST) is build up. This AST is type checked and variables are resolved. This AST can be used for several purposes, for example static code analysis or style checking. It is also possible to generate C code again from this AST.

L-values¶

The semantics class determines for each expression node whether it is an lvalue or not. The lvalue (short for location value) indicates whether the expression has a memory address, or is a value in a register. Basically it boils down to: can we take the address of this expression with the ‘&’ operator. Numeric literals and the results of addition are not lvalues. Variables are lvalues. The lvalue property is used during code generation to check whether the value must be loaded from memory or not.

Hosted vs freestanding¶

A C compiler can be hosted or freestanding. The difference between those two is that a hosted C compiler also provides the standard C library. A freestanding compiler only contains a few really required header files. As a result a hosted compiler is really a combination of a C compiler and a C standard library implementation. Also, the standard library depends on the operating system which is used, where as a freestanding C compiler can be used independent of operating system. Writing an application using a hosted compiler is easier since the standard library is available.

Code generation¶

Code generation takes the AST as a whole and loops over all its elements and generates the corresponding IR-code snippets from it. At the end of code generation, there is an IR module which can be feed into the optimizers or code generators.

C classes¶

The C frontend can be used to generate an AST from C code. You can use it to parse C code and analyze its structure like this:

>>> from ppci.api import get_arch
>>> from ppci.lang.c import create_ast
>>> src = "int a, *b;"
>>> ast = create_ast(src, get_arch('msp430').info)
>>> ast
Compilation unit with 2 declarations
>>> from ppci.lang.c import CAstPrinter
>>> printer = CAstPrinter()
>>> printer.print(ast)
Compilation unit with 2 declarations
Variable [storage=None typ=Basic type int name=a]
Basic type int
Variable [storage=None typ=Pointer-type name=b]
Pointer-type
Basic type int
>>> ast.declarations[0].name
'a'


Module reference¶

C front end.

ppci.lang.c.create_ast(src, arch_info, filename='<snippet>', coptions=None)

Create a C ast from the given source

ppci.lang.c.preprocess(f, output_file, coptions=None)

Pre-process a file into the other file.

ppci.lang.c.c_to_ir(source: io.TextIOBase, march, coptions=None, reporter=None)

C to ir translation.

Parameters: source (file-like object) – The C source to compile. march (str) – The targetted architecture. coptions – C specific compilation options.
ppci.lang.c.print_ast(ast, file=None)

Display an abstract syntax tree.

ppci.lang.c.parse_text(text, arch='x86_64')

Parse given C sourcecode into an AST

ppci.lang.c.render_ast(ast)

Render a C program as text

For example:

>>> from ppci.lang.c import parse_text, print_ast, render_ast
>>> ast = parse_text('int a;')
>>> print_ast(ast)
Compilation unit with 1 declarations
Variable [storage=None typ=Basic type int name=a]
Basic type int
>>> render_ast(ast)
int a;

ppci.lang.c.parse_type(text, context, filename='foo.c')

Parse given C-type AST

class ppci.lang.c.CBuilder(arch_info, coptions)

C builder that converts C code into ir-code

class ppci.lang.c.CContext(coptions, arch_info)

A context as a substitute for global data

alignment(typ: ppci.lang.c.nodes.types.CType)

Given a type, determine its alignment in bytes

error(message, location)

Trigger an error at the given location

eval_expr(expr)

Evaluate an expression right now! (=at compile time)

gen_global_ival(typ, ival)

Create memory image for initial value of global variable

layout_struct(kind, fields)

Layout the fields in the struct

offsetof(typ, field)

Returns the offset of a field in a struct/union in bytes

pack(typ, value)

Pack a type into proper memory format

sizeof(typ: ppci.lang.c.nodes.types.CType)

Given a type, determine its size in whole bytes

class ppci.lang.c.CLexer(coptions)

Lexer used for the preprocessor

lex(src, filename)

Read a source and generate a series of tokens

lex_c()

Root parsing function

lex_char()

Scan for a complete character constant

lex_string()

Scan for a complete string

lex_text(txt)

Create tokens from the given text

tokenize(characters)

Generate tokens from characters

class ppci.lang.c.COptions

A collection of settings regarding the C language

add_include_path(path)

Add a path to the list of include paths

add_include_paths(paths)

Add all the given include paths

process_args(args)

Given a set of parsed arguments, apply those

class ppci.lang.c.CPreProcessor(coptions)

A pre-processor for C source code

define(macro)

Register a define

get_define(name: str)

Retrieve the given define!

include(filename, loc, use_current_dir=False)

Turn the given filename into a series of lines

is_defined(name: str) → bool

Check if the given define is defined.

predefine_builtin_macros()

Define predefined macros

process(f, filename=None)

Process the given open file into expanded lines of tokens

special_macro_date(macro_token)

Invoked when the __DATE__ macro is expanded

special_macro_file(macro_token)

Invoked when the __FILE__ macro is expanded

special_macro_line(macro_token)

Invoked when the __LINE__ macro is expanded

undefine(name: str)

Kill a define!

class ppci.lang.c.CParser(coptions, semantics)

C Parser.

Implemented in a recursive descent way, like the CLANG[1] frontend for llvm, also without the lexer hack[2]. See for the gcc c parser code [3]. Also interesting is the libfim cparser frontend [4].

The clang parser is very versatile, it can handle C, C++, objective C and also do code completion. This one is very simple, it can only parse C.

is_declaration_statement()

Determine whether we are facing a declaration or not

next_token()

parse(tokens)

Here the parsing of C is begun …

Parse the given tokens.

parse_array_initializer(array_typ)

Process array initializers.

For example an array maybe initilized with a single value. Or, one can use an designated initializer.

parse_attributes()

Parse some attributes.

Examples are:

__attribute__((noreturn))

parse_break_statement()

Parse a break

parse_call(callee)

Parse a function call

parse_case_statement()

Parse a case

parse_compound_statement()

Parse a series of statements surrounded by ‘{‘ and ‘}’

parse_condition()

Parse an expression between parenthesis

parse_constant_expression()

Parse a constant expression

parse_continue_statement()

Parse a continue statement

parse_decl_group(decl_spec)

Parse the rest after the first declaration spec.

For example we have parsed ‘static int’ and we will now parse the rest. This can be either a function, or a sequence of initializable variables. At least it will probably contain some sort of identifier and an optional pointer stuff.

parse_decl_specifiers(allow_storage_class=True)

Parse declaration specifiers.

At the end we know type, storage class and qualifiers.

Gathers storage classes: - typedef - extern - static

One of the following type specifiers: - void, char, int, unsigned, long, … - typedef-ed type - struct, union or enum

Type qualifiers: - const - volatile

parse_declarations()

Parse normal declarations

parse_declarator(abstract=False)

Given a declaration specifier, parse the rest.

This involves parsing optionally pointers and qualifiers and next the declaration itself.

parse_default_statement()

Parse the default case

parse_do_statement()

Parse a do-while statement

parse_empty_statement()

Parse a statement that does nothing!

parse_enum()

Parse an enum definition

parse_enum_fields(ctyp, location)

Parse enum declarations

parse_expression()

Parse an expression.

parse_for_statement()

Parse a for statement

parse_function_arguments()

Parse function postfix. We have type and name, now parse function arguments

parse_function_declaration(decl_spec, d)

Parse a function declaration with implementation

parse_gnu_attribute()

Parse a gnu attribute like __attribute__((noreturn))

parse_goto_statement()

Parse a goto

parse_if_statement()

Parse an if statement

parse_initializer_list(typ)

Based on type, determine what we could expect.

This function is only called somewhere within { and }.

parse_inner_variable_initializer()

Parse array or structure initializer within { and }

parse_label()

Parse a label statement

parse_primary_expression()

Parse a primary expression

parse_return_statement()

Parse a return statement

parse_statement()

Parse a statement

parse_statement_or_declaration()

Parse either a statement or a declaration

Returns: a list of statements

parse_struct_fields()

Parse struct or union fields

parse_struct_initializer(struct_typ, initializer)

Match an initializer onto a struct type

parse_struct_or_union()

Parse a struct or union

parse_switch_statement()

Parse an switch statement

parse_translation_unit()

Top level start of parsing

parse_type_modifiers(abstract=False)

Parse the pointer, name and array or function suffixes.

Can be abstract type, and if so, the type may be nameless.

The result of all this action is a type modifier list with the proper type modifications required by the given code.

parse_typedef(decl_spec, d)

Process typedefs

parse_typename()

Parse a type specifier used in sizeof(int) for example.

parse_union_initializer(union_typ, initializer)

Match an initializer onto a union type

parse_variable_initializer(typ)

Parse the C-style array or struct initializer stuff.

Heavily copied from: https://github.com/rui314/8cc/blob/master/parse.c

Argument is the type to initialize.

An initialization can be one of: = 1; = {1, 2, 3}; = {[0] = 3}; // C99 = {.foobar = {23, 3}}; // C99 = {[2..5] = 2}; // C99

parse_while_statement()

Parse a while statement

peek

Look at the next token to parse without popping it

class ppci.lang.c.CAstPrinter(file=None)

Print AST of a C program

class ppci.lang.c.CSemantics(context)

This class handles the C semantics

add_declaration(declaration)

Add the given declaration to current scope

apply_type_modifiers(type_modifiers, typ)

Apply the set of type modifiers to the given type

coerce(expr: ppci.lang.generic.nodes.Expression, typ: ppci.lang.c.nodes.types.CType)

Try to fit the given expression into the given type

end_function(body)

Called at the end of a function

static error(message, location)

Trigger an error at the given location

get_common_type(typ1, typ2, location)

Given two types, determine the common type.

The common type is a type they can both be cast to.

get_type(type_specifiers)

Retrieve a type by type specifiers

not_impl(message, location)

Call this function to mark unimplemented code

on_array_index(base, index, location)

Check array indexing

on_basic_type(type_specifiers, location)

Handle basic type

on_binop(lhs, op, rhs, location)

Check binary operator

on_builtin_offsetof(typ, member, location)

Check offsetof builtin function

on_builtin_va_arg(arg_pointer, typ, location)

Check va_arg builtin function

on_builtin_va_start(arg_pointer, location)

Check va_start builtin function

on_call(callee, arguments, location)

Check function call for validity

on_case(value, statement, location)

Handle a case statement

on_cast(to_typ, casted_expr, location)

Check explicit casting

on_char(value, location)

Process a character literal

on_default(statement, location)

Handle a default label

on_do(body, condition, location)

The almost extinct dodo!

on_enum(tag, location)

Handle enum declaration

on_enum_value(ctyp, name, value, location)

Handle a single enum value definition

on_field_def(storage_class, ctyp, name, modifiers, bitsize, location)

Handle a struct/union field declaration

on_field_select(base, field, location)

Check field select expression

on_for(initial, condition, post, body, location)

Check for loop construction

on_function_argument(typ, name, modifiers, location)

Process a function argument into the proper class

on_function_declaration(storage_class, typ, name, modifiers, location)

Handle function declaration

on_if(condition, then_statement, no, location)

Check if statement

on_number(value, location)

React on numeric literal

on_return(value, location)

Check return statement

on_sizeof(typ, location)

Handle sizeof contraption

on_string(value, location)

React on string literal

on_struct_or_union(kind, tag, fields, location)

Handle struct or union definition

on_switch_exit(expression, statement, location)

Handle switch statement

on_ternop(lhs, op, mid, rhs, location)

Handle ternary operator ‘a ? b : c’

on_type(typ, modifiers, location)

Called when a type itself is described

static on_type_qualifiers(type_qualifiers, ctyp)

Handle type qualifiers

on_typedef(typ, name, modifiers, location)

Handle typedef declaration

on_typename(name, location)

Handle the case when a typedef is refered

on_unop(op, a, location)

Check unary operator semantics

on_variable_access(name, location)

Handle variable access

on_variable_declaration(storage_class, typ, name, modifiers, location)

Given a declaration, and a declarator, create the proper object

on_variable_initialization(variable, expression)

Handle a variable initialized to some value

on_while(condition, body, location)

Handle the while statement

class ppci.lang.c.CSynthesizer

Take an IR-module and convert it into a C-AST.

This does essentially the opposite of the codegenerator.

syn_block(block)

Synthesize an ir block into C

syn_instruction(instruction)

Convert ir instruction to its corresponding C counterpart

class ppci.lang.c.CPrinter

Render a C program as text

gen_declaration(declaration)

Spit out a declaration

gen_expr(expr)

Format an expression as text

gen_statement(statement)

Render a single statement as text

print(compile_unit)

Render compilation unit as C

render_type(typ, name=None)

Generate a proper C-string for the given type

References¶

This section contains some links to other compilers. These were used to draw inspiration from. In the spirit of: it is better to steal ideas then invent something bad yourself :).

Other C compilers¶

A c99 frontend for libfirm: https://github.com/libfirm/cparser

tcc

tcc: tiny c compiler

This compiler is tiny and uses a single pass to parse and generate code.

lcc

https://github.com/drh/lcc

This compiler does parsing and type checking in one go.

cdt

CDT is an eclipse c/c++ ide.

http://wiki.eclipse.org/CDT/designs/Overview_of_Parsing

Cpp internal description:

https://gcc.gnu.org/onlinedocs/cppinternals/

A portable C preprocessor:

http://mcpp.sourceforge.net

The boost wave preprocessor:

http://www.boost.org/doc/libs/1_63_0/libs/wave/

A java implementation of the preprocessor:

http://www.anarres.org/projects/jcpp/

https://github.com/shevek/jcpp

The lcc preprocessor part:

https://github.com/drh/lcc/blob/master/cpp/cpp.h