C compiler¶
This page describes the design of the C compiler. If you want to use the
compiler, please see ppci-cc or ppci.api.cc()
.
Overview¶
The C compilers task is to take C code, and produce intermediate code for the backend. The first step is pre processing, during which a sequence of tokens is transformed into another sequence of tokens. Next comes parsing and code generation. There are two options here: split the parser and code generation passes, such that there is a clean seperation, or do it all at once. Maybe the best choice is to do it all in one pass, and transform a sequence of tokens into IR-code in a single go.
Pre-processor¶
The preprocessor is a strange thing. It must handle trigraphs, backslash newlines and macros.
The top level design of the preprocessor is the following:
- Context: Contains state that would be global otherwise.
CLexer
: processes a raw file into a sequence of tokens- Preprocessor: takes the token sequence a does macro expansion, resulting in another stream of tokens.
- Output: The token stream maybe outputted to file.
- Feed to compiler: The token stream might be fed into the rest of the compiler.
C compiler¶
The c compiler consists of the classical stages: parsing and codegeneration. Code generation is done to ir-code.
Parsing¶
The C parsing is done by two classes CParser
and CSemantics
.
CParser is
a recursive descent parser. It dances a tight dance with the CSemantics class.
This idea is taken from the Clang project. The CParser takes a token sequence
from the preprocessor and matches the C syntax. Whenever a valid C construct
is found, it calls the corresponding function on the CSemantics class. The
semantics class keeps track of the current scope and records the global
declarations. It also check types and lvalues of expressions. At the end of
parsing and this type checking, an abstract syntax tree (AST) is build up.
This AST is type checked and variables are resolved. This AST can be used
for several purposes, for example static code analysis or style checking. It
is also possible to generate C code again from this AST.
L-values¶
The semantics class determines for each expression node whether it is an lvalue or not. The lvalue (short for location value) indicates whether the expression has a memory address, or is a value in a register. Basically it boils down to: can we take the address of this expression with the ‘&’ operator. Numeric literals and the results of addition are not lvalues. Variables are lvalues. The lvalue property is used during code generation to check whether the value must be loaded from memory or not.
Types¶
Hosted vs freestanding¶
A C compiler can be hosted or freestanding. The difference between those two is that a hosted C compiler also provides the standard C library. A freestanding compiler only contains a few really required header files. As a result a hosted compiler is really a combination of a C compiler and a C standard library implementation. Also, the standard library depends on the operating system which is used, where as a freestanding C compiler can be used independent of operating system. Writing an application using a hosted compiler is easier since the standard library is available.
Code generation¶
Code generation takes the AST as a whole and loops over all its elements and generates the corresponding IR-code snippets from it. At the end of code generation, there is an IR module which can be feed into the optimizers or code generators.
C classes¶
The C frontend can be used to generate an AST from C code. You can use it to parse C code and analyze its structure like this:
>>> from ppci.api import get_arch
>>> from ppci.lang.c import create_ast
>>> src = "int a, *b;"
>>> ast = create_ast(src, get_arch('msp430').info)
>>> ast
Compilation unit with 2 declarations
>>> from ppci.lang.c import CAstPrinter
>>> printer = CAstPrinter()
>>> printer.print(ast)
Compilation unit with 2 declarations
Variable [storage=None typ=Basic type int name=a]
Basic type int
Variable [storage=None typ=Pointer-type name=b]
Pointer-type
Basic type int
>>> ast.declarations[0].name
'a'
Module reference¶
C front end.
-
ppci.lang.c.
create_ast
(src, arch_info, filename='<snippet>', coptions=None)¶ Create a C ast from the given source
-
ppci.lang.c.
preprocess
(f, output_file, coptions=None)¶ Pre-process a file into the other file.
-
ppci.lang.c.
c_to_ir
(source: io.TextIOBase, march, coptions=None, reporter=None)¶ C to ir translation.
Parameters: - source (file-like object) – The C source to compile.
- march (str) – The targetted architecture.
- coptions – C specific compilation options.
Returns: An
ppci.ir.Module
.
-
ppci.lang.c.
print_ast
(ast, file=None)¶ Display an abstract syntax tree.
-
ppci.lang.c.
parse_text
(text, arch='x86_64')¶ Parse given C sourcecode into an AST
-
ppci.lang.c.
render_ast
(ast)¶ Render a C program as text
For example:
>>> from ppci.lang.c import parse_text, print_ast, render_ast >>> ast = parse_text('int a;') >>> print_ast(ast) Compilation unit with 1 declarations Variable [storage=None typ=Basic type int name=a] Basic type int >>> render_ast(ast) int a;
-
ppci.lang.c.
parse_type
(text, context, filename='foo.c')¶ Parse given C-type AST.
For example:
>>> from ppci.api import get_arch >>> from ppci.lang.c import parse_type, CContext, COptions >>> msp430_arch = get_arch('msp430') >>> coptions = COptions() >>> context = CContext(coptions, msp430_arch.info) >>> ast = parse_type('int[2]', context) >>> context.eval_expr(ast.size) 2 >>> context.sizeof(ast) 4
-
class
ppci.lang.c.
CBuilder
(arch_info, coptions)¶ C builder that converts C code into ir-code
-
class
ppci.lang.c.
CContext
(coptions, arch_info)¶ A context as a substitute for global data
-
alignment
(typ: ppci.lang.c.nodes.types.CType)¶ Given a type, determine its alignment in bytes
-
static
error
(message, location, hints=None)¶ Trigger an error at the given location
-
eval_expr
(expr)¶ Evaluate an expression right now! (=at compile time)
-
get_field
(typ, field_name)¶ Get the given field.
-
get_field_offsets
(typ)¶ Get a dictionary with offset of fields
-
has_field
(typ, field_name)¶ Check if the given type has the given field.
-
layout_struct
(typ)¶ Layout the fields in the struct.
Things to take in account: - alignment - bit packing - anonynous types
-
offsetof
(typ, field)¶ Returns the offset of a field in a struct/union in bytes
-
pack
(typ, value)¶ Pack a type into proper memory format
-
sizeof
(typ: ppci.lang.c.nodes.types.CType)¶ Given a type, determine its size in whole bytes
-
warning
(message, location, hints=None)¶ Trigger a warning at the given location
-
-
class
ppci.lang.c.
CLexer
(coptions)¶ Lexer used for the preprocessor
-
lex
(src, source_file)¶ Read a source and generate a series of tokens
-
lex_c
()¶ Root parsing function
-
lex_char
()¶ Scan for a complete character constant
-
lex_string
()¶ Scan for a complete string
-
lex_text
(txt)¶ Create tokens from the given text
-
tokenize
(characters)¶ Generate tokens from characters
-
-
class
ppci.lang.c.
COptions
¶ A collection of settings regarding the C language
-
add_include_path
(path)¶ Add a path to the list of include paths
-
add_include_paths
(paths)¶ Add all the given include paths
-
classmethod
from_args
(args)¶ Create a new options object from parsed arguments.
-
process_args
(args)¶ Given a set of parsed arguments, apply those
-
-
class
ppci.lang.c.
CPreProcessor
(coptions)¶ A pre-processor for C source code
-
concat
(lhs, rhs)¶ Concatenate two tokens
-
concatenate
(tokens)¶ Handle the ‘##’ token concatenation operator
-
consume
(typ=None, expand=True)¶ Consume a token of a certain type
-
copy_tokens
(tokens, first_space)¶ Copy a series of tokens.
-
define
(macro)¶ Register a macro
-
define_object_macro
(name, text, protected=False)¶ Define an object like macro.
-
define_special_macro
(name, handler)¶ Define a spcial macro which has a callback function.
-
do_if
(condition, location)¶ Handle #if/#ifdef/#ifndef.
-
eat_line
()¶ Eat up all tokens until the end of the line.
This does not expand macros.
-
error
(msg, hints=None, loc=None)¶ We hit an error condition.
-
eval_expr
()¶ Evaluate an expression
-
expand
(macro_token)¶ Expand a single token into possibly more tokens.
-
expand_macro
(macro, macro_token)¶ Expand a single macro.
-
expand_token_sequence
(tokens)¶ Macro expand a sequence of tokens.
-
gatherargs
(macro)¶ Collect expanded arguments for macro
-
get_define
(name: str)¶ Retrieve the given define!
-
handle_define_directive
(directive_token)¶ Process #define directive.
-
handle_directive
(loc)¶ Handle a single preprocessing directive
-
handle_elif_directive
(directive_token)¶ Process #elif directive.
-
handle_else_directive
(directive_token)¶ Process the #else directive.
-
handle_endif_directive
(directive_token)¶ Process the #endif directive.
-
handle_error_directive
(directive_token)¶ Process #error directive.
-
handle_if_directive
(directive_token)¶ Process an #if directive.
-
handle_ifdef_directive
(directive_token)¶ Handle an #ifdef directive.
-
handle_ifndef_directive
(directive_token)¶ Handle an #ifndef directive.
-
handle_include_directive
(directive_token)¶ Process the #include directive.
-
handle_include_next_directive
(directive_token)¶ Process the #include_next directive.
-
handle_line_directive
(directive_token)¶ Process #line directive.
-
handle_pragma_directive
(directive_token)¶ Process #pragma directive.
-
handle_undef_directive
(directive_token)¶ Process #undef directive.
-
handle_warning_directive
(directive_token)¶ Process #warning directive.
-
in_hideset
(name)¶ Test if the given macro is contained in the current hideset.
-
include
(filename, loc, use_current_dir=False, include_next=False)¶ Turn the given filename into a series of tokens.
-
is_defined
(name: str) → bool¶ Check if the given define is defined.
-
locate_include
(filename, loc, use_current_dir: bool, include_next)¶ Determine which file to use given the include filename.
Parameters: - loc (-) – the location where this include is included.
- use_current_dir (-) – If true, look in the directory of the current file.
-
static
make_token
(from_token, typ, value)¶ Create a new token from another token.
-
next_token
(expand=True)¶ Get next token
-
normalize_space
(args)¶ Normalize spaces in macro expansions.
If we have space, it will be a single space.
-
parse_arguments
()¶ Parse arguments for a function like macro.
- Keep track of parenthesis level.
-
parse_expression
(priority=0)¶ Parse an expression in an #if
- Idea taken from: https://github.com/shevek/jcpp/blob/master/
- src/main/java/org/anarres/cpp/Preprocessor.java
-
parse_included_filename
()¶ Parse filename after #include/#include_next
-
predefine_builtin_macros
()¶ Define predefined macros
-
process_file
(f, filename=None)¶ Process the given open file into tokens.
-
process_tokens
()¶ Process a sequence of tokens into an expanded token sequence.
This function returns an tokens that must be looped over.
-
push_expansion
(expansion)¶ Push a macro expansion on the stack.
-
skip_excluded_block
()¶ Skip the block excluded by if/ifdef.
Skip tokens until we hit #endif or alike.
-
special_macro_counter
(macro_token)¶ Implement __COUNTER__ macro
-
special_macro_date
(macro_token)¶ Invoked when the __DATE__ macro is expanded
-
special_macro_file
(macro_token)¶ Invoked when the __FILE__ macro is expanded
-
special_macro_include_level
(macro_token)¶ Implement __INCLUDE_LEVEL__ macro
-
special_macro_line
(macro_token)¶ Invoked when the __LINE__ macro is expanded
-
special_macro_time
(macro_token)¶ Implement __TIME__ macro
-
stringify
(hash_token, snippet, loc)¶ Handle the ‘#’ stringify operator.
Take care of: - single space between the tokens being stringified - no spaces before first and after last token - escape double quotes of strings and backslash inside strings.
-
substitute_arguments
(macro, args)¶ Return macro contents with substituted arguments.
Pay special care to # and ## operators, When an argument is used in # or ##, it is not macro expanded.
-
token
¶ Peek one token ahead without taking it.
-
tokens_to_string
(tokens)¶ Create a text from the given tokens
-
undefine
(name: str)¶ Kill a define!
-
unget_token
(token)¶ Undo token consumption.
-
-
class
ppci.lang.c.
CParser
(coptions, semantics)¶ C Parser.
Implemented in a recursive descent way, like the CLANG[1] frontend for llvm, also without the lexer hack[2]. See for the gcc c parser code [3]. Also interesting is the libfim cparser frontend [4].
The clang parser is very versatile, it can handle C, C++, objective C and also do code completion. This one is very simple, it can only parse C.
[1] http://clang.org/ [2] https://en.wikipedia.org/wiki/The_lexer_hack [3] https://raw.githubusercontent.com/gcc-mirror/gcc/master/gcc/c/ c-parser.c
[4] https://github.com/libfirm/cparser/blob/ 0cc43ed4bb4d475728583eadcf9e9726682a838b/src/parser/parser.c
-
at_type_id
()¶ Check if the upcoming token is a typedef identifier
-
is_declaration_statement
()¶ Determine whether we are facing a declaration or not
-
next_token
()¶ Advance to the next token
-
parse
(tokens)¶ Here the parsing of C is begun …
Parse the given tokens.
-
parse_array_designator
(init_cursor)¶ Parse array designator like ‘{2, [10]=4}’
-
parse_array_string_initializer
(typ)¶ Handle the special case where an array is initialized with a string.
-
parse_asm_operand
()¶ Parse a single asm operand.
-
parse_asm_operands
()¶ Parse a series of assembly operands. Empty list is allowed.
-
parse_asm_statement
()¶ Parse an inline assembly statement.
See also: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
-
parse_attributes
()¶ Parse some attributes.
Examples are:
__attribute__((noreturn))
-
parse_break_statement
()¶ Parse a break
-
parse_call
(callee)¶ Parse a function call
-
parse_case_statement
()¶ Parse a case.
For example: ‘case 5:’
Or, a gnu extension:
‘case 5 … 10:’
-
parse_compound_statement
()¶ Parse a series of statements surrounded by ‘{‘ and ‘}’
-
parse_condition
()¶ Parse an expression between parenthesis
-
parse_constant_expression
()¶ Parse a constant expression
-
parse_continue_statement
()¶ Parse a continue statement
-
parse_decl_group
(decl_spec)¶ Parse the rest after the first declaration spec.
For example we have parsed ‘static int’ and we will now parse the rest. This can be either a function, or a sequence of initializable variables. At least it will probably contain some sort of identifier and an optional pointer stuff.
-
parse_decl_specifiers
(allow_storage_class=True)¶ Parse declaration specifiers.
At the end we know type, storage class and qualifiers.
Gathers storage classes: - typedef - extern - static
One of the following type specifiers: - void, char, int, unsigned, long, … - typedef-ed type - struct, union or enum
Type qualifiers: - const - volatile
-
parse_declarations
()¶ Parse normal declarations
-
parse_declarator
(abstract=False)¶ Given a declaration specifier, parse the rest.
This involves parsing optionally pointers and qualifiers and next the declaration itself.
-
parse_default_statement
()¶ Parse the default case
-
parse_do_statement
()¶ Parse a do-while statement
-
parse_empty_statement
()¶ Parse a statement that does nothing!
-
parse_enum
()¶ Parse an enum definition
-
parse_enum_fields
(ctyp, location)¶ Parse enum declarations
-
parse_expression
()¶ Parse an expression.
See also: http://en.cppreference.com/w/c/language/operator_precedence
-
parse_for_statement
()¶ Parse a for statement
-
parse_function_arguments
()¶ Parse function postfix.
We have type and name, now parse function arguments.
-
parse_function_declaration
(decl_spec, declarator)¶ Parse a function declaration with implementation
-
parse_gnu_attribute
()¶ Parse a gnu attribute like __attribute__((noreturn))
-
parse_goto_statement
()¶ Parse a goto
-
parse_if_statement
()¶ Parse an if statement
-
parse_initializer
(typ)¶ Parse the C-style array or struct initializer stuff.
Heavily copied from: https://github.com/rui314/8cc/blob/master/parse.c
Argument is the type to initialize.
An initialization can be one of: = 1; = {1, 2, 3}; = {[0] = 3}; // C99 = {.foobar = {23, 3}}; // C99 = {[2..5] = 2}; // C99
-
parse_initializer_list
(typ)¶ Parse braced initializer list.
Parse opening brace, elements and closing brace.
-
parse_initializer_list_element
(init_cursor)¶ Parse an initializer list element with optional designators.
-
parse_initializer_list_sub
(init_cursor, typ)¶ Parse braced initializer list.
Parse opening brace, elements and closing brace.
-
parse_label
()¶ Parse a label statement
-
parse_primary_expression
()¶ Parse a primary expression
-
parse_return_statement
()¶ Parse a return statement
-
parse_statement
()¶ Parse a statement
-
parse_statement_or_declaration
()¶ Parse either a statement or a declaration
Returns: a list of statements
-
parse_struct_designator
(init_cursor)¶ Parse a struct designator in an initializer list.
-
parse_struct_fields
()¶ Parse struct or union fields
-
parse_struct_or_union
()¶ Parse a struct or union
-
parse_switch_statement
()¶ Parse an switch statement
-
parse_translation_unit
()¶ Top level start of parsing
-
parse_type_modifiers
(abstract=False)¶ Parse the pointer, name and array or function suffixes.
Can be abstract type, and if so, the type may be nameless.
The result of all this action is a type modifier list with the proper type modifications required by the given code.
-
parse_typedef
(decl_spec, declarator)¶ Process typedefs
-
parse_typename
()¶ Parse a type specifier used in sizeof(int) for example.
-
parse_variable_declaration
(decl_spec, declarator)¶ Parse variable declaration optionally followed by initializer.
-
parse_while_statement
()¶ Parse a while statement
-
skip_initializer_lists
()¶ Skip superfluous initial values.
-
-
class
ppci.lang.c.
CAstPrinter
(file=None)¶ Print AST of a C program
-
visit
(node)¶ Recursively visit node’s child nodes.
-
-
class
ppci.lang.c.
CSemantics
(context)¶ This class handles the C semantics
-
add_statement
(statement)¶ Helper to emit a statement into the current block.
-
apply_type_modifiers
(type_modifiers, typ)¶ Apply the set of type modifiers to the given type
-
begin
()¶ Enter a new file / compilation unit.
-
check_redeclaration_storage_class
(sym, declaration)¶ Test if we specified an invalid combo of storage class.
-
coerce
(expr: ppci.lang.c.nodes.expressions.CExpression, typ: ppci.lang.c.nodes.types.CType)¶ Try to fit the given expression into the given type.
-
define_tag_type
(tag: str, klass: ppci.lang.c.nodes.types.TaggedType, location)¶ Get or create a tagged type with the given tag kind.
-
end_function
(body)¶ Called at the end of a function
-
enter_scope
()¶ Enter a new symbol scope.
-
error
(message, location, hints=None)¶ Trigger an error at the given location
-
finish_compilation_unit
()¶ Called at the end of a file / compilation unit.
-
get_common_type
(typ1, typ2, location)¶ Given two types, determine the common type.
The common type is a type they can both be cast to.
-
get_type
(type_specifiers)¶ Retrieve a type by type specifiers
-
init_store
(init_cursor, value)¶ Store an initial value at position pointed by cursor.
-
invalid_redeclaration
(sym, declaration, message='Invalid redefinition')¶ Raise an invalid redeclaration error.
-
leave_scope
()¶ Leave a symbol scope.
-
not_impl
(message, location)¶ Call this function to mark unimplemented code
-
on_array_designator
(init_cursor, index, location)¶ Handle array designator.
-
on_array_index
(base, index, location)¶ Check array indexing
-
on_basic_type
(type_specifiers, location)¶ Handle basic type
-
on_binop
(lhs, op, rhs, location)¶ Check binary operator
-
on_builtin_offsetof
(typ, member_name, location)¶ Check offsetof builtin function
-
on_builtin_va_arg
(arg_pointer, typ, location)¶ Check va_arg builtin function
-
on_builtin_va_copy
(dest, src, location)¶ Check va_copy builtin function
-
on_builtin_va_start
(arg_pointer, location)¶ Check va_start builtin function
-
on_call
(callee, arguments, location)¶ Check function call for validity
-
on_case
(value, statement, location)¶ Handle a case statement
-
on_cast
(to_typ, casted_expr, location)¶ Check explicit casting
-
on_char
(value, location)¶ Process a character literal
-
on_compound_literal
(typ, init, location)¶ Check the consistency of compound literals.
-
on_default
(statement, location)¶ Handle a default label
-
on_do
(body, condition, location)¶ The almost extinct dodo!
-
on_enum
(tag, is_definition, location)¶ Handle enum declaration
-
on_enum_value
(ctyp, name, value, location)¶ Handle a single enum value definition
-
on_field_def
(storage_class, ctyp, name, modifiers, bitsize, location)¶ Handle a struct/union field declaration
-
on_field_designator
(init_cursor, field_name, location)¶ Check field designator.
-
on_field_select
(base, field_name, location)¶ Check field select expression
-
on_for
(initial, condition, post, body, location)¶ Check for loop construction
-
on_function_argument
(typ, name, modifiers, location)¶ Process a function argument into the proper class
-
on_function_declaration
(storage_class, typ, name, modifiers, location)¶ Handle function declaration
-
on_if
(condition, then_statement, no, location)¶ Check if statement
-
on_number
(value, location)¶ React on numeric literal
-
on_return
(value, location)¶ Check return statement
-
on_sizeof
(typ, location)¶ Handle sizeof contraption
-
on_string
(value, location)¶ React on string literal
-
on_struct_or_union
(kind, tag, is_definition, fields, location)¶ Handle struct or union definition.
A definition of a struct occurs when we have: struct S; struct S { int f; }; struct { int f; };
-
on_switch_exit
(expression, statement, location)¶ Handle switch statement
-
on_ternop
(lhs, op, mid, rhs, location)¶ Handle ternary operator ‘a ? b : c’
-
on_type
(typ, modifiers, location)¶ Called when a type itself is described
-
static
on_type_qualifiers
(type_qualifiers, ctyp)¶ Handle type qualifiers
-
on_typedef
(typ, name, modifiers, location)¶ Handle typedef declaration
-
on_typename
(name, location)¶ Handle the case when a typedef is refered
-
on_unop
(op, a, location)¶ Check unary operator semantics
-
on_variable_access
(name, location)¶ Handle variable access
-
on_variable_declaration
(storage_class, typ, name, modifiers, location)¶ Given a declaration, and a declarator, create the proper object
-
on_variable_initialization
(variable, expression)¶ Handle a variable initialized to some value
-
on_while
(condition, body, location)¶ Handle the while statement
-
patch_size_from_initializer
(typ, initializer)¶ Fill array size from elements!
-
pointer
(expr)¶ Handle several conversions.
Things handled: - Array to pointer decay. - Function to function pointer
-
promote
(expr)¶ Perform integer promotion on expression.
Integer promotion happens when using char and short types in expressions. Those values are promoted to int type before performing the operation.
-
register_declaration
(declaration)¶ Register declaration into the scope.
-
warning
(message, location, hints=None)¶ Trigger a warning at the given location
-
-
class
ppci.lang.c.
CSynthesizer
¶ Take an IR-module and convert it into a C-AST.
This does essentially the opposite of the codegenerator.
-
syn_block
(block)¶ Synthesize an ir block into C
-
syn_instruction
(instruction)¶ Convert ir instruction to its corresponding C counterpart
-
-
class
ppci.lang.c.
CPrinter
(f=None)¶ Render a C program as text
-
gen_declaration
(declaration)¶ Spit out a declaration
-
gen_expr
(expr)¶ Format an expression as text
-
gen_statement
(statement)¶ Render a single statement as text
-
print
(compile_unit)¶ Render compilation unit as C
-
render_type
(typ, name=None)¶ Generate a proper C-string for the given type
-
-
class
ppci.lang.c.
CTokenPrinter
¶ Printer that can turn a stream of token-lines into text
Uml¶
References¶
This section contains some links to other compilers. These were used to draw inspiration from. In the spirit of: it is better to steal ideas then invent something bad yourself :).
Other C compilers¶
A c99 frontend for libfirm: https://github.com/libfirm/cparser
tcc
tcc: tiny c compiler
This compiler is tiny and uses a single pass to parse and generate code.
lcc
This compiler does parsing and type checking in one go.
cdt
CDT is an eclipse c/c++ ide.
Good resources about preprocessors¶
Cpp internal description:
https://gcc.gnu.org/onlinedocs/cppinternals/
A portable C preprocessor:
The boost wave preprocessor:
http://www.boost.org/doc/libs/1_63_0/libs/wave/
A java implementation of the preprocessor:
http://www.anarres.org/projects/jcpp/
https://github.com/shevek/jcpp
CDT preprocessor: http://git.eclipse.org/c/cdt/org.eclipse.cdt.git/tree/core/org.eclipse.cdt.core/parser/org/eclipse/cdt/internal/core/parser/scanner/CPreprocessor.java
The lcc preprocessor part: