Creating a toy language¶
In this how-to, we will develop our own toy language. We will use textx to define our own language and use the ppci backend for optimization and code generation.
As an example we will create a simple language that can calculate simple expressions and use variables. An example of this toy language looks like this:
b = 2;
c = 5 + 5 * b;
d = 133 * c - b;
print b;
print c;
The language is very limited (which makes it easy to implement), but it contains enough for an example. The example above is stored in a file called ‘example.tcf’ (tcf stands for toy calculator format).
Part 0 - preparation¶
Before we can begin creating the toy language compiler, we need the required dependencies. For that a virtualenv can be created like this:
[windel@hoefnix toydsl]$ virtualenv dslenv
Using base prefix '/usr'
New python executable in /home/windel/HG/ppci/examples/toydsl/dslenv/bin/python3
Also creating executable in /home/windel/HG/ppci/examples/toydsl/dslenv/bin/python
Installing setuptools, pip, wheel...done.
[windel@hoefnix toydsl]$ source dslenv/bin/activate
(dslenv) [windel@hoefnix toydsl]$ pip install textx ppci
Collecting textx
Collecting ppci
Using cached ppci-0.5-py3-none-any.whl
Collecting Arpeggio (from textx)
Installing collected packages: Arpeggio, textx, ppci
Successfully installed Arpeggio-1.5 ppci-0.5 textx-1.4
(dslenv) [windel@hoefnix toydsl]$
After this step, we now have a virtual environment with textx and ppci installed.
Part 1 - textx¶
In this part the parsing of the language will be done. A great deal will be done by textx. For a detailed explanation of the workings of textx, please see: http://igordejanovic.net/textX/
Lets define a grammar file, called toy.tx:
Program: statements*=Statement;
Statement: (PrintStatement | AssignmentStatement) ';';
PrintStatement: 'print' var=ID;
AssignmentStatement: var=ID '=' expr=Expression;
Expression: Sum;
Sum: Product (('+'|'-') Product)*;
Product: Value ('*' Value)*;
Value: ID | INT | ('(' Expression ')');
This grammar is able to parse our toy language. Next we create a python script to load this grammar and parse the toy example program:
from textx.metamodel import metamodel_from_file
toy_mm = metamodel_from_file('toy.tx')
# Load the program:
program = toy_mm.model_from_file('example.tcf')
for statement in program.statements:
print(statement)
Now if we run this file, we see the following:
(dslenv) [windel@hoefnix toydsl]$ python toy.py
<textx:AssignmentStatement object at 0x7f20c9d87cc0>
<textx:AssignmentStatement object at 0x7f20c9d87908>
<textx:AssignmentStatement object at 0x7f20c9d870b8>
<textx:PrintStatement object at 0x7f20c9d87ac8>
<textx:PrintStatement object at 0x7f20c9d95588>
We now have a simple parser for the toy language, and can parse it.
Part 2 - connecting the backend¶
Now that we can parse the dsl, it is time to create new code from the parsed format. To generate code, first the program must be translated to ir code.
The following snippet creates an IR-module, a procedure and a block to store instructions in. Instructions at this point are not machine instructions but abstract instructions that can be translated into any kind of machine code later on.
from ppci import ir
ir_module = ir.Module('toy')
ir_function = ir.Procedure('toy', ir.Binding.GLOBAL)
ir_module.add_function(ir_function)
ir_block = ir.Block('entry')
ir_function.entry = ir_block
ir_function.add_block(ir_block)
Next, we need to translate each statement into some code, but we will do that later.
for statement in program.statements:
print(statement)
First we will add the closing code, that verifies our own constructed module, and compiles the ir code to object code, links this and creates an oj file.
ir_block.add_instruction(ir.Exit())
The code above creates an Exit instruction and adds the instruction to the block. Next we can verify the IR-code, to make sure that the program we created contains no errors. The ir_to_object function translates the program from IR-code into an object for the given target architecture, in this case x86_64, but you could as well use AVR or riscv here.
from ppci.irutils import Verifier
from ppci import api
Verifier().verify(ir_module)
obj1 = api.ir_to_object([ir_module], 'x86_64')
obj = api.link([obj1])
print(obj)
The printed object shows that it conains 11 bytes.
(dslenv) [windel@hoefnix toydsl]$ python toy.py
...
CodeObject of 11 bytes
(dslenv) [windel@hoefnix toydsl]$
We can write the object to file using the following code:
with open('example.oj', 'w') as f:
obj.save(f)
The oj file is a ppci format for object files, pronounced ‘ojee’. It is a readable json format with the object information in it:
{
"arch": "x86_64",
"images": [],
"relocations": [
{
"offset": "0x4",
"section": "code",
"symbol": "toy_toy_epilog",
"type": "apply_b_jmp32"
}
],
"sections": [
{
"address": "0x0",
"alignment": "0x4",
"data": "",
"name": "data"
},
{
"address": "0x0",
"alignment": "0x4",
"data": "55488bece9000000005dc3",
"name": "code"
}
],
"symbols": [
{
"name": "toy_toy",
"section": "code",
"value": "0x0"
},
{
"name": "toy_toy_block_entry",
"section": "code",
"value": "0x4"
},
{
"name": "toy_toy_epilog",
"section": "code",
"value": "0x9"
}
]
}
As you can see, there are two sections, for code and for data. The code section contains some bytes. This is x86_64 machine code.
Part 3 - translating the elements¶
In this part we will create code snippets for each type of TCF code. For this we will use the textx context processor system, and we will also rewrite the initial code such that we have a class that can translate TCF code into IR-code. The entry point to the class will be a compile member function that translates a TCF file into a IR-module.
The whole script now looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | import logging
import struct
from textx.metamodel import metamodel_from_file
from ppci import ir
from ppci.irutils import verify_module
from ppci import api
def pack_string(txt):
ln = struct.pack('<Q', len(txt))
return ln + txt.encode('ascii')
class TcfCompiler:
""" Compiler for the Tcf language """
logger = logging.getLogger('tcfcompiler')
def __init__(self):
self.int_size = 8
self.int_type = ir.i64
self.toy_mm = metamodel_from_file('toy.tx')
self.toy_mm.register_obj_processors({
'PrintStatement': self.handle_print,
'AssignmentStatement': self.handle_assignment,
'Expression': self.handle_expression,
'Sum': self.handle_sum,
'Product': self.handle_product,
})
def compile(self, filename):
self.variables = {}
# Prepare the module:
ir_module = ir.Module('toy')
self.io_print2 = ir.ExternalProcedure('io_print2', [ir.ptr, self.int_type])
ir_module.add_external(self.io_print2)
ir_function = ir.Procedure('toy_toy', ir.Binding.GLOBAL)
ir_module.add_function(ir_function)
self.ir_block = ir.Block('entry')
ir_function.entry = self.ir_block
ir_function.add_block(self.ir_block)
# Load the program:
self.toy_mm.model_from_file('example.tcf')
# Close the procedure:
self.emit(ir.Exit())
verify_module(ir_module)
return ir_module
def emit(self, instruction):
self.ir_block.add_instruction(instruction)
return instruction
def handle_print(self, print_statement):
self.logger.debug('print statement %s', print_statement.var)
name = print_statement.var
value = self.load_var(name)
label_data = pack_string('{} :'.format(name))
label = self.emit(ir.LiteralData(label_data, 'label'))
label_ptr = self.emit(ir.AddressOf(label, 'label_ptr'))
self.emit(ir.ProcedureCall(self.io_print2, [label_ptr, value]))
def handle_assignment(self, assignment):
self.logger.debug(
'assign %s = %s', assignment.var, assignment.expr)
name = assignment.var
assert isinstance(name, str)
# Create the variable on stack, if not already present:
if name not in self.variables:
alloc = self.emit(
ir.Alloc(name + '_alloc', self.int_size, self.int_size))
addr = self.emit(ir.AddressOf(alloc, name + '_addr'))
self.variables[name] = addr
mem_loc = self.variables[name]
value = assignment.expr.ir_value
self.emit(ir.Store(value, mem_loc))
def handle_expression(self, expr):
self.logger.debug('expression')
expr.ir_value = expr.val.ir_value
def handle_sum(self, sum):
""" Process a sum element """
self.logger.debug('sum')
lhs = sum.base.ir_value
for term in sum.terms:
op = term.operator
rhs = term.value.ir_value
lhs = self.emit(ir.Binop(lhs, op, rhs, 'sum', self.int_type))
sum.ir_value = lhs
def handle_product(self, product):
self.logger.debug('product')
lhs = self.get_value(product.base)
for factor in product.factors:
rhs = self.get_value(factor.value)
lhs = self.emit(ir.Binop(lhs, '*', rhs, 'prod', self.int_type))
product.ir_value = lhs
def get_value(self, value):
if isinstance(value, int):
ir_value = self.emit(ir.Const(value, 'constant', self.int_type))
elif isinstance(value, str):
ir_value = self.load_var(value)
else: # It must be an expression!
ir_value = value.ir_value
return ir_value
def load_var(self, var_name):
mem_loc = self.variables[var_name]
return self.emit(ir.Load(mem_loc, var_name, self.int_type))
tcf_compiler = TcfCompiler()
ir_module = tcf_compiler.compile('example.tcf')
obj1 = api.ir_to_object([ir_module], 'x86_64')
obj2 = api.c3c(['bsp.c3', '../../librt/io.c3'], [], 'x86_64')
obj3 = api.asm('linux.asm', 'x86_64')
obj = api.link([obj1, obj2, obj3], layout='layout.mmap')
print(obj)
with open('example.oj', 'w') as f:
obj.save(f)
# Create a linux elf file:
api.objcopy(obj, 'code', 'elf', 'example')
|
And the textx description is modified to include sum and product terms:
Program: statements*=Statement;
Statement: (PrintStatement | AssignmentStatement) ';';
PrintStatement: 'print' var=ID;
AssignmentStatement: var=ID '=' expr=Expression;
Expression: val=Sum;
Sum: base=Product terms*=ExtraTerm;
ExtraTerm: operator=Operator value=Product;
Operator: '+' | '-';
Product: base=Value factors*=ExtraFactor;
ExtraFactor: operator='*' value=Value;
Value: ID | INT | ('(' Expression ')');
When we run this script, the output is the following:
(dslenv) [windel@hoefnix toydsl]$ python toy.py
CodeObject of 117 bytes
(dslenv) [windel@hoefnix toydsl]$
As we can see, the object file has increased in size because we translated the elements.
Part 4 - Creating a linux executable¶
In this part we will create a linux executable from the object code we created. We will do this very low level, without libc, directly using the linux syscall api.
We will start with the low level assembly glue code (linux.asm):
section reset
start:
call toy_toy
call bsp_exit
bsp_syscall:
mov rax, rdi ; abi param 1
mov rdi, rsi ; abi param 2
mov rsi, rdx ; abi param 3
mov rdx, rcx ; abi param 4
syscall
ret
In this assembly snippet, we defined a sequence of code in the reset section which calls our toy_toy function and next the bsp_exit function. Bsp is an abbreviation for board support package, and we need it to connect other code to the platform we run on. The syscall assembly function calls the linux kernel with four parameters.
Next we define the rest of the bsp in bsp.c3:
module bsp;
public function void putc(byte c)
{
syscall(1, 1, cast<int64_t>(&c), 1);
}
function void exit()
{
syscall(60, 0, 0, 0);
}
function void syscall(int64_t nr, int64_t a, int64_t b, int64_t c);
Here we implement two syscalls, namely putc and exit.
For the print function, we will refer to the already existing io module located in the librt folder of ppci. To compile and link the different parts we use the following snippet:
obj1 = api.ir_to_object([ir_module], 'x86_64')
obj2 = api.c3c(['bsp.c3', '../../librt/io.c3'], [], 'x86_64')
obj3 = api.asm('linux.asm', 'x86_64')
obj = api.link([obj1, obj2, obj3], layout='layout.mmap')
In this snippet, three object files are created. obj1 contains our toy languaged compiled into x86 code. obj2 contains the c3 bsp and io code. obj3 contains the assembly sourcecode.
For the link command we also use a layout file, telling the linker where it must place which piece of the object file. In the case of linux, we use the following (layout.mmap):
MEMORY code LOCATION=0x40000 SIZE=0x10000 {
SECTION(reset)
ALIGN(4)
SECTION(code)
}
MEMORY ram LOCATION=0x20000000 SIZE=0xA000 {
SECTION(data)
}
As a final step, we invoke the objcopy command to create a linux ELF executable:
# Create a linux elf file:
api.objcopy(obj, 'code', 'elf', 'example')
This command creates a file called ‘example’, which is an ELF file for linux. The file can be inspected with objdump:
(dslenv) [windel@hoefnix toydsl]$ objdump example -d
example: file format elf64-x86-64
Disassembly of section code:
000000000004001c <toy_toy>:
4001c: 55 push %rbp
4001d: 41 56 push %r14
4001f: 41 57 push %r15
40021: 48 81 ec 18 00 00 00 sub $0x18,%rsp
40028: 48 8b ec mov %rsp,%rbp
000000000004002b <toy_toy_block_entry>:
4002b: 49 be 02 00 00 00 00 movabs $0x2,%r14
40032: 00 00 00
40035: 4c 89 75 00 mov %r14,0x0(%rbp)
40039: 4c 8b 7d 00 mov 0x0(%rbp),%r15
4003d: 49 be 05 00 00 00 00 movabs $0x5,%r14
...
We can now run the executable:
(dslenv) [windel@hoefnix toydsl]$ ./example
Segmentation fault (core dumped)
(dslenv) [windel@hoefnix toydsl]$
Sadly, this is not exactly what we hoped for!
The problem here is that we did not call the io_print function with the proper arguments. To fix this, we can change the print handling routine like this:
def handle_print(self, print_statement):
self.logger.debug('print statement %s', print_statement.var)
name = print_statement.var
value = self.load_var(name)
label_data = pack_string('{} :'.format(name))
label = self.emit(ir.LiteralData(label_data, 'label'))
self.emit(ir.ProcedureCall('io_print2', [label, value]))
We use here io_print2, which takes a label and a value. The label must be packed as a pascal style string, meaning a length integer followed by the string data. We can implement this string encoding with the following function:
def pack_string(txt):
ln = struct.pack('<Q', len(txt))
return ln + txt.encode('ascii')
Now we can compile the TCF file again, and check the result:
(dslenv) [windel@hoefnix toydsl]$ python toy.py
CodeObject of 1049 bytes
(dslenv) [windel@hoefnix toydsl]$ ./example
b :0x00000002
c :0x0000000F
(dslenv) [windel@hoefnix toydsl]$ cat example.tcf
b = 2;
c = 5 + 5 * b;
d = 133 * c - b;
print b;
print c;
(dslenv) [windel@hoefnix toydsl]$
As we can see, the compiler worked out correctly!
Final words¶
In this tutorial we have seen how to create a simple language. The entire example for this code can be found in the examples/toydsl directory in the ppci repository.