Creating a toy language ======================= In this how-to, we will develop our own toy language. We will use textx to define our own language and use the ppci backend for optimization and code generation. As an example we will create a simple language that can calculate simple expressions and use variables. An example of this toy language looks like this: .. code:: b = 2; c = 5 + 5 * b; d = 133 * c - b; print b; print c; The language is very limited (which makes it easy to implement), but it contains enough for an example. The example above is stored in a file called 'example.tcf' (tcf stands for toy calculator format). Part 0 - preparation -------------------- Before we can begin creating the toy language compiler, we need the required dependencies. For that a virtualenv can be created like this: .. code:: bash [windel@hoefnix toydsl]$ virtualenv dslenv Using base prefix '/usr' New python executable in /home/windel/HG/ppci/examples/toydsl/dslenv/bin/python3 Also creating executable in /home/windel/HG/ppci/examples/toydsl/dslenv/bin/python Installing setuptools, pip, wheel...done. [windel@hoefnix toydsl]$ source dslenv/bin/activate (dslenv) [windel@hoefnix toydsl]$ pip install textx ppci Collecting textx Collecting ppci Using cached ppci-0.5-py3-none-any.whl Collecting Arpeggio (from textx) Installing collected packages: Arpeggio, textx, ppci Successfully installed Arpeggio-1.5 ppci-0.5 textx-1.4 (dslenv) [windel@hoefnix toydsl]$ After this step, we now have a virtual environment with textx and ppci installed. Part 1 - textx -------------- In this part the parsing of the language will be done. A great deal will be done by textx. For a detailed explanation of the workings of textx, please see: http://igordejanovic.net/textX/ Lets define a grammar file, called toy.tx: .. code:: Program: statements*=Statement; Statement: (PrintStatement | AssignmentStatement) ';'; PrintStatement: 'print' var=ID; AssignmentStatement: var=ID '=' expr=Expression; Expression: Sum; Sum: Product (('+'|'-') Product)*; Product: Value ('*' Value)*; Value: ID | INT | ('(' Expression ')'); This grammar is able to parse our toy language. Next we create a python script to load this grammar and parse the toy example program: .. code:: python from textx.metamodel import metamodel_from_file toy_mm = metamodel_from_file('toy.tx') # Load the program: program = toy_mm.model_from_file('example.tcf') for statement in program.statements: print(statement) Now if we run this file, we see the following: .. code:: bash (dslenv) [windel@hoefnix toydsl]$ python toy.py We now have a simple parser for the toy language, and can parse it. Part 2 - connecting the backend ------------------------------- Now that we can parse the dsl, it is time to create new code from the parsed format. To generate code, first the program must be translated to ir code. The following snippet creates an IR-module, a procedure and a block to store instructions in. Instructions at this point are not machine instructions but abstract instructions that can be translated into any kind of machine code later on. .. code:: python from ppci import ir ir_module = ir.Module('toy') ir_function = ir.Procedure('toy', ir.Binding.GLOBAL) ir_module.add_function(ir_function) ir_block = ir.Block('entry') ir_function.entry = ir_block ir_function.add_block(ir_block) Next, we need to translate each statement into some code, but we will do that later. .. code:: python for statement in program.statements: print(statement) First we will add the closing code, that verifies our own constructed module, and compiles the ir code to object code, links this and creates an oj file. .. code:: python ir_block.add_instruction(ir.Exit()) The code above creates an Exit instruction and adds the instruction to the block. Next we can verify the IR-code, to make sure that the program we created contains no errors. The ir_to_object function translates the program from IR-code into an object for the given target architecture, in this case x86_64, but you could as well use AVR or riscv here. .. code:: python from ppci.irutils import Verifier from ppci import api Verifier().verify(ir_module) obj1 = api.ir_to_object([ir_module], 'x86_64') obj = api.link([obj1]) print(obj) The printed object shows that it conains 11 bytes. .. code:: bash (dslenv) [windel@hoefnix toydsl]$ python toy.py ... CodeObject of 11 bytes (dslenv) [windel@hoefnix toydsl]$ We can write the object to file using the following code: .. code:: python with open('example.oj', 'w') as f: obj.save(f) The oj file is a ppci format for object files, pronounced 'ojee'. It is a readable json format with the object information in it: .. code:: json { "arch": "x86_64", "images": [], "relocations": [ { "offset": "0x4", "section": "code", "symbol": "toy_toy_epilog", "type": "apply_b_jmp32" } ], "sections": [ { "address": "0x0", "alignment": "0x4", "data": "", "name": "data" }, { "address": "0x0", "alignment": "0x4", "data": "55488bece9000000005dc3", "name": "code" } ], "symbols": [ { "name": "toy_toy", "section": "code", "value": "0x0" }, { "name": "toy_toy_block_entry", "section": "code", "value": "0x4" }, { "name": "toy_toy_epilog", "section": "code", "value": "0x9" } ] } As you can see, there are two sections, for code and for data. The code section contains some bytes. This is x86_64 machine code. Part 3 - translating the elements --------------------------------- In this part we will create code snippets for each type of TCF code. For this we will use the textx context processor system, and we will also rewrite the initial code such that we have a class that can translate TCF code into IR-code. The entry point to the class will be a compile member function that translates a TCF file into a IR-module. The whole script now looks like this: .. literalinclude:: ../../examples/toydsl/toy.py :language: python :linenos: And the textx description is modified to include sum and product terms: .. code:: Program: statements*=Statement; Statement: (PrintStatement | AssignmentStatement) ';'; PrintStatement: 'print' var=ID; AssignmentStatement: var=ID '=' expr=Expression; Expression: val=Sum; Sum: base=Product terms*=ExtraTerm; ExtraTerm: operator=Operator value=Product; Operator: '+' | '-'; Product: base=Value factors*=ExtraFactor; ExtraFactor: operator='*' value=Value; Value: ID | INT | ('(' Expression ')'); When we run this script, the output is the following: .. code:: bash (dslenv) [windel@hoefnix toydsl]$ python toy.py CodeObject of 117 bytes (dslenv) [windel@hoefnix toydsl]$ As we can see, the object file has increased in size because we translated the elements. Part 4 - Creating a linux executable ------------------------------------ In this part we will create a linux executable from the object code we created. We will do this very low level, without libc, directly using the linux syscall api. We will start with the low level assembly glue code (linux.asm): .. code:: section reset start: call toy_toy call bsp_exit bsp_syscall: mov rax, rdi ; abi param 1 mov rdi, rsi ; abi param 2 mov rsi, rdx ; abi param 3 mov rdx, rcx ; abi param 4 syscall ret In this assembly snippet, we defined a sequence of code in the reset section which calls our toy_toy function and next the bsp_exit function. Bsp is an abbreviation for board support package, and we need it to connect other code to the platform we run on. The syscall assembly function calls the linux kernel with four parameters. Next we define the rest of the bsp in bsp.c3: .. code:: module bsp; public function void putc(byte c) { syscall(1, 1, cast(&c), 1); } function void exit() { syscall(60, 0, 0, 0); } function void syscall(int64_t nr, int64_t a, int64_t b, int64_t c); Here we implement two syscalls, namely putc and exit. For the print function, we will refer to the already existing io module located in the librt folder of ppci. To compile and link the different parts we use the following snippet: .. code:: python obj1 = api.ir_to_object([ir_module], 'x86_64') obj2 = api.c3c(['bsp.c3', '../../librt/io.c3'], [], 'x86_64') obj3 = api.asm('linux.asm', 'x86_64') obj = api.link([obj1, obj2, obj3], layout='layout.mmap') In this snippet, three object files are created. obj1 contains our toy languaged compiled into x86 code. obj2 contains the c3 bsp and io code. obj3 contains the assembly sourcecode. For the link command we also use a layout file, telling the linker where it must place which piece of the object file. In the case of linux, we use the following (layout.mmap): .. code:: MEMORY code LOCATION=0x40000 SIZE=0x10000 { SECTION(reset) ALIGN(4) SECTION(code) } MEMORY ram LOCATION=0x20000000 SIZE=0xA000 { SECTION(data) } As a final step, we invoke the objcopy command to create a linux ELF executable: .. code:: python # Create a linux elf file: api.objcopy(obj, 'code', 'elf', 'example') This command creates a file called 'example', which is an ELF file for linux. The file can be inspected with objdump: .. code:: bash (dslenv) [windel@hoefnix toydsl]$ objdump example -d example: file format elf64-x86-64 Disassembly of section code: 000000000004001c : 4001c: 55 push %rbp 4001d: 41 56 push %r14 4001f: 41 57 push %r15 40021: 48 81 ec 18 00 00 00 sub $0x18,%rsp 40028: 48 8b ec mov %rsp,%rbp 000000000004002b : 4002b: 49 be 02 00 00 00 00 movabs $0x2,%r14 40032: 00 00 00 40035: 4c 89 75 00 mov %r14,0x0(%rbp) 40039: 4c 8b 7d 00 mov 0x0(%rbp),%r15 4003d: 49 be 05 00 00 00 00 movabs $0x5,%r14 ... We can now run the executable: .. code:: (dslenv) [windel@hoefnix toydsl]$ ./example Segmentation fault (core dumped) (dslenv) [windel@hoefnix toydsl]$ Sadly, this is not exactly what we hoped for! The problem here is that we did not call the io_print function with the proper arguments. To fix this, we can change the print handling routine like this: .. code:: python def handle_print(self, print_statement): self.logger.debug('print statement %s', print_statement.var) name = print_statement.var value = self.load_var(name) label_data = pack_string('{} :'.format(name)) label = self.emit(ir.LiteralData(label_data, 'label')) self.emit(ir.ProcedureCall('io_print2', [label, value])) We use here io_print2, which takes a label and a value. The label must be packed as a pascal style string, meaning a length integer followed by the string data. We can implement this string encoding with the following function: .. code:: python def pack_string(txt): ln = struct.pack('