Creating Project
This guide tells you how to write your own addition expression parser from scratch. Although we have already decided in advance on the grammar that we want to recognize, by analogy with it, you can create any other implementations.
The full result of this code can be downloaded from here.
Before starting, you should create a new composer.json
file and add the
necessary dependencies (see installation) to it. As a
result, the file will look something like this:
{
"require": {
"phplrt/runtime": "^3.6"
},
"autoload": {
"psr-4": {
"Calculator\\": "src"
}
},
"require-dev": {
"phplrt/phplrt": "^3.6"
}
}
Please note that the
composer.json
contains two packages:phplrt/runtime
inside therequire
section, which provides the minimum assembly of components for the parser to work, andphplrt/phplrt
inside therequire-dev
section, which contains all the necessary tools for development.
Assembly Binary
Now we need to create a file that will allow us to process the source code of the grammar and assemble it into a set of rules for the parser and lexer.
To do this, create a file, for example, build.php
and write the following
code in it:
<?php
require __DIR__ . '/vendor/autoload.php';
$compiler = new \Phplrt\Compiler\Compiler();
// Source grammar loading.
$compiler->load('
// TODO
');
// Compiling the grammar into a set of
// instructions for the parser.
$assembly = $compiler->build();
// Saving an assembly to a file.
file_put_contents(__DIR__ . '/grammar.php', $assembly->generate());
Please note that this code is for package development and will not be needed in the future.
After running this code, a file grammar.php
with the configuration
of our future parser will appear.
More information about the assembly is written in the corresponding page.
Creating Parser
Let's now create a working parser code that will load this configuration. The first step is to load the lexer:
<?php
declare(strict_types=1);
namespace Calculator;
use Phplrt\Contracts\Lexer\LexerInterface;
use Phplrt\Lexer\Lexer;
final class Parser
{
private LexerInterface $lexer;
public function __construct()
{
// This will be the path to our configuration file
$config = require __DIR__ . '/grammar.php';
$this->lexer = new Lexer(
//
// The "tokens" array field contains a list of lexer
// states with list of regular expression of tokens.
//
// Since we use only one state, we can load the main one,
// it is called "default" and is always available.
//
$config['tokens']['default'],
//
// This array field contains a list of token names
// to skip.
//
$config['skip'],
);
}
}
More information about how the lexer works and works is written in this page.
Now the more difficult step is to load the parser. Let's modify the existing code and add it.
// ...
use Phplrt\Contracts\Parser\ParserInterface;
use Phplrt\Parser\Parser as Runtime;
final class Parser
{
private LexerInterface $lexer;
private ParserInterface $parser;
public function __construct()
{
$config = require __DIR__ . '/grammar.php';
$this->lexer = /* ... lexer loading ... */
$this->parser = new Runtime(
//
// The first required argument is a
// reference to the lexer.
//
$this->lexer,
//
// The second argument is a list of parser rules.
//
// This list can also be loaded from a "grammar.php"
// config file.
//
$config['grammar'],
// Additional options
[
//
// It is worth paying attention to one option, which
// is also desirable to set: This is the name of the main
// rule from which the analysis of grammar will begin.
//
Runtime::CONFIG_INITIAL_RULE => $config['initial'],
]
);
}
}
More information about how the parser works and works is written in this page.
That's all! Now it remains to add a method for parsing!
final class Parser
{
/* ... constructor ... */
public function parse(string $code): iterable
{
return $this->parser->parse($code);
}
}
Now you can create a parser and use it. Of course it won't return any result since we haven't written any rules. However, the main code is ready.
$math = new Calculator\Parser();
$ast = $math->parse('2 + 2');
var_dump($ast);
Writing Grammar
With the foundation in place, we can now start writing the grammar. You can
read more about all its features on the
grammar documentation page, but here we will focus on the
simplest. To do this, open our first build.php
assembly file and write the
following:
// Source grammar loading.
$compiler->load('
// All digits sequence should be recognized as "number"
%token number \d+
// All "+" chars should be recognized as "plus"
%token plus \+
// All whitespace chars should be ignored
%skip whitespace \s+
// This means that each "expression" matches the sequence:
// - "number" (required)
// - then optional (from 0 to 1):
// - "plus"
// - then another "expression"
expression : number() (::plus:: expression())?
number : <number>
');
Don't forget to call the build.php
script again to generate a new
grammar.php
file. After the grammar.php
file is generated, we can run
the 2 + 2
expression recognition test again.
$ast = $math->parse('2 + 2 + 4');
var_dump($ast);
As a result, this code should output the following result:
array:3 [
0 => Phplrt\Lexer\Token\Token {
-bytes: null
-offset: 0
-value: "2"
-name: "number"
}
1 => Phplrt\Lexer\Token\Token {
-bytes: null
-offset: 4
-value: "2"
-name: "number"
}
2 => Phplrt\Lexer\Token\Token {
-bytes: null
-offset: 8
-value: "4"
-name: "number"
}
]
Don't worry about the
-bytes: null
value beingnull
on output, it will be initialized when this data is read.
As a result, you will get back a list of tokens that you marked as "keep" using
<token-name>
syntax (instead of ::token-name::
).
In the case that you try to recognize an incorrect expression, you will receive an error message.
$math->parse('2 + 2 & 4');
// Syntax error, unrecognized "&"
// 1. | 2 + 2 & 4
// | ^ in /.../lexer/src/Exception/UnrecognizedTokenException.php:19
$math->parse('2 + ');
// Syntax error, unexpected end of input, "number" is expected
// 1. | 2 +
// | in /.../parser/src/Exception/UnexpectedTokenException.php:35
$math->parse('+ 42');
// Syntax error, unexpected "+" (plus), "number" is expected
// 1. | + 42
// | ^ in /.../parser/src/Exception/UnexpectedTokenException.php:35
Building AST
After we have written the parser, we can transform the set of expressions into their corresponding abstract syntax tree.
A builder is used for this, which can be replaced in the future, but for now we
will use the standard one. To do this, open the Parser.php
file and add a new
CONFIG_AST_BUILDER
option to it.
// ...
use Phplrt\Parser\SimpleBuilder;
final class Parser
{
/* ... */
public function __construct()
{
/* ... lexer initialization ... */
$this->parser = new Runtime($this->lexer, $config['grammar'], [
/** ... other options ... */
//
// This option contains a reference to the tree builder instance.
//
// All construction rules will be loaded from the "reducers" section
// of the compiled grammar.
//
Runtime::CONFIG_AST_BUILDER => new SimpleBuilder($config['reducers']),
]);
}
}
After that, two node classes should be created: The first one will be responsible for the literals (number).
<?php
declare(strict_types=1);
namespace Calculator\Node;
use Phplrt\Contracts\Ast\NodeInterface;
final class Number implements NodeInterface
{
public int $value;
public function __construct(int $value)
{
$this->value = $value;
}
public function getIterator(): \Traversable
{
return new \EmptyIterator();
}
}
And the second one for the sum expression.
namespace Calculator\Node;
use Phplrt\Contracts\Ast\NodeInterface;
final class Addition implements NodeInterface
{
public Number $a;
public object $b;
/** @param Number|Addition $b */
public function __construct(Number $a, object $b)
{
$this->a = $a;
$this->b = $b;
}
public function getIterator(): \Traversable
{
return new \EmptyIterator();
}
}
After the classes have been created, the rules for their construction should
be written. To do this, open the grammar file (build.php
) and modify it.
$compiler->load('
// ... token definitions ...
/**
* BEFORE:
*
* expression : number() (::plus:: expression())?
*
* number : <number>
*
*/
expression -> {
// in case of "$children" sequence is a "number()" and "expression()"
if (count($children) === 2) {
return new \Calculator\Node\Addition($children[0], $children[1]);
}
// otherwise (only "number()")
return $children;
}
: number() (::plus:: expression())?
number -> { return new \Calculator\Node\Number((int)$token->getValue()); }
: <number>
');
Let's call the code:
// Execution "2 + 2 + 4" expression:
$ast = $math->parse('2 + 2 + 4');
var_dump($ast);
// Expected Output:
Calculator\Node\Addition {
+a: Calculator\Node\Number {
+value: 2
}
+b: Calculator\Node\Addition {
+a: Calculator\Node\Number {
+value: 2
}
+b: Calculator\Node\Number {
+value: 4
}
}
}