Header by Maggie Appleton
Today we are going to create our own syntax in JavaScript. For simplicity sake and easy understanding, we will stick to a single javascript structure. Variable Declaration. We are going to implement a new syntax for declaring variables in JavaScript. The new syntax definition will be below.
// `set` and `define` to replace `let` and `const`
set name as "Duncan";
// let name = "Duncan";
define k as 1024;
// const k = 1024;
With the syntax, we could split
the input and replace set
and define
with let
and const
respectively but everyone can do that. Let's try something else.
A compiler.
Don't get too scared, it will be a very small and tiny one. For simplicity, our compiler will only support numbers
, strings
, boolean
and null
.
The Compiler
Different compilers work in different ways but break down to the three primary stages:
Parsing
: takes the raw code and turning it into an abstract representation known as an Abstract Syntax Tree (AST)Transformation
: takes the abstract representation and transforms and modifies it into another abstract representation of the target language.Code Generation
: takes the transformed abstract representation and generates the new code based on the given abstract representation.
Parsing
Parsing also gets broken down into two stages. Lexical Analysis
(lexing/ tokenizing) and Syntactic Analysis
. Lexical Analysis
takes the raw code and turn each character it into a token
with the lexer/tokenizer. The tokenizer returns an array of all the tokens for a given syntax.
// Given the code
set age as 18;
The tokenizer
will return the array below.
[
{ type: 'keyword', value: 'set' },
{ type: 'name', value: 'age' },
{ type: 'ident', value: 'as' },
{ type: 'number', value: '18' },
];
Tokens are an array of tiny little objects that describe an isolated piece of the syntax.
Each token is an object with a type
and value
property. The type
holds the type of the current character or set of characters being passed. value
property stores the value of the character being passed.Syntactic Analysis
then takes the tokens and transforms them with a parser function to an abstract representation of the tokens in relation to each other. Usually, we would have two ASTs where one is from our language and the other is for the target language, but for simplicity again, we will build a single AST modify the same one to produce a different AST.
The parser will return the object below.
// Abstract Syntax Tree for `set age as 18;`
{
type: "Program",
body: [
{
type: "VariableDeclaration",
kind: "set",
declarations: [
{
type: "VariableDeclarator",
id: { type: "Identifier", name: "age" },
init: { type: "NumberLiteral", value: 18 },
},
],
},
],
}
Transformation
The next stage for our compiler is transformation. Taking the AST and transforming it into a totally new AST for any programming language or just modifying the same one. We won't generate a new AST, we will just modify it.
On our AST, we have at each level an object with a type
property. These are known as AST Node. These nodes have defined properties on them that describe one isolated part of the tree.
// We have a Node for a "NumberLiteral"
{
type: "NumberLiteral",
value: 18,
}
// A Node for a "VariableDeclarator"
{
type: "VariableDeclarator",
id: { ...object },
init: { ...object },
}
Fortunately for us, we are doing only one thing with our AST, that is Variable Declaration. Let's see how we will modify our AST.
At the VariableDeclaration
node, we have a kind
property that contains the current keyword being used. So we will traverse
the tree and visit
each node until have a Node with type
of VariableDeclaration
and set the kind
property to what keyword we want. let
or const
// AST for `set age as 18;`
{
type: "Program",
body: [
{
type: "VariableDeclaration",
kind: "set", // <- `kind` will be changed to `let` or `const`
declarations: [ [Object] ],
},
],
}
// AST after transforming it
{
type: "Program",
body: [
{
type: "VariableDeclaration",
kind: "let", // <<<<<<<: Changed from `set`
declarations: [ [Object] ],
},
],
}
Code Generation
Now that we have our new AST, we can now generate our code. Our new AST has everything we need. The keyword, the variable name and the value assigned to the variable. The name and value can be found in the VariableDeclarator
node.
Now that's it. A general idea of compilers and how they work. Not all compilers work like this but most certainly do. That's backbone and skeleton of our compiler. If our compiler was a website, all the above will be the HTML.
Let's write some code. 😋
We won't use any external libraries, we will write everything from scratch😺. Also, you must have Node.js installed on your local system. Use any text editor or IDE of your choice.
Create a new directory and run npm init -y
and create a new javascript file with any filename of your choice.
In general, we will have 5 main functions in our code
tokenizer
We will first declare a tokenizer
function with a parameter of input
, the inital code we are going to pass to our compiler as a string. Then initialize a current
and tokens
variable. current
for the current location in the input and tokens
will be an array that will hold the tokens for each individual token
. Then we will add a;
and awhitespace
character to the end.
const tokenizer = (input) => {
let tokens = [];
let current = 0;
// Add the semicolon to the end of the input if one was not provided
// Then add whitespace to the end of the input to indicate the end of the code
if (input[input.length - 1] === ';') {
input += ' ';
} else {
input = input + '; ';
}
};
After the initial declarations in the tokenizer
, we come to the main part. We will have a while
loop that will loop over all the characters in the input
and while there is a character available, we will check for the type of the character and add it to a token
and add the token
to the tokens
array.
const tokenizer = (input) => {
// ...
while (current < input.length - 1) {
// We get the current character first
const currentChar = input[current];
// Now, we test for the types of each character.
// We check for Whitespaces first
// Regex to check for whitespace
const WHITESPACE = /\s+/;
if (WHITESPACE.test(currentChar)) {
// If the current character is a whitespace, we skip over it.
current++; // Go to the next character
continue; // Skip everything and go to the next iteration
}
// We need semicolons They tell us that we are at the end.
// We check for semicolons now and also if the semicolon is at the last but one position
// We only need the semicolons at the end. Any other position means there
// An error
if (currentChar === ';' && currentChar === input[input.length - 2]) {
// If the current character is a semicolon, we create a `token`
let token = {
type: 'semi',
value: ';',
};
// then add it to the `tokens` array
tokens.push(token);
current++; // Go to the next character
continue; // Skip everything and go to the next iteration
}
}
};
We now have check in place for semicolons and whitespaces but there are four more to go. Our compiler supports strings
, numbers
, booleans
and null
. We will now check for the following types. Remmember we are dealing with single characters so we willl need to put some checks in place else we will be pushing single characters as tokens
Still in the while loop
const tokenizer = (input) => {
// ...
while (current < input.length - 1) {
const currentChar = input[current];
//...
// Now we will check for Numbers
const NUMBER = /^[0-9]+$/; // Regex to check if character is a number
// If we use the same method above for the semicolons,
// We create a number `token` and add it to `tokens`, we end up with a token for
// each single number character instead of the number as a whole.
// For example, if we have a number value of `123`, then our tokens will be
//
// [
// { type: 'number', value: 1 },
// { type: 'number', value: 2 },
// { type: 'number', value: 3 },
// ]
//
// Instead of
//
// [
// { type: 'number', value: 123 },
// ]
// which we don't want.
// So we create a `number` variable and check if the next character is a number.
// If the next character is a number, we add it to the `number` variable
// Then add the `number` variable's value as the value in our `token`
// The add the `token` to our `tokens` array
if (NUMBER.test(currentChar)) {
let number = '';
// Check if the next character is a number
while (NUMBER.test(input[current++])) {
number += input[current - 1]; // Add the character to `number`
}
// Create a token with type number
let token = {
type: 'number',
value: parseInt(number), // `number` is a string to we convert it to an integer
};
tokens.push(token); // Add the `token` to `tokens` array
continue;
}
}
};
Now that we have numbers underway, the next on our list is strings
, booleans
and null
values. If we used the same approach for the semicolon and add a token for every character, we could face the same problem where we won't the full token value so we will a different approach similiar to the number check.
Strings will be easy to tackle with first. Each string starts and ends with a "
so based on the same approach for numbers, we check if a character is a "
, If it is, we will add every value that comes after the quote("
) until we meet another quote indicating the end of the string.
const tokenizer = (input) => {
// ...
while (current < input.length - 1) {
const currentChar = input[current];
//...
// Check if character is a string
if (currentChar === '"') {
// If the current character is a quote, that means we have a string
// Initialize an empty strings variable
let strings = '';
// Check if the next character is not a quote
while (input[++current] !== '"') {
// If it is not a quote, it means we still have a string
strings += input[current]; // Add it to the `strings` variable
}
// Create a token with property type string and a value with the `strings` value
let token = {
type: 'string',
value: strings,
};
tokens.push(token); // Add the `token` to the `tokens` array
current++;
continue;
}
}
};
The last check and we are done with our tokenizer
. The check for letters. booleans
, null
and the keywords, set
and define
all have characters that will test true for letters so we will use the same approach as the numbers. If the current character is a letter, we will add it to a new variable and check of the next character is also a letter until we meet a non-letter character then we will return.
const tokenizer = (input) => {
// ...
while (current < input.length - 1) {
const currentChar = input[current];
//...
// Check if the character is a letter
const LETTER = /[a-zA-Z]/; // Regex to check if it is a letter
if (LETTER.test(currentChar)) {
// If the current character is a letter we add it to a `letters` variable
let letters = currentChar;
// Check if the next character is also a letter
while (LETTER.test(input[++current])) {
// We add it to the `letters` variable if it is
letters += input[current];
}
// ...
// See below..
}
}
};
At this point, we have our letters
value but we cannot add it to the tokens
array yet. Each token must have a type
and a value
but for letters, they could be different. Our letters could be true
|| false
which will have a type of boolean
or the letters could be set
|| define
which could have a type of keyword
, so we need another check to check the letters and assign thier token the respective type.
const tokenizer = (input) => {
// ...
while (current < input.length - 1) {
const currentChar = input[current];
//...
const LETTER = /[a-zA-Z]/;
if (LETTER.test(currentChar)) {
// ...
//
// Still in the letter check
// At this point, we have a value for our `letters` so we check for thier types.
//
// We first check if the `letters` is `set` or `define` and we assign the `token` a type `keyword`
if (letters === 'set' || letters === 'define') {
// Add a `token` to the `tokens` array
tokens.push({
type: 'keyword',
value: letters,
});
continue; // We are done. Start the loop all over again
}
// If the letter is `null`, assign the `token` a type `null`
if (letters === 'null') {
tokens.push({
type: 'null',
value: letters,
});
continue;
}
// If the letter is `null`, assign the `token` a type `ident`
if (letters === 'as') {
tokens.push({
type: 'ident',
value: letters,
});
continue;
}
// If the letter is `true` or `false`, assign the `token` a type `boolean`
if (letters === 'true' || letters === 'false') {
tokens.push({
type: 'boolean',
value: letters,
});
continue;
}
// If we don't know the `letters`, it is the variable name.
// Assign the `token` a type `name`
tokens.push({
type: 'name',
value: letters,
});
continue; // Start the loop again
}
}
};
At this point, we are done checking but if the character isn't recognized the our while
loop will be stuck so we need some error checking in place and finally return the tokens
from the tokenizer.
const tokenizer = (input) => {
// ...
while (current < input.length - 1) {
// ....
//
// If the character reaches this point, then its not valid so we throw a TypeError
// with the character and location else we will be stuck in an infinite loop
throw new TypeError('Unknown Character: ' + currentChar + ' ' + current);
}
// Return the `tokens` from the `tokenizer`
return tokens;
};
We are done with the tokenizer
. All the code at this point can be found here.
// You can test your tokenizer with
const tokens = tokenizer('set isEmployed as false');
// [
// { type: 'keyword', value: 'set' },
// { type: 'name', value: 'isEmployed' },
// { type: 'ident', value: 'as' },
// { type: 'boolean', value: 'false' },
// { type: 'semi', value: ';' },
// ]
parser
Now that the heavy lifting has been done for us in the tokenizer
, we move to the parser
. The parser
takes the tokens
produced by the tokenizer
and modifies them into an AST. Out parser will have a walk
function. The walk
function will take the current token
and return the AST Node for that specific token
.
If we had a token
{
type: "number",
value: 1024
}
The AST Node will be:
{
type: "NumberLiteral",
value: 1024
}
The code for our parser
const parser = (tokens) => {
// We will declare a `current` variable to get the current `token`
let current = 0;
// Then our parser will have a walk function
const walk = () => {};
};
The walk
function will be a recursive function. We first get the current token
, check the type
of the token
and return an AST Node based on the type
.
const parser = (tokens) => {
// ...
const walk = () => {
// Get the current `token` with the `current` variable
let token = tokens[current];
// From here, we will check for the `type` of each token and return a node.
if (token.type === 'number') {
// Our token is a `number`,
// We increase the current counter
current++;
// We create a type `NumberLiteral` and the value as the token's `value`
let astNode = {
type: 'NumberLiteral',
value: token.value,
};
// We return the node
return astNode;
}
// We will take the same steps for the `boolean`, `null` and `string` token types
// Check the value, Increment the counter, return a new node
// Check for a string token
if (token.type === 'string') {
current++;
let astNode = {
type: 'StringLiteral',
value: token.value,
};
return astNode;
}
// Check for boolean token
if (token.type === 'boolean') {
current++;
let astNode = {
type: 'BooleanLiteral',
value: token.value,
};
return astNode;
}
// Check for null token
if (token.type === 'null') {
current++;
let astNode = {
type: 'NullLiteral',
value: token.value,
};
return astNode;
}
};
};
We have checks for null
, boolean
,string
and number
token types. Let's focus on the remaining ones, keyword
, name
, semi
and ident
. ident
will always have a value of as
so we won't need a node for it. We will just skip it. semi
also indicates the end of the code so we will ignore it too. We will focus on the keyword
and name
const parser = () => {
// ...
const walk = () => {
let token = tokens[current];
// ...
// We now check for the `keyword` token type
// The presence of a `keyword` token type indicates that we are declaring a variable,
// So the AST node won't be the same as that of `number` or `string`.
// The node will have a `type` property of `VariableDeclaration`, `kind` property of the keyword
// and a `declarations` property which is an array for all the declarations
if (token.type === 'keyword') {
// New AST Node for `keyword`
let astNode = {
type: 'VariableDeclaration',
kind: token.value, // The keyword used. `set` or `define`
declarations: [], // all the variable declarations.
};
// At this stage, we don't need the `keyword` token again. It's value has been used at the astNode.
// So we increase the current and get the next token
// Obviously the next one will be the `name` token and we will call the `walk` function again
// which will have a token type of `name` now and the returned results will be pushed into
// the declarations array
token = tokens[++current]; // Increase the `current` token counter and get the next token.
// Check if there is a token and the next token is not a semicolon
while (token && token.type !== 'semi') {
// if the token is not a semicolon, we add the result of `walk` again into
// the AST Node `declarations` array
astNode.declarations.push(walk());
// We then go to the next token
token = tokens[current];
}
// From here, we don't need the semicolon again, so we remove it from the
// `tokens` array
tokens = tokens.filter((token) => token.type !== 'semi');
// Then we return the AST Node
return astNode;
}
// The last is the `name` token type
// The `name` token type will have a node of type `VariableDeclarator` and an
// `id` which will also be a another node with type `Identifier` and an
// `init` with the type of the value.
// If the token type is a name, we will increse `current` by two to skip the next value after
// `name` which is `ident` and we don't need it.
if (token.type === 'name') {
current += 2; // Increase by 2 to skip `ident`
// Declare a new AST Node and recursively call the `walk` function again
// Which the result will be placed in the `init` property
let astNode = {
type: 'VariableDeclarator',
id: {
type: 'Identifier',
name: token.value,
},
init: walk(), // Call `walk` to return another AST Node and the result is assigned to `init`
};
// Return the AST Node
return astNode;
}
// We throw an error again for an unknown type
throw new Error(token.type);
};
};
We are done with the walk
function, but the function is just declared in the parser
, it's not being used by the parser
so we have to use it.
const parser = () => {
// ..
const walk = () => {
// ...
};
// We will now declare our AST. We have been building the nodes,
// so we have to join the AST as one.
// The type of the AST will be `Program` which will indicate the start of the code
// And a `body` property which will be an array that will contain all the other AST we have generated.
let ast = {
type: 'Program',
body: [],
};
// We then check if there are token's in the `tokens` array and add thier Node to the main AST
while (current < tokens.length) {
ast.body.push(walk());
}
// Final return of the parse function.
return ast;
};
There you have it, the parser
in the flesh. You can use the test case for the tokenizer
above and pass the tokens to the parser and log the results for yourself. You can get all the code up to this point here
traverser
It's time for our traverser
. The traverser
will take the ast
from the parser
and a visitor
. The visitor
will have objects with names of the various AST Node types and each object will have an enter
method. While traversing the AST, when we get to a node with a matching visitor object, we call the enter
method on that object.
// Example Visitor
let visitor = {
VariableDeclaration: {
enter() {},
},
};
// Declaring the `traverser`
const traverser = (ast, visitor) => {};
The traverser
will have two main methods, traverseArray
and traverseNode
. traverseArray
will call traverseNode
on each node in a node array. traverseNode
will take an node and it's parent node and call the visitor method on the node if there is one.
const traverser = (ast, visitor) => {
// `traverseArray` function will allow us to iterate over an array of nodes and
// call the `traverseNode` function
const traverseArray = (array, parent) => {
array.forEach((child) => {
traverseNode(child, parent);
});
};
};
Now that we have the traverseArray
, we can proceed to the main traverseNode
function.
const traverser = (ast, visitor) => {
// ...
// In the `traverseNode`, will get the node `type` object and call the `enter`
// method if the object is present
// Then recursively call the `traverseNode` again on every child node
const traverseNode = (node, parser) => {
// Get the node object on the visitor passed to the `traverser`
let objects = visitor[node.type];
// Check if the node type object is present and call the enter method
// with the node and the parent
if (objects && objects.enter) {
methods.enter(node, parent);
}
// At this point, we will call the `traverseNode` and `traverseArray` methods recursively
// based on each of the given node types
switch (node.type) {
// We'll start with our top level `Program` and call the `traverseArray`
// on the `body` property to call each node in the array with `traverseNode`
case 'Program':
traverseArray(node.body, node);
break;
//We do the same to `VariableDeclaration` and traverse the `declarations`
case 'VariableDeclaration':
traverseArray(node.declarations, node);
break;
// Next is the `VariableDecalarator`. We traverse the `init`
case 'VariableDeclarator':
traverseNode(node.init, node);
break;
// The remaining types don't have any child nodes so we just break
case 'NumberLiteral':
case 'StringLiteral':
case 'NullLiteral':
case 'BooleanLiteral':
break;
// We throw an error if we don't know the `type`
default:
throw new TypeError(node.type);
}
};
// We now start the `traverser` with a call to the `traverseNode` with the
// `ast` and null, since the ast does not have a parent node.
traverseNode(ast, null);
};
That's it for our traverser
. You can get all the code up to this point here.
transformer
Next is our transformer
which will take the AST and modify the AST and return it. Our transformer
will have a visitor
object and it will traverse the AST passed as an argument with the visitor and return the modified AST
Since we are only dealing with Variable Declaration's, our visitor will have only one object,VariableDeclaration
and will change the value of the kind
to the respective equivalent.
const transformer = (ast) => {
// We will start by creating the `visitor` object
const visitor = {
// Then we will create the `VariableDeclaration` object in the `visitor`
VariableDeclaration: {
// Here, we will have the `enter` method which will take the `node` and the `parent`
// Although we won't use the parent (Simplicity)
enter(node, parent) {
// Check if the VariableDeclaration has a `kind` property
// If it has, we change based on the previous one
// `set` -> `let`
// `define` -> `const`
if (node.kind) {
if (node.kind === 'set') {
node.kind = 'let'; // Set it to `let`
} else {
node.kind = 'const';
}
}
},
},
};
};
That's it for our visitor
. Although we could have done more, like things not related to variable declaration. We could have added a NumberLiteral
object to multiply every number by 2 or another method to make every string in a String
uppercase. visitor
is where the mutations and the modifications take place.
let visitor = {
// Multiply every number by 2
NumberLiteral: {
enter(node) {
if (typeof node.value === 'number') {
node.value *= 2;
}
},
},
// Uppercase every string value
StringLiteral: {
enter(node) {
if (typeof node.value === 'string') {
node.value = node.value.toUpperCase();
}
},
},
};
We are done with the visitor
but not the whole transformer
. We need to use the visitor
we created with the traverser
to modify our AST and return the modified AST
const transformer = (ast) => {
// ...visitor
// We will call the `traverser` with the `ast` and the `visitor`
traverser(ast, visitor);
// Finally we return the AST, which has been modified now.
return ast;
};
We are done with the transformer
, you can get all the code up to this point here.
You can test your transformer
with an ast generated by the parser
and compare the difference.
generator
We are done with two phases of our compiler, Parsing and Transformation. It's left with the last phase, Code Generation. We will only have one function for this phase, generator
.
The generator
will be recursively call itself at each node till we get a giant string of all the values. At each node, we will either return a call to another child node or return a value if the node has no children.
const generator = (ast) => {
// Let's break things down by the `type` of the `node`.
// Starting with the smaller nodes to the larger ones
switch (node.type) {
// If our node `type` is either `NumberLiteral`,`BooleanLiteral` or `NullLiteral`
// we just return the value at that `node`.
case 'NumberLiteral':
case 'BooleanLiteral':
case 'NullLiteral':
return node.value; // 18
// For a `StringLiteral`, we need to return the value with quotes
case 'StringLiteral':
return `"${node.value}"`;
// For an `Identifier`, we return the `node`'s name
case 'Identifier':
return node.name; // age
// A `VariableDeclarator` has two more `node`'s so we will call the `generator`
// recursively on the `id` and `init` which in turn will return a value.
// `id` will be called with the `generator` with type `Identifier` which will return a name
// `init` will be called with the `generator` with any of the Literals and will also return a value.
// We then return the results of these values from the VariableDeclarator
case 'VariableDeclarator':
return (
generator(node.id) + // age
' = ' +
generator(node.init) + // 18
';'
); // age = 18;
// For `VariableDeclaration`,
// We will map the `generator` on each `node` in the `declarations`
// The `declarations` will have the `VariableDeclarator` which in turn has `id` and `init`
// which when the generator is called on will return a value
// In total, we will return the `kind` of node with
// a joined string of what we had from mapping the declarations
case 'VariableDeclaration':
return (
node.kind + // let
' ' +
node.declarations.map(generator).join(' ') // age = 18
); // let age = 18;
// If we have a `Program` node. We will map through each node in the `body`
// and run them through the `generator` and join them with a newline.
case 'Program':
return node.body.map(generator).join('\n'); // let age = 18;
// We'll throw an error if we don't know the node
default:
throw new TypeError(node.type);
}
};
Finally, we are done with our generator
and all the three stages. You can get all the code up till this point here.
compiler
Congratulations if you really made it this far. There's only one thing left to do. We need to link all the functions we created and combine it into one single function. We'll name it as the compiler
const compiler = (code) => {
// Take the code and convert it into token
const token = tokenizer(code);
// Take the tokens and parse the into an AST
const ast = parser(tokens);
// Modify the ast into a new one
const mast = transformer(ast);
// Generate the code from the modified AST
const output = generator(mast);
// Return the new compiled code
return output;
};
We can now test our baby compiler
let code = 'set age as 18;';
let _code = 'define name as "Duncan"';
const js = compiler(code);
const _js = compiler(_code);
console.log(js); // let age = 18;
console.log(_js); // const name = "Duncan";
Conclusion
Congratulations once again on making it to the end 🥳🥳🥳. In view of the fact that we wrote all this, it's kind of useless. No one will use it in the real world and also, if we used it in a real javascript code, we will get all sort of errors, unless of course we had a way to use it in the real world. I am planning on building a babel plugin so please check back in a few weeks. I learnt a lot. I hope you did. Thank you for reading. If you face any errors or have any questions, you can find me on twitter.