Recently, I began to delve deeper into Rust. As a way to become more familiar with the language, I decided to write a simple lexer for a mathematical expression such as 10 - 3 + ( ( 4 / 2 ) * ( 8 * 4 ) )
.
Writing a lexer shouldn't be a difficult task, especially if you've built one in another language. I've tried to explore Rust features and write idiomatic code as much as possible, without getting too attached to past implementations in other languages.
If you're not familiar with compiler theory, essentially a lexer is a phase of the compiler that transforms your text input into a list of meaningful tokens. Essentially, we want to implement a function like this:
fn tokenizer(input: String) -> Vec<Token> {
...
}
In reality, lexers are not written in a way that produces all of the tokens at once. Usually they produce one token at a time, in an iterator fashion, and that token is passed to the parser that will check grammar rules and build an Abstract Syntax Tree.
We start by defining our tokens using an Enum
:
enum Token {
Number(i64),
Plus,
Dash,
Star,
Slash,
LeftParen,
RightParen,
EOF,
}
For now, we will focus on the four basic operations, parentheses, and positive integers.
The typical approach to implementing a lexer is by iterating through the text and checking whether the current character is relevant for creating a token. This can be achieved using either a slice and indices or an iterator. In this task, I have chosen to use an iterator as it is considered the more idiomatic approach and allows me to explore the iterator-related APIs.
Essentially, the core pattern we need is:
let mut tokens: Vec<Token> = Vec::new();
let mut iter = input.chars();
while let Some(ch) = iter.next() {
match ch {
// pattern matching logic
// tokens.push(...)
}
}
We loop through each character and apply pattern matching logic to it.
Since our mathematical expression is very simple, most of the characters are tokens themselves. Additionally, whitespaces have no meaning in our expression. Therefore:
match ch {
ch if ch.is_whitespace() => continue,
'(' => tokens.push(Token::LeftParen),
')' => tokens.push(Token::RightParen),
'+' => tokens.push(Token::Plus),
'-' => tokens.push(Token::Dash),
'*' => tokens.push(Token::Star),
}
The only token that is slightly more complex is the Number
, which is a sequence of characters. To implement it, we can utilize the matching ranges feature:
match ch {
// ...
'1'..='9' => {
// ...
}
}
To implement the body, I have decided to take a functional approach and use only iterators. The idea is to consume the iterator by extracting characters as long as they are digits. We can achieve this by using take_while
:
match ch {
// ...
'1'..='9' => {
let s = iter.take_while(|c| match c.is_ascii_digit()).collect();
}
}
The issue with that approach is that s
would not include the first digit, which is ch
. Additionally, take_while
would have already consumed the first non-digit character, resulting in the loss of one character in the subsequent iteration of the loop.
The way to work around this is by making iter
Peekable
and making of use of next_if
. With it we can peek
at the next element and only consume it if it is a digit. Also, by combining it with from_fn
we can create a new iterator chained with ch
.
'1'..='9' => {
let n: i64 = iter::once(ch)
.chain(from_fn(|| iter.by_ref().next_if(|s| s.is_ascii_digit())))
.collect::<String>()
.parse()
.unwrap();
tokens.push(Token::Number(n));
}
With that we finish our tokenizer
function:
fn tokenizer(input: String) -> Vec<Token> {
let mut tokens: Vec<Token> = Vec::new();
let mut iter = input.chars().peekable();
while let Some(ch) = iter.next() {
match ch {
ch if ch.is_whitespace() => continue,
'(' => tokens.push(Token::LeftParen),
')' => tokens.push(Token::RightParen),
'+' => tokens.push(Token::Plus),
'-' => tokens.push(Token::Dash),
'*' => tokens.push(Token::Star),
'1'..='9' => {
let n: i64 = iter::once(ch)
.chain(from_fn(|| iter.by_ref().next_if(|s| s.is_ascii_digit())))
.collect::<String>()
.parse()
.unwrap();
tokens.push(Token::Number(n));
}
_ => {
panic!("unrecognized char");
}
}
}
tokens.push(Token::EOF);
tokens
}
We can improve our error handling by defining a SyntaxError
struct:
#[derive(Debug)]
struct SyntaxError {
message: String,
}
impl SyntaxError {
fn new(message: String) -> Self {
SyntaxError {
message
}
}
}
and returning Result<Vec<Token>, SyntaxError>
fn tokenizer(input: String) -> Result<Vec<Token>, SyntaxError> {
let mut tokens: Vec<Token> = Vec::new();
let mut iter = input.chars().peekable();
while let Some(ch) = iter.next() {
match ch {
ch if ch.is_whitespace() => continue,
'(' => tokens.push(Token::LeftParen),
')' => tokens.push(Token::RightParen),
'+' => tokens.push(Token::Plus),
'-' => tokens.push(Token::Dash),
'*' => tokens.push(Token::Star),
'1'..='9' => {
let n: i64 = iter::once(ch)
.chain(from_fn(|| iter.by_ref().next_if(|s| s.is_ascii_digit())))
.collect::<String>()
.parse()
.unwrap();
tokens.push(Token::Number(n));
}
_ => return Err(SyntaxError::new(format!("unrecognized character {}", ch))),
}
}
tokens.push(Token::EOF);
Ok(tokens)
}
It wasn't a very hard exercise but sufficient for me to get familiarized with Rust's syntax, its type system, some error handling patterns, iterator-related APIs, and pattern-matching syntax. I've also been doing Rustlings exercises on the side. For someone used to programming in Go, I've often found reading Rust code to be challenging. However, I'm gradually beginning to appreciate its elegance.
Future posts ideas: work on a iterator-based version of the lexer; build a parse to evaluate the expression; build a lexer for a more complex language.