Let’s Build A Simple Interpreter 系列全部出自 Ruslan’s Blog,因其中的链接和图片来自国外,如遇加载较慢或无法访问的情况请挂代理。
Let’s Build A Simple Interpreter. Part 2.
In their amazing book “The 5 Elements of Effective Thinking” the authors Burger and Starbird share a story about how they observed Tony Plog, an internationally acclaimed trumpet virtuoso, conduct a master class for accomplished trumpet players. The students first played complex music phrases, which they played perfectly well. But then they were asked to play very basic, simple notes. When they played the notes, the notes sounded childish compared to the previously played complex phrases. After they finished playing, the master teacher also played the same notes, but when he played them, they did not sound childish. The difference was stunning. Tony explained that mastering the performance of simple notes allows one to play complex pieces with greater control. The lesson was clear - to build true virtuosity one must focus on mastering simple, basic ideas.1
The lesson in the story clearly applies not only to music but also to software development. The story is a good reminder to all of us to not lose sight of the importance of deep work on simple, basic ideas even if it sometimes feels like a step back. While it is important to be proficient with a tool or framework you use, it is also extremely important to know the principles behind them. As Ralph Waldo Emerson said:
“If you learn only methods, you’ll be tied to your methods. But if you learn principles, you can devise your own methods.”
On that note, let’s dive into interpreters and compilers again.
Today I will show you a new version of the calculator from Part 1 that will be able to:
- Handle whitespace characters anywhere in the input string
- Consume multi-digit integers from the input
- Subtract two integers (currently it can only add integers)
Here is the source code for your new version of the calculator that can do all of the above:
1 | # Token types |
Save the above code into the calc2.py file or download it directly from GitHub. Try it out. See for yourself that it works as expected: it can handle whitespace characters anywhere in the input; it can accept multi-digit integers, and it can also subtract two integers as well as add two integers.
Here is a sample session that I ran on my laptop:
1 | python calc2.py |
The major code changes compared with the version from are:
-
The get_next_token method was refactored a bit. The logic to increment the pos pointer was factored into a separate method advance.
-
Two more methods were added: skip_whitespace to ignore whitespace characters and integer to handle multi-digit integers in the input.
-
The expr method was modified to recognize INTEGER -> MINUS -> INTEGER phrase in addition to INTEGER -> PLUS -> INTEGER phrase. The method now also interprets both addition and subtraction after having successfully recognized the corresponding phrase.
In Part 1 you learned two important concepts, namely that of a token and a lexical analyzer. Today I would like to talk a little bit about lexemes, parsing, and parsers.
You already know about tokens. But in order for me to round out the discussion of tokens I need to mention lexemes. What is a lexeme? A lexeme is a sequence of characters that form a token. In the following picture you can see some examples of tokens and sample lexemes and hopefully it will make the relationship between them clear:
Now, remember our friend, the expr method? I said before that that’s where the interpretation of an arithmetic expression actually happens. But before you can interpret an expression you first need to recognize what kind of phrase it is, whether it is addition or subtraction, for example. That’s what the expr method essentially does: it finds the structure in the stream of tokens it gets from the get_next_token method and then it interprets the phrase that is has recognized, generating the result of the arithmetic expression.
The process of finding the structure in the stream of tokens, or put differently, the process of recognizing a phrase in the stream of tokens is called parsing. The part of an interpreter or compiler that performs that job is called a parser.
So now you know that the expr method is the part of your interpreter where both parsing and interpreting happens - the expr method first tries to recognize (parse) the INTEGER -> PLUS -> INTEGER or the INTEGER -> MINUS -> INTEGER phrase in the stream of tokens and after it has successfully recognized (parsed) one of those phrases, the method interprets it and returns the result of either addition or subtraction of two integers to the caller.
And now it’s time for exercises again.
- Extend the calculator to handle multiplication of two integers
- Extend the calculator to handle division of two integers
- Modify the code to interpret expressions containing an arbitrary number of additions and subtractions, for example “9 - 5 + 3 + 11”
Check your understanding.
- What is a lexeme?
- What is the name of the process that finds the structure in the stream of tokens, or put differently, what is the name of the process that recognizes a certain phrase in that stream of tokens?
- What is the name of the part of the interpreter (compiler) that does parsing?
I hope you liked today’s material. In the next article of the series you will extend your calculator to handle more complex arithmetic expressions. Stay tuned.
And here is a list of books I recommend that will help you in your study of interpreters and compilers: