自己手动编写一个简单的解释器Part 3

2015-8-19 23:25| 发布者: joejoe0332| 查看: 873| 评论: 0

摘要: 早上起来的时候我自顾自地想着：“为什么我们会发现学一门新的技能很困难呢？”我认为这并不仅仅是因为辛苦的工作。我认为其中的一个原因可能是我们花了很多时间和辛苦的工作用在通过阅读文献和观看技术视频来获得知 ...

　　早上起来的时候我自顾自地想着：“为什么我们会发现学一门新的技能很困难呢？”

　　我认为这并不仅仅是因为辛苦的工作。我认为其中的一个原因可能是我们花了很多时间和辛苦的工作用在通过阅读文献和观看技术视频来获得知识，以致于没有足够的时间将这些知识通过实践来转化成技能。就拿游泳来说，你可以花大量的时间阅读成百上千本有关游泳的书籍，和有经验的游泳人员或者教练讨论数小时，观看可获取到的所有的教学视频，然而当你第一次跳下泳池的时候你还是会像块石头一样下沉。

　　底线在于：不管你认为你对这个目标知道的有多透彻——你一定要将这些知识用于实践并转化为技能。为了能够以实践方式来帮助你，我在这一系列中设置了 Part 1 和 Part 2 两个部分的练习。对了，我保证你还可以在今天的和以后的文章中看到更多的练习。

　　好了，让我们开始学习今天的资料，好吗？

到这为止，我们已经学习到了如何解释像"7 + 3"这样的加法算式和 "12 - 9“这样的减法算式了。

今天，我们将要讨论的是，如何解析（识别）出像"7 - 3 + 2 - 1"这样的任意加减法的数学表达式。

咱们在文章中讨论的数学表达式可以用下图这样的语法图来表达：

那么，什么是语法图呢？

语法图是对编程语言语法规则的图形化表达。

总的说来，语法图就是你可以直观的看出哪些语句是合法的语法，哪些不是。

语法图很容易阅读：只要沿着箭头指示的方向就可以了，有一些是分支的结构，有一些是循环结构。

你可以这样跟着我阅读上面的语法图：一个term，然后后面跟着任意多个『一个加减号跟着一个term』这样的结构。这样一个一个的画下来，就得到了上图。

你可能纳闷，term 是啥，对于咱们文章里的情形，term 就是一整数（interger）了。

语法图主要为两个目的服务：

它们使用图的形式来表达一门编程语言
他们可以帮助你编写你的解析器（parser），你可以使用几种简单的规则来把图转换为代码。

我们已经知道了，在一串标识符（token）中识别出短语（phrase）的过程叫做解析。而且解释器或者编译器的一部分就是解析器。解析也叫做『语法分析』，解析器也叫做——你猜对了——语法分析器。

参照上面的语法图，下面的数学表达式都是有效的。

3
3 + 4
7 - 3 + 2 - 1

因为算数表达式在不同的编程语言中都差不多一样，我们可以用 Python 来『测试』一下我们的语法图。把你的 Python Shell 搞起来，然后输入：

1

2

3

4

5

6

>>> 3
3
>>> 3 + 4
7
>>> 7 - 3 + 2 - 1
5

没啥特别的。

『3 + 』作为一个算数表达式是不合法的。因为根据语法图，一个减号后面必须跟着一个 term（也就是整数），否则解释器就会报错。不信你试试：

1

2

3

4

>>> 3 +
  File "<stdin>", line 1    3 +
      ^
SyntaxError: invalid syntax

用 Python Shell 来做测试很爽，不过我们还是想自己用代码实现我们自己的解释器，对吧？

在前文（Part 1 和 Part 2）中，你已经知道了expr 方法就是我们的解释器和解析器工作的地方。也就是说，解析器仅识别出语法结构，确保语句符合规定，解释器在解析器工作完毕之后（也就是解析完了之后），将表达式计算出来。

下面是根据语法图写出来的解析器（parser）代码。语法图中的矩形在代码中变成了可以解析整数的 term 方法，expr 方法只负责跟随语法图：

1 2	`def` `term(self):` `self.eat(INTEGER)`

def expr(self):
    # set current token to the first token taken from the input
    self.current_token = self.get_next_token()
 
    self.term()
    while self.current_token.type in (PLUS, MINUS):
        token = self.current_token
        if token.type == PLUS:
            self.eat(PLUS)
            self.term()
        elif token.type == MINUS:
            self.eat(MINUS)
            self.term()

你可以看到 expr 方法最先调用了 term 方法。然后 expr 方法进入到了一个可以执行任意多次的循环。在循环中，parser 通过 token（加号还是减号）来决定作出什么样的判断。花点时间来证明上图的代码实现了可以解析算术表达式并遵照了上面的语法图。

解析器自己本身并不会解释任何东西：如果它识别出来一个合法的表达式它就不吱声了，如果表达式不合法，它就会抛出一个语法错误。接下来我们修改一下 expr 方法，然后再加上解释器的代码。

1

2

3

4

5

def term(self):
    """Return an INTEGER token value"""
    token = self.current_token
    self.eat(INTEGER)
    return token.value

def expr(self):
    """Parser / Interpreter """
    # set current token to the first token taken from the input
    self.current_token = self.get_next_token()
 
    result = self.term()
    while self.current_token.type in (PLUS, MINUS):
        token = self.current_token
        if token.type == PLUS:
            self.eat(PLUS)
            result = result + self.term()
        elif token.type == MINUS:
            self.eat(MINUS)
            result = result - self.term()
 
    return result

因为解释器需要去计算表达式的值。我们修改了 term 方法，它现在返回了一个整数值。修改 expr 方法，现在它在恰当的位置执行加减法操作，并且整个解释的结果。尽管代码已经很直观了，我还是建议花时间好好研究一下。

好，我们继续，然后来看完整的解释器代码。

下面是你的计算器的新版代码，它可以处理包含任意多个加减法运算符的算术表达式。

# Token types## EOF (end-of-file) token is used to indicate that# there is no more input left for lexical analysisINTEGER, PLUS, MINUS, EOF = 'INTEGER', 'PLUS', 'MINUS', 'EOF'class Token(object):
    def __init__(self, type, value):
        # token type: INTEGER, PLUS, MINUS, or EOF
        self.type = type
        # token value: non-negative integer value, '+', '-', or None
        self.value = value
 
    def __str__(self):
        """String representation of the class instance.        Examples:            Token(INTEGER, 3)            Token(PLUS, '+')        """
        return 'Token({type}, {value})'.format(
            type=self.type,
            value=repr(self.value)
        )
 
    def __repr__(self):
        return self.__str__()class Interpreter(object):
    def __init__(self, text):
        # client string input, e.g. "3 + 5", "12 - 5 + 3", etc
        self.text = text
        # self.pos is an index into self.text
        self.pos = 0
        # current token instance
        self.current_token = None
        self.current_char = self.text[self.pos]
 
    ##########################################################
    # Lexer code                                             #
    ##########################################################
    def error(self):
        raise Exception('Invalid syntax')
 
    def advance(self):
        """Advance the `pos` pointer and set the `current_char` variable."""
        self.pos += 1
        if self.pos > len(self.text) - 1:
            self.current_char = None  # Indicates end of input
        else:
            self.current_char = self.text[self.pos]
 
    def skip_whitespace(self):
        while self.current_char is not None and self.current_char.isspace():
            self.advance()
 
    def integer(self):
        """Return a (multidigit) integer consumed from the input."""
        result = ''
        while self.current_char is not None and self.current_char.isdigit():
            result += self.current_char
            self.advance()
        return int(result)
 
    def get_next_token(self):
        """Lexical analyzer (also known as scanner or tokenizer)        This method is responsible for breaking a sentence        apart into tokens. One token at a time.        """
        while self.current_char is not None:
 
            if self.current_char.isspace():
                self.skip_whitespace()
                continue
 
            if self.current_char.isdigit():
                return Token(INTEGER, self.integer())
 
            if self.current_char == '+':
                self.advance()
                return Token(PLUS, '+')
 
            if self.current_char == '-':
                self.advance()
                return Token(MINUS, '-')
 
            self.error()
 
        return Token(EOF, None)
 
    ##########################################################
    # Parser / Interpreter code                              #
    ##########################################################
    def eat(self, token_type):
        # compare the current token type with the passed token
        # type and if they match then "eat" the current token
        # and assign the next token to the self.current_token,
        # otherwise raise an exception.
        if self.current_token.type == token_type:
            self.current_token = self.get_next_token()
        else:
            self.error()
 
    def term(self):
        """Return an INTEGER token value."""
        token = self.current_token
        self.eat(INTEGER)
        return token.value
 
    def expr(self):
        """Arithmetic expression parser / interpreter."""
        # set current token to the first token taken from the input
        self.current_token = self.get_next_token()
 
        result = self.term()
        while self.current_token.type in (PLUS, MINUS):
            token = self.current_token
            if token.type == PLUS:
                self.eat(PLUS)
                result = result + self.term()
            elif token.type == MINUS:
                self.eat(MINUS)
                result = result - self.term()
 
        return resultdef main():
    while True:
        try:
            # To run under Python3 replace 'raw_input' call
            # with 'input'
            text = raw_input('calc> ')
        except EOFError:
            break
        if not text:
            continue
        interpreter = Interpreter(text)
        result = interpreter.expr()
        print(result)if __name__ == '__main__':
    main()

把上面的代码存为 calc3.py 或者直接从 GitHub 上下载。试试看！它可以根据开始的时候我展示给你看的语法图来处理算术表达式。

下面是我用我自己的本本跑出来的结果：

$ python calc3.py
calc> 3
3
calc> 7 - 4
3
calc> 10 + 5
15
calc> 7 - 3 + 2 - 1
5
calc> 10 + 1 + 2 - 3 + 4 + 6 - 15
5
calc> 3 +
Traceback (most recent call last):
  File "calc3.py", line 147, in <module>
    main()
  File "calc3.py", line 142, in main    result = interpreter.expr()
  File "calc3.py", line 123, in expr    result = result + self.term()
  File "calc3.py", line 110, in term
    self.eat(INTEGER)
  File "calc3.py", line 105, in eat
    self.error()
  File "calc3.py", line 45, in error
    raise Exception('Invalid syntax')Exception: Invalid syntax

记得，我在文章开始说的小练习吧？现在我按照我说的兑现了承诺：）。

用纸笔画出来可以处理包含『乘号和除号』的算数表达式，比如『7 * 4 / 2 * 3』这样的。
修改上面的计算器源代码，让它可以处理像『7 * 4 / 2 * 3』这样的算术表达式。
编写一个可以处理像『7 - 3 + 2 -1 』这样的算术表达式的解释器。你可以使用任何你用着顺手的编程语言。然后不看任何教程来实现它，让它运行起来跟上边儿的一样。在做的时候记得几个关键的组件：一个把输入转换为一串标识符（token）的词法分析器（lexer）。一个从lexer手中，将流式的标识符（token）识别为结构的解析器（parser），和一个在解析器成功解析之后计算结果的解释器。将这些知识串联起来，通过做一个解释器的练习，把知识转换为技能。

看看你理解了多少

1、什么是语法图？

2、什么是语法分析？

3、什么是语法分析器？

你已经看到最后了，感谢阅读！别忘了做练习:)

过段时间我会带着新文章回来的，保持饥渴的求知心哦！

PS. 翻译@OSC 以往文章导航：

自己手动编写一个简单的解释器 Part 1

自己手动编写一个简单的解释器 Part 2