前言

前面基本上把词法分析的部分大致说完了,具体的实现细节以及homework和project在接下来进行完成。今天先来看下WA1的内容,这个资料是在斯坦福的官网上找的,edx上好像没有。

正文

  1. Write regular expressions for the following languages over the alphabet Σ = {0,1} :

    (a) The set of all strings where no two consecutive characters are the same.

    (b) The set of all strings representing a binary number that is a power of 2. Allow for leading

    zeros e.g. 001000.
    (c) The set of all strings containing at least one of 1110 or 1011.

    A:

    (a) (0+ε)(10)*(1+ε) 或者 (1+ε)(01)*(0+ε)

    (b) 0*10*

    (c) (1+0)*(1110+1011)(1+0)*

  2. Draw the DFAs for each of the languages from Question 1.

    A:

    (a)

    a的NFA状态图,然后进行DFA状态图,这里只写一个来举个例子怎么做转换好吧,我草稿纸上写了下不想再打一遍了,将就着看吧,字丑别怪我。

    最后成功拿到我们想要的DFA

    (b)

    不存在ε和多路径所以可以直接画出来DFA

    (c)

  3. Using the techniques covered in class, transform the following NFAs with ε-transitions over
    the given alphabet Σ into DFAs. Note that a DFA must have a transition defined for every
    state and symbol pair, whereas a NFA need not. You must take this fact into account for
    your transformations. Hint: Is there a subset of states the NFA transitions to when fed a
    symbol for which the set of current states has no explicit transition?

    (a) Original NFA, Σ = {a, b}:

    DFA:

    (b) Σ = {a,b,c}

    DFA:

    (c) Σ = {a, b}

    DFA:

  4. Let L be the language over Σ = a; b; c where the following holds: w is in L if at most one
    character has more than one occurrence in w.
    Examples of Strings in L: a, baaaaaac, abc
    Examples of Strings not in L: ccbb, abcab,ε
    Draw an NFA for L.

    至多有一个字符多余出现一次(我应该没理解错吧?),太难画了,放弃直接看状态图

  5. Consider the following tokens and their associated regular expressions, given as a flex scanner
    specification:

    Give an input to this scanner such that the output string is (badger11mushroom2)4 snake3,
    where Ai denotes A repeated i times. (And, of course, the parentheses are not part of the
    output.) You may use similar shorthand notation in your answer.

    上面给了flex语法的文件,flex可以根据你写的正则做对应的操作,转换成c文件,里面的代码其实就是有限状态自动机的代码,然后我们根据题目给的输出反推出输入就行了,flex具体请看这个自己动手写编译器8

    然后我们看下输出,很简单就可以拿到输入照着正则写就行了,答案给的例子是((0011)^11 0100^2)^4 011001

    然后这不是唯一答案只要符合正则就行了

  6. Recall from the lecture that, when using regular expressions to scan an input, we resolve
    conflicts by taking the largest possible match at any point. That is, if we have the following
    flex scanner specification:

    and we see the input string \dot", we will match the second rule and emit T Identifier for
    the whole string, not T Do.
    However, it is possible to have a set of regular expressions for which we can tokenize a
    particular string, but for which taking the largest possible match will fail to break the input
    into tokens. Give an example of a set of regular expressions and an input string such that:
    a) the string can be broken into substrings, where each substring matches one of the regular
    expressions, b) our usual lexer algorithm, taking the largest match at every step, will fail to
    break the string in a way in which each piece matches one of the regular expressions. Explain
    how the string can be tokenized and why taking the largest match won’t work in this case.

    题目的意思就是说在做词法分析时我们总是先选择匹配长度长的规则来,但是在某些情况下匹配最长的规则,会导致之后的分词匹配失败,让我们举个例子,仔细思考下其实很简单就是这里是过度匹配而不是适应匹配,过度匹配虽然能解决歧义性,但是可能不符合适应整个字符串都能完整的被接收,这里直接拿答案的例子了:

    输入abab,理论上正确的分词是a和bab,但是因为最长匹配的原则会先匹配aba,导致最后的b没有正则能够匹配上,这也解释了上面的问题

总结

WA1主要就是画画DFA,我太懒了,懒得全画完,没时间2333,继续学习了,加油。