INT201 W1
This is the note of INT201(Decision Computation and Language).
The lecture of this week is mainly discussing about Preliminaries: principal mathematical ideas necessary to understand the material of the course
.
根据预习的笔记修改了一些内容。
Decision, Computation and Language
This module is about the theory of computation and answering the question: “What are the fundamental capabilities and limitations of computers?”
这个模块是关于计算理论的,并回答这个问题:“计算机的基本能力和局限性是什么?”
The central theme: Are the languagues, their membership problems can not be solved (computed) Algorithmically.
Lecture 1
Language
Formal languages have the property that there is a precise rule that governs what strings belong to the language.
正式语言包括了programming languages, database query languages, various file formats.
然而English, Mandarin这种并不是Formal languages.
what the module is about
- notations for representing formal languages. These give us
- ways to define them precisely
- ways to build compilers that recognise the languages
- ways to check whether
- a string of symbols belongs to a language
- whether two alternative descriptions of languages are actually the same language
- tools to analyse languages
- natural languages (NL): want to recognise valid sentence 自然语言
- programming languages (PL): want to recognise valid program 编程语言
- Some notations/methods for describing a language (other than explicitly listing it) include
- finite state automaton 有限状态自动机
- regular expressions 正则表达式
- context-free grammar 上下文无关语法
- pushdown Automation 叠加自动化
- Turing machine 图灵机
Observations
- 语法可以生成毫无意义的句子。
- 可以生成任意长的句子。
- 语义可以根据语法来确定,例如由单词“and”组成的子句的逻辑连接。
- 语法可以扩展到处理从句、副词等。
- 你不能完全用这种方式定义自然语言句子,但你可以用这种方式定义编程语言。
Stages of compilation
- lexical analysis: divide sequence of characters into tokens, such as variable names, operators, labels. In a natural language tokens are strings of consecutive letters (easy to recognise!) 将字符序列划分为符号
- parsing: identify relationships between tokens 辨别符号之间的关系
- code generation: generate object code 集成目标代码
- code optimisation 代码优化
lexical analysis 词法分析
1 | pay=salary+(overtimerate*overtime); |
Break into tokens as follows:
1 | pay |
这就是一个把句子拆开成token的过程
Parse Tree
Definitions and notation
Nothing serious, man.
Languages
ε is nothing and it’s a word. This’s not an empty set. Empty set is a language.
A language (or formal language) over alphabet A is a subset of A*.
L1L2 ={w1w2: w1 ∈ L1 and w2 ∈ L2}. L1L2是两个集合中的所有元素两两组合之后的集合
L* = {w1w2…wn: n ≥ 0 and w1,w2,…wn ∈ L} L*是一个集合元素中任意组合
Lecture 2
Deterministic Finite Automata 确定性有限自动机
Can be used a model for what happens during lexical analysis — scan program from beginning to end and divide it into tokens. 可以用一个模型来描述词法分析过程中发生的事情——从头到尾扫描程序,并将其划分为标记。
Finite automata are used to specify tokens of programming languages. 有限自动机用于指定编程语言的标记。
Also used in “model checking”, reasoning about systems with objective of proving they satisfy useful properties. 也用于“模型检验”,即以证明系统满足有用性质为目标的推理。
Also used in statistical models for analysing biological and textual sequences. 也用于分析生物和文本序列的统计模型。
DFA适用的对象:
- 任意有限长的字符串
- 任意无限长的有限字符串组成的列表
DFA适用的匹配模式: 保证某个字符出现次数不多于几次;保证字符串长度不超过多少;保证字符b永远出现在字符a后
DFA不适用的匹配模式:
如:字符串列表中a开头的字符串数量多于b开头的字符串;字符串回文;有意义的算数表达式
不知道为什么这个让我想到了BST。可能这就是为什么说DFA可以用于”模型验证”。
A deterministic finite automaton (DFA) has 5 components:
- Q is a finite nonempty set whose members are called states of the automaton; Q是一个有限的非空集,其成员称为自动机的状态;
- A is a finite nonempty set called the alphabet of the automaton; A是一个有限的非空集被叫做字母表的自动机;
- φ is a map from Q × A to Q called the transition function of the automaton; φ是一个从Q × a到Q的映射,称为自动机的转移函数;
- i is a member of Q and is called the initial state; i是Q的一个元素,称为初始状态;
- T is a nonempty subset of Q whose members are called terminal states or accepting states. T是Q的一个非空子集,其成员称为终端状态或接受状态。
因此,假如Q = {q0,q1,q2}
,所以q0
是初始状态,唯一的接受状态是q1
,也就是说终结状态T
是q1
。接受语言L
(有01子串的串的语言)的自动机A的完整描述是A = ({q0,q1,q2},{0,1},φ,q0,{q1})
。φ
是转移函数。
State of a machine tells you something about the prefix that has been read so far. If the string is a member of the language of interest, the state reached when the whole string has been scanned will be an accepting state (a member of T ). 机器的状态告诉您有关到目前为止已读取的前缀的一些信息。如果字符串是感兴趣的语言的成员,当整个字符串被扫描时达到的状态将是接受状态。
Transition function φ tells you how state should change when an additional letter is read by the DFA转移函数φ告诉你当DFA读取一个额外的字母时状态应该如何变化。
DFA常被描述为一个有标记的有向图。
Initially the state is i and if the input word is w = a1a2 …an then, as each letter is read, the state changes and we get q1, q2, . . . , qn defined by 假如w按照要求的顺序输入,state就会根据下面图示进行变化。
转化图的例子
右下角的出口箭头和同心圆(concentric circles)都表示终端态
input word 110100
| symbol | 1 | 1 | 0 | 1 | 0 | 0 |
| —– | —– | —– | —– | —– | —– | —– |
| state | i | t | t | t | t | t |
简化
If φ is a partial function (not defined for some state/letter pairs), then the DFA rejects an input if it ever encounters such a pair. 如果简化自动机,将不能达到最终态的途径状态去掉,只考虑根据当前状态和这一次扫描的字符能否达到最终态,如果不能就拒绝,这里描述为undefined未定义。
This convention often simplifies the definition of a DFA. In the previous example we could use transition table
只有cat和dog可以到达最终态,剩下的都被简化掉了。
An important observation
Any DFA that uses the convention that an undefined transition leads to a rejection, can be converted to a DFA that uses a total transition function (that is, one that is defined for all combinations of input symbols and states). The convention is useful, but it does not add extra expressive power. 任何使用未定义转换导致拒绝约定的DFA,都可以转换为使用总转换函数(即为所有输入符号和状态组合定义的转换函数)的DFA。这种约定是有用的,但并没有增加额外的表达能力。
Any finite language is accepted by some DFA
General rule: i is the initial state. For each prefix p of a word in the list, include state sp with the idea that the machine should be in state sp after p has been read. i是初始状态。对于列表中每个单词的前缀p,包含状态sp,即在读取p之后,机器应该处于状态sp。
这和DFA算法的原理是有关系的,DFA的本质是根据初始状态,通过一系列事件转化为另外一种状态的过程。即:state --> event --> state
.
- 确定:状态以及引起状态转换的事件都是可确定的,不存在“意外”。
- 有穷:状态以及事件的数量都是可穷举的。
练习
见CSDN博主「sanmusen_wu」的原创文章: INT201 决策,计算,语言 笔记
References
- XJTLU MC PowerPoint slides (INT201 Lecture1 & Lecture2)
- CSDN博主「sanmusen_wu」的原创文章: INT201 决策,计算,语言 笔记