Parsing Context-Free
Languages
(文脈自由言語の構文解析)
7th lecture, May 31, 2019
Language Theory and Compilers
http://www.sw.it.aoyama.ac.jp/2019/Compiler/lecture7.html
Martin J. Dürst
© 2005-19 Martin
J. Dürst 青山学院大学
Today's Schedule
- Leftovers and summary from last lecture
flex
homework
- Notations for grammars
- Constructing a grammar
- Parse tree and abstract syntax tree
- Difficulty of parsing
- Top-down and bottom-up parsing
Collection of Homework from Last Lecture
(bring to next lecture, will be collected)
For a programming language that you know (e.g. C, Java, Ruby,...), search
for a grammar on the Web, print it out, and carefully study it.
Leftovers from Last Lecture
Summary of Last Lecture
- Regular expressions and related tools are limited in their descriptive
power
(example: properly nested parentheses,...)
- Regular expressions,... cannot be used for parsing, but are very useful
for lexical analysis
- Separating lexical analysis and parsing leads to a fast compiler with a
clean structure
- Grammars for (programming) languages can be written as context-free
grammars
- Context-free grammars can be recognized with push-down automata
- Nondeterministic push-down automata are more powerful than deterministic
ones, but their implementation is more complex and slower
Homework Due Yesterday
Example solution: xml.l
A Difficult Regular Expression: C Comment
For better readability, we will match /x x/
, and use spaces
Try # |
Regular Expression |
Problem |
1 |
/x .* x/ |
/xx/ /xx/ is matched as a single unit |
2 |
/x [^x]* x/ |
/xxx/ is not matched |
3 |
/x ([^x]|x[^/])* x/ |
/x xx/ /x x/ is matched as a single unit |
4 |
/x ([^x]|x+[^/])* x/ |
same as above |
5 |
/x ([^x]|x+[^/x])* x/ |
/x xx/ is not matched |
6 |
/x ([^x]|x+[^/x])* x+/ |
Done! |
Reference: Mastering Regular Expressions, Jeffrey E.F. Friedl, pp.
168,...
Homework from Last Lecture
都合により削除
Compound Statement for C
compound_statement
: '{' '}'
| '{' statement_list '}'
| '{' declaration_list '}'
| '{' declaration_list statement_list '}'
;
declaration_list
: declaration
| declaration_list declaration
;
statement_list
: statement
| statement_list statement
;
Block for Java
Be careful to distinguish e.g. literal {} (appearing
as such in Java) and grammar-level {}
(indicating 0 or
more repetitions)
Block:
{ BlockStatements }
BlockStatements:
{ BlockStatement }
BlockStatement:
LocalVariableDeclarationStatement
ClassOrInterfaceDeclaration
[ Identifier : ] Statement
Different Ways to Express a Grammar
Simple Grammar
- Example: C
grammar
- Metasymbols: only
→
/:
and |
- Rewriting rules have the same form as in language theory
BNF
- Example: Java
grammar
- Metasymbols:
→
/:
, |
(or line
break), {}
, []
, ()
,...
(the Java grammar uses fonts to distinguish metasymbols ({}) and
terminal symbols ({}
))
- The right side of a rewriting rule is similar to a regular expression
Formal Rewriting Rules and BNF
There are many different ways to write a grammar:
- Simplest: Only a list of rewriting rules
- Connect all the right hand sides that have the same left hand side with
|
⇒ No fundamental change from 1 (syntactic sugar)
- Add equivalent of
?
in regular expressions (present/absent,
often written with [
...]
)
⇒ Can be rewritten as two different rules
- Add equivalent of
*
in regular expressions (0 or more
repetitions, often written {
...}
)
⇒ Can be rewritten with []
and list
Writing grammars as above is often called BNF (Backus-Naur Form), EBNF
(Extended...), or ABNF (Augmented...),...
Rewriting Connected Right Hand Sides
A → B C | D E
⇒
A → B C
A → D E
Rewriting Presence/Absence
A → B [ C ] D
⇒
A → B C D
A → B D
Rewriting BNF Repetition
A → B { C } D
⇒
A → B [ CList ] D
CList → C | CList C
How to Create a Grammar
- Write down a simple example word of the language
- Convert the example to tokens types (result of lexical analysis)
- Give names to the various phenomena in the example (e.g.: ...expression,
...statement, etc.)
- Create draft rewriting rules
- Repeat 1.-4. with more difficult examples, check, and fix
Example of Grammar Creation
- Goal: A grammar for formulæ such as
5 + 3
Goal of Parsing
- Acceptance/rejection of input
- Processing of input:
- Direct computation of result (for uses such as "pocket
calculator")
- Creation of abstract syntax tree
- Easy to understand error messages
Result of Parsing: Parse Tree and Abstract Syntax Tree
- Parse tree (concrete syntax tree):
- Leaves are non-terminals (during analysis) and terminals
- Internal nodes are non-terminals
- Purpose: Research/analysis of grammars and parsing methods
- Abstract syntax tree:
- Leaves are subset of terminals (identifiers, literals,...)
- Nodes correspond to some non-terminals, but are often labeled with
terminals (operators, keywords,...)
- Some terminals are ignored (parentheses,...)
Examples of Parse Tree and Abstract Syntax Tree
Parser Implementation: Top-Down or Bottom-Up
- Top-down parsing:
- Build the parse tree from the top (root, start symbol)
- Bottom-up parsing:
- Build the parse tree from the bottom (terminal symbols)
- During parsing, there may be several (small) parse trees
Difficulty of Parsing
- Deterministic and non-deterministic push-down automata differ in
recognition power
- A general algorithm to recognize context-free languages is
O(n3)
- A O(n) algorithm is highly desirable, but this
means that the grammar has to be somewhat limited
- In research on parsing, many different methods have been studied:
- Methods easy to understand for humans
- Methods with few limitations on grammars
- Grammars for which a parser can be implemented by hand
- Grammars for which a parser can be implemented automatically
Very General Parsing Method
(Cocke–Younger–Kasami (CYK) algorithm)
- Convert grammar to Chomsky normal form
(only rules of the forms A → BC,
D → d, and S → ε are
allowed)
- Consider subsequences of the input of length 1, 2, ...
- Computational complexity is O(n3 · g)
(where g is the size of the grammar)
Homework
Deadline: June 6, 2019 (Thursday), 19:00
Where to submit: Box in front of room O-529 (building O, 5th floor)
Format: A4 single page (using both sides is okay; NO cover page, staple in
top left corner if more than one page is necessary), easily readable
handwriting (NO printouts), name (kanji and kana) and student number at the top
right
In the problems below, n, +, -, *, and / are terminal symbols. Any other
letters are non-terminal symbols. n denotes an arbitrary number, and the other
symbols denote the four basic arithmetic operations.
- For the three grammars below, construct all the possible parse trees for
words of length 5. Find the grammar that allows all and only those parse
trees that produce correct results.
- E → n | E - E
- E → n | n - E
- E → n | E - n
- Same as in problem 1 for the four grammars below:
- E → n | E + E | E * E
- E → n | E + n | E * n
- E → T | E + T; T → n | T * n
- E → T | E * T; T → n | T + n
- (bonus problem) Based on what you learned from problems 1 and 2, create a
grammar that allows to correctly calculate expressions with the four
arithmetic operations (without parentheses). Check this grammar with
expressions of length 5.
- Bring your notebook computer to the next lecture
Glossary
- parse tree (concrete syntax tree)
- 解析木
- abstract syntax tree
- 構文木 (抽象構文木)
- syntactic sugar
- 糖衣構文
- top-down parsing
- 下向き構文解析
- bottom-up parsing
- 上向き構文解析
- pocket calculator
- 電卓
- Chomsky normal form
- Chomsky 標準形
- four (basic) arithmethic operations
- 四則演算