Parsing Context-Free Languages

(文脈自由言語の構文解析)

7rd lecture, May 26, 2017

Language Theory and Compilers

http://www.sw.it.aoyama.ac.jp/2017/Compiler/lecture7.html

Martin J. Dürst

Today's Schedule

Remainders and summary from last lecture
flex homework
Notations for grammars
Constructing a grammar
Parse tree and abstract syntax tree
Difficulty of parsing
Top-down and bottom-up parsing

Summary of Last Lecture

Regular expressions and related tools are limited in their descriptive power
(example: properly nested parentheses,...)
Regular expressions,... cannot be used for parsing, but are very useful for lexical analysis
Separating lexical analysis and parsing leads to a fast compiler with a clean structure
Grammars for (programming) languages can be written as context-free grammars
Context-free grammars can be recognized with push-down automata
Nondeterministic push-down automata are more powerful than deterministic ones, but their implementation is more complex and slower

Homework Due Yesterday

Regular Expression for C Comments

For better readability, we will match /x x/, and use spaces

Try #	Regular Expression	Problem
1	`/x .* x/`	`/xx/ /xx/` is matched as a single unit
2	`/x [^x]* x/`	`/xxx/` is not matched
3	`/x ([^x]\|x[^/])* x/`	`/x xx/ /x x/` is matched as a single unit
4	`/x ([^x]\|x+[^/])* x/`	same as above
5	`/x ([^x]\|x+[^/x])* x/`	`/x xx/` is not matched
6	`/x ([^x]\|x+[^/x])* x+/`	Done!

Reference: Mastering Regular Expressions, Jeffrey E.F. Friedl, pp. 168,...

Homework from Last Lecture

Compound Statement for C

compound_statement
        : '{' '}'
        | '{' statement_list '}'
        | '{' declaration_list '}'
        | '{' declaration_list statement_list '}'
        ;

declaration_list
        : declaration
        | declaration_list declaration
        ;

statement_list
        : statement
        | statement_list statement
        ;

Block for Java

Block: 
     { BlockStatements }

 BlockStatements: 
     { BlockStatement }

 BlockStatement:
     LocalVariableDeclarationStatement
     ClassOrInterfaceDeclaration
     [Identifier :] Statement

Different Ways to Express a Grammar

Simple Grammar

Example: C grammar
Metasymbols: only →, :, |
Rewriting rules have the same form as in language theory

BNF

Example: Java grammar
Metasymbols: →, :, | (or line break), {}, [], (),...
(the Java grammar uses fonts to distinguish metasymbols ({}) and terminal symbols ({}))
The right side of a rewriting rule is similar to a regular expression

Formal Rewriting Rules and BNF

There are many different ways to write a grammar:

Simplest: Only a list of rewriting rules
Connect all the right hand sides with the same left hand side with |
⇒ No fundamental change from 1. (syntactic sugar)
Add equivalent of ? in regular expressions (present/absent, often written with [...])
⇒ Can be written as two different rules
Add equivalent of * in regular expressions (often written {...}
⇒ Can be rewritten (see next slide)

This kind of grammar is often called BNF (Backus-Naur Form), EBNF (Extended...), or ABNF (Augmented...),...

Rewriting BNF to Simple Grammar Rules

M → a {N} b

⇒

M → a b | a NList bNList → N | NList N

How to Create a Grammar

Write down a simple example word of the language
Convert the example to tokens types (result of lexical analysis)
Give names to the various phenomena in the example (e.g.: ...expression, ...statement, etc.)
Create draft rewriting rules
Repeat 1.-4. with more difficult examples, check, and fix

Example of Grammar Creation

Goal: A grammar for formulæ such as 5 + 3

Goal of Parsing

Acceptance/rejection of input
Processing of input:
- Direct computation of result (for uses such as "pocket calculator")
- Creation of abstract syntax tree
Easy to understand error messages

Result of Parsing: Parse Tree and Abstract Syntax Tree

Parse tree (concrete syntax tree):

Leaves are terminals and non-terminals (during analysis)
Internal nodes are non-terminals
Purpose: Research/analysis of grammars and parsing methods

Abstract syntax tree:

Leaves are subset of terminals (identifiers, literals,...)
Nodes correspond to a subset of the non-terminals, but are often labeled with a subset of the terminals (operators, keywords,...)
Some terminals are ignored (parentheses,...)

Examples of Parse Tree and Abstract Syntax Tree

Parser Implementation: Top-Down or Bottom-Up

Top-down parsing:: Build the parse tree from the top (root, start symbol)
Bottom-up parsing:: Build the parse tree from the bottom (terminal symbols); During parsing, there may be several (small) parse trees

Difficulty of Parsing

Deterministic and non-deterministic push-down automata differ in recognition power
A general algorithm to recognize context-free languages is O(n³)
A O(n) algorithm is highly desirable, but this means that the grammar has to be somewhat limited
In research on parsing, many different methods have been studied:
- Methods easy to understand for humans
- Methods with few limitations on grammars
- Grammars for which a parser can be implemented by hand
- Grammars for which a parser can be implemented automatically

Very General Parsing Method

(Cocke–Younger–Kasami (CYK) algorithm)

Convert grammar to Chomsky normal form
(only rules of the forms A → BC, D → d, and S → ε are allowed)
Consider subsequences of the input of length 1, 2, ...
- (for subsequences of length ≥3) consider many different ways of splitting
  (Dymanic Programming, same as matrix multiplication problem in Data structures and Algorithms)
- Check and store all applicable rewriting rules
Computational complexity is O(n³ · g) (where g is the size of the grammar)

Homework

Deadline: June 1st, 2017 (Thursday), 19:00

Where to submit: Box in front of room O-529 (building O, 5th floor)

Format: A4 single page (using both sides is okay; NO cover page, staple in top left corner if more than one page is necessary), easily readable handwriting (NO printouts), name (kanji and kana) and student number at the top right

In the problems below, n, +, -, *, and / are terminal symbols, and any other letters are non-terminal symbols. n denotes an arbitrary number, and the other symbols denote the four basic arithmetic operations.

For the three grammars below, construct all the possible parse trees for words of length 5. Find the grammar that allows all and only those parse trees that lead to correct results.
1. E → n | E - E
2. E → n | n - E
3. E → n | E - n
Same as in problem 1 for the four grammars below:
1. E → n | E + E | E * E
2. E → n | E + n | E * n
3. E → T | T + T; T → n | n * n
4. E → T | T * T; T → n | n + n
(bonus problem) Based on what you learned from problems 1 and 2, create a grammar that allows to correctly calculate expressions with the four arithmetic operations (without parentheses). Check this grammar with expressions of length 5.
Bring your notebook computer to the next lecture

Glossary

parse tree (concrete syntax tree): 解析木
abstract syntax tree: 構文木 (抽象構文木)
syntactic sugar: 糖衣構文
top-down parsing: 下向き構文解析
bottom-up parsing: 上向き構文解析
pocket calculator: 電卓
Chomsky normal form: Chomsky 標準形
four (basic) arithmethic operations: 四則演算