Use of Tools for Parsing
(構文解析用のツールの詳細)
10th lecture, June 16, 2017
Language Theory and Compilers
http://www.sw.it.aoyama.ac.jp/2017/Compiler/lecture10.html
Martin J. Dürst
© 2005-17 Martin
J. Dürst 青山学院大学
Today's Schedule
- About last week's homework
- Summary of last lecture
- How
bison
works:
- Different orders of derivation
- LALR parsing
- How to read the
.output
file
- How to debug
bison
grammars
Last Week's Homework
Complete calc.y so that with test.in as an input, it produces test.check
Caution: How to deal with unary MINUS
Summary of Last Lecture
- Creating a program with
bison
and flex
needs
many steps, so using make
is important
- The input format for
bison
is very similar to the input
format for flex
, but there are also some differences
bison
uses attribute grammars to calculate the result of
parsing
- The attributes are referenced as
$$
(left hand side) and
$1
, $2
,... (right hand side) in the C program
fragments
- Priority and associativity of operators are expressed in the rewriting
rules of the grammar
How to Express Priorities
- Use a separate nonterminal symbol for each priority level
- Write the grammar starting with the lowest priority (outside)
- On the right hand side of the rewriting rule for a given priority's
nonterminal,
use nonterminals of the same or one level higher priority
- How to select names for nonterminals:
- Use mathematical terms (用語): term (項), factor (因子)
- Use the type of operator, or one representative operator
(shift_expression, mulExpression,...)
How to Express Associativity
- Left associative:
- Use the same nonterminal on the left hand side of the rewriting rule
and on the right hand side to the left of the operator
- Use the one level higher priority nonterminal to the right
of the operator
- (unless this operator is required) Also create a rewriting rule with
just the one level higher nonterminal as the right hand side
- Right associative:
- Exchange the nonterminals to the right and the left of the
operator
How to Express Repetition (Lists)
- Example: A list of statements (
statementList
or
statements
)
- Two rewriting rules are needed
- Base rule:
- Repetition of 0 or more times: A rewriting rule with an empty right
hand side
- Repetition of 1 or more times: A rewriting rule with a single element
(e.g.
statement
) of the list on the right hand side
- Inductive rule:
- Right hand side uses both list and single element nonterminals
- If associativity is important, it determines the order of the two
nonterminals
- If associativity is not important, there are two choices:
- List first: Left recursion; advantage: smaller stack
- Element first: Right recursion
- Examples:
How to Express Other Constructs (e.g. if statement)
Write as is, carefully distinguishing alternatives and terminal/non-terminal
symbols
if_statement : IF OPENPAREN cond CLOSEPAREN statement
| IF OPENPAREN cond CLOSEPAREN statement
ELSE statement
;
Grammar Patterns: Repetition
One or more times:
items: items item
| item
;
Zero or more times:
items: items item
|
;
Instead of "items item
", "item items
" is also
possible, but bison
's stack may become a problem
Grammar Patterns: Associativity
Left associative:
big_exp: big_exp left_associative_operator small_exp
| small_exp
;
Right associative:
big_exp: small_exp right_associative_operator big_exp
| small_exp
;
Grammar Patterns: Priority
(priority is small_exp > middle_exp > big_exp; assuming left
associative)
big_exp: big_exp operator middle_exp
| middle_exp
;
middle_exp: middle_exp operator small_exp
| small_exp
;
Grammar Patterns: Parentheses
small_exp: open_paren big_exp close_paren
| literal
;
Order of Derivation: Leftmost and Rightmost Derivation
With leftmost derivation, the leftmost nonterminal in the
syntax tree is always expanded first
With rightmost derivation, the rightmost nonterminal in
the syntax tree is always expanded first
Simple example grammar:
E → E '+' T
| T
T → integer
Example of input: 5 + 7 + 3
Derivation Choices
Different choices may:
- Generate different words: This is necessary to be able to process
different inputs
- Generate the same word, but with different syntax trees: Ambiguous
grammar, try to avoid
- Generate the same syntax tree, but in different orders
(leftmost/rightmost/... derivation): Different parsing algorithm
Kinds of Analysis Methods
- LL: Read input from the left, use leftmost derivation (used in top-down
parsing)
- LR: Read input from the right, use rightmost derivation (in reverse
order)
- LL(1): LL, with one token lookahead
- LR(1): LR, with one token lookahead
- LALR: A kind of LR (1), used widely in
yacc
and
bison
The labels are also used for grammars:
"This grammar is LL(1)" means: this grammar can be used with an LL(1) parser
Understanding bison
: The .output
File
bison -v
creates a file with extension .output
,
containing the following interesting details:
- [Problems: Unused terminal symbols, conflicts]
- Grammar: Numbered rewriting rules; rule number 0 is
$accept:
start symbol $end
)
- Terminals: Numbered terminal symbols; numbers are ASCII codes or
≧256)
- Nonterminals, with numbers of rules where they appear
- States, with the following information for each state:
- Rewriting rules (
.
shows current position)
- Terminal symbols (or $default) and the action (shift, reduce, goto)
if this symbol is the next symbol in the input
- Nonterminal symbols: goal state of transition after reduction
Understanding bison
: Debuging
#define YYDEBUG 1
switches on debugging
The output shows how bison
works:
- A (pushdown) stack is used to store:
- States (of an automaton)
- Already read terminals and reduced nonterminals
- The automaton and the next input token decide the action to be taken
- There are three possible actions:
- shift: Read a token and put it on the stack (together with a
state)
- reduce: Convert some tokens and/or nonterminals on the stack to a
single nonterminal using a rewriting rule
(a reduce action is always followed by a goto to another state)
- accept: Stop processing and accept the input
Conflicts and Ambiguous Grammars
- When running
bison
, it may show some conflicts:
- shift/reduce conflicts: Both shift and reduce are possible; shift is
always choosen
- reduce/reduce conflicts: There is more than one way to reduce
bison
just chooses one of the selections:
- If this is the right selection, we should be fine
(but we may want to fix the grammar anyway)
- If this is the wrong selection, we need to fix the grammar
- Grammar example:
E → E '-' E | integer
- For
5 - 3 - 7
, this grammar allows two interpretations:
(5-3) - 7
and 5 - (3-7)
Another Example of Ambiguity
The grammar for if
-else
is a famous example of
ambiguity:
if (...) if (...) ...; else ...;
can be parsed in two ways:
if (...) {
if (...)
...;
else
...;
}
or
if (...) {
if (...)
...;
}
else
...;
This creates a shift-reduce conflict.
The first way of parsing is correct (for C), and is choosen by Bison because
in a shift-reduce conflict, shift is selected.
Grammar of bison
Rewriting Rules
rewritingRule → nonterminalSymbol ":
" rightHandList
";
"
rightHandList → rightHand | rightHand "|
" rightHandList
rightHand → symbolList "{
" CFragment "}
"
symbolList → symbol | symbol symbolList
symbol → nonterminalSimbol | terminalSymbol
How to Combine flex
and bison
- In the
.y
file, list all token types:
%token NUM PLUS ASTERISK
...
- In the
.y
file, define the type of the attributes
#define YYSTYPE int
- In the
.lex
file, define one or more rules for each token
type
- In the
.y
file, define the rewriting rules of the
grammar
- In the
.y
file, write the program fragments to calculate
attribute values
- Process (with flex/bison), compile, and test
Advantages and Problems of Bottom-Up Parsing
- Advantages:
- No fear of left recursion
- Wider range of grammars
- Automatic creation of parser
- Problems:
- Very hard to create parser by hand
- Ambiguity needs attention
Homework: Simple Programming Language
Deadline: June 29, 2017 (Thursday in two weeks), 19:00
Prepare questions so that you can ask them in next week's lecture!
Where to submit: Box in front of room O-529 (building O, 5th floor)
Format: Submit the files language.lex
and
language.y
, A4 using BOTH sides (↓↓, not ↓↑), stapled in
upper left if more than one page, NO cover page, NO wrapping lines, legible
font size, non-proportional font, portrait (not landscape), formatted
(indents,...) for easy visibility, name (kanji and kana) and student number as
a comment at the top right
Expand the simple calculator of calc.y to a small programming language.
- The only data type is integer
- Variable names are single letter (A-Za-z, 52 in total), no
declarations/definitions, initialized to 0
- Implement as many C operators as possible (at least: +, -, *, /, ==, !=,
<=, >=, <, >, &&, ||, =, ?:, the rest is voluntary)
- The result of && and || is not 1, but the value that decided the
expression (e.g. 0 || 7 ⇒ 7)
- The result is 0 or 1 for relational operators (==, !=, <=, >=,
<, >)
- Express priorities and associativity directly in the grammar (do not use
%left
, %right
,...)
- Statements end with
;
. There are no flow control statements
(if
,...) or functions
@ expressison
outputs the result of "expression"
followed by a newline (@
has lowest priority)
#
starts a comment; the comment ends at the end of the
line
Hints for Homework
- Start from the calc files, but change the file names (including in the
makefile
)
- Define additional tokens in
.y
- Write rules for additional tokens in
.lex
- If you have a shift/reduce or reduce/reduce conflict:
- Check the
.output
file
- Check using different inputs
- Expand your grammar little by little, always carefully testing
- Build up your own test file, and expand it together with expansions to
the grammar
- Create tests that check important aspects of the grammar (priorities,
associativity, errors)
- Save test outputs and use them for automatic comparison
Glossary
- unary (operator)
- 単項 (演算子)
- leftmost derivation
- 最左導出
- rightmost derivation
- 最右導出
- reverse order
- 逆順
- lookahead
- 先読み
- non-proportional font
- 等幅のフォント