Use of Tools for Parsing
(構文解析用のツールの詳細)
10th lecture, June 21, 2019
Language Theory and Compilers
http://www.sw.it.aoyama.ac.jp/2019/Compiler/lecture10.html
Martin J. Dürst
© 2005-19 Martin
J. Dürst 青山学院大学
Today's Schedule
- About last lecture's homework
- Summary of last lecture
- How to write a grammar:
- Priorities
- Associativity
- Repetition (lists)
- Other constructs
- How
bison
works:
- Different orders of derivation
- LALR parsing
- How to read the
.output
file
- How to debug
bison
grammars
Last Lecture's Homework
都合により削除
Summary of Last Lecture
- Creating a program with
bison
and flex
needs
many steps, so using make
is important
- The input format for
bison
is very similar to the input
format for flex
, but there are also some differences
bison
uses attribute grammars to calculate the result of
parsing
- The attributes are referenced as
$$
(left hand side) and
$1
, $2
,... (right hand side) in the C program
fragments
- Priority and associativity of operators are expressed in the rewriting
rules of the grammar
Leftovers from Last Lecture
How to Express Priorities
- Use a separate nonterminal symbol for each priority level
- Write the grammar starting with the lowest priority (outside)
- On the right hand side of the rewriting rule for a given priority's
nonterminal,
use nonterminals of the same or one level higher priority
- How to select names for nonterminals:
- Use mathematical terms (用語): term (項), factor (因子)
- Use the type of operator, or one representative operator
(shift_expression, mulExpression,...)
Grammar Patterns: Priority
(priority is small_exp > middle_exp > big_exp; assuming left
associative)
big_exp: big_exp operator middle_exp
| middle_exp
;
middle_exp: middle_exp operator small_exp
| small_exp
;
How to Express Associativity
- Left associative:
- Use the same nonterminal on the left hand side of the rewriting rule
and on the right hand side to the left of the operator
- Use the one level higher priority nonterminal to the right
of the operator
- (unless this operator is required) Also create a rewriting rule with
just the one level higher nonterminal as the right hand side
- Right associative:
- Exchange the nonterminals to the right and the left of the
operator
Grammar Patterns: Associativity
Left associative:
big_exp: big_exp left_assoc_op small_exp
| small_exp
;
Right associative:
big_exp: small_exp right_assoc_op big_exp
| small_exp
;
How to Express Repetition (Lists)
- Example: A list of statements (
statementList
or
statements
)
- Two rewriting rules are needed
- Base rule:
- Repetition of 0 or more times: A rewriting rule with an empty right
hand side
- Repetition of 1 or more times: A rewriting rule with a single element
(e.g.
statement
) of the list on the right hand side
- Inductive rule:
- Right hand side uses both list and single element nonterminals
- If associativity is important, it determines the order of the two
nonterminals
- If associativity is not important, there are two choices:
- List first: Left recursion; advantage: smaller stack
- Element first: Right recursion
Grammar Patterns: Repetition
Zero or more times:
items: items item
|
;
One or more times:
items: items item
| item
;
Instead of "items item
", "item items
" is also
possible, but bison
's stack may become a problem
Grammar Patterns: Parentheses
small_exp: open_paren big_exp close_paren
| literal
;
How to Express Other Constructs (e.g. if statement)
Write as is, carefully distinguishing alternatives and terminal/non-terminal
symbols
if_statement : IF OPENPAREN cond CLOSEPAREN statement
| IF OPENPAREN cond CLOSEPAREN statement
ELSE statement
;
Order of Derivation: Leftmost and Rightmost Derivation
With leftmost derivations, the leftmost nonterminal in the
syntax tree is always expanded first
With rightmost derivations, the rightmost nonterminal in
the syntax tree is always expanded first
Simple example grammar:
E → E '-' T
| T
T → integer
Example of input: 5 - 7 - 3
Derivation Choices
Different choices may generate:
- Different words:
This is necessary to be able to process different inputs
- Different syntax trees (same word):
Ambiguous grammar, try to avoid
- Different orders (leftmost/rightmost/... derivation; same syntax
tree):
Different parsing algorithm
Kinds of Analysis Methods
- LL: Read input from the left, use leftmost derivation (used in
top-down parsing)
- LR: Read input from the left, use rightmost derivation, in reverse
order)
- LL(1): LL, with one token lookahead
- LR(1): LR, with one token lookahead
- LALR: A kind of LR(1), used widely in
yacc
and
bison
The labels are also used for grammars:
grammar g is LL(1) ⇔ grammar g can be used with an
LL(1) parser.
Understanding bison
: The .output
File
bison -v
creates a file with extension .output
,
containing the following interesting details:
- [Problems: Unused terminal symbols, conflicts]
- Grammar: Numbered rewriting rules; rule number 0 is
$accept:
start_symbol $end
)
- Terminals: Numbered terminal symbols; numbers are ASCII codes or
≧256)
- Nonterminals, with numbers of rules where they appear
- States, with the following information for each state:
- Rewriting rules (
.
shows current position)
- Terminal symbols (or $default) and the action (shift, reduce, goto)
if this symbol is the next symbol in the input
- Nonterminal symbols: goal state of transition after reduction
Understanding bison
: Debuging
#define YYDEBUG 1
switches on debugging
The output shows how bison
works:
- A (pushdown) stack is used to store:
- States (of an automaton)
- Already read terminals and reduced nonterminals
- The automaton and the next input token decide the action to be taken
- There are three possible actions:
- shift: Read a token and put it on the stack (together with a
state)
- reduce: Convert some tokens and/or nonterminals on the stack to a
single nonterminal using a rewriting rule
(a reduce action is always followed by a goto to another state)
- accept: Stop processing and accept the input
Conflicts and Ambiguous Grammars
- When running
bison
, it may show some conflicts:
- shift/reduce conflicts: Both shift and reduce are possible; shift is
always choosen
- reduce/reduce conflicts: There is more than one way to reduce
bison
just chooses one of the selections:
- If this is the right selection, we are fine
(but we may want to fix the grammar anyway)
- If this is the wrong selection, we need to fix the grammar
- Grammar example:
E → E '-' E | integer
For 5 - 3 - 7
, this grammar allows two
interpretations:
(5-3) - 7
and 5 - (3-7)
Another Example of Ambiguity
The grammar for if
-else
is a famous example of
ambiguity:
if (...) if (...) ...; else ...;
can be parsed in two ways:
if (...) {
if (...) ...;
else ...;
}
or
if (...) {
if (...) ...;
}
else ...;
This creates a shift-reduce conflict.
The first way of parsing is correct (for C), and is choosen by
bison
because in a shift-reduce conflict, shift is selected.
Grammar of bison
Rewriting Rules (META!)
rewritingRule → nonterminalSymbol ":
" rightHandList
";
"
rightHandList → rightHand | rightHand "|
" rightHandList
rightHand → symbolList "{
" CFragment "}
"
symbolList → symbol | symbol symbolList
symbol → nonterminalSymbol | terminalSymbol
How to Combine flex
and bison
- In the
.y
file, list all token types:
%token NUM PLUS ASTERISK
...
- In the
.y
file, define the type of the attributes
#define YYSTYPE int
- In the
.lex
file, define one or more rules for each token
type
- In the
.y
file, define the rewriting rules of the
grammar
- In the
.y
file, write the program fragments to calculate
attribute values
- Process (with flex/bison), compile, and test
Advantages and Problems of Bottom-Up Parsing
- Advantages:
- No problems with left recursion
- Wider range of grammars
- Automatic creation of parser
- Problems:
- Very hard to create parser by hand
- Ambiguities needs attention
Homework: A Calculator for Rational Numbers
Deadline: July 4, 2019 (Thursday in two weeks), 19:00
Prepare questions so that you can ask them in next week's lecture!
Where to submit: Box in front of room O-529 (building O, 5th floor)
Format: Submit the files rationals.lex
and
rationals.y
(in this order!), A4 using BOTH sides (↓↓, not
↓↑), portrait (not landscape), stapled in upper left if more than one page,
NO cover page, NO wrapping lines, legible font size, non-proportional font,
formatted (indents,...) for easy visibility, name (kanji and kana) and student
number as a comment at the top right
Change the simple calculator of calc.y to a calculator that can handle
integers and rationals.
- Statements are separated with
;
.
- Rationals are expressed as
[numerator,
denominator]
.
- Inside
[]
, division is not allowed. Make sure this is
checked by the grammar.
- Nesting of
[]
is not allowed. Make sure this is checked by
the grammar.
- Calculations are exact, not using floating point numbers.
- Print out the result of each statement.
- Results are given as irreducible fractions, with a minus sign on the
numerator if applicable.
- Example result:
Result is -53/17
- Use the grammar to define priorities and associativities (do NOT use
%left
, %right
,...).
Bonus points (発展問題) for dealing with hours, months,
years,...
Hints for Homework
- Start from the calc files.
- Change the filenames (including in the
makefile
).
- Use
YYSTYPE
to define a type that can handle rationals and
integers (both in .lex
and .y
)
- Define additional tokens in
.y
(%token
rule)
- Write rules for additional tokens in
.lex
- If you have a shift/reduce or reduce/reduce conflict:
- Check the
.output
file
- Check using different inputs
- Expand your grammar little by little, always carefully testing
- Build up your own test file, and expand it together with expansions to
the grammar
- Create tests that check important aspects of the grammar (priorities,
associativity, errors)
- Save test outputs and use them for automatic comparison
Glossary
- unary (operator)
- 単項 (演算子)
- leftmost derivation
- 最左導出
- rightmost derivation
- 最右導出
- reverse order
- 逆順
- lookahead
- 先読み
- non-proportional font
- 等幅のフォント
- rational number
- 有理数
- numerator
- 分子
- denominator
- 分母
- sign
- 符号
- irreducible fraction
- 既約分数