























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A set of lecture notes from the university of illinois at urbana-champaign's cs421 course, covering the topic of lexing and using ocamllex for lexical analysis. The notes discuss the process of turning strings of characters into computer instructions through lexing and parsing, and the use of regular expressions and finite automata for recognizing tokens.
Typology: Papers
1 / 31
This page cannot be seen from the preview
Don't miss anything!
























Outline Overview Lexing ocamllex Activity
Mark Hills
June 27, 2006
Based on slides by Mattox Beckman, as updated by Vikram Adve, Gul
Agha, and Elsa Gunter
Outline Overview Lexing ocamllex Activity
Overview
Lexing
ocamllex
Activity
Outline Overview Lexing ocamllex Activity
Strings are converted into ASTs in two phases:
Lexing Convert strings (streams of characters) into lists (or
streams) of tokens, representing words in the
language
Parsing Convert lists of tokens into abstract syntax trees
Outline Overview Lexing ocamllex Activity
Overview Strategy Options
With lexing, we break sequences of characters into different
syntactic categories, called tokens. As an example, we could break
this:
asd 123 jkl 3.
into this:
[String ‘‘asd’’, Int 123; String ‘‘jkl’’; Float 3.14]
Outline Overview Lexing ocamllex Activity
Overview Strategy Options
To solve this, we will modify the behavior of the DFA.
◮ (^) if we find a character where there is no transition from the
current state, stop processing the string
◮ (^) if we are in an accepting state, return the token corresponding
to what we found as well as the remainder of the string
◮ (^) now, use iterator or recursion to keep pulling out more tokens
◮ (^) if we were not in an accepting state, fail – invalid syntax
Outline Overview Lexing ocamllex Activity
Overview Strategy Options
Outline Overview Lexing ocamllex Activity
Overview Strategy Options
We need a few core items to get this working:
◮ (^) Some way to identify the input string – we’ll call this the
lexing buffer
◮ (^) A set of regular expressions that correspond to tokens in our
language
◮ (^) A corresponding set of actions to take when tokens are
matched
The lexer can then take the regular expressions to build state
machines, which are then used to process the lexing buffer. If we
reach an accept state and can take no further transitions, we can
apply the actions.
Outline Overview Lexing ocamllex Activity
Getting Started Lexer Input Regular Expressions Example 1 Example 2 Scanning Comments
◮ (^) Lexer definitions using ocamllex are written in a file with a
.mll extension. The file includes the regular expressions in a
table, with associated actions for each.
◮ (^) OCaml code for the lexer is generated with
ocamllex file.mll
◮ (^) This generates the code for the lexer in file file.ml
Outline Overview Lexing ocamllex Activity
Getting Started Lexer Input Regular Expressions Example 1 Example 2 Scanning Comments
1 { header }
2 let ident = regexp ...
3 rule entrypoint [arg1... argn] = parse
4 | regexp { action }
5 | ...
6 | regexp { action }
7 and entrypoint [arg1... argn] = parse
8 ...and ...
9 { trailer }
Outline Overview Lexing ocamllex Activity
Getting Started Lexer Input Regular Expressions Example 1 Example 2 Scanning Comments
◮ (^) header and footer contain arbitrary OCaml code to insert into
generated .ml file
◮ (^) shorthands for regular expressions can be introduced with
let ident = regexp
◮ (^) multiple entry points turn into multiple functions in the .ml
file, with the given arguments and an additional argument for
the lexing buffer
Outline Overview Lexing ocamllex Activity
Getting Started Lexer Input Regular Expressions Example 1 Example 2 Scanning Comments
◮ (^) Character ranges – pick any character in the range, based on
character codes: [c 1 − c 2 ]
◮ (^) Negative character ranges – any character not in the range:
[
1 −^ c^2 ]
◮ (^) e∗ has same meaning as we’ve already seen
◮ (^) e+ means one ore more, same as ee∗
◮ (^) e? means one or none, same as e + ǫ
◮ (^) e 1 #e 2 means the characters in e 1 but not in e 2
◮ (^) ident – shorthand for earlier definition of a regular expression
using let
◮ (^) e 1 as id – binds matched string to id
Outline Overview Lexing ocamllex Activity
Getting Started Lexer Input Regular Expressions Example 1 Example 2 Scanning Comments
The page for the ocamllex tool is at
http://caml.inria.fr/pub/docs/manual-ocaml/manual026.html
Outline Overview Lexing ocamllex Activity
Getting Started Lexer Input Regular Expressions Example 1 Example 2 Scanning Comments
1 rule main = parse
2 (digits)’.’digits as f { Float (float_of_string f) }
3 | digits as n { Int (int_of_string n) }
4 | letters as s { String s}
5 | _ { main lexbuf }
6 { let newlexbuf = (Lexing.from_channel stdin) in
7 print_string "Ready to lex.";
8 print_newline ();
9 main newlexbuf
Outline Overview Lexing ocamllex Activity
Getting Started Lexer Input Regular Expressions Example 1 Example 2 Scanning Comments
1 # #use "test.ml";;
2 ...
3 val main : Lexing.lexbuf -> result =
4 val __ocaml_lex_main_rec :
5 Lexing.lexbuf -> int -> result =
6 Ready to lex.
7 hi there 234 5.
8 - : result = String "hi"
What happened to the rest?