In lecture, you learned that the first task in a compiler’s workflow is lexical
analysis. A lexical analyzer (or tokenizer) converts the raw sequence of
characters from the code into sequence of tokens. In order to do that, the
lexical analyzer scans through the sequence of characters in the code, groups
them together into lexemes, and identifies the token class for each lexeme.
For this programming assignment, you will gain hands-on experience with
lexical analysis and implement your own lexical analyzer. You will implement a
partial lexical analyzer (implemented in C++) that scans streams of C code.
Please follow these instructions prior to starting the assignment:
1. Install cmake from here, or run sudo apt-get install cmake in your
2. Run our build.sh script as follows: bash build.sh .
Main Assignment (100 Points)
We have provided a C++ file, src/Lexer.cpp , as well as a header file,
src/Lexer.h , that contain setup code and helper classes/functions for
tokens, token classes, state transitions, and outputting tokens. Your task will
be to fill in the missing items of the tokenizer to generate all tokens for an
input code snippet.
The TODO comments indicate all parts of the lexical analyzer that you need to
implement in this assignment:
1. stateTransition : we have implemented the state transition for the if
keyword. You are responsible for implementing the rest of the state
2. tokenizeCode : we have generated tokens for parentheses, curly braces,
and the if keyword. You are responsible for generating the rest of the
Keywords (20 Points): any tokens from the list [if/else, for, while,
Token class: KEYWORD
Identifiers (20 Points): any tokens that begin with an alphabetic
(including both capital and lowercase) character or an underscore (_),
followed by at most 16 alphanumeric characters and/or underscore
(EXCEPT for the keyword tokens)
Examples of valid identifiers: test, test1, _id1, and test_1_id_2
Token class: ID
Numbers (20 Points): any numerical tokens optionally containing a
decimal point/period (.), i.e., both integers and floating-point numbers
Examples of valid numbers: 1, 1.0, 1.01, and .01
Token class: NUMBER
Strings (20 Points): any tokens represented by a sequence of characters
(including the empty sequence) that begins and ends with double quotes
(“). You are not required to handle escape characters like \” .
Examples of strings: “Hello”, “”, and “1.01”
Token class: STRING