Flex, Regular Expressions, and Lexical Analysis

What Makes a Programming Language?

There is an alphabet, words, grammar, statements, semantics, and various ways to organize the previous in order to create a computer program in a programming language. Flex helps developers create a tool called a lexical analyzer which identifies the words of a program during the compilation/interpretation process.

A compiler takes a text file and parses it by character trying to match patterns at each of the aforementioned levels. The initial parse is the lexical analysis, this pass ensures that there are no lexical errors, which are characters that don’t contribute to meaningful words.

Meaningful words for a programming language are described by a regular language. An implementation for describing a regular language is regular expressions. An implementation for parsing text while looking for matches to regular expressions is a flex lexical analyzer.

Essentially, programming a flex lexer means defining various regular expressions which detail all of the possible words and characters that are meaningful in a correct program for your language.

BooleanLogicLanguage Example Language

To illustrate flex, BooleanLogicLanguage is an example language which a flex-based lexer will lexically analyze. For no reason in particular, the purpose of this language is to evaluate Boolean logic expressions.

An example of a program made in BooleanLogicLanguage

This diagram is an example of what a correct program would look like in this BooleanLogicLanguage. The lexical analyzer should be able to parse this without errors. Note that in this language, ‘INTEGER(‘ and ‘INTEGER)’ are the ‘words’ used to separate pieces of code — as opposed to ‘(‘, ‘)’, ‘{‘, or ‘}’. When creating your own language, you are free to do whatever you want, but following convention, to some degree, just ensures that the tool you are creating is easy to use.

NOTE: A programming language is a tool for using a computer. A GUI is a tool for using a computer. Siri or Cortana are tools for using a computer, etc.

Regular expressions describing the lexical structure of BooleanLogicLanguage

This diagram is a sketch of the regular expressions which will be used in the flex program in order to describe meaningful words. The phrases on the left hand side are token names. Token names are not necessarily important during lexical analysis, but they are of the utmost importance when performing the syntactic analysis (which comes after a lexical analysis during compilation/interpretation — not covered in this post).

The most difficult part of this process is defining tokens and figuring out what sort of regular expression should describe them. For this example, I decided to make a language that would evaluate Boolean logic. Then I started writing potential programs in this language, and once I wrote enough that looked interesting and useful, I defined tokens and regular expressions to ensure those particular programs would be correct. I made code up that I liked and then fit tokens and regex to make them work.

A Quick Note on Regular Expressions (Regex)

[…] denotes a single thing, whose identity is dictated by what’s inside the brackets.

A single character is a single thing.

* denotes zero or more of the single thing before it.

? denotes one or zero of the single thing before it.

(…) is just a grouping, usually used with * or ?.

The difference between […] and (…) is that the square brackets represents one of what is inside of it and the parentheses are a grouping. For example [abc] and (abc): ‘a’, ‘b’, ‘c’ all match the former, and ‘abc’ matches the latter.

– denotes a range and has specific applications that are very useful: A-Z, a-z, 0-9.

Flex

Flex is like a C program, except it has a further defined layout.

C code declarations

definitions of regular expressions

regular expression1 code to execute on match (such as a macro)

regular expression2 other code to execute on match (such as a function)

C code functions

main function

The power of Flex comes from being able to define regular expressions (between ‘%}’ and ‘%%’) and then attach them to code (between ‘%%’ and ‘%%’). The additional areas for C code are both handy and what gives the lexer its true functionality (doing something when a regex is matched).

NOTE: flex essentially turns this flex file (extension ‘.l’) into a C program which is then compiled like you would compile any C program. The result is an object-file/program which you execute on/with a text file containing programming code. And in this case, the output of the program is to a text file (Tokens.txt) and also to stdout (terminal).

%{
int numErrors = 0;
char* arg;
typedef struct node {
	char* lexeme;
	char* token;
	struct node* next;
} node_t;
node_t head;
node_t* current = &head;
int yywrap(void);
void store(char* lexeme);
void error(void);
void printStored(void);
%}
whitespace		(\t|" "|\r)
newline			\n
real 			-?[0-9]+\.[0-9]+
integer			-?[0-9]+
string			\".*\"
boolean_op_binary	(" AND "|" OR "|" XOR ")
boolean_op_unary	"NOT "
boolean_literal		("True"|"False")
identifier		[A-Z]+
separator_open		({real}|{integer})\(
separator_close		\)({real}|{integer})
assignment		=
IO			"print "
terminal		;
%%
{whitespace}				{ECHO;}
{newline}				{ECHO;}
{real}					{ECHO;}
{integer}				{ECHO;}
{string}				{ECHO;}
{boolean_op_binary}			{ECHO; arg = "BOOL_OP_BINARY"; store(yytext);}
{boolean_op_unary}			{ECHO; arg = "BOOL_OP_UNARY"; store(yytext);}
{boolean_literal}			{ECHO; arg = "BOOL_LITERAL"; store(yytext);}
{identifier}				{ECHO; arg = "IDENTIFIER"; store(yytext);}
{separator_open}			{ECHO; arg = "OPEN"; store(yytext);}
{separator_close}			{ECHO; arg = "CLOSE"; store(yytext);}
{assignment}				{ECHO; arg = "ASSIGN"; store(yytext);}
{IO}					{ECHO; arg = "IO"; store(yytext);}
{terminal}				{ECHO; arg = "TERMINAL"; store(yytext);}
.					{ECHO; numErrors++; error();}
%%
int yywrap(void) {
	return 1;
}
void store(char* lexeme) {
	current->lexeme = malloc(sizeof(strlen(lexeme)+1));
	strcpy(current->lexeme,lexeme);
	current->token = malloc(sizeof(strlen(arg)+1));
	strcpy(current->token,arg);
	node_t* temp;
	temp = malloc(sizeof(node_t));
	current->next = temp;
	current = current->next;
}
void error(void) {
	printf("[e]");
}
void printStored(void) {
	node_t* c = &head;
	FILE* f = fopen("Tokens.txt","w");

	while (c->next) {
		fprintf(f,"%s\t%s\n",c->lexeme,c->token);
		c = c->next;
	}
	fclose(f);
	printf("Tokens.txt written.\n");
}
int main(int argc, char *argv[]) {
	// ensures number of command line arguments
	if (argc != 2) {
		printf("Please enter one filename as an argument.\n");
		return -1;
	}
	// opens the file with name of second argument
	yyin = fopen(argv[1],"r");
	yylex();
	// close file
	fclose(yyin);
	printf("\nLexicalErrors %d\n",numErrors);
	printStored();
	return 0;
}

The above code is a flex file which parses the BooleanLogicLanguage.

CC=gcc
CFLAGS=
LexerFile=lexer

lexer: lex.yy.c
	$(CC) $(CCFLAGS) -o lexer lex.yy.c

lex.yy.c: $(LexerFile).l
	flex $(LexerFile).l

The above code is a makefile, which when run in the same directory as the flex file, will create the ‘lexer’ program. This was tested on an Ubuntu 16.04 operating system. The GNU C Compiler (gcc) is required in addition to flex.

P = True;
R = False;

Q = 1(NOT P)1 XOR 2(P AND 3(NOT R)3)2;
print Q;

The above text should be parsed without errors by our lexer. And the lexer should output a Tokens.txt matching each lexeme to the token describing its regex.

P	IDENTIFIER
=	ASSIGN
True	BOOL_LITERAL
;	TERMINAL
R	IDENTIFIER
=	ASSIGN
False	BOOL_LITERAL
;	TERMINAL
Q	IDENTIFIER
=	ASSIGN
1(	OPEN
NOT 	BOOL_OP_UNARY
P	IDENTIFIER
)1	CLOSE
 XOR 	BOOL_OP_BINARY
2(	OPEN
P	IDENTIFIER
 AND 	BOOL_OP_BINARY
3(	OPEN
NOT 	BOOL_OP_UNARY
R	IDENTIFIER
)3	CLOSE
)2	CLOSE
;	TERMINAL
print 	IO
Q	IDENTIFIER
;	TERMINAL

Here is the Token.txt for that initial errorless program.

Lexical errors are designated with a ‘[e]’ for this lexer, and it comes jsut after the erroneous character.

The above is a screenshot of an example of parsing a program with lexical errors.

What is Next?

Classically, once you can identify words, including what type of word (token), you can then create a grammar. You could think of a token as ‘noun’ or ‘verb’ or ‘adjective’, if comparing a programming language to a natural language. The step after lexical analysis (checking for correctness of words) is syntactic analysis (checking for correctness of grammar). Making a comparison to natural languages again, an English grammar could be PHRASE: article noun verb (The dog ran, A bird flies, etc). In BooleanLogicLanguage there could be STATEMENT: IDENTIFIER ASSIGN BOOL_LITERAL (Q = True, A = False, etc). But each of those are an example of just one type of phrase or statement, you can have multiple definitions for a production in the grammar for your language.

Non-Classically? I don’t know. Natural Language Processing (NLP) is a very active area, but I’m not aware of anyone using those techniques for parsing textual programming languages.

XRDS

Crossroads – The ACM Magazine for Students

Flex, Regular Expressions, and Lexical Analysis

What Makes a Programming Language?

BooleanLogicLanguage Example Language

Flex

Leave a Reply Cancel reply