

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of formal languages, focusing on recursively enumerable and context-free languages. It explains how formal languages are described using grammars, production rules, and terminals and non-terminals. The document also introduces the chomsky hierarchy and its classes, with a particular emphasis on context-free languages and their application to programming languages.
Typology: Study notes
1 / 2
This page cannot be seen from the preview
Don't miss anything!


CS 3723 : Supplemental Notes on Formal Languages, 2 / 12 / 2008 Formal languages are not concerned with what anything means, but rather only what is or is not a correctly formed sentence (i.e., “well formed formula (wff)”) in the language. The ‘sentences’ are simply sequences of symbols from some ‘alphabet’ and do not have any inherent meaning. A formal language is simply a (possibly infinite) set of the sequences that are in the language.
One way of describing a formal language, is to give a grammar that says how to generates them. Grammars consist of a set of rewriting (production) rules, each consisting of two sequences of symbols. Some of these symbols occur in at least one wff (these are called terminals) and some never do (these are non- terminals). The general procedure is to start with some designated start symbol and apply one of the rewriting rules to get a sequence of symbols and keep on iteratively applying rewriting rules to a part of the sequence. The left-hand side of each rule must always contain at least one non-terminal, and the rewriting stops when one is left with a sequence made up entirely of terminal symbols. All of the sequences that can be produced in this manner wff in the language described by the grammar.
For example, we might define a language with: Production Rules: S → wABx wA → yA ABx → z
Where: — the start symbol is S — the non-terminals are S, A, and B — the terminals are w, x, y, and z There are exactly two wff in this language: wz and yz. wz can be produced using the derivation: S → wABx → wz. yz can be produced using the derivation: S → wABx → yABx → yz.
The class of languages that can be described with this type of grammar is known as “recursively enumerable” languages. If you the kinds of symbols that can ap- pear on the left and right side and the relationships between the two sequences, you can define other less-general classes of formal languages. Four of these are types of languages grammars in the Chomsky Hierarchy (see Figure). The most restricted class is the regular languages, and all regular languages are also context-free languages. All context-free languages are also context-sensitive lan- guages, and all context-sensitive languages are also recursively-enumerable lan- guages. (As an aside, these languages can also be defined, in terms of the ab- stract machine required to look at a sequence of symbols and decide whether the sequence is or is not a wff for a given language. To recognize arbitrary recursively-enumerable languages requires a Turing machine.)
For our purposes, the most interesting class of languages is the context-free lan- guages. The syntax of most programming languages can be (and is) described
Recursively Enumerable (Type-0) Context-Sensitive (Type-1) Context-Free (Type-2) LR(1) LL(1) Regular (Type-3)
by context-free grammar.^1 Context-free grammars are restricted so that the left- hand side of every production rule consists of exactly one non-terminal. The right-hand side may, however, still be an arbitrary sequence of non-terminal and terminal symbols.
When applied to programming languages, context-free grammars are usually written in a notation called BNF (Backus Normal Form or Backus-Naur Form). In BNF, one normally uses words instead of letters as the individual symbols,^2 and non-terminals are distinguished from terminal by bracketing the non-terminals with < and >. BNF also replaces ‘→’ with ::=.
The following is an example of a BNF grammar:
< expr > ::= < expr > + < expr > < expr > ::= < expr > − < expr > < expr > ::=
This would include wff such as 1, 2, 1 + 1, 1 + 2 − 1, 1 − 1 − 1
BNF also uses a ‘|’ symbol as short-hand to separate multiple production rules arising from the same non-terminal, so the grammar above could be rewritten:
< expr > ::= < expr > + < expr > | < expr > − < expr > |
For more information: Mitchell 4. 1 ; Aho, et al., Compilers: Principles, Techniques, and Tools; Wikipedia, Formal languages
(^1) More specifically, most programming languages grammars belong to a subclass of context- free grammars known as LR grammars, which can be parsed by automatic tools, such as yacc. Some, like Pascal, belong to even smaller subclasses, such as LL( 1 ). (^2) Programming Languages usually have separate ‘lexical’ rules that define a regular language for putting together letters to form words. These rules are often expressed using regular expres- sions, which can be processed using automatic tools, such as lex.