

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Assignment; Professor: Forrest; Class: ST: Prog Analy &Mechanization; Subject: Computer Science; University: University of New Mexico; Term: Unknown 2003;
Typology: Assignments
1 / 3
This page cannot be seen from the preview
Don't miss anything!


The basic idea of this assignment is to see how the genetic algorithm works on a language-induction problem. In language induction, the learning system is presented with a set of symbol strings defined over a fixed-size alphabet. These strings, known as sentences, are exemplars of a language (any set of legal sentences defines a language). The induction problem is to figure out a minimal procedure for recognizing legal strings in the language (that is, the set of exemplar sentences). In our assignment we will be using finite-state automata (FSA) to represent the procedure. A trivial solution to this problem would be simply to define the language of all possible strings over the alphabet. Then, any possible set of exemplars would be recognized by our FSA. To make the problem more interesting, we will supply exemplars of strings that are in the language (positive examples) and strings that are not in the language (negative examples). Your GA should discover FSAs that recognize all the positive examples and reject all the negative examples. We will study two methods for representing FSAs using genetic algorithms: the fixed-size table method and the variable-length genome method with historical markers. We will suggest the basic representation strategy and you are welcome to use existing software, but you are expected to design and implement your own fitness function.
You are free to choose any publicly available genetic algorithm software or to write your own using the language of your choice. One possible choice is a very simple genetic algorithm written in C by Gary Flake (see URLs below) which generates FSAs for playing the Prisoner’s Dilemma.
If you have never heard of a finite-state automata, you can find a description and formal definition in any automata theory book and in most compiler books (among other applications, finite-state machines are used for the lexical analysis phase of compilers). Here are some materials we found on the web:
If there is sufficient interest, we can hold an extra discussion session to review the basics of FSAs.
Gary Flake’s book The Computational Beauty of Nature, published by MIT Press, has an excellent collection of software that accompanies it. The software, including the genetic algorithm, is available from http://mitpress.mit.edu/books/FLAOH/cbnhtml/source.html. The general download page is: http://mitpress.mit.edu/books/FLAOH/cbnhtml/download.html. There are UNIX, Mac, and Windows ver- sions, both binary and source. The basic, architecture-independent (i.e. UNIX or Windows) source archive is: http://mitpress.mit.edu/books/FLAOH/cbnhtml/cbn-noarch-src+docs.tgz. The most relevant program for us is called gaipd (GA for Iterated Prisoner’s Dilemma). You may also wish to examine or use the NEAT software desribed in our readings. That is available in several versions from: http://www.cs.utexas.edu/users/nn/pages/software/software.html. Once you have your GA turning over and generating FSAs, you will find it convenient to print the FSAs out in graphical form. The program GraphViz, available from http://www.research.att.com/sw/tools/graphviz/, allows you to display FSAs graphically. Note: You will need to convert the format of your GA representation in to a format that Dot understands.
Inputs will first list the legal letters in the alphabet (each letter separated by a single “ “), followed by a blank line. Then a set of positive examples will appear, one per line. The end of the positive examples will be denoted by a blank line. Next, will be a set of negative examples. For example, if the language we are recognizing consists of any string made up of an even number of occurrences of the letter “a,” the input file might look as follows:
a
aa aaaa aaaaaa
a aaa aaaaa
Each language example will occur in a separate file. An important distinction in supervised machine learning is that between training sets and test sets. Using a separate set of testing data from that used to train the system allows us to see how well the system generalizes from a small set of examples to a general concept. In some of our data sets, we will include a third set of examples, the test set, consisting of positive and negative examples.
You are allowed to work with one other partner, if you choose. On March 5 you are expected to hand in a working GA (source code listing), some plots of its performance on the text examples (e.g., mean fitness plotted against generation), printouts of example high-fitness FSAs, and a 3-5 page writeup. On March 12, you should hand in an augmented writeup that describes your new representation, any changes to the fitness function, your new results, the comparison between the two methods, and your conclusions. The writeup should describe at least the following: