
Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
In this lab, students will write a program to identify and count the occurrences of meaningful words in english text by creating lists of stop words and content words. The lab includes instructions for creating two versions of the program, one using unordered lists and the other using ordered lists, and analyzing the relative efficiency of both representations. Students are encouraged to experiment with several input files and observe the relationship between the topics of the texts and the lists of words found.
Typology: Lab Reports
1 / 1
This page cannot be seen from the preview
Don't miss anything!

Instructions
In Natural Language Understanding, we try to enable computers to “understand” languages such as English and Spanish. It has been found that we can enable computers to classify text according to its topic (for example sports, politics, science, etc.) by simply counting the occurrences of meaningful words in the text to be classified and comparing those frequencies with those of texts of known classes.
In this lab you will write a program to count the occurrences of meaningful words in English text. To decide if a word is meaningful (also known as a content word), we will use a list of words that are known to provide little meaning (also known as stop words) that can be found at
Your task consists of writing two versions of a program to do the following:
For version one of your program, use unordered lists to store stop words and content words, for version two use ordered lists for both types of words. In both cases, report the total number of string comparisons that were performed by each program and analyze the relative efficiency of both representations. Perform experiments using several input files and observe the relationship between the topics of these texts and the lists of words found by your programs.