Computer Science Assignment: Implementing a Word Frequency Counter, Exams of Computer Science

An assignment for a computer science course where students are required to write a java program that reads an input file, stores words in a hash table using chaining for collision resolution, and produces the 10 most frequently occurring words, the total number of unique words, and the length of the longest chain. The program should handle command line arguments and use linkedlist for the hash table.

Typology: Exams

Pre 2010

Uploaded on 07/30/2009

koofers-user-ap2-1
koofers-user-ap2-1 🇺🇸

8 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Computer Science 3310
Program 5
Your assignment is to write, run, and test a program that does the following:
1. Read an input file whose name will be specified as a command line argument, breaking the lines of the file into
words.
2. Store the words in a hash table (collision resolution to be done by chaining) along with a count of how many
times the word appears in the text.
3. Produce the following as output to standard output.
The n most frequently occurring words where the value for n is specified on the command line. Output one
line per word, giving the word and how many times it occurs.
The total number of unique words found in the file.
The length of the longest chain in the hash table.
PROGRAM DETAILS
Command Line Arguments: Java provides a way for a program user to specify program arguments at run time. The
code for doing this has been included in the file WordCount.java that is being provided for you as a starting template
for the assignment. What this allows you to do is specify program parameters when execution is requested rather
than going through a sequence of input prompts. Execution of the following command: java WordCount words.txt
17 will cause the program to execute, using words.txt as the input file and 17 as the value for n. The code provided
in WordCount.java checks the validity of these arguments automatically. If you are developing your program
without using command line execution, you will want to comment out all of the code involving the command line
arguments, and manually prompt for the name of the file and the value for n before beginning the processing.
However, when you submit your final program, it should handle the command line arguments as specified in the file
I provide.
Words: A “word” is either: (1) A sequence of letters, terminated by a non-letter; or (2) a sequence of letters which
contains the apostrophe, where a letter must both precede and follow the apostrophe. Examples of words are the
following:
Input Words
now is, the999 time8dkfj couldn’t now is the time dkfj couldn’t
999xjk,isnt’ he ‘would’ve done it** xjk isnt he would’ve done it
more sample’’words more sample words
You may assume that a word begins and ends on the same line of text. In order to count "The" and "the" as an
occurrence of the same word, you must store all words in lower case letters. Code to do that is provided in the
WordCount.java template. Note that the proper representation of apostrophe as a character constant is ‘\’’ (i.e.
apostrophe backslash apostrophe apostrophe). The function Character.isLetter(ch) that returns true if the char
variable ch is a letter will be useful.
Data Structure: Since chaining is being used for collision resolution, the hash table will be an array of lists. You
will use the Java API class LinkedList for the list type. Remember that with LinkedList item positions begin with 0.
The list will contain items of type WordItem. The definition of WordItem.java will be provided. Declarations for
the table as well as the value for TABLESIZE are provided in WordCount.java. You will need to write the code to
initialize each entry of the hash table.
pf3

Partial preview of the text

Download Computer Science Assignment: Implementing a Word Frequency Counter and more Exams Computer Science in PDF only on Docsity!

Computer Science 3310

Program 5

Your assignment is to write, run, and test a program that does the following:

  1. Read an input file whose name will be specified as a command line argument, breaking the lines of the file into words.
  2. Store the words in a hash table (collision resolution to be done by chaining) along with a count of how many times the word appears in the text.
  3. Produce the following as output to standard output.
    • The n most frequently occurring words where the value for n is specified on the command line. Output one line per word, giving the word and how many times it occurs.
    • The total number of unique words found in the file.
    • The length of the longest chain in the hash table.

PROGRAM DETAILS

Command Line Arguments: Java provides a way for a program user to specify program arguments at run time. The code for doing this has been included in the file WordCount.java that is being provided for you as a starting template for the assignment. What this allows you to do is specify program parameters when execution is requested rather than going through a sequence of input prompts. Execution of the following command: java WordCount words.txt 17 will cause the program to execute, using words.txt as the input file and 17 as the value for n. The code provided in WordCount.java checks the validity of these arguments automatically. If you are developing your program without using command line execution, you will want to comment out all of the code involving the command line arguments, and manually prompt for the name of the file and the value for n before beginning the processing. However, when you submit your final program, it should handle the command line arguments as specified in the file I provide.

Words: A “word” is either: (1) A sequence of letters, terminated by a non-letter; or (2) a sequence of letters which contains the apostrophe, where a letter must both precede and follow the apostrophe. Examples of words are the following:

Input Words now is, the999 time8dkfj couldn’t now is the time dkfj couldn’t 999xjk,isnt’ he ‘would’ve done it** xjk isnt he would’ve done it more sample’’words more sample words

You may assume that a word begins and ends on the same line of text. In order to count "The" and "the" as an occurrence of the same word, you must store all words in lower case letters. Code to do that is provided in the WordCount.java template. Note that the proper representation of apostrophe as a character constant is ‘\’’ (i.e. apostrophe backslash apostrophe apostrophe). The function Character.isLetter(ch) that returns true if the char variable ch is a letter will be useful.

Data Structure: Since chaining is being used for collision resolution, the hash table will be an array of lists. You will use the Java API class LinkedList for the list type. Remember that with LinkedList item positions begin with 0. The list will contain items of type WordItem. The definition of WordItem.java will be provided. Declarations for the table as well as the value for TABLESIZE are provided in WordCount.java. You will need to write the code to initialize each entry of the hash table.

Predefined functions: To assist you in developing the program, I provide 2 functions: (1) hash - the hash function which takes a word and produces the hash value ranging between 0 and TABLESIZE-1; (2) wordCopy - takes an input word and makes a copy of it, allocating necessary memory. You can look at WordCount.java for the details.

Required functions: (you may need/want others)

  • A function that given an input buffer (i.e., a String containing a line of text), processes the buffer (i.e. finds all the words in the line and uses wordCopy to create a separate String variable containing the word and then stores it into the hash table).
  • A function to store a word in the hash table. What should happen if the word is already in the table?
  • Function(s) for determining the n most frequent words once you’ve finished reading the file.

Technical Implementation Requirements: One of the best ways to determine if students have mastered some principles of data abstraction is to have you apply such principles in your software solutions. To that end for this assignment, you must implement your solution for this program using the following features in order to receive full credit for correctness (assuming that your program produces correct answers as well)

  • As already specified, you must use the LinkedList structure for each chain (WWW link to the API).
  • Whenever you are potentially searching an entire chain (i.e. one of your LinkedList structures), you must use an iterator. This requires use of the listIterator method (see API for LinkedList ). Iterators are also described in your text beginning on page 272.
  • When you are determining the n most frequent words, after you identify the most frequently occurring word, you must somehow mark it so that it is not selected again. The simplest way is to reset its count, but the most efficient way is to remove it from the table. You must remove it to earn full credit (rather than simply reset the count).

Test Data: You need to verify your program on small test files of your own data. Run your program on the same file for various values of n. In particular, try choosing values of n where the cutoff point involves words that occur the same number of times. I will test your program on large files. In particular, I have an on-line copy of Mark Twain’s book, Huckleberry Finn. It is located at http://www.cs.ecu.edu/~rws/c3310/Book. Each file consists of a set of chapters. For example, the file part1.txt contains the first nine chapters. Assuming you’ve downloaded the file, your output should appear something like the following if you type in the following command line: java WordCount part1.txt 10

The 10 most frequently occurring words were:

  1. and occurred 1137 times
  2. the occurred 879 times
  3. i occurred 785 times
  4. a occurred 611 times
  5. to occurred 540 times
  6. it occurred 446 times
  7. was occurred 408 times
  8. he occurred 339 times
  9. of occurred 290 times
  10. in occurred 264 times

There were a total of 2331 unique words The longest chain was 22