Word Concordance Application: Building a Dictionary ADT for Main Words with Line Numbers | Slides Data Structures and Algorithms

This assignment has several parts -- a comparison of dictionary/map ADTs (section 5.2.3.3 of the text) and a

concordance-production application using the dictionary ADTs. A Webster’s dictionary definition of concordance

is: “an alphabetical list of the main words in a work.” In addition to the main words, I want you to keep track of

all the line numbers where these main words occur.

WORD & LINE CONCORDANCE APPLICATION

The goal of this assignment is to process a textual, data file (hw4data.txt) to generate a word concordance with

line numbers for each main word. A dictionary ADT is perfect to store the word concordance with the word being

the dictionary key and a list of its line numbers being the associated value with the key. Since the concordance

should only keep track of the “main” words, there will actually be a second stop-words file (stop_words.txt).

The stop-words file will contain a list of stop words (e.g., “a”, “the”, etc.) -- these words will not be included in the

concordance even if they do appear in the data file. Sample files might be:

Sample hw4data.txt file

aThis is a sample data (text) file to bigger: 4

about

can

the

Sample output file

Notes:

Sample stop_words.txt file

this

be processed by your word-concordance program.

The real data file is much bigger.

was

concordance: 2

data: 1 4

file: 1 4

much: 4

processed: 2

program: 2

real: 4

sample: 1

text: 1

word: 2

your: 2

1) Words are defined to be sequences of letters delimited by any non-letter.

(e.g., white space, punctuation, parentheses, dashes, double quotes, etc.)

2) There is to be no distinction made between upper and lower case letters.

(e.g., "CAT" is the same word as "cat")

(e.g., line 3 above is blank)

3) Blank lines are to be counted in the line numbering.

The general algorithm for the word-concordance program is:

1) Read the stop_words.txt file into a dictionary containing only stop words, called stopWordDict.

(WARNING: Strip the newline (‘\n’) character from the end of the stop word before adding it to stopWordDict)

2) Process the hw4data.txt file one line at a time to build the word-concordance dictionary (called

wordConcordanceDict) containing “main” words for the keys with a list of their associated line numbers as

their values. The main loop is something like:

lineCounter = 1

for each line in the data file do

processLine( lineCounter, line, wordConcordanceDict... )

lineCounter += 1

3) Traverse the wordConcordanceDict alphabetically by key to generate a text file containing the concordance

words printed out in alphabetical order along with their corresponding line numbers.

The general algorithm for the processLine ( lineCounter, line, wordConcordanceDict ... ) function is:

wordList = createWordList(line)

for each word in the wordList do

if the word is not in the stopWordDict then

if the word is in the wordConcordanceDict then

look up the line-#-list value associated with the word in the wordConcordanceDict

append the lineCounter to the end of the line-#-list

else

add the word with an associated [lineCounter] list value (e.g., list containing first line #) to the

wordConcordanceDict

(Note: I strongly suggested that the logic for reading words and assigning line numbers to them be developed and

tested separately from other aspects of the program. This could be accomplished by reading a sample file and

printing out the words recognized with their corresponding line numbers without any other word processing.)

Data Structures (CS 1520) Homework #4

Docsity.com

Word Concordance Application: Building a Dictionary ADT for Main Words with Line Numbers, Slides of Data Structures and Algorithms

Related documents

Partial preview of the text

Download Word Concordance Application: Building a Dictionary ADT for Main Words with Line Numbers and more Slides Data Structures and Algorithms in PDF only on Docsity!

Data Structures (CS 1520) Homework

Docsity.com

DICTIONARY ADT COMPARISON

EXTRA CREDIT POSSIBILITIES:

Data Structures (CS 1520) Homework #4 Due: 3/15/13 (Fri.) at 3 PM

Docsity.com