Information Retrieval 6, Exercises - Computer Science, Exercises of Artificial Intelligence

Prof.Paul McNamee, Information Retrieval,Computer Science, Artificial Intelligence, Johns Hopkins University, Information Retrieval, Exercises - Computer Science, Prof. Paul McNamee, Google Translate, Dictionary Translation, Translation ambiguity, Language Models, Language Identification

Typology: Exercises

2010/2011

Uploaded on 11/09/2011

stagist
stagist 🇺🇸

4.1

(27)

265 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
605.744 Information Retrieval
Spring 2011 Paul McNamee
Homework #6 (due in 2 weeks)
Multilingual Issues (100=10+10+10+10+60 points)
This assignment will give you a chance to consider multilingual issues in text retrieval. Specifically you will take a
look at an integrated web-based translation and search system and build a simple approach to language identification. I
assume you have read the assigned paper by Kishida.
Google Translate (10 points)
Investigate the translate and search utility at http://translate.google.com/translate_s Try the following English queries
(optionally others) to search pages written in (1) Spanish, (2) Thai, and (3) Ukrainian: “Japanese tsunami”, 1000
places to see before you die”, “JHU public health”. Make observations about the quality of the results compared to
regular (i.e., non-translation) English search, and consider issues such as: are the translated (to-English) pages useful,
what types of errors occur, what do you find that is impressive/substandard, and the ease of use of the interface.
Dictionary Translation (10 points)
Describe three significant problems that arise when using dictionaries to translate queries in cross-language information
retrieval.
Translation ambiguity (10 points)
Briefly explain how Pirkola has attempted to cope with translation ambiguity in CLIR.
Language Models (10 points)
Briefly explain how statistical language models can be modified to support cross-language information retrieval.
Language Identification (60 points)
There are several approaches to language identification. Typical methods include: (1) using common words
(stopwords) as features; (2) using character-based language modeling techniques; (3) vector space comparison between
a training document and each test document (the training 'document' may be a large sample of text); and, (4)
compression techniques (i.e., train a compression model for each language and see how each 'test' document
compresses using each model). On the course web page I have included samples of English, French, and Spanish text,
along with test files for each. Each test file contains 1000 sentences with one sentence on each line. The files are in the
ISO-8859-1 (Latin-1) encoding. Your task is to build a classifier to predict language and to evaluate its results on the
test documents. Describe your methods and results.
Evaluation. Assess the performance of your classifier by calculating precision, recall, and F-scores for each language;
you will obtain three metrics for each language. Precision(Lang) = percentage of time that you predict language=Lang
and you are correct. Recall(Lang) = percentage of cases where the true language is Lang and your prediction is correct.
Both precision and recall are values between 0.0 and 1.0. F-scores can be computed as 2*P*R/(P+R). Show your work
for calculating precision and recall (i.e., show numerators and denominators) and report scores with at least four digits
of precision. You should try to obtain 90% accuracy on the test sets. (FYI: I was able to obtain 90% accuracy in each
language simply by using small lists of stopwords in each language.)
You are not required to use the training data that I provided. And you may use other sources if you like. My texts are
works of fiction/literature, from Project Gutenberg. You may use other approaches to those mentioned above, and you
may use publicly available tools (e.g., gzip, language modeling toolkits, SVM_light, decision trees); however, you
should not use software intended to solve the entire identification problem (i.e., you should not rely on products such as
Rosette (by BASIS Technology) or demos such as http://odur.let.rug.nl/~vannoord/TextCat/Demo/.
pf2

Partial preview of the text

Download Information Retrieval 6, Exercises - Computer Science and more Exercises Artificial Intelligence in PDF only on Docsity!

605.744 Information Retrieval

Spring 2011 – Paul McNamee

Homework #6 (due in 2 weeks)

Multilingual Issues (100=10+10+10+10+60 points)

This assignment will give you a chance to consider multilingual issues in text retrieval. Specifically you will take a look at an integrated web-based translation and search system and build a simple approach to language identification. I assume you have read the assigned paper by Kishida. Google Translate (10 points) Investigate the translate and search utility at http://translate.google.com/translate_s Try the following English queries (optionally others) to search pages written in (1) Spanish, (2) Thai, and (3) Ukrainian: “Japanese tsunami”, “ 1000 places to see before you die”, “JHU public health”. Make observations about the quality of the results compared to regular (i.e., non-translation) English search, and consider issues such as: are the translated (to-English) pages useful, what types of errors occur, what do you find that is impressive/substandard, and the ease of use of the interface. Dictionary Translation (10 points) Describe three significant problems that arise when using dictionaries to translate queries in cross-language information retrieval. Translation ambiguity (10 points) Briefly explain how Pirkola has attempted to cope with translation ambiguity in CLIR. Language Models (10 points) Briefly explain how statistical language models can be modified to support cross-language information retrieval. Language Identification (60 points) There are several approaches to language identification. Typical methods include: (1) using common words (stopwords) as features; (2) using character-based language modeling techniques; (3) vector space comparison between a training document and each test document (the training 'document' may be a large sample of text); and, (4) compression techniques (i.e., train a compression model for each language and see how each 'test' document compresses using each model). On the course web page I have included samples of English, French, and Spanish text, along with test files for each. Each test file contains 1000 sentences with one sentence on each line. The files are in the ISO- 8859 - 1 (Latin-1) encoding. Your task is to build a classifier to predict language and to evaluate its results on the test documents. Describe your methods and results. Evaluation. Assess the performance of your classifier by calculating precision, recall, and F-scores for each language; you will obtain three metrics for each language. Precision( Lang ) = percentage of time that you predict language= Lang and you are correct. Recall( Lang ) = percentage of cases where the true language is Lang and your prediction is correct. Both precision and recall are values between 0.0 and 1.0. F-scores can be computed as 2PR/(P+R). Show your work for calculating precision and recall (i.e., show numerators and denominators) and report scores with at least four digits of precision. You should try to obtain 90% accuracy on the test sets. (FYI: I was able to obtain 90% accuracy in each language simply by using small lists of stopwords in each language.) You are not required to use the training data that I provided. And you may use other sources if you like. My texts are works of fiction/literature, from Project Gutenberg. You may use other approaches to those mentioned above, and you may use publicly available tools (e.g., gzip , language modeling toolkits, SVM_light, decision trees); however, you should not use software intended to solve the entire identification problem ( i.e., you should not rely on products such as Rosette (by BASIS Technology) or demos such as http://odur.let.rug.nl/~vannoord/TextCat/Demo/.

605.744 Information Retrieval

Spring 2011 – Paul McNamee

The first two lines of each test file are shown below: English Resumption of the session I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period. Spanish Reanudación del período de sesiones Declaro reanudado el período de sesiones del Parlamento Europeo , interrumpido el viernes 17 de diciembre pasado , y reitero a Sus Señorías mi deseo de que hayan tenido unas buenas vacaciones. French Reprise de la session Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances. Extra Credit (up to 8 points) I'll give 8 points extra-credit on the assignment to the student with the highest accuracy (F-score) on the Spanish data. And 4 points to the student with the second highest accuracy.