





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An in-depth exploration of information systems, their types, characteristics, and the importance of effective information retrieval. Topics include the definition of information systems, different types of systems, the role of information systems in organizations, and information retrieval techniques. Students will gain a solid foundation in the fundamentals of information systems and the skills necessary to navigate the vast amount of information available.
Typology: Study notes
1 / 9
This page cannot be seen from the preview
Don't miss anything!






cis20.2design and implementation of software applications IIspring 2008session # II.1information models and systems topics:^ •^ what is information systems?^ •^ what is information?^ •^ knowledge representation^ •^ information retrieval cis20.2-spring2008-sklar-lecII.^
1
cis20.2-spring2008-sklar-lecII.^
cis20.2-spring2008-sklar-lecII.^
3
information^ from^ data, data analysis and reporting) • decision support systems (DSS) (e.g., extension of MIS, often with some intelligence, allowprediction, posing of “what if” questions) • executive information systems (e.g., extension of DSS, contain strategic modelingcapabilities, data abstraction, support high-level decision making and reporting, often havefancy graphics for executives to use for reporting to non-technical/non-specializedaudiences) cis20.2-spring2008-sklar-lecII.^
5
cis20.2-spring2008-sklar-lecII.^
-^ change management •^ broad implementation (not just about software) •^ education and training •^ skill change •^ societal and cultural change cis20.2-spring2008-sklar-lecII.^
7
-^ computers in society •^ the internet revolution (internet 2, web 2.0) •^ “big brother” •^ ubiquitous computing cis20.2-spring2008-sklar-lecII.^
-^ is the form of information the information itself? or another kind of information? •^ is the meaning of a signal or message the signal or message itself? •^ representation (from Norman 1993)^ –^ why do we write things down?^ ∗^ Socrates thought writing would obliterate serious thought^ ∗^ sound and gestures fade away^ –^ artifacts help us reason^ –^ anything not present in a representation can be ignored (do you agree with that?)^ –^ things left out of a representation are often those things that are hard to represent, orwe don’t know how to represent them cis20.2-spring2008-sklar-lecII.^
13
-^ Claude Shannon, 1940’s, IBM •^ studied communication and ways to measure information •^ communication^ = producing the same message at its destination as at its source •^ problem:^ noise^ can distort the message •^ message is^ encoded^ between source (transmitter) and destination (receiver) cis20.2-spring2008-sklar-lecII.^
15
-^ many disciplines: mass communication, media, literacy, rhetoric, sociology, psychology,linguistics, law, cognitive science, information science, engineering, medicine... •^ human communication theory:^ do you understand what I mean when I say something? •^ what does it mean to say a message is received? is received the same as understood? •^ the^ conduit metaphor •^ meaning: syntactic versus semantic cis20.2-spring2008-sklar-lecII.^
cis20.2-spring2008-sklar-lecII.^
17
-^ information^ organization^ versus
retrieval
-^ organization:categorizing and describing information objects in ways that people can use them whoneed to use them •^ retrieval:being able to find the information objects you need when you need them •^ two key concepts:^ –^ precision: did I find what I wanted?^ –^ recall: how quickly did I find it? •^ ideally, we want to maximize both precision and recall—this is the primary goal of the fieldof^ information retrieval (IR) cis20.2-spring2008-sklar-lecII.^
-^ information remains static •^ query remains static •^ the value of an IR solution is in how good the retrieved information meets the needs of theretriever •^ are these good assumptions?^ –^ in general, information does not stay static; especially the internet^ –^ people learn how to make better queries •^ problems with standard model on the internet:^ –^ “answer” is a list of hyperlinks that then need to be searched^ –^ answer list is apparently disorganized cis20.2-spring2008-sklar-lecII.^
19
-^ IR is iterative •^ IR doesn’t end with the first answer (unless you’re “feeling lucky”...) •^ because humans can recognize a partially useful answer; automated systems cannot alwaysdo that •^ because human’s queries change as their understanding improves by the results of previousqueries •^ because sometimes humans get an answer that is “good enough” to satisfy them, even ifinitial goals of IR aren’t met cis20.2-spring2008-sklar-lecII.^
-^ a “zone” is an identified region within a document •^ typically the document is “marked up” before you search •^ content of a zone is free text (unlike parametric fields) •^ zones can also be indexed •^ example: search for a book with certain keyword in the title, last name in author and topicin body of document •^ does this make the web a database? not really (which you’ll see when we get intodatabase definitions next week) cis20.2-spring2008-sklar-lecII.^
25
-^ search results can either be^ Boolean^ (match or not) or^ scored -^ scored results attempt to assign a quantitative value to how good the result is •^ some web searches can return a
ranked^ list of answers, ranked according to their score
-^ some scoring methods:^ –^ linear combination of zones (or fields)^ –^ incidence matrices cis20.2-spring2008-sklar-lecII.^
bedrooms) + 0.^4 ∗^ (1000 =^ price
-^ problem:it is frequently hard for a user to assign a weighting that adequately or accurately reflectstheir needs/desires cis20.2-spring2008-sklar-lecII.^
27
-^ recall^ = document (or a zone or field in the document) is a binary vector
v X ∈ { 0 , 1 }
-^ query^ is a vector •^ score^ is overlap measure:^ |X
-^ example:^ Julius Caesar^ The Tempest
Hamlet^ Othello^ Macbeth Antony^1
Brutus^1
Caesar^1
Calpurnia^1
Cleopatra^0
score^ is sum of entries row (or column, depending on what the query is) cis20.2-spring2008-sklar-lecII.^
-^ problem:^ overlap measure^ doesn’t consider:^ –^ term frequency (how often does a term occur in a document)^ –^ term scarcity in collection (how infrequently does the term occur in all documents inthe colletion)^ –^ length of documents searched •^ what about^ density? if a document talks about a term more, then shouldn’t it be a better match? •^ what if we have more than one term?this leads to^ term weighting cis20.2-spring2008-sklar-lecII.^
29
-^ in previous matrix, instead of
0 or^1 in each entry, put the^ number of occurrences^ of each term in a document • this is called the “bag of words” (multiset) model • problem:^ –^ score is based on syntactic count but not on semantic count^ –^ e.g.:^ The Red Sox are better than the Yankees.^ is the same as^ The Yankees are better than the Red Sox.^ (well, only in this example...) • count^ versus^ frequency^ –^ search for documents containing “ides of march”^ –^ Julius Caesar has 5 occurrences of “ides”^ –^ No other play has “ides”^ –^ “march” occurs in over a dozen plays^ –^ All the plays contain “of” cis20.2-spring2008-sklar-lecII.^
-^ By this scoring measure, the top-scoring play is likely to be the one with the most“of”s — is this what we want? • NOTE that in the IR literature, “frequency” typically means “count” (not really“frequency” in the engineering sense, which would be count normalized by documentlength...) • term frequency (tf) –^ somehow we want to account for the length of the documents we are comparing • collection frequency (cf) –^ the number of occurrences of a term in a collection (also called
corpus)
-^ document frequency (df)^ –^ the number of documents in a collection (corpus) containing the term •^ tf x idf or tf.idf^ –^ tf = term frequency^ –^ idf = inverse document frequency; could be
1 /df^ , but more commonly computed as:^ ^ n^ idf= logi dfi cis20.2-spring2008-sklar-lecII.^
-^ “weight” of term 31 i^ occurring in document^ d^ (w
) is then:i,d w=^ tf×^ idfi,d^ i,d^ i^ =^ tf×^ log(n/df)i,d^ i where tf= frequency of term^ i^ in documenti,d^
d n^ = total number of documents in collection df= number of documents in collection that contain termi^
i
-^ weight increases with the number of occurrences within a document –^ weight increases with the rarity of the term across the whole collection • so now we recompute the matrix using the
wformula for each entry in the matrix, andi,d^ then we can do our ranking with a query cis20.2-spring2008-sklar-lecII.^