Download Pattern-Based Relation Mining - Lecture Slides | CSE 591 and more Study notes Computer Science in PDF only on Docsity!
CSE 591
Pattern-based relation
mining
Fall 2008
http://www.public.asu.edu/~jhakenbe/591/
Last Monday
• Relation mining: spot (defined)^ relations/associations
between named entities in text
• “John^ is^ married^ to^ Alice.”
• “CASP8^ binds^ to the death domain of^ FADD.”
• “The^ G56R^ mutation in^ NR2E3^ accounts^ for^ ADRP.”
Gene Mutation Disease Relation
NR2E3 Gly56Arg ADRP Cause
Protein Protein Relation
CASP8 FADD Spatial interaction
Person Person Relation
John Alice Spouses
Corpus-level statistics
• evaluate co-occurrences of the same pair in large
corpus, especially across different texts:
1) get instances on sentence level
2) aggregate into corpus-level results
3) decide whether due to chance or statistically significant
• measures: for instance,
- pointwise mutual information (PMI)
- log-likelihood ratio (LLR)
- co-citation (hypergeometric distribution)
• prerequisite: normalization of entities to IDs
(homonyms, synonyms)
Max. entropy modeling
• use two MEM models
- sentence filtering (pre-classification)
- classification of pairs
• features: see last week
- normalization factor^ Z
- K^ feature functions^ fj(c,x)
- model parameters^ aj
- fj(c,x)^ =^ {1,0}
- N^ outcome labels
- labels^ c , observations^ x
“These observations establish that RsmC negatively regulates rsmB
transcription but positively affects RsmA production.”
Protein-protein interactions
sculpture of a
potassium channel
signal transduction pathways
• finding interactions and networks helps to
understand processes and how the influence each
other ➠ and how they might be influenced
Pattern-based extraction
• although^ high variation in language, many description
of interactions/relations follow characteristic patterns
- “John went into the building”
[Person] * [Movement] * [Location]
- “RsmA interacts with rmsB”
[Protein] * [Interaction-type] * [Protein]
- “LRRK2 is involved in Alzheimer’s disease”
[Protein] * [Involvement] * [Disease]
Pattern-based extraction
• although^ high variation in language, many description
of interactions/relations follow characteristic patterns
- “John went into the building”
[Person] * [Movement] * [Location]
- “RsmA interacts with rmsB”
[Protein] * [Interaction-type] * [Protein]
- “LRRK2 is involved in Alzheimer’s disease”
[Protein] * [Involvement] * [Disease]
‣“Peter came out of the building”
‣“Mary went into the bank”
‣“Paul is sitting at the table”
‣“The first person to exit the bank was Peter”
‣“One of the proteins involved in AD is LRRK2”
➱ a lot of different patterns are required to capture the most frequent variations; or : very generic patterns
Word-sequence patterns
• comparable to regular expressions
• fixed parts and options
- John|Mary went|came into|out_of a|the building|bank
✓ “Mary came out of a bank”
“Paul came out of the shop”
• concepts
- [Person] [verb-movement] (in|into|out|out_of) [det] [Location]
“John left the building”
• optional parts
- [Person] [verb-movement] (in|into|out|out_of)?^ [det] [Location]
“John, absent-minded, entered the wrong building”
• wildcards
- [Person]^ *^ [verb-movement] (in|into|out|out_of)? [det]^ *^ [Location]
“John did nothing and Mary went into the bank” matched by
➱ wrong fact: John went into the bank
Hand-picked patterns
• recall ~40%^ ➠^ never enough patterns that still yield
high precision
Table 1. Frame representation and accuracy for 100 randomly selected cases. Frame Probability Number of hits in Number of hits in Precision, score cell-cycle corpus saccharomyces corpus saccharomyces corpus (percentage) Type I [syntactical class = proteins] (0-5 words) [verbs] 4 2628 13667 68 (0-5) [proteins] [proteins] (0-5) [verbs] (6-10) [proteins] 3 969 5380 50 [proteins] (6-10) [verbs] (0-5) [proteins] 3 892 5090 54 [proteins] (0-10) [verbs] (0-10) [proteins] 2 278 1672 33 [proteins] () [verbs] () [proteins] 1 1632 11080 21 protein verbs protein NA 6399 36889 NA [proteins] () [verbs] (0-3) but not (0-3) [proteins] 0 26 64 NA [proteins] () cannot (0-3) [verbs] () [proteins] 0 7 24 NA [proteins] () does not (0-3) [verbs] () [proteins] 0 38 235 NA [proteins] () did not (0-3) [verbs] () [proteins] 0 34 218 NA [proteins] () was not (0-3) [verbs] () [proteins] 0 12 77 NA [proteins] () not (0-3) [verbs] () by () [proteins] 0 6 101 NA [proteins] () not required for (0-3) [verbs] () [proteins] 0 4 10 NA [proteins] () failed to (0-3) [verbs] () [proteins] 0 2 67 NA Negations NA 129 796 NA Type II [verbs] of (0-3) [proteins] (0-3) by (0-3) [proteins] 5 1 17 40 (*) [verbs] of (0-3) [proteins] (0-3) to (0-3) [proteins] 5 29 294 97 [nouns] of (0-3) [proteins] (0-3) by (0-3) [proteins] 5 93 400 91 [nouns] of (0-3) [proteins] (0-3) with (0-3) [proteins] 5 66 386 95 [nouns] between (0-3) [proteins] (0-3) and (0-3) [proteins] 5 83 437 94
Alignment
• common technique in computational biology and
linguistics
• finds similar sequences^ and^ the similarities in sequences
- cosine distance tells you that two objects are similar, but not
why and where the are similar/identical/dissimilar
• we usually speak of pairwise alignment, comparing two
sequences
protein strongly binds to protein protein interacts with the protein protein never binds to protein protein regulates the protein protein inhibits a protein protein {strongly,never}?^ {binds, .., ..}^ {to, with}?^ {the, a}?^ protein
Learning patterns
• resulting patterns, sorted by support
• filtering rules:
• precision/recall around 80%
.Huang et al.
ig. 4. Pattern examples extracted from about 1200 sentences. The star symbol denotes a protein name. Words for each component of a attern are separated by a semicolon. Action words are not completely listed.
able 8. The recall and precision experiments
eyword TP TP+TN TP+FP Recall
Precision
Fβ= 1
such as ‘ PTN NN PTN ’ because there are never such seg- ment ‘protein 1 interaction protein 2 ’ defining a real interaction between protein 1 and protein 2. Some patterns, such as ‘ PTN VBZ IN CC IN PTN ’ which should be ‘ PTN VBZ IN PTN CC IN PTN ’ (protein 1 interacts with protein 2 and with protein 3 ),
References
- Pyysalo et al. (2006) Relationship type ontology.^ http://mars.cs.utu.fi/BioInfer/?q=relationship_ontology
- Saveanu. Cells need interactions.^ http://www.functionalgenomics.org.uk/sections/resources/protein- protein.htm
- Blaschke and Valencia (2002)^ The Frame-Based Module of the SUISEKI Information Extraction System.
- Huang et al. (2004) Discovering patterns to extract protein-protein interactions from full texts.
- Riesbeck (1986) From Conceptual Analyzer to Directo Memory Access Parsing:^ An Overview. Advances in Cognitive Sciences , pp. 237-258.
- Livingston and Riesbeck (2007) Using Episodic Memory in a Memory Based Parser to Assist Machine Reading. AAAI Spring Symposium on Machine Reading.