







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The concept of cluster analysis, specifically focusing on hierarchical cluster analysis (hca). Hca is an unsupervised clustering method that groups samples into homogeneous classes based on their distance or similarity. The three most common methods for linking samples in hca: single link, complete link, and centroid link. It also provides examples of dendrograms produced using single, complete, and centroid linkages for raw and autoscaled data.
Typology: Study notes
1 / 13
This page cannot be seen from the preview
Don't miss anything!








(^0 5 10) Petal width 15 20 25 30 Relative frequency
Histogram (Petal width) 0
0 5 10 15 20 25 30 Petal width Relative frequency
Not exactly the best classification. It does show that there is some skew to the results (more men in class one and more women in class two) - and there is a fair amount of overlap.
Now, our points have been linked into three clusters.
All points have now been linked.
2
2
Dendrograms We can now see how our samples are linked. The higher the linkage level, the lower the similarity. 1.0 similarity 0. Dendrograms This plot appears to indicate that there are three groups of samples that can only be linked at very low similarity values. A B C D E F G H I J Dendrograms Lets look again at our single linkage example and see what the dendrogram would look like. Example dendrogram 1.0 similarity 0. A real example Substances commonly used as accelerants were assayed by capillary column GC / MS. At present, accelerants are identified based on boiling point range. ! Class assignments: A, B, C, D, E Goal: To determine if multivariate data treatment has the potential for classification of accelerants. Analysis conditions Neat samples were spiked with a known amount an internal standard. ! SP-5 25m x 0.2mm I.D. column ! 1 μl sample, 100:1 split injection ! 50 oC,5 min; 10oC/min ramp; hold at 250oC ! Total run time: 30 minutes ! Mass Range: 50-150 AMU ! ISTD: octadeuteronaphthalene
Raw - centroidal linkage Centroid, Raw e e e e e e e b b a a a a b a b b b b b b b b d d d d d d d d d d d d^ c^ c^ c^ c^ c^ c^ c^ c^ c^ c^ c
Similarity Raw - comparison Centroidal linkage appears to give the best results.
b b b b a a a a b a b b b b b b e e e e e e e d d d d d d d d d d d d^ c^ c^ c^ c^ c^ c^ c^ c^ c^ c^ c
b b b b b b b b b b a b a a a a c c c c c c c c c c c e e e e e e e d d d d d d d d d d d d
e e e e e e e b b a a a a b a b b b b b b b b d d d d d d d d d d d d c c c c c c c c c c c
Autoscaled - single link Single, Scaled b b b b b b b b b a a a b a a b e e e e e e e d d d d d d d d d d d d c c c c c c c c c c c
Similarity Autoscaled - complete link Complete, Scaled a a a b a a b b b b b b b b b b^ c^ c^ c^ c^ c^ c^ c^ c^ c^ c^ c e e e e e e d d d d d d d e d d d d d -0. -0. -0. -0.
Similarity Autoscaled - centroid link Centroidal, Scaled e e e e e e e d d d d d d d d d d d d a a a b a a b b b b b b b b b b c c c c c c c c c c c -0. -0.
Similarity Autoscaled - centroid link For this example, a centroid link best reflects what we already know about the data.
Iris dataset A pretty famous data set published by R.A. Fisher, “The Use of Multiple Measurements in Axonmic Problems.” Anals of Eugenics, 7, 179-188 (1936). He measured four physical properties of iris to see if they could be used to classify any of three different species. Used length and width of the sepal and petal. Iris dataset ! Species Property ! I. Setosa Petal width ! I.Versicolor Petal length ! I.Verginica Sepal width ! Sepal length 150 samples - no missing values HCA analysis was conducted on both raw and scaled data. Both single linkage and complete linkage were evaluated. Autoscaled data, centroidal linkage (^0111111111111111111111111111111111111111111111111133333333333333333333333333333333122222232222222222322222233333332333232222222322323332222222222222222) 5 10 15 20 25 30 35 40 Dissimilarity One class is distinct but the other two overlap. Iris dataset So it should be possible to classify samples. HCA just does not provide as useful a view as we had hoped for. 5 10 15 20 25 15 30 45 60 Petal length P e t a l w i d t h Raw data. Iris dataset So there was useful information in the dataset. HCA - not a good tool. Reducing the four measurements into a single one actually make the data worse. Autoscaling - had little or no effect. The actual numbers were all of a similar range. Moral - just because a method doesn’t work does not mean that there is no useful information. Classification of Mycobacteria Investigators at the CDC wanted to see if it was possible to identify mycobacteria using pattern recognition of an HPLC analysis of mycolic acids. Mycobacteria - include a number of respiratory and non-respiratory pathogens such as M. tuberculosis. C 70 -C 90 α-branched β-hydroxy mycolic acids were selected as they are known to be in the cell walls of these bacteria.
Sulawesi (^) Costa Rica Ethiopia Sumatra Kenya Columbia
Using XLStat Classification criteria that can be minimized.