

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Project; Professor: Holmes; Class: DATA MINING; Subject: Statistics & Applied Probability; University: University of California - Santa Barbara; Term: Summer A 2009;
Typology: Study Guides, Projects, Research
1 / 2
This page cannot be seen from the preview
Don't miss anything!


The DUNGAREE data set gives the number of pairs of four different types of dungarees sold at stores. Each row represents an individual store. There are five columns in the data set. One column is the store identification number, and the remaining columns contain the number of pair of each type of jeans sold.
a. Open a new diagram in your Exercise project. Name the diagram Jeans.
b. Add an Input Data Source node to the diagram and assign the data set DUNGAREE.
c. Examine the distribution of the variables. Are there any unusual data values? Are there missing values that should be replaced? Are the model roles and measurement levels assigned to the variables appropriate?
d. Assign the variable STOREID the model role id and the variable SALESTOT the model role rejected. Be sure that the remaining variables have the input model role and interval measurement level. Why should the variable SALESTOT be rejected?
e. Add a Clustering node to the diagram workspace and connect it to the Input Data Source node.
f. Open the Clustering node. Choose the standard deviation standardization method.
g. Run the diagram from the Clustering node and examine the results. Which of the variables was the most important in determining the clusters?
h. After examining the results, summarize the nature of the clusters? Add an Insight node to the diagram. Use box plots to compare the numbers of different types of jeans and the total number of jeans sold in each of the clusters. Do the results you see here agree with the conclusions you drew from looking at the Clustering node results?
j. Use the Insight node to visualize the clusters.
k. Add an SOM/Kohonen node to the diagram and connect it to the Input Data Source node.
l. Open the SOM/Kohonen node. Choose the standard deviation standardization method. To better compare the clusters to those generated by the Clustering node, set the map to two rows and three columns.
m. Run the diagram from the SOM/Kohonen node and examine the results. Which of the variables was the most important in determining the clusters?
n. After examining the results, summarize the nature of the clusters?
o. Add an Insight node to the diagram. Use box plots to compare the numbers of different types of jeans in each of the clusters. Do the results you see here agree with the conclusions you drew from looking at the Clustering node results?