Bioinformatics and Machine Learning Methods - Advance Database Systems | EECS 700, Assignments of Deductive Database Systems

Material Type: Assignment; Class: Special Topics: Advanced Database Systems; Subject: Elect Engr & Computer Science; University: University of Kansas; Term: Fall 2005;

Typology: Assignments

Pre 2010

Uploaded on 03/10/2009

koofers-user-p9r-1
koofers-user-p9r-1 🇺🇸

5

(1)

10 documents

1 / 1

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
EECS 700 Bioinformatics and Machine Learning Methods
Fall 2005
Homework #2
Due Oct. 24th, 2005
Classification of Leukemia with gene expression profiles. The training set (in the file golub-data-
train.txt) and the test set (in the file golub-data-independent.txt) are available at
http://people.eecs.ku.edu/~yazhang/course/f05/Homework.html.
Background on the dataset: The training set contains gene expression profiles for 38 bone
marrow samples from acute leukemia patients, with each profile consisting of about 7000 gene
expression levels. The training sample are labeled as either ALL (acute lymphoid leukemia) or
AML (acute myeloid leukemia), two clinically distinct types of leukemia. The ALL type samples
can further be divided into T-lineage ALL and B-lineage ALL. Finally there is a test
("independent") set of 50 additional samples also consisting of AML, T-lineage ALL, and B-
lineage ALL leukemia types. Your task is to distinguish between ALL and AML. (Reference:
"Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression
Monitoring", Golub et al., Science, 286, 1999.)
1. (20 pts) Try kNN classifiers with k=1, 3, 5, 7, to give a baseline performance measure. Explain
what distance measure you are using.
2. (20 pts) Train soft-margin linear SVM classifiers. State clearly what implementation of SVM
classifiers you use, what command-line options you use for training, and how you tune this
parameter.
3. (30 pts) For each kNN/SVM experiment, report the predicted labels and calculate the
"confusion matrix" on the test set for the 2-class ALL versus AML problem (choose one class to
be "Positive" and one class to be "Negative"):
Actual \ Predicted Negative Positive
Negative A B
Positive C D
where A, B, C, D are the number of test examples falling into each category, and calculate the
following simple statistics:
Accuracy: (A+D)/(A+B+C+D)
Sensitivity (True Positive Rate): D/(C+D)
Specificity (True Negative Rate): A/(A+B)
4. (30 pts) Rerun the SVM experiments using your choice of two kernels. Again, report the types
of kernels and the kernel parameters that you used, and report the confusion matrix. Did the use
of kernels with the SVM improve performance on the test set or lead to overfitting?

Partial preview of the text

Download Bioinformatics and Machine Learning Methods - Advance Database Systems | EECS 700 and more Assignments Deductive Database Systems in PDF only on Docsity!

EECS 700 Bioinformatics and Machine Learning Methods

Fall 2005

Homework

Due Oct. 24th, 2005

Classification of Leukemia with gene expression profiles. The training set (in the file golub-data- train.txt) and the test set (in the file golub-data-independent.txt) are available at http://people.eecs.ku.edu/~yazhang/course/f05/Homework.html.

Background on the dataset: The training set contains gene expression profiles for 38 bone marrow samples from acute leukemia patients, with each profile consisting of about 7000 gene expression levels. The training sample are labeled as either ALL (acute lymphoid leukemia) or AML (acute myeloid leukemia), two clinically distinct types of leukemia. The ALL type samples can further be divided into T-lineage ALL and B-lineage ALL. Finally there is a test ("independent") set of 50 additional samples also consisting of AML, T-lineage ALL, and B- lineage ALL leukemia types. Your task is to distinguish between ALL and AML. (Reference: "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring", Golub et al ., Science , 286, 1999.)

  1. (20 pts) Try kNN classifiers with k=1, 3, 5, 7, to give a baseline performance measure. Explain what distance measure you are using.
  2. (20 pts) Train soft-margin linear SVM classifiers. State clearly what implementation of SVM classifiers you use, what command-line options you use for training, and how you tune this parameter.
  3. (30 pts) For each kNN/SVM experiment, report the predicted labels and calculate the "confusion matrix" on the test set for the 2-class ALL versus AML problem (choose one class to be "Positive" and one class to be "Negative"):

Actual \ Predicted Negative Positive Negative A B

Positive C D

where A, B, C, D are the number of test examples falling into each category, and calculate the following simple statistics:

  • Accuracy: (A+D)/(A+B+C+D)
  • Sensitivity (True Positive Rate): D/(C+D)
  • Specificity (True Negative Rate): A/(A+B)
  1. (30 pts) Rerun the SVM experiments using your choice of two kernels. Again, report the types of kernels and the kernel parameters that you used, and report the confusion matrix. Did the use of kernels with the SVM improve performance on the test set or lead to overfitting?