CS 461 Homework 4

Due: Midnight, February 21, 2008

Please your answers to these questions in <yourlastname>-hw4.txt (or PDF).

Part 1: Parametric Methods (45 points)

For what kinds of situations is the Bernoulli distribution a good choice as a parametric model? Give an example. (5 points)
If I flip a coin and get { tails, tails, tails, heads, tails, heads }, what is the maximum likelihood estimate of the probability of getting heads? (5 points)
Imagine that I always assume ("estimate") that the probability of getting heads from any coin is 0.5 (i.e., I assume that they are all fair coins). If I find a coin that is off-balance, so that P(heads) for this coin is really 0.62, what is the bias of my estimate for this coin? (5 points)
How is the MAP estimate different from the maximum likelihood estimate of a given value theta? (5 points)
Weka: This time, you'll get a chance to use Weka's "Experimenter", which makes it easy to compare several classifiers on the same data set(s). You may want to use the same technique for part of your project!

The data set you will use is wine.arff. It contains three classes, each of which corresponds to wine from a different cultivar (1, 2, or 3). There are 13 features, which are the results of different chemical analyses of the wine. Human wine tasters often claim to be able to distinguish among these three types -- let's see if machine learning can also learn to do it.

First, what is a baseline you might use to evaluate any machine learning algorithm against, on this problem? Remember, a baseline should assign a new wine sample to one of the three classes, without observing any of the features. (5 points)
Now, to get started, run Weka and click on "Experimenter" (instead of "Explorer"). The "Setup" panel should appear. Click "New" to create a new experiment. Leave all of the defaults set (Experiment type-> Cross-validation, 10 folds, Classification).

Under "Datasets", click "Add new..." and browse to add this file: wine.arff. We'll just use one data set, but in general you could add more to see how your algorithm does on many different problems.

Under "Algorithms", click "Add new.." and click "Choose" to pull up the same algorithm chooser that we used before. Add each of the following, clicking "OK" after each one:
- k-NN: lazy -> IBk, then set "KNN" to 3 to specify the use of the three nearest neighbors.
- DT: trees -> J48, leave "unpruned" set to "False" (so it prunes).
- SVM: functions -> SMO, then set "useRBF" to "True" (Gaussian kernel) and set "gamma" to 0.1.
- ANN: functions -> MultilayerPerceptron, use the defaults.
- Naive Bayes: bayes -> NaiveBayes, use the defaults.
Now switch to the "Run" panel and click "Start". This will do 10 runs of 10-fold cross-validation (splitting the folds differently each time).

Report your results: Switch to the "Analyse" panel and click "Experiment". To compare the accuracy achieved by each method, leave all of the settings on their defaults and click "Perform test". This will report the average accuracy achieved by each algorithm. List the results you got, one number (accuracy) for each of the five algorithms. (5 points)
Analyze your results: (15 points)

Looking at the help for the MLP (on the "Setup" panel, select the MLP line in the Algorithms box, click "Edit selected...", then click "More" to read about the options), what does specifying 'a' (the default) for the "hiddenLayers" field do? How many nodes will be in the hidden layer for this data set?

Set "Comparison field" to "Time_training" and click "Perform test". List all of the results. Which one took the longest to train?

Set "Comparison field" to "Time_testing" and click "Perform test". List all of the results. Which one took the longest to test?

Part 2: Clustering (55 points)

When using the k-means clustering algorithm, we seek to minimize the variance of the solution. In general, what happens to the variance of a partition as you increase the value of k, the number of clusters, and why? (5 points)
For what value of k can you always get a variance of 0? Why? (5 points)
Show 2 iterations of the k-means algorithm (k=2) on the following one-dimensional data set: (10 points)

Data: [ 3, 7, 2, 9, 8, 5, 10, 3, 6 ]

First iteration: cluster centers (randomly chosen): 1, 7

Data assignment:
- Cluster 1: { 3, 2, 3 }
- Cluster 2: { 7, 9, 8, 5, 10, 6 }
Show the cluster centers, then the data assignments, that would be obtained for each of two more iterations.

After your iterations, has the algorithm converged to a solution at this point, or not? How can you tell?
Consider the following diagram showing the same 7 balls that have been clustered into two different partitions. In each partition, there are two clusters. One cluster labels the balls red, and the other cluster labels the balls white. What is the Rand Index of agreement between the two partitions? Show your work. (10 points)
Weka: Go back to the "Explorer" to do some clustering. Load in wine.arff. Select the "Cluster" tab. Click "Choose" and select SimpleKMeans. Click on its name to change "numClusters" to 3. Under "Cluster mode", select "Classes to clusters evaluation" and make sure "(Nom) class" is specified below that. Click "Start".

Weka will show you the mean of each cluster (across all features) and then show a confusion matrix that compares the clusters you got with the "true" class labels in the data. Since k-means doesn't know which class is which, they may not line up in the order you expect. Below the confusion matrix, Weka tells you which class (1, 2, or 3) is the most likely match for clusters 0, 1, and 2. It then reports the number of incorrectly classified instances.

Show the confusion matrix you obtained, with the rows re-arranged so that the "correctly clustered" items are on the diagonal. That is, the first row should be the class that matches cluster 0, the second row should be the class that matches cluster 1, and the third row should be the class that matches cluster 2. (5 points)

How many items were incorrectly clustered? (5 points)
Now do the same thing with EM. Click "Choose", select "EM", and then specify that "numClusters" should be 3. Now the output is not just a mean for each feature, but a normal distribution, so you get a mean and standard deviation for each one.

Show the re-arranged confusion matrix for EM: (5 points)

How many items were incorrectly clustered? (5 points)
Using what you know about how k-means and EM work, why do you think you got different results for the two algorithms on the same data set? (5 points)

What to turn in

Upload this file to CSNS under "Homework 4":

<yourlastname>-hw4.txt (or <yourlastname>-hw4.pdf if you prefer to submit in PDF format)

Extra credit

If you wish to tackle some extra credit, you may do so here. You can earn up to 10 points to be applied to any of your homework assignments (but not to exceed 100 on any assignment). To receive these points, you must get at least a 70% on the main part of Homework 4, so finish the regular assignment before moving on to this part.

A recent development in the world of clustering has been the ability to incorporate prior knowledge when clustering.
This paper shows how to incorporate knowledge about individual pairs of data points, in terms of whether they should or should not be clustered together. Explain, in your own words, how this is accomplished. (5 points)
Each time you run k-means with a different initialization (choice of starting cluster centers), you may get a different result out (a local optimum, starting from that point). Since the starting point is not always the same distance from the local optimum, the runtime of k-means also changes depending on its initialization. If the data set is large, k-means can take a very long time to run. However, a good choice of starting point will help keep the runtime low. Can you think of a good way to initialize the cluster centers, other than using random selection, to help minimize runtime? (5 points)

What to turn in

Upload this file to CSNS under "Homework 4: Extra Credit":

<yourlastname>-hw4-extra.txt (or <yourlastname>-hw4-extra.pdf if you prefer to submit in PDF format)