Homework 4
CS522, Fall 2008


Due: Wednesday, November 26

Please upload your files to CSNS. The files should include all the source code, documentation (optional), and a text file hw4.txt, which contains detailed instructions on how to compile and run your program on the CS3 server. Note that file uploading will be disabled automatically after 11:59PM of the due date, so please turn in your work on time.

[Readings]

[K-Means] (60pt)

Implement the K-Means algorithm as described in Chapter 7.4.1 of the textbook. You may use any programming language of your choice, as long as your program can be compiled and run on CS3.

Use your K-Means implementation to cluster the Forest CoverType dataset. Note that you do not have to use all the attributes for clustering, e.g. you may choose to use only the numerical attributes or only the binary attributes. Your program should take the input data file as a command line parameter, as shown below (I'm using Java for examples, but as stated earlier, you may use other programming languages):

java KMeans <dataset>

Since there are seven forest cover types, your program should produce seven clusters. We label each cluster with the class label of the majority class in the cluster, and if a record has the same label as the label of the cluster it belongs to, we consider the record "correctly clustered". Your program should output to the console the percentage of the correctly clustered records, e.g. 50%. Please do not output any debugging information in the submitted version.

Note that