CS522 HW4

Homework 4
CS522, Fall 2008

Due: Wednesday, November 26

Please upload your files to CSNS. The files should include all the source code, documentation (optional), and a text file hw4.txt, which contains detailed instructions on how to compile and run your program on the CS3 server. Note that file uploading will be disabled automatically after 11:59PM of the due date, so please turn in your work on time.

[Readings]

Read Chapter 6.11-6.14, 7.1-7.4 of the textbook.

[K-Means] (60pt)

Implement the K-Means algorithm as described in Chapter 7.4.1 of the textbook. You may use any programming language of your choice, as long as your program can be compiled and run on CS3.

Use your K-Means implementation to cluster the Forest CoverType dataset. Note that you do not have to use all the attributes for clustering, e.g. you may choose to use only the numerical attributes or only the binary attributes. Your program should take the input data file as a command line parameter, as shown below (I'm using Java for examples, but as stated earlier, you may use other programming languages):

java KMeans <dataset>

Since there are seven forest cover types, your program should produce seven clusters. We label each cluster with the class label of the majority class in the cluster, and if a record has the same label as the label of the cluster it belongs to, we consider the record "correctly clustered". Your program should output to the console the percentage of the correctly clustered records, e.g. 50%. Please do not output any debugging information in the submitted version.

Note that

Use of existing clustering code found online or from other sources will be considered cheating.
To receive full credit for the assignment, you must include a brief description of your implementation in hw4.txt. The description should include, but not limited to, the distance function used, how centroids are selected and updated, and any performance optimizations you implemented.
The three most accurate clustering results will receive up to 20% extra credit, and the three most inaccurate ones will receive up to 20% penalty.