CS522 HW4

Homework 4
CS522, Winter 2011

Due: Friday, March 11

Please upload your files to CSNS. The files should include all the source code, documentation (optional), and a text file hw4.txt, which contains detailed instructions on how to compile and run your program on the CS3 server. Note that file uploading will be disabled automatically after 11:59PM of the due date, so please turn in your work on time.

For this assignment, you are going to implement the K-Means algorithm described in Chapter 7.4.1 of the textbook. You may use any programming language of your choice, as long as your program can be compiled and run on CS3.

Your program should take the input data file as a command line parameter, as shown below (I'm using Java for examples, but as stated earlier, you may use other programming languages):

java KMeans <inputFile>

The input file contains the records to be clustered. Each record occupies one line in the following format:

AttributeValue1,AttributeValue2,...,AttributeValueN,ClassLabel

You may assume that all attribute values are numeric, and the class labels are strings.

The parameter K for the algorithm should be the number of classes in the input file. Your program should perform basic K-Means clustering K^K/K! times and choose the best clustering based on SSE. The console output of your program should be the following (and nothing else):

The number of records in each cluster.
The SSE of the clustering.
The "accuracy" of the clustering.

We define the accuracy of the clustering as follows: we label each cluster with the class label of the majority class in the cluster, and if a record has the same label as the label of the cluster it belongs to, we consider the record "correctly clustered", and the accuracy of the clustering is simply the percentage of the correctly clustered records.

To receive full credit for this assignment, your implementation must achieve at least 75% accuracy on the Iris dataset.