CS522 HW3

Homework 3
CS522, Fall 2007

Due: Wednesday, November 14

Please upload your files to CSNS. The files should include all the source code, documentation (optional), and a text file hw3.txt, which contains detailed instructions on how to compile and run your program on the CS3 server. Note that file uploading will be disabled automatically after 11:59PM of the due date, so please turn in your work on time.

[Readings]

Read Chapter 2.6.1, 5.4, 8.3.1, 8.3.2 (up to GSP), 6.1-6.3 of the textbook.

[Decision Tree Induction] (90pt)

For this assignment, you are going to implement a decision tree classifier as described in Chapter 6.3 of the textbook. You may use any programming language of your choice, as long as your program can be compiled and run on CS3.

Use the Forest CoverType dataset to test your classifier as follows (I'm going to use Java for examples, but as stated earlier, you may use other programming languages):

java DecisionTreeClassifier <TrainingSet> <TestSet> <ResultSet>

TrainingSet is the name of a text file that contains the records for training the classifier. In the case of the Forest CoverType dataset, the training set should include half of the records in the dataset. Note that each record occupies one line in the data file, the attributes of each record are separated by comma, and the last attribute is the class label.
TestSet contains the other half of the records in the Forest CoverType dataset, and the file is of the same format as the training set. Note that the class labels of the records are included in the test set so you can calculate the success rate of your classifier, but you may not use these class labels for any other purpose.
ResultSet should be the same as the test set except that the class labels are replaced by the ones predicted by your classifier.

Your classifier should also output to the console the percentage of the correctly classified records, e.g. 50%. Please do not output any debugging information in the submitted version.

Note that

Use of existing classification code found online or from other sources will be considered cheating.
The top five most accurate classifiers will receive up to 20% extra credit, and the five most inaccurate classifiers will receive up to 20% penalty.

[Extra Credit] (5pt)

Given n distinct items, how many sequences with n events are there? Please give your answer and explain your reasoning in hw3.txt.