Homework 3
CS522, Fall 2007
Due: Wednesday, November 14
Please upload your files to CSNS.
The files should include all the source code, documentation (optional),
and a text file hw3.txt,
which contains detailed instructions on how to compile and run your
program on the CS3 server. Note that file
uploading will be disabled automatically after 11:59PM
of the due date, so please turn in your work on time.
[Readings]
- Read Chapter 2.6.1, 5.4, 8.3.1, 8.3.2 (up to GSP), 6.1-6.3
of the textbook.
[Decision Tree Induction] (90pt)
For this assignment, you are going to implement a decision
tree classifier as described in Chapter 6.3 of the textbook. You may
use any programming language
of your choice, as long
as your program can be compiled and run on CS3.
Use the Forest
CoverType dataset to test your classifier as
follows (I'm
going to use Java for examples, but as stated earlier, you may use
other programming languages):
java
DecisionTreeClassifier
<TrainingSet>
<TestSet> <ResultSet>
- TrainingSet
is the name of a text file that contains the records for training the
classifier. In the case of the Forest CoverType dataset, the training
set should include half of the records in the dataset. Note that each
record occupies one line in the data file, the attributes of each
record are separated by comma, and the last attribute is the class
label.
- TestSet
contains the other half of the records in the Forest CoverType dataset,
and the file is of the same format as the training set. Note that the
class labels of the records are included in the test set so you can
calculate the success rate of your classifier, but you may not use
these class labels for any other purpose.
- ResultSet
should be the same as the test set except that the class labels are replaced by the ones predicted by your classifier.
Your classifier should also output to the console the percentage of the correctly classified records, e.g. 50%. Please do not output any debugging information in the submitted version.
Note that
- Use of existing classification code found online or from other sources will be considered
cheating.
- The top five most accurate classifiers will receive up to 20%
extra credit, and the five most inaccurate classifiers will receive up
to 20%
penalty.
[Extra Credit] (5pt)
Given n distinct items, how many sequences with n events are there? Please give your answer and explain your reasoning in hw3.txt.