Homework 2
CS522, Winter 2012

Due: Friday, February 3 (Part I) and Friday, February 10 (Part II)


[Readings]

[Association Rule Mining with Apriori Algorithm]

For this assignment, you are going to implement the Apriori Algorithm described in Chapter 6.2 of the textbook to mine strong association rules. You may use any programming language of your choice, as long as your program can be compiled and run on CS3.

Please send an email to csun@calstatela.edu and ask for an account on CS3. A number of programming languages are available on CS3, including C, C++, Java, and several scripting languages such as Perl, Python, and Ruby. Please try compiling and running a simple program on CS3 before you decide what language you are going to use.

Part I. Test Dataset (30pt)

In the first part of the assignment you are going to design a test dataset which you will use to verify the correctness of your code in Part II. Your test dataset must meet the following requirements:

Collaborating on the test dataset design is considered cheating. Anyone who submits a dataset that is identical or sufficiently similar to somebody else's will receive an automatic F for the course.

For this part please submit two files. The first file should contain the dataset in the input file format described in Part II. The second file should contain a step-by-step description of applying the Apriori algorithm to the dataset to find the frequent itemsets and the strong association rules with 30% minimum support and 60% minimum confidence.

Part II. Algorithm Implementation (70pt)

Your implementation of the Apriori algorithm must take four command line parameters as follows (I'm going to use Java for examples, but as stated earlier, you may use other programming languages):

java Apriori <inputFile> <outputFile> <minSupportCount> <minConfidence>

A -> B, SupportCount, Confidence

For example:

48 170 -> 38 39, 1193, 0.7662170841361593

After you complete your program, please test it with the test dataset you designed in Part I to make sure the results are correct, and then test the program with a larger dataset (e.g. retail.txt) to make sure its performance is reasonable.

Note that

For this part please submit your source code, documentation (optional), and a text file hw2.txt which contains detailed instructions on how to compile and run your program on the CS3.

[1] retail.txt is a dataset provided by Tom Brijs and contains the anonymized retail market basket data from an anonymous Belgian retail store.