CS522 HW2

Homework 2
CS522, Fall 2007

Due: Wednesday, October 17

Please upload your files to CSNS. The files should include all the source code, documentation (optional), and a text file hw2.txt, which contains detailed instructions on how to compile and run your program on the CS3 server. Note that file uploading will be disabled automatically after 11:59PM of the due date, so please turn in your work on time.

[Readings]

Read Chapter 5.1 and 5.2 of the textbook.

[Apriori Algorithm Implementation] (60pt)

For this assignment, you are going to write a frequent itemset mining program using the Apriori algorithm described in Chapter 5.2.1 of the textbook. You may use any programming language of your choice, as long as your program can be compiled and run on CS3.

Please send an email to csun@calstatela.edu and ask for an account on CS3. A number of programming languages are available on CS3, including C, C++, Object-C, Java, and several scripting languages such as Perl, Python, and Ruby. Please try compiling and running a simple program on CS3 first before you decide what language you are going to use.

You program must take three command line parameters as following (I'm going to use Java for examples, but as stated earlier, you may use other programming languages):

java FrequentSetMiner <inputFile> <outputFile> <minSupportCount>

inputFile is the name of a text file that contains the transactions. Each transaction occupies one line, and the items in each transaction are separated by white spaces. You may assume that the items are represented by integers starting from 0. Two sample input files, test.txt and retail.txt [1] are available.
outputFile is the name of a text file that contains the frequent itemsets found. Each frequent itemset occupies one line, which includes the items separated by white spaces, followed by a colon, and then the support count of the itemset, e.g.

38 48 110 : 1361

minSupportCount is the minimum support count threshold, which is an integer greater than 0, e.g. 1000.

After you complete your program, please test it with a small dataset (e.g. test.txt) to make sure the results are correct, and then test it with a larger dataset (e.g. retail.txt) to make sure its performance is reasonable.

Note that

Use of existing Apriori code or other frequent pattern mining code found online or from other sources will be considered cheating.
I will time the running time of each program. The five fastest programs will receive up to 20% extra credit, and the five slowest programs and the programs that cannot finish the tests due to code inefficiency (e.g. memory overflow) will receive up to 20% penalty.

[1] retail.txt is a dataset provided by Tom Brijs and contains the anonymized retail market basket data from an anonymous Belgian retail store.