Homework 2
CS522, Winter 2011
Due: Wednesday, February 2
Please upload your files to CSNS.
The files should include all the source code, documentation (optional),
and a text file hw2.txt,
which contains detailed instructions on how to compile and run your
program on the CS3 server. Note that file
uploading will be disabled automatically after 11:59PM
of the due date, so please turn in your work on time.
[Readings]
- Read Chapter 5.1 and 5.2 of the textbook.
[Association Rule Mining with Apriori Algorithm] (80pt)
For this assignment, you are going to implement the Apriori Algorithm described in Chapter
5.2 of the textbook to mine strong association rules. You may use any programming language
of your choice, as long
as your program can be compiled and run on CS3.
Please send an email to csun@calstatela.edu
and ask for an account on CS3.
A number of programming languages are available on CS3, including C,
C++, Java, and several scripting languages such as Perl,
Python, and Ruby. Please try
compiling and running a simple program on CS3 before you decide
what language you are going to use.
You program must take four command line parameters as follows (I'm
going to use Java for examples, but as stated earlier, you may use
other programming languages):
java Apriori
<inputFile> <outputFile> <minSupportCount> <minConfidence>
- inputFile is the name of a text file that contains the
transactions. Each transaction occupies one line, and the items in each
transaction are separated by white spaces. You may assume that the
items are represented by integers starting from 0. Two sample input
files, test.txt
and retail.txt [1]
are available.
- outputFile is the name of a text file that contains
the strong association rules discovered. Each association rule occupies one line in the format of
A -> B, SupportCount, Confidence
For example:
48 170 -> 38 39, 1193, 0.7662170841361593
- minSupportCount is the minimum support count, which is an integer greater than 0, e.g. 1000.
- minConfidence is the minimum confidence, which is a value between 0 and 1, e.g. 0.7.
After you complete your program, please test it with a small dataset
(e.g. test.txt) to make sure the results are correct, and then test it
with a larger dataset (e.g. retail.txt) to make sure its performance is reasonable.
Note that
- Use of existing Apriori code or other frequent pattern
mining code found online or from other sources will be considered
cheating.
- Your program may not rely on any external tools or databases.
- To receive full credit, your program must be able to mine the retail dataset (i.e. retail.txt) with minSupportCount=1000 and minConfidence=0.7 within 30 seconds on CS3.
- I will time the running time of each program. The five
fastest
programs will receive up to 15% extra credit, and the five slowest
programs and the programs that cannot finish the tests due to code
inefficiency (e.g. memory overflow) will receive up to 30% extra penalty.
[1] retail.txt is a dataset provided by Tom Brijs and contains the anonymized retail market basket data from an anonymous Belgian retail store.