CS454, Summer 2009

1. Consider the following three transactions:

T1: {a1,a2,...,a99,a100}

T2: {a1,a2,...,a74,a75}

T3: {a1,a2,...,a49,a50}

T2: {a1,a2,...,a74,a75}

T3: {a1,a2,...,a49,a50}

Let minimum support count min_sup = 2.

(a) How many frequent itemsets are there?

(b) List all the closed itemset(s).

(c) List all the maximal itemset(s).

2. Suppose we want to build a Decision Tree Classifier to classify animals based on the following training set. Use Gain Ratio with Entropy to determine the best split point for the root node of the decision tree.

Name | Body Temperature | Skin Cover | Gives Birth | Aquatic Creature | Aerial Creature | Has Legs | Hibernates | Class |

Human | warm-blooded | hair | yes | no | no | yes | no | mammal |

Python | cold-blooded | scales | no | no | no | no | yes | reptile |

Salmon | cold-blooded | scales | no | yes | no | no | no | fish |

Whale | warm-blooded | hair | yes | yes | no | no | no | mammal |

Frog | cold-blooded | none | no | semi | no | yes | yes | amphibian |

Komodo Dragon | cold-blooded | scales | no | no | no | yes | no | reptile |

Bat | warm-blooded | hair | yes | no | yes | yes | yes | mammal |

Pigeon | warm-blooded | feathers | no | no | yes | yes | no | bird |

Cat | warm-blooded | fur | yes | no | no | yes | no | mammal |

Leopard Shark | cold-blooded | scales | yes | yes | no | no | no | fish |

Turtle | cold-blooded | scales | no | semi | no | yes | no | reptile |

Penguin | warm-blooded | feathers | no | semi | no | yes | no | bird |

Porcupine | warm-blooded | quills | yes | no | no | yes | yes | mammal |

Eel | cold-blooded | scales | no | yes | no | no | no | fish |

Salamander | cold-blooded | none | no | semi | no | yes | yes | amphibian |

3. Suppose you are given the web server log of a website to perform some data mining tasks. The website consists of 7 web pages: P1, P2, ..., P7. The topology of the website is shown in Figure 1 (you can click on the image to see a larger version), where a link from Pi to Pj indicates that there is a hyperlink on Pi pointing to Pj.

Figure 1. Website Topology

We define a session as an ordered sequence of pages visited by one user, and the time between each page visit is below certain threshold (e.g. 30 minutes). After preprocessing the data, we identify the following sessions from the server log:

S1:
<P1,P2,P5,P2,P7,P3>

S2: <P1,P4,P5,P1,P6,P7>

S3: <P1,P6,P1,P4>

S4: <P5,P4,P1,P6,P7>

S5: <P5,P3>

S6: <P1,P2,P7,P3>

S7: <P2,P7>

S8: <P1,P2,P4,P1,P2,P6,P7,P3>

S2: <P1,P4,P5,P1,P6,P7>

S3: <P1,P6,P1,P4>

S4: <P5,P4,P1,P6,P7>

S5: <P5,P3>

S6: <P1,P2,P7,P3>

S7: <P2,P7>

S8: <P1,P2,P4,P1,P2,P6,P7,P3>

Let min_sup=3. Use the GSP algorithm to discover the frequent sequences, which in this case are navigation paths frequently followed by the users.

4. Use the dataset in Exercise 3. Let the similarity(P4,P6)=similarity(P5,P7)=0.9, and for any other page pairs, similarity(Pi,Pj)=1 if i=j, and similarity(Pi,Pj)=0 if i!=j. Use the Needleman-Wunsch Algorithm ([LuDM05]) to determine the best alignment of S3 and S4.

5. Use the dataset in Exercise 3 to build a Markov Model as in [LuDM05]. In this model, each page is a state, and there are also a start state and an end state. The transitional probability from Pi to Pj is the probability that a user navigates from Pi to Pj. The transitional probability from the start state S to Pi is the probability that Pi being the entry point of a user (i.e. first page in a session), and the transitional probability from Pi to the end state E is the probability that Pi being the exit point of a user (i.e. last page of a session).

6. We define a transaction as an unordered set of pages visited by a user in the same session, so the dataset in Exercise 3 becomes the following transactions:

T1: {P1,P2,P3,P5,P7}

T2: {P1,P4,P5,P6,P7}

T3: {P1,P4,P6}

T4: {P1,P4,P5,P6,P7}

T5: {P3,P5}

T6: {P1,P2,P3,P7}

T7: {P2,P7}

T8: {P1,P2,P3,P4,P6,P7}

T2: {P1,P4,P5,P6,P7}

T3: {P1,P4,P6}

T4: {P1,P4,P5,P6,P7}

T5: {P3,P5}

T6: {P1,P2,P3,P7}

T7: {P2,P7}

T8: {P1,P2,P3,P4,P6,P7}

Let min_sup=3.

(a) Use the Apriori Algorithm to discover all the frequent itemsets. Note that in this case, the frequent itemsets are the sets of pages that are often visited together by a user.

(b) Draw the Frequent Itemsets Graph as shown in [MobasherDLN01].

7. Use the dataset in Exercise 6. Let min_sup=3 and the minimum confidence min_conf=70%. Use the Apriori Algorithm to discover all the strong association rules. Note that in this case, a strong association rule A->B tells us that if a user visits certain set of pages A, the user has a high likelihood to visit the set of pages B.

8. [FuBH00] proposed two methods for recommending web pages to a user:

- Rank the discovered frequent itemsets based on the user's navigation history, and make a recommendation based on the highest ranked frequent itemset(s).
- Find the closest neighbor of this user (i.e. the user who is most similar to this user), and recommend the pages visited by the closest neighbor to this user.

Use the dataset in Exercise 6. Suppose we know that each transaction is made by a different user, and we use Jaccard Coefficient as the similarity measure.

(a) Use the frequent itemset ranking method to recommend a page to the user who made T3.

(b) Use the closest neighbor method to recommend a page to the user who made T3.

9. Use the K-Mediods clustering algorithm with Jaccard Coefficient as the similarity measure ([SilvaLCT06] and lecture notes) to divide the transactions in Exercise 6 into two clusters.

10. Use the dataset in Exercise 6 and the structure of the website shown in Figure 1. Let min_sup=3 and the minimum "interestingness" threshold=0.2. Use the techniques proposed in [CooleyTS99] to discover all the interesting frequent itemsets.

11. Develop an algorithm to find the largest continuous intersection between two sequences as defined in [SpiliopoulouMBN03].

12. The sessions in Exercise 3 are reconstructed from the web server log. Suppose the following are the real user sessions:

S1:
<P1,P2,P5>

S2: <P1,P2,P7,P3>

S3: <P1,P4,P5,P1,P6,P7>

S4: <P1,P6,P1,P4,P5,P4,P1,P6,P7>

S5: <P5,P3,P1,P2,P7,P3>

S6: <P2,P7>

S7: <P1,P2,P4,P1,P2,P6,P7,P3>

S2: <P1,P2,P7,P3>

S3: <P1,P4,P5,P1,P6,P7>

S4: <P1,P6,P1,P4,P5,P4,P1,P6,P7>

S5: <P5,P3,P1,P2,P7,P3>

S6: <P2,P7>

S7: <P1,P2,P4,P1,P2,P6,P7,P3>

Compute M_cr, M_crs, M_cre, M_crse, M_o, and M_s as defined in [SpiliopoulouMBN03].

13. Suppose the sessions in Exercise 3 are identified from the web server log using heuristics based on page stay time and session duration time.

(a) Apply the second phase of Smart-SRA [BayirTCF09] to discover the set of sessions that also satisfy the topology and maximality requirements. You may assume that the timestamp requirements are always satisfied, i.e. there's no need to recheck page stay time.

(b) Let min_sup=3. Use the Sequential Apriori algorithm [BayirTCF09] to discover the frequent pattens.

14. Exercise 2 lists fifteen records. Use the first twelve records as a training set and the last three (i.e. Porcupine, Eel, and Salamander) for testing. Use Naive Baysian Classification to classify the testing records.

15. Exercise 2 lists fifteen records. Use the first twelve records as a training set and the last three (i.e. Porcupine, Eel, and Salamander) for testing. Use kNN to classify the testing records.

16. Use the data in the BBN example in the lecture notes to calculate the following probablilties:

(a) P(HeartDisease=Yes|Diet=Healthy)

(b) P(HeartDisease=Yes|BloodPresure=High,ChestPain=Yes)

17. Suppose we use the Apriori Algorithm to discover frequent itemsets from the transactions show in Exercise 6. For each of the five different approaches proposed in [JinYA04], calculate the maximum size of the memory needed for the Reduction Object(s) during the discovery process. For this exercise, use the following parameters:

- min_sup = 3
- Each item can be represented by an integer (i.e. 4 bytes)
- Each lock can be represented by an integer (i.e. 4 bytes)
- For the Fixed Locking approach, the number of locks used is 8.
- The size of a cache block is 64 bytes.