Wednesday, November 30, 2016

Association Rule Mining : Example & R code

Let's try to solve the following association rules mining problem.














Minimum support = 30%



Solution:  Association Rules Mining

Transaction ID
Items Bought
1
{Laptop, Printer, Tablet, Headset}
2
{Printer, Monitor, Tablet}
3
{Laptop, Printer, Tablet, Headset}
4
{Laptop, Monitor, Tablet, Headset}
5
{Printer, Monitor, Tablet, Headset}
6
{Printer, Tablet, Headset}
7
{Monitor, Tablet}
8
{Laptop, Printer, Monitor}
9
{Laptop, Tablet, Headset}
10
{Printer, Tablet}


a)     
Step 1: Calculating minimum support count :
It’s given that minimum support percentage is 30%, so, the minimum support count is 3

C1

Items
Support Count
{Laptop}
5
{Printer}
7
{Tablet}
9
{Headset}
6
{Monitor}
5

L1

Items
Support Count
{Laptop}
5
{Printer}
7
{Tablet}
9
{Headset}
6
{Monitor}
5



C2

Items
Support Count
{Laptop, Printer}
3
{Laptop, Tablet}
4
{Laptop, Headset}
4
{Laptop, Monitor}
2
{Printer, Tablet}
6
{Printer, Headset}
4
{Printer, Monitor}
3
{Tablet, Headset}
6
{Tablet, Monitor}
4
{Headset, Monitor}
2


L2

Items
Support Count
{Laptop, Printer}
3
{Laptop, Tablet}
4
{Laptop, Headset}
4
{Printer, Tablet}
6
{Printer, Headset}
4
{Printer, Monitor}
3
{Tablet, Headset}
6
{Tablet, Monitor}
4

C3

Items
Support Count
{Laptop, Printer, Tablet}
2
{Laptop, Printer, Headset}
2
{Laptop, Tablet, Headset}
4
{Printer, Monitor, Tablet}
2
{Tablet, Printer, Headset}
4

L3

Items
Support Count
{Laptop, Tablet, Headset}
4
{Tablet, Printer, Headset}
4

C4 will not be generated because one of its subset is not in L3. So, we are stopping here.



Result :

Frequent Item sets

{Laptop}
{Printer}
{Tablet}
{Headset}
{Monitor}
{Laptop, Printer}
{Laptop, Tablet}
{Laptop, Headset}
{Printer, Tablet}
{Printer, Headset}
{Printer, Monitor}
{Tablet, Headset}
{Tablet, Monitor}
{Laptop, Tablet, Headset}
{Tablet, Printer, Headset}

Efficiency of Apriori vs FP Growth Algorithm:

When a database is scanned to check Ck for creating Lk, a large number of transactions will be scanned even they do not contain any k-item set.  For any large application, reading data from disk is really resource consuming one.

For example,  the number of frequent pattern generated for 100 item set is 2^100 -1, which is 1.27×10^30 frequent sub-patterns! It’s a too huge set for any computer to store and compute.

In FP Growth algorithm, it scans database only twice. It’s much more efficient and faster than Apriori algorithm.

1st time : Scan DB once and find single item frequent pattern
2nd time : Scan DB again and construct a tree.




Implementation of the above problem using R programming language:




Output:




No comments :

Post a Comment