Market Basket Analysis, Collaborative Filtering- Unsupervised Algorithm - Part 4
Market Basket Analysis (MBA) and Collaborative Filtering (CF)
Market
Basket Analysis
Market
Basket Analysis is one of the most common and useful types of data mining and analysis
method. It uses association rule mining to identify
products frequently bought together. The underlying objective of association
rule mining is to find associations between different items in a set. In the
case of market basket analysis, these items are the products purchased by a customer
and the set is a transaction. In short, market basket analysis
- Is an unsupervised data
mining technique
- Uncovers products which are frequently bought together
- Creates if-then scenario
rules
Figure-1:
Various
use cases of MBA
It
finds applications in diverse areas viz. Manufacturing, Telecom, Insurance,
Retail Marketing, Medical industries, Banking, Education, Restaurants etc. These
organizations use association rule data analysis method to find information on how
to bundle their products, to increase sales profit; to design - catalogues,
menu layouts, store displays; plan inventory control; promotional/advertisement
campaigns; determine timings of product groups and so on.
When
you go to e-marketing sites like Amazon and browse, you will find similar and
associated products grouped together. Marketing firms also find associations
between different products and customers. They know that which product advertisements
are effective to promote their sales. MBA can be used to analyze which age,
region and income group and will buy which set of items. In short MBA is useful
to anticipate customer behaviour.
Besides
increasing sales profits, association rules can also be used in other fields.
In medical diagnosis for instance, understanding which symptoms tend to
comorbid (in medicine means same as co-occurrence) can help to improve patient
care and medicine prescription. By studying the co-occurrence of side effects
in patient with multiple prescriptions, a hospital can find previously unknown
drug interactions and use this information to warn patients.
Finding
Association Rules
This
is the process of finding
associations between different items in the available data. It
discovers the probability of the co-occurrence of items in a collection, such
as people who buy item X also tend to
buy Y. The following figure illustrates the concept
Figure-2:
Association
rule mining
Other
Application Areas
Although market
basket analysis generally conjures up pictures of shopping carts and
supermarket shoppers, it is important to realize that there are many other
areas in which it can be applied. These include:
- Analysis of
credit card purchases.
- Analysis of
telephone calling patterns.
- Identification
of fraudulent medical insurance claims. (Consider cases where common rules
are broken).
- Analysis of
telecom service purchases.
- Etc.
As mentioned earlier MBA used association
rule learning algorithms. Most common algorithms are
1.
Apriori
2.
Eclat
The Problem
When we go grocery
shopping, we often have a standard list of things to buy. Each shopper
has a distinctive
list, depending on one’s needs and preferences.
A housewife might
buy healthy ingredients for a family dinner.
A bachelor might
buy beer and chips.
Figure-3:
Frequent Itemsets
Understanding
these buying patterns can help to increase sales in several ways.
If there is a pair
of items, X and Y, that are frequently bought together:
- Both X and Y can be placed on the same shelf, so that buyers of one item would be prompted to buy the other.
- Promotional discounts could be applied to just one out of the two items.
- Advertisements on X could be targeted at buyers who purchase Y.
- X and Y could be combined into a new product, such as having toothpaste with toothbrush.
While
we may know that certain items are frequently bought together, the question is,
how do we uncover these associations from a large new dataset on which no
analysis has been done?
In order to answer
such questions, we need to learn some definitions.
Definitions
1.
Itemset: Collection of one or
more items, k-item-set means a set of
k items.
Figure-4:
Itemsets for
different values of k.
2.
Support Count: Frequency of
occurrence of an item-set 'X '.
3.
Support (X): Fraction
of transactions that contain the item-set X
Support(X) = frequency(X) /N
where N
is the total number of items sold.
Association rules analysis
Is a technique to
uncover how items are associated to each other? There are three common ways to
measure association.
Measure 1:
Support. This says how popular an itemset is, as measured by the
proportion of transactions in which an itemset appears. The
range of this measure is [0,1].
Support
of the item X is nothing but the ratio of the number of
transactions in which the item X appears to the total number of
transactions.
Consider
the example of a group of items Apple, Beer, Chicken, Sugar, Milk and Pear
Example
Figure-5:
Example Transactions
In Figure-5 above
the support of {apple} is 4 out of 8, or 50%.
Frequent item-set mining
This is data mining that focuses on set of items or sequences of
actions, events (for example it can even be the order in which we get dressed).
In a transaction it is a set of items that appears in many baskets is said to
be “frequent” which is a collection
of one or more items.
Pairs of items purchased
together may provide give insight to customer purchase behaviour. Frequent item-sets
can contain multiple items. For instance, the support of {apple, beer, rice} is
2 out of 8, or 25%. When there are more than two items in an itemset the
analysis is called multidimensional MBA.
Support Threshold
If the sales of
items beyond a certain proportion tend to have a significant impact on business
profits, a proportion called support threshold can be considered. This
value (which ranges from 0 to 1) is used to identify item-sets with support
values above this threshold. This is called significant
itemset. A
support metric is defined for itemsets, not association rules.
Figure-6:
Setting a minimum
threshold support
Minimum support threshold:
Item-sets whose
support is greater or equal than minimum support threshold (min_sup). This is
set on user choice. As an example min-sup can be value = 0.3.
As an exercise
find the item set that has minimum support of 0.3
TID
|
Items_Count
|
T1
|
M,O,N,K,E,Y
|
T2
|
D,O,N,K,E,Y
|
T3
|
M,A,K,E
|
T4
|
M,U,C,K,Y
|
T5
|
C,O,O,K,I,E
|
Measure 2: Confidence.
This is a measure
of how likely item Y is purchased
when item X is purchased, expressed
as {X => Y}. The
range of this measure is [0,1].
This is measured
by the proportion of transactions with item A,
in which item B also appears.
In Figure-5:, the
confidence of {apple => beer} is 3 out of 4, or 75%.
It
is the ratio of how many times apple (A)
, and beer (B) were purchased
together to number of apple appeared in the total number of transaction. This
method takes into account the popularity of the item A.
Support and Confidence
Consider two sets of items A and B.
It is assumed that A and B are disjoint sets. Then support and
confidence measure between A
and B is expressed
A => B [Support, Confidence]
The part
before => is referred to as if (Antecedent) and the part
after => is referred to as then (Consequent). The confidence of a rule is the probability of
observing a consequent in a transaction given it contains the antecent. This
metric is symmetric or directed.
Figure-7:
Antecedent and Consequent
Where A and B are sets of items in the transaction data. Consider a
purchase of computer and anti-virus software which is expressed as
Computer =>Anti−virus
Software [Support = 20%, confidence = 60%]
Above
rule says:
- 20% of total transactions
show contains the item set {computers, anti-virus software}.
- 60% percentage of transactions containing computers, also contain
anti-virus software.
Support and Confidence measure how significant
the association rule is. It is set by the minimum support and minimum
confidence thresholds. These thresholds help to compare the rule strength
according to user’s choice.
Strong
rules: If a rule A=>B [Support, Confidence] satisfies min_sup
and min_confidence then it is a strong rule.
Drawback of confidence measure
One
drawback of the confidence measure is that it might misrepresent the importance
of an association. This is because it only accounts for how popular apples are,
but not beers. If beers are also very popular in general, there will be a
higher chance that a transaction containing apples will also contain beers,
thus inflating the confidence measure. To account for the base popularity of
both constituent items, we use a third measure called lift.
Measure 3: Lift.
Lift: Lift
gives the correlation between A and B in the rule A=>B. Correlation
shows how one item-set A effects the
item-set B. The
range is [0,∞).
Lift(A=>B)=Support(A,B)/Supp(A)Supp(B)
This
is similar to confidence measure. But unlike confidence score this method takes
into account the popularity of the item B. Confidence score
between A and B may reflect an inflated value on
the likelihood of B being
purchased with A, if both A and B are independently popular. With a value greater than one it indicates how more often the antecedent and consequent of the
rule A=>B occur together than if they
were statistically independent.
In other word it says how likely item B is
purchased when item A is purchased,
while controlling for how popular item B is. If lift score of A=>B equals 1 then A and B are
independent.
In
Figure-5: the lift of {apple -> beer} is 1, which implies no association
between items. A lift value greater than 1 means that item B is likely to be
bought if item A is bought, while a
value less than 1 means that item B
is unlikely to be bought if item A is bought.
- Lift (A => B) = 1 means that there is no correlation within the itemset.
- Lift (A =>B) > 1 means that there is a positive correlation within the itemset, i.e., products in the itemset, x and y, are more likely to be bought together.
- Lift (A => B) < 1 means that there is a negative correlation within the itemset, i.e., products in itemset, A and B, are unlikely to be bought together.
Association
Rule Mining is viewed as a two-step approach:
- Frequent Itemset Generation: Find all frequent item-sets with support >= pre-determined min_support count.
- Rule Generation: List all Association Rules from frequent item-sets. Calculate Support and Confidence for all rules. Prune rules that fail min_support and min_confidence thresholds.
Summary of Support, Confidence and Lift
The
following figure summarizes the concepts of Support, Confidence and Lift for MBA.
Figure-8:
Summary of Support, Confidence and Lift measures
Measure 4: Leverage
Leverage: The leverage of A => B computes the difference between
the frequency of A
and B
appearing together and the frequency that would be expected if A
and B
were independent. A leverage value 0 indicates that they are independent. This measure is computed as
The
range of this measure is [-1,1].
Measure 5: Conviction
Conviction: The conviction of rule A => B computes how the consequent is
dependent on the antecedent. A high conviction means that the consequent is
highly dependent on the antecedent.
If the confidence measure of rule A => B equals 1, this means perfect
confidence score, then conviction measure becomes infinite. Similar to lift
measure if A and B
are independent the conviction measure of rule A => B equals 1. The range is [0,∞).
What is Apriori Algorithm?
Apriori is an algorithm designed
for frequent item set mining and association rule learning over relational
databases. It proceeds by identifying associations and commonalities between
component parts of rows of data called transactions. As an example for
e-commerce/retail shops with large databases the algorithm can find groups of individual
items that are frequently purchased together.
The algorithm was
proposed by Rakesh Agarwal and Ramakrishnan Srikant in 2005. The apriori
principle can reduce the number of itemsets we need to examine. Put
simply, the apriori principle states that
if an itemset is infrequent, then all its supersets must
also be infrequent
This
means that if {beer} was found to be infrequent, we can expect {beer, pizza} to
be equally or even more infrequent. So in consolidating the list of popular
itemsets, we need not consider {beer, pizza}, nor any other itemset
configuration that contains beer.
Finding
itemsets with high support
Using the apriori
principle, the number of itemsets that have to be examined can be pruned, and
the list of popular itemsets can be obtained in these steps:
Step 0.
Start with itemsets containing just a single item, such as {apple} and {pear}.
Step 1.
Determine the support for itemsets. Keep the itemsets that meet your minimum
support threshold, and remove itemsets that do not.
Step 2.
Using the itemsets you have kept from Step 1, generate all the possible itemset
configurations.
Step 3.
Repeat Steps 1 & 2 until there are no more new itemsets
Reducing
candidate itemsets using the Apriori Algorithm
As
seen in the animation, {apple} was determine to have low support, hence it was
removed and all other itemset configurations that contain apple need not be
considered. This reduced the number of itemsets to consider by more than half.
Note
that the support threshold that you pick in Step 1 could be based on formal
analysis or past experience. If you discover that sales of items beyond a
certain proportion tend to have a significant impact on your profits, you might
consider using that proportion as your support threshold.
Finding item rules with high confidence or lift
We
have seen how the apriori algorithm can be used to identify itemsets with high
support. The same principle can also be used to identify item associations with
high confidence or lift. Finding rules with high confidence or lift is less
computationally taxing once high-support itemsets have been identified, because
confidence and lift values are calculated using support values.
Take
for example the task of finding high-confidence rules. If the rule
{beer, chips -> apple}
has
low confidence, all other rules with the same constituent items and with apple
on the right hand side would have low confidence too. Specifically, the rules
{beer -> apple, chips}
{chips -> apple, beer}
would
have low confidence as well. As before, lower level candidate item rules can be
pruned using the apriori algorithm, so that fewer candidate rules need to be
examined.
Limitations of Apriori
Computationally Expensive. Even though the apriori algorithm reduces
the number of candidate itemsets to consider, this number could still be huge
when store inventories are large or when the support threshold is low. However,
an alternative solution would be to reduce the number of comparisons by using
advanced data structures, such as hash tables, to sort candidate itemsets more
efficiently.
·
Spurious Associations. Analysis of large inventories would
involve more itemset configurations, and the support threshold might have to be
lowered to detect certain associations. However, lowering the support threshold
might also increase the number of spurious associations detected. To ensure
that identified associations are generalizable, they could first be distilled
from a training dataset, before having their support and confidence assessed in
a separate test dataset.
Missing Values and Data noise which
contaminate the data can affect the accuracy of market basket analysis.
Eclat Algorithm
The ECLAT algorithm stands for Equivalence Class Clustering and bottom-up Lattice Traversal.
It is one of the popular methods of Association Rule mining.
It is a more efficient and scalable version of the Apriori algorithm. While the
Apriori algorithm works in a horizontal sense imitating the Breadth-First
Search of a graph, the ECLAT algorithm works in a vertical manner just like the
Depth-First Search of a graph. This vertical approach of the ECLAT algorithm
makes it a faster algorithm than the Apriori algorithm.
Apriori are
use large dataset and eclat are
small and medium datase.
Apriori are
scan original(real) dataset Eclat scan
currently genereted dataset.
Apriori are slower then Eclat.
Advantages
over Apriori algorithm:-
- Memory Requirements: Since the ECLAT algorithm uses a Depth-First Search approach, it uses less memory than Apriori algorithm.
- Speed: The ECLAT algorithm is typically faster than the Apriori algorithm.
- Number of Computations: The ECLAT algorithm does not involve the repeated scanning of the data to compute the individual support values.
Collaborative filtering (CF) and Recommender Systems
Collaborative
Filtering is an unsupervised data mining method to filter out items using the opinions of
other people. This type of analysis is used for recommender systems. As
examples products, books, news items, movies, web pages, resorts, restaurants
etc that are likely of interest to users are evaluated based on similar user
patterns (for more related figures refer to earlier blogs on Paradigms of
Learning Algorithms and Unsupervised Learning part-1). This algorithmic technique has its roots on
the age old human behaviour of opinion sharing between people. This is active collaborative
filtering with human participation.
A limitation of active collaborative filtering systems is that
they require a community of people who know each other.
As computers evolved individual opinions could be stored in
databases. Opinions could be accessed by appropriate queries. This resulted in
the development of automated collaborative and recommender systems. During the
developmental history of collaborative two models evolved 1) Pull-active systems
in which it is required that the user know whose opinions to trust 2) Push-active
systems that required the user know to which particular content may be
interesting. Automated collaborative filtering (ACF) systems relieve users of
this burden by using a database of historical user opinions to automatically
match each individual to others with similar opinions.
The
underlying assumption of the collaborative filtering approach is that if a
person A has the same opinion as a person B on
an issue, A is more likely to have B's opinion on a different issue than
that of another randomly chosen person. For example, a collaborative
filtering recommendation system for television tastes could make
predictions about which television show a user should like, given a partial
list of that user's tastes (likes or dislikes). Note that these predictions are specific to
the user, but use information gleaned from many users.
Item Based or
item-item collaborative filtering:
This is a model-based algorithm for making
recommendations. The method takes similarities between items’ consumption
histories. For item based collaborative
filtering user information is not important. The
similarity values between items are measured by observing all the users who have rated items in a dataset. Similarity between two items is dependent upon the ratings
given to the items by users who have rated both of them. Similarities scores are calculated by using one of a
number of similarity measures. Common similarity measures are cosine similarity
(vector based) measure, Pearson (Correlation) - based similarity measure,
adjusted cosine similarity (adjusted based on the fact the different users will
have different similarity ratings). These similarity values are used to predict
ratings for user-item pairs. Once we make a model using one of the similarity
measures described above, we can predict the rating for any user-item pair by
using the idea of weighted sum.
In the more general sense, collaborative filtering is
the process of filtering for information or patterns using techniques involving
collaboration among multiple agents, viewpoints, data sources, etc. Applications
of collaborative filtering typically involve very large data sets.
Collaborative filtering methods have been
applied to many different kinds of data including: sensing and monitoring data,
such as in mineral exploration, environmental sensing over large areas or
multiple sensors; financial data, such as financial service institutions that
integrate many financial sources; or in electronic commerce and web
applications where the focus is on user data, etc.
User Based:
This method considers similarities
between user consumption histories and item similarities. Since we have item
based similarity matrix we can check which are the items a user has
consumed. For each item the top K neighbours are determined. The
consumption record of each neighbour is recorded. A similarity score is
calculated. Items with highest scores are recommended.
Collaborative filtering methods have been
applied to many different kinds of data including: sensing and monitoring data,
such as in mineral exploration, environmental sensing over large areas or
multiple sensors; financial data, such as financial service institutions that
integrate many financial sources; or in electronic commerce and web
applications where the focus is on user data, etc.
References:
-
Jiawei Han, Micheline
Kamber, Jian Pei, “Data Mining Concepts
and Techniques”, Morgan Kaufmann, 3rd Ed.,
-
George M. Marakas, “Modern Data Warehousing, Mining, and
Visualization: Core Concepts”, Pearson, 2003,
-
Ben Schafer, Dan
Frankowski, Shilad Sen, “Collaborative Filtering Systems”, researchgate.com,
-
"Practical Introduction to
Market Basket Analysis - Association Rules”, rsquaredacademy.com
-
http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/
- Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining Concepts and Techniques”, Morgan Kaufmann, 3rd Ed.,
- George M. Marakas, “Modern Data Warehousing, Mining, and Visualization: Core Concepts”, Pearson, 2003,
- Ben Schafer, Dan Frankowski, Shilad Sen, “Collaborative Filtering Systems”, researchgate.com,
- "Practical Introduction to Market Basket Analysis - Association Rules”, rsquaredacademy.com
- http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/
Figure Credits:
Figure-1: rsquaredacademy.com
Figure-2: techleer.com
Figure-3: sciencedirect.com
Figure-4: saedsayad.com
Figure-5: KDnuggets.com
Figure-6: datacamp.com
Figure-7: i.stack.imgur.com
Figure-8: saedsayad.com
Figure-9:
semanticscholar.org
Figure-10: ieee.org
Figure-11: salemmarafi.com
Comments
Post a Comment