Tuesday, February 17, 2015

Apriori Principle



When making sure that all of the patterns in a set of data meet the minimum support requirements, we want to find all of the patterns that are supported, and not waste time looking at patterns that aren't.  This seems simple, but in much larger data sets, it can become difficult to keep track of which patterns are support and which aren't.  The Apriori Principle helps with this.

To explain, let's use the data in this table and assume that the minimum support is 2.



We start by looking for single items that meet the support threshold.  In this case, it's simply A, B, C, D, and E, because there is at least 2 of each of these in the table.  This is summarized in the single item support table below

single item support table

Next, we take all of the items that meet the support requirements, everything so far in this example, an make all of the patterns/combinations we can out of them; AB, AC, AD, AE, BC, BD, BE, CD, CE, DE.  When we list all of these combinations in a table, and determine the support for each, we get a table that looks like this.
2 items support before filtering

Several of these patterns don't meet the support threshold of 2, so we remove them from the list of options.
2 item support table

At this point, we use the surviving items to make other patterns that contain 3 items.  If you logically work through all of the options you'll get a list like this: ABC, ABD, ABE, BCD, BCE, BDE (Notice that I didn't list ABCD, or BCDE here because they are 4 items long).


Before I create the support table for these let's look at these patterns.  The first one, ABC, was created by combining AB and BC.  If you look in the 2 item support table (before or after filtering), you'll find that AC doesn't have the minimum support required.  If AC isn't supported, a more complicated pattern that includes AC (ABC) can't be supported either.  This is a key point of the Apriori Principle.  So, without having to go back to the original data, we can exclude some of the 3-item patterns.  When we do this, we eliminate ABC (AC not supported), ABD (AD not supported), ABE (AE not supported), BCE (CE not supported) and BDE (DE not supported).  This process of removing patterns that can't be supported because their subsets (or shorter combination) aren't supported is called pruning.  This pruning process leaves only BCD with a support of 2.

3 item support table

The final list of all of the patterns that have support greater than or equal to 2 are summarized here.


Takeaways:
  1. The Apriori Principle can be used to simplify the pattern generation process when mining patterns in data sets
  2.  If a simple pattern is not supported, then a more complicated one with that simple pattern in it can not be supported (e.g. if AC isn't supported, there is no way that ABC is supported)
  3. You can also look at takeaway 2 in the opposite direction. If a complicated pattern meets the minimum support requirements, all of the simpler patterns that can be created from that complicated pattern must be supported. (e.g. if ABC is supported, then AB, BC, AC, A, B, and C all have to be supported)

3 comments:

  1. "e.g. if AC isn't supported, there is no way that ABC is supported"
    should be
    "e.g. if AB isn't supported, there is no way that ABC is supported"
    or
    "e.g. if BC isn't supported, there is no way that ABC is supported"
    right?

    ReplyDelete
  2. I think that you’re focusing on the sequence of letters. This post is discussing frequent patterns in sets, which have no sequence. I just wrote them in alphabetical order for convenience.

    ReplyDelete