Thursday, February 26, 2015

Lift

Lift is an objective measure of "interestingness" that has been used in various fields including statistics.  It is also an option when we are data mining.  In case you've run across this measure and don't fully understand it, let me give you a quick summary of what it is, how it's calculated and some of its properties.

Much like the properties of support and confidence, lift attempts to use a numerical calculation to determine how importance a pattern or correlation is.  If we use the table below as a simple example we can kind of see that people who buy oranges, are less likely to buy an apple and vice versa (this is called a negative correlation), but in a data mining process we want to have an automated way to see this.  Lift is one option to automate and determine this.


To calculate lift, first find the support for both items together (in this case 2/20=10%.) and this will be your numerator.  Notice that I used the "% of total" version of support.  If you don't do this you will get a different answer and it will be wrong.  For the denominator, multiply the total support for both items together (bottom left and top right total values); in this case that would be 8/20 * 7/20 = 0.14.  The lift for buying apples and oranges together would be ~0.714 for this example.

So what does this 0.714 mean really?  If, the lift were to work out to equal exactly 1, then that would mean that there is no correlation at all between the 2 items.  Since, our lift is less than 1, it means that there is a negative correlation (that's what we could see with our own intuition before we started).  If the lift turns out to be greater than 1, then there is a positive correlation.  If you look at how lift is calculated you will notice that because all of the values that go into the fraction are positive counts (or fractions of positive counts), the value of lift always has to be positive (>0).  It can get really BIG though.  If I change the table above to have lots of transactions without apples or oranges, then I can get a really big number for my lift

If you do the same calculation we did above on this table, you get a lift of 357,143.32. This huge swing in the lift value is the greatest limitation of lift in data mining.  The only thing I changed in the data I was analyzing was the number of transactions that didn't have apples or oranges.  Intuitively this shouldn't make any difference whether the correlation is interesting or not.  That is why data mining has developed other measures of interestingness that are called null-invariant.  Null-invariant measures aren't sensitive to that lower right value in the table.  Eventually I'll write a blog post about those and add a link here, but that won't be tonight.