Saturday, May 2, 2015

How to Deal with Mixed Data Types when Performing K-means Clustering

When we look at a group of dots on a page, most of us can intuitively put these dots into groups that appear to "belong" together because they seem to be close to each other.  In data mining we call these clusters.  Clustering with the human eye works pretty well for problems in 2 or sometimes 3 dimension.  As an example, if you were to look at the chart below how would you cluster/group the points together?

I think that most of us would group the 3 points in the upper left corner together, and then group the points on the right side together.  Or maybe you'd group the upper left together, and then create 2 groups on the right side (one for the upper 2 dots, and one for the lower 2).  As you can see this art and practice are not really definitive, but there is a lot of information to be gained from data sets by grouping data points together.  Therefore, data mining seeks to make this process of finding the best clusters objective so that we can perform this analysis on multi-dimensional data with little to no human interaction.

Sometimes our data does not just have numerical data.  We may have to deal with data that has categories like male/female, ethnical background, etc.  Or, maybe our data set has ordered categories (ordinal data) like t-shirt size (small, medium, large, extra large), or relative temperature (cold, warm, and hot).  We would like to still be able to cluster our data points, even though all of the data types don't fit the same cookie cutter mould.  To explain this, I'm going to start by showing a simple example of the k-means algorithm with just numerical data.  Then, we'll add in a category and an ordinal data type to see how we can perform the same process with different data types.

Since we already have a simple 2-dimensional set of data points above, we'll stick with that example and use the k-means algorithm to solve for the best clusters.  The 'k' in k-means is a user defined variable that controls how many clusters we're going to look for.  To keep things simply we'll set k=2, which means that when we're done, we'll have the data points in the example split and assigned to 2 groups.

To mathematically define these 2 clusters, k-means uses 2 points in space that aren't on the graph yet.  These points are called cluster centers, or centroids.  Think of a centroid as the center of gravity of all of the points assigned to it.  When we're all done, we want the distance between these centroids and the points assigned to them to be minimized.  If we do that, we'll get centroids that , have data points tightly packed around them, which is intuitively what we are looking for when we are doing this by eye.  The original k-means algorithm starts by randomly picking values for these cluster centers and then iteratively improves their position to minimize the distance as described above.  For our example, I have randomly placed 2 cluster centers (red and green).  These cluster centers are also referred to as seeds. The red seed is located at (2.2087,4.1076) and the green seed is located at (4.8579, 0.3341).



The question is how we will define "distance" from the data points to these seeds.  There are multiple ways to define the distance between 2 points.  To visualize a couple of them, let's say that we have randomly picked two cluster centers (in red and green below) and we're trying to calculate the distance to one of the other blue data points.


A very simplistic way of calculating the distance between 2 data points is to use the "Manhattan" distance.  Think of this as how far you would have to walk to get somewhere in Manhattan, New York where it's all city blocks and you can't just jump over or walk through buildings.  This is also called the L1 norm.  I described how to calculate this in my PageRank blog post.  If you're curious about it you can learn more there.

Another way of calculating distance is called the Euclidean distance.  This is like the saying "the shortest distance between 2 points is a straight line".  To calculate this distance we use the square root of the sum of the squares.  If we have 2 points in space where point 1 = (x1,y1) and point 2 = (x2,y2), then the Euclidean distance between them looks like this


There are actually several other distance metrics we could use, but since it is fairly simple and it was the one used in the original paper on k-means, we'll stick with the Euclidean distance in our example.

Now that we know how we're going to calculate distance and we know where our seeds are located we can start the iterative process of finding the best location for the centroids.  The process goes like this

  1. Calculate the distances between all of the data points and all of the cluster centers
  2. Assign each data point to the cluster center closest to it
  3. Calculate the average x and y values for the points assigned to each clusted center
  4. Set each cluster center's location to be equal to the average of the points assigned to it
  5. Check to see if any data points have been assigned to a different cluster center
    1. If so, go back to the top and repeat
    2. If not, then end the algorithm, you've got your answer

If you didn't follow that, it will become clear as we work through the example.  For step 1, I'll show you a couple of the calculations by hand and then jump to a table of all of the distances.  If we take the data point at (1,5), the distance from that point to the red seed at (2.2087,4.1076) is


You can see that I also gave you the answer for how to calculate the distance from (1,5) to the green cluster center (4.8579, 0.3341) in the image above.  At this point, I'm going to assume that we can all replicate this until we're sick of it, so I'll just give you the rest of the answers in this table


Now that we have that completed we can assign the points to their nearest cluster center.  The following points get assigned to the red cluster center: (2,3); (1,5); (3,5); (7,5).  All the rest get assigned to the green cluster center.



Now we need to take the average values of the point assigned to each cluster center.  For the red center, the average X value is (2+1+3+7)/4 = 3.25 and the average Y value is (3+5+5+5)/3 = 4.5.  For the green cluster, the average X value is (9+8+7)/3 = 8 and the average Y value is (1+4+2)/3 = 2.3333.  Now we assign these average values to be the new locations of the 2 cluster centers.  So the new location of the red cluster center is (3.25, 4.5) and the new location of the green cluster center is (8, 2.3333).

Since this is the first iteration, it's obvious that we have changed the assignments of the data points to the cluster centers (they were previously unassigned).  So, we have to repeat the process.  With our new cluster centers of (3.25, 4.5) and (8, 2.3333), the distances between the points and the cluster centers looks like this


When you look to see which cluster is closest to each data point, you can see that some of the assignments are going to change.  The data point at (7,5) is now closer to the green cluster center than the red cluster center.


Now we take the average values of all of the data points assigned to red and green.  The new location of the red center is (2,4.3333) and the new location of the green center is (7.75, 3).


If you do the distance calculations again and assign the centers, as we have in prior iterations, you'll find that none of the data points get assigned to a different center even though the location of the center moved a little bit.  We now have our final answer.  The red cluster center is located at (2, 4.3333) and points (1,7), (2,3) and (3,7)  are assigned to the red cluster.  The green cluster center is located at (7.75,3) an points (7,2), (7,5), (8,4), and (9,1) are assigned to the green cluster.

In a nutshell, that's how k-means works.  It can be expanded to many many dimensions using similar calculations.  For 'k' different dimensions the Euclidean distance calculation looks like this


Don't get confused by the letters in this formula.  You just keep adding squares of differences for each dimension you're trying to compute the distance for.  So if you're trying to calculate the distance between 2 points in 4 dimensional space like (1,2,3,4) and (10,9,8,7), you would get

Once you adapt this to your problem.  The rest of the algorithm is basically the same.

The selection of the seed cluster centers is also very important.  As I mentioned, I randomly picked the values in this example.  If we randomly pick different values, depending on the "shape" and structure of the data, we can get different answers.  The originator of the algorithm suggested that you just run the algorithm several times with different random starting values and then pick the answer that minimizes the total distance between cluster centers and their assigned data points.  The k-means++ algorithm improves on this, but that's not part of the scope of this post.

The last thing I need to tell you before we change topics is what you need to do if for some reason no data points get assigned to one of your cluster centers.  If this happens, basically, the cluster center is too far away from the data.  One simple way to deal with this is just to randomly assign a new location to that cluster center again and continue the algorithm.  This might make the algorithm wander around a little more looking for the solution, but it's better than only getting 1 cluster center when you were looking for 2.

Dealing with Different Data Types
Now what do you do if your data set looks like this?


Before you answer "uuuhh... give up?" read the rest of this post.

I know that there are some of you out there that thought "white and black aren't colors; they're shades" when you looked at the table above.  OK Mr. Smarty Pants, you're right but let it go for now.

The most important part of adding in different data types is figuring out a way to measure "distance" between these variables.  We'll start by describing how this can be done with the size category above.  Since size has a sort of intuitive order to it where medium is bigger than small and large is bigger than medium, we can create a pseudo-distance fairly easily.  the easiest way to think of this is to just make the lowest value ("small" in our case) equal to 0 and the highest value ("large") equal to 1.  Then we just divide up the space between 0 and 1 equally to fit in any other values we have in our list.  Since we only have 3 values in our example, this is easy and we set "medium" to 0.5.  If you have a longer list of values, you can use this general equation to assign values to your labels between 0 and 1.

In this equation, the Zi is a pseudo-location like an X or Y value.  The ri term is the rank of the label in the list you're trying to convert, and M is the total number of labels in the list.  So, for our example, M=3 and if I'm trying to calculate Zi for "medium" then ri is equal to 2, ri for "small" would be equal to 1, and 3 for "large".  Take some time to plug those numbers into the equation above to make sure you can see how this gives you the values we already intuitively assigned in the paragraph above.

Now that we have these "location" values for size, we can treat the size variable just like the X and Y variables.  Although, we'll have to modify X and Y soon too, but there'll be more on that later.

Now how do we deal with the pesky color value?  Some of you out there may think that color could be done the same way as size because we can use the light freqency spectrum values for order.  Let's just assume instead, for demonstration purposes, that we are thinking the colors are from a color wheel with no beginning and end.  That means we have to treat them like categories.  For categories, we don't transform the values ("white", "red", or "black") into numbers, but we use a different distance calculation.  We say that if 2 points have the same color, the category distance between them is 0.  If the 2 points have different colors, the category distance between them is 1.  So the distance between "white" and "white" is 0, between "white" and "red" is 1, and between "white" and "black" is 1 too.  You can see that it is more difficult to be "close" to a point here if there are multiple values to choose from in a category.

You may have noticed that our transformations of size and color both have a scale from 0 to 1.  If we just leave the values of X and Y the way they are, their distances will appear much larger than the distances for size and color just because of their scale.  To solve this problem we can scale X and Y down to something close to 0 to 1.  Once we do this all of the different variables (X, Y, size, color) will all get roughly the same distance weighting.  A simple way to scale numerical variables like X and Y is to compute the Z-score.  The Z-score will likely look very familiar to you if you've been through a statistics course.

The Z will replace our X and Y values in our table above after this transformation.  The x, in the equation is a specific value of X or Y in the table above.  Mu (μ) is the average (mean) value of X or Y, and sigma (σ) is the standard deviation of X or Y.  For example, If I want to transform the values in the Y column, I need the average (1+4+2+3+5+5+5)/7 = 3.5714..., and the standard deviation which equals 1.6183... (if you want more understanding of how to calculate a standard deviation you can easily find this on Wikipedia here).  Now that I have those 2 values, I'll calculate the first couple Z-scores for the Y variable.



If we complete all of these calculations, we can transform our table into something that we can use to measure distance with all of these different data types.


Now that we've adjusted all of the variables, we're ready to introduce the mixed variable type distance equation.  It works pretty much like the Manhattan distance described earlier.


The first time I saw this I just about fell over from all of the subscripts and superscripts in this equation.  It looks WAY more complicated than it actually is.  Thing of it as a weighted average of distance.  The Dij on the left is the distance between points i and j (e.g. point #'s 1 and 5 in our table above).  The summations on the top and bottom are just summing up over all of the attributes/variables we are analyzing in our table.  So P would be equal to 4 because we have 4 attributes (Zx, Zy, Zsize, and Color).  The Wij terms are user selected weights that can be used to push the clustering to consider some attributes to be more important than others.  In our example, we'll set these equal to 1 for simplicity.  Once we do that, the formula is really just an average (mean) Manhattan distance of all of the attributes in our table.  If you didn't follow that, you'll see what I mean when we find the new cluster centers for our example.

If I plot the Zx and Zy variables on chart (I can't plot 4 dimensions and 3 dimensions is a little tricky) it would look like this.


You can see that it is not so easy to visually identify where the cluster centers are going to end up. This is why we solve this with math instead of by eye.  To start, just like last time, we randomly select new seed cluster centers and then calculate the distances from all of the data points to these cluster centers.  Suppose that I pick the seed locations to be (-0.614, 0.107, 0.5, "red") and (-0.160, -0.116, 1, "black").  I can plot these on the chart like this


Notice that I changed the size labels from words to their coded number values in this version of the graph.  The seeds I "randomly" selected have values for Zsize that matched exactly with "medium" and "large", but there is no restriction that ordinal attribute seeds match one of the values in the series.  I could have just as easily picked a value like 0.33356.  Of course the category attribute values have to be selected from the available options ("white", "red" or "black).  Now we just need to compute some distances.  Like last time, I'll show you how to calculate the distances from the 2 seeds to the point in the top left corner, and then give the rest of the answers.


Notice that I am using absolute values for the distance calculations.  When you use the Euclidean distance you don't need to worry about this because a squared number will always be positive.  The other thing you need to notice is that for the last term in the numerator, I just did the color category distance comparison in my head and put it in there.  The first seed (the red diamond...don't get confused here with the colors of the cluster centers and the labels) had its color label equal to "red" and the data point in the top left is "red" too, so the color distance between them is 0.  For the green seed, it was labelled "black" so the color distance between it and the top left data point is 1.  If you continue to calculate the distances between the cluster centers and all of the data points, you'll end up with a table that looks like this


I highlighted in green the distances that are lower for each data point.  We'll use these lower distances to assign each point to a cluster center.  So points 1, 2, 4, and 6 are assigned to the green cluster center and points 3, 5, and 7 are assigned to the red cluster center.  Now, we have to calculate the new cluster centers based on the means of all of the attributes.  We do this the exact same way we did it before, except for the color attribute.  Zx, Zy, and Zsize are all now numerical and we just calculate the mean of the values assigned to each cluster.  For the color attribute,  we don't actually take the mean.  Instead we take the mode.  If you don't remember what a mode is, it's just the value that comes up most often in the data.  So for the red cluster (points 3,5,and 7) we have have 2 points that are "red" and 1 point that is "white".  So we'll make the new cluster center point "red".  Similarly, the green cluster center has points that are "white", "black", "black", and "black".  So the new green cluster center is assigned to be "black".  There is a possibility that there is tie between 2 values.  If that ever happens the easiest thing to do is just to create an arbitrary tie-breaker rule like "always prefer white over red, and prefer red over black".  This rule will create some bias in the results, but in a large data set, this effect will happen pretty infrequently so it is not much of a concern.  If we do all of the mean calculations and assign the new colors to the cluster centers we get new cluster centers that look like the last 2 rows of the table below.


If we plot this on the graph, we can see the shift in the location of the cluster centers.


If we go through the same distance calculation process as before we see the distances come out like this

You may notice that although the distances are different, the assignments are the same.  The red cluster center is still closest to points 3, 5, and 7 and the green cluster center is closest to points 1, 2, 4, and 6.  That's our signal to stop the algorithm and give the final assignments.  They look like this


In 2D space this final answer to the clustering problem doesn't look so awesome, but as far as I can tell with my limited 4-dimensional vision it's the optimal solution with the seeds we started with.  The total distance from the cluster centers to the assigned data points can be calculated from the distances in the table above.  Red distances = 0.465 + 0.592 + 0.686 = 1.743  Green distances = 0.996 + 0.311 + 0.437 + 0.591 = 2.335.  The total distance is 4.078.  As mentioned, we have no guarantee that this is the absolute best way to cluster these data points.  I used Excel to run this clustering hundreds of times and found an optimal solution whose total distance was only 3.229.  The clustering for that solution looks like this.


This looks a lot like the solution for the non categorical problem.  It seems that the clusters we found in that example are still pretty significant in this example.

That summarizes how to do K-means clustering with numerical and non-numerical data.  I hope this has helped somebody out there.  Let me know if you have any comments or questions about anything in this post.