Show Table of Contents
Cluster analysis partitions the marks in the view into clusters, where the marks within each cluster are more similar to one another than they are to marks in other clusters. Tableau distinguishes clusters using color.
Note: For additional insight into how clustering works in Tableau, see the blog post Understanding Clustering in Tableau 10.
Tableau uses the k-means algorithm for clustering. For a given number of clusters k, the algorithm partitions the data into k clusters. Each cluster has a center (centroid) that is the mean value of all the points in that cluster. K-means locates centers through an iterative procedure that minimizes distances between individual points in a cluster and the cluster center. In Tableau, you can specify a desired number of clusters, or have Tableau test different values of k and suggest an optimal number of clusters (see Determining the optimal number of clusters).
K-means requires an initial specification of cluster centers. Starting with one cluster, the method chooses a variable whose mean is used as a threshold for splitting the data in two. The centroids of these two parts are then used to initialize k-means to optimize the membership of the two clusters. Next, one of the two clusters is chosen for splitting and a variable within that cluster is chosen whose mean is used as a threshold for splitting that cluster in two. K-means is then used to partition the data into three clusters, initialized with the centroids of the two parts of the split cluster and the centroid of the remaining cluster. This process is repeated until a set number of clusters is reached.
Tableau uses Lloyd’s algorithm with squared Euclidean distances to compute the k-means clustering for each k. Combined with the splitting procedure to determine the initial centers for each k > 1, the resulting clustering is deterministic, with the result dependent only on the number of clusters.
The algorithm starts by picking initial cluster centers:
It then partitions the marks by assigning each to its nearest center:
Then it refines the results by computing new centers for each partition by averaging all the points assigned to the same cluster:
It then reviews the assignment of marks to clusters and reassigns any marks that are now closer to a different center than before.
The clusters are redefined and marks are reassigned iteratively until no more changes are occurring.
Tableau uses the Calinski-Harabasz criterion to assess cluster quality. The Calinski-Harabasz criterion is defined as
where SSB is the overall between-cluster variance, SSW the overall within-cluster variance, k the number of clusters, and N the number of observations.
The greater the value of this ratio, the more cohesive the clusters (low within-cluster variance) and the more distinct/separate the individual clusters (high between-cluster variance).
Since the Calinski-Harabasz index is not defined for k=1, it cannot be used to detect one-cluster cases.
If a user does not specify the number of clusters, Tableau picks the number of clusters corresponding to the first local maximum of the Calinski-Harabasz index. By default, k-means will be run for up to 25 clusters if the first local maximum of the index is not reached for a smaller value of k. You can set a maximum value of 50 clusters.
Note: If a categorical variable (that is, a dimension) has more than 25 unique values, then Tableau will disregard that variable when computing clusters.
When there are null values for a measure, Tableau assigns values for rows with null to a Not Clustered category. Categorical variables (that is, dimensions) that return * for ATTR (meaning that all values are not identical) are also not clustered.
Tableau scales values automatically so that columns having a larger range of magnitudes don’t dominate the results. For example, an analyst could be using inflation and GDP as input variables for clustering, but because GDP values are in trillions of dollars, this could cause the inflation values to be almost completely disregarded in the computation. Tableau uses a scaling method called min-max normalization, in which the values of each variable is mapped to a value between 0 and 1 by subtracting its minimum and dividing by its range.