About: This article describes the Clustering subtab in the Analysis tab.
Table of Contents
- Introduction
- Clustering
- Binning
- Clustering Models
- Clustering Analysis Controls/Options
- Video Tutorial
- Related Articles
Introduction
The clustering feature is a way to categorize records into bins within your dataset. This process uses one or more variables as the criteria for the binning process (see Binning below). Predict utilizes the “Lloyd’s K-Means Clustering” algorithm however, there are several methods for binning records based on their observed values.
Information is very often stored either as a number on a spectrum or as a category. Predictive modeling benefits from both of these kinds of data. Clusters can help your model capture behaviors even better. Simply put, imagine that a characteristic in your dataset, like SAT scores, has a pretty straightforward relationship- the higher the score, the more likely to retain.
But there are many cases where there will be deviations from this “pretty straightforward” relationship. These are the cases where a cluster will be useful. The linear term itself can’t do much to improve the fit. Using a series of clusters, as specified below, we could fine-tune our model. The nice part is, with the clusters formed, computers take care of this correction for us.
Clustering
In this case, “Lloyd’s Algorithm” forms “k” categories, where k is a number that the user specifies. The algorithm then assigns each record in your data into one of those categories. The perk of creating these categories through K-means, instead of one of the other methods, is that the algorithm’s clusters will take into consideration the distribution exhibited in your data. A manually generated binning scheme might include, for example, a “50 to 54” age category, but the K-means clustering would not create that category if the dataset has no records in that age range. K-means clusters reflect the data and is not at all arbitrary. For strict comparisons across analysis, you would need to be sure to import identical clustering models.
Binning
Binning refers to the process of grouping records in your dataset according to a specified rule. Binning uses the raw values found within columns to identify which larger group each record belongs to. This could be hard-coded, as in the case of fixed values, like “0-5”, “6-10”, “11-15”, or dynamic, as in the case of the K-Means Clustering. The end result is an additional column which serves as an over-arching grouping variable.
Clustering Models
Clustering Models is a decision rule created by the Clustering process. It allocates records to appropriate bins, based on their values in whichever columns were referenced while clustering.
Clustering Analysis Controls/Options
- K (number of clusters): This dropdown lets you choose the number of “clusters” or “groupings” you want to form.
- X Axis (projected): This allows you to choose which of your “Included Variables” will be displayed along the x axis (horizontal) of the “Clustering Visualization”
- Y Axis (projected): This allows you to choose which of your “Included Variables” will be displayed along the y axis (vertical) of the “Clustering Visualization.”
- Granularity: The number indicated in this box indicates the number of data points that will be displayed on the “Clustering Visualization.” You can adjust this number manually, or with the slider.
- Generate Clusters: Clicking this tells the software to create the desired number of clusters with respect to any variables that you’ve included.
- Save Variables: This adds your created cluster to the dataset, and enables you to use it for analyses in the various other tabs of Veera Predict
- Clustering Visualization: This graph plots data points with respect to their characteristics (which are labeled on the x and y axes). The color coding serves to identify which cluster the data point belongs to, while the black squares represent the mean associated with each cluster.
- Computed Means: This tabular readout identifies the record count of each cluster (the “Population” column) as well as the mean(s) of the variable(s) used in the clustering process.
Comments
0 comments
Article is closed for comments.