About: This article provides a glossary of key terms found throughout the Predict module.
Predict
Key Word |
Definition |
Analysis / Saved Analysis |
The tab that contains all subtabs, memorized models, filters or transforms, and reports is known as the "analysis". Analyses can be saved as a .vpa (Veera Predict Analysis) file type. Saved analyses can be re-opened using the Load Saved Analysis option on the Workspace tab. |
Automine | Automine, also known as "Automated Mining", is a tab within the Model Subtab that allows users to quickly evaluate all variables against the Y-variable to find any statistical relationships that exist. |
Binary Data |
A data type consisting of only two possible categories, “0” or “1”. Predictive models can make use of Binary data as either the Y-variable or a coefficient variable if related to the outcome. |
Build | An option to execute the model fitting process using only the variables manually added to the Included Variables section. |
Build Automatically | An option to execute the model fitting process using a Stepwise Regression method and considering every variable that is available according to the significance determined in the automine process. No manual variable selection is required. |
Build Stepwise | An option to execute the model fitting process using a Stepwise Regression method
(adding and removing independent variables iteratively and testing for significance after each iteration) only using the variables that have been added to the Included Variables section. |
Categorical Data | A data type consisting of a finite number of possible categories or types (i.e. – Gender {male, female}). Categorical data may not be used as an outcome but can be used as a variable in the model. |
Clustering |
Clustering partitions similar data points into groups by finding the minimum distance between each data point and a centroid that defines the center of each cluster with respect to one or more chosen characteristics. The process forms k (a user-chosen number) clusters. |
Coefficient |
Each variable that enters a predictive model is assigned a "coefficient" number. These coefficient numbers, along with the intercept are what create the formula of the predictive model. Each coefficient is multiplied by the variable's value to arrive at the variable's contribution to a record's probability score. A negative coefficient |
Concordance |
A means of evaluating a logistic regression model’s predictive quality. Concordance is only calculated when the Y-variable is binary (0 or 1). Each record in a dataset is assigned a probability score by the model. Then, each record where the Y-variable = 1 is paired with every record where the Y-variable = 0. The ratio of concordant pairs to the total number of pairs is represented as the Percent Concordant. |
Connection | A pointer to a data source that allows data to enter Predict. |
Continuous Data | A data type consisting of a number that may include decimals as an approximation of a real number. |
Correlation Analysis | The correlation coefficient is a number between -1 and +1 that measures the degree of association between two variables. A positive value implies a positive association while a negative value implies an inverse association. |
Date Data | A data type consisting of a calendar date. Date data can enter Predict but cannot be utilized by a predictive model. |
Decile Analysis |
A model visualization that represents the accuracy of the predicted outcomes in comparison to the actual outcomes. A decile analysis is created by scoring a dataset and then sorting by the score from high to low. The dataset is then divided into 10 equal bins, and the average actual value is calculated for each bin. This is a good way to gauge the fit of a model before moving forward. An ideal representation contains a stair-step pattern with the highest columns on the left gradually decreasing towards the right of the chart. |
Discordance |
A means of evaluating a logistic regression model’s predictive quality. Discordance is only calculated when the Y-variable is binary (0 or 1). Each record in a dataset is assigned a probability score by the model. Then, each record where the Y-variable = 1 is paired with every record where the Y-variable = 0. The ratio of discordant pairs to the total number of pairs is represented as the Percent Discordant. |
Field Delimiter |
A character that defines the beginning or end of a value, field, or variable. Example: Column1, Column2The comma is the delimiter between Column1 and Column2. |
Frequency Analysis |
A tool in the Analyze Subtab that displays the number of records that fall into each category of any binary or categorical variable in the dataset. An option to include an additional binary or categorical variable to create a contingency table for the frequencies, row percentages, column percentages, and total percentages is also available. |
Hold-out Sample |
A hold-out sample is a random sample from the data set that is withheld and not included in the modeling process. After a model is built using the non-hold-out data (the "training" data), it is then applied to the hold-out sample to test and validate the accuracy of the model. The default hold-out sample size in Predict is 50% of records withheld for validation. This can be adjusted in the Model Options tab. |
Included Variables | The subset of variables from a dataset that get included into the model build either by manual or automatic addition to the "Included Variables" section on the Model subtab. |
Intercept | A coefficient calculated during model fitting and included in the model formula, that ensures that every line/plane/hyperplane does not have to be fit from a 0,0 coordinate. The intercept ensures an optimal fit of the model to the data. |
Logistic Regression | A type of regression analysis used when the Y-variable is a binary variable. The outcome of predicting the binary variable is a probability value (0-1.0) relating the likelihood of the positive outcome (Y-variable = 1) to occur. |
Mean | The sum of observations divided by the total number of observations. |
Means Analysis | A tool in the Analyze Subtab that allows the average value of a numeric variable to be calculated and disaggregated by categories within other binary or categorical variables in the dataset. |
Missing Handling |
An option available in the Automine Tab to determine what happens to records that are null during the modeling process. This is determined at a variable/column level. Options include:
|
Memorized Model | A model formula that gets stored within the analysis for a model built with specific filters, transformations, inclusions and exclusions. Multiple versions of models can be 'memorized' within the same analysis. Model formulas can be saved outside of Predict as .vpsm (Veera Predict Scoring Model) file types by selecting the memorized model and using the save icon. |
Model | A mathematical process used to predict future outcomes based on historical patterns. |
Model Availability |
An option available in the Automine Tab to determine whether a variable is available for the modeling process. The default setting is that any variable that is related to the Y-Variable is available during the model fitting. Any selection other than the default manually overrides the software's determination of availability/related-ness. Model Availability options include:
|
Model Steps | A tool available at the time of a model building in the bottom left corner of the Model Subtab. The model steps explain each iteration of the stepwise regression, which variables were considered, entered the model, or removed from consideration. |
Multivariate Analysis | An analysis involving the relationship between two or more variables at a time. The multivariate option in the Visualize Subtab displays a visual representation of the effect and association of two or three variables to one another. |
Odds Ratio |
A measure of association between an independent variable (exposure) and the dependent variable (outcome). The odds ratio conveys the impact of a one unit increase or decrease of the independent variable on the odds of a "success" outcome. Odds ratios can be interpreted by their distance from 1.0 OR = 1 Exposure does not affect odds of "success" outcome |
Ordinary Least Squares Regression |
Also known as "OLS" or linear regression. A type of regression analysis used when the Y-variable is a continuous variable. OLS estimates the coefficients of a linear regression equation to model the relationship between independent variables and the dependent variable (Y-variable). The outcome of predicting a continuous variable using OLS is a point estimate of the value. |
Outcome |
The event or occurrence that is being predicted for using regression. Synonyms: Y-Variable, Dependent Variable, Response. |
Outliers | Values that are atypical (by definition), or infrequently observed. |
P-Value | The probability that the null hypothesis is true. In regression, the null hypothesis is that each variable is not correlated to the outcome. If the null hypothesis is not true (small p-value, low probability) then the inverse is true - the variable is associated with outcome with a high enough correlation that it is unlikely to be a result of random variation. |
Percent Concordance |
The ratio of concordant pairs to the total number of pairs. See Concordance. |
Percent Discordance | The ratio of discordant pairs to the total number of pairs. See Discordance. |
Predict (verb) | A formulaic process of evaluating what the likelihood of an outcome is based on historic patterns. |
Profiling Analysis |
A tool in the Analyze Subtab that performs a hypothesis test for each variable in the dataset across the two categories from any binary variable. Example: Testing whether the average High School GPA is significantly different between students who enrolled vs. did not enroll. |
Score | The process of applying a model formula to a cohort of records to "score" or predict the value/likelihood of an outcome. |
Standard Error (S.E) | The standard error indicates the extent of deviation of regression coefficients across cases and helps to measure the precision of the estimate of the coefficient. The smaller the standard error, the more precise the estimate is. |
Suggest Variable | A tool in the Model Subtab that adds a single variable at a time to the model from the Variables pool, that has been identified as the best variable based on what has already been included. |
Text Data | A data type consisting of any combination of alpha-numeric characters. Text data can enter Predict but cannot be utilized by a predictive model. |
Transformations (variable) | Mathematical functions applied to the original variables in the dataset to generate additional variables for consideration by the model. Variable transformations are automatically generated and included in the variable pool if they are related to the outcome. Selecting the "View 'new variable' suggestions" box under the Variables section will show related variable transformations. |
Univariate Analysis | An analysis involving a single variable. The univariate option in the Visualize Subtab displays a visual representation of the distribution and summary statistics of the selected variable. |
Variable | A column/field/attribute in a dataset. |
Wald Chi-Square Test Statistic |
The Wald statistic is analogous to the t-test in linear regression. It is used to assess The Wald statistic is the ratio of the square of the regression coefficient to the square of the standard error of the coefficient and is asymptotically distributed as a chi-square distribution. |
Y-Variable |
An outcome that depends on several other factors. The variable in a dataset that is selected to predict for. Synonyms: Outcome, Response, Dependent Variable |
Comments
0 comments
Please sign in to leave a comment.