About: This article provides a glossary of key terms found throughout the Predict module.
Predict
Key Word 
Definition 
Analysis / Saved Analysis 
The tab that contains all subtabs, memorized models, filters or transforms, and reports is known as the "analysis". Analyses can be saved as a .vpa (Veera Predict Analysis) file type. Saved analyses can be reopened using the Load Saved Analysis option on the Workspace tab. 
Automine  Automine, also known as "Automated Mining", is a tab within the Model Subtab that allows users to quickly evaluate all variables against the Yvariable to find any statistical relationships that exist. 
Binary Data 
A data type consisting of only two possible categories, “0” or “1”. Predictive models can make use of Binary data as either the Yvariable or a coefficient variable if related to the outcome. 
Build  An option to execute the model fitting process using only the variables manually added to the Included Variables section. 
Build Automatically  An option to execute the model fitting process using a Stepwise Regression method and considering every variable that is available according to the significance determined in the automine process. No manual variable selection is required. 
Build Stepwise  An option to execute the model fitting process using a Stepwise Regression method
(adding and removing independent variables iteratively and testing for significance after each iteration) only using the variables that have been added to the Included Variables section. 
Categorical Data  A data type consisting of a finite number of possible categories or types (i.e. – Gender {male, female}). Categorical data may not be used as an outcome but can be used as a variable in the model. 
Clustering 
Clustering partitions similar data points into groups by finding the minimum distance between each data point and a centroid that defines the center of each cluster with respect to one or more chosen characteristics. The process forms k (a userchosen number) clusters. 
Coefficient 
Each variable that enters a predictive model is assigned a "coefficient" number. These coefficient numbers, along with the intercept are what create the formula of the predictive model. Each coefficient is multiplied by the variable's value to arrive at the variable's contribution to a record's probability score. A negative coefficient 
Concordance 
A means of evaluating a logistic regression model’s predictive quality. Concordance is only calculated when the Yvariable is binary (0 or 1). Each record in a dataset is assigned a probability score by the model. Then, each record where the Yvariable = 1 is paired with every record where the Yvariable = 0. The ratio of concordant pairs to the total number of pairs is represented as the Percent Concordant. 
Connection  A pointer to a data source that allows data to enter Predict. 
Continuous Data  A data type consisting of a number that may include decimals as an approximation of a real number. 
Correlation Analysis  The correlation coefficient is a number between 1 and +1 that measures the degree of association between two variables. A positive value implies a positive association while a negative value implies an inverse association. 
Date Data  A data type consisting of a calendar date. Date data can enter Predict but cannot be utilized by a predictive model. 
Decile Analysis 
A model visualization that represents the accuracy of the predicted outcomes in comparison to the actual outcomes. A decile analysis is created by scoring a dataset and then sorting by the score from high to low. The dataset is then divided into 10 equal bins, and the average actual value is calculated for each bin. This is a good way to gauge the fit of a model before moving forward. An ideal representation contains a stairstep pattern with the highest columns on the left gradually decreasing towards the right of the chart. 
Discordance 
A means of evaluating a logistic regression model’s predictive quality. Discordance is only calculated when the Yvariable is binary (0 or 1). Each record in a dataset is assigned a probability score by the model. Then, each record where the Yvariable = 1 is paired with every record where the Yvariable = 0. The ratio of discordant pairs to the total number of pairs is represented as the Percent Discordant. 
Field Delimiter 
A character that defines the beginning or end of a value, field, or variable. Example: Column1, Column2The comma is the delimiter between Column1 and Column2. 
Frequency Analysis 
A tool in the Analyze Subtab that displays the number of records that fall into each category of any binary or categorical variable in the dataset. An option to include an additional binary or categorical variable to create a contingency table for the frequencies, row percentages, column percentages, and total percentages is also available. 
Holdout Sample 
A holdout sample is a random sample from the data set that is withheld and not included in the modeling process. After a model is built using the nonholdout data (the "training" data), it is then applied to the holdout sample to test and validate the accuracy of the model. The default holdout sample size in Predict is 50% of records withheld for validation. This can be adjusted in the Model Options tab. 
Included Variables  The subset of variables from a dataset that get included into the model build either by manual or automatic addition to the "Included Variables" section on the Model subtab. 
Intercept  A coefficient calculated during model fitting and included in the model formula, that ensures that every line/plane/hyperplane does not have to be fit from a 0,0 coordinate. The intercept ensures an optimal fit of the model to the data. 
Logistic Regression  A type of regression analysis used when the Yvariable is a binary variable. The outcome of predicting the binary variable is a probability value (01.0) relating the likelihood of the positive outcome (Yvariable = 1) to occur. 
Mean  The sum of observations divided by the total number of observations. 
Means Analysis  A tool in the Analyze Subtab that allows the average value of a numeric variable to be calculated and disaggregated by categories within other binary or categorical variables in the dataset. 
Missing Handling 
An option available in the Automine Tab to determine what happens to records that are null during the modeling process. This is determined at a variable/column level. Options include:

Memorized Model  A model formula that gets stored within the analysis for a model built with specific filters, transformations, inclusions and exclusions. Multiple versions of models can be 'memorized' within the same analysis. Model formulas can be saved outside of Predict as .vpsm (Veera Predict Scoring Model) file types by selecting the memorized model and using the save icon. 
Model  A mathematical process used to predict future outcomes based on historical patterns. 
Model Availability 
An option available in the Automine Tab to determine whether a variable is available for the modeling process. The default setting is that any variable that is related to the YVariable is available during the model fitting. Any selection other than the default manually overrides the software's determination of availability/relatedness. Model Availability options include:

Model Steps  A tool available at the time of a model building in the bottom left corner of the Model Subtab. The model steps explain each iteration of the stepwise regression, which variables were considered, entered the model, or removed from consideration. 
Multivariate Analysis  An analysis involving the relationship between two or more variables at a time. The multivariate option in the Visualize Subtab displays a visual representation of the effect and association of two or three variables to one another. 
Odds Ratio 
A measure of association between an independent variable (exposure) and the dependent variable (outcome). The odds ratio conveys the impact of a one unit increase or decrease of the independent variable on the odds of a "success" outcome. Odds ratios can be interpreted by their distance from 1.0 OR = 1 Exposure does not affect odds of "success" outcome 
Ordinary Least Squares Regression 
Also known as "OLS" or linear regression. A type of regression analysis used when the Yvariable is a continuous variable. OLS estimates the coefficients of a linear regression equation to model the relationship between independent variables and the dependent variable (Yvariable). The outcome of predicting a continuous variable using OLS is a point estimate of the value. 
Outcome 
The event or occurrence that is being predicted for using regression. Synonyms: YVariable, Dependent Variable, Response. 
Outliers  Values that are atypical (by definition), or infrequently observed. 
PValue  The probability that the null hypothesis is true. In regression, the null hypothesis is that each variable is not correlated to the outcome. If the null hypothesis is not true (small pvalue, low probability) then the inverse is true  the variable is associated with outcome with a high enough correlation that it is unlikely to be a result of random variation. 
Percent Concordance 
The ratio of concordant pairs to the total number of pairs. See Concordance. 
Percent Discordance  The ratio of discordant pairs to the total number of pairs. See Discordance. 
Predict (verb)  A formulaic process of evaluating what the likelihood of an outcome is based on historic patterns. 
Profiling Analysis 
A tool in the Analyze Subtab that performs a hypothesis test for each variable in the dataset across the two categories from any binary variable. Example: Testing whether the average High School GPA is significantly different between students who enrolled vs. did not enroll. 
Score  The process of applying a model formula to a cohort of records to "score" or predict the value/likelihood of an outcome. 
Standard Error (S.E)  The standard error indicates the extent of deviation of regression coefficients across cases and helps to measure the precision of the estimate of the coefficient. The smaller the standard error, the more precise the estimate is. 
Suggest Variable  A tool in the Model Subtab that adds a single variable at a time to the model from the Variables pool, that has been identified as the best variable based on what has already been included. 
Text Data  A data type consisting of any combination of alphanumeric characters. Text data can enter Predict but cannot be utilized by a predictive model. 
Transformations (variable)  Mathematical functions applied to the original variables in the dataset to generate additional variables for consideration by the model. Variable transformations are automatically generated and included in the variable pool if they are related to the outcome. Selecting the "View 'new variable' suggestions" box under the Variables section will show related variable transformations. 
Univariate Analysis  An analysis involving a single variable. The univariate option in the Visualize Subtab displays a visual representation of the distribution and summary statistics of the selected variable. 
Variable  A column/field/attribute in a dataset. 
Wald ChiSquare Test Statistic 
The Wald statistic is analogous to the ttest in linear regression. It is used to assess The Wald statistic is the ratio of the square of the regression coefficient to the square of the standard error of the coefficient and is asymptotically distributed as a chisquare distribution. 
YVariable 
An outcome that depends on several other factors. The variable in a dataset that is selected to predict for. Synonyms: Outcome, Response, Dependent Variable 
Comments
0 comments
Please sign in to leave a comment.