Machine Learning: Data Analysis and Discovery
Excerpts from Qlik AutoML training, pulling out key points that you should do prior to running any Machine Learning (ML) analysis.
Data analysis and discovery
Explore the data
How many observations have you collected
Is this enough to predict on
Are there gaps or nulls in key data points
Do you need to reassess your data collections
Do you understand the distribution of your data
Do you have a normal distribution
Is it skewed? Are there outliers?
What is the range, mean, and median
What is time variant vs non-time variant data
With time variant, is the data time stamped to be aggregated appropriately
Will it be available at the time of the prediction
Review correlations - get early insights into data in order to refine hypothesis
Target correlation
The stratification and/or correlations of the target exist across some of the features
Check for signals
Are the directional patterns in the features to target relationships intuitive
Correlation matrix
Features that are highly correlated to one another may be redundant and a cause for noise, not an additional signal
Consider selecting a single feature from groups that appear to capture the same behaviors in the data
Or else determine if there is a single feature driving both
Apply business knowledge
Is historical data indicative of today's operating environment
Have systems or data collection practices significantly change in the collection window
How does your domain experience explain and validate data
If data doesn't align to your assumptions, it could mean there are data issues or assumptions are off
What additional features need to be collected or engineered
Clean and generalize forms - should try to use a model that generalizes when possible
Remove outliers - which could impede an algorithm ability to discern general patterns in the data.
Get rid of them
Address distribution oddities like skews, tails, multi-modal shapes in your data
May require additional data transformation or future feature design
One hint to group low volume categories and round or remove tails in numeric features
Replace null or missing values
with others or unknown when appropriate in order to gain extra value from a sparse column
Address correlated features
By remove redundant features
Engineering new features to extract additional information
Feature engineering - the process of creating new features from current ones
To gain additional predictive power from source data collected to address a business question
Date feature engineering
Parse date into columns (MM,DD,YY)
Creating segments like seasons, quarters, semesters
Calculating date difference between 2 dates
Others
Gender
Assigning gender based on Mr or Mrs
Creating median income from zipcode and income
Parsing customer address to City, State, Zip
Feature design - reviewing the features in your dataset to determine what possible issues may exist or improvements that can be made
Architecting good features include
Leveraging business acumen - you are the expert. Use it to your advantage
Expressing features in a way that ties to the target
Consider factors like
Should time factor into the future
Does rate of change matter
Should a feature be normalized to account for differences across subsets of data
Do null values mean something
Recognizing data leakage - when does the data that you are using to train an ML algorithm include the information you are trying to predict
Data leakage can lead to false assumptions
Data leakage can lead to model performing better in training vs real world
Can cause false assurance of how well the model actually performs
Prevent data leakage
Pay attention to time constraints included in your identified business questions
All data inside of the training set must be relevant to the time constraints set forth by the business questions
Types of data leakage - result in model performing more poorly in real world vs training
One or more features in the training set include information that wouldn't be known at time of leakage
When one or more features in the training set can be used to derive the target variable you are trying to predict
Do additional analysis if AutoML score > 85% for data leakage
Ways to identify data leakage
High scores - if scores are really high, there might be leakage
Feature importance - if one feature is a lot more important than everything else
Chronological holdout - if this score is drastically lower than cross validation
Logic - will you have the information for records at the time you want to make a prediction. Will the records be the same in 30 days
How to fix data leakage
If you identify a column that should not be used to train a model, then drop the column from being used in the model. Keep in data set
Hold time constant on that feature so that it becomes a good feature (e.g. fixing aggregation)
Ways to prevent data leakage
Having well defined business questions with learning framework following ingredients
Event trigger
Target (value & horizon)
Features
Prediction point