Motivation and Machine Learning (Lesson 3) Part 2
Helps you answer the question: "What are the features that are most useful for a given model?" One of the reasons we need to apply this is that the number of features in your original dataset may be very very high.
- Eliminates irrelevant, redundant and highly correlated features
- Reduce dimensionality for increased performance. As many ML models do not cope well on data with very large dimensions(many features).
- We can improve the situation of having too many features through dimensionality reduction.
Commonly used techniques are:
PCA (Principal Component Analysis)
t-SNE (t-Distributed Stochastic Neighboring Entities)
Azure ML prebuilt modules:
Filter-based feature selection: identify columns in the input dataset that have the greatest predictive power
Permutation feature importance: determine the best features to use by computing the feature importance scores
Data drift is change in the input data for a model. Over time, data drift causes degradation in the model's performance, as the input data drifts farther and farther from the data on which the model was trained. Monitoring and idenfifying this data drift helps to improve model performance as a lack of doing so allows decreased accuracy over time.
Changes in the upstream process: data input source changed, measurement unit changed, equipment calibration for data gathering changed etc
Changes in the quality of the data: Changes in customer behaviour, changes in seasonality etc
Monitoring for Data Drift
Azure Machine Learning allows you to set up dataset monitors that can alert you about data drift and even take automatic actions to correct data drift.
The process of monitoring for data drift involves:
Specifying a baseline dataset – usually the training dataset
Specifying a target dataset – usually the input data for the model
Comparing these two datasets over time, to monitor for differences
Here are different types of comparisons you might want to make when monitoring for data drift:
Comparing input data vs. training data: This is a proxy for model accuracy; that is, an increased difference between the input vs. training data is likely to result in a decrease in model accuracy.
Comparing different samples of time series data: In this case, you are checking for a difference between one time period and another. For example, a model trained on data collected during one season may perform differently when given data from another time of year. Detecting this seasonal drift in the data will alert you to potential issues with your model's accuracy.
Model Training Basics
In Model Training, the goal is to be able to give the model a set of input features, X, and have it predict the value of some output feature, y. First establish the type of problem. Is it a classification or regression problem? Decide whether you need to scale or encode the data. Then identify input features needed or create new ones through feature engineering.
Model training is the iterative process that involves selecting the hyperparameters, training the model and then evaluating the model performance. Once you've trained your model, you can now run it on the test dataset to see how it performs.
Given a regression problem modeled as y=bx+c
where y is the expected output and x is the input, the values b and c represent the slope and intercept respectively which in machine learning will be the parameters which will be learnt from the data during model training. Examples: weights, bias, costs etc.
In contrast to parameters, hyperparameters are not values that are learned from the data during training. Rather, they are values that we set before the training. Examples: learning rate, batch size, number of clusters, number of layers for deep network.
Because we do not know the best values for these before training, we usually start with a best guess, run the training, adjust the parameters and retrain to get optimal hyperparameter values.
Data for ML is typically into three parts:
- Training data
- Validation data
- Test data
Training data is used to learn the values for the parameters. Model's performance is checked on the validation data and we adjust the hyperparameters until the model performs well with the validation data. Finally, we do a final check on our model's performance with the test data which was never seen by our model.
The Process For Model Training on Azure:
Collect Data - Prepare Model - Train Model - Evaluate Model - Deploy Model - Retraining
Terms to know:
Workspace: This is the very first thing you need to create. It is the the centralized place for working with all the components of the machine learning process.
Experiment: This is just a container that helps you group various artifacts that are related to your machine learning processes
Run: This is one of those artifacts in experiment. It is a process that is delivered and executed in one of the compute resources. Examples: the training/validation of models, the feature engineering codes.
Model registry: This is a service that provides snapshots and versioning for your trained models.
Compute Instances: This refers to a cloud-based workstation that gives you access to various development environments, such as Jupyter Notebooks.
Classification problems are involved when expected outputs are categorical or discrete. There are 3 categories involved:
- Binary Classification: This results only in a binary or 2 class value example: 0 and 1, true and false etc. Binary Classification, Spam Email detection and Fraud detection are common applications.
- Multi Class Single label classification: Unlike the output from binary classification, the output contains multiple classes(three or more values) example: recognize written numbers from 1-10 or recognize days of the week.
- Multt Class Multi Label Classification: Here we have multiple categories, but the output can belong to more than one class. Unlike the previous whose output must belong to a single class, in this case, your output can belong multiple output classes.