Motivation and Machine Learning - Part2
Learnt about Tabular Data - data simply arranged in tabular format like in an Excel spreadsheet with rows, columns and cells where they intersect.
Rows describe a single observation, product or entity
Columns describe the properties or features of the item. Column values can be continuous (countable numeric values that can take any value) or discrete(categorical) values which have a limited range and needs to be converted.
Cells represent single value in row and column intersection.
In machine learning we ultimately work with numbers specifically vectors. So everything that isn't numbers like the categorical variables, text, pictures, videos, audio inputs are eventually converted to array of numbers.
Revision - 2.8 Scaling Data and 2.9 Encoding categorical data
The point of scaling Data is transforming it to fit within some range or scale say 0 - 1 or 1-100. It doesn't affect the algorithms because every value is scaled same way. It can speed up the training process.
Two common approaches to scaling: standardization and normalizing.
standardization rescales data to have mean of Zero and variance of one. variance is an indication of the data's dispersion. formular: (x - mean)/variance
normalizing simply rescales the data within range [0-1]. formular: (x - xmin)/(xmax-xmin)
Categorical data needs to be encoded into numbers, so that it can be modeled by machine learning algorithms. Two approaches: Ordinal encoding and One hot Encoding.
Ordinal assumes there's an order or rank of importance It converts categorical data into integers ranging from zero to the number of categories minus 1.
One hot Encoding doesn't assume order. It creates new columns for each category. And number of distinct categories correspond to number of the categories.
I revised lessons 2.10 - Image data and 2.11 - Text data My summary:
Image data needs to be represented as numbers before it can be fed to ML algorithms which is why we describe it in terms of pixels.
A pixel can be viewed as a tiny square of color obtained by a combination of 3 color channels (RGB) or more.
So images are represented 3D vectors with height, width and no of channels. It's important to use same aspect ratio for ML image data. In case you're wondering what aspect ratio means, it refers to the ratio between an image height and its width.
It's also important to normalize image data by subtracting per channel mean pixel values from individual pixels.
Text Data is also processes to numbers before use. There are two steps in that: normalization, then vectorization.
Normalization transforms a piece of text to canonical (official) form. This helps handle multiple words that mean same thing(lemmetization) and remove unnecessary words(stop words).
Tokenization refers to splitting string of text into list of smaller parts.
Then comes vectorization. The point is to encode the text into numerical form. We identify most relevant features or key words after normalizing and assign values to each one.
Two approaches: TF-IDF(Term frequency inverse document frequency) and Word2Vec
The difference between them is that while TD-IDF ignores the order of words and gives a matrix based on number of words(frequency) in the vocabulary and number of documents, Word2Vec on the other hand gives a unique vector for each word based on the words appearing around the particular word.
Sorry, the summary is longer than I planned. Hope you like it. :blush: Let's keep going.
I took lessons 2.12 - 2.14. It centered on the two perspectives to machine learning; the computer science perspective and the statistical perspective.
In the computer science perspective, we try to determine the program that given the data input (input features) can produce a correct output or expected prediction.
In the statistical perspective, the ML algorithm is trying to learn a hypothetical function F from given input variables(x). The output is a dependent variable.
The challenge is same in both perspectives; determining a program or learning a function for a dependent variable.
2.15(Tools for Machine Learning)
ML tools consists of Libraries, development environments and services provide support for the ML ecosystem.
Common libraries: Scikit-learn a classical machine learning library.
Keras, Tensorflow and Pytorch - popular deep learning libraries.
Development environments: most popular is Jupyter notebooks where you can use Python to write your codes and you can split runs in cells. Other environments are; Azure notebooks, Azure Databricks, Visual Studio code etc.
Cloud services are needed to support the development environments. Microsoft Azure is a leading cloud computing service for building, testing, deploying and managing apps and services through Microsoft data centers. Cloud services are important because sometimes data is too large and you need a faster processor to handle your ML workloads which can easily be provisioned on Cloud.
Microsoft has Azure Data Science VM(virtual machines), Azure Databricks, Azure Machine Learning Compute and even SQL server ML services to handle all kinds of ML workloads irrespective of size.
For notebooks, Jupyter, Databricks, R markdown and Apache Zeppelin are the most popular.
Library is a collection of pre-written(and compiled code) which you can make use of in your own project by simply importing them.
2.16 (ML libraries)
Python is a programming language tool which can be used for ML. It has two important libraries for this: Pandas, which allows you work efficiently with Dataframes and Numpy which supports high level mathematical functions for numerical optimization and operations on arrays.
For ML: most popular library is Scikit-learn. There is also spark ML which is also used for classical ML.
For DL: there are two core libraries: Tensorflow and Pytorch. Keras is a library was developed to make Tensorflow easier to work with for Deep learning.
For Visualization: we have Seaborn, Plotly, Matplotlib, and Bokeh.. Seaborn provides a high level interface and has additional features than Matplotlib. Bokeh generates Interactive data visualization.
Revision , Lessons 2.17-2.18
Cloud services for ML provides support for managing the core assets used in ML projects.
Datasets: define version and monitor data used in ML runs
Experiments & runs: organize ML workloads and keep track of each task executed through the service.
Pipelines: provide structural flow of tasks to model complex ML flows.
Model registry: manages models and registers then with support for versioning and deployment to production.
Endpoints: expose real time endpoint for scoring as well as pipelines for advanced automation.
Datastores: are data sources connected to the service environments like blob stores, fileshare data stores, data lake shows and databases. Simply a place you can create datasets from other sources.
Compute: This is a designated compute resource where you run your ML trainings.
Difference between models and algorithms
Models are specific representations learnt from data algorithms. Model is the end product.
Algorithms: can be seen as prescriptive recipes to transform data into models.
The goal of Linear regression assumes there's a linear relationship between x, the dependent variable and y, the independent variable. Represented as y=mx+c
:sunglasses: That's quite a lot. Hope you found it useful.
Automated ML: rapidly iterates over many combinations of algorithms and hyperparameters to help you find the best model based on your defined metrics.