Study notes
Machine learningis a data science technique used toextract patterns from data allowing computers to identify related data, forecast future outcomes, behaviors, and trends.
They name it programming paradigm.
- Traditional programming - You provide the traditional program with rules and data, and in return, it gives your results or answers.
- Machine learning training result - a machine learning algorithm is that the algorithm has learned the rules to map the input data to answers.
- Data cleaning deal with:
- Imputation of null values
- Converting data types
- Duplicate records
- Outliers
- Feature engineering
- Aggregation(count, sum, average, mean, median, and the like)
- Part-of (year of date, month of date, week of date, and the like)
- Binning(grouping entities into bins and then applying aggregations)
- Flagging(boolean conditions resulting in True of False)
- Frequency-based(calculating the frequencies of the levels of one or more categorical variables)
- Embedding(transforming one or more categorical or text features into a new set of features, possibly with a different cardinality)
- Deriving by example
- Data scaling
- Normalization
- Standardization
- Data encoding
- Ordinal encoding
- One-hot encoding
Azure Databricks libraries:
- MLLib (Apache Spark legacy) - org.apache.spark.mllib
- Spark ML - in fact is the same MLLib but the 'latest' - That's the one used in Azure Databricks - org.apache.spark.ml
- Splitting data
- Training model
- transformer
Takes a DataFrame as an input and returns a new DataFrame as an output.
Used for performing feature engineering and feature selection, as the result of a transformer is another DataFrame.
Example to read in a text column, map that text column into a set of feature vectors, and output a DataFrame with the newly mapped column.
.transform() - estimator
Takes a DataFrame as an input and returns a model
Example tLinearRegression machine learning algorithm. It accepts a DataFrame and produces a Model
.fit() - pipeline
Combine transformers & estimators.
Breaking out the training process into a series of stages, it's easier to combine multiple algorithms
.fit()
- Validate model
After training, Spark ML includes an uses built in statistics such as: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and coefficient of determination ...
All above are Summary measures based on training data.
From there, with a validation dataset, it is possible to calculate summary statistics on a never-before-seen set of data, running the model's transform() function against the validation dataset.
.transform()
Resources
Prepare data for machine learning with Azure Databricks - Training | Microsoft Learn
Train a machine learning model with Azure Databricks - Training | Microsoft Learn