Training models in Azure Databricks

Rating & reviews (0 reviews)
Study notes

Machine learningis a data science technique used toextract patterns from data allowing computers to identify related data, forecast future outcomes, behaviors, and trends.
They name it programming paradigm.
  • Traditional programming - You provide the traditional program with rules and data, and in return, it gives your results or answers.
  • Machine learning training result - a machine learning algorithm is that the algorithm has learned the rules to map the input data to answers.
In real world data is a total mess. To feed a Azure ML model you must bring it in to coherent format so ML will be able to ingest and process.
  • Data cleaning deal with:
    • Imputation of null values
    • Converting data types
    • Duplicate records
    • Outliers
  • Feature engineering
    • Aggregation(count, sum, average, mean, median, and the like)
    • Part-of (year of date, month of date, week of date, and the like)
    • Binning(grouping entities into bins and then applying aggregations)
    • Flagging(boolean conditions resulting in True of False)
    • Frequency-based(calculating the frequencies of the levels of one or more categorical variables)
    • Embedding(transforming one or more categorical or text features into a new set of features, possibly with a different cardinality)
    • Deriving by example
  • Data scaling
    • Normalization
    • Standardization
  • Data encoding
    • Ordinal encoding
    • One-hot encoding

Azure Databricks libraries:
  • MLLib (Apache Spark legacy) - org.apache.spark.mllib
  • Spark ML - in fact is the same MLLib but the 'latest' - That's the one used in Azure Databricks - org.apache.spark.ml
Train & Validate a model (same old story) - attended
  1. Splitting data
  2. Training model
    1. transformer
      Takes a DataFrame as an input and returns a new DataFrame as an output.
      Used for performing feature engineering and feature selection, as the result of a transformer is another DataFrame.
      Example to read in a text column, map that text column into a set of feature vectors, and output a DataFrame with the newly mapped column.
      .transform()
    2. estimator
      Takes a DataFrame as an input and returns a model
      Example tLinearRegression machine learning algorithm. It accepts a DataFrame and produces a Model
      .fit()
    3. pipeline
      Combine transformers & estimators.
      Breaking out the training process into a series of stages, it's easier to combine multiple algorithms
      .fit()
  3. Validate model
    After training, Spark ML includes an uses built in statistics such as: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and coefficient of determination ...
    All above are Summary measures based on training data.
    From there, with a validation dataset, it is possible to calculate summary statistics on a never-before-seen set of data, running the model's transform() function against the validation dataset.
    .transform()

Resources
Prepare data for machine learning with Azure Databricks - Training | Microsoft Learn
Train a machine learning model with Azure Databricks - Training | Microsoft Learn