Apache Spark (Advanced) on Hadoop

2 Day Course
Hands On
Code QAASAH

This course has been retired. Please view currently available Big Data.

Modules

Hide all

Datasets and Catalogs (23 topics)

  • What is a Dataset?
  • Dataset versus SQL/DataFrames
  • When to use which object
  • Serialization performance using Encoders
  • Encoders and semi-structured data
  • Dataset caching (1 of 2)
  • 01: Dataset Caching (2 of 2)
  • 02a/b/c: Common ways to create DS
  • 03: Creating DS from an RDD
  • Cannot create DS these ways
  • 04: Casting DS and convert DS to DF to RDD
  • 05a: map() on DS means lose column names
  • 05b: map() characteristics on Dataset
  • 06: select on DS
  • 07: filter() and groupBy() on DS
  • 08: joinWith() on DS
  • 09: explain() on DS
  • 10: Catalog: List Hive databases
  • 11: Catalog: List Hive tables, Spark Views
  • 012: Catalog: List column names on table
  • 13: Catalog: List Spark functions
  • Review Questions: Datasets/Catalog
  • In Review: Datasets/Catalog

Catalyst and Tungsten functionalities (26 topics)

  • Before we Begin: Open Zeppelin note
  • DataFrames, Datasets and Views use Catalyst/Tungsten
  • Catalyst optimizer overview
  • 01a: Catalyst: Join on 2 Spark Views demo
  • 01a: Catalyst demo: Join on 2 Spark views
  • But RDDs can't use Catalyst
  • Loading data in Spark 2.x and Catalyst
  • 02a: Load data (old way), then Join (1 of 3)
  • Execution Plan from 'old way' loading (2 of 3)
  • 02b: DataFrameReader: Load/Execution Plan (3 of 3)
  • 03a: Dropping hints to Catalyst (1 of 2)
  • 03b: Dropping hints to Catalyst (2 of 2)
  • 04a: Catalyst: Column pruning demo
  • 04b: Catalyst: Column (& Partition) pruning
  • Catalyst: Predicate pushdown concepts
  • 05: Catalyst: Predicate pushdown (1 of 2)
  • 05: Catalyst: Predicate pushdown (2 of 2)
  • Tungsten overview
  • Tungsten: Binary processing
  • Tungsten: Improved Memory usage
  • 06: Tungsten: Improved Caching demo
  • 07: Tungsten: Whole-stage code gen
  • 08: Tungsten: Whole-stage code gen demo
  • Tungsten: Whole-stage code gen Vectorization
  • Review Questions: Catalyst/Tungsten
  • In Review: Catalyst/Tungsten

Performance Tuning (73 topics)

  • 2 types of Machine Learning
  • How Models Created
  • Four common MLlib functions
  • What is Supervised Learning?
  • Spark Supervised Learning workflow
  • Walking the Workflow: Predicting SPAM (1 of 3)
  • Walking the Workflow: Predicting SPAM (2 of 3)
  • Walking the Workflow: Predicting SPAM (3 of 3)
  • Unsupervised Learning
  • RDD - Machine Learning (MLlib)
  • Walking the Workflow: Predicting SPAM (1 of 3)
  • KMeans scenario
  • 01a: Kmeans - Load data
  • 01b: Kmeans - Create Model and Predict
  • 01c: Kmeans - Compare Actual to Predict
  • Collaborative Filtering (CF) recommender
  • Will Carl like 'Star Wars'?
  • 02a: CF - Load Movie data
  • 02b: CF - Create Model and Factors
  • 02c: CF - Map MovieID to MovieName
  • 02d: CF - Make User recommendation
  • Classification Functions (Supervised)
  • Before we Begin: Classification uses LabelPoint. So what is LabelPoint?
  • CASTing X-var and Y-vars for LabelPoint
  • Logistic Regression, Support Vector Machines, NaveBayes and Decision Tree (Supervised)
  • 03a: Logistic Regression, Support Vector Machines, NaveBayes, and Decision Tree
  • 03b: Logistic Regression, Support Vector Machines, NaveBayes, and Decision Tree
  • 03c: Logistic Regression, Support Vector Machines, NaveBayes, and Decision Tree
  • 03c: Logistic Regression, Support Vector Machines, NaveBayes, and Decision Tree (con't)
  • DataFrames - Machine Learning (ML)
  • ML Pipeline Terminology
  • How ML Pipeline Works
  • 02: Predict Bike Rentals (GBT Regression)
  • 02a: Know the Data
  • 02b: Load and View Data types
  • Clean the Data (remove columns)
  • 02c: Clean the Data (remove columns) (cont.)
  • 02d: Clean the Data (change to Double)
  • 02e: Visualize the DataFrame
  • 02f: Create Train/Test Set from DataFrame
  • Train ML Pipeline - The Big Picture
  • 02g: Define Feature Processing Pipeline
  • 02h: Define Model Training of Pipeline
  • 02i: Add CrossValidation to Pipeline
  • 02j: Tie Features/Model Together in Pipeline
  • 02k: Train the Pipeline
  • 02l: Make Predictions, evaluate Results
  • 02l: Make Predictions, evaluate Results (cont.)
  • 02m/n: Visualize the Model's DataFrame
  • Improving the Model
  • Predict Titanic Survivors (Random Forest)
  • 03a: Know the Data
  • 03b: Load and view Data types and Data
  • 03c: Clean data - Add column 'FamilySize'
  • 03d: Clean data - Replace NULLs (con't)
  • 03e: Clean data - Replace empty strings (con't)
  • 03f: Split DataFrame into TrainDF / TestDF
  • 03g: IMPORT ML packages
  • 03h: Index Categorical and Label columns
  • 03i: Assemble all Features into Vector
  • 03j: Using Decision Tree classifier, 03k: Retrieve Original labels, 03l: Create Pipeline
  • 03m: Selecting the best Model
  • 03n: Make Prediction using TestDF
  • Review Questions: Machine Learning
  • In Review: Machine Learning
  • But wait, there's more (for MLlib) (Appendix)
  • Linear Regression scenario (Supervised)
  • Linear Regression (1 of 6)
  • Linear Regression (2 of 6)
  • Linear Regression (3 of 6)
  • Linear Regression (4 of 6)
  • Linear Regression (5 of 6)
  • Linear Regression (6 of 6

Prerequisites

To get the most out of this training, that you have the following knowledge or experience as it builds the foundation for the advance course.

  • Apache Spark (Basic) on Hadoop

Course PDF

Print

Sections