Apache Spark (Basic) on Hadoop

3 Day Course
Hands On
Code QAASBH

This course has been retired. Please view currently available Big Data.

Modules

Hide all

Intro and Setup (2 topics)

  • How to start Spark and Zeppelin services in Ambari
  • How to login to Spark using Python and Scala

Spark Architecture (3 topics)

  • What is Apache Spark?
  • Spark components (Driver, Context, Yarn, HDFS, Workers, Executors)
  • Spark processing (Jobs, Stages, Tasks)

Getting Started with RDDs (3 topics)

  • Running queries in Python, Scala and Zeppelin
  • Creating RDDs
  • Queries using most popular Transformations and Actions

Pair RDDs (2 topics)

  • Difference between RDDs and Pair RDD
  • 1 Pair Actions, 1 Pair Transformations and 2 Pair Transformations

Spark SQL (2 topics)

  • Working with DataFrames and Tables and DataSets
  • Catalyst optimizer overview

Spark Streaming (2 topics)

  • Working with DStreams
  • Stateless and Stateful Streaming labs using HDFS and Sockets

Visualizations using Zeppelin (2 topics)

  • Creating various Charts using DataFrames and Tables
  • How to create Pivot charts and Dynamic forms

Spark UI (2 topics)

  • Overview of Job, Stage and Tasks
  • Monitoring Spark jobs in Spark UI

Performance Tuning (2 topics)

  • Caching, Checkpoint, Accumulators and Broadcast Variables
  • Hashed Partitions, Tungsten, Executor memory and Serialization

Spark Applications (2 topics)

  • Creating an application via spark-submit
  • Parameter configurations (number executors, driver memory, executor cores, etc.)

Spark 2.0 Machine Learning (ML) (2 topics)

  • How ML Pipelines work
  • Making Predictions using Decision Tree

Prerequisites

To get the most out of this training, you should have the following knowledge or experience as they will not be discussed during class.

  • Hadoop Distributed File System (HDFS), YARN (Yet Another Resource Manager) and MapReduce processing engine
  • Scala or Python coding
  • Linux command line experience

Course PDF

Print

Sections