Spark for Java Developers

Try for free!

Subscribe and stream all our courses
from just $19.00 per month
Start my free trial

Spark for Java Developers

Big Data with Java Lambdas!

The course is 6 hours long and would be equivalent to a 3 day training course.

  • Get started with the amazing Apache Spark parallel computing framework - this course is designed especially for Java Developers
  • All of the fundamentals you need to understand the main operations you can perform in Spark Core.
  • Deploy to a live EMR hardware cluster.
  • Understand the internals of Spark and how it optimizes your execution plans.
  • Get some great practice with Java 8 Lambdas - our most "functional" course to date!
  • There will be a follow on module covering SparkSQL later in the year.
You'll need to be familar with Java. We'll be using Lamdbas throughout, but this course is a good introduction to them if you're not familar with them already.


Having problems? check the errata

Introduction 16m 56s

A brief overview of Spark and some of the jargon terms you'll be encountering.


Getting Started 21m 35s

Let's get Spark "installed" - it's just a maven dependency.


Reduces 14m 19s

Reduces are fundamental transformations. Here we'll do a very basic reduce to establish the idea.


Update - problems with NotSerializableExceptions? 6m 28s

If, in the next chapter on "Mapping" (or any future chapters) you experience a NotSerializableException, it is because your CPU architecture is sophisticated enough for Spark to treat each CPU as a node in a cluster! But this causes a crash with System.out.println. See this video for a simple workaround.


Mapping 17m 45s

Mapping allows you transform the RDD from one form to another.


Tuples 18m 12s

Commonly used in Scala, Tuples appear everywhere in the Spark Core API. We can use them in Java, but they are a bit awkward.


PairRDDs 41m 30s

A PairRDD is a key/value representation of a dataset.


FlatMap and Filtering 14m 46s

FlatMaps look complicated but it's a simple transformation. Also we'll see how to filter.


Reading Files 13m 26s

We can read local files, or from S3 or HDFS big data file systems.


Keyword Ranking 41m 47s

A major exercise, we'll automatically generate keywords for training courses based on their subtitle files.


Sorts and Coalesces 28m 44s

There are some misunderstandings with sorts and we'll address that here. Also - what is Coalesce used for (and when it shouldn't be used).


Deploying to EMR 40m 42s

We'll now deploy to a live cluster. Spark can deploy to Hadoop Yarn clusters or you can build a standalone cluster. Here we use Amazon EMR. Even if you're not using EMR, do watch this chapter as there is a lot to learn from running on real hardware.


Joins 27m 27s

One last transformation type on the course - how to do Inner, Outer, Full and Cartesian Joins.


Big Data Big Exercise 51m 35s

A chance for you to practice everything - a real "course ranking" process we run here at VirtualPairProgrammers.


Performance 80m 8s

A deeper look into the internals of Spark.

Copyright ©2024