Overview of Spark - Bigdata Bootcamp

Learning Objectives

Know basic Scala syntax.
Being farmiliar with Spark RDD operations.
Being able to work with advanced tools on top of Spark.

In this chapter, you will learn the usage of Spark, an in-memory spark clustering computing framework for parallel data processing. Due to time limit, we will focus more on interactive shell.

Initially, Spark is developed with Scala, a functional programming laguage on JVM. Though most of the Spark functions also have Python and Java API, in this course, we will give examples in Scala only for its concise and simplicity. Interested students can learn more about Python and Java in Spark from official Spark document.

This chapter is logically divided into following sections

Scala Basic: You will learn/review basic of scala syntax via interactive shell, including declare variables of different types, make function calls etc. How to compile and run a standalone scala program will also be covered.
Spark Basic: In this section, you will learn how to load data into Spark and how to conduct some basic processing, i.e. converting data from raw string into predefined class, filtering out those items with missing field and count the final.
Spark SQL: In this section, you will learn how to use SQL like syntax for data processing in Spark. You will see how the tasks of previous section can be achieved with Spark SQL.
Spark Graphx: Grpahx is a specially designed component of Spark for graph data processing. In this section, you will learn how to construct a graph and run PageRank, Connected Components algorithm on the constructed graph.
Spark MLlib: With pre-processed data you got by following instructions of previous sections, you will have a dataset suitable for Machine Learning task. In this section, you will learn how to convert data you have into feature vectors and how to apply existing algorithms in MLlib to predict whether patient will have heart disease or not.