Spark Job Stage

Amit Kumar
3 min readJan 16, 2021

All we are going to do in spark is read data from Source and Load it in Spark to process the data & hold intermediate result and finally write the result back to destination. Basically it’s ELT (extract, load and transform) process.

In this process we need a data structure to hold the data in Spark. We have three option to hold data in Spark, i.e..

  1. RDD
  2. Data Frame (DF)
  3. Data Set (DS)

Spark 2.X recommended to use DF and DS and avoid RDD but critical fact is DF and DS both of them are ultimately compile into the RDD. So, under the hood everything in Spark is RDD

RDD stand for…..

The name RDD itself define it’s significance, i.e…

RDD offer two types of operation i.e.. Transformation and Action.

  • Transformation :- This operation create new set of distributed dataset from a existing dataset. So, create a new RDD from existing one.
  • Action :- It is mainly perform to send result back to the driver and hence they produce non-distributed dataset.

The map and reduce by key are transformation and collect is an action. All transformation in spark are lazy that means they not compute result until an action require to produce a result.

It act like a driver programming…..

Some example of Transformation and Action are…..

Originally published at https://amitnxtt.blogspot.com on January 16, 2021.

--

--