Big Data and Spark

Amit Kumar
3 min readJan 13, 2021

--

As we already know if we have a massive amount of data and we want to process it for whatever purpose , it wouldn’t be efficient and cost effective to do it on single computer; no matter how big and powerful that individual machine. Surely we had a bottleneck at some point of time.

  • We also know Hadoop offer a revolutionary solution for this problem by offering a distributed Storage and computing framework and Cluster Resource Manager.

However creating Map-reduce program, sound hard to code by developer. Many of they criticize it due to poor performance.

Later Apache Spark came out in market. We can think spark as a compelling alternative or replacement of Hadoop Map Reduce.

If we compare M/R with Spark then we find Spark is 10 to 100 times faster then M/R. So, just think Spark as successor of Hadoop.

What is Spark ?

Let look at Spark Home Page….

Lightning fast cluster computing . A fast and general engine for large scale data processing.

Let look at Databricks Page….

Powerful open source ease of use. That correct , Spark are indeed of all this but all that sound like marketing pitch, what exactly is Spark.

We can define Spark like this…..

Why Spark is so great ?

On high level their is three main reason for its popularity and rapid adaptation.

  1. It abstract away the fact we are coding to compute on cluster of computer.
  • In best case scenario, we are working with table and using SQL Queries. It fill like working with Database.
  • In worst case scenario, we are working with collection. It fill like working with Scala or Python Collection.

2. It is unified platform, that combine the capabilities for Batch processing, Structure data handling with SQL like language, near real time Stream Processing, Graph Processing and Machine Learning. All of these in Single Framework using our favorite programming language.

3. Ease of use :- If we compare it with M/R code, Spark Code is much more short, simple, easy to read and understand.

The growing ecosystem and libraries offer ready to use algorithm and tool.

So, We are able to answer below question :-

  1. What is Apache Spark ?
  2. What do we do with Apache Spark ?

Answer is :-

  1. A distributed computing platform.
  2. We create program and execute them on spark cluster.

How do we execute program on a spark cluster ?

--

--

Amit Kumar
Amit Kumar

No responses yet