Spark Data Processing Architecture

Spark data processing model is very much interesting to understand. Spark works on based on concept called 'RDD' which is immutable distributed collection of element. You start writing your spark job by creation of RDD and subsequently apply multiple transformation and action on RDD. Each transformation gets further divided into stages and end up being executed as small-small task.

We are going to look this process in detail by taking standard word count example. I am going to use my virtual machine for demonstration.

I have around 3 GB of text file placed in HDFS having 29 blocks allocated to it. Here is output of fsck command.


File contains a single line which is repeated multiple times in file just created dummy information. Now lets look at spark word count program for this.


If you look at the above , it starts with reading text file and applies two consecutive transformation flatMap and map followed by reduceByKey.

Spark follows lazy evaluation model and therefor no execution takes place as soon as any action occurs.

When you call any action on above RDD , it will created DAG for above list of transformation . Each DAG divides operators into stages. List of operators which does not requires any data shuffling are groped into one stage and which requires data shuffling are groped int another stage.

In our example

textFile(),flatMap(),map() ---> Will be grouped into Stage 1
reduceByKey() -----> Will be grouped into Stage 2

Below is Spark UI which shows two stages have been launched.



Now each stage will hold list of task to be executed on each executor. No of task to be executed is based on no of partitions RDD hold. If you are using HDFS storage then based on no of block , partitions gets created. In our example 29 partitions will be created for input RDD.

Below is pictorial representation of above explanation.



Once task are created then their execution will start on worker node and status will be returned to Driver node.

Happy Sparking....!!!!!

Reference : Learning Spark

Comments

Popular posts from this blog

JDBC Hive Connection fails : Unable to read HiveServer2 uri from ZooKeeper

Access Kubernetes ConfigMap in Spring Boot Application

Developing Custom Processor in Apache Nifi