Posts

Showing posts from May, 2016

Spark SQL DataFrame Basic

Spark SQL is a spark module for structured data processing. User can execute sql queries by taking advantage of spark in-memory data processing architecture. DataFrame :-  Distributed collection of data organized into named columns.It is conceptually equivalent to table in relational database. Can be constructed from Hive,external database or existing RDD's Entry point for all functionality in Spark SQL is the SQLContext class and or one of its descendants. There are two ways to convert existing RDD into DataFrame. 1. Inferring Schema 2. Programmatically Specifying Schema We will see example for both step by step. Inferring Schema :This is done by using case classes.The names of the arguments to the case class are read using reflection and become the names of the columns. Below is sample code: import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql.SQLContext import org.apache.spark.sql.DataFrame

Spark Data Processing Architecture

Image
Spark data processing model is very much interesting to understand. Spark works on based on concept called 'RDD' which is immutable distributed collection of element. You start writing your spark job by creation of RDD and subsequently apply multiple transformation and action on RDD. Each transformation gets further divided into stages and end up being executed as small-small task. We are going to look this process in detail by taking standard word count example. I am going to use my virtual machine for demonstration. I have around 3 GB of text file placed in HDFS having 29 blocks allocated to it. Here is output of fsck command. File contains a single line which is repeated multiple times in file just created dummy information. Now lets look at spark word count program for this. If you look at the above , it starts with reading text file and applies two consecutive transformation flatMap and map followed by reduceByKey. Spark follows lazy evaluation model and t

Pipe in Spark

Image
Spark is distributed parallel processing framework developed using Scala. Spark also support Java and Python api for writing spark jobs.But there might be a case where  you might want to use any other language for RDD data processing ex. Shell Scrip, R etc because every functionality can not be provided by supported three language. Spark provides a pipe method on RDDs. Spark’s pipe lets us write parts of jobs using any language we want as long as it can read and write to Unix standard streams. It allows us to perform transformation in RDD and write result to standard output. Today we will perform word count by parallelizing collection but in between we will use shell script for spliting sentence. Sample Code : val scriptLocation = "/root/user/shashi/spark/code/pipe/splittingScript.sh" val input = sc . parallelize ( List ( "this is file" , "file is this" , "this is hadoop" )) val output = input . pipe ( scriptLocation ). ma

Custom Partitioner in Spark

Image
Welcome back to Sparking world.....!!!! Today we will be looking at one  of the most interesting operation that we can perform in spark - Partitioning If you have basic understanding of spark then you might be knowing that spark works on a concept called 'RDD'. Each RDD is partitioned across cluster. One can choose number of partitioned allocated to newly created RDD. For example val testFile = sc . textFile ( "/user/root/testfile.txt" , 3 ) Second parameter of textFile function indicates number of partition to be allocated to RDD testFile. If you want to verify then you can use below command in Scala. testFile . partitions . size Spark supports two type of partitioner 1.Hash Partitioner - Uses Java  Object.hashCodemethod to determine the partition  2.Range Partitioner -  Uses a range to distribute to the respective partitions the keys that fall within a range. One of basic advantage is that as similar kind of data are co-located the

Read/Write data from Hbase using Spark

Image
Apache spark is open source distributed computing framework for processing large dataset. Spark mainly work on concept of RDD. RDD is  Immutable , Distributed collection of Objects partitioned across cluster. Two types of operation that can be performed on RDD . 1. Transformation 2. Action Today we will see how we can use these transformation and action on Hbase data.Lets take traditional 'Employee' example. Pre-requisite : Make sure Spark,Hbase and Hadoop are functioning properly. We will be running following list of operation in Hbase data using Spark.  Create Table in Hbase. Write data in Hbase.  Read data from Hbase Apply Filter operation on Hbase data Store result into HDFS. Just to give high level overview , I will write employee information in Hbase and filter those employee records whose age is less than 55. Query : Filter those records from Hbase data whose Age is less than 55. Expected Output : I should get two records with age 53. Be