Posts

Showing posts from 2016

Hello World with Apache Sentry

Image
Apache Sentry Sentry is Apache project for role based authorization in hadoop. Sentry works pretty well with Apache Hive.In this blog we will talk about creating a policy in Sentry using Beeline(HiveServer2) shell. Pre-requisite : I am having Cloudera VM with Sentry installed on it. Hive authorization is done by creating policies in sentry. Sentry policy can be created by Sentry Admins. We need to create sentry admin group and add that group into Sentry Admin list using cloudera manager(in sentry-site.xml). Lets create user sentryAdmin with group as sentryAdmin. Fire below command on linux. useradd sentryAdmin Now lets Add this group to sentry admin list. Go to Cloudera Manager - Sentry - Configuration . Select Sentry(Service-wide) from Scope and Main from cataegory. Add sentryAdmin in Admins Groups(sentry.service.admin.group) Restart Sentry service. Its time to create a policy for user. Now lets say that I have a database in Hive and I want to give read p

Developing Custom Processor in Apache Nifi

Image
Apache Nifi was developed to automate the flow of data between different systems. Apache NiFi is based on technology previously called “Niagara Files” that was in development and used at scale within the NSA for the last 8 years and was made available to the Apache Software Foundation through the NSA Technology Transfer Program. Nifi is based on FlowFiles which are heart of it. A FlowFile is a data record, which consists of a pointer to its content (payload) and attributes to support the content, that is associated with one or more provenance events. The attributes are key/value pairs that act as the metadata for the FlowFile, such as the FlowFile filename. The content is the actual data or the payload of the file. Provenance is a record of what’s happened to the FlowFile. Each one of these parts has its own repository (repo) for storage. Each flowfile is processed by FlowFile processor . Processors have access to attributes of a given FlowFile and its content stream. Processo

Cleared Databricks Spark Certification

Last Week I cleared Databricks spark developer certification. Well exam was not that much difficult but it had some tricky question and bunch of scala/python spark program challenges. 1. You must go through ‘Learning Spark’ book. Try practicing all transformation and action given in book. 2. There were 40 questions in exam and all were MCQ. I am not sure about passing percentage; I scored 70% to pass. 3. 90% question were from Spark Core , Spark SQL and Spark Streaming . There were 2 questions from Machine learning and 1 question from GraphX. For GraphX I referred MapR blog , it has got nice explanation to begin with. https://www.mapr.com/blog/how-get-started-using-apache-spark-graphx-scala 4. Make sure you do enough hands on before appearing for the exam. Most of the questions were related to programming so must have through understanding how API works in detail. https://github.com/databricks/learning-spark 5. Try examples in all three language, you

Spark SQL DataFrame Basic

Spark SQL is a spark module for structured data processing. User can execute sql queries by taking advantage of spark in-memory data processing architecture. DataFrame :-  Distributed collection of data organized into named columns.It is conceptually equivalent to table in relational database. Can be constructed from Hive,external database or existing RDD's Entry point for all functionality in Spark SQL is the SQLContext class and or one of its descendants. There are two ways to convert existing RDD into DataFrame. 1. Inferring Schema 2. Programmatically Specifying Schema We will see example for both step by step. Inferring Schema :This is done by using case classes.The names of the arguments to the case class are read using reflection and become the names of the columns. Below is sample code: import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql.SQLContext import org.apache.spark.sql.DataFrame

Spark Data Processing Architecture

Image
Spark data processing model is very much interesting to understand. Spark works on based on concept called 'RDD' which is immutable distributed collection of element. You start writing your spark job by creation of RDD and subsequently apply multiple transformation and action on RDD. Each transformation gets further divided into stages and end up being executed as small-small task. We are going to look this process in detail by taking standard word count example. I am going to use my virtual machine for demonstration. I have around 3 GB of text file placed in HDFS having 29 blocks allocated to it. Here is output of fsck command. File contains a single line which is repeated multiple times in file just created dummy information. Now lets look at spark word count program for this. If you look at the above , it starts with reading text file and applies two consecutive transformation flatMap and map followed by reduceByKey. Spark follows lazy evaluation model and t

Pipe in Spark

Image
Spark is distributed parallel processing framework developed using Scala. Spark also support Java and Python api for writing spark jobs.But there might be a case where  you might want to use any other language for RDD data processing ex. Shell Scrip, R etc because every functionality can not be provided by supported three language. Spark provides a pipe method on RDDs. Spark’s pipe lets us write parts of jobs using any language we want as long as it can read and write to Unix standard streams. It allows us to perform transformation in RDD and write result to standard output. Today we will perform word count by parallelizing collection but in between we will use shell script for spliting sentence. Sample Code : val scriptLocation = "/root/user/shashi/spark/code/pipe/splittingScript.sh" val input = sc . parallelize ( List ( "this is file" , "file is this" , "this is hadoop" )) val output = input . pipe ( scriptLocation ). ma

Custom Partitioner in Spark

Image
Welcome back to Sparking world.....!!!! Today we will be looking at one  of the most interesting operation that we can perform in spark - Partitioning If you have basic understanding of spark then you might be knowing that spark works on a concept called 'RDD'. Each RDD is partitioned across cluster. One can choose number of partitioned allocated to newly created RDD. For example val testFile = sc . textFile ( "/user/root/testfile.txt" , 3 ) Second parameter of textFile function indicates number of partition to be allocated to RDD testFile. If you want to verify then you can use below command in Scala. testFile . partitions . size Spark supports two type of partitioner 1.Hash Partitioner - Uses Java  Object.hashCodemethod to determine the partition  2.Range Partitioner -  Uses a range to distribute to the respective partitions the keys that fall within a range. One of basic advantage is that as similar kind of data are co-located the

Read/Write data from Hbase using Spark

Image
Apache spark is open source distributed computing framework for processing large dataset. Spark mainly work on concept of RDD. RDD is  Immutable , Distributed collection of Objects partitioned across cluster. Two types of operation that can be performed on RDD . 1. Transformation 2. Action Today we will see how we can use these transformation and action on Hbase data.Lets take traditional 'Employee' example. Pre-requisite : Make sure Spark,Hbase and Hadoop are functioning properly. We will be running following list of operation in Hbase data using Spark.  Create Table in Hbase. Write data in Hbase.  Read data from Hbase Apply Filter operation on Hbase data Store result into HDFS. Just to give high level overview , I will write employee information in Hbase and filter those employee records whose age is less than 55. Query : Filter those records from Hbase data whose Age is less than 55. Expected Output : I should get two records with age 53. Be