Posts

Showing posts from 2015

Joining Spark RDD's

Image
Hi Friends, Today I will be demonstrating, how you can perform joins on Spark RDD's. We are going to focus on three basic join operations. 1. Join (Inner) 2.Left Outer Join 3. Right Outer Join Lets take standard Employee+Department example and create a two RDDs;one holding employee data and another holding department data. Employee Table : Eid EName LName 101 Sam Flam 102 Scot Rut 103 Jass Tez val EmpRDD = sc.parallelize(Seq((101,"Sam","Flam"),(102,"Scot","Rut"),(103,"Jas","Tez"))) Array[(Int, String, String)] = Array((101,Sam,Flam), (102,Scot,Rut), (103,Jas,Tez)) // output Department Table : DeptId DepartmentName Eid D01 Computer 101 D02 Electronic 104 D03 Civil 102 val DeptRDD = sc.parallelize(Seq(("D01","Computer",101),("D02","Electronic",104),("D03","Civil",102))) Array[(String, S

RHive - Integration of R and Hive with simple demo

Image
RHive is package that can be used for writing hive queries in R.User can import RHive library in R and then one can start writing hive queries in it. Today we are going to do installation of R and some simple example to demonstrate power of R. Notes: All commands shown below are executed and tested on Cent OS 6.x Before we proceed , we need certain things in place. Since I have used Cents OS, I installed couple of pre-requisite that is required before we begin with RHive installation. Install Ant : This is required for building and packaging project.Below is a command to install it. sudo yum install ant                                                                                                  Install JDK sudo yum install java-1.6.0-openjdk                                                                  Set JAVA_HOME in .bashrc file export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.36.x86_64                       Install git

Put in Hbase using REST API(Stargate)

Image
Hi Friends, Today  I am going to demonstrate simple example of putting data into hbase table using Stargate REST API. In this example , I have a simple Employee json document which I want to store into one of the cell of hbase. Here is my sample json documen. {"Employee":{"Ename":"Sam","Eid":12}} This sample contains employee name and employee ID. Instead of this json document i can use any other sample data which i want to store into hbase. I created a table in hbase with table name testtable having column family as colfam1. create 'testtable' , 'colfam1' Now I have everything available to being with. Next step is to form you REST query and simply fire it insert data into hbase, Here is sample example taken from  https://wiki.apache.org/hadoop/Hbase/Stargate for inserting data into hbase. curl -H "Content-Type: text/xml" --data '[...]' http://localhost:8000/test/testrow/test:testcolumn Ma

Custom Source in Flume

Image
Flume provides a way where you can write your own source.As we know that there are default source type available in flume like exec, spoolDir, Tiwtter. Here I have a tried small demonstration for custom flume source.In this example I have written MySource java class which will read single line from input and concatenate them as output and it will pass it to channel. Example: Sample Input File : 20 50 50 04 17 59 18 43 28 58 27 81 Sample Output File :   20 2050 205050 20505004 2050500417 205050041759 20505004175918 2050500417591843 205050041759184328 20505004175918432858 First line is concatenated with other and process continues in this way. Here is my Java Code. MySource.Java import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.nio.charset.Charset; import org.apache.flume.Context; import org.apache.flume.Event; import org.apache.flume.EventDeliveryExcep

Rhadoop - Integration of R and Hadoop using rhdfs

Image
Hi Friends, From last few days I was trying to integrate R and Hadoop. I found that there are couple of packages available which can be used for R and Hadoop integration for example Rhipe.Rhadoop BigR etc. Out of these packages , most easiest one that i found  is Rhadoop as while using Rhipe i faced lot of dependencies issue.Here is an small demonstration how to get started with rhdfs. If you want more information then you can try for rmr for writing map reduce job. Below are some packages available with Rhadoop 1.rmr 2.rhdfs 3.rhbase 4.plyrmr Note : I have installed everything on Cents OS 6.x First thing that you need is that R to be in place,Use below commands to install R. sudo su -c 'rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm' sudo yum install R Once R is install , just type R from shell and you should be able to enter into R shell. Next thing is to install dependencies for Rhadoop package. Download rhdf