RHive - Integration of R and Hive with simple demo

RHive is package that can be used for writing hive queries in R.User can import RHive library in R and then one can start writing hive queries in it. Today we are going to do installation of R and some simple example to demonstrate power of R.

Notes: All commands shown below are executed and tested on Cent OS 6.x

Before we proceed , we need certain things in place. Since I have used Cents OS, I installed couple of pre-requisite that is required before we begin with RHive installation.

Install Ant : This is required for building and packaging project.Below is a command to install it.

sudo yum install ant                                                                                               

Install JDK

sudo yum install java-1.6.0-openjdk                                                                 

Set JAVA_HOME in .bashrc file

export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.36.x86_64                    

Install git

 sudo yum install git                                                                                                         
Set up your HIVE_HOME and HADOOP_HOME 

export HIVE_HOME=/usr/lib/hive                                                                                
export HADOOP_HOME=/usr/lib/hadoop                                                                  

Once this is done, you will have to install R and RStudio for accessing hive and demonstrating example.

For R installation , you can simply do 

sudo yum install R and for RStudio you can simply run below command.


wget https://download2.rstudio.org/rstudio-server-rhel-0.99.484-x86_64.rpm
sudo yum install --nogpgcheck rstudio-server-rhel-0.99.484-x86_64.rpm                   

Note: Above RStudio  command is with respect to my machine configuration , you can check on rstudio site for command compatible with your machine.

To check RStudio server is accessible, open your browser and type below address

http://ip:8787

ip :IP address of your machine where you have installed RSutdio server.

Above URL should give RStudio web page.

Now you have installed and configured all pre-requisite required for RHive installation. Now get ready to clone RHive repository and install RHive package in R.

git clone https://github.com/nexr/RHive.git
cd Rhive
ant build
R CMD build RHive
wget https://cran.r-project.org/src/contrib/rJava_0.9-7.tar.gz
wget https://cran.r-project.org/src/contrib/Rserve_1.7-3.tar.gz
R CMD INSTALL rJava_0.9-7.tar.gz
R CMD INSTALL Rserve_1.7-3.tar.gz
R CMD INSTALL RHive_2.0-0.10.tar.gz             

Once above command are completed successfully, you are ready to use RHive.
Now just to test your installation start R shell and type below command to check if you get output.

library(RHive)
rhive.init()
rhive.connect(ip,port,hiveServer2)
rhive.query("show databases")

Above command should list all databases present in hive , if doesn't then please re-check your configurations.

Lets draw 3D pie chart using R now. Before this we need to have sample data in hive which can be accessed through R.

For this I have created a simple Student table. Here is a command to create student hive table.

create table Students(
Sname String,
score int,
subject String
)
row format delimited 
fields terminated by '|'
stored as TextFile;

Sample Input file for student table.

Sname|Score|Subject
Griffith|20|Computer
Rigel|19|Computer
Amos|44|Computer
Mason|35|Computer
Otto|67|Computer
Chester|16|Computer
Henry|15|Computer
Jerry|96|Computer
Grant|86|Computer

Load data into table and start rstudio server web page (http://ip:8787)

You should see scree something like this.



Here I am going to create  student performance pie chart.  Based marks scored in computer subject, they will be given Poor,Good etc grade.

Here is my R script.I am not a good R programmer but I have tried my best perfect result.:P

library(RHive)
rhive.init()
rhive.connect(ip,10000)
poor <- rhive.query("select count(score) from students where score <=25")
avgs <- rhive.query("select count(score) from students where score >25 and score <=40")
goodst <- rhive.query("select count(score) from students where score >40 and score <=75")
vgoods <- rhive.query("select count(score) from students where score >75")
a = as.integer(poor)
b = as.integer(avgs)
c = as.integer(goodst)
d = as.integer(vgoods)
marks <- c(a,b,c,d)
lbls <- c("Very Poor", "Average", "Good" ,"Very Good")
pct <- round(marks/sum(marks)*100)
lbls <- paste(lbls, pct) # add percents to labels 
lbls <- paste(lbls,"%",sep="") # ad % to labels 
pie3D(marks,labels = lbls, col=rainbow(length(lbls)),
   main="Pie Chart of Student Performance")
Output from above 

Reference:
https://github.com/nexr/RHive

Hope this will help....:)

Comments

Popular posts from this blog

JDBC Hive Connection fails : Unable to read HiveServer2 uri from ZooKeeper

Access Kubernetes ConfigMap in Spring Boot Application

Developing Custom Processor in Apache Nifi