Pipe in Spark

Spark is distributed parallel processing framework developed using Scala. Spark also support Java and Python api for writing spark jobs.But there might be a case where  you might want to use any other language for RDD data processing ex. Shell Scrip, R etc because every functionality can not be provided by supported three language.

Spark provides a pipe method on RDDs. Spark’s pipe lets us write parts of jobs using any language we want as long as it can read and write to Unix standard streams.

It allows us to perform transformation in RDD and write result to standard output.

Today we will perform word count by parallelizing collection but in between we will use shell script for spliting sentence.

Sample Code :

val scriptLocation= "/root/user/shashi/spark/code/pipe/splittingScript.sh"

val input = sc.parallelize(List("this is file","file is this","this is hadoop"))

val output = input.pipe(scriptLocation).map(word => (word,1)).reduceByKey(_+_)

output.collect

If you look at 3rd line of code where pipe method invokes shell script and then this shell script will split you sentence and will return output to standard stream which will be further read by map function and performs subsequent operation

Sample Shell Script :

 #!/bin/sh  
 while read input;  
 do  
      for word in $input  
        do  
          echo $word  
        done  
 done  

Output :


Happy Sparking.....!!!!

Comments

Popular posts from this blog

JDBC Hive Connection fails : Unable to read HiveServer2 uri from ZooKeeper

Access Kubernetes ConfigMap in Spring Boot Application

Developing Custom Processor in Apache Nifi