Posts

Showing posts from September, 2014

Distributed Cache in Hadoop

Image
Side Data Distribution : Side data can be defined as extra read-only data needed by a job to process the main dataset. The challenge is to make side data available to all the map or reduce tasks (which are spread across the cluster) in a convenient and efficient fashion. Distributed Cache : It provides a service for copying files and archives to the task nodes in time for the tasks to use them when they run. Lets take a standard word count example with distributed cache . I have a files article.txt placed in HDFS files system. While running a word count example , I will read article from this file and ignore in word count. Sample Cache File Cache files contains three article as 'a','an','the'. Input File for Word Count: Here is a Program: import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.net.URI; import java.util.HashSet; import java.util.StringTokenizer; import org.apache.hadoo

Anagram Grouping using MapReduce

Image
Anagram Grouping :   Input files contain n number of varying string , we need to identify anagrams words and group them together. Problem Statement : Find anagram words from input files , group them together and convert each word into upper case. Solution : For solving this problem , i am going to use ChainMapper and ChainReducer. All configuration is done in ChainRunner class. Flow of program is mentioned below: SortKeyMapper -->CombineKeyReducer-->UpperCaseMapper SortKeyMapper  : This will read input file line by line.It will create a token from line and sort each toke to form a key with original value.In this way hadoop will create a group of similar words. For example : aba and baa will get sorted as aab. So reducer will receive a key as aab with group {aba,baa}  CombineKeyReducer : It will read each member of group and combine them into single string separated by tab. UpperCaseMapper : It will convert each word into upper case and final output will be written

Hadoop - Custom Partitioner implementation using wordcount.

Image
Here is an simple implementation of partitioner class using standard word count example. In below word count example , integer values will be send to first reducer and text values will be send to second reducer using custom partitioner. Sample Input to Program: 1 2 3 text sample file 3 2 4 timepass testing text file sample 1 Expected Output: part-00000 1 2 2 2 3 2 4 1 part-00001 file 2 sample 2 testing 1 text 2 timepass 1 Runner Class: import java.io.IOException; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.Partitioner; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.TextOutputFormat; public class wc_