Posts

Showing posts from 2014

Cleared CCD-410 Cloudera Hadoop Developer Exam

Feeling relaxed and happy. Finally cleared Cloudera Hadoop Developer certification (CCD-410) exam today. Well exam was not that much difficult but it had some tricky question and couple of program. Here are some suggestion and link for preparing this exam. 1.Prepare Hadoop Definitive Guide.(I had referred 3rd edition of this book) 2.I read Yahoo hadoop developer tutorial. 3.Make short note's, so that it will be easy to revise at the end. 4.Try running example given in Definitive guide as well as Hadoop in Action.(Custom Record Reader,Distributed Cache,Secondary Sort , Combiner ,Partitionor etc.) 5.Play with different hadoop command's and mapper/reducer. 6.Have basic understanding of hadoop ecosystem components, specially given in definitive  guide. All the best...!!!!
Image
Sqoop: Sqoop  is an open source tool that allows user to extract data from structured data store into  hadoop for further processing. After loading data into HDFS , it can be processed by map reduce or any other high level language like hive etc.   Data can be loaded from HDFS to relational databases for user.   By default, Sqoop will generate comma-delimited text files for imported data. Sqoop uses primary key of relational table for splitting. Each split will be process by individual mapper.. Sqoop will run 4 mapper by default to load data into HDFS. User can change number of mapper by property  –m  <no. of mapper>

Distributed Cache in Hadoop

Image
Side Data Distribution : Side data can be defined as extra read-only data needed by a job to process the main dataset. The challenge is to make side data available to all the map or reduce tasks (which are spread across the cluster) in a convenient and efficient fashion. Distributed Cache : It provides a service for copying files and archives to the task nodes in time for the tasks to use them when they run. Lets take a standard word count example with distributed cache . I have a files article.txt placed in HDFS files system. While running a word count example , I will read article from this file and ignore in word count. Sample Cache File Cache files contains three article as 'a','an','the'. Input File for Word Count: Here is a Program: import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.net.URI; import java.util.HashSet; import java.util.StringTokenizer; import org.apache.hadoo

Anagram Grouping using MapReduce

Image
Anagram Grouping :   Input files contain n number of varying string , we need to identify anagrams words and group them together. Problem Statement : Find anagram words from input files , group them together and convert each word into upper case. Solution : For solving this problem , i am going to use ChainMapper and ChainReducer. All configuration is done in ChainRunner class. Flow of program is mentioned below: SortKeyMapper -->CombineKeyReducer-->UpperCaseMapper SortKeyMapper  : This will read input file line by line.It will create a token from line and sort each toke to form a key with original value.In this way hadoop will create a group of similar words. For example : aba and baa will get sorted as aab. So reducer will receive a key as aab with group {aba,baa}  CombineKeyReducer : It will read each member of group and combine them into single string separated by tab. UpperCaseMapper : It will convert each word into upper case and final output will be written

Hadoop - Custom Partitioner implementation using wordcount.

Image
Here is an simple implementation of partitioner class using standard word count example. In below word count example , integer values will be send to first reducer and text values will be send to second reducer using custom partitioner. Sample Input to Program: 1 2 3 text sample file 3 2 4 timepass testing text file sample 1 Expected Output: part-00000 1 2 2 2 3 2 4 1 part-00001 file 2 sample 2 testing 1 text 2 timepass 1 Runner Class: import java.io.IOException; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.Partitioner; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.TextOutputFormat; public class wc_

Hadoop - Sequence and Map File

Image
Hadoop Sequence Files: -It is a flat file with binary key/value pairs. -There are three different sequence file formats: 1.Uncompressed key/value records. 2.Record compressed key value records. - here values are compressed. 3.Block compressed key value records. -here both key and values are blocked separately and compressed. Small File Problem: All small files can be treated as values for each key and can be stored into a single sequence file.This will reduce overhead of Namenode for storing metadat information about each file. MapFile: The map file is actually a directory.  Within the same, there is an "index" file, and a "data" file. The data file is a sequence file and has keys and associated values. The index file is smaller, has key value pairs with the key being the actual key of the data, and the value, the byte offset.