Hadoop Development

This is a 65 hours instructor lead Hadoop training course delivers the key concepts and expertise necessary to create robust data processing applications using Apache Hadoop. Through lecture and interactive hands-on exercises, attendees will learn Hadoop and its ecosystem components.


Upon completion of the course, attendees can clear Hadoop developer and Hadoop administrator certifications from Cloudera or from HortonWorks. Certification is a great differentiator; it helps establish individuals as leaders in their field, providing customers with tangible evidence of skills and expertise.


At VisualPath, clearing the certification is a very small target where as learning how to use Hadoop and its Ecosystems in production is our main Objective. The entire training will be on a live multi node Hadoop cluster sitting on cloud.



→ Introduction

→ Hadoop: Basic Concepts

  • What is Hadoop?
  • The Hadoop Distributed File System
  • Hadoop Map Reduce Works
  • Anatomy of a Hadoop Cluster

→ Hadoop demons

  • Master Daemons
    • Name node
    • Job Tracker
    • Secondary name node
  • Slave Daemons
    • Job tracker
    • Task tracker

→ HDFS(Hadoop Distributed File System)

  • Blocks and Splits
    • Input Splits
    • HDFS Splits
  • Data Replication
    • Hadoop Rack Aware
  • Data high availability
  • Data Integrity
  • Cluster architecture and block placement
  • Accessing HDFS
    • JAVA Approach
    • CLI Approach
  • Programming Practices
  • Developing MapReduce Programs in

    Running without HDFS and Mapreduce

    Running all daemons in a single node

    Running daemons on dedicated nodes

    • Local Mode
    • Pseudo-distributed Mode
    • Fully distributed mode

Hadoop Administrative Tasks

→ Setup Hadoop cluster of Apache, Cloudera and HortonWorks

  • Make a fully distributed Hadoop cluster on a single laptop/desktop
  • Name Node in Safe mode
  • Meta Data Backup
  • Integrating Kerberos security in hadoop

Hadoop Developer Tasks

→ Writing a MapReduce Program

  • Examining a Sample MapReduce Program
    • With several examples
  • Basic API Concepts
  • The Driver Code
  • The Mapper
  • The Reducer
  • Hadoop's Streaming API

→  Performing several hadoop jobs

  • The configure and close Methods
  • Sequence Files
  • Record Reader
  • Record Writer
  • Role of Reporter
  • Output Collector
  • Processing XML files
  • Counters
  • Directly Accessing HDFS
  • ToolRunner
  • Using The Distributed Cache

→ Common MapReduce Algorithms

  • Sorting and Searching
  • Indexing
  • Classification/Machine Learning
  • Term Frequency - Inverse Document Frequency
  • Word Co-Occurrence
  • Hands-On Exercise: Creating an Inverted Index
  • Identity Mapper
  • Identity Reducer
  • Exploring well known problems using MapReduce applications

→ Debugging MapReduce Programs

  • Testing with MRUnit
  • Logging
  • Other Debugging Strategies.

→ Advanced MapReduce Programming

  • A Recap of the MapReduce Flow
  • The Secondary Sort
  • Customized Input Formats and Output Formats

→ Monitoring and debugging on a Production Cluster

  • Counters
  • Skipping Bad Records
  • Rerunning failed tasks with Isolation Runner

→ Tuning for Performance in MapReduce

  • Reducing network traffic with combiner
  • Partitioners
  • Using Compression
  • Reusing the JVM
  • Running with speculative execution
  • Refactoring code and rewriting algorithms Parameters affecting Performance
  • Other Performance Aspects

Hadoop Ecosystem

→ HBase

  • HBase  concepts
  • HBase  architecture
    • Region server architecture
    • File storage architecture
  • HBase  basics
    • Column access
    • Scans
  • HBase   use cases
  • Install and configure HBase on a multi node cluster
  • Create database, Develop and run sample applications
  • Access data stored in HBase  using clients like Java, Python and Pearl
  • HBase  and Hive Integration
  • HBase  admin tasks
    • Defining Schema and basic operation.

→ Hive

  • Hive concepts
  • Hive architecture
  • Install and configure hive on cluster
  • Create database, access it from java client
  • Buckets
  • Partitions
  • Joins in hive
    • Inner joins
    • Outer Joins
  • Hive UDF
  • Hive UDAF
  • Hive UDTF
  • Develop and run sample applications in Java/Python to access hive


  • Pig basics
  • Install and configure PIG on a cluster
  • PIG Vs MapReduce and SQL
  • Pig Vs Hive
  • Write sample Pig Latin scripts
  • Modes of running PIG
    • Running in Grunt shell
    • Programming in Eclipse
    • Running as Java program
  • PIG UDFs
  • Pig Macros

→ Flume, Chukwa, Avro, Scribe, Thrift

  • Flume and Chukwa concepts
  • Use cases of Thrift, Avro and scribe
  • Install and configure flume on cluster
  • Create a sample application to capture logs from Apache using flume

→ CDH4 Enhancements

  • Name Node High – Availability
  • Name Node federation
  • Fencing
  • YARN

→ Hadoop Challenges

  • Hadoop disaster recovery
  • Hadoop suitable cases

Added privileges

  • 400 interview questions will be provided
  • Mock interviews for each individual will be conducted on need basis
  • Every participant will do a POC of a production grade project with real data
  • We will explain at least 4 real time projects that are being used at production in various companies
  • Resume preparation and guidance for clearing interviews
  • 24 X 7 blog support : http://www.hadoopatvisualpath.blogspot.in
  • Core Java concepts required for learning Hadoop as part of the Hadoop course itself.
  • Unix concepts and basic commands will also be covered as part of the course.
Contact for Demo
Training Enquiry Form

Online Courses Videos