Hadoop Tools

Overview

The Hadoop ecosystem provides a rich set of tools that extend core Hadoop capabilities. These tools enable SQL-like querying (Hive), scripting (Pig), NoSQL storage (HBase), in-memory analytics (Spark), workflow orchestration (Oozie), data ingestion (Sqoop, Flume, NiFi), and cluster management (Ambari). Each tool integrates with HDFS/YARN, allowing scalable, fault-tolerant Big Data solutions. Below you’ll find detailed descriptions and hands-on examples for each tool.

Hive – SQL on Hadoop

Hive exposes HDFS data as tables and lets you query them using HiveQL. Under the hood, queries compile to MapReduce, Tez, or Spark jobs.

Example: Create an external table over log files and run an aggregate.

-- Create external table
enable debug;
CREATE EXTERNAL TABLE IF NOT EXISTS access_logs (
  host STRING,
  timestamp STRING,
  request STRING,
  status INT,
  bytes INT
)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION '/data/logs/';

-- Run query: count requests per status code
SELECT status, COUNT(*) AS cnt
FROM access_logs
GROUP BY status
ORDER BY cnt DESC;

Pig – Data Flow Scripting

Pig Latin scripts transform and analyze large datasets. Scripts compile to MapReduce or Tez tasks.

Example: Count visitors per URL from web logs.

-- Load raw logs
logs = LOAD '/data/logs' USING PigStorage(' ') \
  AS (host:chararray, timestamp:chararray, url:chararray, status:int, bytes:int);

-- Group by URL and count
grp = GROUP logs BY url;
counts = FOREACH grp GENERATE group AS url, COUNT(logs) AS cnt;

-- Filter popular pages and store result
popular = FILTER counts BY cnt > 100;
STORE popular INTO '/output/pig/popular_urls' USING PigStorage(',');

HBase – NoSQL Wide-Column Store

HBase provides real-time random access to large tables on HDFS.

Example: Create a table, put and get data using HBase shell.

# Launch HBase shell
hbase shell

# Create table with column family
create 'users', 'info'

# Insert rows
put 'users', 'user1', 'info:name', 'Alice'
put 'users', 'user1', 'info:email', 'alice@example.com'

# Retrieve row
inget 'users', 'user1'

# Scan table for first 10 rows
scan 'users', {LIMIT => 10}
exit

Spark – In-Memory Analytics

Spark runs on YARN or standalone, offering fast in-memory data processing.

Example: Use pyspark to perform word count.

from pyspark import SparkContext
sc = SparkContext(appName="WordCount")

# Read text file from HDFS
lines = sc.textFile("hdfs:///data/text/input.txt")

# Word count
counts = (lines.flatMap(lambda line: line.split())  
               .map(lambda w: (w, 1))
               .reduceByKey(lambda a, b: a + b))

# Save results
counts.saveAsTextFile("hdfs:///output/spark/wordcount")
sc.stop()

Oozie – Workflow Orchestration

Oozie coordinates Hadoop jobs using XML workflows.

Example: A simple workflow definition snippet.

<workflow-app name="WordCountWf" xmlns="uri:oozie:workflow:0.5">
  <start to="wordcount"/>
  <action name="wordcount">
    <map-reduce>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration>
        <property><name>mapred.mapper.class</name><value>WordCount$TokenizerMapper</value></property>
      </configuration>
      <input>${nameNode}/data/input</input>
      <output>${nameNode}/data/output</output>
    </map-reduce>
    <ok to="end"/>
    <error to="fail"/>
  </action>
  <kill name="fail"><message>Workflow failed</message></kill>
  <end name="end"/>
</workflow-app>

Sqoop – RDBMS Import/Export

Sqoop transfers data between Hadoop and relational databases.

Example: Import MySQL table to HDFS.

sqoop import \
  --connect jdbc:mysql://db.example.com/sales \
  --username user --password secret \
  --table transactions \
  --target-dir /data/sales/transactions \
  --split-by id \
  --num-mappers 4

Flume – Log Collection

Flume ingests streaming data into HDFS or Kafka.

Example: A simple Flume agent configuration.

agent.sources = src1
agent.sinks = sink1
agent.channels = ch1

agent.sources.src1.type = exec
agent.sources.src1.command = tail -F /var/log/messages
agent.sources.src1.channels = ch1

agent.channels.ch1.type = memory
agent.channels.ch1.capacity = 1000

agent.sinks.sink1.type = hdfs
agent.sinks.sink1.channel = ch1
agent.sinks.sink1.hdfs.path = hdfs:///data/logs/%y-%m-%d/

NiFi – Data Flow Management

NiFi provides a web UI to design flows with processors, connections, and controllers. Example: Ingest HTTP POST data into HDFS using the ListenHTTP and PutHDFS processors.

Ambari – Cluster Management

Ambari simplifies cluster deployment and monitoring. Use its REST API to programmatically add hosts or services. Example:


curl -u admin:admin -H "X-Requested-By: ambari" \
  -X POST -d '{"Hosts": {"host_name": "node4.example.com"}}' \
  http://ambari-server:8080/api/v1/clusters/MyCluster/hosts
  

Next Steps

Hands-on: deploy each tool on a sandbox cluster, run the examples, and monitor behavior. Integrate tools into end-to-end workflows, tuning configurations for performance and reliability.

Previous: Hadoop Components | Next: Hadoop Development Environment

<
>