Hadoop Tools
Overview
The Hadoop ecosystem provides a rich set of tools that extend core Hadoop capabilities. These tools enable SQL-like querying (Hive), scripting (Pig), NoSQL storage (HBase), in-memory analytics (Spark), workflow orchestration (Oozie), data ingestion (Sqoop, Flume, NiFi), and cluster management (Ambari). Each tool integrates with HDFS/YARN, allowing scalable, fault-tolerant Big Data solutions. Below you’ll find detailed descriptions and hands-on examples for each tool.
Hive – SQL on Hadoop
Hive exposes HDFS data as tables and lets you query them using HiveQL. Under the hood, queries compile to MapReduce, Tez, or Spark jobs.
Example: Create an external table over log files and run an aggregate.
-- Create external table
enable debug;
CREATE EXTERNAL TABLE IF NOT EXISTS access_logs (
host STRING,
timestamp STRING,
request STRING,
status INT,
bytes INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION '/data/logs/';
-- Run query: count requests per status code
SELECT status, COUNT(*) AS cnt
FROM access_logs
GROUP BY status
ORDER BY cnt DESC;
Pig – Data Flow Scripting
Pig Latin scripts transform and analyze large datasets. Scripts compile to MapReduce or Tez tasks.
Example: Count visitors per URL from web logs.
-- Load raw logs
logs = LOAD '/data/logs' USING PigStorage(' ') \
AS (host:chararray, timestamp:chararray, url:chararray, status:int, bytes:int);
-- Group by URL and count
grp = GROUP logs BY url;
counts = FOREACH grp GENERATE group AS url, COUNT(logs) AS cnt;
-- Filter popular pages and store result
popular = FILTER counts BY cnt > 100;
STORE popular INTO '/output/pig/popular_urls' USING PigStorage(',');
HBase – NoSQL Wide-Column Store
HBase provides real-time random access to large tables on HDFS.
Example: Create a table, put and get data using HBase shell.
# Launch HBase shell
hbase shell
# Create table with column family
create 'users', 'info'
# Insert rows
put 'users', 'user1', 'info:name', 'Alice'
put 'users', 'user1', 'info:email', 'alice@example.com'
# Retrieve row
inget 'users', 'user1'
# Scan table for first 10 rows
scan 'users', {LIMIT => 10}
exit
Spark – In-Memory Analytics
Spark runs on YARN or standalone, offering fast in-memory data processing.
Example: Use pyspark to perform word count.
from pyspark import SparkContext
sc = SparkContext(appName="WordCount")
# Read text file from HDFS
lines = sc.textFile("hdfs:///data/text/input.txt")
# Word count
counts = (lines.flatMap(lambda line: line.split())
.map(lambda w: (w, 1))
.reduceByKey(lambda a, b: a + b))
# Save results
counts.saveAsTextFile("hdfs:///output/spark/wordcount")
sc.stop()
Oozie – Workflow Orchestration
Oozie coordinates Hadoop jobs using XML workflows.
Example: A simple workflow definition snippet.
<workflow-app name="WordCountWf" xmlns="uri:oozie:workflow:0.5">
<start to="wordcount"/>
<action name="wordcount">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property><name>mapred.mapper.class</name><value>WordCount$TokenizerMapper</value></property>
</configuration>
<input>${nameNode}/data/input</input>
<output>${nameNode}/data/output</output>
</map-reduce>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail"><message>Workflow failed</message></kill>
<end name="end"/>
</workflow-app>
Sqoop – RDBMS Import/Export
Sqoop transfers data between Hadoop and relational databases.
Example: Import MySQL table to HDFS.
sqoop import \
--connect jdbc:mysql://db.example.com/sales \
--username user --password secret \
--table transactions \
--target-dir /data/sales/transactions \
--split-by id \
--num-mappers 4
Flume – Log Collection
Flume ingests streaming data into HDFS or Kafka.
Example: A simple Flume agent configuration.
agent.sources = src1
agent.sinks = sink1
agent.channels = ch1
agent.sources.src1.type = exec
agent.sources.src1.command = tail -F /var/log/messages
agent.sources.src1.channels = ch1
agent.channels.ch1.type = memory
agent.channels.ch1.capacity = 1000
agent.sinks.sink1.type = hdfs
agent.sinks.sink1.channel = ch1
agent.sinks.sink1.hdfs.path = hdfs:///data/logs/%y-%m-%d/
NiFi – Data Flow Management
NiFi provides a web UI to design flows with processors, connections, and controllers. Example: Ingest HTTP POST data into HDFS using the ListenHTTP and PutHDFS processors.
Ambari – Cluster Management
Ambari simplifies cluster deployment and monitoring. Use its REST API to programmatically add hosts or services. Example:
curl -u admin:admin -H "X-Requested-By: ambari" \
-X POST -d '{"Hosts": {"host_name": "node4.example.com"}}' \
http://ambari-server:8080/api/v1/clusters/MyCluster/hosts
Next Steps
Hands-on: deploy each tool on a sandbox cluster, run the examples, and monitor behavior. Integrate tools into end-to-end workflows, tuning configurations for performance and reliability.