You are on page 1of 3

Big Data Terminology


Page 1

YouTube Videos
There are many educational videos about Hadoop on YouTube. Below are links to 2
videos that are recommended for getting familiar with Hadoop concepts.
Demystifying Hadoop
This is a 17 minute video that provides a simple, but thorough description of
Hadoop with lots of easy to understand pictures.
Basic Introduction to Apache Hadoop
This 15 minute video provides a high-level overview of Hadoop. It also discusses the
primary components (projects) of Hadoop and how they interact with each other.
Introduction to Hadoop and each of the projects associated with Hadoop
This is a series of 3-minute videos. Each video discusses a specific Hadoop project.
This is a good follow-up to the video above. It does get a little technical is some of
the videos but overall, stays at a higher level. The projects that Control-M for
Hadoop is supporting are: Pig, Hive, MapReduce, Sqoop and the HDFS Filewatcher.
Glossary of Terms associated with Control-M for Hadoop
This section contains a few of the basic terms that you must be familiar with to have
a conversation about Hadoop batch processing. It is not a comprehensive list but
rather a starter set. Big Data Terminology
Hadoop An open source project that lets you manage your big data (volume,
variety, velocity). It is an ecosystem of projects that provide a common set of
services. The Hadoop framework makes it easy to scale from a single server to
thousands of machines with a high degree of fault tolerance. Four main benefits of
Hadoop are -- Scalable, Cost Effective, Flexible, and Availability (fault tolerance).
Oozie Oozie is a job scheduler that handles Apache Hadoop jobs. It is part of every
Hadoop distribution just like cron is part of Unix systems. It is a project within the
Apache Hadoop framework.

Control-M conversion utility provides a wizard based interface to

convert Oozie jobs to Control-M and virtually eliminates the need to
redefine the existing Oozie jobs before running them in Control-M.
Starting with Control-M for Hadoop 9.0 customers can use Control-M
to run Oozie jobs without converting them to Control-M.This now
gives the customers a choice to convert or run jobs already defined
in Oozie

Spark In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's

is utilizes in-memory processing and can provide performance up to 100 times
Big Data Terminology

Page 2

Global ESO Enablement

faster than MapReduce for certain applications. It loads data into a cluster's
memory and queries it repeatedly. Spark is well-suited to machine learning
algorithms and other streaming data such as data from devices and sensors.

Control-M for Hadoop 9.0 has added support for Spark

HDFS Hadoop Distributed File System. Data in Hadoop is broken down into blocks
and distributed across servers. Hadoop uses very cheap servers basic machines
with their own internal disk drives. HDFS replicates data across multiple machines to
prevent processing failure (default is three servers).
MapReduce MapReduce is the heart of Hadoop as it enables massive parallel
processing. It refers to two separate functions that are performed Map and Reduce
and in that order. Map splits input data into chunks that can be processed in
parallel as tasks. Reduce then aggregates or summarizes the data from the tasks
into final results. MapReduce takes care of scheduling tasks (through Job Tracker),
monitoring them and re-executes the failed tasks.
Job Tracker farms out MapReduce tasks to specific nodes (servers) in a cluster. A
task tracker on the node will submit the tasks and monitor work, reporting the
success or failure of a task to the Job Tracker.
Pig A Hadoop project that consists of a programming language, called Pig Latin,
used to develop programs that analyze large data sets. It lets you focus on what
you want to do vs. how it gets done. It can handle any kind of data. Pig programs
end up as Map Reduce jobs that run on clusters. Writing Pig programs is easier than
writing Map Reduce programs.
Hive Hive is a Hadoop project that allows SQL-like language to be used to query
and analyze large data sets stored in HDFS. Hive includes a language called HiveQL
that lets you write programs that end up as Map Reduce jobs. Organizations that
have SQL programming skills will use Hive (vs. Pig).
Sqoop An application that is used for transferring data between relational
databases and Hadoop. It can be used to import data from relational databases,
such as MS SQL Server databases, to populate Hbase. It can also be used to export
data from Hadoop to a relational database.
DistCp (Distributed Copy) is a tool used for large inter/intra-cluster copying in Hadoop

Control-M for Hadoop 9 has added support for DistCP

Big Data Terminology

Page 3

Global ESO Enablement