All you need to know about Hadoop
1) Hadoop and Big data-
i) What is Big data?
– Big data is a marketing term, not a technicality. Everything is big data these days.
– Big data consist of three Vs-
a) Volume – Now days data is collected in large amount
b) Velocity – The speed which we access data
c) Variety – All types of data formats. Structured, semi-structured, unstructured, log files, pictures, audio files, communications records, email.
ii) What is Hadoop?
Hadoop is divided into two components-
a) Open source data storage – [HDFS]
b) Processing – Map-Reduce API
Definition – Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing.
– Hadoop is not a database. It is alternative file system.
2) How did Hadoop get here?
– Hadoop was created by Doug Cutting. He was working on Nutch project- an open-source web search engine. Their main goal to invent to a way to return a web search result faster by distrusting data and calculations across different computers so multiple tasks could be accomplished simultaneously. During the same time another search engine project called Google was in progress on the same concept.
– In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s early work with automating distributed data storage and processing. In 2008, Yahoo released Hadoop as an open-source project.
Fun Fact- Hadoop was the name of a yellow toy elephant owned by the son of Doug Cutting.
3) When should you use Hadoop?
a) When there is huge data
b) Unstructured data
c) Non-transnational data -write once and read more
d) Behaviour data – refers to observational information collected about the actions and activities. Best example is flipkart product recommendation.
4) When not to use Hadoop?
a) You require random, interactive access to data
b) Small dataset(large number of small files)
c) If you want to store sensitive data
d) Real time data
5) How does data get into Hadoop?
There are numerous ways to get data into Hadoop. Here are just a few:
a) Using Java program you can load data in HDFS
b) Using Shell script/ command
c) Using Sqoop to import structured data from a relational database to HDFS, Hive and HBase
d) Using Flume to continuously load data from logs into Hadoop.
6) Hadoop Ecosystem
a) Pig – a platform for manipulating data stored in HDFS. It consists of a compiler for MapReduce programs and a high-level language called Pig Latin. It provides a way to perform data extractions, transformations and loading, and basic analysis without having to write MapReduce programs.
b) Hive – a data warehousing and SQL-like query language that presents data in the form of tables. Hive programming is similar to database programming.
c) HBase – a non-relational, distributed database that runs on top of Hadoop. HBase tables can serve as input and output for MapReduce jobs.
d) Zookeeper – an application that coordinates distributed processes.
e) Ambari – a web interface for managing, configuring and testing Hadoop services and components.
f) Flume – software that collects, aggregates and moves large amounts of streaming data into HDFS.
g) Sqoop – a connection and transfer mechanism that moves data between Hadoop and relational databases.
h) Oozie – a Hadoop job scheduler.
You can see a full list of Apache Hadoop project on their official website.