Overview of Hadoop Ecosystem

Of late, have been looking into the Big Data space and Hadoop in particular. When I started looking into it, found that there are so many products and tools related to Haddop. Using this post summarize my discovery about Hadoop Ecosystem.

Hadoop Ecosystem

A small overview on each is listed below:

Data Collection - Primary objective of these is to move data into a Hadoop cluster

Flume: - Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Developed by cloudera and currently being incubated at Apache software foundaton. The details about the same can be found here.

Scribe: Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. Dveloped by Facebook and can be found here.

Chuckwa: Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Chukwa is built on top of the Hadoop distributed filesystem (HDFS) and MapReduce framework and inherits Hadoop’s scalability and robustness. Details about Chhuckwa can be found here.

Kafka: Kafka provides a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. Kafka is being incubated at Apache software foundation and details about Kafka can be found here.

Sqoop - Sqoop allows easy import and export of data from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. Using Sqoop, you can provision the data from external system on to HDFS, and populate tables in Hive and HBase. cloudera is the creator of Sqoop and Sqoop is currently undergoing incubation at Apache Software Foundation. More information on this project can be found here.

HIHO - HIHO aims to integrate Hadoop with existing data centric systems. There is a need to connect Hadoop to different systems, like databases, report display tools etc to fully leverage the functionality offered by each. HIHO project attempts to do so. Details on HIHO can be found here.

Core Engine - These form the core Engine for storage and processing

HDFS - HDFS (Hadoop Distributed File System) is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. HDFS is part of core Hadoop framework and is a subproject of Apache Hadoop project. Additonal information pertaining to HDFS can be found here.

Map Reduce - Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. Good starting point to understand mapreduce would be here.

Data Storage - Specialized storage on Hadoop

HBase - HBase is the Hadoop database. We can think of it as a distributed, scalable, big data store. The most common usage scenario for HBase is to use it when you need random, realtime read/write access to your Big Data. HBase is a sub project of Apache Hadoop project and information on HBase can be found here.

Support Extensions - These help in extending and supporting core activities on Hadoop

Avro - Avro is an additon to apache family of products and address the area of Data serialization. Apache Avro is a data serialization system that provides Rich data structures, A compact, fast, binary data format, A container file, to store persistent data, Remote procedure call (RPC) and integration with dynamic languages. Information on Avro can be found here.

Zookeeper - ZooKeeper is part of Apache family of products and is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications like Hadoop. Details about zookeeper can be found here.

Oozie - Oozie is a workflow/coordination system to manage Apache Hadoop job. Oozie is being incubated at Apache software foundation and details regarding oozie can be found here.

Thrift - he Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages. Thrift also provides a set of API to connect with HBase. Details on Thrift can be found here.

Scripting Languages - These provide easy ways to facilitate processing on Hadoop

Jaql - Jaql is a high-level scripting language for the JavaScript Object Notation (JSON). It is able to run on Hadoop and break most requests down to Map/Reduce tasks. Jaql heavily borrowes from SQL, XQuery, LISP, Pig Latin, JavaScript and Unix Pipes. Developed primarily inside Ibm and part of BigInsights project. The good starting point for JAQL will be here.

Pig - Pig is a high-level scripting language for data transformation. It is a Hadoop Subproject and can be looked at as a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Details about Pig can be found here.

Hive - Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive is a subproject of Apache hadoop project and information w.r.t to Hive can be found here.

Data Analytics - These are specialized software to perform Analytics on data stored on Hadoop

Intellicus : Intellicus, with a variety of powerful features, extends Ad hoc Reporting and Dash boarding solution to support Big Data. Intellicus provides data source connectivity through Hive to all variants of Hadoop like Apache and Cloudera. It provides both background batch processing and online query processing for Ad hoc reporting using HiveQL. Intellecus as a basic free edition but also has feature rich paid editions. The website for intellicus is https://www.intellicus.com

Karmasphere - Karmasphere provides self-service access to Big Data and analytic functions for faster, more efficient analysis and collaboration. Visually explore data for patterns and trends, iteratively analyze data using familiar queries and skills. Look up details of karmasphhere at https://karmasphere.com

Monitoring and management - These help to monitor and manage the jobs and data on Hadoop

Gangalia: Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency. Ganglia is a BSD-licensed open-source project that grew out of the University of California, Berkeley Millennium Project. The details of Ganglia can be found here.

Hue: Hue is both a web UI for Hadoop and a framework to create interactive web applications. It features a FileBrowser for accessing HDFS, JobSub and JobBrowser applications for submitting and viewing MapReduce jobs, a Beeswax application for interacting with Hive. Hue grew out of Cloudra and details pertaining to Hue can be found here.

Cacti: Cacti is a complete network graphing solution designed to harness the power of RRDTool's data storage and graphing functionality. Cacti provides a fast poller, advanced graph templating, multiple data acquisition methods, and user management features out of the box.There is a good set of hadoop cacti templates that can be used to visualize the data exposed by Hadoop jmx. Look for details on cacti here.

Hadoop high level application - These are application/frameworks which leverage Hadoop for specialized purposes

Search Related - These are software which leverage Hadoop in web crawling and indexing.

Nutch: Apache Nutch is an open source web-search software project. Nutch is a project of the Apache Software Foundation. Nutch is a Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner. Nutch can run hadoop clusters and a good source to begin reading on this would be here.

Solr: Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr integrates easily with Nutch. Nutch crawls the web and solr indexes the crawled data. Details of Solr can be found here.

Machine Learning - Specialized software that implement machine learning algorithms over Hadoop.

Mahout - Mahout is a subproject in Apache family of products on Hadoop. The Apache Mahout library's goal is to build scalable machine learning libraries. Mahout core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. A starting point for information on Mahout would be here.

Comments

Dharmendra ChouhanMay 30, 2013 at 3:44 PM
Hi Girish,

I am working with Intellicus, one of the products that you referred to in your blog. I would like to share the progress made by Intellicus in Big data space during last one year. The latest release of Intellicus provides data source connectivity to Hive, Pig, HBase, custom map reduce jobs, HDFS, AWS, S3. One can also invoke R server from Intellicus and create interactive charts in Intellicus from the response received by R server. Intellicus also supports Adhoc reporting over Greenplum, Vertica, Teradata,GridSQL & Sand.

Intellicus also provides a unique solution which is MOLAP on Hadoop, which takes the execution of the cube to Hadoop rather than bringing data out of Hadoop.
ReplyDelete
Replies
UnknownJune 19, 2017 at 2:18 PM
Updating with the recent skills and applying it is the only tactic to live in our vocation. You have done really a great job by sharing this blog in here. Keep writing blog like this.

Hadoop Training in Bangalore
ReplyDelete
Replies
UnknownSeptember 2, 2017 at 10:45 AM
Excellent blog helpful to everyone Hadoop training in bangalore
Tableau training in bangalore
ReplyDelete
Replies
nishanthSeptember 26, 2017 at 4:46 PM
Very Nice Blog on Hadoop Overview.. Thank You For sharing
Devops Training in Bangalore
itEanz
ReplyDelete
Replies
UnknownOctober 27, 2017 at 5:00 PM
excellent blog python interview questions
ReplyDelete
Replies
mojahOctober 27, 2017 at 5:35 PM
Nice blog
Iot Training in Bangalore
Iteanz
ReplyDelete
Replies
UnknownOctober 30, 2017 at 5:22 PM
very nice artificial intelligence training in bangalore
ReplyDelete
Replies
mojahOctober 31, 2017 at 3:01 PM
Great information for Artifiacial Intelligence Training in Bangalore
Iot Interview Questions

ReplyDelete
Replies
UnknownMarch 8, 2018 at 5:12 PM
Thanks for sharing information nice blog
python online training
artificial intelligence online training
power bi training in bangalore
Talend training in bangalore
talend online training
wordpress interview questions and answers
python training in bangalore
corporate training companies in bangalore
ReplyDelete
Replies
UnknownApril 22, 2018 at 4:02 PM
Hi, from your post i learnt about hadoop ecosystem ,do keep posting your blog | Hadoop Training .
Hadoop Training in Chennai | Hadoop .
ReplyDelete
Replies
UnknownMay 20, 2018 at 2:59 PM

This really has covered a great insight on Hadoop. I found myself lucky to visit your page and came across this insightful read on Hadoop tutorial. Please allow me to share similar work on Hadoop training course . Watch and gain knowledge today.https://www.youtube.com/watch?v=SwDZhlnr9ho
ReplyDelete
Replies
UnknownJuly 24, 2018 at 5:39 PM
Wonderful blog & good post.Its really helpful for me, awaiting for more new post. Keep Blogging !!

Power BI Training in Chennai | Power BI Training Institute in Chennai
ReplyDelete
Replies
MahaSeptember 21, 2018 at 9:17 AM
Awesome Blog with Smart Content

Hadoop training in Hyderabad
Hadoop training in Bangalore
ReplyDelete
Replies
Chethu ApponixMarch 8, 2019 at 5:36 PM
Very good information. we need learn from real time examples and for this we choose good training institute, who were interested to know about Hadoop which is quite interesting. We need a good training institute for my learning .. so people making use of the free demo classes.
Many training institute provides free demo classes. One of the best training institute in Bangalore is Apponix Technologies.
https://www.apponix.com/Big-Data-Institute/hadoop-training-in-bangalore.html
ReplyDelete
Replies
janathanOctober 20, 2019 at 10:41 AM
nice post..
foreach loop in node js
ywy cable
javascript integer max value
adder and subtractor using op amp
"c program to find frequency of a word in a string"
on selling an article for rs 1020, a merchant loses 15%. for how much price should he sell the article to gain 12% on it ?
paramatrix interview questions
why you consider yourself suitable for the position applied for
ReplyDelete
Replies
easylearnOctober 21, 2019 at 10:04 AM
Very excellent post.The knowledge you have been sharing through this post is very helpul to bring up new ideas and to innovate big things.I suggesst everyone to go through this blog,you not ony get knowledge on it but enjoy reading the post.Thanks for sharing.
Python Certification Course in Bangalore
ReplyDelete
Replies
VijaykumarNovember 5, 2019 at 11:32 AM
Exelent post...
Inplant Training in Chennai
Iot Internship
Internship in Chennai for CSE
Internship in Chennai
Python Internship in Chennai
Implant Training in Chennai
Android Training in Chennai
R Programming Training in Chennai
Python Internship
Internship in chennai for EEE
ReplyDelete
Replies
sanashreeNovember 14, 2019 at 5:28 PM
This comment has been removed by the author.
ReplyDelete
Replies
rajuDecember 30, 2019 at 4:20 PM
nice post......!!
poland web hosting
russian federation web hosting
slovakia web hosting
spain web hosting
suriname
syria web hosting
united kingdom
united kingdom shared web hosting
zambia web hosting
inplant training in chennai

ReplyDelete
Replies
NathandigiMarch 3, 2020 at 2:58 PM
This comment has been removed by the author.
ReplyDelete
Replies
NathandigiMarch 3, 2020 at 3:01 PM
Nice post. Thanks for sharing! I want people to know just how good this information is in your blog. It’s interesting content and Great work
data analytics course
Business Analytics Certification Course Training in Hyderabad
<a href="https://360digitmg.com/india/python-r-programming/''>Python & R Programming Course Training for Beginners</a>
ReplyDelete
Replies
nivaApril 20, 2020 at 4:55 PM
nice blog..
coronavirus update
inplant training in chennai
inplant training
inplant training in chennai for cse
inplant training in chennai for ece
inplant training in chennai for eee
inplant training in chennai for mechanical
internship in chennai
online internships

ReplyDelete
Replies
ArunvijayApril 21, 2020 at 12:27 PM
Great post...

Coronavirus Update
Intern Ship In Chennai
Inplant Training In Chennai
Internship For CSE Students
Online Internships
Internship For MBA Students
ITO Internship
ReplyDelete
Replies
lavanyaAugust 1, 2020 at 2:48 AM
I think great site for these post and I am read the most of contents have useful for my Carrier.Thanks for these useful information.Any information are commands like to share himThank you for your post. This is useful information.
Here we provide our special one's.Java training in Chennai

Java Online training in Chennai

Java Course in Chennai

Best JAVA Training Institutes in Chennai

Java training in Bangalore

Java training in Hyderabad

Java Training in Coimbatore

Java Training

Java Online Training
ReplyDelete
Replies
sathyaAugust 2, 2020 at 8:28 PM
Quite Interesting post!!! Thanks for posting such a useful post. I wish to read your upcoming post to enhance my skill set, keep blogging.I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
selenium training in chennai

selenium training in chennai

selenium online training in chennai

software testing training in chennai

selenium training in bangalore

selenium training in hyderabad

selenium training in coimbatore

selenium online training

selenium training

ReplyDelete
Replies
anandAugust 4, 2020 at 7:33 PM
Thanks for sharing an informative blog keep rocking bring more details.I like the helpful info you provide in your articles. I’ll bookmark your weblog and check again here regularly.

Software Testing Training in Chennai | Certification | Online
Courses

Software Testing Training in Chennai

Software Testing Online Training in Chennai

Software Testing Courses in Chennai

Software Testing Training in Bangalore

Software Testing Training in Hyderabad

Software Testing Training in Coimbatore

Software Testing Training

Software Testing Online Training
ReplyDelete
Replies
suryaAugust 7, 2020 at 4:15 PM
Awesome..I read this post so nice and very imformative information...thanks for sharing

angular js training in chennai

angular training in chennai

angular js online training in chennai

angular js training in bangalore

angular js training in hyderabad

angular js training in coimbatore

angular js training

angular js online training
ReplyDelete
Replies
RamyaAugust 13, 2020 at 12:36 PM
Positive site, where did u come up with the information on this posting?I have read a few of the articles on your website now, and I really like your style. Thanks a million and please keep up the effective work
DevOps Training in Chennai

DevOps Online Training in Chennai

DevOps Training in Bangalore

DevOps Training in Hyderabad

DevOps Training in Coimbatore

DevOps Training

DevOps Online Training
ReplyDelete
Replies
dhineshAugust 14, 2020 at 7:51 PM
Nice article i was really impressed by seeing this article, it was very interesting and it is very useful for me.Thanks for sharing this wonderful content.its very useful to us.I gained many unknown information, the way you have clearly explained is really fantastic.keep posting such useful information.
Full Stack Training in Chennai | Certification | Online Training Course
Full Stack Training in Bangalore | Certification | Online Training Course

Full Stack Training in Hyderabad | Certification | Online Training Course
Full Stack Developer Training in Chennai | Mean Stack Developer Training in Chennai
Full Stack Training

Full Stack Online Training

ReplyDelete
Replies
RevathiAugust 18, 2020 at 7:41 PM
thanks for sharing such a nice info.I hope you will share more information like this. please keep on sharing!keep up!!

Android Training in Chennai

Android Online Training in Chennai

Android Training in Bangalore

Android Training in Hyderabad

Android Training in Coimbatore

Android Training

Android Online Training
ReplyDelete
Replies
alexAugust 30, 2020 at 2:58 PM
Wonderful blog & good post.Its really helpful for me, awaiting for more new post. Keep Blogging !!

AWS Course in Bangalore

AWS Course in Hyderabad

AWS Course in Coimbatore

AWS Course

AWS Certification Course

AWS Certification Training

AWS Online Training

AWS Training

ReplyDelete
Replies
vivekvedhaAugust 31, 2020 at 6:44 PM
Thanks for sharing Valuable information.
acte chennai

acte complaints

acte reviews

acte trainer complaints

acte trainer reviews

acte velachery reviews complaints

acte tambaram reviews complaints

acte anna nagar reviews complaints

acte porur reviews complaints

acte omr reviews complaints

ReplyDelete
Replies
prabhuSeptember 1, 2020 at 2:49 PM
It’s really Nice and Meaningful. It’s really cool Blog. You have really helped lots of people who visit Blog and provide them Useful Information. Thanks for Sharing.

IELTS Coaching in chennai

German Classes in Chennai

GRE Coaching Classes in Chennai

TOEFL Coaching in Chennai

spoken english classes in chennai | Communication training

ReplyDelete
Replies
manasaMarch 23, 2021 at 2:11 PM
Hi, Thanks for sharing nice articles...

Data Science Training in Hyderabad
ReplyDelete
Replies
ramJune 29, 2023 at 3:27 AM
Raise month approach doctor it president. Western share seek.technology
ReplyDelete
Replies

Add comment

Musings on Tech

Search This Blog

Overview of Hadoop Ecosystem

Labels

Comments

Post a Comment

Popular posts from this blog

Big Data: Why Traditional Data warehouses fail?

Dilbert on Agile Programing