Saturday, 7 May 2016

What is Apache Hadoop ?

Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure.

The Core of Hadoop: MapReduce

The important innovation of MapReduce is the ability to take a query over a dataset, divide it, and run in parallel over multiple nodes. Distributation the computation solves the issue of data too large to fit onto a single machine. Combine this technique with commodity Linux server and you have a cost-effective alternative to massive computing arrays.

Programming Hadoop at the MapReduce level is a case of working with the Java APIs, and manually loading data files into HDFS (Hadoop Distributed File System).

Programmability

Hadoop offers two solutions for making Hadoop programming easier.

Programming Pig

Pis is a programming language that simplifies the common tasks of working with Hadoop: loading data, expressing transformations on the data, and storing the final results.

Programming Hive

Hive enables Hadoop to operate as a data warehouse. It superimposes structure on data in HDFS and then permits queries over the data using a familiar SQL-like syntax. As with Pig, Hive's core capabilities are extensible.

Choosing between Hive and Pig can be confusing. Hive is more suitable for data warehousing tasks, with predominatly static structure and the need for frequent analysis. Hive's closeness to SQL makes it an ideal point of integration between Hadoop and other business intelligence tools.

Pig gives the developer more agility for the exploration of large datasets, allowing the development of succinct scripts for transforming data flows for incorporation into larger applications.

The Hadoop Bestiary

  • Ambari Deployment, configuration and monitoring
  • Flume Collection and import of log and event data
  • HBase Column-oriented database scaling to billions of rows
  • HCatalog Schema and data type sharing over Pig,Hive and MapReduce
  • HDFS Distributed redundant file system for Hadoop
  • Hive Data warehouse with SQL-Like access
  • Mahout Library if machine learning and data mining algorithms
  • MapReduce Parallel computation on server clusters
  • Pig High-level programming language for Hadoop computations
  • Oozie Orchestration and workflow management
  • Sqoop Imports data from relational databases
  • Whirr Cloud-agnostic deployment of clusters
  • Zoookeeper Configuration management and coodination

Getting data in and out: Sqoop and Flume

Improved interoperability with the rest of the data world is provided by Sqoop and Flume. Sqoop is a tool designed to import data from relational databases into Hadoop, either directly into HDFS or into Hive. Flume is designed to import streaming flows of log data directly into HDFS.

Hive's SQL friendliness means that it can be used as a point of integration with vast universe of database tools capable of making connections via JBDC or ODBC database drivers.

Coordination and Workflow: Zookeeper and Oozie

As cmputing nodes can come and go, members of the cluster need to synchronize with each other, know where to access services, and know how they should be configured. This is the purpose of Zookeper

The Oozie component provides features to manage the workflow and dependencies, removing the need for developers to code custom solutions.

Management and Deployment: Ambari and Whirr

Ambari is intended to help system administrators deploy and configure Hadoop, upgrade clusters, and monitor services. Through and API, it may be integrated with other system management tools.

Whirr is a highly complementary componentary component. It offers a way of running services, including Hadoop, on cloude pltforms. Ehirr is cloud neutral and currently supports. Whirr is cloud neutral and currently supports the Amazon EC2 and Rackspace services.

Machine Learning: Mahout

Every organization's data are diverse and particular to their needs. However, there is much less diversity in the kinds of analyses performes on the data. The Mahout project is a library of Hadoop implementations of common analytical computations. Use cases include user collaborative filtering, user recommendations, clustering, and classification.

No comments:

Post a Comment