boletín de noticias

Reciba actualizaciones recientes de Hortonworks por correo electrónico

Una vez al mes, recibir nuevas ideas, tendencias, información de análisis y conocimiento de macrodatos.

AVAILABLE NEWSLETTERS:

Sign up for the Developers Newsletter

Una vez al mes, recibir nuevas ideas, tendencias, información de análisis y conocimiento de macrodatos.

cta

Empezar

nube

¿Está preparado para empezar?

Descargue sandbox

¿Cómo podemos ayudarle?

* Entiendo que puedo darme de baja en cualquier momento. Agradezco asimismo la información complementaria porporcionada en la Política de privacidad de Hortonworks.
cerrarBotón de cerrar
Proyectos de Apache
Apache Hadoop

Apache Hadoop

MENÚ

INFORMACIÓN GENERAL

Apache Hadoop es una plataforma de software de código abierto para el almacenamiento distribuido y procesamiento distribuido de grandes conjuntos de datos en clústeres informáticos construidos del hardware de productos básicos.  Los servicios de Hadoop prevén almacenamiento de datos, procesamiento de datos, acceso a los datos, gestión de datos, seguridad y operaciones.

A Brief History and Benefits

History

The genesis of Hadoop came from the Google File System paper that was published in October 2003. This paper spawned another research paper from Google – MapReduce: Simplified Data Processing on Large Clusters. Development started in the Apache Nutch project, but was moved to the new Hadoop subproject in January 2006. The first committer added to the Hadoop project was Owen O’Malley in March 2006. Hadoop 0.1.0 was released in April 2006 and continues to be evolved by the many contributors to the Apache Hadoop project. Hadoop was named after one of the founder’s toy elephant.

En 2011, Rob Bearden se asoció con Yahoo! para establecer Hortonworks con 24 ingenieros del equipo de Hadoop original, incluyendo a los fundadores Alan Gates, Arun Murthy, Devaraj Das, Mahadev Konar, Owen O'Malley, Sanjay Radia y Suresh Srinivas.

Beneficios

Some of the reasons organizations use Hadoop is its’ ability to store, manage and analyze vast amounts of structured and unstructured data quickly, reliably, flexibly and at low-cost.

  • Scalability and Performance – distributed processing of data local to each node in a cluster enables Hadoop to store, manage, process and analyze data at petabyte scale.
  • Reliability – large computing clusters are prone to failure of individual nodes in the cluster. Hadoop is fundamentally resilient – when a node fails processing is re-directed to the remaining nodes in the cluster and data is automatically re-replicated in preparation for future node failures.
  • Flexibility – unlike traditional relational database management systems, you don’t have to created structured schemas before storing data. You can store data in any format, including semi-structured or unstructured formats, and then parse and apply schema to the data when read.
  • Low Cost – unlike proprietary software, Hadoop is open source and runs on low-cost commodity hardware.

Hadoop Capabilities

Data Storage

data_storageThe Hadoop Distributed File System (HDFS) provides scalable, fault-tolerant, cost-efficient storage for your big data lake. It was designed to span large clusters of commodity servers scaling up to hundreds of petabytes and thousands of servers. By distributing storage across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.

 

Data Processing

data_processingMapReduce is the original framework for writing massively parallel applications that process large amounts of structured and unstructured data stored in HDFS. MapReduce can take advantage of the locality of data, processing it near the place it is stored on each node in the cluster in order to reduce the distance over which it must be transmitted.

More recently, Apache Hadoop YARN opened Hadoop to other data processing engines that can now run alongside existing MapReduce jobs to process data in many different ways at the same time, such as Apache Spark. YARN provides the centralized resource management that enables you to process multiple workloads simultaneously.  YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture.

Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data.

Data Access and Analysis

data_analyticsApplications can interact with the data in Hadoop using batch or interactive SQL (Apache Hive) or low-latency access with NoSQL (Apache HBase).  Hive allows business users and data analysts to use their preferred business analytics, reporting and visualization tools with Hadoop. Data stored in HDFS in Hadoop can be searched using Apache Solr.

 

Data Governance and Security

data_securityThe Hadoop ecosystem extends data access and processing with powerful tools for data governance and integration including centralized security administration (Apache Ranger) and data classification tagging (Apache Atlas), which combined enable dynamic data access policies that proactively prevent data access violations from occurring. Hadoop perimeter security is also available to integrate with existing enterprise security systems and control user access to Hadoop (Apache Knox).

Foros

Hadoop in our Blog

Hadoop in the Press

Seminarios web y presentaciones