boletín de noticias

Reciba actualizaciones recientes de Hortonworks por correo electrónico

Una vez al mes, recibir nuevas ideas, tendencias, información de análisis y conocimiento de macrodatos.

AVAILABLE NEWSLETTERS:

Sign up for the Developers Newsletter

Una vez al mes, recibir nuevas ideas, tendencias, información de análisis y conocimiento de macrodatos.

cta

Empezar

nube

¿Está preparado para empezar?

Descargue sandbox

¿Cómo podemos ayudarle?

* Entiendo que puedo darme de baja en cualquier momento. Agradezco asimismo la información complementaria porporcionada en la Política de privacidad de Hortonworks.
cerrarBotón de cerrar
Proyectos de Apache
Apache Hive

Apache Hive

MENÚ

INFORMACIÓN GENERAL

The de facto standard for SQL queries in Hadoop

Since its incubation in 2008, Apache Hive is considered the defacto standard for interactive SQL queries over petabytes of data in Hadoop.

With the completion of the Stinger Initiative, and the next phase of Stinger.next, the Apache community has greatly improved Hive’s speed, scale and SQL semantics. Hive easily integrates with other critical data center technologies using a familiar JDBC interface.

What Hive Does

Hadoop was built to organize and store massive amounts of data of all shapes, sizes and formats. Because of Hadoop’s “schema on read” architecture, a Hadoop cluster is a perfect reservoir of heterogeneous data—structured and unstructured—from a multitude of sources.

Data analysts use Hive to query, summarize, explore and analyze that data, then turn it into actionable business insight.

Advantages of using Hive for enterprise SQL in Hadoop

Característica Descripción
Familiar
    Query data with a SQL-based language
Fast
    Interactive response times, even over huge datasets
Scalable and Extensible
    As data variety and volume grows, more commodity machines can be added, without a corresponding reduction in performance
Compatible
    Works with traditional data integration and data analytics tools.

How Hive Works

Hive on LLAP (Live Long and Process) makes use of persistent query servers with intelligent in-memory caching to avoid Hadoop’s batch-oriented latency and provide as fast as sub-second query response times against smaller data volumes, while Hive on Tez continues to provide excellent batch query performance against petabyte-scale data sets.

The tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language and Hive supports overwriting or appending data.

Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets.

Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.

Innovation & Focus

The Stinger and Stinger.next initiatives successfully delivered a fundamentally new Apache Hive, which evolved Hive’s traditional architecture and made it much faster, with richer SQL standards compliant semantics and petabyte scalability. We continue to work within the Hive community to advance these four key facets of hive:

Velocidad
Deliver sub-second query response times
Escala
The only SQL interface to Hadoop designed for queries that scale from Terabytes to Petabytes
SQL
Enable transactions and SQL:2011 Analytics for Hive
OLAP
Provide native interfaces for MDX compliant analytics tools including Excel

 

The focus for the next major Hive release will be:

Focus Planned Enhancements
MDX interface
    Provide native support for the MDX (multi-dimensional expressions) query language, the de-facto query language used by online analytical process (OLAP) tools, include Microsoft Excel.
Complete SQL:2011 Compliance
    Close the remaining gaps to achieve complete SQL:2011 standard SQL compliance, enabling maximum compatibility with SQL-based BI, visualization and machine learning tools.

Recent Hive Releases

HIVE_release_graphic

Apache Hive Version Prior Enhancements
2.1
  • Hive LLAP: Persistent query servers with intelligent in-memory caching.
  • ACID GA: Hardened and proven at scale.
  • Expanded SQL Compliance: More capable integration with BI tools.
  • Performance: Interactive query, 2x faster ETL.
  • Security: Row / Column security extending to views, Column level security for Spark.
  • Operations: LLAP integration in Ambari, new Grafana dashboards.
2.0
  • Speed: HBase to store Hive Metadata
  • Workflow: HPL/SQL – Implementing Procedural SQL in Hive
  • Scale: first version of LLAP, and Hive on Spark
  • SQL: Hive-on-Spark Self Union/Join
1.2
  • Speed: Vectorized Map Join brings up to 5x faster map joins
  • Scale: Hybrid-Hybrid Grace Hash Join allows analytical queries at large scale without complex tuning
  • Scale: Bloom Filter support added to ORCFile
  • SQL: Added support for UNION DISTINCT and Interval Types
0.14
  • Speed: Cost-based optimizer for star and bushy join queries
  • Scale: Temporary tables
  • Scale: Transactions with ACID semantics
0.13
  • Speed: Hive on Tez, vectorized query engine & cost-based optimizer
  • Scale: dynamic partition loads and smaller hash tables
  • SQL: CHAR & DECIMAL datatypes, subqueries for IN / NOT IN
0.12
  • Speed: Vectorized query engine & ORCFile predicate pushdown
  • SQL: Support for VARCHAR and DATE semantics, GROUP BY on structs and unions

What’s new in HDP 2.6

Hive LLAP is now GA

Hive LLAP delivers MPP performance at Hadoop scale, provides ACID transactions and is 100% compatible with existing Hive applications. Apache Ambari delivers a one-click experience for enabling LLAP, making Hive LLAP the easiest SQL on Hadoop solution to get started with.

ACID MERGE

Hive now support ANSI SQL compliant MERGE to complement existing INSERT/UPDATE/DELETE. Managing data in Hadoop is now easier and faster than ever.

More Powerful SQL

Hive is now able to run all 99 TPC-DS queries out-of-the-box with only trivial rewrites required, thanks to the addition of new set operators and multi subquery support. Migrating workloads to Hive has never been easier.

New Hive View

Hive View 2.0 introduces a database manager to let you create, modify and manage Hive tables. In addition, a new visual explain makes it easy to pinpoint and optimize expensive query operations.

More Powerful Tez UI

The new Total Timeline View shows you exactly where a query spends its time, from planning to execution to cleanup, making it easy to pinpoint and fix query bottlenecks.

SQL and HiveQL

Apache Hive is data warehouse infrastructure built on top of Apache Hadoop for providing data summarization, ad-hoc query, and analysis of large datasets. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).

Naturally, there are a bunch of differences between SQL and HiveQL, but on the other hand there are a lot of similarities too, and recent releases of Hive bring that SQL-92 compatibility closer still.

Below is some basic information to get you started. Of course, if you really want to get to grips with Hive, then take a look at the full language manual.

Recuperación de información

Función MySQL Hive
Retrieving Information (General) SELECT from_columns FROM table WHERE conditions; SELECT from_columns FROM table WHERE conditions;
Retrieving All Values SELECT * FROM table; SELECT * FROM table;
Recuperación de algunos valores SELECT * FROM table WHERE rec_name = "value"; SELECT * FROM table WHERE rec_name = "value";
Recuperación con criterios múltiples SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2"; SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2";
Recuperar columnas específicas SELECT column_name FROM table; SELECT column_name FROM table;
Retrieving Unique Output SELECT DISTINCT column_name FROM table; SELECT DISTINCT column_name FROM table;
Clasificación SELECT col1, col2 FROM table ORDER BY col2; SELECT col1, col2 FROM table ORDER BY col2;
Clasificación inversa SELECT col1, col2 FROM table ORDER BY col2 DESC; SELECT col1, col2 FROM table ORDER BY col2 DESC;
Contar filas SELECT COUNT(*) FROM table; SELECT COUNT(*) FROM table;
Agrupación con la cuenta SELECT owner, COUNT(*) FROM table GROUP BY owner; SELECT owner, COUNT(*) FROM table GROUP BY owner;
Máximo valor SELECT MAX(col_name) AS label FROM table; SELECT MAX(col_name) AS label FROM table;
Selecting from multiple tables (Join same table using alias w/”AS”) SELECT pet.name, comment FROM pet, event WHERE pet.name = event.name; SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name)

Metadatos

Función MySQL Hive
Seleccionar una base de datos USE database; USE database;
Listado de bases de datos SHOW DATABASES; SHOW DATABASES;
Enumerar tablas en una base de datos SHOW TABLES; SHOW TABLES;
Describir el formato de una tabla DESCRIBE table; DESCRIBE (FORMATTED|EXTENDED) table;
Crear una base de datos CREATE DATABASE db_name; CREATE DATABASE db_name;
Dejar una base de datos DROP DATABASE db_name; DROP DATABASE db_name (CASCADE);

Foros

Tutoriales Hive

Hive in the Press

Seminarios web y presentaciones