Impala vs Hive vs Spark SQL: Choosing the right SQL engine to work properly in the Cloudera Data Warehouse



We always lack data. And we don’t just want more data ... we want new types of data that will allow us to better understand our products, customers and markets. We are always in search of new data, data of all shapes and sizes, structured and not very. We want to open our doors to a new generation of business and technical specialists who will enthusiastically open new databases and technologies with us, which will subsequently change the nature of how we interact with data and what impact they have on our lives.

I will give a life example so that you better understand what I mean. About two years ago, the data saved the life of my friend’s daughter. When she was born she was diagnosed with seven heart defects. Thanks to new technologies, such as interactive 3D graphics, virtual modeling, more intelligent ECG analysis, modern solutions for monitoring patients undergoing bed rest, and thanks to other advanced medical procedures based on data, she managed to survive two open heart surgeries and now lives a healthy life . The data saved her life. This is what pushes me every day to search for new innovative solutions and ways to transfer data faster to those who need them more than others.

I am proud to be part of the Cloudera Data Warehouse (CDW) team, powered by the Cloudera Data Platform (CDP). CDP was created from scratch as an enterprise data cloud or Enterprise Data Cloud (EDC). EDC is a multifunctional tool for implementing many tasks on one platform. Thanks to the use of hybrid and multi-cloud systems, CDP can work anywhere - both on a platform without an operating system, and in a private and public cloud. As more cloud solutions are introduced as part of our digital development plan, we see hybrid and multi-cloud solutions becoming the new norm. However, these combined solutions create problems in managing them, which in turn creates new security risks,the likelihood of tracking the user and subsequently breaking the law. To solve these problems, CDP has advanced security and control capabilities that will make it possible to open access to data without risking violating anyone's security policy or even the law.

CDW on CDP is a new service that allows you to create a self-service data warehouse for BI analytics teams. You can quickly create new data warehouses and use them yourself, or provide access to them to a group of people and use a single database with them. Do you remember the times when you could manage your data warehouse on your own? Manage it without the participation of platforms and the infrastructure necessary for its operation? This has never happened before. CDW made this possible.

Thanks to CDW, various SQL engines have become available, but confusion comes with great choices. Let's look at the SQL engines available in CDW on CDP, and discuss which SQL option is more suitable for a specific task.

Such a great choice! Impala? Hive LLAP? Spark? What to use and when? Let's figure it out.

Impala sql engine


Impala is a popular open source MPP engine with a wide range of features in Cloudera Distribution Hadoop (CDH) and CDP. Impala has earned market confidence with its low-latency highly interactive SQL queries. Impala's capabilities are very wide, Impala not only supports the Hadoop Distributed File System (HDFS - Hadoop Distributed File System) with Parquet, Optimized Row Columnar (ORC - Optimized Storage Node), JavaScript Object Notation (JSON), Avro, and text formats, but also has built-in support for Kudu, Microsoft Azure Data Lake Storage (ADLS), and Amazon Simple Storage Service (S3). Impala has a high level of security with either sentry or ranger and, as you know, can support thousands of users with clusters of hundreds of nodes on multi-petabyte datasets.Let's look at the overall Impala architecture.



Impala uses the StateStore to verify the health of the cluster. If for some reason the Impala node goes offline, the StateStore will send a message about this to all nodes and skip the inaccessible node. The Impala Directory Service manages metadata for all SQL statements for all nodes in the cluster. The StateStore and the directory service exchange data with the Hive MetaStore to store blocks and files, and then transfer metadata to work nodes. When a request arrives, it is transferred to one of the many matching programs where compilation is performed and planning is initiated. Fragments of the plan are returned, and the coordination program organizes its implementation. Intermediate results are passed between Impala services and then returned.

This architecture is ideal for cases where we need data marts for business intelligence to receive answers to queries with low latency, as is usually the case with ad-hoc, self-service and discovery types. In this scenario, we have customers telling us answers to complex queries from less than one second to five seconds.

For Internet of Things (IoT) data and related scenarios, Impala, together with streaming solutions such as NiFi, Kafka or Spark Streaming, and related data warehouses such as Kudu, can provide continuous pipelining with a delay time of less than ten seconds . With built-in read / write capabilities on S3, ADLS, HDFS, Hive, HBase, and more, Impala is an excellent SQL engine to use when starting a cluster of up to 1000 nodes, and more than 100 trillion rows in tables or datasets of 50BP or more.

Hive LLAP


“Live Long And Process” or “Long Delay Analytics Processing”, also known as LLAP, is a Hive-based execution engine that supports long-running processes using the same caching and processing resources. This processing mechanism gives us a response from SQL with a very low latency, since we do not have time to start the requested resources.



In addition, LLAP provides and establishes control over the execution of security policies, so all LLAP work for the user is transparent, which helps Hive to compete in terms of workload performance even with the most popular and traditionally used storage media today.

Hive LLAP offers the most advanced SQL engine in the big data ecosystem. Hive LLAP was created for a huge amount of data, providing users with the wide capabilities of the Enterprise Data Warehouse (EDW), which supports the conversion of large volumes of data, the execution of long queries or heavy SQL queries with hundreds of joins. Hive supports materialized views, surrogate keys, and various restrictions similar to traditional relational database management systems, including built-in caching for querying results and querying data. Hive LLAP can reduce the burden of repeated requests by reducing response time to a fraction of a second. Hive LLAP can support federated requests for HDFS (Hadoop Distributed File System) and object stores, as well as real-time streaming,working with Kafka and Druid.

Thus, Hive LLAP is ideally suited as an Enterprise Data Warehouse (EDW) solution, in which we will be faced with a large number of long queries that require large transformations or multiple joins between tables and large datasets. Thanks to the caching technology included in Hive LLAP, we now have customers who can join 330 billion records with 92 billion other records with or without partition key and get results in seconds.

Spark sq



Spark is a high-performance, general-purpose data processing engine that supports data processing and distribution and has a wide range of applications. There are many Spark data libraries for data science and machine learning experts who support the higher-level programming model for quick development. Spark SQL, MLlib, Spark Streaming and GrapX are located above Spark.



Spark SQL is a structured data processing module compatible with various data sources, with support for Hive, Avro, Parquet, ORC, JSON and JDBC. Spark SQL is efficient on semi-structured datasets and integrates with Hive MetaStore and NoSQL repositories such as HBase. Spark is often used with various software APIs in our favorite programming languages ​​such as Java, Python, R, and Scala.

Spark can be very useful if you need to embed SQL queries with Spark programs if it works with large amounts of data and high load. Spark helps many of our users who work in Global 100 companies to reduce the processing of streaming data. Combining this with MLlib, we see how many of our customers respond positively to Spark, as an excellent system capable of machine learning when working with data warehouse applications. Thanks to its high performance, low latency and excellent integration of third-party tools, Spark SQL provides the best conditions for switching between programming and SQL.

So which SQL engine to use?



Since you can combine the same data in CDW to CDP, you can choose the right engine for each type of workload, such as data engineering, traditional EDW, ad hoc analytics, BI dashboards, Online Analytical Processing (OLAP) or Online Transaction Processing (OLTP). The diagram below shows some principles aimed at simplifying the selection, according to which the engines and their mechanisms are well suited for each of the stated goals.



Conclusion


If you use EDW supporting BI dashboards, Hive LLAP will give you the best results. When you need ad-hoc, self-service, and research data warehouse, turn your eyes to the benefits of Impala. If you look at Data Engineering with long-running queries and no high concurrency, Spark SQL is a great choice. If you need high concurrency support, you can take a look at Hive on Tez. Look for OLAP support with time series data, add Druid, and if you are looking for OLTP with low latency and high concurrency, then maybe you should add Phoenix.

Total - there are many SQL engines in CDW to CDP, and this is done on purpose. Making choices before making a decision is the best way to optimize processes for high-performance applications with multi-threaded processing on massive data warehouses. CDW in CDP provides data sharing and sharing under a single system of security, management, data tracking and metadata, which allows you to combine SQL components in optimized repositories. Thus, this gives the user the freedom to choose the best SQL engine depending on his workloads.

Source: https://habr.com/ru/post/undefined/


All Articles