The 10 Distributed SQL Query Engine for Big Data

This time i go with a blog called, The 10 Distributed SQL Query Engine for Big Data ! A Much Thank for your time, it’s truly appreciated !

Data…Data…Data…Yep, it’s every where starting from Software to Salt stores which is tagged as Big Data. But who is the friend who can help us to get the insights/values from the data in the distributed systems. If we take this list of in the industry sure it’ll reach more than quarter pie(Am not a statistician, it’s all from my theory leanings and Industry followups).

Here to go with the collection of 10 such as Hive, Impala, HAWQ, IBM Big SQL, Drill, Tajo, Pig, Presto,DryadLINQ, Jaql and shared the findings to have bit’s of value for your journey in the big data anlaytics landscape.

The SQL Lineage: Edgar Frank “Ted” Codd proposed a language called DSL/Alpha along with the relational model definition to manipulate the data in relational tables. IBM formed a group to created a simplified version of DSL/Alpha that they called SQUARE. SEQUEL is the refined version of SQUARE, whcih was finally renamed SQL. And also, a interesting note i learnt from the well know book is, SQL is not an acronym for anything (although many people will insist it stands for “Structured Query Language”). When referring to the language, it is equally acceptable to say the letters individually as S. Q. L. or to use the word sequel.

1) Hive:

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summation, query, and analysis. While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix. Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic Map Reduce on Amazon Web Services. For more details …

2) Impala:

Impala raises the bar for SQL query performance on Apache Hadoop while retaining a familiar user experience. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time. Furthermore, Impala uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Hue Beeswax) as Apache Hive, providing a familiar and unified platform for batch-oriented or real-time queries. For that reason, Hive users can utilize Impala with little setup overhead. For more details …

3) HAWQ:

HAWQ is designed as a MPP SQL processing engine optimized for analytics with full transaction support. HAWQ breaks complex queries into small tasks and distributes them to MPP query processing units for execution. The query planner, dynamic pipeline, leading edge interconnect, and the specific query executor optimization for distributed storage are designed to work seamlessly, to support highest level of performance and scalability. For more details …

4) IBM Big SQL:

Big SQL is a software layer that enables IT professionals to create tables and query data in BigInsights using familiar SQL statements. To do so, programmers use standard SQL syntax and, in some cases, SQL extensions created by IBM to make it easy to exploit certain Hadoop-based technologies. The SQL query engine supports joins, unions, grouping, common table expressions, windowing functions, and other familiar SQL expressions. Furthermore, you can influence the data access strategy for your queries with optimization hints and configuration options. Depending on the nature of your query, your data volumes, and other factors, Big SQL can use Hadoop’s MapReduce framework to process various query tasks in parallel or execute your query locally within the Big SQL server on a single node — whichever may be most appropriate for your query. For more details …

5) Drill:

Drill is an open source, low latency SQL query engine for Hadoop and NoSQL. Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications. The datasets associated with these applications evolve rapidly, are often self-describing and can include complex types such as JSON and Parquet. Apache Drill is built from the ground up to provide low latency queries natively on such rapidly evolving multi-structured datasets at scale. For more details …

6) Tajo:

Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources. By supporting SQL standards and leveraging advanced database techniques, Tajo allows direct control of distributed execution and data flow across a variety of query evaluation strategies and optimization opportunities. For more details …

7) Pig:

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

At the present time, Pig’s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig’s language layer currently consists of a textual language called Pig Latin, which has the following key properties: Ease of programming. It is trivial to achieve parallel execution of simple, “embarrassingly parallel” data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. Extensibility. Users can create their own functions to do special-purpose processing. For more details …

8) Presto:

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook. For more details …

9) DryadLINQ:

DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC clusters.

The goal of DryadLINQ is to make distributed computing on large compute cluster simple enough for every programmer. DryadLINQ combines two important pieces of Microsoft technology: the Dryad distributed execution engine and the .NET Language Integrated Query (LINQ). Dryad provides reliable, distributed computing on thousands of servers for large-scale data parallel applications. LINQ enables developers to write and debug their applications in a SQL-like query language, relying on the entire .NET library and using Visual Studio. For more details …

10) Jaql:

Jaql is primarily a query language for JavaScript Object Notation (JSON), but it supports more than just JSON. It allows you to process both structured and nontraditional data and was donated by IBM to the open source community (just one of many contributions IBM has made to open source).

Specifically, Jaql allows you to select, join, group, and filter data that is stored in HDFS, much like a blend of Pig and Hive. Jaql’s query language was inspired by many programming and query languages, including Lisp, SQL, XQuery, and Pig. Jaql is a functional, declarative query language that is designed to process large data sets. For parallelism, Jaql rewrites high-level queries, when appropriate, into “low-level” queries consisting of MapReduce jobs. For more details …

Extend to that, we are collecting the differences in perspective of technical keywords; which will be shared shortly. Please feel free to comment and suggest.

copy from kumarchinnakali

Tìm kiếm Blog này

OSS for Enterprise Solutions

The 10 Distributed SQL Query Engine for Big Data

Nhận xét

Đăng nhận xét

Bài đăng phổ biến từ blog này

Open Source CMS Ecommerce in .Net Core

ECM Solution

SQL vs NOSQL vs NEWSQL