Apache Spark

List of open source distributed SQL engines that are capable of querying extremely large datasets:

Apache Hive
Apache Impala
Presto
Apache Drill

This list did not include Spark SQL. This optional reading briefly explains what Spark is, what Spark SQL is, and why Spark SQL was not included on this list.

What Is Apache Spark, and What Is Spark SQL?

Apache Spark is a large-scale data processing engine. It is capable of running a wide range of different data processing workloads. Apache Spark provides several libraries for performing different kinds of work. One of these libraries is Spark SQL.

Spark SQL is Spark’s library for working with structured data. The name “Spark SQL” seems to suggest that the SQL query language is the central piece of this library, but it is not. Support for the SQL query language is just one part of what Spark SQL provides. Spark SQL also provides programming interfaces for several programming languages (Scala, Java, Python, and R) that are not based on the SQL query language.

Who Uses Spark SQL?

Spark SQL is most often used by data scientists, data engineers, and big data application developers. Spark SQL helps those types of users work with structured data inside their Spark applications.

Spark SQL is not widely used by data analysts. Compared to Hive and Impala, Spark SQL is not as well-incorporated into the ecosystem of tools that data analysts use. The lack of integration, tooling, and support for Spark SQL has limited its use by data analysts. Furthermore, the architecture of Apache Spark makes Spark SQL inherently less efficient as a query engine for data analysts than purpose-built query engines like Impala.

However, there have been some recent efforts to make Spark SQL a more viable alternative for data analysts running interactive SQL queries. If these efforts prove successful, we will consider adding more details about Spark SQL to this course. But at the current time, the number of data analysts using Spark SQL remains relatively small, and there are obstacles to its broader use. As a result, we recommend that data analysts focus on learning Hive and Impala.

Spark SQL Is Compatible with Hive and Impala

The good news is that Spark SQL was designed to be highly compatible with Hive and Impala. Spark SQL can query the same tables that Hive and Impala can, and the Spark SQL query syntax is almost entirely compatible with Hive’s query syntax. So even though this course does not mention Spark SQL by name (except in this reading), you can take the skills you’ll learn in this course and apply them directly to Spark SQL.

For more information about Spark SQL compatibility with Hive, see https://spark.apache.org/docs/latest/sql-migration-guide-hive-compatibility.html (but note that many of the details described there are beyond the scope of this course).