Science and technology

Scaling relational databases with Apache Spark SQL and DataFrames

Relational databases are right here to remain, whatever the hype and the arrival of newer databases typically referred to as NoSQL databases. The easy cause is that relational databases implement important construction, constraints, and supply a pleasant, declarative language to question information (which we love): SQL!

However, scale has all the time been an issue with relational databases. Most enterprises within the 21st century are loaded with wealthy information shops and repositories and need to take most benefit of their Big Data for actionable insights. Relational databases is likely to be well-liked, however they do not scale very properly except we spend money on a correct Big Data administration technique. This includes serious about potential information sources, information volumes, constraints, schemas, ETL (extract-transform-load), entry and querying patterns, and way more!

This article will cowl some wonderful advances made for leveraging the facility of relational databases, however “at scale,” utilizing among the newer parts from Apache Spark—Spark SQL and DataFrames. Most notably, we’ll cowl the next subjects.

  1. Motivations and challenges with scaling relational databases
  2. Understanding Spark SQL & DataFrames
    • Goals
    • Architecture and options
    • Performance
  3. The second article on this sequence presents a real-world case study/tutorial on Spark SQL with hands-on examples

We will probably be trying on the main challenges and motivations for folks working so exhausting and investing time in constructing new parts in Apache Spark so we are able to carry out SQL at scale. We may even look at the key structure, interfaces, options, and efficiency benchmarks for Spark SQL and DataFrames. Last, however most significantly, we’ll cowl a real-world case research on analyzing intrusion assaults based mostly on KDD 99 Cup Data utilizing Spark SQL and DataFrames by leveraging Databricks Cloud Platform for Spark.

Motivations and challenges on scaling relational databases for Big Data

Relational information shops are straightforward to construct and question. Also, customers and builders typically choose writing easy-to-interpret, declarative queries in a human-like readable language akin to SQL. However, as information begins growing in quantity and selection, the relational strategy doesn’t scale properly sufficient for constructing Big Data functions and analytical methods. Following are some main challenges.

  • Dealing with differing kinds and sources of information, which could be structured, semi-structured, and unstructured
  • Building ETL pipelines to and from varied information sources, which can result in growing a variety of particular customized code, thereby growing technical debt over time
  • Having the aptitude to carry out each conventional enterprise intelligence (BI)-based analytics and superior analytics (machine studying, statistical modeling, and so on.), the latter of which is certainly difficult to carry out in relational methods

Big Data analytics just isn’t one thing that was simply invented yesterday! We have had success on this area with Hadoop and the MapReduce paradigm. This was highly effective, however typically sluggish, and gave customers a low-level, procedural programming interface that required folks to put in writing a variety of code for even quite simple information transformations. However, as soon as Spark was launched, it actually revolutionized the way in which Big Data analytics was achieved with a give attention to in-memory computing, fault tolerance, high-level abstractions, and ease of use.

From then, a number of frameworks and methods like Hive, Pig, and Shark (which developed into Spark SQL) offered wealthy relational interfaces and declarative querying mechanisms to Big Data shops. The problem remained that these instruments have been both relational or procedural-based, and we could not have the very best of each worlds.

However, in the actual world, most information and analytical pipelines may contain a mix of relational and procedural code. Forcing customers to decide on both one finally ends up complicating issues and growing person efforts in growing, constructing, and sustaining totally different functions and methods. Apache Spark SQL builds on the beforehand talked about SQL-on-Spark effort referred to as Shark. Instead of forcing customers to select between a relational or a procedural API, Spark SQL tries to allow customers to seamlessly intermix the 2 and carry out information querying, retrieval, and evaluation at scale on Big Data.

Understanding Spark SQL and DataFrames

Spark SQL basically tries to bridge the hole between the 2 fashions we talked about beforehand—the relational and procedural fashions—with two main parts.

  • Spark SQL supplies a DataBody API that may carry out relational operations on each exterior information sources and Spark’s built-in distributed collections—at scale!
  • To help all kinds of numerous information sources and algorithms in Big Data, Spark SQL introduces a novel extensible optimizer referred to as Catalyst, which makes it straightforward so as to add information sources, optimization guidelines, and information sorts for superior analytics akin to machine studying.

Essentially, Spark SQL leverages the facility of Spark to carry out distributed, strong, in-memory computations at huge scale on Big Data. Spark SQL supplies state-of-the-art SQL efficiency and in addition maintains compatibility with all present constructions and parts supported by Apache Hive (a preferred Big Data warehouse framework) together with information codecs, user-defined features (UDFs), and the metastore. Besides this, it additionally helps in ingesting all kinds of information codecs from Big Data sources and enterprise information warehouses like JSON, Hive, Parquet, and so forth, and performing a mix of relational and procedural operations for extra complicated, superior analytics.


Let’s take a look at among the fascinating information about Spark SQL, together with its utilization, adoption, and targets, a few of which I’ll shamelessly as soon as once more copy from the wonderful and authentic paper on “Relational Data Processing in Spark.” Spark SQL was first launched in May 2014 and is probably now one of the vital actively developed parts in Spark. Apache Spark is certainly probably the most lively open supply undertaking for Big Data processing, with a whole bunch of contributors.

Besides being an open supply undertaking, Spark SQL has began seeing mainstream business adoption. It has already been deployed in very large-scale environments. Facebook has a wonderful case research about “Apache Spark @Scale: A 60 TB+ production use case.” The firm was doing information preparation for entity rating, and its Hive jobs used to take a number of days and had many challenges, however Facebook was efficiently capable of scale and improve efficiency utilizing Spark. Check out the fascinating challenges they confronted on this journey!

Another fascinating truth is that two-thirds of Databricks Cloud (a hosted service operating Spark) prospects use Spark SQL inside different programming languages. We may even showcase a hands-on case research utilizing Spark SQL on Databricks in part two of this sequence.

The main targets for Spark SQL, as outlined by its creators, are:

  1. Support relational processing, each inside Spark packages (on native RDDs) and on exterior information sources, utilizing a programmer-friendly API
  2. Provide excessive efficiency utilizing established DBMS strategies
  3. Easily help new information sources, together with semi-structured information and exterior databases amenable to question federation
  4. Enable extension with superior analytics algorithms akin to graph processing and machine studying

Architecture and options

We will now check out the important thing options and structure round Spark SQL and DataFrames. Some key ideas to bear in mind right here could be across the Spark ecosystem, which has been always evolving over time.

RDD (Resilient Distributed Dataset) is probably the largest contributor behind all of Spark’s success tales. It is mainly a knowledge construction, or relatively a distributed reminiscence abstraction to be extra exact, that enables programmers to carry out in-memory computations on giant distributed clusters whereas retaining facets like fault tolerance. You may also parallelize a variety of computations and transformations and monitor the entire lineage of transformations, which can assist in effectively recomputing misplaced information. Spark fans might want to learn a wonderful paper round RDDs, “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.” Also, Spark works with the idea of drivers and employees, as depicted within the following determine.

You can usually create an RDD by studying in information from information, databases, parallelizing present collections, and even transformations. Typically, transformations are operations that can be utilized to remodel the info into totally different facets and dimensions, relying on the way in which we need to wrangle and course of the info. They are additionally lazily evaluated, which means that, even for those who outline a change, the outcomes should not computed till you apply an motion, which usually requires a consequence to be returned to the motive force program (and it computes all utilized transformations then!).

Shout out to fellow information scientist and pal Favio Vázquez and his wonderful article “Deep Learning With Apache Spark,” the place I received some wonderful concepts and content material, together with the previous determine. Check it out!

Now that we all know in regards to the common structure of how Spark works, let’s take a more in-depth look into Spark SQL. Typically, Spark SQL runs as a library on high of Spark, as we noticed within the determine protecting the Spark ecosystem. The following determine offers a extra detailed peek into the standard structure and interfaces of Spark SQL.

The determine clearly exhibits the assorted SQL interfaces, which could be accessed by JDBC/ODBC or by a command-line console, in addition to the DataBody API built-in into Spark’s supported programming languages (we will probably be utilizing Python). The DataBody API may be very highly effective and permits customers to lastly intermix procedural and relational code! Advanced features like UDFs (user-defined features) may also be uncovered in SQL, which can be utilized by BI instruments.

Spark DataFrames are very fascinating and assist us leverage the facility of Spark SQL and mix its procedural paradigms as wanted. A Spark DataBody is mainly a distributed assortment of rows (row sorts) with the identical schema. It is mainly a Spark Dataset organized into named columns. Some extent to notice right here is that Datasets are an extension of the DataBody API that gives a type-safe, object-oriented programming interface. Hence, they’re obtainable solely in Java and Scala and we’ll, due to this fact, be specializing in DataFrames.

A DataBody is equal to a desk in a relational database (however with extra optimizations beneath the hood) and may also be manipulated in related methods to the “native” distributed collections in Spark (RDDs). Spark DataFrames have some fascinating properties, a few of that are talked about beneath.

  1. Unlike RDDs, DataFrames normally preserve monitor of their schema and help varied relational operations that result in a extra optimized execution.
  2. DataFrames could be constructed from tables, similar to present Hive tables in your Big Data infrastructure, and even from present RDDs.
  3. DataFrames could be manipulated with direct SQL queries and in addition utilizing the DataBody DSL (domain-specific language), the place we are able to use varied relational operators and transformers akin to the place and groupBy.
  4. Also, every DataBody may also be considered as an RDD of row objects, permitting customers to name procedural Spark APIs akin to map.
  5. Finally, a given, however a degree to all the time keep in mind, in contrast to conventional dataframe APIs (Pandas), Spark DataFrames are lazy, in that every DataBody object represents a logical plan to compute a dataset, however no execution happens till the person calls a particular “output operation” akin to save.

This ought to provide you with sufficient perspective on Spark SQL, DataFrames, important options, ideas, structure, and interfaces. Let’s wrap up this part by having a look at efficiency benchmarks.


Releasing a brand new characteristic with out the best optimizations could be lethal, and the oldsters who constructed Spark did tons of efficiency assessments and benchmarking! Let’s check out some fascinating outcomes. The first determine showcasing some outcomes is depicted beneath.

In these experiments, they in contrast the efficiency of Spark SQL towards Shark and Impala utilizing the AMPLab Big Data benchmark, which makes use of an internet analytics workload developed by Pavlo, et al. The benchmark incorporates 4 sorts of queries with totally different parameters performing scans, aggregation, joins, and a UDF-based MapReduce job. The dataset was 110GB of information after compression utilizing the columnar Parquet format. We see that in all queries, Spark SQL is considerably sooner than Shark and usually aggressive with Impala. The Catalyst optimizer is answerable for this, which reduces CPU overhead (we’ll cowl this briefly). This characteristic makes Spark SQL aggressive with the C++ and LLVM-based Impala engine in lots of of those queries. The largest hole from Impala is in Query 3a the place Impala chooses a greater be part of plan, as a result of the selectivity of the queries makes one of many tables very small.

The following graphs present some extra efficiency benchmarks for DataFrames and common Spark APIs and Spark + SQL.

Finally, the next graph exhibits a pleasant benchmark results of DataFrames vs. RDDs in numerous languages, which supplies an fascinating perspective on how optimized DataFrames could be.

Secret to efficiency: the Catalyst optimizer

Why is Spark SQL so quick and optimized? The cause is due to a brand new extensible optimizer, Catalyst, based mostly on practical programming constructs in Scala. While we can’t go into intensive particulars about Catalyst right here, it’s value a point out because it helps in optimizing DataBody operations and queries.

Catalyst’s extensible design has two functions.

  • Makes it straightforward so as to add new optimization strategies and options to Spark SQL, particularly to sort out numerous issues round Big Data, semi-structured information, and superior analytics
  • Ease of having the ability to prolong the optimizer—for instance, by including information source-specific guidelines that may push filtering or aggregation into exterior storage methods or help for brand spanking new information sorts

Catalyst helps each rule-based and cost-based optimization. While extensible optimizers have been proposed prior to now, they’ve usually required a fancy domain-specific language to specify guidelines. Usually, this results in having a big studying curve and upkeep burden. In distinction, Catalyst makes use of commonplace options of the Scala programming language, akin to pattern-matching, to let builders use the total programming language whereas nonetheless making guidelines straightforward to specify.

At its core, Catalyst incorporates a common library for representing timber and making use of guidelines to govern them. On high of this framework, it has libraries particular to relational question processing (e.g., expressions, logical question plans), and a number of other units of guidelines that deal with totally different phases of question execution: evaluation, logical optimization, bodily planning, and code technology to compile elements of queries to Java bytecode. Interested in understanding extra particulars about Catalyst and doing a deep-dive? You can try this wonderful “Deep Dive into Spark SQL’s Catalyst Optimizer” from Databricks.

Click by to the second article on this sequence for a hands-on tutorial based mostly on a real-world dataset to know at how one can use Spark SQL. 

This article initially appeared on Medium’s Towards Data Science channel and is republished with permission.

What to learn subsequent

Most Popular

To Top