Table of contents
History of Spark
Apache Spark originated as a research project at UC Berkeley’s AMPLab, focusing on big data analytics. It introduced a programming model that offers broader application support compared to MapReduce while maintaining automatic fault tolerance.
MapReduce is inefficient for certain types of applications that involve low-latency data sharing across parallel operations. These applications are common in analytics and include iterative algorithms (e.g., machine learning and graph algorithms like PageRank), interactive data mining, and streaming applications with aggregate state maintenance.
Conventional MapReduce and DAG engines are suboptimal for these applications because they follow acyclic data flow. Each job reads data from stable storage, performs computations, and writes the results back to replicated storage, incurring significant data loading and writing costs at each step.
Spark addresses these challenges with its resilient distributed datasets (RDDs) abstraction. RDDs can be stored in memory without replication, rebuilding lost data on failure using lineage information. This approach enables Spark to outperform existing models by up to 100x in multi-pass analytics.
The initial version of Spark supported only batch processing. However, due to early adoption and compelling use cases for interactive data science and ad hoc queries, AMPLab projects like Shark emerged. Shark was an early version of a SQL plugin, enabling SQL-like queries, which later evolved into SQL libraries.
In my view, Spark’s success can be attributed to its foresight and ability to anticipate various needs in distributed computing setups. The creators proactively incorporated capabilities like batch processing, streaming, machine learning, SQL, and graph processing into the core Spark engine. This approach transformed Spark into a comprehensive ecosystem, eliminating the need for multiple standalone projects with compatibility issues, as often faced with MapReduce. Consequently, Spark gained a significant advantage over MapReduce.
Spark aims to cater to a wide range of data users by providing compatible language APIs, including Scala, Python, Java, R, and SQL.
Furthermore, Spark enjoys the benefit of a vibrant community and commercial support from Databricks. This backing encourages new users through certifications, training, meetups, books, and continuous support. Having strong support and incentives is crucial for the thriving of computer languages and products, an area where MapReduce may lack compared to Spark.
Source: https://spark.apache.org/research.html
Image Source by Author
Interaction with Spark
There are a couple of ways in which you can interact with Spark:
Local mode: Also known as developer mode, this mode runs everything on a single computer without requiring any additional configuration.
Cluster mode: In this mode, Spark applications are deployed in a cluster through a resource manager such as YARN or Mesos. It allows for distributed processing across multiple nodes in the cluster.
Interactive mode: This mode involves using a computer-based notebook like Zeppelin, which is the preferred way for data scientists to interact with Spark. Notebooks provide an interactive and collaborative environment for executing Spark code and analyzing data.
Spark console: You can run individual spark commands and test your line of codes similar to Python CLI.
IntelliJ IDEA: You are configuring IntelliJ to develop your spark applications. For Scala SBT can be used to set up the environment. Similarly, for Java.
Image Source by Author
For comprehensive details on each version please refer to the official Apache Spark documentation.
Spark Supports the following language APIs:
Scala — Spark is primarily written in Scala, which is a functional language and it inherits a lot of functional language core features.
Java
Python
SQL
R
The diagram below shows how Spark handles other language libraries with the core engine which is written in Scala.
Image Source by Author
Your primary focus is on crafting your business logic in the language that suits you best — whether it’s R or Python. The SparkSession object is at your disposal, serving as the gateway to executing your Spark code.
To illustrate, consider a concise program that calculates the squares of numbers from 1 to 10. This program showcases the ease of seamlessly transitioning across languages. Spark adeptly conceals the intricate inner workings of converting your code base to the JVM’s format and handles distribution across a distributed system according to your Spark environment’s configuration. As a user, your core task is to address problems by expressing your business logic in your preferred language. The Spark engine and your infrastructure administration take care of the remaining intricacies.
Should you encounter performance challenges, Spark equips you with tools and settings to fine-tune various stages. You can collaborate with your team and administrators, conducting tests to enhance performance and consequently reduce computational costs. We’ll delve deeper into these options when we cover the topic of performance.
You can find the code used below on GitHub at the following location:
Python
From your standalone spark directory open pyspark to access the python console for starting the interactive session.
Image Source by Author
Image Source by Author
I am using the 3.2.4 version
Image Source by Author
You can run commands in the console and monitor the DAG at http://localhost:4040, we will talk about DAG later.
In the first line of code, I create a range of numbers from 1 to 10 and convert it into a DataFrame, which can be thought of as a table, similar to Excel with rows and columns. DataFrame concept is not unique to Spark if you have worked with R and Python both have similar concepts.
Once the DataFrame is created, I perform the power function on the numbers and then call the action “show().” It’s important to note that without calling an action (in this case show()), Spark will not perform any computations immediately. This is because of lazy evaluation, where Spark postpones the actual execution of transformations until an action is called, optimizing the overall computation process. This laziness has many benefits we will discuss them later.
Scala
Similarly, using bin/spark-shell you can access the Scala console for starting interactive sessions.
Image Source by Author
Same program as Python for calculating squares of 1 to 10 using Scala.
Image Source by Author
SQL
Image Source by Author
Image Source by Author
Similarly, you can run your Java and R codes.
I will be using pyspark (python API) as my choice of language in this series.
Let's examine the explain plan of all these three programs, though the language API is different but internally spark is executing the same explain plan. We will look closely explain plain later in the series.
Image Source by Author
Image Source by Author
I know that you have come across a lot of new terminologies introduced by me in this series, and I intentionally deferred a closer examination of them. My deliberate decision was to ensure that every reader, regardless of whether they code in Spark every day or not, can take away a thing or two from these Spark series articles and gradually become familiar with these terminologies, eventually building up to the concepts.
In case you missed the previous article, you can find it at this link.
https://arunadas.hashnode.dev/spark-series-1-why-spark
If establishing a connection interests you, you can find my LinkedIn profile at the following link: linkedin.com/in/arunadas29