spark cloud computing big data data science data engineering

Image credit – Photo by thekliks photos on Unsplash

Julia Morgan, a renowned American architect and engineer, eloquently captured the essence of architectural expression in the words above. While she referred to these sentiments in the context of physical building architecture, I find these words resonating deeply within the domain of software architecture as well. In both scenarios, be it constructing edifices or crafting software systems, a robust foundation, effective functionality, and a clear purpose are pivotal. The significance of these principles remains unwavering, whether we are shaping a physical structure or designing software.

I think the creators of Spark did a fantastic job in architecting the product. They clearly understood the pain points of distributed frameworks and addressed them in the core Spark engine. The rest, such as adaptability to different languages and the ability to have different packages for machine learning, SQL, and graph, are all embellishments that fall into place perfectly once the foundation or core is strong.

Resilient Distributed Datasets (RDDs) serve as the foundational components of Spark, providing a distributed memory abstraction that empowers programmers to execute in-memory computations on extensive clusters while maintaining fault tolerance. By leveraging RDDs, Spark surpasses existing models, exhibiting performance enhancements of up to 100 times in multi-pass analytics.

RDDs cater to two specific types of applications within computing frameworks:

Iterative algorithms: RDDs excel in supporting iterative algorithms, enabling efficient processing by preserving data in memory. This capability significantly enhances performance, making Spark an ideal choice for applications that involve repetitive computations.

Interactive data mining tools: RDDs also prove beneficial for interactive data mining tools, as they facilitate quick data access and manipulation by storing it in memory. This approach leads to improved responsiveness and agility in performing data exploration and analysis tasks.

RDDs play a crucial role in Spark’s success, allowing programmers to harness the power of distributed memory and achieve exceptional performance gains in both iterative algorithms and interactive data mining applications.

Image by Author

Source research paper:- people.csail.mit.edu/matei/papers/2012/nsdi..

Despite the shift that occurred in 2016, where DataFrames and Datasets (higher-level data structure APIs) assumed prominence over RDDs (lower-level data structure APIs), RDDs have not been phased out. They continue to serve as the foundational low-level storage layer utilized by Spark. While your direct interaction with RDDs might have reduced as a user, it’s important to recognize that behind the scenes, RDDs are the driving force orchestrating the operations.

Spark application architecture

Image by Author

In a distributed setup, your data is distributed across various worker nodes in an HDFS (Hadoop Distributed File System) setup. Spark provides a framework to coordinate work among these worker nodes.

Now, you have several options for cluster management. In local mode, where you’re not dealing with a distributed network — perhaps you want to run Spark on your laptop or a single machine — you can utilize Spark’s standalone cluster manager, which comes with the Apache Spark package.

On the other hand, when you’re dealing with a distributed setup in a production environment, with billions of rows or millions of rows whatever your use case is data is spread across many machines, resulting in petabytes or gigabytes of data, you have several alternatives. You can use YARN, Mesos, or Kubernetes as a cluster manager. These options are all configurable settings.

A Spark application consists of two main components:

Driver Process: The driver process is the heart or brain (whichever way you feel the association fits ) of the Spark application. It runs the main() function of the application and creates the SparkSession ( from Spark 2.0 onwards SparkSession is the entry point). It resides on a node in the cluster.

The driver process assumes the following primary roles:

Retaining essential details concerning the Spark application.
Reacting to the commands and instructions of the user’s program.
Conducting analysis
Orchestrating distribution
Arranging the scheduling of tasks across the executors.

2. Executor Processes: Executors are responsible for carrying out the actual work assigned by the driver.

Executors have two main objectives:

Performing the tasks assigned by the driver.
Reporting back the execution state of those tasks to the driver.

To put it simply, within the context of Apache Spark, both the driver and the executors can be understood as distinct operating units. These units are responsible for carrying out tasks and can function either on a single machine or across various machines. When operating in local mode, both the driver and executors function as if they were threads running on your personal computer. This stands in contrast to the typical setup where they would be distributed across multiple machines forming a cluster. This local mode configuration is commonly utilized during the development and testing phases, allowing for easier debugging and experimentation.

Some important key takeaways from our discussions are:

Each application is assigned its own executor processes.
Executor processes remain active throughout the application’s lifespan.
Executors execute tasks using multiple threads.
This arrangement isolates applications from each other.
Isolation occurs in terms of scheduling (each driver schedules its tasks) and executors (tasks from different apps run in separate JVMs).
This setup prevents seamless data sharing between Spark applications (different SparkSession instances) without storing data externally.

Executors primarily execute the computational tasks specified by Spark code, contributing to the actual processing of data. On the other hand, the driver component of Spark can be triggered using any of the language APIs provided by Spark, such as Python or Scala. This driver orchestrates the overall execution process by coordinating tasks and managing the flow of data.

In this framework, the cluster manager plays a pivotal role in monitoring and managing resource allocation. It ensures that computational resources like memory and processing power are efficiently distributed among the various components, including the driver and multiple executors. By overseeing this allocation, the cluster manager optimizes the utilization of the underlying hardware and facilitates the smooth execution of Spark applications, ultimately contributing to enhanced performance and scalability.

I’d like to emphasize the confusion that persisted for me concerning the term “driver” within Spark. It wasn’t until I delved deeper that I realized it’s employed in two distinct contexts. Firstly, the “driver” utilized in the ‘Spark context’ pertains to a process responsible for maintaining the application’s state running within a cluster. Conversely, the “driver” (also referred to as the “master”) within the ‘cluster manager’ context serves a different purpose. Here, the key distinction is that in the cluster manager scenario, the “driver” is linked to a physical machine, unlike the Spark context where it functions as a process.

You have three primary execution modes in Spark:

Cluster Mode: Predominant in production environments, this mode involves the cluster manager launching the driver process on a worker node within the cluster.
Client Mode: Similar to cluster mode, but with the driver residing on the client machine that submitted the application.
Local Mode: Particularly suitable for testing and experimentation, this mode involves the entire Spark application operating on a single machine. Parallelism is achieved through the use of threads.

The following features in Spark architecture contribute to its efficiency and optimization:

Lazy Evaluation: Spark operates lazily, delaying transformations until an action is invoked. This allows Spark to optimize the execution by analyzing the entire workflow before initiating any operations.
Directed Acyclic Graph (DAG): Transformations in Spark are represented as a DAG, which enables fault tolerance by facilitating the reconstruction of failed RDDs from their parent RDDs.
Immutable Data: Data in Spark is immutable, meaning RDDs are not modified in memory. Instead, new RDDs are created when performing CRUD operations. This immutability ensures data integrity and simplifies parallel processing.
Optimizer (Catalyst): Spark incorporates an in-built optimizer called Catalyst, which optimizes the execution plan of the DAG. This optimization step enhances performance and resource utilization.
Data Persistence: Spark provides the ability to cache data in memory or on disk, enabling persistence options. This caching mechanism enhances data access and processing speed.
Partitioning and Parallelism: Spark leverages partitioning to divide data across nodes, facilitating parallelism. This parallel execution across nodes contributes to faster execution times in Spark.

These features collectively make Spark a highly efficient and optimized framework for big data processing. We will look into these features closely in the upcoming Spark series.

Now, let’s swiftly delve into some of the supplementary features that I alluded to initially — extra functionalities that come bundled with the Spark core.

Spark Components

Spark Streaming

With Spark Structured Streaming, you can utilize the familiar structured APIs (DataFrames and Datasets) of Spark. This eliminates the need to develop and maintain separate technology stacks for batch and streaming processing. Furthermore, the unified APIs simplify the migration of existing batch Spark jobs to streaming jobs.

Spark Structured Streaming abstracts away complexities such as incremental processing, checkpointing, and watermarks. This allows you to build streaming applications and pipelines without the need to learn new concepts or tools.

SQL and Dataframes

Spark SQL combines a cost-based optimizer (Catalyst Optimizer), columnar storage, and code generation to ensure speedy query execution. It seamlessly scales to handle large-scale queries across thousands of nodes while maintaining fault tolerance. You can leverage Spark’s engine for historical data processing without the need for a separate engine.

Spark SQL enables the seamless integration of SQL queries with Spark programs, offering flexibility in data processing.

In addition, with release of Spark 2.0, Spark SQL supports HiveQL syntax, Hive SerDes, and UDFs, enabling access to existing Hive warehouses.

For connectivity with business intelligence tools, Spark SQL provides standard JDBC and ODBC connectivity.

Graphx

GraphX provides seamless integration of graph and collection operations. It offers a unified system that combines ETL (Extract, Transform, Load), exploratory analysis, and iterative graph computation. With GraphX, you can work with data as both graphs and collections, efficiently transform and join graphs with RDDs (Resilient Distributed Datasets), and develop custom iterative graph algorithms using the Pregel API.

MLlib

MLlib is compatible with Java, Scala, Python, and R programming languages. It seamlessly integrates with Spark’s APIs and supports interoperability with NumPy in Python (since Spark 0.9) and R libraries (since Spark 1.5). MLlib can utilize various Hadoop data sources such as HDFS, HBase, and local files, making it effortless to integrate into existing Hadoop workflows.

Third-Party Libraries

Spark-packages.org serves as an external platform managed by the community, offering a comprehensive catalog of third-party libraries, add-ons, and applications that are compatible with Apache Spark. For a complete list please visit Apache Spark's official website.

Source — spark.apache.org

I’ve intentionally reiterated numerous concepts in multiple ways to ensure they become ingrained in your understanding. While it’s essential for you to grasp these terminologies and distributed computing concepts, I don’t want you to feel overwhelmed by them. I believe that if your infrastructure is appropriately aligned with your data volumes and configured with default executors, memory, and other crucial parameters, the majority of your batch jobs or Spark applications (whichever term you prefer) should operate seamlessly.

However, it’s worth noting that you may encounter unique scenarios like long-running jobs or instances where previously functioning jobs now fail due to high volumes or insufficient memory. This is where performance tuning comes into play, requiring a closer examination of what’s happening behind the scenes. When you find yourself in such situations, these concepts will serve as valuable tools to troubleshoot your batch jobs effectively.

In case you missed the previous article, you can find it at this link.

arunadas.hashnode.dev/spark-series-2-evolut..

If establishing a connection interests you, you can find my LinkedIn profile at the following link: linkedin.com/in/arunadas29

Spark Series #3 : Architecture of Spark

Table of contents

Spark application architecture

Spark Components

Spark Streaming

SQL and Dataframes

Graphx

MLlib

Third-Party Libraries

Spark Series #3 : Architecture of Spark

Table of contents

Spark application architecture

Spark Components

Spark Streaming

SQL and Dataframes

Graphx

MLlib

Third-Party Libraries

Did you find this article valuable?