Spark Series # 1 : Why Spark?

Spark Series # 1 : Why Spark?

Image created by Aruna Das using AI

·

4 min read

What is Big Data?

We are currently living in the data era, as a biological being, you are a significant source of big data both internally and externally. Internally, you carry a multitude of minerals such as iron, zinc, calcium, phosphorus, magnesium, sodium, potassium, chloride, sulfur, and many more. Additionally, various amino acids and proteins serve as the fundamental building blocks within you.

We are an incredibly intricate chemical factory, possessing highly sophisticated motorized capabilities and immense neuronal power. Measurements such as heart rate, ECGs, blood tests, DNA, and RNA testing are a few examples of this big data collection. Medical science has made significant progress over the past decades in comprehending human physiology and functioning. However, in my opinion, there remain numerous unknowns yet to be uncovered, surpassing the already known aspects, waiting to be explored and understood.

Externally, through our interactions with the surrounding world and other individuals, we continuously generate data. Each day, actions such as logging into email or social media accounts, engaging in daily transactions at coffee shops or food courts, and utilizing credit cards all contribute to the generation of data. Furthermore, interacting with voice-controlled devices like smart lights, fans, and TVs, or even using semi or fully-automatic cars, leaves behind a digital footprint.

Throughout history, humans and other beings have been consistent data generators, and this has not solely been a product of the 21st century. Additionally, modern advancements have expanded this list to include machines such as software, IoT sensors, satellites, and more. As a result of notable progress in the semiconductor and chip industries, the capture and storage of data have become more convenient, affordable, and compact than ever before.

Around 2005, the top speed of high-end processors reached a plateau at approximately 4 GHz, and it has not seen significant increases since then. This slowdown can be attributed to the significance of transistors, which act as electronic switches to create logical gates — the vital components of processors. These logical gates, when combined in various ways, enable the execution of arithmetic and complex logical operations. Before 2005, processor speed increases resulted in faster applications as well.

This trend in hardware persisted until 2005 because of limitations in packing more transistors onto chips and the considerable heat generated by them. To overcome this challenge, the industry shifted towards multi-core architecture, giving rise to new programming paradigms such as parallel processing, multithreading, and the development of frameworks like Hadoop. These innovations allowed for better utilization of multiple processor cores and enabled distributed computing, ultimately enhancing overall system performance.

This data holds immense value and interest. Companies can utilize it to profile individuals accurately and identify potential future customers. In the realm of government elections, data mining plays a significant role in understanding people’s issues at a local level, instilling confidence by addressing and improving their everyday lives.

Microblogging platforms have empowered ordinary individuals to quickly amplify relevant topics and bring about positive change within hours. The correct utilization of data has made this all possible.

Managing and analyzing large amounts of information is a necessity that presents unique challenges. You are already aware of 5v’s (volume, value, variety, velocity, and veracity) challenges of dealing with big data.

Image created by Author

What is Hadoop?

Hadoop is a comprehensive framework that enables the storage and processing of massive datasets using cost-effective hardware within a distributed network.

It essentially combines two key components:

HDFS (Hadoop Distributed File System) for storage and MapReduce for processing.

MapReduce is responsible for performing computations on extensive datasets and extracting valuable insights from carefully selected data subsets. It efficiently processes vast volumes of data residing in HDFS.

Spark provides an alternative to MapReduce. Apache Spark is a computing engine and set of libraries for parallel data processing on a computer cluster. It possesses the capability to process large-scale data stored in HDFS by delivering actionable results and valuable insights.

While MapReduce and Apache Spark are both distributed data processing frameworks, they have some key differences in terms of their architecture, performance, and features.

Here are some of the main differences between MapReduce and Spark:

Image created by Author

Overall, Spark is the most widely accepted, considered modern and feature-rich framework compared to MapReduce. It offers improved performance, flexible programming models, in-memory data processing, and a broader ecosystem for various data processing and analytics needs.

However, MapReduce is still widely used, particularly for large-scale batch-processing tasks in Hadoop environments.

Did you find this article valuable?

Support Aruna's blog by becoming a sponsor. Any amount is appreciated!