Spark Series #7: Data Ingestion From Modern Files

We will continue to look into file ingestion , modern files like parquet, ORC, Avro. We already discuss the different file types including these modern ones with their comparative study if you missed it here is the link -

https://hashnode.com/post/clm9fbk8t00010akucg8q8mfr

For this exercise purpose I am using below flow:

Image by Author

I used a util spark code to convert csv files into avro, ORC and parquet for this exercise purpose and then again spark code to read those files and display on console. All the code can be found in the github location -

https://github.com/arunadas/Spark-Series

These modern file formats address the shortcoming of existing file format specially in splitting , compression , schema evaluation, row vs column preference in distributed computing set up. In Distributed computing set up with big data you will find these file formats because of their efficient storage and processing, which are crucial for keeping the cost under control from both processing(reading) and storage cost involved in these systems.

Parquet Parser: Spark natively supports the Parquet file format, which is a columnar storage format optimized for big data processing. Spark’s Parquet parser enables efficient reading and writing of Parquet files

Image by Author

Output

Image by Author

2. ORC Parser: Spark also supports the ORC (Optimized Row Columnar) file format, which is another columnar storage format optimized for analytics workloads. The ORC parser in Spark allows you to read and write ORC files.

Image by Author

Output

Image by Author

3. Avro Parser: Spark provides support for the Avro data serialization format. The Avro parser allows you to read and write Avro data, which is self-describing and schema-based.

To ingest avro file You will need to add a library , as Avro is not natively supported by spark. I am using this version org.apache.spark:spark-avro_2.12:3.2.1 once you installed it the parser will work but you will have to pass the package as argument to your spark-submit.

bin/spark-submit — packages org.apache.spark:spark-avro_2.12:3.2.1 /home/aruna/spark/app/article/SparkSeries/SparkSeries5.1/avroToDataframe.py

Image by Author

Output

Image by Author

Spark Series #7: Data Ingestion From Modern Files

Output

Output

Output

Did you find this article valuable?