Benchmarking Pandas VS Spark

Tri Juhari
4 min readApr 17, 2021

--

Hello everyone, on this occasion I would like to share insights about Benchmarking between Pandas and Spark. Apache Spark has become a de facto integrated analytical engine for processing large volumes of data or big data in a distributed environment. However, seeing more users who choose to run Spark as a single machine, often laptops that are used to process data from small to large sizes prefer large Spark clusters. This choice is mainly for the following reasons:

  1. A single, unified api that can process small data on a laptop into big data in a cluster
  2. Support Many popular programming languages such as Python, R, Scala, and Java with ANSI SQL support.
  3. There is a PyData library integration, for example the Pandas library through functions defined by the Pandas user.

While the above may be obvious, users are often surprised to find that:

  1. IInstalling Spark on a single node requires no configuration (just download and then run easily).
  2. Additionally Spark can often be faster, because spark works in parallelism, than other PyData tools.
  3. Spark has the advantage of being lower memory consumption and can process more data than laptop memory size, because Spark does not load all data into memory before processing, in contrast to other pydata tools which have to load their data first, causing the memory required to be very large. .

Configure Apache Spark on Laptop

wget http://www-us.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
tar -xvf spark-2.3.0-bin-hadoop2.7.tgz
cd spark-2.3.0-bin-hadoop2.7
bin/pyspark

Spark is available on PyPI, Homebrew, and Conda, and can be installed using the command:

pip install pyspark

homebrew install apache-spark

PySpark VS Pandas

The dataset used in this benchmarking process is the “store_sales” table consisting of 23 columns of Long / Double data type. The benchmarking process uses three common SQL queries to show a single node comparison of Spark and Pandas:

Query 1. SELECT max(ss_list_price) FROM store_salesQuery 2. SELECT count(distinct ss_customer_sk) FROM store_salesQuery 3. SELECT sum(ss_net_profit) FROM store_sales GROUP BY ss_store_sk

To demonstrate previously we measured the maximum data sizes (Parquet and CSV) that Pandas can load on a single node with 244 GB of memory, and compared the performance of the three queries.

Setup and Configuration

Hardware

Using a virtual machine with the following specifications:

  • CPU core count: 32 virtual cores (16 physical cores), Intel Xeon CPU E5–2686 v4 @ 2.30GHz
  • System memory: 244 GB
  • Total local disk space for shuffle: 4 x 1900 GB NVMe SSD

Software

  • OS: Ubuntu 16.04
  • Spark: Apache Spark 2.3.0 in local cluster mode
  • Pandas version: 0.20.3
  • Python version: 2.7.12

Scalability Testing

Pandas requires a lot of memory to load a dataset on a scale of 10 to 270 using pandas and pyarrow. It can be seen from the graph below, that the graph shows a linear line, which means that the larger the size of the dataset used, the more memory is needed to load the dataset.

Performance Testing

Performance test performed by running the above SQL query on the “store_sales” table (scale 10 to 260) in Parquet file format. PySpark runs in local cluster mode with 10GB of memory and 16 threads. Based on the observation that when the input data size increases, PySpark produces better performance even though the memory it uses is limited, while using Pandas it crashes and fails when loading a dataset larger than 39 GB.

Due to parallel execution of all its cores, PySpark is faster than Pandas in testing, even when PySpark does not cache its data before running the query.

Conclusion

Based on single-node analytics with a large dataset size, Spark produces a faster runtime than using Pandas.

References :

https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html

--

--