Spark 3.0 is out, and there are ton of improvements! But there are a nice improvement that is not yet highlighted in the announcement post: Push down filter for CSV file.
Prior to Spark 3.0, when you load a CSV file, the CSV file is read to memory then apply filter, which is a waste of CPU cycle and bandwidth. Now, the data can be filtered as the files are read. This is similar to push down filter in Parquet but now for CSV files.
Here is a quick example: I load a CSV file (flights dataset from Kaggle), then filter by ORIGIN_AIRPORT, then print out the execution plan.
val df = spark
.filter($"ORIGIN_IRPORT" === "UA")df.explain()
With Spark 3.0, you would see as below screenshot. The main highlight is the new
DataFilters in the plan describing about the filters.
I also try a quick benchmark and do see performance improvement.
Spark 2.4: Time to count a 7gb CSV file with filter: 37240.61ms
Spark 3.0: Time to count a 7gb CSV file with filter: 31579.83ms
The difference is ~5,661ms, almost 15% improvement.
If I remove the filter, both yield almost the same time of 13209.25ms.
And the nicest thing is you don’t need to change anything, this optimization will just work when you switch to Spark 3.0!