Slow Reads for S3 Files in Pandas & How to Optimize it

Sneha Ghantasala
Thomson Reuters Labs
4 min readFeb 17, 2023

--

As part of one of our projects, we needed to load pickle files with sizes in the range of 10s of GBs from AWS S3 in AWS Sagemaker endpoints (private subnet is configured with S3 VPC endpoints). When using the pandas library to do this, passing the S3 URI of the file as an argument takes 16 min but downloading the file and then passing the file object takes 3 min. Why?

Our Scenario
We are using pandas for reading a large pickle file (in the range of 12 GB) from AWS S3. In this case, we are trying to understand why the pandas library shows a significant performance difference to read the same file when the argument is passed in a different way.

Our Experiments
Let’s deep dive into our 2 ways of loading the pickle data from S3 to investigate the reason behind the performance gap. To replicate these experiments, you would need to use the following library versions:

Please note that the code tests were run in the same environment with the same file, multiple times — restarting kernels, changing kernels, and changing the order.

Test 1
Read the pickle file from S3 using the pandas read_pickle function passing S3 URI.
Time taken: ~16 min

import pandas as pd
import time

start_time = time.time()

s3_uri = "{s3_uri}"
pq_file_data = pd.read_pickle(s3_uri)

end_time = time.time()
total_time = end_time - start_time
print(total_time)

#943.84 sec

Test 2
​​​​​​​
Getting the S3 file using boto3 and passing it directly to the pandas read_pickle function.
Time taken: ~3 min

import pandas as pd
import time
import boto3

start_time = time.time()

s3_bucket_name = "{s3_bucket_name}"
filename = "{pickle_filename_s3_path}"

s3 = boto3.resource("s3")
file_obj = s3.Bucket(s3_bucket_name).Object(filename).get()["Body"]
pq_file_data = pd.read_pickle(file_obj)

end_time = time.time()
total_time = end_time - start_time
print(total_time)

#178.28 sec

Passing S3 URI takes ~16 min, while downloading and passing the file directly takes ~3 min. About 5x slower

Initial Investigation steps

  1. Reading pandas documentation and going through the source code.
  2. Using Python’s inbuilt profiler to understand what takes time. We could also build a graph using the profiler’s output. Attached are a few graphs in the Appendix section. The red blocks are the most time-consuming blocks.
  3. Using a debugger to understand the code flow for the two code tests.

Initial Findings

  1. Test 1 streams in chunks, while Test 2 downloads the entire file and passes it so some performance gap is expected (i.e. Test 1 will take more time than Test 2). But the observed difference is quite high. Is the 5x performance difference justified?
  2. The source code suggests that the pandas library internally calls the pickle library (Python in-built library) to read the pickle files (at least in our scenario).
  3. For Test 1, the pandas library uses the s3fs library to establish a connection with S3 and read the file.
  4. The graphs generated using the profiler suggest for Code test 2 (pandas): read_ method ‘acquire’ of ‘_thread.lock’ objects takes up a lot of time. This doesn’t help figure out the large performance gap.

Further investigation steps

  1. Reading documentation and reviewing the source codes for the pickle and s3fs libraries.
  2. Figuring out what are the different kinds of parameters affecting the read performance and playing around with them.

Our Final Findings

  1. S3fs implements fsspec and pandas uses the fsspec interface to access file systems (in our case S3).
  2. The parameters default_block_size and default_cache_type of s3fs affect the read performance. At the time of writing this blog, s3fs had the default_cache_type set as bytes and the default_block_size as 5*2**20 Bytes (5MB). We can modify them. Below is the code snippet for the same.
import pandas as pd

pd.read_pickle({s3_uri}, storage_options={"default_block_size":{block_size}, "default_cache_type":{cache_type}})

Here’s a table outlining the experiments done with different configurations obtained by modifying the values for those parameters.

The experiments suggest that changing the default_cache_type to readahead would give a good read performance improvement and a Github Issue for the same had been raised in s3fs’s Github repository.

The experiment results were acknowledged by the s3fs devs and the default_cache_type has been changed to readahead.

Note: The experiments weren’t performed multiple times with the same parameters for most of the configurations, so the read times can vary by a few seconds.

The cache_type parameter actually refers to the buffering scheme used while reading the file and the block_size parameter is used to configure the buffer size as mentioned here. For more information about the different cache types, you could refer to this.

Conclusion
Pandas actually has nothing to do with the read performance of the files. It’s s3fs!

We can configure the required parameters via Pandas for better performance. It may not be easy to know which one would be optimal — experiments like the one above may be required.

Choosing caches/blocksize involve tradeoffs. Here are some observations —

Did you know ‘all’ cache type is faster than downloading the full file and passing it to read_pickle but it requires 2x more memory since it holds the full size in memory while unpickling

And finally, ‘readahead’ cache’s memory requirement is much closer to the actual file size but it’s ~1.75x slower (given optimal block_size).

Thank you!

Appendix

Profiler Output Graph: Code Test 1
Profiler Output Graph: Code Test 2

--

--