Data-Management Case: too many CSV files -- Part 1

I have created the following dataset: I have 12 process methods that I have to run on 277,827 raw-data files (broken into two uneven sets). The output-data columns vary for each method and the output-data rows vary for each raw-data file. So now I have 51GB of data broken into 3,333,924 CSV files. I would like to combine these files into a single file because just reading these files can take hours.

I'm a big fan of HDF5 because it is a versatile and common binary format easily accessible from interpretive languages such as python. But I asked ChatGPT whether this was the right choice. I also asked ChatGPT for the fastest way to read these files themselves since I was going to have to do that.

What ensues is our conversation. Make sure you include all key concepts in your question.

Me: i have thousands of csv files with tables that I would like to store into an hdf5 file. what is the fastest way to read these files and what is the best format to save these tables in the hdf5 file?

To efficiently read thousands of CSV files and store them in an HDF5 file, you should focus on parallelized reading and efficient HDF5 storage formats. Here’s the optimal approach:

1. Fastest Way to Read CSV Files

The primary bottleneck in reading thousands of CSV files is I/O speed and parsing efficiency. Here’s how you can speed it up:

A. Use pandas.read_csv() with Optimized Parameters

Python’s pandas.read_csv() is well-optimized, but you can speed it up further by:

Using only necessary columns (usecols parameter)
Setting dtype explicitly if known (avoids auto-detection overhead)
Using engine='c' (default) for faster parsing
Using low_memory=False to avoid type inference issues
Disabling index parsing if unnecessary (index_col=False)

import pandas as pd 
df = pd.read_csv('file.csv', usecols=['col1', 'col2'], dtype={'col1': 'float32', 'col2': 'int32'}, engine='c', low_memory=False)

B. Use dask.dataframe for Parallel Reading

Dask is great for handling large numbers of CSVs in parallel:

import dask.dataframe as dd 
df = dd.read_csv('data/*.csv', dtype={'col1': 'float32', 'col2': 'int32'})

This automatically parallelizes file loading.

C. Use polars.read_csv() for Maximum Speed

Polars is significantly faster than pandas for CSV parsing:

import polars as pl
df = pl.read_csv('file.csv', dtypes={'col1': pl.Float32, 'col2': pl.Int32})

For multiple files:

df = pl.concat([pl.read_csv(f) for f in file_list])

Polars is particularly efficient because it processes data in a lazy, multi-threaded fashion.

and now for the second part:

2. Best Format to Save in HDF5

HDF5 stores data in a binary format, making it much faster than CSV. The optimal way to store tables depends on access patterns:

A. Use fixed format for Fastest Writing (if no appends needed)

df.to_hdf('data.h5', key='table_name', mode='w', format='fixed')

Pros: Fast write speed.
Cons: Cannot be appended to later.

B. Use table format for Appending & Querying

df.to_hdf('data.h5', key='table_name', mode='a', format='table', data_columns=True)

Pros: Allows appending and efficient queries on columns.
Cons: Slightly slower than fixed.

C. Use Chunking & Compression for Large Datasets

For very large datasets:

df.to_hdf('data.h5', key='table_name', mode='a', format='table', complevel=5, complib='blosc')

Compression (blosc): Reduces file size with minimal speed loss.
Chunking: HDF5 stores data in chunks for efficient access.

I found it awesome that ChatGPT recommended using my favorite parallel-processing library: concurrent futures!!!

3. Fastest Way to Process & Store Multiple CSVs

Option 1: Using pandas with Parallel I/O

from concurrent.futures import ProcessPoolExecutor 
import pandas as pd 
import glob 
def process_csv(file): 
	df = pd.read_csv(file, dtype={'col1': 'float32', 'col2': 'int32'})
	df.to_hdf('data.h5', key=file.split('/')[-1].replace('.csv', ''), mode='a', format='table', data_columns=True) 
files = glob.glob('data/*.csv') 
with ProcessPoolExecutor() as executor: 
	executor.map(process_csv, files)

This utilizes multiple CPU cores for parallel processing.

Option 2: Using dask for Large Data

import dask.dataframe as dd 
df = dd.read_csv('data/*.csv', dtype={'col1': 'float32', 'col2': 'int32'})
df.to_hdf('data.h5', key='table', format='table', mode='w')

Dask automatically optimizes performance and parallelism.

Conclusion

Fastest Reading: Use Dask (for large datasets) or Polars (for speed).
Best HDF5 Format:
- Use fixed format for fastest writing.
- Use table format if you need appending or querying.
Parallel Processing: Use ProcessPoolExecutor or Dask for optimal efficiency.

I went with Option 1 because I needed more controls. But I ended up changing the process_csv proc for more control and better way of storing the data. There is a limit on the number of objects you can save in the table format.... But we'll leave the details for another time. I think this provided a grea introduction to new concepts.

Did you know about the fixed vs table formats in HDF5?