Data-Management Case: too many CSV files -- Part 2

Of course, nothing is simple. I had to make adjustments to the simple script provided by ChatGPT. That's what I like about GenerativeAI: I can actually have a conversation with it!

So here is my prompt:

i do not know the contents of each file, they vary. I would like to save all of the files into a single hdf5 file.

And the lesson begins:

To efficiently store thousands of CSV files in an HDF5 file, you'll want a solution that:

Reads CSV files quickly
Handles varying file structures
Efficiently stores data in HDF5 format

We have a plan!

Step 1: Read CSV Files Efficiently

Since CSV files are plain text, reading them can be slow. The best tools to speed up this process are:

pandas (read_csv) – Works well for structured tabular data.
dask (dask.dataframe.read_csv) – Useful for parallel processing if files are large.
polars (pl.read_csv) – Optimized for performance, faster than Pandas.

Since your files vary in content, we assume they can be loaded as DataFrames (tables). If some have issues (e.g., missing headers), handling errors gracefully is important.

import pandas as pd
import os
import h5py
# Path to your CSV files
csv_folder = "path/to/csv/files"
hdf5_file = "output.h5"
# Create an HDF5 file
with h5py.File(hdf5_file, "w") as hdf:
    for i, filename in enumerate(os.listdir(csv_folder)):
        if filename.endswith(".csv"):
            file_path = os.path.join(csv_folder, filename)
            try:
                df = pd.read_csv(file_path)  # Read CSV
                df.to_hdf(hdf5_file, key=f"table_{i}", mode="a", 
format="table", data_columns=True)
                print(f"Stored {filename} as table_{i}")
            except Exception as e:
                print(f"Skipping {filename}: {e}")

Step 2: Choosing the Best HDF5 Format

HDF5 supports different ways to store tables:

Fixed Format (format="fixed")
- Fast for writing and reading
- Doesn't support appending new data later
- Best for one-time storage of static data
Table Format (format="table")
- Slower than fixed format but supports appending
- Allows querying data more flexibly
- Ideal for cases where files may be updated or extended later

Since your files vary in structure and size, using format="table" is the best choice for flexibility.

Step 3: Speeding Up the Process

If the number of files is huge, reading and writing in parallel will improve speed. You can use concurrent.futures to process multiple CSV files at the same time:

from concurrent.futures import ProcessPoolExecutor
import pandas as pd
import os
csv_folder = "path/to/csv/files"
hdf5_file = "output.h5"
def process_csv(file_tuple):
    index, filename = file_tuple
    file_path = os.path.join(csv_folder, filename)
    try:
        df = pd.read_csv(file_path)
        df.to_hdf(hdf5_file, key=f"table_{index}", mode="a", format="table", data_columns=True)
        return f"Stored {filename} as table_{index}"
    except Exception as e:
        return f"Skipping {filename}: {e}"
# Get list of CSV files
csv_files = [(i, f) for i, f in enumerate(os.listdir(csv_folder)) if f.endswith(".csv")]
# Process files in parallel
with ProcessPoolExecutor() as executor:
    results = list(executor.map(process_csv, csv_files))
# Print results
for r in results:
    print(r)

Key Takeaways

✅ Fast reading: Use pandas.read_csv() or polars.read_csv()

✅ Fast writing: Use df.to_hdf() with format="table"

✅ Parallel processing: Use ProcessPoolExecutor to speed up file handling

✅ Single HDF5 file: Store each CSV file as a separate dataset (key=f"table_{i}")

Of course, I needed more...:

I needed to modify the script so that concurrent futures would work on my windows machine
I had more problems with the data output.