Data-Management Case: too many CSV files -- Part 2
- silviamazzoni
- Feb 25
- 2 min read
Of course, nothing is simple. I had to make adjustments to the simple script provided by ChatGPT. That's what I like about GenerativeAI: I can actually have a conversation with it!
So here is my prompt:
i do not know the contents of each file, they vary. I would like to save all of the files into a single hdf5 file.
And the lesson begins:
To efficiently store thousands of CSV files in an HDF5 file, you'll want a solution that:
Reads CSV files quickly
Handles varying file structures
Efficiently stores data in HDF5 format
We have a plan!
Step 1: Read CSV Files Efficiently
Since CSV files are plain text, reading them can be slow. The best tools to speed up this process are:
pandas (read_csv) – Works well for structured tabular data.
dask (dask.dataframe.read_csv) – Useful for parallel processing if files are large.
polars (pl.read_csv) – Optimized for performance, faster than Pandas.
Since your files vary in content, we assume they can be loaded as DataFrames (tables). If some have issues (e.g., missing headers), handling errors gracefully is important.
import pandas as pd
import os
import h5py
# Path to your CSV files
csv_folder = "path/to/csv/files"
hdf5_file = "output.h5"
# Create an HDF5 file
with h5py.File(hdf5_file, "w") as hdf:
for i, filename in enumerate(os.listdir(csv_folder)):
if filename.endswith(".csv"):
file_path = os.path.join(csv_folder, filename)
try:
df = pd.read_csv(file_path) # Read CSV
df.to_hdf(hdf5_file, key=f"table_{i}", mode="a",
format="table", data_columns=True)
print(f"Stored {filename} as table_{i}")
except Exception as e:
print(f"Skipping {filename}: {e}")
Step 2: Choosing the Best HDF5 Format
HDF5 supports different ways to store tables:
Fixed Format (format="fixed")
Fast for writing and reading
Doesn't support appending new data later
Best for one-time storage of static data
Table Format (format="table")
Slower than fixed format but supports appending
Allows querying data more flexibly
Ideal for cases where files may be updated or extended later
Since your files vary in structure and size, using format="table"Â is the best choice for flexibility.
Step 3: Speeding Up the Process
If the number of files is huge, reading and writing in parallel will improve speed. You can use concurrent.futures to process multiple CSV files at the same time:
from concurrent.futures import ProcessPoolExecutor
import pandas as pd
import os
csv_folder = "path/to/csv/files"
hdf5_file = "output.h5"
def process_csv(file_tuple):
index, filename = file_tuple
file_path = os.path.join(csv_folder, filename)
try:
df = pd.read_csv(file_path)
df.to_hdf(hdf5_file, key=f"table_{index}", mode="a", format="table", data_columns=True)
return f"Stored {filename} as table_{index}"
except Exception as e:
return f"Skipping {filename}: {e}"
# Get list of CSV files
csv_files = [(i, f) for i, f in enumerate(os.listdir(csv_folder)) if f.endswith(".csv")]
# Process files in parallel
with ProcessPoolExecutor() as executor:
results = list(executor.map(process_csv, csv_files))
# Print results
for r in results:
print(r)
Key Takeaways
✅ Fast reading: Use pandas.read_csv() or polars.read_csv()
✅ Fast writing: Use df.to_hdf() with format="table"
✅ Parallel processing: Use ProcessPoolExecutor to speed up file handling
✅ Single HDF5 file: Store each CSV file as a separate dataset (key=f"table_{i}")
Of course, I needed more...:
I needed to modify the script so that concurrent futures would work on my windows machine
I had more problems with the data output.