Read Multiple CSVs Files

Save all files in a Panda DataFrame from your directory

python

pandas

Author

Possible Insitute

Published

January 4, 2023

multiple_csv.py

import pandas as pd
import time
start = time.time()
# File list into the array list
data_list = ["data-0{0}.tsv".format(i) for i in range(1, 8)]
data_name = []
for i in range(7):
  x = data_list[i].strip(".tsv")
  data_name.append(x)

for i in range(7):
  data_name[i] = pd.read_csv("../url_path/{0}.tsv".format(data_name[i]), sep='\t', nrows=10000)
for i in range(7):
  print(data_name[i].head())
end = time.time()

print("Total time of the process:", end-start)
while end-start < 1:
  print("We have processed all files less than one minute.")
  break

For performance, especially when dealing with a large number of CSV files or large individual files, it can be beneficial to use Dask instead of Pandas. Dask is a flexible library for parallel computing in Python that integrates well with the existing Python ecosystem and is particularly good at working with large datasets.

Here’s how you can use Dask to read multiple CSV files more efficiently:

Code

import dask.dataframe as dd

# Use Dask to read all CSV files
df = dd.read_csv('~/GithubPI/website/dataset/*.csv')

# Compute as Pandas dataframe
all_data = df.compute()

In this code, dd.read_csv(’dataset/*.csv’) reads all CSV files matching the wildcard pattern and returns a Dask DataFrame. This operation is lazy; it doesn’t trigger any actual computation but builds up a task graph. The df.compute() statement triggers the actual computation and returns a Pandas DataFrame.

This Dask version will typically be faster than the Pandas version, particularly for large datasets because Dask works on smaller chunks of the data in parallel, which can also make it more memory-efficient.

As with the Pandas version, make sure that the CSV files are in the same directory as your Python script or notebook, or provide the full path to the files.

Remember that Dask might be a bit overkill for small data sets, as it has some overhead for dividing tasks into smaller chunks and distributing them. But for larger datasets, Dask can significantly outperform Pandas.