Code
import dask.dataframe as dd
# Use Dask to read all CSV files
= dd.read_csv('~/GithubPI/website/dataset/*.csv')
df
# Compute as Pandas dataframe
= df.compute() all_data
Save all files in a Panda DataFrame from your directory
Possible Insitute
January 4, 2023
multiple_csv.py
import pandas as pd
import time
start = time.time()
# File list into the array list
data_list = ["data-0{0}.tsv".format(i) for i in range(1, 8)]
data_name = []
for i in range(7):
x = data_list[i].strip(".tsv")
data_name.append(x)
for i in range(7):
data_name[i] = pd.read_csv("../url_path/{0}.tsv".format(data_name[i]), sep='\t', nrows=10000)
for i in range(7):
print(data_name[i].head())
end = time.time()
print("Total time of the process:", end-start)
while end-start < 1:
print("We have processed all files less than one minute.")
break
For performance, especially when dealing with a large number of CSV files or large individual files, it can be beneficial to use Dask instead of Pandas. Dask is a flexible library for parallel computing in Python that integrates well with the existing Python ecosystem and is particularly good at working with large datasets.
Here’s how you can use Dask to read multiple CSV files more efficiently:
In this code, dd.read_csv(’dataset/*.csv’) reads all CSV files matching the wildcard pattern and returns a Dask DataFrame. This operation is lazy; it doesn’t trigger any actual computation but builds up a task graph. The df.compute() statement triggers the actual computation and returns a Pandas DataFrame.
This Dask version will typically be faster than the Pandas version, particularly for large datasets because Dask works on smaller chunks of the data in parallel, which can also make it more memory-efficient.
As with the Pandas version, make sure that the CSV files are in the same directory as your Python script or notebook, or provide the full path to the files.
Remember that Dask might be a bit overkill for small data sets, as it has some overhead for dividing tasks into smaller chunks and distributing them. But for larger datasets, Dask can significantly outperform Pandas.