Open In App

Working with large CSV files in Python

Last Updated : 12 Mar, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Data plays a key role in building machine learning and the AI model. In today’s world where data is being generated at an astronomical rate by every computing device and sensor, it is important to handle huge volumes of data correctly. One of the most common ways of storing data is in the form of Comma-Separated Values(CSV). Directly importing a large amount of data leads to out-of-memory error and reading the entire file at once leads to system crashes due to insufficient RAM.

Working with large CSV files in Python

The following are a few ways to effectively handle large data files in .csv format and read large CSV files in Python. The dataset we are going to use is gender_voice_dataset.

  • Using pandas.read_csv(chunk size)
  • Using Dask
  • Use Compression

Read large CSV files in Python Pandas Using pandas.read_csv(chunk size)

One way to process large files is to read the entries in chunks of reasonable size and read large CSV files in Python Pandas, which are read into the memory and processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines. This function returns an iterator which is used to iterate through these chunks and then processes them. Since only a part of the file is read at a time, low memory is enough for processing.

The following is the code to read entries in chunks.

chunk = pandas.read_csv(filename,chunksize=...)


The below code shows the time taken to read a dataset without using chunks:

Python3
# import required modules
import pandas as pd
import numpy as np
import time

# time taken to read data
s_time = time.time()
df = pd.read_csv("gender_voice_dataset.csv")
e_time = time.time()

print("Read without chunks: ", (e_time-s_time), "seconds")

# data
df.sample(10)

Output:

The data set used in this example contains 986894 rows with 21 columns. The time taken is about 4 seconds which might not be that long, but for entries that have millions of rows, the time taken to read the entries has a direct effect on the efficiency of the model.

Now, let us use chunks to read the CSV file:

Python3
# import required modules
import pandas as pd
import numpy as np
import time

# time taken to read data
s_time_chunk = time.time()
chunk = pd.read_csv('gender_voice_dataset.csv', chunksize=1000)
e_time_chunk = time.time()

print("With chunks: ", (e_time_chunk-s_time_chunk), "sec")
df = pd.concat(chunk)

# data
df.sample(10)

Output:

As you can see chunking takes much lesser time compared to reading the entire file at one go.

Read large CSV files in Python Pandas Using Dask

Dask is an open-source python library that includes features of parallelism and scalability in Python by using the existing libraries like pandas, NumPy, or sklearn.

To install:

pip install dask


Dask is preferred over chunking as it uses multiple CPU cores or clusters of machines (Known as distributed computing). In addition to this, it also provides scaled NumPy, pandas, and sci-kit libraries to exploit parallelism. The following is the code to read files using dask:

Python3
# import required modules
import pandas as pd
import numpy as np
import time
from dask import dataframe as df1

# time taken to read data
s_time_dask = time.time()
dask_df = df1.read_csv('gender_voice_dataset.csv')
e_time_dask = time.time()

print("Read with dask: ", (e_time_dask-s_time_dask), "seconds")

# data
dask_df.head(10)

Output:

Read large CSV files in Python Pandas Using Compression

Compression method in Pandas’ `read_csv` allows you to read compressed CSV files efficiently. Specify the compression type (e.g., ‘gzip’, ‘zip’, ‘xz’) with the `compression` parameter. To enable this feature, ensure the required library is installed using below command `pip install ‘library_name’` Example :

pip install gzip

In this example , below Python code uses Pandas Dataframe to read a large CSV file in chunks, prints the shape of each chunk, and displays the data within each chunk, handling potential file not found or unexpected errors.

Python3
import pandas as pd

chunk_size = 1000
compression_type = None  # Set to None for non-compressed files

file_path = '/content/drive/MyDrive/voice.csv'

try:
    chunk_iterator = pd.read_csv(file_path, chunksize=chunk_size, compression=compression_type)

    for i, chunk in enumerate(chunk_iterator):
        print(f'Chunk {i + 1} shape: {chunk.shape}')
        print(chunk)  # Print the data in each chunk

except FileNotFoundError:
    print(f"Error: File '{file_path}' not found. Please provide the correct file path.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Output :

hellow

Note: The dataset in the link has around 3000 rows. Additional data was added separately for the purpose of this article, to increase the size of the file. It does not exist in the original dataset.


Previous Article
Next Article

Similar Reads

How to create multiple CSV files from existing CSV file using Pandas ?
In this article, we will learn how to create multiple CSV files from existing CSV file using Pandas. When we enter our code into production, we will need to deal with editing our data files. Due to the large size of the data file, we will encounter more problems, so we divided this file into some small files based on some criteria like splitting in
3 min read
Working with csv files in Python
Python is one of the important fields for data scientists and many programmers to handle a variety of data. CSV (Comma-Separated Values) is one of the prevalent and accessible file formats for storing and exchanging tabular data. In article explains What is CSV. Working with CSV files in Python, Reading, and Writing to a CSV file, and Storing Email
9 min read
Python program to read CSV without CSV module
CSV (Comma Separated Values) is a simple file format used to store tabular data, such as a spreadsheet or database. CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the nam
3 min read
Reading and Writing CSV Files in Python
CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases. It is one of the most common methods for exchanging data between applications and popular data format used in Data Science. It is supported by a wide range of applications. A CSV file stores tabular data in which each data field is separa
4 min read
Convert multiple JSON files to CSV Python
In this article, we will learn how to convert multiple JSON files to CSV file in Python. Before that just recall some terms : JSON File: A JSON file may be a file that stores simple data structures and objects in JavaScript Object Notation (JSON) format, which may be a standard data interchange format. It is primarily used for transmitting data bet
8 min read
How to merge two csv files by specific column using Pandas in Python?
In this article, we are going to discuss how to merge two CSV files there is a function in pandas library pandas.merge(). Merging means nothing but combining two datasets together into one based on common attributes or column. Syntax: pandas.merge() Parameters : data1, data2: Dataframes used for merging.how: {‘left’, ‘right’, ‘outer’, ‘inner’}, def
3 min read
How to read numbers in CSV files in Python?
Prerequisites: Reading and Writing data in CSV, Creating CSV files CSV is a Comma-Separated Values file, which allows plain-text data to be saved in a tabular format. These files are stored in our system with a .csv extension. CSV files differ from other spreadsheet file types (like Microsoft Excel) because we can only have a single sheet in a file
4 min read
Getting all CSV files from a directory using Python
Python provides many inbuilt packages and modules to work with CSV files in the workspace. The CSV files can be accessed within a system's directories and subdirectories and modified or edited. The CSV file contents can both be printed on the shell, or it can be saved in the form of the dataframe and played with later. In this article, we will see
3 min read
Read multiple CSV files into separate DataFrames in Python
Sometimes you might need to read multiple CSV files into separate Pandas DataFrames. Importing CSV files into DataFrames helps you work on the data using Python functionalities for data analysis. In this article, we will see how to read multiple CSV files into separate DataFrames. For reading only one CSV file, we can use pd.read_csv() function of
2 min read
Compare Two Csv Files Using Python
We are given two files and our tasks is to compare two CSV files based on their differences in Python. In this article, we will see some generally used methods for comparing two CSV files and print differences. Compare Two CSV Files for Differences in PythonBelow are some of the ways by which we can compare two CSV files for differences in Python:
2 min read
Article Tags :
Practice Tags :
three90RightbarBannerImg