Struggling with Data Cleaning in Python—Best Practices for Large Datasets?

Neeljy · Feb 14, 2025

Hey everyone,

I’ve been working with pandas to clean datasets, and while I can handle small ones easily, I run into trouble when dealing with large datasets (millions of rows). Operations that work fine on smaller sets, like .dropna(), .drop_duplicates(), and .apply(), seem to slow down significantly or even cause memory issues.

Some specific issues I’m facing:

Handling missing values efficiently – Should I always use .fillna() or are there better ways to impute values at scale?
Removing duplicates in a large dataset – .drop_duplicates() sometimes takes forever, any optimizations?
Inconsistent formatting – Text columns have mixed casing, extra spaces, and different date formats. Is there a more scalable way to standardize them without looping?
Outlier detection at scale – Using .describe() and IQR works for small data, but what about large datasets?
Performance concerns – Are Dask, Modin, or Polars worth using for large-scale data cleaning?

I’d love to hear real-world experiences from others handling big data.

What are your best practices for cleaning and preprocessing large datasets? Any tips on optimizing performance without running out of memory?

Jon · Feb 14, 2025

Welcome to Access World! We're so happy to have you join us as a member of our community. As the most active Microsoft Access discussion forum on the internet, with posts dating back more than 20 years, we have a wealth of knowledge and experience to share with you.

We're a friendly and helpful community, so don't hesitate to ask any questions you have or share your own experiences with Access. We're here to support you and help you get the most out of this powerful database program.

To get started, we recommend reading the post linked below. It contains important information for all new users of the forum:

https://www.access-programmers.co.uk/forums/threads/new-member-read-me-first.223250/

We hope you have a great time participating in the discussion and learning from other Access enthusiasts. We look forward to having you around!

The_Doc_Man · Feb 14, 2025

Hello and welcome to the forums.

We like to keep actual technical questions separate from the introduction forum. I am going to move your post to one of the technical forums.

Since this is your first post with us, there is no way you could have known our preferences. This is done without penalty or any hint of chastisement.

DickyP · Feb 14, 2025

The_Doc_Man said:
This is without penalty or any hint of chastisement.

Reminds me of a teacher school who threatened us with 'castigation and chastisement'

The_Doc_Man · Feb 14, 2025

Not to mention vile contumely and inglorious invective.

Isaac · Feb 19, 2025

I've worked with python a little bit, my son a lot, and definitely I've worked with SQL. My advice is that Python is going to be a lot slower than many other languages, and you might consider trying to clean the data (or clean it to some degree %) using a SQL database rather than Python (who for all you know may be looping behind the scenes anyway).

I've never heard of Dask, Modin or Polars so can't comment on those, except I do know there is a new scripting language coming out just about every time you turn around and that doesn't mean just because they're new, they're good. New scripting languages are almost like new crypto coins. Everyone wants to make one.

I'd try good old SQL for your data cleansing - you can install sql server express for free if you want and including data tools you can use SSIS packages to transform and channel data into different directions, or use a sql query to cleanse a large dataset at a time with all sorts of conditional statements and replacements.

The newest stuff is not necessarily the best - nor even necessarily likely to be.

CJ_London · Apr 19, 2025

I've used sql to clean around 1m rows of data from a monthly produced excel file.

Sql:

Appends data to a temporary table, all fields treated as text and an indexed hash value of all fields (excluding PK) created to subsequently identify duplicates
looks for and removes duplicates in the table
looks for and removes records which have already been imported
identifies fields which do not match required specification for the field (e.g. null, string too long/short, too large/small, out of date range, etc). Made use of a meta table to contain the specification for each field
gives the user the opportunity to review and not exclude those records where one or more fields do not meet the specificaton(default is exclude)
Appends resultant data including hash field to destination table, Any records flagged by user to import despite being out of spec were flagged in 'flag' field as a csv to identify the user and the fields and their values that were out of spec (the query would resolve issues by truncating or substitution)

first pass (1-4) to identify records to be excluded took around 45 seconds
user review took as long as they needed. Typically there may have been around 50-100 records to review
second pass (6) to import took around 60 seconds.

this process ran in access and updated tables in sql server hence the long second pass time since you could only insert one record at a time.

Isaac · Apr 19, 2025

CJ_London said:
I've used sql to clean around 1m rows of data from a monthly produced excel file.

Sql:

Appends data to a temporary table, all fields treated as text and an indexed hash value of all fields (excluding PK) created to subsequently identify duplicates

looks for and removes duplicates in the table

looks for and removes records which have already been imported

identifies fields which do not match required specification for the field (e.g. null, string too long/short, too large/small, out of date range, etc). Made use of a meta table to contain the specification for each field

gives the user the opportunity to review and not exclude those records where one or more fields do not meet the specificaton(default is exclude)

Appends resultant data including hash field to destination table, Any records flagged by user to import despite being out of spec were flagged in 'flag' field as a csv to identify the user and the fields and their values that were out of spec (the query would resolve issues by truncating or substitution)

first pass (1-4) to identify records to be excluded took around 45 seconds
user review took as long as they needed. Typically there may have been around 50-100 records to review
second pass (6) to import took around 60 seconds.

this process ran in access and updated tables in sql server hence the long second pass time since you could only insert one record at a time.

Sounds like a very satisfying outcome! SQL is so great for jobs like this. And I find myself making good use of try_cast

Struggling with Data Cleaning in Python—Best Practices for Large Datasets?

Neeljy

New member

Jon

Access World Site Owner

The_Doc_Man

Immoderate Moderator

DickyP

Active member

The_Doc_Man

Immoderate Moderator

Isaac

Lifelong Learner

CJ_London

Super Moderator

Isaac

Lifelong Learner

Similar threads

Users who are viewing this thread