How to automate importing 31 .csv files into SQL Server?

ions · Sep 20, 2025

BlueSpruce said:
That's pretty good time. Make sure you dont have table indexes created when importing data, otherwise the loading will crawl. if you're appending extra data, drop the indexes and recreate them after loading

I could tell it was slower with the Natural Key index in place and decided to drop the index prior to the import and recreate the index after import in C#.

How about table relationships? Do you recommend I strip those out too?

Thank you

Isaac · Sep 20, 2025

Glad you got something you like that worked

Tuck the idea of learning SSIS in your back pocket, it will expand your horizons exponentially

BlueSpruce · Sep 20, 2025

ions said:
I am skipping the records that are duplicates while keeping the Natural Key index in place.

Umm, I don't think that's a good idea. I don't even think it's possible to preserve a snapshot of a table related index if you delete all records and re_import the data. If you drop the table, it drops all related indexes. If you delete all records but keep the table, the indexes get updated to basically blank, and the deleted records are not physically removed, rather they're flagged as logically deleted. Your best bet is to drop the table(s), recreate them, load all the data with the dup records removed, or include the dups, create non_unique indexes, then remove the dups with a delete query.

How about table relationships? Do you recommend I strip those out too?

If you drop the tables and recreate them, which I strongly recommend, the joins are automatically removed and you have to re-define them. For best performance, especially with large tables, it's best to start out with bran spanking new tables, load all the data, then create all the indexes, joins, constraints, perms, etc. etc.

You should also create a clustered index on the primary key so that it physically reorders the table data in the same order as the cluster index. You will also want to update the tables statistics so the cost based query optimizer can build the best query execution plans.

ions · Sep 20, 2025

Isaac said:
Glad you got something you like that worked
Tuck the idea of learning SSIS in your back pocket, it will expand your horizons exponentially

Will do. Thank you Isaac.

ions · Sep 20, 2025

BlueSpruce said:
Umm, I don't think that's a good idea. I don't even think it's possible to preserve a snapshot of a table related index if you delete all records and re_import the data. If you drop the table, it drops all related indexes. If you delete all records but keep the table, the indexes get updated to basically blank, and the deleted records are not physically removed, rather they're flagged as logically deleted. Your best bet is to drop the table(s), recreate them, load all the data with the dup records removed, or include the dups, create non_unique indexes, then remove the dups with a delete query.

If you drop the tables and recreate them, which I strongly recommend, the joins are automatically removed and you have to re-define them. For best performance, especially with large tables, it's best to start out with bran spanking new tables, load all the data, then create all the indexes, joins, constraints, perms, etc. etc.

You should also create a clustered index on the primary key so that it physically reorders the table data in the same order as the cluster index. You will also want to update the tables statistics so the cost based query optimizer can build the best query execution plans.

You are right skipping the duplicated records in C# was a poor choice. I now run a removeDuplicate query after all the data is imported. Thanks.

My current approach is to Truncate the Table, remove all indexes, import data, remove dups, recreate indexes. On Monday a fresh data dump is released. I want to see how much things change from week to week before fully optimizing. Perhaps I only need to update monthly and my current approach may be sufficient. I'll know more on Monday and strategize accordingly.

Thank you for all the advice. Really appreciate it.

BlueSpruce · Sep 20, 2025

ions said:
You are right skipping the duplicated records in C# was a poor choice. I now run a removeDuplicate query after all the data is imported. Thanks.

My current approach is to Truncate the Table, remove all indexes, import data, remove dups, recreate indexes. On Monday a fresh data dump is released. I want to see how much things change from week to week before fully optimizing. Perhaps I only need to update monthly and my current approach may be sufficient. I'll know more on Monday and strategize accordingly.

Thank you for all the advice. Really appreciate it.

Sounds like a good plan. Virtually everything takes longer to do when dealing with big tables, so optimization is crucial. This data dump you're importing seems like government data for an analytical decision support system. We used to import a 4.5GB PostgreSQL dataset of all U.S. Federal contracts for trend analysis. In a table with a million records, for every alpha character you reduce, you save 1MB. We summarized data by creating aggregate tables. Created lookup tables for certain text fields, then replaced the text values with foreign key integer values joining to the lookup tables.

Keep us informed about your progress, and best wishes for success.

ions · Sep 20, 2025

BlueSpruce said:
Sounds like a good plan. Virtually everything takes longer to do when dealing with big tables, so optimization is crucial. This data dump you're importing seems like government data for an analytical decision support system. We used to import a 4.5GB PostgreSQL dataset of all U.S. Federal contracts for trend analysis. In a table with a million records, for every alpha character you reduce, you save 1MB. We summarized data by creating aggregate tables. Created lookup tables of certain text values, then replaced the text values in all tables with foreign key integer values joined to the lookup tables.

Keep us informed about your progress, and best wishes for success.

Yes this is US EPA data. I will be analyzing it to perform Hazardous Waste Market analysis for customers.

I find it a little disappointing that the EPA doesn't have simple data integrity such as Natural Key (ManifestNum, LineNum). I found the same lack of data integrity in Canada. I assume it's the same in most departments?

BlueSpruce · Sep 20, 2025

ions said:
the EPA doesn't have simple data integrity such as Natural Key (ManifestNum, LineNum).

I find that strange. What do the specs say about that data?

ions · Sep 21, 2025

BlueSpruce said:
I find that strange. What do the specs say about that data?

There are not too many duplicates. I think there were around 10 cases like the below.

BlueSpruce · Sep 21, 2025

ions said:
There are not too many duplicates. I think there were around 10 cases like the below.

View attachment 121569

After you remove the dups you should be able to successfully create the following composite index:

Code:

CREATE UNIQUE CLUSTERED INDEX idxname ON dbo.tablename (ManifestNum, LineNum);

ions · Sep 21, 2025

BlueSpruce said:
After you remove the dups you should be able to successfully create the following composite index:

Code:

CREATE UNIQUE CLUSTERED INDEX idxname ON dbo.tablename (ManifestNum, LineNum);

Yes doing that except I always use ID for Clustered. I did find some records that have ManifestNum == null. I have to remove them. I am surprised SQL Server successfully indexed this.

BlueSpruce · Sep 21, 2025

ions said:
Yes doing that except I always use ID for Clustered. I did find some records that have ManifestNum == null. I have to remove them. I am surprised SQL Server successfully indexed this.

View attachment 121570

View attachment 121571

Actually, the clustered index should be created on the field(s) that are most frequently queried on. Primary Key fields automatically have an index created on them. They need it so foreign keys in other related tables can join to the primary key in the parent.

I did find some records that have ManifestNum == null. I have to remove them. I am surprised SQL Server successfully indexed this.

SQL Server and other db servers allow null values to be indexed. However, Access doesn't.

I thought both ManifestNum and LineNum columns determined the uniqueness for each row?

ions · Sep 21, 2025

BlueSpruce said:
I thought both ManifestNum and LineNum columns determined the uniqueness for each row?

Yes ManifestNum and LineNum is the Natural Key for the Details table. My understanding is that the Primary Key is always the Clustered Index? I always use ID as the primary key. I will be using the ID in all the table joins and table relationships.

BlueSpruce · Sep 21, 2025

ions said:
Yes ManifestNum and LineNum is the Natural Key for the Details table. My understanding is that the Primary Key is always the Clustered Index? I always use ID as the primary key. I will be using the ID in all the table joins and table relationships.

Primary Key ID's will always have a unique index for joins, however, you can also create an additional clustered index on the column(s) your application is going to most often query on to improve performance. The clustered index doesn't have to necessarily include the primary key.
e.g. If your application is mostly going to query on Location and HazmatName, then create a composite clustered index on both of those fields so the db engine will physically sort the records in the table by Location, HazmatName and the clustered index will also be in that same order. If there are dup records with the same two field values then the clustered index doesn't have to be unique. You're only allowed one cluster index per table, so choose the column(s) wisely.

If you're going to periodically import fresh sets of data, create a script that drops/recreates the tables, imports the new snapshot, and creates the indexes, joins, constraints, perms, etc.

https://stackoverflow.com/questions...use-clustered-vs-non-clustered-index#18306484

https://stackoverflow.com/questions/5070529/difference-between-clustered-and-nonclustered-index

DickyP · Sep 22, 2025

Gasman said:
You could create one huge file in dos with something along the lines of

Code:

Full file = file1 + file2 + file3 etc I believe?

I know this thread has been solved elsewhere but the suggestion above would take multiple lines of code as the limit on the command line (as far as I can establish - please point me in the right direction if I'm wrong) is still 255 characters

BlueSpruce · Sep 22, 2025

DickyP said:
I know this thread has been solved elsewhere but the suggestion above would take multiple lines of code as the limit on the command line (as far as I can establish - please point me in the right direction if I'm wrong) is still 255 characters

The Windows command line limit is 8191, however, the Access text field limit is 255.

DickyP · Sep 22, 2025

BlueSpruce said:
The Windows command line limit is 8191, however, the Access text field limit is 255.

Cheers - my memory was from long (long) ago and I couldn't seem to find up-to-date info - thanks.

ions · Sep 22, 2025

I downloaded today's manifest data. 35,636 Manifests were added in a week. This entails 176,932,466 Kg of waste. I will likely do the update weekly.

Thank you for all the input and advice.

ions · Sep 23, 2025

Looking for advice for a new post I just created.

Advice on how to improve query efficiency when Filtering on nvarchar(4000)

Hello, I have a table with 31 million records. So far most queries return within 2-4 seconds but there is one important query that is taking 14 seconds. There is a field called FEDERAL_WASTE_CODES and 75% of the time it contains

www.access-programmers.co.uk

Thank you.

How to automate importing 31 .csv files into SQL Server?

Access User

Lifelong Learner

Active member

Access User

Access User

Active member

Access User

Active member

Access User

Active member

Access User

Active member

Access User

Active member

Active member

Active member

Active member

Access User

Access User

Similar threads

Users who are viewing this thread