How to automate importing 31 .csv files into SQL Server? (2 Viewers)

That's pretty good time. Make sure you dont have table indexes created when importing data, otherwise the loading will crawl. if you're appending extra data, drop the indexes and recreate them after loading
I could tell it was slower with the Natural Key index in place and decided to drop the index prior to the import and recreate the index after import in C#.

How about table relationships? Do you recommend I strip those out too?

Thank you
 
Glad you got something you like that worked :)
Tuck the idea of learning SSIS in your back pocket, it will expand your horizons exponentially
 
I am skipping the records that are duplicates while keeping the Natural Key index in place.
Umm, I don't think that's a good idea. I don't even think it's possible to preserve a snapshot of a table related index if you delete all records and re_import the data. If you drop the table, it drops all related indexes. If you delete all records but keep the table, the indexes get updated to basically blank, and the deleted records are not physically removed, rather they're flagged as logically deleted. Your best bet is to drop the table(s), recreate them, load all the data with the dup records removed, or include the dups, create non_unique indexes, then remove the dups with a delete query.
How about table relationships? Do you recommend I strip those out too?
If you drop the tables and recreate them, which I strongly recommend, the joins are automatically removed and you have to re-define them. For best performance, especially with large tables, it's best to start out with bran spanking new tables, load all the data, then create all the indexes, joins, constraints, perms, etc. etc.

You should also create a clustered index on the primary key so that it physically reorders the table data in the same order as the cluster index. You will also want to update the tables statistics so the cost based query optimizer can build the best query execution plans.
 
Last edited:
Umm, I don't think that's a good idea. I don't even think it's possible to preserve a snapshot of a table related index if you delete all records and re_import the data. If you drop the table, it drops all related indexes. If you delete all records but keep the table, the indexes get updated to basically blank, and the deleted records are not physically removed, rather they're flagged as logically deleted. Your best bet is to drop the table(s), recreate them, load all the data with the dup records removed, or include the dups, create non_unique indexes, then remove the dups with a delete query.

If you drop the tables and recreate them, which I strongly recommend, the joins are automatically removed and you have to re-define them. For best performance, especially with large tables, it's best to start out with bran spanking new tables, load all the data, then create all the indexes, joins, constraints, perms, etc. etc.

You should also create a clustered index on the primary key so that it physically reorders the table data in the same order as the cluster index. You will also want to update the tables statistics so the cost based query optimizer can build the best query execution plans.
You are right skipping the duplicated records in C# was a poor choice. I now run a removeDuplicate query after all the data is imported. Thanks.

My current approach is to Truncate the Table, remove all indexes, import data, remove dups, recreate indexes. On Monday a fresh data dump is released. I want to see how much things change from week to week before fully optimizing. Perhaps I only need to update monthly and my current approach may be sufficient. I'll know more on Monday and strategize accordingly.

Thank you for all the advice. Really appreciate it.
 
You are right skipping the duplicated records in C# was a poor choice. I now run a removeDuplicate query after all the data is imported. Thanks.

My current approach is to Truncate the Table, remove all indexes, import data, remove dups, recreate indexes. On Monday a fresh data dump is released. I want to see how much things change from week to week before fully optimizing. Perhaps I only need to update monthly and my current approach may be sufficient. I'll know more on Monday and strategize accordingly.

Thank you for all the advice. Really appreciate it.
Sounds like a good plan. Virtually everything takes longer to do when dealing with big tables, so optimization is crucial. This data dump you're importing seems like government data for an analytical decision support system. We used to import a 4.5GB PostgreSQL dataset of all U.S. Federal contracts for trend analysis. In a table with a million records, for every alpha character you reduce, you save 1MB. We summarized data by creating aggregate tables. Created lookup tables of certain text values, then replaced the text values in all tables with foreign key integer values joined to the lookup tables.

Keep us informed about your progress, and best wishes for success.
 
Sounds like a good plan. Virtually everything takes longer to do when dealing with big tables, so optimization is crucial. This data dump you're importing seems like government data for an analytical decision support system. We used to import a 4.5GB PostgreSQL dataset of all U.S. Federal contracts for trend analysis. In a table with a million records, for every alpha character you reduce, you save 1MB. We summarized data by creating aggregate tables. Created lookup tables of certain text values, then replaced the text values in all tables with foreign key integer values joined to the lookup tables.

Keep us informed about your progress, and best wishes for success.
Yes this is US EPA data. I will be analyzing it to perform Hazardous Waste Market analysis for customers.

I find it a little disappointing that the EPA doesn't have simple data integrity such as Natural Key (ManifestNum, LineNum). I found the same lack of data integrity in Canada. I assume it's the same in most departments?
 

Users who are viewing this thread

Back
Top Bottom