How to automate importing 31 .csv files into SQL Server? (4 Viewers)

That's pretty good time. Make sure you dont have table indexes created when importing data, otherwise the loading will crawl. if you're appending extra data, drop the indexes and recreate them after loading
I could tell it was slower with the Natural Key index in place and decided to drop the index prior to the import and recreate the index after import in C#.

How about table relationships? Do you recommend I strip those out too?

Thank you
 
Glad you got something you like that worked :)
Tuck the idea of learning SSIS in your back pocket, it will expand your horizons exponentially
 
I am skipping the records that are duplicates while keeping the Natural Key index in place.
Umm, I don't think that's a good idea. I don't even think it's possible to preserve a snapshot of a table related index if you delete all records and re_import the data. If you drop the table, it drops all related indexes. If you delete all records but keep the table, the indexes get updated to basically blank, and the deleted records are not physically removed, rather they're flagged as logically deleted. Your best bet is to drop the table(s), recreate them, load all the data with the dup records removed, or include the dups, create non_unique indexes, then remove the dups with a delete query.
How about table relationships? Do you recommend I strip those out too?
If you drop the tables and recreate them, which I strongly recommend, the joins are automatically removed and you have to re-define them. For best performance, especially with large tables, it's best to start out with bran spanking new tables, load all the data, then create all the indexes, joins, constraints, perms, etc. etc.

You should also create a clustered index on the primary key so that it physically reorders the table data in the same order as the cluster index. You will also want to update the tables statistics so the cost based query optimizer can build the best query execution plans.
 
Last edited:
Umm, I don't think that's a good idea. I don't even think it's possible to preserve a snapshot of a table related index if you delete all records and re_import the data. If you drop the table, it drops all related indexes. If you delete all records but keep the table, the indexes get updated to basically blank, and the deleted records are not physically removed, rather they're flagged as logically deleted. Your best bet is to drop the table(s), recreate them, load all the data with the dup records removed, or include the dups, create non_unique indexes, then remove the dups with a delete query.

If you drop the tables and recreate them, which I strongly recommend, the joins are automatically removed and you have to re-define them. For best performance, especially with large tables, it's best to start out with bran spanking new tables, load all the data, then create all the indexes, joins, constraints, perms, etc. etc.

You should also create a clustered index on the primary key so that it physically reorders the table data in the same order as the cluster index. You will also want to update the tables statistics so the cost based query optimizer can build the best query execution plans.
You are right skipping the duplicated records in C# was a poor choice. I now run a removeDuplicate query after all the data is imported. Thanks.

My current approach is to Truncate the Table, remove all indexes, import data, remove dups, recreate indexes. On Monday a fresh data dump is released. I want to see how much things change from week to week before fully optimizing. Perhaps I only need to update monthly and my current approach may be sufficient. I'll know more on Monday and strategize accordingly.

Thank you for all the advice. Really appreciate it.
 
You are right skipping the duplicated records in C# was a poor choice. I now run a removeDuplicate query after all the data is imported. Thanks.

My current approach is to Truncate the Table, remove all indexes, import data, remove dups, recreate indexes. On Monday a fresh data dump is released. I want to see how much things change from week to week before fully optimizing. Perhaps I only need to update monthly and my current approach may be sufficient. I'll know more on Monday and strategize accordingly.

Thank you for all the advice. Really appreciate it.
Sounds like a good plan. Virtually everything takes longer to do when dealing with big tables, so optimization is crucial. This data dump you're importing seems like government data for an analytical decision support system. We used to import a 4.5GB PostgreSQL dataset of all U.S. Federal contracts for trend analysis. In a table with a million records, for every alpha character you reduce, you save 1MB. We summarized data by creating aggregate tables. Created lookup tables for certain text fields, then replaced the text values with foreign key integer values joining to the lookup tables.

Keep us informed about your progress, and best wishes for success.
 
Last edited:
Sounds like a good plan. Virtually everything takes longer to do when dealing with big tables, so optimization is crucial. This data dump you're importing seems like government data for an analytical decision support system. We used to import a 4.5GB PostgreSQL dataset of all U.S. Federal contracts for trend analysis. In a table with a million records, for every alpha character you reduce, you save 1MB. We summarized data by creating aggregate tables. Created lookup tables of certain text values, then replaced the text values in all tables with foreign key integer values joined to the lookup tables.

Keep us informed about your progress, and best wishes for success.
Yes this is US EPA data. I will be analyzing it to perform Hazardous Waste Market analysis for customers.

I find it a little disappointing that the EPA doesn't have simple data integrity such as Natural Key (ManifestNum, LineNum). I found the same lack of data integrity in Canada. I assume it's the same in most departments?
 
I find that strange. What do the specs say about that data?
There are not too many duplicates. I think there were around 10 cases like the below.

1758410497383.png
 
There are not too many duplicates. I think there were around 10 cases like the below.

View attachment 121569

After you remove the dups you should be able to successfully create the following composite index:

Code:
CREATE UNIQUE CLUSTERED INDEX idxname ON dbo.tablename (ManifestNum, LineNum);
 
Last edited:
After you remove the dups you should be able to successfully create the following composite index:

Code:
CREATE UNIQUE CLUSTERED INDEX idxname ON dbo.tablename (ManifestNum, LineNum);

Yes doing that except I always use ID for Clustered. I did find some records that have ManifestNum == null. I have to remove them. I am surprised SQL Server successfully indexed this.

1758416846858.png


1758416920643.png
 
Yes doing that except I always use ID for Clustered. I did find some records that have ManifestNum == null. I have to remove them. I am surprised SQL Server successfully indexed this.

View attachment 121570

View attachment 121571

Actually, the clustered index should be created on the field(s) that are most frequently queried on. Primary Key fields automatically have an index created on them. They need it so foreign keys in other related tables can join to the primary key in the parent.

I did find some records that have ManifestNum == null. I have to remove them. I am surprised SQL Server successfully indexed this.

SQL Server and other db servers allow null values to be indexed. However, Access doesn't.

I thought both ManifestNum and LineNum columns determined the uniqueness for each row?
 
Last edited:
I thought both ManifestNum and LineNum columns determined the uniqueness for each row?

Yes ManifestNum and LineNum is the Natural Key for the Details table. My understanding is that the Primary Key is always the Clustered Index? I always use ID as the primary key. I will be using the ID in all the table joins and table relationships.
 
Yes ManifestNum and LineNum is the Natural Key for the Details table. My understanding is that the Primary Key is always the Clustered Index? I always use ID as the primary key. I will be using the ID in all the table joins and table relationships.
Primary Key ID's will always have a unique index for joins, however, you can also create an additional clustered index on the column(s) your application is going to most often query on to improve performance. The clustered index doesn't have to necessarily include the primary key.
e.g. If your application is mostly going to query on Location and HazmatName, then create a composite clustered index on both of those fields so the db engine will physically sort the records in the table by Location, HazmatName and the clustered index will also be in that same order. If there are dup records with the same two field values then the clustered index doesn't have to be unique. You're only allowed one cluster index per table, so choose the column(s) wisely.

If you're going to periodically import fresh sets of data, create a script that drops/recreates the tables, imports the new snapshot, and creates the indexes, joins, constraints, perms, etc.

https://stackoverflow.com/questions...use-clustered-vs-non-clustered-index#18306484

https://stackoverflow.com/questions/5070529/difference-between-clustered-and-nonclustered-index
 
Last edited:
You could create one huge file in dos with something along the lines of
Code:
Full file = file1 + file2 + file3 etc I believe?
I know this thread has been solved elsewhere but the suggestion above would take multiple lines of code as the limit on the command line (as far as I can establish - please point me in the right direction if I'm wrong) is still 255 characters
 
I know this thread has been solved elsewhere but the suggestion above would take multiple lines of code as the limit on the command line (as far as I can establish - please point me in the right direction if I'm wrong) is still 255 characters
The Windows command line limit is 8191, however, the Access text field limit is 255.
 
Last edited:

Users who are viewing this thread

Back
Top Bottom