How to automate importing 31 .csv files into SQL Server?

Sounds like a good plan. Virtually everything takes longer to do when dealing with big tables, so optimization is crucial. This data dump you're importing seems like government data for an analytical decision support system. We used to import a 4.5GB PostgreSQL dataset of all U.S. Federal contracts for trend analysis. In a table with a million records, for every alpha character you reduce, you save 1MB. We summarized data by creating aggregate tables. Created lookup tables of certain text values, then replaced the text values in all tables with foreign key integer values joined to the lookup tables.

Keep us informed about your progress, and best wishes for success.
Yes this is US EPA data. I will be analyzing it to perform Hazardous Waste Market analysis for customers.

I find it a little disappointing that the EPA doesn't have simple data integrity such as Natural Key (ManifestNum, LineNum). I found the same lack of data integrity in Canada. I assume it's the same in most departments?
 
I find that strange. What do the specs say about that data?
There are not too many duplicates. I think there were around 10 cases like the below.

1758410497383.png
 
After you remove the dups you should be able to successfully create the following composite index:

Code:
CREATE UNIQUE CLUSTERED INDEX idxname ON dbo.tablename (ManifestNum, LineNum);

Yes doing that except I always use ID for Clustered. I did find some records that have ManifestNum == null. I have to remove them. I am surprised SQL Server successfully indexed this.

1758416846858.png


1758416920643.png
 
I thought both ManifestNum and LineNum columns determined the uniqueness for each row?

Yes ManifestNum and LineNum is the Natural Key for the Details table. My understanding is that the Primary Key is always the Clustered Index? I always use ID as the primary key. I will be using the ID in all the table joins and table relationships.
 
You could create one huge file in dos with something along the lines of
Code:
Full file = file1 + file2 + file3 etc I believe?
I know this thread has been solved elsewhere but the suggestion above would take multiple lines of code as the limit on the command line (as far as I can establish - please point me in the right direction if I'm wrong) is still 255 characters
 
I downloaded today's manifest data. 35,636 Manifests were added in a week. This entails 176,932,466 Kg of waste. I will likely do the update weekly.

Thank you for all the input and advice.
 
I know this is older, but I've used Python and SSIS to accomplish this. Both can work as standalone or combined solutions, depending on how variable you need this to be.
 

Users who are viewing this thread

Back
Top Bottom