Update 1.74 million records

GregD

Registered User.
Local time
Today, 04:54
Joined
Oct 12, 2007
Messages
47
It is not really 1.74 million records, but Access thinks it is, for some reason.

Here's what happens: I get a CSV file in with 2196 lines. There are 2 date fields that are formatted poorly. Sometimes it is mddyyyy and some times it is mmddyyyy. I import the data to one table and then export everything to another table except those two fields.

There are two date/time fields in the new table that are left empty at first. I then run 2 update queries to format and convert these poorly formated date fields.

The query simply joins to the two tables on 3 fields and then updates the date field. When I hit the preview button on the query it takes about a second and says it will update 2196 records. Perfect.

When I actually run the query, it takes about a half hour and tells me it will be updating 1.74 million records. Any ideas why this is happening? If and when the query runs there are still only 2196 records in both tables.
 
There's probably a problem with your joins. We'd need more info, can you post the SQL for your query?

If you right click on the title bar of your query you can select SQL view and copy/paste the text here.
 
There's probably a problem with your joins. We'd need more info, can you post the SQL for your query?

If you right click on the title bar of your query you can select SQL view and copy/paste the text here.

Well, I'm happy to show you the SQL, but it about as straight forward as it gets. The tables are identical expect for the two date fields. On the import table they are long int and on the new table they are date/time. Also, I miss-wrote in my original post. The original dates come in as either mddyy or mmddyy. This is why the textimport method sees them as long and not date/time, or even text.

Code:
UPDATE tblEFPVitalsInput INNER JOIN tblEFPVitals ON (tblEFPVitals.LastName = tblEFPVitalsInput.F3) AND (tblEFPVitals.FirstName = tblEFPVitalsInput.F2) AND (tblEFPVitalsInput.F1 = tblEFPVitals.ChartNumber) SET tblEFPVitals.EncDate = FixEncDate(tblEFPVitalsInput.F6);
 
Joining the tables using names could cause these issues. I'd have to have a look at the data. Is there any other unique identifier that is the same? I would guess that the names are not unique enough.
 
Joining the tables using names could cause these issues. I'd have to have a look at the data. Is there any other unique identifier that is the same? I would guess that the names are not unique enough.

There is really nothing else to go on. In the health insurance industry it is first, last, and date of birth. I do this all day, every day. In this case, I'm joining on first name, last name, and chart number. Because this data is screwy they supply the age and not DOB. In the next query I join on another table to get the DOB. That one runs fine. It is only 2200 records and the tables are identical. This thing is driving me nuts.

Due to federal regulations I can't post any data.
 
What I do in times like this:

  1. import the data into a staging table
  2. Use autonumber to create a unique id for the staging table
  3. Add all the fields including the autonumber to the new table - datatype Long Integer
  4. Join the staging table and the new table on the autonumber field - you will have a one to one match
  5. Perform the updates and remove the field from the main table if required (or leave for later updates)
 
What I do in times like this:

  1. import the data into a staging table
  2. Use autonumber to create a unique id for the staging table
  3. Add all the fields including the autonumber to the new table - datatype Long Integer
  4. Join the staging table and the new table on the autonumber field - you will have a one to one match
  5. Perform the updates and remove the field from the main table if required (or leave for later updates)

That's not a bad idea.

What I just ended up doing was write a small chunk of code to reformat the date in the CSV file before I import it. The code reformats the date to mm-dd-yyyy before it is imported. The importtext method recognizes it as a date on import. I can now skip the whole query altogether. My code took 14 seconds to run.

There is always more than one way to skin a cat.


Code:
Private Sub FixDate(sFileName As String)
    Dim sLines() As String
    Dim sParts() As String
    Dim sTemp As String
    Dim lFileNum As Long
    Dim k As Integer
    Dim sNewDate As String
    Dim iLineCount As Long
    Dim i As Integer
    Dim sNewLine As String
    
    lFileNum = FreeFile
    Open sFileName For Input As lFileNum
        sTemp = Input(LOF(lFileNum), lFileNum)
        sLines = Split(sTemp, vbCrLf)
        sTemp = ""
        iLineCount = UBound(sLines)
        For k = 0 To iLineCount
            If Len(Trim(sLines(k))) > 0 Then
                sParts = Split(sLines(k), ",")
                sNewDate = FormatDate(sParts(5))
                sParts(5) = sNewDate
                sNewLine = ""
                For i = 0 To UBound(sParts)
                    sNewLine = sNewLine & sParts(i) & ","
                Next
                sLines(k) = Left(sNewLine, Len(sNewLine) - 1)
            End If
            sTemp = sTemp & sLines(k) & vbCrLf
        Next
        
    Close
    lFileNum = FreeFile
    Open sFileName For Output As lFileNum
        Print #lFileNum, sTemp
        Debug.Print sTemp
    Close
End Sub
Private Function FormatDate(sDateIn As String) As String

    If Len(sDateIn) = 5 Then
        sDateIn = "0" & sDateIn
    End If
    
    If Len(sDateIn) <> 6 Then
        FormatDate = "0"
    Else
        sDateIn = Left(sDateIn, 4) & "20" & Right(sDateIn, 2)
    End If

    FormatDate = Format(sDateIn, "&&-&&-&&&&")


End Function
 
Last edited:
The other odd thing about this that I didn't mention. Every time I opened the query in design view, or every time I switch between SQL and design view, it drops the two of the three joins. I recreated the query and it did the same thing. Only when I put brackets [] around the field names in the SQL view did this stop. Before I did that I compacted and repaired the MDB file and that did help. There is just something screwy about this.
 
The other odd thing about this that I didn't mention. Every time I opened the query in design view, or every time I switch between SQL and design view, it drops the two of the three joins.

That pretty much defines why you're having problems. You're getting a cartesian product.

Access is very annoying about this. I actually copy the text of my sql before going into design view so I can "restore" after Access screws it up. You might want to do the same. After you've got it the way you want it, avoid design view at all costs.
 
That pretty much defines why you're having problems. You're getting a cartesian product.

Access is very annoying about this. I actually copy the text of my sql before going into design view so I can "restore" after Access screws it up. You might want to do the same. After you've got it the way you want it, avoid design view at all costs.

I'm not familiar with the term "cartesian product". What does this mean. Even if it defines why I'm have problems, what is the problem. Even after I got that to stop by bracketing the field names the query still wouldn't run properly. I have written 100s of queries in Access and have gotten some odd results, but this is a first for me.
 
A cartesian product is the result of every row of a table or query joining to every row in a different table or query. Since Access is removing your criteria, this is what's happening.

Perhaps you could try removing the inner join criteria (just use "from table1, table2") and put the criteria in a where clause? I frequently do that (which Access also screws up, sometimes).
 
Thanks. Yea, that's what I thought was happening at first, but then I wasn't sure. So you're suggesting that even though I restore the joins, Access still drops them when it runs the query, even though all three joins appear in both the SQL and design view? I guess that is as good an explanation as any.
 
Since All 3 fields need to be considered together, wouldn't a concatenation of the fields and a join on the result work in this instance?

((tblEFPVitals.LastName & tblEFPVitals.FirstName & tblEFPVitals.ChartNumber) = (tblEFPVitalsInput.F3 & tblEFPVitalsInput.F2 & tblEFPVitalsInput.F1))

I also noticed that the order of the entries for the third AND Statement is switched compared to the first two. Will this make any difference?
 
Since All 3 fields need to be considered together, wouldn't a concatenation of the fields and a join on the result work in this instance?

((tblEFPVitals.LastName & tblEFPVitals.FirstName & tblEFPVitals.ChartNumber) = (tblEFPVitalsInput.F3 & tblEFPVitalsInput.F2 & tblEFPVitalsInput.F1))

Yes, I do this a lot when the tables are not identical. The only difference is that I will take the first 3 letters of the first name, first 4 of the last name, and the DOB or chart number to create a GUID and then join on that. It is a good way to find matches with free text entries on names.

I also noticed that the order of the entries for the third AND Statement is switched compared to the first two. Will this make any difference?

That is an excellent observation. I bet that is what was causing it to drop the joins. Very odd, though. I'm going to take a look at this next week.
 
if you are joining non-unique fields then each one will join to each of the others

so if you have in one table

100 entries for John Smith,

and in another table 80 entries for John Smith,

then joining the tables on the names John Smith will give you 8000 (80 x 100) instances of John Smith.

If you can extract a subquery with unique data for each table, then you could set the query properties to unique values, which will eliminate the duplicates, and therefore remove the cartesian product.

It depends on what data is in the table though - it really arises because the spreadsheet (csv) is non normalised, and you have repeating groups within the data.
 
if you are joining non-unique fields then each one will join to each of the others

so if you have in one table

100 entries for John Smith,

and in another table 80 entries for John Smith,

then joining the tables on the names John Smith will give you 8000 (80 x 100) instances of John Smith.

If you can extract a subquery with unique data for each table, then you could set the query properties to unique values, which will eliminate the duplicates, and therefore remove the cartesian product.

It depends on what data is in the table though - it really arises because the spreadsheet (csv) is non normalised, and you have repeating groups within the data.

Yes, I get that. I'm all too familiar with the phenomena and how to correct. it was the term "cartesian product" that I was unfamiliar with.

Although, we now know the problem, because it has been stated several times now in this thread. The root cause is not as clear as you are attempting to make it. I won't go over everything again, but if you go back and re-read the entire thread, I think you will agree with me.
 
I also noticed that the order of the entries for the third AND Statement is switched compared to the first two. Will this make any difference?

That was it. Once you pointed it out, it was painfully obvious. :mad:
 
i appreciate what you are saying, but if you are getting a cartesian join, you must be joining the tables on a non-unique key. therefore you need to find a way of achieving a unique join, i would have thought.

one issue is that if you have a cartesian join you wont know which of the multiple cross-joins has provided the data for the update

nuff said - no more from me now
 
That was it. Once you pointed it out, it was painfully obvious. :mad:

I am glad that i was able to be of help to someone for a change. So many people have helped me and others, out, and I was hoping that I could begin to return the favors by paying it forward and helping others in kind
 

Users who are viewing this thread

Back
Top Bottom