SELECT DISTINCT not removing duplicate records in staging table

AOB · Jan 29, 2015

Hi guys,

I have a database with an import process which normalises incoming data and appends to various tables. No issues with that. I also have a function within that process which counts the number of new entries for a summary popup when the process has completed.

This works by querying the staging table, prior to the append, into a recordset and using the .RecordCount to increment the count (multiple files can be imported at once so this effectively provides a running count, per file, to give a total for the whole import)

I thought it was working fine but this morning I noticed that the count which appeared on the popup was 1 greater than the number of actual new records. I checked the source files and noticed that, for whatever reason, there was a duplicate entry in there. So I presume that's why the count was out by 1.

There's no integrity issue in the main tables as the composite primary keys ensure that duplication shouldn't be a problem. Indeed, the record in question, duplicated in the source, appears only once in the main table post-import. So not too worried about that.

However, I need the count in the popup to be accurate (it tells the users how many new entries require further investugation). And what's puzzling me is that I use DISTINCT in the query, which I would have thought should eliminate any potential dupes in the recordset and thus provide the correct count. It seems it doesn't?

Here is the piece of the code where the count is calculated / incremented :

Code:

Public lngNewBalancesTBI As Long    [COLOR=green]' Defined in a separate module...[/COLOR]
-------
Dim dbs As Database
Dim rst As Recordset
Dim strSQL As String
....
strSQL = "SELECT [COLOR=red]DISTINCT[/COLOR] [tblBalanceHolding].[BalanceDate],  [tblBalanceHolding].[AccountID] " & _
    "FROM (([tblBalanceHolding] " & _
    "INNER JOIN [tblAccounts] ON [tblBalanceHolding].[AccountID] = [tblAccounts].[AccountID]) " & _
    "INNER JOIN [tblCurrencies] ON [tblAccounts].[CcyID] = [tblCurrencies].[CcyID]) " & _
    "INNER JOIN [tblRates] ON [tblBalanceHolding].[BalanceDate] = [tblRates].[RateDate] " & _
        "AND [tblAccounts].[CcyID] = [tblRates].[CcyID] " & _
    "WHERE " & _
        "([tblCurrencies].[Emerging] = True " & _
        "AND " & _
        "(Abs([tblBalanceHolding].[Amount]*[tblRates].[FXRate])) >= " & _
        DLookup("[Threshold]", "[tblThresholds]", "[CurrencyGroup]='Emerging'") & ")" & _
    " OR " & _
        "([tblCurrencies].[Emerging] = False " & _
        "AND " & _
        "(Abs([tblBalanceHolding].[Amount]*[tblRates].[FXRate])) >= " & _
        DLookup("[Threshold]", "[tblThresholds]", "[CurrencyGroup]='NonEmerging'") & ")"
 
Set dbs = CurrentDb
With dbs
    Set rst = .OpenRecordset(strSQL)
    With rst
        If .RecordCount > 0 Then
            .MoveLast
            lngNewBalancesTBI = lngNewBalancesTBI + .RecordCount
        End If
        .Close
    End With
End With

Any suggestions why the dupe, which is still present in the staging table, also makes it over to the recordset, even though I'm using DISTINCT?

Thanks

Al

namliam · Jan 29, 2015

well it seems like you are doing

Load file
Count in file
Load file
COunt in file

The records are unique inside the file thus are counted, however across the board it isnt (anymore)??

Either or... your data LOOKS the same but isnt i.e. your date could be showing only a date but also has a time component...
Or a trailing space or something along those lines?

AOB · Jan 29, 2015

Thanks namliam!

I see where you're going - but that's not it...

Between files, there is a separate duplication removal process. Any records which are already present in the main table are removed from the staging table up front. So the staging table only ever contains "brand new" records for that particular file.

So really the process is :

Load file
Remove anything already present in main table
Count file
Append to main table
Load file
Remove anything already present in main table
Count file
Append to main table

So it's not the same record propagating from file to file, it's only the dupe within the same file.

I like your point about the time component! There is a time component in the source file but it is fairly redundant - a bog standard midnight stamp on every single date (basically, no decimal component to the date integer) My query to read the date into the staging table takes only the date part anyway. So that shouldn't be an issue (dumb assumption??

)

I'm going to try adding DISTINCT to the query to read the data from the raw source to the staging table to see if that has any effect. Then the dupe should never make it as far as the count query and the count should therefore be accurate (?) I hadn't bothered including it before as I had no reason to believe there would be dupes in the source, and even if there were any, the keys on the table would prevent them from being an issue.

Only reason I'm hesitant is, if DISTINCT isn't removing the dupe in the count query, the same issue may arise when I'm creating the staging table. I've checked the source file and the dupe records are absolutely identical. Stumped!

namliam · Jan 29, 2015

basically, no decimal component to the date integer

You mean date double, but I get your point

Though doubles are not always wat we percieve them to be.... 1.1 isnt always 1.1 but actually 1.099999999999999999678546763478, which in turn can cause issues, but these are rare and far between. In particular I have not seen this happen with (trunced) dates at all.

Not 100% sure if the distinct will work with the OR or not...
might it cause issues between Emerging being True/False to satisfy the individual where clauses?

Dont think it should, but as long as we are guessing....

AOB · Jan 29, 2015

D'uhhh - sorry, yes, I meant double! Or rather, I meant the 'double' value was effectively an integer. Not getting into the semantics - we both know what we mean!

Yeah I don't think the DISTINCT would have a problem with the OR either? Which is why I found it odd in the first place.

Hopefully, applying the DISTINCT when I'm generating the staging table will do the trick. No WHERE's there; I take everything in and then whittle down any nonsense before appending. Fingers crossed...

jdraw · Jan 29, 2015

Just a thought, not knowing the details of your set up, could you put a unique composite index on the staging table. Seems this would prevent any duplicate on the combination of fields making up this index at the dbms level.

AOB · Jan 29, 2015

Hmmm, I could... But the staging table is temporary and gets dumped after each import has been completed. I use a simple SELECT INTO statement to generate the table from scratch. Might have difficulties applying a composite index to the table after records (which may or may not comply with the index) are already present. Which means I would have to create the blank table and then INSERT INTO instead. Would really rather just strip out the dupes with a DISTINCT statement than go to all the effort of defining a temporary table via VBA. Appreciate this does actually sound like better practice though...

jdraw · Jan 29, 2015

But the staging table is temporary and gets dumped after each import has been completed.

The data gets removed after each import, or you remove and rebuild the table?
If only data gets dumped, then you can put the unique composite index on the table before you add any data.

If you do the unique composite index and do an import, duplicates will not be written to the table. But once you delete the data, you could only identify duplicates within the next import. That is the dupe ignore process is only for during any import session.

AOB · Jan 29, 2015

No the entire table gets dropped completely and is created anew with each file

jdraw · Jan 29, 2015

I'm not following the rationale for the process.
If you are recreating the table (and I don't know if it's exactly the same structure, but if it is) after each file, why exactly?

I guess I'm missing a key point???

Can you give us a sample of data in the main table and some data coming in in the import?

You are loading/appending data into a main table. Where exactly does the duplication fit in the process?

AOB · Jan 29, 2015

I'm only recreating the staging table - not the main table.

I create a linked file to a spreadsheet and SELECT INTO a temporary staging table. I then run a couple of additional queries between the staging table and some static tables so that I can keep the data normalised.

Then, I run a number of integrity checks, namely to see there are any records in the staging table which already exist in the main table. If there are, I check to see if any details have changed (which would prompt a warning and an option to overwrite the existing details - a fairly rare occurence but gives me peace of mind) before eventually deleting any duplicate records, leaving me with only "brand new" records in the staging table. I append these to the main table and purge the staging table, ready for the next import.

As it stands, I am only guarding against duplication between the staging table and the main table. My problem appears to be duplication within the staging table.

jdraw · Jan 31, 2015

This is not for finding FILES. This approach, as I tried to explain it, is about preventing duplicates.

The story as I understand it is as follows:
The OP has a spreadsheet that he has linked to his Access database.
He has a staging table and a Main Table in his database.
He imports data into the staging table.
He wants to move records from the staging table to the main table if certain conditions exist.
(These conditions are still not clear to me).
(He mentions if records already exist in the Main table, then he says he sees if details have changed)
(He also says only brand new records in the staging table and APPEND these to the main table)
He feels the duplication is in the staging table.

My suggestion (without knowing what makes a duplicate a duplicate) is:
Put a unique composite index on the Staging Table. If there are DUPLICATES based on the unique composite index the
record will not be added to the staging table. By definition the unique composite index only allows unique values, so
a duplicate is rejected by the DBMS itself. So the staging table for this import session can only have unique values.
He then removes the staging table, recreates the staging table, then proceeds to import another spreadsheet. And the process repeats until he has finished all spreadsheets.

I don't know the details by which the OP moves/appends/updates existing records n the main table. He said APPEND only new records. There is no need to remove the staging table; then recreate it --but we don't know the nitty gritty details.

Please note:
My comment re "That is the dupe ignore process is only for during any import session." makes little sense as I re read it.

What I should say is. Since the OP is removing the staging table after importing each sheet, we can't know if there are duplicates in SheetA and sheetB - perhaps this isn't part of the OP's issue, but we don't know that for sure. My suggestion was for preventing duplicates during the import of data. If the OP imports only 1 sheet in a session, then only duplicates in that 1 spreadsheet would be identified and prevented from being stored in the staging table.

AOB · Feb 2, 2015

jdraw,

Apologies for any lack of clarity - but you seem to have it pretty much nailed! You are correct in that technically there is no reason to purge the staging table and then recreate it each time. However, I have multiple import processes from multiple sources which utilise some modular functions. It is far easier for me to create / dump the staging table(s) on the fly than custom build multiple staging tables which will take up space in the deployed FE and make it more difficult for me to manage. By creating them at runtime, I can let each individual FE take care of it itself and I just have to worry about the permanent tables in the BE. But I take your point on board.

The conditions for the staging element are fairly complex and I didn't want to overload the thread with lots of redundant detail. Suffice to say, it is necessary to examine the incoming data and identify / strip-out problem data prior to appending to the permanent tables. This is partly (but not entirely) due to data integrity concerns with the incoming data (over which I have no control)

The duplication is definitely within the staging table (i.e. duplication within the same worksheet) And I agree with your suggestion to add a composite index to the staging table to prevent them. What puzzles me - still - is why SELECT DISTINCT doesn't do the same job?

But thanks for the detailed responses, it is very much appreciated!

jdraw · Feb 2, 2015

Distinct records have exactly the values in all of the fields in your select list. For example if you are selecting Distinct Dates, and have records such as 12/3/2015 8:23:15AM and
12/3/2015 8:23:16AM --these are NOT duplicates.

Please show us a sample of your "duplicates".

AOB · Feb 2, 2015

I understand that jdraw - which is why I'm puzzled - the records (or Excel rows) in question had identical values for all fields (or cells), including the one date (or datetime) field / cell

I can't post the actual sample as it contains sensitive information. And little point in me posting a share-safe replica as, naturally, I will simply post the same record twice!

Suffice to say, though, I have examined them thoroughly and cannot find any element in one that would distinguish it from the other.

It also seems to be a pretty rare occurence; I haven't seen a similar issue since I first posted this...!

gemma-the-husky · Feb 19, 2015

is this sorted?

the easiest way is to compare in code then

sort of this

Code:

 for each fld in tabledef.fields
     if record!.somefield<>record2!somefield then 
          show error
    end if
 next

note that null and a zls may appear the same on inspection, so watch out for nulls

CJ_London · Feb 19, 2015

including the one date (or datetime) field / cell

it may be formatting to the same dd/mm/yyyy hh:mm:ss value but may have a slightly different time element - it is different between sql server and access for example so if your spreadsheet was originally sourced from sql server, this may explain the difference.

Suggest view as a number, rather than a formatted date, although dave's suggestion should pick it up

SELECT DISTINCT not removing duplicate records in staging table

AOB

Registered User.

namliam

The Mailman - AWF VIP

AOB

Registered User.

namliam

The Mailman - AWF VIP

AOB

Registered User.

jdraw

Super Moderator

AOB

Registered User.

jdraw

Super Moderator

AOB

Registered User.

jdraw

Super Moderator

AOB

Registered User.

jdraw

Super Moderator

AOB

Registered User.

jdraw

Super Moderator

AOB

Registered User.

gemma-the-husky

Super Moderator

CJ_London

Super Moderator

Similar threads

Users who are viewing this thread