Solved Query runs slow with NOT IN

Brother Arnold · Nov 28, 2022

I'm running a simple query that provides a list of client IDs that are in the client table but are not linked to the monitoring table. It's running a little slow and if I click to go to the last record for example it just hangs. Client has 18k records, monitoring 12k so result is 6k. The IDs are indexed on both tables

Here's the code

Code:

SELECT tblClient.ID
FROM tblClient
WHERE tblClient.ID NOT IN (SELECT tblMonitoring.client_ID FROM tblMonitoring);

Have I got the code right? Are there any tricks to speed it up?

pbaldy · Nov 28, 2022

A join would likely be faster. Play with the unmatched query wizard.

isladogs · Nov 28, 2022

Agree with @pbaldy
Subqueries are always slower and should only be used where no better method exists.
In this case an unmatched query will be FAR faster and a wizard is available to assist you

CORRECTION : slower

Brother Arnold · Nov 28, 2022

pbaldy said:
A join would likely be faster. Play with the unmatched query wizard.

Thanks, I completely forgot that existed. I don't work in Access enough to keep everything fresh in my mind

pbaldy · Nov 28, 2022

Happy to help.

ebs17 · Nov 29, 2022

... WHERE tblClient.ID NOT IN (SELECT ...
The NOT IN prevents index use. Therefore the query must be slow.

The following variant will run about as fast as the JOIN variant, despite using a subquery.

SQL:

SELECT
   tblClient.ID
FROM
   tblClient
WHERE
   NOT EXISTS
      (
         SELECT
            NULL
         FROM
            tblMonitoring
         WHERE
            tblMonitoring.client_ID = tblClient.ID
      )

Not using subqueries can be a personal stylistic device. A blanket disqualification is not convincing, but also reveals a lack of knowledge.

isladogs · Nov 29, 2022

@ebs17
We've had this discussion before and you won't be surprised that I disagree with part of your comments

Allen Browne wrote this in one of his articles on subqueries: Microsoft Access tips: Surviving subqueries (allenbrowne.com)

Subqueries are amazingly flexible and incredibly frustrating. They have an important place in your repertoire, but you must have other tricks up your sleeve as well.
In general, subqueries are slower and less stable than JOINs. You must test them with real-world data samples to know how they will perform.
They are certainly worth pursuing and mastering for the flexibility they provide.

I agree with all of that
Subqueries are an extremely useful tool and I use them wherever appropriate
I have also extensively tested their speed in comparison with other approaches to achieve the same result.

I agree that NOT IN will be slow and NOT EXISTS will be MUCH faster. However, an unmatched query may well be faster still

I wasn't impressed by your comment

Not using subqueries can be a personal stylistic device. A blanket disqualification is not convincing, but also reveals a lack of knowledge.

You are entitled to your opinions but there is no need to criticize others

ebs17 · Nov 29, 2022

I don't criticize people, I criticize statements like "Subqueries are always slow". ("I agree that ... will be MUCH faster.")
I criticize blanket statements along the lines of "put the contents of a subquery into its own stored query and all is well and fast".

I myself take a more differentiated approach, so I get to know strengths and weaknesses and can make well-founded choices.
In the example above, the question was not so much whether JOIN or subquery, but rather the use or non-use of indices for the amounts of data mentioned, and that is a crucial question in addition to the query design. The advantage of the JOIN variant is that the developer is less likely to implement a non-use of the JOIN himself. I could do that very easily if I wanted to.
It is also true that the JOIN variant interferes with or even prevents the updateability of the query. So there are more criteria to choose from than just a maybe slightly higher speed.

I don't know unstable queries due to the use of subqueries, at least not from my pen. Performance can be measured. Above all, it must be sufficient for the required task.

Pat Hartman · Nov 29, 2022

@ebs17 We DO KNOW that Jet/ACE do NOT optimize subqueries efficiently. Therefore, unless you have no other way of obtaining the result you need, you should use the more efficient method which will be a join, either left or inner. Therefore in 100% of cases, I start with a join as long as a join will solve the problem. I'm not sure why you seem to be encouraging starting with a subquery and THEN using a join if that doesn't work.

Colin has done an excellent job of creating a tool to measure performance. And as we can see by this example, the join wins again. Why not teach students to start with a join unless he knows from experience that the join will not work?

Is there some RDBMS out there where the subquery is the superior method? Even when I was working with DB2, Oracle, and SQL Server almost exclusively, I always used joins rather than subqueries unless I actually needed a subquery. I am always flabbergasted by the number of examples I see posted which use subqueries but don't need to. I wonder if there is some teacher out there that doesn't teach people how to use joins

Galaxiom · Nov 30, 2022

ebs17 said:
... WHERE tblClient.ID NOT IN (SELECT ...
The NOT IN prevents index use. Therefore the query must be slow.

The following variant will run about as fast as the JOIN variant, despite using a subquery.

SQL:

SELECT tblClient.ID FROM tblClient WHERE NOT EXISTS ( SELECT NULL FROM tblMonitoring WHERE tblMonitoring.client_ID = tblClient.ID )

Not using subqueries can be a personal stylistic device. A blanket disqualification is not convincing, but also reveals a lack of knowledge.

That query is also slow because it uses a correlated subquery which runs for every record in tblClient.

I've never had speed problems with uncorrelated subqueries.

ebs17 · Nov 30, 2022

Now I've given @Galaxiom a like because he considers the subquery to be a bit nuanced and not something to avoid altogether if there's any alternative. However, the query with the NOT IN construct is not a correlated subquery, it is only executed once. The NOT EXISTS variant I have shown is such a correlated query, but EXISTS has some special technical features.

In a forum that is no longer active, there was practically the same task as mentioned above, except that it was a delete query and not a selection query. Due to the change from NOT IN to NOT EXISTS, the query took almost half a second instead of several minutes (according to that user's measurement). The bottom line was an acceleration by a factor of about 2500. You could really hear and feel the AHA of this user.

isladogs · Nov 30, 2022

I like to base my conclusions on data.
I ran some tests on 2 almost identical tables of 10000 UK postcodes tblSource & tblData where one randomly selected record in tblData was changed

I measured the time taken to look for the unmatched record using 3 queries as discussed above:

qryNotIN

Code:

SELECT tblData.Postcode
FROM tblData
WHERE tblData.Postcode NOT IN (SELECT tblSource.Postcode FROM tblSource);

qryNotEXISTS

Code:

SELECT tblData.Postcode FROM tblData
WHERE (((Exists (SELECT NULL FROM tblSource WHERE tblSource.Postcode = tblData.Postcode))=False));

qryUnmatched

Code:

SELECT tblData.Postcode
FROM tblData LEFT JOIN tblSource ON tblData.[Postcode] = tblSource.[Postcode]
WHERE (((tblSource.Postcode) Is Null));

I deliberately made the first two queries the same as in earlier posts and used the wizard for the third query

Unfortunately, the JET ShowPlan feature cannot handle subqueries so we cannot know the precise execution plan used by Access for these
So, for the first set of tests, the Postcodes field was deliberately NOT indexed

These were the results

Code:

Check for unmatched record - NON-INDEXED FIELD:
1. Query qryNotIN:  Time Taken = 85.1875 s
2. Query qryNotExists:  Time Taken = 38.875 s
3. Query qryUnmatched :  Time Taken = 0.171875 s


Check for unmatched record - NON-INDEXED FIELD:
1. Query qryNotIN:  Time Taken = 84.15234 s
2. Query qryNotExists:  Time Taken = 38.51563 s
3. Query qryUnmatched :  Time Taken = 0.111437 s


Check for unmatched record - NON-INDEXED FIELD:
1. Query qryNotIN:  Time Taken = 83.67188 s
2. Query qryNotExists:  Time Taken = 38.79688 s
3. Query qryUnmatched :  Time Taken = 0.125 s

Both subqueries were very slow but the unmatched query was very fast

However, of course, fields used in searches should be indexed so Access can search the indexes rather than scan all records
So I then indexed the Postcodes field in each table and repeated the tests.

Code:

Check for unmatched record - INDEXED FIELD:
1. Query qryNotIN:  Time Taken = 84.67188 s
2. Query qryNotExists:  Time Taken = 0.1523438 s
3. Query qryUnmatched :  Time Taken = 0.1171875 s


Check for unmatched record - INDEXED FIELD:
1. Query qryNotIN:  Time Taken = 84.54297 s
2. Query qryNotExists:  Time Taken = 0.1367188 s
3. Query qryUnmatched :  Time Taken = 0.109375 s


Check for unmatched record - INDEXED FIELD:
1. Query qryNotIN:  Time Taken = 84.25391 s
2. Query qryNotExists:  Time Taken = 0.140625 s
3. Query qryUnmatched :  Time Taken = 0.109375 s

Conclusions:
1. Query NotIN cannot use the index (as stated above by @ebs17) so it runs just as slowly - approx 84s
2. Query qryNotEXISTS does use the index and runs almost 300x faster -approx 0.14s
3. Query qryUnmatched takes a similar time as before but is still the fastest of the 3 methods in these tests - approx 0.11s

Finally I repeated the last two tests (after restoring the index to the Postcode field) but this time using 1 million records in the two tables
For obvious reasons, I didn't bother with qryNotIN in this case as it would probably have crashed Access

The results were:

Code:

Check for unmatched record - INDEXED FIELD:
1. Query qryNotExists:  Time Taken = 3.75 s
2. Query qryUnmatched :  Time Taken = 2.65625 s


Check for unmatched record - INDEXED FIELD:
1. Query qryNotExists:  Time Taken = 3.90625 s
2. Query qryUnmatched :  Time Taken = 2.75 s


Check for unmatched record - INDEXED FIELD:
1. Query qryNotExists:  Time Taken = 4.109375 s
2. Query qryUnmatched :  Time Taken = 2.6875 s

Although both queries were fast, the unmatched query took about 2.7s compared to 3.9s for NotEXISTS

ebs17 · Nov 30, 2022

Thank you for testing and posting the results.

From the results one would conclude that with qryUnmatched it doesn't matter whether there are indices or not. I find that astonishing to the point of impossible. After all, setting referential integrity on a relationship automatically indexes the foreign key because that's so important.

@Pat Hartman:

We DO KNOW that Jet/ACE do NOT optimize subqueries efficiently.

That's a killer argument. But not for me.

1) What exactly should an optimization achieve with such a clear task?
2) The first optimizer is the developer who formulates the query. A built-in NOT or <> as an index killer cannot be ignored by a SQL optimizer and can rarely be optimized away. Sometimes it should and will be enough to simply execute the query as it is.

isladogs · Nov 30, 2022

ebs17 said:
Thank you for testing and posting the results.

From the results one would conclude that with qryUnmatched it doesn't matter whether there are indices or not. I find that astonishing to the point of impossible. After all, setting referential integrity on a relationship automatically indexes the foreign key because that's so important.

To be honest I was also surprised how fast the unmatched query ran on the smaller recordset of 10K records when the field wasn't indexed

Just retested with 1 million records in tblData and 10000 in tblSource

Code:

Check for unmatched record - NON-INDEXED FIELD:
2. Query qryUnmatched :  Time Taken = 4.851563 s

Check for unmatched record - INDEXED FIELD:
2. Query qryUnmatched :  Time Taken = 2.652344 s

So indexing does make a difference even to the unmatched query
As that doesn't use a subquery, I ran the query execution plans for that query in both cases

First of all - not indexed

Code:

--- qryUnmatched ---Postcode field NOT INDEXED---

- Inputs to Query -
Table 'tblData'
Table 'tblSource'
- End inputs to Query -

01) Sort table 'tblData'
02) Sort table 'tblSource'
03) Outer Join result of '01)' to result of '02)'
      merging indexes
      join expression "tblData.[Postcode]=tblSource.[Postcode]"
04) Restrict rows of result of 03)
      by scanning
      testing expression "tblSource.Postcode Is Null"

I'm not clear what merging indexes means above as the join field has no indexes in either table ...hence scanning the table records

And again with the field indexed

Code:

--- qryUnmatched ---Postcode field INDEXED---

- Inputs to Query -
Table 'tblData'
Table 'tblSource'
    Using index 'Postcode'
    Having Indexes:
    Postcode 10000 entries, 37 pages, 10000 values
      which has 1 column, fixed
    ID 10000 entries, 18 pages, 10000 values
      which has 1 column, fixed, clustered and/or counter
- End inputs to Query -

01) Outer Join table 'tblData' to table 'tblSource'
      using index 'tblSource!Postcode'
      join expression "tblData.[Postcode]=tblSource.[Postcode]"
02) Restrict rows of result of 01)
      by scanning
      testing expression "tblSource.Postcode Is Null"

Perhaps you would like to try a similar test for yourself and let us know the outcomes

Pat Hartman · Nov 30, 2022

That's a killer argument. But not for me.

So, you don't believe the results of the tests Colin ran?????? Clearly, the Access query engine does not create an optimal execution plan for queries with subqueries. Access doesn't show us what it did create but the results are pretty clear. Why do you keep pushing to have novices, who have no idea that an execution plan is even created for each query, to use subqueries when they aren't necessary?

Over the years, MS has cleaned up the execution plan creation process. In earlier versions, the space used to create the plan was abandoned and generated serious bloat. Now they're doing something different that allows them to clean up after themselves without leaving bloat in the FE. But, we can't make them fix the inefficient subquery problem. It just isn't a high enough priority when there are other options in most cases.

isladogs · Dec 16, 2022

Following on from my earlier test results, I have just published a web article discussing the speed tests I used a few weeks ago to compare three methods of identifying unmatched records:
a) NOT IN subquery (as used by the OP)

Code:

SELECT tblData.Postcode FROM tblData WHERE tblData.Postcode NOT IN (SELECT tblSource.Postcode FROM tblSource);

b) NOT EXISTS subquery
Originally I saved this query exactly as @ebs17 suggested:

Code:

SELECT tblData.Postcode FROM tblData WHERE NOT Exists (SELECT NULL FROM tblSource WHERE tblSource.Postcode = tblData.Postcode);

Access then proceeded to rewrite this in a way it deemed more efficient!

Code:

SELECT tblData.Postcode FROM tblData WHERE (Exists (SELECT NULL FROM tblSource WHERE tblSource.Postcode = tblData.Postcode))=False;

I tested both. The times were almost identical in each case!

c) Simple unmatched query using the wizard (my original suggestion)

Code:

SELECT tblData.Postcode FROM tblData LEFT JOIN tblSource ON tblData.[Postcode] = tblSource.[Postcode] WHERE (((tblSource.Postcode) Is Null));

I tested this for both non-indexed & indexed fields on two tables of 10K records, repeating each test 10 times to minimise the effects of natural variation and calculated averages etc

The non-indexed results were:

The two subquery methods were both very slow though the NOT IN query is far slower than NOT EXISTS.
By comparison, the simple unmatched query created using the query wizard is blisteringly fast!

I then indexed the search field in both tables and repeated the tests

The differences from the first set of tests are interesting:
a) The NOT IN query is almost as slow as before indicating that it cannot make use of the indexing
b) The NOT EXISTS query is now very fast indicating that it does use indexing
c) The unmatched query is still (just) the fastest though indexing appears to play little part in this set of results.

I then repeated the two faster tests for a larger indexed dataset of 1 million records in one of the tables

The greater efficiency of the simple unmatched query became even more apparent in this case with the unmatched query now taking just under 75% of the time taken by NOT EXISTS. I didn't attempt to time the NOT IN query for this large dataset for what, I hope, are obvious reasons.

Finally, I repeated those two tests using 10 million records. The average results in this case are shown below

In both cases, the total time is about 9x larger and the time difference is therefore even more significant

Conclusions
A NOT IN subquery is a very inefficient method of finding unmatched records as it cannot use indexes.
By comparison, a NOT EXISTS subquery is far better as it does use indexes

However, an unmatched query (such as that created by the wizard) is more efficient (faster) than either type of subquery
This becomes even more noticeable with larger datasets.

Although the NOT EXISTS subquery does work well for indexed fields, it is more complex to create and still significantly slower than a simple unmatched query for large datasets. I haven't changed my original opinion and see no benefit in using subqueries for this purpose.

The article includes two versions of the test database - indexed & not-indexed so anyone interested can try this for themselves.
It can be found at

Speed Comparison Tests 12

This article compares three methods of finding differences in records: the unmatched query created with the wizard compared to two methods using subqueries. It is the twelfth in a series of articles discussing various tests done to compare the efficiency of different approaches to coding...

www.isladogs.co.uk

Hope some of you find it interesting

Pat Hartman · Dec 16, 2022

Thanks for the report Colin. I hope we can put this to bed finally. The join is faster. I'm not asking you to test this but it is pretty likely that the results would be different for SQL server since so many people just automatically write sub queries even when there is no need to since a simple join would also work. I think they like the way the SQL looks when it is formatted or something. I've never figured it out.

isladogs · Dec 16, 2022

It may well be different for SQL Server.
Certainly not now, but I'm intending to do a number of tests using SS BE tables next year.

ebs17 · Dec 28, 2022

A little reflection from my point of view:

Query runs slowly with NOT IN

My first choice of the alternative is the unmatched query with the LEFT JOIN, by the way, before the entire discussion and before the measurements provided, because practical experience has taught me that. This would be amplified if one had to compare over more than one key. After all, everyone wants to use the simplest, best, and fastest... to leave room for bigger, right things to do.

However, the formulated task is also extremely simple: 2 tables, comparison via a key.
How can / should you formulate a query when the conditions are more extensive and complicated: More tables involved, additional criteria, additional requirements.
If the only additional requirement is that the query should be updatable, the LEFT JOIN construct will fail here.
What to do? Now that there is no other way (let's rule out the creation of additional temporary tables, because it's not optimal either), do we reach into the black box with the unloved and unoptimizable subqueries and pull out the closest available one?
You would only be familiar with the next best instead of the closest available if you didn't demonize and equate subqueries globally, but if you also dealt with them and as a result were able to fall back on experience and habits, i.e. if you had an idea of the effort, index use, correlation and similar things.

Subqueries are always slow and should only be used where no better method exists.

The discussion and the measurements have shown that even in the original constellation, a subquery does not necessarily have to be slow. You can deal with it. If you claim that you always use the fastest, then you have to be measured against it in all your actions. I think lived practice will find some contradictions, at each.

isladogs · Dec 28, 2022

ebs17 said:
You would only be familiar with the next best instead of the closest available if you didn't demonize and equate subqueries globally, but if you also dealt with them and as a result were able to fall back on experience and habits, i.e. if you had an idea of the effort, index use, correlation and similar things.

Despite your repeated attempts to wrongly typecast me as someone who neither uses nor understands how to use subqueries, let me state again that is far from the truth.

I have been using Access to a high level for over 25 years and I regularly use subqueries. I do so whenever it is beneficial.

For example, see my comments & quote by Allen Browne in post #7

Allen Browne wrote this in one of his articles on subqueries: Microsoft Access tips: Surviving subqueries (allenbrowne.com)
Subqueries are amazingly flexible and incredibly frustrating. They have an important place in your repertoire, but you must have other tricks up your sleeve as well.
In general, subqueries are slower and less stable than JOINs. You must test them with real-world data samples to know how they will perform.
They are certainly worth pursuing and mastering for the flexibility they provide.

I agree with all of that
Subqueries are an extremely useful tool and I use them wherever appropriate

My conclusions in post #16 were justified by data

Although the NOT EXISTS subquery does work well for indexed fields, it is more complex to create and still significantly slower than a simple unmatched query for large datasets. I haven't changed my original opinion and see no benefit in using subqueries for this purpose.

ebs17 said:
My first choice of the alternative is the unmatched query with the LEFT JOIN, by the way, before the entire discussion and before the measurements provided, because practical experience has taught me that. This would be amplified if one had to compare over more than one key. After all, everyone wants to use the simplest, best, and fastest... to leave room for bigger, right things to do.

Good. We agree. Yet you consistently argued the point earlier?

ebs17 said:
However, the formulated task is also extremely simple: 2 tables, comparison via a key.
How can / should you formulate a query when the conditions are more extensive and complicated: More tables involved, additional criteria, additional requirements.
If the only additional requirement is that the query should be updatable, the LEFT JOIN construct will fail here.

Why mention that? If I’m looking for records in one table but not another, updating the data isn’t normally relevant

ebs17 said:
The discussion and the measurements have shown that even in the original constellation, a subquery does not necessarily have to be slow. You can deal with it. If you claim that you always use the fastest, then you have to be measured against it in all your actions. I think lived practice will find some contradictions, at each.

Speed is always an important factor but so is the complexity of code used

The comment I made that does need editing is the line in post #3 where I wrongly omitted 2 letters. I have now amended it to read:

Subqueries are always slower and should only be used where no better method exists.

Perhaps you won't agree with that comment either.
If so, I challenge you to devise one or more subqueries that do EXACTLY the same task as a single well optimised query and run faster – better still significantly faster. If you can do so, please upload for others to benefit from your wisdom

I look forward to your response

Solved Query runs slow with NOT IN

Member

Wino Moderator

MVP / VIP

Member

Wino Moderator

Well-known member

MVP / VIP

Well-known member

Super Moderator

Super Moderator

Well-known member

MVP / VIP

Well-known member

MVP / VIP

Super Moderator

MVP / VIP

Super Moderator

MVP / VIP

Well-known member

MVP / VIP

Similar threads

Users who are viewing this thread