Need some help making a bad query good (AC2007)

AOB · Jul 3, 2015

Hi guys,

I have hastily put together an admittedly awful query - or rather a series of awful queries - in order to get around a duplication issue in a source file. I have no control over the source file so have to live with the duplication that exists there. And I would prefer to build a query to handle the duplication rather than introduce a requirement for the source to be refined prior to running the process.

Here is the background...

The source (a linked spreadsheet) has 3 key fields : Book, Centre & Source. All text fields. And something of the order of 30,000 rows / records

Each Book *should* be assigned to just one Centre (each Centre can be associated with multiple Books)

Books can appear in multiple records but, generally speaking, each instance of any Book *should* show the same Centre each time.

However, on inspection, it appears there are a handful of Books which have multiple Centres. In order to identify which Centre is correct, I have to refer to the Source field.

The dataset produced by my query should therefore show each distinct Book, with the relevant Centre, according to the following criteria :

If the Book / Centre relationship is consistent (only one Centre assigned to a particular Book), then include that Book / Centre combination
If the Book / Centre relationship is not consistent (more than one Centre assigned to a particular Book), then check the Source for each combination
- If the Source starts with "ABCD" then include that Book / Centre combination
- If none of the Sources starts with "ABCD" then omit that Book completely from the final dataset

Here are the queries that I've written in order to achieve the above...

qryDistinctBooks (this gives me the complete list of distinct, non-null, Books)

Code:

SELECT DISTINCT lnkBooks.[Book]
FROM lnkBooks
WHERE Nz(lnkBooks.[Book],"")<>""

qryDistinctBookCentreCombos (this gives me the complete list of distinct combinations of Book / Centre)

Code:

SELECT DISTINCT lnkBooks.[Book], lnkBooks.[Centre]
FROM lnkBooks
WHERE Nz(lnkBooks.[Book],"")<>""
    AND Nz(lnkBooks.[Centre],"")<>""

qryBooksWithMultipleCentres (this gives me the list of Books which are assigned to more than one Centre)

Code:

SELECT [qryDistinctBookCentreCombos].[Book], Count(*) AS NoOfDistinctCentres
FROM qryDistinctBookCentreCombos 
GROUP BY [qryDistinctBookCentreCombos].[Book]
HAVING Count(*)>1

qryBooksWithMultipleCentresSourceABCD (this gives me, for each of the Books assigned to more than one Centre, the Centre corresponding to a Source of "ABCD", if one exists)

Code:

SELECT qryBooksWithMultipleCentres.[Book], lnkBooks.[Centre], lnkBooks.[Source]
FROM qryBooksWithMultipleCentres 
LEFT JOIN lnkBooks 
    ON qryBooksWithMultipleCentres.[Book] = lnkBooks.[Book]
GROUP BY qryBooksWithMultipleCentres.[Book], lnkBooks.[Centre], lnkBooks.[Source]
HAVING lnkBooks.[Source] LIKE "ABCD*"

qryFinalBookToCentre (the query that uses all of the above to generate the final dataset)

Code:

SELECT qryDistinctBooks.[Book], qryBooksWithMultipleCentresSourceABCD.[Centre]
FROM qryDistinctBooks 
INNER JOIN qryBooksWithMultipleCentresSourceABCD 
    ON qryDistinctBooks.[Book] = qryBooksWithMultipleCentresSourceABCD.[Book]
 
UNION
 
SELECT qryDistinctBooks.[Book], qryDistinctBookCentreCombos.[Centre]
FROM qryDistinctBooks 
INNER JOIN qryDistinctBookCentreCombos 
    ON qryDistinctBooks.[Book] = qryDistinctBookCentreCombos.[Book]
WHERE qryDistinctBookCentreCombos.[Book] NOT IN 
    (SELECT qryBooksWithMultipleCentres.[Book] 
     FROM qryBooksWithMultipleCentres)

Horrendous, isn't it? I know, I know... I ran this yesterday and it took more than 12 hours to complete (I left it overnight) Not sure if it's as a result of the use of the LIKE operator in qryBooksWithMultipleCentresSourceABCD or something else. But it just generally looks pants.

Can anybody suggest how this can be improved? I need to get the run time down to minutes / seconds rather than hours / days!

Thanks in advance & apologies for the size of the post

Please keep the SQL-slagging to a minimum; I am more than aware just how awful the above is, I am not suggesting for one second that this is good query design!!

AOB

jdraw · Jul 3, 2015

Is this process a 1 time thing to clean up the file?
Or is this something done on repeatedly?

Do you have a test data file you could post?

AOB · Jul 3, 2015

No the process is performed monthly and with the latest version of the source file each time, provided by another party. So the source file could change each month (but the fundamental logic remains) Hence my preference to incorporate a query into the process to use the data appropriately.

Can't share the file as it contains sensitive information. Was hoping I had put sufficient detail into the post.

Broadly speaking though, it should turn something like this :

Code:

[COLOR=blue][B][Book] | [Centre] | [Source][/B][/COLOR]
[B][COLOR=blue]-------|----------|---------[/COLOR][/B]
 Alan  | Dublin   | ABCDBlah
 Barry | London   | WXYZBlah
 Conor | Paris    | ABCDBlah
 Conor | New York | WXYZBlah
 David | Geneva   | PQRSBlah
 David | Lisbon   | WXYZBlah

Into something like this :

Code:

[COLOR=blue][B][Book] | [Centre][/B][/COLOR]
[B][COLOR=blue]-------|----------[/COLOR][/B]
Alan   | Dublin
Barry  | London
Conor  | Paris

If that makes sense?

CJ_London · Jul 3, 2015

one of the reasons your query is slow is because it is working off an excel spreadsheet which is not indexed.

you should see a huge improvement if you import the data to a temporary table and index the fields

also this is a much simpler query

Code:

 SELECT distinct tblxl.book, tblxl.centre
FROM tblxl
WHERE (((tblxl.source) Like 'abcd*' Or (select count(*) from tblxl as T where book=[tblxl].book)=1));

jdraw · Jul 3, 2015

Agree 100% with CJ.

AOB · Jul 3, 2015

Agreed on the indexing certainly! It did cross my mind (this isn't one of my own DB's, I inherited it from a former colleague) as this would normally be one of the first things I'd correct!

Love the reduced query also, let me give that a bash

Thanks guys!

CJ_London · Jul 3, 2015

be interested to know how much quicker it is running off your spreadsheet and if you do decide to import and index, what the improvement is there

AOB · Jul 3, 2015

Me too - I'm currently testing the simplified query on unindexed linked s/s, still quite slow but will let it run its course to see how slow.

Then I'll add a step to import the data into a dedicated table and repeat - will report back...

AOB · Jul 3, 2015

Gave up on querying the spreadsheet directly after about 45 minutes...

But the introduction of an import procedure into a new table, addition of indexes to same and query on the indexed local table? Roughly 6-8 seconds!

Thanks so much!

jdraw · Jul 3, 2015

Yes - great learning experience. Database is different than spreadsheet -- different tools for different jobs.
Good luck with your project.

CJ_London · Jul 3, 2015

was that with your 'bad' query or the simplified query?

just for fun, can you compare the two?

vbaInet · Jul 4, 2015

Another twist: move the subquery to the FROM clause and use the Left() function.
By moving it to the FROM clause you will see improvement and because the field isn't indexed Left() in this instance should perform better than LIKE:

Code:

SELECT  A.book, 
	A.centre,
FROM 	tblxl AS A
	INNER JOIN ( SELECT   book, 
		    	      Count(*) AS BookCount
		     FROM     tblxl
		     GROUP BY book 
		   ) AS B
	ON	   A.book = B.book
WHERE	B.BookCount = 1
    	OR Left(A.[source], 4) = "ABCD";

NB: Of course, nothing will beat the local indexed table CJ suggested.

CJ_London · Jul 4, 2015

vbaInets query design should be faster still - particularly with indexing.

AOB · Jul 6, 2015

This is marvellous stuff, thanks guys!

just for fun, can you compare the two?

CJ, I had already deleted the awful queries but I was similarly curious so recreated them to test on an indexed table version of the linked spreadsheet. After 20 minutes I stopped it... I reckon it would have completed in considerably less time than the linked version but it was still prohibitively slow.

Another twist: move the subquery to the FROM clause and use the Left() function

vbaInet, thanks a million for the suggestion! I am already more than satisifed with the 6-8 second latency but I will incorporate your take on the problem anyway, not so much for the time-saving but for the learning curve.

As always, thank you all for your valued advice!

AOB · Sep 28, 2015

Guys,

Sorry to resurrect but... I've found an issue...

In a scenario where the same book / center combination appears more than once :

Code:

[B][COLOR=#0000ff][Book] | [Centre] | [Source][/COLOR][/B]
[B][COLOR=blue]-------|----------|---------[/COLOR][/B]
 Alan  | Dublin   | EFGHBlah
 Alan  | Dublin   | EFGHBlah

These queries :

Code:

SELECT distinct tblxl.book, tblxl.centre
FROM tblxl
WHERE (((tblxl.source) Like 'abcd*' Or (select count(*) from tblxl as T where book=[tblxl].book)=1));

Code:

SELECT  A.book, 
    A.centre,
FROM     tblxl AS A
    INNER JOIN ( SELECT   book, 
                      Count(*) AS BookCount
             FROM     tblxl
             GROUP BY book 
           ) AS B
    ON       A.book = B.book
WHERE    B.BookCount = 1
        OR Left(A.[source], 4) = "ABCD";

...excludes that book / centre combination from the final output?

Even though it's the same combination both times, the Count(*) shows 2 which thus fails to fulfill the criteria for inclusion?

I only want to exclude book / centre combinations where there are multiple different centres assigned to a given book - if the centre is consistent across multiple records, I want to keep it in?

Any suggestions how the query can be tweaked to allow for this?

jdraw · Sep 28, 2015

??
Given these records, what should the result be??

Book Centre ID Source
------ ---------- --- -----------
Alan Dublin 1 ABCDBlah
Barry London 2 WXYZBlah
Conor Paris 3 ABCDBlah
Conor New York 4 WXYZBlah
David Geneva 5 PQRSBlah
David Lisbon 6 WXYZBlah
Alan Dublin 7 EFGHBlah
Alan Dublin 8 EFGHBlah

AOB · Sep 29, 2015

Probably a bad choice of example - see revised records thus :

Code:

 [B][COLOR=#0000ff][Book] | [Centre] | [Source][/COLOR][/B]
[B][COLOR=blue]--------|----------|---------[/COLOR][/B]
 Alan   | Dublin   | ABCDBlah
 Barry  | London   | WXYZBlah
 Conor  | Paris    | ABCDBlah
 Conor  | New York | WXYZBlah
 David  | Geneva   | PQRSBlah
 David  | Lisbon   | WXYZBlah
 Elliot | Prague   | EFGHBlah
 Elliot | Prague   | WXYZBlah

Result should be as follows :

Code:

[B][COLOR=#0000ff][Book] | [Centre] [/COLOR][/B]
[B][COLOR=#0000ff]-------|----------[/COLOR][/B]
Alan   | Dublin
Barry  | London
Conor  | Paris
Elliot | Prague

Alan / Dublin makes it through as it is the only record for that book (source is irrelevant)
Barry / London makes it through for the same reason (source irrelevant)
Conor / Paris makes it through as that book / centre combination is the only one (for Conor) where the source starts with ABCD.
David does not make it through because there are multiple centres associated with it and none of them have a source starting with ABCD
Elliot / Prague makes it through, even though it is present multiple times, the book / centre combination is consistent (source is irrelevant)

jdraw · Sep 29, 2015

Here is one that seems to work: (Try it to confirm)

Code:

SELECT A.book
	,A.centre
FROM tblx1 AS A
INNER JOIN (
	SELECT book
		,Count(*) AS BookCount
	FROM tblx1
	GROUP BY book
	) AS B ON A.book = B.book
WHERE B.BookCount = 1
	OR Left(A.[source], 4) = "ABCD"

UNION

SELECT book
	,centre
FROM (
	SELECT tblX1.Book
		,Count(tblX1.Centre) AS CountOfCentre
		,tblX1.Centre
	FROM tblX1
	GROUP BY tblX1.Book
		,tblX1.Centre
	HAVING (((Count(tblX1.Centre)) > 1))
	);

Result:

Code:

book	centre
-----  -------
Alan	Dublin
Barry	London
Conor	Paris
Elliot	Prague

AOB · Sep 30, 2015

Mmmm, no that's not working for me jdraw?

It is returning a single record for multiple combinations of the same book / centre (a good thing)

But it's also returning multiple centres for certain books (i.e. multiple records for the same book)

(e.g. the "Conor" example below)

I have to have one record per book (a single book / centre combination)

To clarify, where I have the following in "tblx1" :

Code:

 [B][COLOR=#0000ff][Book] | [Centre] | [Source][/COLOR][/B]
[B][COLOR=blue]--------|----------|---------[/COLOR][/B]
 Conor  | New York | WXYZBlah
 Conor  | New York | WXYZBlah
 Conor  | Paris    | ABCDBlah
 Conor  | New York | WXYZBlah

The query is returning this :

Code:

 [B][COLOR=#0000ff][Book] | [Centre][/COLOR][/B]
[B][COLOR=blue]--------|----------[/COLOR][/B]
 Conor  | New York 
 Conor  | Paris

I need it to return this :

Code:

 [B][COLOR=#0000ff][Book] | [Centre][/COLOR][/B]
[B][COLOR=blue]--------|----------[/COLOR][/B] 
 Conor  | Paris

(Just to clarify - I notice you use tblx1 in the first half of the query, and tblX1 in the second half - I've assumed this is a typo, am I correct?...)

AOB · Sep 30, 2015

I've had a go - not as elegant as your suggestions but it seems to work?...

Code:

SELECT DISTINCT tblX1.Book, tblX1.Centre
FROM tblX1
INNER JOIN
    (SELECT A.Book, Count(A.Centre) AS NoOfCentres
     FROM 
         (SELECT DISTINCT tblX1.Book, tblX1.Centre 
          FROM tblX1)  AS A
     GROUP BY A.Book) AS B
ON tblX1.Book=B.Book
WHERE Left(tblX1.Source,4)="ABCD" And B.NoOfCentres>1
 
UNION
 
SELECT DISTINCT tblX1.Book, tblX1.Centre
FROM tblX1
INNER JOIN 
    (SELECT C.Book, Count(C.Centre) AS NoOfCentres
     FROM 
         (SELECT DISTINCT tblX1.Book, tblX1.Centre 
          FROM tblX1)  AS C
     GROUP BY C.Book) AS D 
ON tblX1.Book=D.Book
WHERE D.NoOfCentres=1

I say it seems to work because the current version of tblX1 has 29,987 unique / distinct books (albeit across 47,375 total records)

The above returns 29,923 book / centre combinations (with each book represented only once)

Suggesting the offset of 64 between the two must be those books which a) have multiple distinct centres assigned to them and b) none of which have a source starting with "ABCD" (and thus should be excluded anyway)

I've run a quick query to isolate those 64 missing books and checked the logic of the above and it seems to hold true.

So like I say, this seems to work - although I'm sure there's a more elegant way of doing it?

Need some help making a bad query good (AC2007)

Registered User.

Super Moderator

Registered User.

Super Moderator

Super Moderator

Registered User.

Super Moderator

Registered User.

Registered User.

Super Moderator

Super Moderator

AWF VIP

Super Moderator

Registered User.

Registered User.

Super Moderator

Registered User.

Super Moderator

Registered User.

Registered User.

Similar threads

Users who are viewing this thread