@isladogs shows a representation with variants for Comparison of Grouped Aggregate Queries. I've been looking into this, using the test environment, which I find clearer, according to Speed tests and speed comparisons for everyone. This measures a single run of each query.
I used an additional query that improves the previously optimal query. Because much of the discussion relates to indexes on SENTestHistory.Level and PupilData.Reading, there are test runs on all four ways of indexing the two fields. The results are as follows:
The following general statements can be made:
- The indexes on the named fields play no role in the actual grouping and minimally to no role in aggregation. COUNT(TH.Level) can be replaced by the more optimal COUNT(*).
- The non-equi join SENTestHistory.Level < PupilData.Reading:
So: These two indexes are not used in the query variants and are therefore superfluous here. The tendentially best results are when they are not set. However, it seems that if they exist despite not being used, they lead to processing and confusion. The two outliers in the subquery variant are particularly affected by confusion. Here I would simply say: Jet is crazy (is overwhelmed).
- I've often made statements about Pro subquery. In the variant used (correlated subquery with tablescan per execution) I would not have thought of using it because the performance here is devastating and the disaster grows exponentially with increasing dataset numbers. Here in the test it is only executed 1482 times (records from PupilData) and it's bad. Therefore, follow-up thoughts are also academic/superfluous.
Discussion of the most optimized query to date (qryCountCriteria):
SELECT PD.PupilID, PD.Surname, PD.Forename, PD.YearGroup, PD.Reading, PD.ReadingDate, Count(TH.Level) AS CountOfLevel
FROM PupilData AS PD INNER JOIN SENTestHistory AS TH ON PD.PupilID = TH.PupilID
WHERE TH.Type="Reading" And TH.Level<PD.Reading
GROUP BY PD.PupilID, PD.Surname, PD.Forename, PD.YearGroup, PD.Reading, PD.ReadingDate
HAVING Count(TH.Level)>1
ORDER BY PD.Surname, PD.Forename
Grouping is a comparative operation (done after filtering into FROM and WHERE parts) and can potentially use an index. There is no index for the entire field list used, and that doesn't really make sense either. Without index: Add up the required bytes in this list. These must be compared record by record. A single key, which is indexed anyway, does a much better job.
My suggestion:
Grouping with TH.PupilID (not PD.PupilID), because the actual aggregation takes place in TH, so does grouping.
Sorting is the last in query execution, here over remaining 97 records. With that number, thoughts about index don't matter much anymore (calculation on table field breaks index usage).
Finally, a content-related question, detached from syntax and performance topics:
String: '9:03' > '11:09', '11:09' < '12:06'
Can this comparison bring correct results in this format?
I used an additional query that improves the previously optimal query. Because much of the discussion relates to indexes on SENTestHistory.Level and PupilData.Reading, there are test runs on all four ways of indexing the two fields. The results are as follows:
The following general statements can be made:
- The indexes on the named fields play no role in the actual grouping and minimally to no role in aggregation. COUNT(TH.Level) can be replaced by the more optimal COUNT(*).
- The non-equi join SENTestHistory.Level < PupilData.Reading:
Source: Michael Zimmermann: Performance in Queries from 2005Theta joins, i.e. comparisons like
A.id < B.id
A.id <= B.id
A.id = B.id * 2
lead to a cross join without using the index.
So: These two indexes are not used in the query variants and are therefore superfluous here. The tendentially best results are when they are not set. However, it seems that if they exist despite not being used, they lead to processing and confusion. The two outliers in the subquery variant are particularly affected by confusion. Here I would simply say: Jet is crazy (is overwhelmed).
- I've often made statements about Pro subquery. In the variant used (correlated subquery with tablescan per execution) I would not have thought of using it because the performance here is devastating and the disaster grows exponentially with increasing dataset numbers. Here in the test it is only executed 1482 times (records from PupilData) and it's bad. Therefore, follow-up thoughts are also academic/superfluous.
Discussion of the most optimized query to date (qryCountCriteria):
SELECT PD.PupilID, PD.Surname, PD.Forename, PD.YearGroup, PD.Reading, PD.ReadingDate, Count(TH.Level) AS CountOfLevel
FROM PupilData AS PD INNER JOIN SENTestHistory AS TH ON PD.PupilID = TH.PupilID
WHERE TH.Type="Reading" And TH.Level<PD.Reading
GROUP BY PD.PupilID, PD.Surname, PD.Forename, PD.YearGroup, PD.Reading, PD.ReadingDate
HAVING Count(TH.Level)>1
ORDER BY PD.Surname, PD.Forename
Grouping is a comparative operation (done after filtering into FROM and WHERE parts) and can potentially use an index. There is no index for the entire field list used, and that doesn't really make sense either. Without index: Add up the required bytes in this list. These must be compared record by record. A single key, which is indexed anyway, does a much better job.
My suggestion:
SQL:
SELECT
TH.PupilID,
FIRST(PD.Surname) AS Surname,
FIRST(PD.Forename) AS Forename,
FIRST(PD.YearGroup) AS YearGroup,
FIRST(PD.Reading) AS Reading,
FIRST(PD.ReadingDate) AS ReadingDate,
COUNT(*) AS CountOfLevel
FROM
PupilData AS PD
INNER JOIN SENTestHistory AS TH
ON PD.PupilID = TH.PupilID
WHERE
TH.Type = "Reading"
AND
TH.Level < PD.Reading
GROUP BY
TH.PupilID
HAVING
COUNT(*) > 1
ORDER BY
FIRST(PD.Surname),
FIRST(PD.Forename)
Grouping with TH.PupilID (not PD.PupilID), because the actual aggregation takes place in TH, so does grouping.
Sorting is the last in query execution, here over remaining 97 records. With that number, thoughts about index don't matter much anymore (calculation on table field breaks index usage).
The second sentence is wrong, as the detailed QEP's already show: The indexes of Type and PupilID are used.The use of indexes on suitable fields is generally beneficial in speeding up searches and sorts.
However, it appears to offer no benefit in aggregate queries.
Finally, a content-related question, detached from syntax and performance topics:
SENTestHistory.Level < PupilData.ReadingNOTE: Reading age is expressed in years:months e.g. 13:07. Students take targeted tests aimed at a certain range e.g. 9:03 to 11:09
String: '9:03' > '11:09', '11:09' < '12:06'
Can this comparison bring correct results in this format?
Last edited: