Using a custom aggregate function with the GROUP BY clause

funkykizzy · Dec 20, 2013

Hi all!

I am trying to calculate annual percentiles of a large set of data and I have only been successful at retrieving the percentile of the entire data set (and not by the grouping). See provided example database for code/query. Query1 is what I want to happen to make the Percentiles table.

Thanks!

plog · Dec 20, 2013

Can you provide specific data examples of what you want? Post some sample starting data (including field and table names) and then what data should be returned based on that sample data. Use this formatting:

TableNameHere
Field1Here, Field2Here, Field3Here
David, 13, 5/18/1998
Sally, 27, 3/31/2001
Tim, 4, 11/9/2008

funkykizzy · Dec 20, 2013

Sure!

I am looking to get the Percentiles table populated.

Here is what it should look like:
Percentiles
State, Method, Parameter, Year, Average, Sum, 90thPercentile
a, a, a, 2001, ##, ##, ##
a, a, a, 2002, ##, ##, ##
a, a, b, 2001, ##, ##, ##
a, a, b, 2002, ##, ##, ##
a, a, c, 2001, ##, ##, ##
a, a, c, 2002, ##, ##, ##
a, b, a, 2001, ##, ##, ##
etc.
Where the "##" represents a calculated value based on the grouping.

Does that help?

plog · Dec 20, 2013

Nope.

Give me starting data and then what the results should be based on that sample data.

funkykizzy · Dec 20, 2013

It was in my example database, but I don't think I posted it right. I'll try again.

Original data = Table1
Query for calculations = Query1
Results wanted = Percentiles

Is that better?

Thanks!

funkykizzy · Dec 20, 2013

I'm starting to think that it has something to do with the "*" in my Dcount arguments (in Module1). But I'm having trouble solving the issue.

Thanks!

plog · Dec 20, 2013

If all the values are to be 93, then this will do it:

90th: 93

funkykizzy · Dec 20, 2013

No, unfortunately all the percentiles are not 93. I want it to be calculated by year (in other words, each row should have a different value in the Percentile field).

I think it has to do with the "*" in my DCount method in the Criteria parameter of my DMin(,,DCount) method. But when I try to use "Query1" as the criteria, I get the error :"an action query cannot be used as a row source". What now?

How is the avg SQL function able to do this calculation correctly? And how do I get my custom function to operate the same way??

Thanks again!

plog · Dec 20, 2013

What now? Post some sample starting data and then what results should be based on that sample starting data.

That will help me get my head around what you want. Your words and code are not helping.

funkykizzy · Dec 20, 2013

Sorry.

OriginalData
State, Method, Parameter, Year, Samples, Value
a, a, a, 2001, 2, 34
a, a, a, 2001, 3, 56
a, a, a, 2001, 1, 4
a, a, a, 2002, 1, 27
a, a, a, 2002, 4, 38
a, a, a, 2002, 1, 13
a, a, b, 2001, 2, 43
a, a, b, 2001, 1, 25
a, a, b, 2001, 3, 64
etc...

Group by: state, method, parameter, year
and calculate the total number of samples, the average value and the 90th percentile
to result in:

Percentiles
State, Method, Parameter, Year, SumofSamples, AvgofValue, 90thPercentile
a, a, a, 2001, 6, 31.3, 51.6
a, a, a, 2002, 6, 26, 35.8
a, a, b, 2001, 6, 44, 59.8
etc...

How's that?

plog · Dec 20, 2013

Better. Now its time for explanations:

How is the 51.6 calculated for the first result? You are barred from using database jargon (Dcount, Dmin). Explain in english to a 6th grader how this calculation is made.

funkykizzy · Dec 20, 2013

Programmatically - it should be done with my code in Module1 (I can't help myself)

Theoretically (to a 6th grader) - if the data was sorted with the largest numbers on the top and the smallest on the bottom, the 90th percentile is the point in the data at which 10% of the bigger numbers are above it, and 90% of everything else (the lower numbers) are below it). Similarly, the 50th percentile is the spot in a list of data sorted the same way where 50% of the data is above it and 50% of the data is below it.

Literally - I put my example data into Excel and used the =PERCENTILE(range, k) to get the result I am looking for using my Access code.

plog · Dec 20, 2013

Theoretically (to a 6th grader) - if the data was sorted with the largest numbers on the top and the smallest on the bottom, the 90th percentile is the point in the data at which 10% of the bigger numbers are above it, and 90% of everything else (the lower numbers) are below it).

Let's just work with your first example data (a,a,a,2001). That gives us 3 values:

56
34
4

I don't see how you get 51.6 out of there. 56 * .9 equals 50.4. Walk me through the calculation of the 51.6 using the numbers from your sample.

funkykizzy · Dec 20, 2013

That, my friend, is a VERY good question.

Unfortunately, calculating a percentile is a little more complicated than just multiplying by the percent.

First, we rank our data:
1 4
2 34
3 56

And the way we are calculating the percentile with Excel is:

Let N be the number of values and P be the percentile where 0<P<1.
Then, (N-1)P=k+d , where k=integer part and d=decimal part
-- with our data -- (N-1)P = (3-1)0.9 = 1.8 where k = 1 and d = 0.8

Then, P-th percentile is : v_{k+1} + d (v_{k+2} - v_{k+1})
where v_k is the value at position/rank k

So, with our data:
P-th percentile = v_{1+1} + 0.8 (v_{1+2} - v_{1+1}), which
= v_2 + 0.8(v_3 - v_2) or = 34 + 0.8(56-34) which = 51.6

I hope I explained that well enough. Let me know if you're confused.

Thanks!

plog · Dec 20, 2013

Excellent explanation. I achieved what you wanted through 2 queries, the first of which is the real key:

Code:

SELECT Table1.State, Table1.method, Table1.parameter, Table1.Year, Sum(Table1.samples) AS Total_Samples, Avg(Table1.value) AS Avg_Value, Max(Table1.Value) AS Max_Value, DMax("[value]","Table1","[State]='" & [State] & "' AND [method]='" & [method] & "' AND [parameter]='" & [parameter] & "' AND [Year]=" & [Year] & " AND [value]<" & Max([value]))*1 AS SecondMax_Value
FROM Table1
GROUP BY Table1.State, Table1.method, Table1.parameter, Table1.Year;

Name the above query 'sub_PercentileRank'. It gets all the fields you want for the final result, plus 2--the maximum value of a grouping and the second maximum value of the grouping. With those 2 values you can calculate the percentile you want. I did that in the main query:

Code:

SELECT sub_PercentileRank.State, sub_PercentileRank.method, sub_PercentileRank.parameter, sub_PercentileRank.Year, sub_PercentileRank.Total_Samples, sub_PercentileRank.Avg_Value, [SecondMax_Value]+0.8*([Max_Value]-[SecondMax_Value]) AS 90th
FROM sub_PercentileRank;

I hard coded the values for 90th percentile (1 and .8), but you can create your own function and pass it the Max_Value and SecondMax_Value values and have it do it for whatever percentage you want.

I didn't debug your function to see where it went wrong, but here's where you went wrong in building it:

1. 'value' and 'Year' are poor choices for field names because using them makes it hard to code--if you don't delimit them properly in your code, it will not run properly.

2. You swung for the fences instead of trying to get base hits and eventually drive in a run in your code. In a module, you don't have to cram your code onto one line. Use as much vertical space as you need. I saw a DMin nested inside a DCount then either another DCount was nested inside that DMin or you were doing multiplication with it on that first DCount. That's why I didn't even start to try and understand what it did and tell you were you went wrong.

Instead, do a DCount and set that value to a variable, then if needed use that variable in the next DMin and set that value to another variable and so on and so on. Then when it doesn't work, you can spit out all those variables to see which one is holding bad data and narrow down where the whole thing went wrong.

funkykizzy · Dec 20, 2013

Thank you thank you thank you!!

Wow, that was a lot simpler than what I was trying to do! I'm going to give it a try to see if it works.

Thanks again!

funkykizzy · Dec 20, 2013

Sorry to be such a dunce, plog, but I am assuming that the first code is a subquery called sub_PercentileRank for the second code. Is that correct?

plog · Dec 20, 2013

Correct. That's why you should name it as such. The second query you can name whatever because its not being referenced by anything else.

Actually, once you set it up and have it working, you should be able to rename them both to whatever you want and Access will help you when you rename it.

funkykizzy · Dec 20, 2013

You are a genius! That works great! ...for the dummy data...

Unfortunately, my actual data has varying N-values for each grouping (about 50 to 60) and my client wants the 5th, 50th, and 95th percentiles. So I will take what I have learned from your comments/suggestions and see if I can't work something out.

Thanks again!

plog · Dec 20, 2013

You should be able to break this part of the final query:

[SecondMax_Value]+0.8*([Max_Value]-[SecondMax_Value]) AS 90th

Into a function. Right now its hard coded 90%, but you could make it work for any percentage per your explanation for calculating the integer and decimal part of a percentile.

Using a custom aggregate function with the GROUP BY clause

Registered User.

Attachments

Banishment Pending

Registered User.

Banishment Pending

Registered User.

Attachments

Registered User.

Banishment Pending

Registered User.

Banishment Pending

Registered User.

Banishment Pending

Registered User.

Banishment Pending

Registered User.

Banishment Pending

Registered User.

Registered User.

Banishment Pending

Registered User.

Banishment Pending

Similar threads

Users who are viewing this thread