Calculate Median value for group of transactions within the same industry

crashboat

Registered User.
Local time
Yesterday, 16:41
Joined
Oct 15, 2013
Messages
11
I am using the code below to evaluate values for each transaction to determine the median:
Function MedianF(pTable As String, pfield As String, pgroup As String) As Single
'*******************************************
'Purpose: Return median value from a recordset
'Inputs: ? medianF("Values", "Industry") <enter>
'Output: results are currently incorrect
'*******************************************
Dim rs As Recordset
Dim strSQL As String
Dim n As Integer
Dim sglHold As Single
Dim groupValue As String 'I added this field to pass the transaction Industry

strSQL = "SELECT " & pfield & " from " & pTable & " WHERE " & pfield & "<>0 Order by " & pfield & ";"
groupValue = pgroup
Set rs = CurrentDb.OpenRecordset(strSQL)
rs.MoveLast
n = rs.RecordCount
rs.Move -Int(n / 2)

If n Mod 2 = 1 Then 'odd number of elements
MedianF = rs(pfield)
Else 'even number of elements
sglHold = rs(pfield)
rs.MoveNext
sglHold = sglHold + rs(pfield)
MedianF = sglHold / 2
End If
rs.Close
End Function

The code generates median results based on the value but not the correct ones, as I check results in Excel using the Median function. I need to see the median for the transactions within the same industry and I do not know how to do that :banghead: . Below are some sample transactions including the Industry, Value, mymedian (which the current results based on my code) and the expected median result (which I calculated in Excel based solely on the transactions listed here). Note that mymedian results listed in the 3rd column are determined using the complete data set (over 2000 transactions). I only included a few sample transactions below to show that I am currently getting 2 different median results: 44.62 and 50.82. Also be aware that I do not want to evaluate any zero value transactions, which I believe I am handling in the strSQL line above where I indicate <>0 . I need to see a Median result for each industry and the result should be the same for all transactions within the same industry.


Industry Value mymedian Expected Median
Aerospace 18.01 50.82 40.45
Aerospace 53.02 50.82 40.45
Aerospace 109.18 50.82 40.45
Aerospace 61.54 44.62 40.45
Aerospace 62.90 44.62 40.45
Air Freight 38.18 44.62 35.22
Air Freight 32.27 50.82 35.22
Auto Components 28.44 44.62 59.10
Auto Components 34.33 44.62 59.10
Auto Components 36.17 44.62 59.10
Auto Components 45.42 50.82 59.10
Auto Components 83.87 50.82 59.10

This is how I am passing the values from the query to the module, passing the value of Period2 if Period 1 = 0, else passing Period1:
mymedian: IIf([Period1]=0, medianF("qry_My_Query","Period2","Classification_Industry"), medianF("qry_My_Query","Period1","Classification_Industry"))


Your help would be greatly appreciated.
 
Last edited:
Your pgroup parameter is never used in the construction of the recordset. Yeah, you pass it to the function, and then assign it within the function to a variable called groupValue. But that's it. You never use it to limit what records are selected in the recordset.

One problem you are going to run into is that your Function MedianF() is too abstract, as written, to accommodate this additional parameter. That routine is written to receive a tablename and a fieldname, and if you want to add a constraint you'll need to add another fieldname parameter and a value parameter, but if you do that you sort of break the abstraction.

I'd probably keep that MedianF routine, assuming you use it in other places, and write a more specific Function GetIndustryMedian() into which all you pass is the fieldname and the industry, and it knows the table. Then that function can call MedianF() with a sub-query for a table, like . . .
Code:
Function GetIndustryMedian(Fieldname as string, Industry as string) as Single
   Dim Table as String
[COLOR="Green"]   'construct a subquery for the table param[/COLOR]
   Table = "( SELECT * FROM qry_My_Query WHERE Industry = '" & industry & "' ) "
[COLOR="Green"]   'now call the original MedianF()[/COLOR]
   GetIndustryMedian = MedianF(Table, Fieldname)
End Function
Does that make sense?
 
Thank you very much lagbolt for your reply. I understand I never used the pgroup parameter in my code as that is the part I am not clear with. I thought I needed build something like a Do While loop to get all the transactions for an industry, calculate the Median and then move on to the next Industry. I am glad to see that building a loop was not included in your reply, but I am not clear how the subquery in your example will accomplish this. Am I overcomplicating this?

I don't care about breaking the abstraction the function as I am not using MedianF anywhere else, but I understand what you say about creating a second function CreateIndustryMedian. In your example you are passing 2 variables: fieldname and Industry. I am assuming the 'fieldname' refers to the amount.

A few additional questions to clarify:
CreateIndustryMedian: The variables I am passing are amount and industry from qry_My_Query. From here I call MedianF. Do I need to edit the variable names to the same ones I am using with MedianF which are:
Function MedianF(pTable As String, pfield As String) As Single.

MedianF: The variables I am passing are table name and amount, but what about the industry? If I am calculating the median for each industry, don't I need to pass that variable? I am not clear with this. Please advise.

Thanks for your assistance.

For your reference, here is your code:
Function GetIndustryMedian(Fieldname as string, Industry as string) as Single Dim Table as String 'construct a subquery for the table param Table = "( SELECT * FROM qry_My_Query WHERE Industry = '" & industry & "' ) " 'now call the original MedianF() GetIndustryMedian = MedianF(Table, Fieldname)End Function
 
Last edited:
You are trying to display the median as a field in a query aren't you?
mymedian: IIf([Period1]=0, medianF("qry_My_Query","Period2","Classification_Industry"), medianF("qry_My_Query","Period1","Classification_Industry"))
  • The variables are the fieldname--Period1 or Period2--and the industry.
  • In the IIf() block you posted, "qry_My_Query" doesn't change so we don't need to pass it to the function. It can be a fixed feature of the function.
  • For the industry parameter, we need to draw that data from the table, not the quote delimited literal yoiu have.
  • Finally we can put the IIF() call inside the function call, and leverge the fact that a zero is the same as a boolean False
. . . and amend things as follows . . .
Code:
MyIndustryMedian: GetIndustryMedian(IIf([Period1], "Period1", "Period2"), [Industry])
Is that clearing things up?
 
Thanks again. You are absolutely correct. I am calling the median function from a query as I ultimately need to see the median by industry in the query output so I can use this to perform other calculations.

I don't understand the 3rd bullet on your last reply. What literal quote are your referring to? For the industry parameter, we need to draw that data from the table, not the quote delimited literal yoiu have.

I still don't see how are we using the industry classification in the function MedianF. That was my original problem.

Here is a summary of the related queries and functions I have so far:
1. Query containing the source data to be used to calculate the median by industry. To keep it simple I called it qry_My_Query, but the real name is qry_KPIs_By_Company.

2. Query where I pass variables to MedianF -- mymedian: IIf([DWC_Period1]=0,medianF("qry_KPIs_By_Company","DWC_Period2"),medianF("qry_KPIs_By_Company","DWC_Period1")). This query runs.

3. Query where I pass veriables to GetIndustryMedian -- Industry_Median: IIf([DWC_Period1]=0,GetIndustryMedian("DWC_Period2","GICS_Classification_Industry"),GetIndustryMedian("DWC_Period1","GICS_Classification_Industry"))

4. Function MedianF(pTable As String, pfield As String) As Single
'*******************************************
'Purpose: Return median value from a recordset
'Inputs: ? medianF("qry_KPIs_By_Company", "Period1 or Period2") <enter>
'*******************************************
Dim rs As Recordset
Dim strSQL As String
Dim n As Integer
Dim sglHold As Single
strSQL = "SELECT " & pfield & " from " & pTable & " WHERE " & pfield & "<>0 Order by " & pfield & ";"
Set rs = CurrentDb.OpenRecordset(strSQL) 'runtime error 3061
rs.MoveLast
n = rs.RecordCount
rs.Move -Int(n / 2)

If n Mod 2 = 1 Then 'odd number of elements
MedianF = rs(pfield)
Else 'even number of elements
sglHold = rs(pfield)
rs.MoveNext
sglHold = sglHold + rs(pfield)
MedianF = sglHold / 2
End If
rs.Close


5. Public Function GetIndustryMedian(Fieldname As String, Industry As String) As Single
Dim Table As String
'construct a subquery for the table param
Table = "( SELECT * FROM qry_KPIs_by_Company WHERE Industry = '" & Industry & "' ) "
'now call the original MedianF()
GetIndustryMedian = MedianF(Table, Fieldname)

Function MedianF run on it own, but I get runtime error when called from GetIndustryMedian. Runtime error 3061. Too few parameters. Expected 1. I get the error on MedianF on the line Set rs = CurrentDb.OpenRecordset(strSQL)
 
Yeah, that third bullet, you can't pass the name of the industry field to the function, you have to pass the actual value the field contains, so this . . .
Industry_Median: IIf([DWC_Period1]=0,GetIndustryMedian("DWC_Period2","GICS_Classification_Industry"),GetIndustryMedian("DWC_Period1","GICS_Classification_Industry"))
. . . should be . . .
Industry_Median: IIf([DWC_Period1]=0,GetIndustryMedian("DWC_Period2",[GICS_Classification_Industry]),GetIndustryMedian("DWC_Period1",[GICS_Classification_Industry]))
Do you see the difference? The first passes the field name, the next passes the value of the field for the current record. Very different.

And the other thing is about readability mostly, you have two function calls inside the IIF() right? But the only thing that changes in the IIF() is the fieldname, so it's clearer if you have the IIF() inside the function call, like this . . .
Industry_Median: GetIndustryMedian(IIf([DWC_Period1],"DWC_Period1","DWC_Period2"), [GICS_Classification_Industry])
Right? The IIF() distinguishes the one fieldname from the other. No need to duplicate the whole function call, just use one function call, and call it with exactly the field name we need.

And if that's not enough, which it might not be, consider posting a database with sensitive data removed, that I can modify and make it work, and then you can reverse engineer it. It's not super simple what's going on here, but it's also not that far from working.

Cheers,
 
Thank you so much for explaining the difference between passing a field name or a value of the field. Excellent! However, I don't understand why wouldn't we want to pass the value for both fields.?

I see how having the IIF inside the function can clean it up a bit, but I still have to test for zero - If Period1 = 0, then pass "Period2", [GICS_Classification_Industry], else "Period1", [GICS_Classification_Industry].

I want to follow your advise and post the database. I included only a few records without any sensitive information. I appreciate your help.

The main query used to calculate median is qry_KPIs_by_Company.
I was trying to view the median by industry in qry_Industry_Median.

Cheers!
 

Attachments

So here's a fix for the immediate problem.

A more basic problem is that the query on which you base your median calculation does not actually provide distinct data by industry, but rather by company. As a result, your median query can return different medians for the same industry--because it's actually calculating it on the company.

More basic still is that fact that your main table is not "Normalized," which is a bit of technical term, but worth a Google. Best search term: "database normalization"

In short, you are using a database table like it's the same as a spreadsheet, and it's not, and that will be a major impediment to working with that data. In a database you want use many tables to model the one-to-many relationships within your data, so a row in an Industry table might have many related rows in a Company table, and a Company table might have many related rows in a RevenueHistory table. Then, calculating RevenueHistory by Industry is a trivial matter, and the passage of time, where you might need to add Period7, Period8, PeriodN, doesn't require you to modify your data structure. When new data comes in you want to always add rows, never columns.

hope this helps,
 

Attachments

You are absolutely correct. This db is not normalized at all. Not an excuse, but it was an urgent request and need to provide a quick solution. I will make the Company number (Excel_ID) my primary key as that is unique. Luckily the number of months is not going to increase as they plan to carry revolving 6 periods (the most current 6 periods). Industry information cannot be unique because we want to have as many companies as possible within an industry and we want to use each of them to calculate the industry median.

To calculate industry median we need to pass the values for each company within that industry based on the most current information available. Many companies do not have data for Period1 (Period1=0), which is why for those cases we want to grab Period2 to have a more representative industry analysis. We want to see a single median per industry. So for each company within the same industry they will have the same Industry Median. We need to allocate the industry median to each company as there are a number of other analyses for the company that will use the industry median. As you stated, currently in the database the median is calculated at the company level rather than at the industry level. That is what I was originally doing and did not know how to do it at the industry level.

Here is an example of the companies for 1 industry in the database and the expected results:
Industry Median per DB Period1 Period2 Industry Median
Aerospace 59.68 0.00 66.34 57.66
Aerospace 48.98 48.98 53.02 57.66

I highlighted red the 2 values that need to be passed to calculate the industry median. Since Period1 = 0 for the first company, then pass Period2. However, Period1 should be passed for the second company as the value is not zero. In this example to calculate the industry median we use Period2 of the first company and Period1 of the second one. Note the industry median should be the same for each company as they pertain to the same industry. This is why I was asking about creating some type of looping to read the industry median. Do you think that will be required?

Many thanks for your assist!
 
Last edited:
. . . but it was an urgent request and need to provide a quick solution.
The quickest solution in a database is to normalize the data. This is the cause of your problem, and these other snags you are running into are the symptoms. Sorry to have to break the news. :(
 
I understand and it is true. This is a bad database. However, if I normalize it, I still have to pass the values for all the companies within an industry to calculate the industry median. That was my original request as I could not get that to work. Any other ideas?
 
If you call this query "qCompanyRevenueByIndustry" . . .
Code:
SELECT GICS_Classification_Industry AS Industry, IIf([Revenue_Period1]=0,[Revenue_Period2],[Revenue_Period1]) AS Revenue
FROM tbl_Data_Main;
. . . and you call this query "qIndustryMedianFinal" . . .
Code:
SELECT Industry, GetIndustryMedian([Industry]) AS Median
FROM qCompanyRevenueByIndustry
GROUP BY Industry;
. . . and amend this function as follows . . .
Code:
Public Function GetIndustryMedian(Industry As String) As Single
   Dim Table As String
[COLOR="Green"]   'construct a subquery for the table param[/COLOR]
   Table = "( SELECT * FROM qCompanyRevenueByIndustry WHERE Industry = '" & Industry & "' ) "
[COLOR="Green"]   'now call the original MedianF()[/COLOR]
   GetIndustryMedian = MedianF(Table, "Revenue")
End Function
. . . does that give you the correct result?
 
Thanks again for your support. I made the changes as you indicated, except for the first query, since the field needed to be passed to determine the median does not exist in the table tbl_Data_Main. That is a calculated field created in qry_KPIs_By_Company. In addition, I simplified the query as we don't need to look for non-zero valued between Period1 and Period2. We decided we are using Period2 for the calculation. I kept your query alias names, but updated this query as follows:

qCompanyRevenueByIndustry
SELECT qry_KPIs_By_Company.GICS_Classification_Industry AS Industry, qry_KPIs_By_Company.DWC_Period2 AS Revenue
FROM qry_KPIs_By_Company;

Unfortunately I get the following runtime error when running qIndustryMedianFinal.
Runtime error 3021. No Current Record.
The code halts at MedianF in the line rs.MoveLast
 
Yeah, so your recordset is empty. The SQL doesn't return any records. You can post a database again if you want, but I can't tell from your description why it fails.
 
Thanks again. I have some good and bad news. But I truly think we are getting closer to the end. Apologies as after I submitted my last reply I realized I had made a change to one of the modules, hence the reason why I had the empty recordset. After correcting that, the code worked and displayed the correct median results per industry!!!

Then I ruined it again :( I got too ambitious and decided to replicate the code as I need to do median calculation using 3 other fields. It is not clean or pretty, as I copied and replicated the GetIndustryCode 3x times (slightly changing the name for the corresponding field. Now I have GetIndustryMedian (the original), GetIndustryMedian_DSO, GetIndustryMedian_DIO and GetIndustryMedian_DPO.

I tested having the first 2 median fields (original and DSO) and the query output was correct, no errors and the results perfect. Then I added the next 2 in the final median query and that's when the trouble started. I do get the results displayed correctly (for a moment) but then it goes into a runtime error that I cannot break.

By the way, if I get into a runtime error and hitting the END button doesn't do anything, how can I get out of it?

I am posting the database again :(
 

Attachments

The problem you can get, if you call a function from a query, is that the function gets called for every row. So if the function errors out and you hit end, you kill the code execution, but only for that call for that row. So you want to have some OK error handling in there in a production environment, or hit debug and fix the routine so the following rows will execute without error.

I'll see what I can find in the db.
 
Here's what I did to the code . . .
Code:
Function MedianF(pTable As String, pfield As String) As Single
'*******************************************
'Purpose:   Return median value from a recordset
'Inputs:    ? medianF("qry_KPIs_By_Company", "Period1 or Period2") <enter>
'*******************************************
    Dim i   As Integer
    Dim tmp As Single

    With CurrentDb.OpenRecordset( _
        "SELECT " & pfield & " " & _
        "FROM " & pTable & " " & _
        "WHERE " & pfield & " <> 0 " & _
        "ORDER BY " & pfield & ";" _
    )
        If Not .EOF Then
            .MoveLast
            i = .RecordCount
            .PercentPosition = 0.5
            
            tmp = .Fields(0)
            If i Mod 2 = 0 Then 'even number of elements
                .MoveNext
                tmp = (tmp + .Fields(0)) / 2
            End If
            MedianF = tmp
        End If
        .Close
    End With

End Function

Public Function GetIndustryMedian(Industry As String, Fieldname As String) As Single
    Dim Table As String
    'construct a subquery for the table param
    Table = "( SELECT * FROM qCompanyRevenueByIndustry WHERE Industry = '" & Industry & "' ) "
    'now call the original MedianF()
    GetIndustryMedian = MedianF(Table, Fieldname)
End Function

The only functional change in the MedianF routine is I've added a check for EOF, which is true when you open an empty recordset, and it skips trying to reference any field data in that case.

I've returned to one function call, to which you now pass the field name.

And here's the query . . .
Code:
SELECT qCompanyRevenueByIndustry.Industry,
GetIndustryMedian([Industry],"DWC_Period2") AS Median_DWC,
GetIndustryMedian([Industry],"DSO_Period2") AS Median_DSO,
GetIndustryMedian([Industry],"DIO_Period2") AS Median_DIO,
GetIndustryMedian([Industry],"DPO_Period2") AS Median_DPO
FROM qCompanyRevenueByIndustry
GROUP BY qCompanyRevenueByIndustry.Industry;
. . . and you can see there that each function call includes the name of the field on which to calculate the Median.

But the only real error with what you had was that you were trying to read data from an empty recordset and that'll fail. All you need to change is that, check for EOF when you open the recordset. These other step are just optimizations I would make that seem simpler to me.
 
You are brilliant! Thank you very much. It works beautifully. I liked how you simplified the function call from the query to pass the 4 variables. I also noticed in the function MedianF you included a line .percentposition = .5 to get to the middle record in the dataset (I assume). I appreciate your notes on checking for EOF.

Now I am going to use your solution to calculate Top Quartile by industry. Wish me luck ;)

Have a great weekend!
 

Users who are viewing this thread

Back
Top Bottom