- Local time
- Today, 02:07
- Joined
- Feb 28, 2001
- Messages
- 27,281
Earlier this month someone came up with a question about trend lines and I made an off-handed reference that it might be done simply via a query and a quick & dirty recordset operation. So I thought it might be a good idea to post this in a way to make in findable by search later.
You can use the following general ideas to do a linear regression. Here are the formulas and data you need.
Requirements:
1. A table with lists of data to be analyzed this way, where you have one independent value and a corresponding dependent value. Further, you need to believe the formula describing that dependency is a straight line.
I.e you have to assume the mathematical relationship: y = a + bx.
The goal is therefore to find a and b having lists of x and corresponding y.
2. The table (suggesting for scientific purposes) should contain these X and Y values in either SINGLE or DOUBLE format - and I always work in DOUBLE if doing regressions for reasons listed later.
3. Write this query, perhaps calling it GENSUM:
NOTE: ORDER BY clauses are NOT NEEDED in this context. See later discussion. GROUP BY clauses would let you do this for multiple sets of sums at once for disjoint data sets in the same table, but for this discussion I am keeping it simple.
4. In your VBA code where you need to do this computation (and I will not assume it to be a subroutine on its own for this example):
DISCUSSION:
In essence, let an Access summation query form the sums, then copy them from the recordset implied by that summation query to your code, then apply the formulas. Now, a fine point: Since addition is an "associative" operation and is also "commutative" you do not need an ORDER BY in the generation of the sums. Having the pairs in ascending, descending, or random order makes NO DIFFERENCE to the math.
There is another pitfall to be mentioned. If it should ever happen that you do this, there is a question to be asked: Is this relationship really expected to be a straight line? If the answer is NO then you need another type of regression. The above method will work but you will need more sums, and the higher the order of the polynomial you are using, the faster the power series starts to mount up. For example, to get a quadratic equation, you will need to diddle with 4th-power summations, which is why I suggested DOUBLE variables - they have a wider exponent range than SINGLE variables. By the time you get to quartic (4th-order) polynomials, you are talking about sums involving the 16th power of your variables, which could quickly blow SINGLE sums out of the water.
Note also that if you have Excel installed along with Access, there is a matrix math library you can use, and some of the really messy equations for higher polynomials can be managed more easily via matrix multiplication. A web search on "VBA MATRIX MATH" will lead you to articles involving use of a list of worksheet functions that look like obj.WorksheetFunction.MMult, obj.WorksheetFunction.MInverse, etc. A web search on "Regression using Sums" will include articles on how to do this via matrix methods.
You can use the following general ideas to do a linear regression. Here are the formulas and data you need.
Requirements:
1. A table with lists of data to be analyzed this way, where you have one independent value and a corresponding dependent value. Further, you need to believe the formula describing that dependency is a straight line.
I.e you have to assume the mathematical relationship: y = a + bx.
The goal is therefore to find a and b having lists of x and corresponding y.
2. The table (suggesting for scientific purposes) should contain these X and Y values in either SINGLE or DOUBLE format - and I always work in DOUBLE if doing regressions for reasons listed later.
3. Write this query, perhaps calling it GENSUM:
Code:
SELECT COUNT(X) AS N, SUM(X) AS SUMX, SUM(Y) AS SUMY, SUM(X*X) AS SUMX2, SUM(Y*Y) AS SUMY2, SUM(X*Y) AS SUMXY
FROM MYDATATABLE
WHERE {fill in any filtration/selection criteria here};
NOTE: ORDER BY clauses are NOT NEEDED in this context. See later discussion. GROUP BY clauses would let you do this for multiple sets of sums at once for disjoint data sets in the same table, but for this discussion I am keeping it simple.
4. In your VBA code where you need to do this computation (and I will not assume it to be a subroutine on its own for this example):
Code:
...
' Define variables to be used in computation
DIM N AS DOUBLE
DIM SUMX AS DOUBLE
DIM SUMY AS DOUBLE
DIM SUMX2 AS DOUBLE
DIM SUMY2 AS DOUBLE
DIM SUMXY AS DOUBLE
' Define variables that we wanted to see
DIM A AS DOUBLE
DIM B AS DOUBLE
' Define a couple of working variables for intermediates
DIM DIVISOR AS DOUBLE
DIM DIVIDEND AS DOUBLE
' Define a way to see the results of the query
DIM RS AS RECORDSET
....
' The actual mode of opening is ALMOST irrelevant because you are
' not going to scan multiple records via the explicit recordset.
SET RS = CURRENTDB.OPENRECORDSET( "GENSUM", dbOpenDynaset )
RS.MOVEFIRST
' Transfer the variables from the recordset to the code
' (It makes typing easier later)
WITH RS
N = ![N]
SUMX = ![SUMX]
SUMY = ![SUMY]
SUMXY = ![SUMXY]
SUMY2 = ![SUMY2]
SUMX2 = ![SUMX2]
END WITH
RS.CLOSE
' Finally, do the math.
DIVISOR = ( N * SUMX2 ) - ( SUMX * SUMX )
DIVIDEND = ( N * SUMXY ) - ( SUMX * SUMY )
B = DIVIDEND / DIVISOR 'B is the slope
DIVIDEND = ( SUMY * SUMX2 ) - ( SUMX * SUMXY )
A = DIVIDEND / DIVISOR 'A is the Y-intercept
DISCUSSION:
In essence, let an Access summation query form the sums, then copy them from the recordset implied by that summation query to your code, then apply the formulas. Now, a fine point: Since addition is an "associative" operation and is also "commutative" you do not need an ORDER BY in the generation of the sums. Having the pairs in ascending, descending, or random order makes NO DIFFERENCE to the math.
There is another pitfall to be mentioned. If it should ever happen that you do this, there is a question to be asked: Is this relationship really expected to be a straight line? If the answer is NO then you need another type of regression. The above method will work but you will need more sums, and the higher the order of the polynomial you are using, the faster the power series starts to mount up. For example, to get a quadratic equation, you will need to diddle with 4th-power summations, which is why I suggested DOUBLE variables - they have a wider exponent range than SINGLE variables. By the time you get to quartic (4th-order) polynomials, you are talking about sums involving the 16th power of your variables, which could quickly blow SINGLE sums out of the water.
Note also that if you have Excel installed along with Access, there is a matrix math library you can use, and some of the really messy equations for higher polynomials can be managed more easily via matrix multiplication. A web search on "VBA MATRIX MATH" will lead you to articles involving use of a list of worksheet functions that look like obj.WorksheetFunction.MMult, obj.WorksheetFunction.MInverse, etc. A web search on "Regression using Sums" will include articles on how to do this via matrix methods.