Need help with statistics work...

charliemopps · Mar 2, 2009

So I have a DB that I made to track problems on a network. I don't want to get into what it does because it will get overly complicated so I'm going to just use an analogy.

Lets pretend that I own several hundred fruit stands, I want to track if I'm getting bad fruit from my vendors. There are 2 ways I could end up getting bad fruit.

The first is pretty basic... A bad shipment, a time when entire boxes arrive at the site totally bad... This was pretty simple to find. The fruitstand workers report how many bad fruits they find per kind. I get info like this (but there are hundreds of sites and about a dozen "Fruit"):

Code:

NewYork Bananas |0 bad|3 avg bad over past 4 weeks|
NewYork Apples   |0 bad|4 avg bad over past 4 weeks|
Chicago Bananas |28 bad|3 avg bad over past 4 weeks|
Chicago Apples   |3 bad|2 avg bad over past 4 weeks|
St Lois Bananas |2 bad|5 avg bad over past 4 weeks|
St Lois  Apples   |1 bad|0 avg bad over past 4 weeks|

Now, I've easilly made a querry that rules out all but the site with the problem. Clearly I got some bad bananas in Chicago. The problem is, my vendor is smart. I called and complained about the bad batch. He realized that to sneak the bad fruit past me, all he has to do is sneak in a bad banana into each box. Now, no paticular site will get past the filter. But each site has gone up by 2 bananas each.

I have successfully created a querry to identify such an occurance by simply checking the sum of Bananas against their average. The problem is that if there is a single site that recieves an entire bad box of bananas, it will set this querry off as well. Making me think there is a "country wide" problem as well as a single site problem.

Does anyone know how I would go about this statistically to find Larger problem and aslo the single site problem? Is there a way to throw out what we would consider data far above the norm?

Thanks!

The_Doc_Man · Mar 2, 2009

I can follow network-speak.

It appears to me that as long as you can identify the origin of the bad traffic, it shouldn't matter what the source does to mask the problem. Whether all the traffic is bad for a single type of message or whether it is more generally distributed, you still have error rates.

Deciding that a given problem is network-wide is trickier and easier at the same time. Trickier because as you point out, one bad apple can screw up the whole barrel. But easier, because as long as you track the source, you can still isolate whether the origin is unique, regardless of whether the destination sites are being hit wholesale or retail. A network-wide problem will have multiple origins and multiple targets. A site-specific problem won't. One end or the other will be well-defined and narrowly manifested.

For a math reference, try searching for "Univariate analysis" to see the way to compute single-variable variances, standard deviations, and the like. That is what you need to see. That is how you will eventually be able to decide that things have gone west on you.

charliemopps · Mar 2, 2009

Ok, thanks! I'll check that out.

By network, I didn't mean like computer network. It's a broad number of people reporting problems with equipment that is interconnected over a large area. Again, it's rather complicated and really not important to the statistics at hand so I'll just leave it out. Just think of it as fruit. lol

The_Doc_Man · Mar 3, 2009

Just so long as you don't run into a fruit-salad case (which clearly would represent a denormalized database...)

Need help with statistics work...

charliemopps

Registered User.

The_Doc_Man

Immoderate Moderator

charliemopps

Registered User.

The_Doc_Man

Immoderate Moderator

Similar threads

Users who are viewing this thread