variability of measures of central tendency (Python, no graphics)
Kragen Javier Sitaker
kragen at canonical.org
Sat Aug 14 03:37:01 EDT 2010
I don’t know shucks about statistics, but I thought I would toy with
some different random distributions to see how they behaved. Contrary
to my expectations, the sample median was never the more consistent
measure, and was often worse.
#!/usr/bin/python
"""Different measures of central tendency have different variability.
And it depends on the distribution of the underlying data.
Some distributions are easy to characterize from a random sample. If
your data is normally or uniformly distributed, you can get a pretty
good estimate of the mean, which is also the median, after just ten or
twenty data points.
But some other distributions are not so well-behaved. The exponential
distribution is a common one. Its median is well to the left of its
mean. Does the sample mean or the sample median have greater
variance? I hypothesize, without actually doing the math, that the
sample mean of an exponential distribution has proportionally greater
variance, and therefore the sample median is a better measure to use,
if you have to pick one.
"""
from __future__ import division
import random, math, sys
sample_mean = lambda sample: sum(sample)/len(sample)
sample_means = lambda samples: map(sample_mean, samples)
# wrong for even samples, but close enough:
sample_median = lambda sample: sorted(sample)[len(sample)//2]
sample_medians = lambda samples: map(sample_median, samples)
uniform_sample = lambda n: [random.uniform(0, 1) for ii in range(n)]
expo_sample = lambda n: [random.expovariate(1) for ii in range(n)]
def standard_deviation(sample):
mean = sample_mean(sample)
return math.sqrt(sum((x - mean)**2 for x in sample)/(len(sample)-1))
uniform_samples = lambda n, m: [uniform_sample(m) for ii in range(n)]
expo_samples = lambda n, m: [expo_sample(m) for ii in range(n)]
def compare(n, m):
print "%d samples of %d items each:" % (n, m)
print "Uniform:",
describe(uniform_samples(n, m))
print "Exponential:",
describe(expo_samples(n, m))
def describe(samples):
means, medians = sample_means(samples), sample_medians(samples)
print "standard deviation of mean %.2f (mean mean %.2f), of median %.2f (mean median %.2f)" % (standard_deviation(means), sample_mean(means),
standard_deviation(medians), sample_mean(medians))
if __name__ == '__main__':
compare(int(sys.argv[1]), int(sys.argv[2]))
(End of `variance.py`.)
Example usage:
: kragen at inexorable:~/devel/inexorable-misc ; ./variance.py 10000 20
10000 samples of 20 items each:
Uniform: standard deviation of mean 0.06 (mean mean 0.50), of median 0.11 (mean median 0.52)
Exponential: standard deviation of mean 0.22 (mean mean 1.00), of median 0.24 (mean median 0.77)
This software is available via
git clone http://canonical.org/~kragen/sw/inexorable-misc.git
(or in <http://canonical.org/~kragen/sw/inexorable-misc>) in the file
`variance.py`.
Like everything else posted to kragen-hacks without a notice to the
contrary, this software is in the public domain.
More information about the Kragen-hacks
mailing list