We consider a bin of a histogram with N entries
of weigthed events
with weigths w_i, i=1,N.
The quantity of interest is the sum of the weights, sum_w,
sum_w = sum {w_i} (i=1,N).
The error on sum_w is then given as
err(sum_w) = sqrt( sum {w_i^2} ).
(err(sum_w))^2 == var(sum_w) = sum {var(w_i)} (i=1,N)
, i.e. adding the squares of the errors on the weighted events.
The variance var(w_i) of weight w_i
is determined only by the statistical fluctuation of
the number of events considered,
var(w_i) = var(w_i * 1 event) = w_i^2 * var(1 event) = w_i^2,
with poissonian fluctuation of the number of events ("1 event"), and taking w_i to be a constant for event i.
If this sounds difficult at first glance,
just make the exercise and
construct error propagation where you have 100 events
split into two groups,
with 90 events w_i==1.0 and 10 events with w_i==0.1 .
We have
sum_w = 90*1 + 10*0.1 = 91 events,
the statistical fluctuation is coming from sqrt(90) and
sqrt(10), giving
var(w_i) = 1^2 * 90 + 0.1^2 * 10 = 90.1 ,
i.e. the error on sum_w is sqrt(90.1) = 9.49 .
Your relative error is 9.49/91 = 0.105.
The number of equivalent events is defined as
N_equ = ( sum_{w_i} )^2 / sum {w_i^2} .
This number relates the sample of N weighted events to N_equ events with w==1 that would have the same relative statistical fluctuation.
For the example above:
The number of equivalent events there is
N_equ = (sum_w)^2 / var(w_i) = 91.9 events.
This means your statistics fluctuation is about as good (or bad)
as for 92 events with event-weight==1.
For the MC-files for the atm-nu's in the Nature paper:
For 7000 events we get N_equ=2200,
while the number of events in the data sample is 188.
So, the "equivalent" statistics of MC is only about 12 times the data !
In certain regions of variable space,
or for different
distribution functions of the weights (which is the relevant
quantity here !! ) you can be much better or worse !
Note also, that the other MC-sample used (Eva's events)
contain 2100 events (weight=1.) and
have the same statistical significance.
Only a single command in Hbook or PAW (also from shell!!)
is needed to get weighted error
handling 'correctly' (sorry to all who knew this...) -
Invoke statistics BEFORE filling the histo:
call hbook1 (id,............)
call hidopt (id,'stati')
.....
call hfill (id,......)
If you also want error bars on the plot, add "call hbarx(id)" after hidopt.
The number of equivalent events you get with "x=hstati(id,...)" or
from PAW shell by
$HINFO(id,'events').
You see that the formula given in ansatz must be wrong
immediately from the weigth==1. limit coming out
absolutely wrong:
Error=0 instead of sqrt(N) !
The problems:
(1) His "Ansatz" neglects any statistics fluctuation of the data sample.
(2) He makes a numerically wrong assumption about the second
term being small.
Neglecting the second term introduces up to 100% error in the
formula !
For our cos(zenit)-distribution
it is about 30%.
These two mistakes by chance compensate to the correct final
formula (poissonian and error propagation quadratic).
Find here also the cos(zenit) plots, for bin=0.025, for bin=0.050, and for bin=0.10, recently updated for the binomial error bars for the data.
For consideration of Poissonian statistics only, here are the cos(zenit) plots, for bin=0.025, for bin=0.050, and for bin=0.10. (The latest suggestion was using Poissonian error bars for the Nature paper.)
- Fig.3 in Nature draft - already changed.
- Fig.2 in Nature draft - the error of the ratios and the fit results will (slightly?) change; for sure the error of the fit results.
- Any statistics check (Chi2, Runtest, Kolmogorov, ...) of the MC-data compatibility, as e.g. for the cos(zenit)-distribution will quantitatively change.
How to correctly assign errors to unweighted events ?
The true statistics the distribution of N_k entries
in a histogram bin with N entries in the histogram
in total
is following a binomial statistics.
The error of N_k comes out to
err(N_k) = sqrt (N_k * (1 - N_k/N) ).
If neglecting this, you get an upscaling of the errors
up to 13% in the cos(zenith) plots.
Also, this effect can get big e.g. for exponential distribution with
a large fraction of entries in a small number of bins.
The usage of binomial statistics means that
you consider the number of trials fixed to the number of entries
in the given histogram. This is the case
e.g. if you compare the data to another data set, i.e. making
an (implicitely normalized) density or shape distribution.
Since we are doing this, the application of the binomial statistics
seems to be adequat
in our case .
Author of this page:
Ralf Wischnewski
10-August-2000