The functions for percentiles (Q1), Q(2), Q(3), Interquartile Range (IQR),Semiquartile Deviation (SID), kth percentile can be obtained from the original median calculation. I use the freely available public domain Octave toolbox to derive these parameters in a few lines of code. The code is transparent and should be easily transportable to java or any other language with minor adaptation.
Sysp34, check out IQR and SID. Interesting?.
More importantly you forgot to pinpoint the importance of outliers in percentile tracking. I have included below this jackpot winning feature just for EL.
Good Luck
Midas
The Octave Toolbox
GNU Octave is a high-level language, primarily intended for numerical computations. It provides a convenient command line interface for solving linear and nonlinear problems numerically, and for performing other numerical experiments using a language that is mostly compatible with Matlab. It may also be used as a batch-oriented language.Octave has extensive tools for solving common numerical linear algebra problems, finding the roots of nonlinear equations, integrating ordinary functions, manipulating polynomials, and integrating ordinary differential and differential-algebraic equations. It is easily extensible and customizable via user-defined functions written in Octave's own language, or using dynamically loaded modules written in C++, C, Fortran, or other languages.
GNU Octave is also freely redistributable software. You may redistribute it and/or modify it under the terms of the GNU General Public License (GPL) as published by the Free Software Foundation.Octave was written by John W. Eaton and many others. Because Octave is free software you are encouraged to help make Octave more useful by writing and contributing additional functions for it, and by reporting any problems you may have.
Algorithm
% STEP 1 – We need to rank and the data in order from representative
sample (small set).
(Btw use the same trick as last time first shuffle the sequence then
sample the distribution by ‘bootstrapping’the sample within
confidence interval to get a small nice representative set of the
distribution.
THEN sort the small representative sample in order and compute the
MEDIAN as last time. Use this universal value in the algorithm
below.)
% The following code should be transparent. The built in function
‘find’ is explained at the end, there are java versions of this
function which can easily be adapted.
% Please note all the various statistics are easily derived
from simple median equations.
Given a small representative ‘dataset (x)’ derived from preprocessing above
Let:
y = sort(x);
% compute 25th percentile (first quartile)
Q(1) = median(y(find(y<median(y))));
% compute 50th percentile (second quartile)
Q(2) = median(y);
% compute 75th percentile (third quartile)
Q(3) = median(y(find(y>median(y))));
% compute Interquartile Range (IQR)
IQR = Q(3)-Q(1);
% compute Semi Interquartile Deviation (SID)
% The importance and implication of the SID is that if you
% start with the median and go 1 SID unit above it
% and 1 SID unit below it, you should (normally)
% account for 50% of the data in the original data set
SID = IQR/2;
% determine extreme Q1 outliers (e.g., x < Q1 - 3*IQR)
iy = find(y<Q(1)-3*IQR);
if length(iy)>0,
outliersQ1 = y(iy);
else
outliersQ1 = [];
end
% determine extreme Q3 outliers (e.g., x > Q1 + 3*IQR)
iy = find(y>Q(1)+3*IQR);
if length(iy)>0,
outliersQ3 = y(iy)
else
outliersQ3 = [];
end
% compute total number of outliers
Noutliers = length(outliersQ1)+length(outliersQ3);
% display results
disp(['Mean: ',num2str(mx)]);
disp(['Standard Deviation: ',num2str(sigma)]);
disp(['Median: ',num2str(medianx)]);
disp(['25th Percentile: ',num2str(Q(1))]);
disp(['50th Percentile: ',num2str(Q(2))]);
disp(['75th Percentile: ',num2str(Q(3))]);
disp(['Semi Interquartile Deviation: ',num2str(SID)]);
disp(['Number of outliers: ',num2str(Noutliers)]);
% Percentile Calculation an Example
% define percent
kpercent = 75;
% STEP 1 - rank the data
y = sort(x);
% STEP 2 - find k% (k /100) of the sample size, n.
k = kpercent/100;
result = k*Nx;
% STEP 3 - if this is an integer, add 0.5. If it isn't an integer round up.
[N,D] = rat(k*Nx);
if isequal(D,1), % k*Nx is an integer, add 0.5
result = result+0.5;
else % round up
result = round(result);
end
% STEP 4 - Find the number in this position. If your depth ends
% in 0.5, then take the midpoint between the two numbers.
[T,R] = strtok(num2str(result),'0.5');
if strcmp(R,'.5'),
Qk = mean(y(result-0.5:result+0.5));
else
Qk = y(result);
end
% display result
fprintf(1,['
The ',num2str(kpercent),'th percentile is ',num2str(Qk),'.
']);
______________________________________________________________________________________________________________________________________________________________
% Octave Built in Function (Btw easily transportable code)
FIND Find indices of nonzero elements.
I = FIND(X) returns the linear indices of the array X that are
nonzero. X may be a logical expression.
Use IND2SUB(I,SIZE(X)) to calculate
multiple subscripts from the linear indices I.
I = FIND(X,K) returns at most the first K indices of X that are
nonzero. K must be a positive integer, but can be of any numeric
type.
I = FIND(X,K,'first') is the same as I = FIND(X,K).
I = FIND(X,K,'last') returns at most the last K indices of X
That are nonzero.
[I,J] = FIND(X,...) returns the row and column indices instead of
linear indices into X. This syntax is especially useful when
working with sparse matrices. If X is an N-dimensional array
where N > 2, then
J is a linear index over the N-1 trailing dimensions of X.
[I,J,V] = FIND(X,...) also returns a vector V containing the
values that correspond to the row and column indices I and J.
If X is a logical expression, then V will contain the values
returned after evaluating that expression.
%Examples:
Let
X = [1 0 4 -3 0 0 0 8 6];
‘Using Octave’
indices = find(X); returns linear indices for the nonzero entries of
X.
indices =
1 3 4 8 9
You can use a logical expression to define X. For example,
find(X > 2); returns the linear indices corresponding to the entries of X that are greater than 2.
ans =
3 8 9
The following commands
Let:
X = [3 2 0; -5 0 7; 0 0 1];
[i,j,v] = find(X)
return
i =
1
2
1
2
3
a vector of row indices of the nonzero entries of X,
j =
1
1
2
3
3
a vector of column indices of the nonzero entries of X, and
v =
3
-5
2
7
1
a vector containing the nonzero entries of X.
Some operations on a vector
Let:
x = [11 0 33 0 55]';
find(x)
ans =
1
3
5
find(x == 0)
ans =
2
4
find(0 < x & x < 10*pi)
ans =
1
For the matrix
M = magic(3)
M =
8 1 6
3 5 7
4 9 2
find(M > 3, 4)
returns the indices of the first four entries of M that are greater than 3.
ans =
1
3
5
6
If X is a vector of all zeros, find(X) returns an empty, 0-by-1 matrix. For example:
indices = find([0;0;0])
indices =
Empty matrix: 0-by-1 .................QED.