how to find smallest-area ellipsoid enclosing 95% of points?

Discussion:

(too old to reply)

Jonathan Thornburg (remove animal to reply)

2019-04-01 04:51:45 UTC

I have a data-analysis problem, and I wonder if someone can suggest
software to solve it.

Summary:
I have a number of points (typically 300 to 1000) in $R^2$. I'm looking
for code (in any reasonable programming language) to find the smallest-area
ellipsoid which encloses 95% of the points. The code doesn't need to be
particularly efficient.

More details:
I have a bunch of time series each consisting of relatively-slowly varying
"background" plus some signals; in certain regions of the data the signals
are well-approximated by damped sinusoids. For each time series, I've
done a nonlinear least-squares fit (using the MINPACK routine LMDIF1) of
a model (signal = spline "background" + damped simusoid) to the data. I'd
like to estimate a confidence region for the fitted damped-sinusoid
frequency and decay rate so that I can assess whether these are
consistent with a theoretical prediction.

The residuals from my fits are highly non-random -- they are dominated
by low-level quasi-periodic oscillations with frequencies similar to
those of the fitted damped-sinusoids. (By "low-level" I mean that the
RMS residual is typically on the order of $10^{-3}$ times the amplitude
of the fitted damped-sinusoid.)

Because of the systematics in the residuals, I don't think a standard
$\chi^2$-square error estimates would be valid. Instead, I have estimated
a confidence region via a Monte-Carlo procedure in which I do trial fits
on randomly-chosen subintervals of the data, and use the scatter in the
trial fits' damped-sinusoid parameters as an estimate of the uncertainty
in those parameters for the original fits.
[The "randomly-chosen" subintervals are subject to
a minimum length of several oscillation periods so
that (a) the oscillation period & decay rate are
well-defined, and (b) the oscillation is not degenerate
with the "background" spline.]
I typically do a few hundred Monte-Carlo trials for each time series.

The result is that for each time series, I now have a set of (say) 300
(period,decay_rate) points in $R^2$. A scatterplot of these shows a cloud
of points (in some cases with "interesting" structure, but I don't think
that's particularly relevant here); I believe that this cloud is a reasonable
estimate of the actual confidence region of the fitted parameters.

The problem I now face is how to summarize such a scatterplot with a few
numbers. (I actually have several hundred time series, each with its own
scatterplot.) In particular, for each time series / scatterplot I'd like
to find the minimum-area ellipsoid enclosing 95% of the points.

Mount et al have published an elegant and efficient algorithm for solving
such problems.
[
D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman,
and A. Y. Wu,
"Quantile Approximation for Robust Statistical Estimation
and k-Enclosing Problems",
Int'l. J. of Computational Geometry and Applications,
volume 10 (2000), pages 593-608.
http://www.cs.umd.edu/users/mount/Papers/QuantApx.pdf
]
Alas, I have not been able to find any implementation of their algorithm
(and Mount himself does not know of one), and I don't want to reimplement
it myself (it's quite complicated, and I have many other things to finish
for this project!).

I strongly suspect that other people using Monte-Carlo methods have
encountered this or a similar problem before, which suggests that there
may already be software available for solving it. So... Can anyone in
this newsgroup point me to software for finding the minimum-area ellipsoid
which contains at least 95% of a specified set of points in $R^2$?

Given that I have only a few hundred such sets-of-points, I could tolerate
the minimum-area-ellipsoid computation being fairly inefficient -- say
up to an hour of CPU time and a few gigabytes of memory.

Thanks,

--
-- "Jonathan Thornburg [remove -animal to reply]" <***@astro.indiana-zebra.edu>
Dept of Astronomy & IUCSS, Indiana University, Bloomington, Indiana, USA
"He wakes me up every morning meowing to death because he wants to
go out, and then when I open the door he stays put, undecided, and
then glares at me when I put him out"
-- Nathalie Loiseau (French minister for European Affairs,
explaining why she named her cat "Brexit")

r***@gmail.com

2019-04-03 21:58:51 UTC

Permalink

Post by Jonathan Thornburg (remove animal to reply)
I have a data-analysis problem, and I wonder if someone can suggest
software to solve it.
I have a number of points (typically 300 to 1000) in $R^2$. I'm looking
for code (in any reasonable programming language) to find the smallest-area
ellipsoid which encloses 95% of the points. The code doesn't need to be
particularly efficient.

ALGLIB has a large number of algorithms in it for solving differential
equations, integration, linear and non-linear optimization, statistical
analysis, neural net algorithms, interpolation (featuring the extremely
efficient and precise interpolation by rational functions as well as
radial basis and inverse distance weighted interpolation and spline).

You can put together your own algorithm and implementation using the
resources at your disposal with the methods provided. To make the
problem on hand accessible to non-linear optimization, what you might do
is smear out the "inside of" function for a point so that it goes from 1
(when well inside an ellipse) to 0 (when well outside an ellipse)
smoothly. Then you can encode the problem as a constrained optimization
problem, ensuring that the total value of the membership function remain
at or above 0.95 x the number of points, while setting the optimization
to minimize the size of the ellipse.

Essentially, what this does is make the boundary of the ellipse
fuzzy. So, you have an adjustable "sharpness" parameter that is used to
define the "membership" function; and the sharper you set up the
problem, the more accurately it will reflect the 100% 0-1-only sharp
limit of the membership function (but the longer the convergence time
will take).

Warning, a proviso: ALGLIB is heavily front-loaded with infrastructure;
because it is actually C that "simulates" C++ (almost as if C++ had been
run through "cfront" or something) and that's in one namespace
(alglib_impl) and then wraps C++ on top of that (in another namespace:
alglib).

But it is fast, if you set it up right. Even the GPL version can run in
multiple cores in parallel, if you have the right compiler and right
compiler settings (GCC can compile to multi-core).

I'll be distributing a branch off the GPL version soon that tears down
much of this infrastructure. It, itself, is just an intermediate form of
a newer, yet to be completed/released, library I named after its
flagship project (the future robot named and modelled on a close
friend); on which several of the issues and/or bugs and/or design
inefficiencies in the library are addressed and resolved. I might post a
followup to on this later. It has a Makefile for "make test" that
contains the configuration I'm using.

Douglas Eagleson

2019-04-04 04:49:07 UTC

Permalink

[Moderator's note: Rest of quoted post deleted. I'm not sure that the
following works. It is clear what needs to be done; the question is how
to do it. Future posts on this topic should ideally point to actual
code. -P.H.]

Consider an ellipsoid as a set of line segments. This is a graphical
approximation of a function. This discrete representation can
be made as fine as desired.

Now for each line segment place an interior normal and exterior
normal pair of points.

For each point in your set calculate the distance to the
two normal's. Do this for each line segment. If the
distance to the interior normal is smaller than to the
exterior normal the point is inside the ellipsoid if
true for all line segments.

Now in the global sense graphically reduce or expand
the ellipsoid until the scoring of 95% for your set
of points is found.

This is a nonfunctional method, but can approximate
with as fine a resolution as wanted.