Support Vector Machine classification and regression.
These classes are experimental. They implement support vector machines for classification and regression. Support vector machines are powerful artificial intelligence engines which construct hyperplanes in very high dimensional feature spaces using time and memory related to the number of training examples, not the size of the feature space.
A dot product <a, b> is equal to
||a|| * ||b|| cos(theta) where the sizes of a and b are scalars
giving the lengths of a and b, and * indicates normal multiplication. 
Theta is the angle between the items a and b.
It should be clear from this that the dot product of two scalars is the value
you would get by the normal notion of multiplication, as cos(theta) is by
definition 1.
The equation of a plane is:
W * x + b = 0, where W is a vector representation of
the normal of the plane, x is any point within the plane and b is a scalar
constant giving the minimal distance of the plane from the origin. If we compute
f(x) = W * x + b for an x that does not lie on the
plane, then f(x) is equal to the minimal distance of x from the
plane. If f(x) = y where y is the correct label or
value for x then we have a working artificial intelligence engine. In reality,
we find an f(x) that minimizes sum_i (||f(x_i) - y_i||)
for the training examples x_i, y_i.
The normal vector W could be represented as a sum of vectors such
that sum_i (w_i) = W. We can further break each w_i
down into two parts, a point (x_i) and a scaling factor
(a_i) so that W = sum_i (a_i * x_i). It follows that
f(x) can be represented by sum_i (a_i * x_i * x) + b.
It is unlikely that the observed values of your data are the best indicators upon which to classify the points. For example, when looking at heart-attacks, you may measure body-mass, height and age, but the distinguishing variable may be some mixture of these three variables, and you forgot to add in the effects of stress. Support vector machines can not tell you to go out and add a stress field to the data, but it can find that optimal mix of height, age and body-mass that pushes you over the edge into a high-risk of heart disease category. It does this by using a kernel function.
In the above situation, the data-space was three dimensional. However, the
solution is embedded in a feature space containing all possible combinations of
these three variables. We can define a mapping function M:X->F such
that M(x) = f and f is a point in the space of all
possible combinations of the tree variables.
To construct a plane in F space, we must calculate the dot product
of two points in that space. The expression
<M(a), M(b)> can be re-written as k(a, b) where
k implicitly maps the points a and b into
F-space, and returns the dot product there. To be useful to us, the
function k must be expressible as
k(W, x) = sum_i (a_i * k(x_i, x)) for suitable values of
a_i and x_i. It so happens that to search through all
combinations of our heart-attack indicators we can use
k(a, b) = (s * a * b + c)^d where s is a scaling
factor, a * b is a dot product, c is a constant and
d is a power. s and c allow you to tune
the relative lengths of the unit vectors along each axis relative to the
input space. d is the number of input space dimensions that
contribute to each feature space dimension. For example, d=2 with
c = 0 and s = 1 would
produce a feature space of the form:
This is in effect the space of input
height*height, height*weight, height*age,
weight*height, weight* weight, weight*age,
age*height, age*weight, age*age
^d (the dimensions of feature
space are the cross product of the input space d times). The variable
c mixes in all of the spaces of the form
input^i; i = {0 .. d}. Points in this feature space represent the
coefficients of all possible polynomials of order d. Since any
polynomeal can be constructed from a weighted sum of other polynomials, this
particular kernel function is useful to us.
So, the equasion for our classifier now becomes:
f(x) = sum_i (a_i * k(x_i, x)) + b
and the problem of training becomes setting a_i and b
optimaly given training data points x_i to minimize the cumulative 
error of ||f(x_i) - y_i|| over all training examples.
It should be obvious that we can only explore the sub-space of the feature
space that is spannable as a linear combination of the input data. This has
the practical effect that although the feature space may be infinite
dimensional, we explore the finite sub-space that the training points are
embedded within. This guarantees that the model f(x) can never
contain more information than the training examples x_i, y_i. Also,
as the hyperplane chosen is the best one, it is most likely to generalise well
to new data, assuming that it was drawn from the same population as the
training examples. In practice, usually only a small subset of the weights
a_i need to be set to a value other than 0, so the solution
is over an even smaller subspace. The training examples that have a non-zero
a_i are called the support vectors, as they support the hyperplane.
Many kernels, such as the polynomial kernel, calculate the dot-product in a feature space that is a combination of the input space dimensions, and do so by transforming the input space dot product in some way. Let us re-visit the polynomial kernel to illustrate one of the most striking points about kernel functions - that they can be nested to build very specific feature spaces.
In this way, kernels can be layered one on another, to scale, stretch and combine sub-spaces to produce the feature space that you believe is optimal for locatingkpoly(a, b | s, c, d) = (s * a * b + c)^dbuta * bis<a, b>in linear space. Rewriting using a kernelkgives:kpoly(a, b | s, c, d, k) = (s * k(a, b) + c)^dwherekperforms the dot product in the space that the polynomial kernel will cross-product.
f(x).
The design philosophy for kernels is as follows. There are basic kernels that calculate a type-specific dot product. These will normally be found by looking for a static final method named Kernel of type SVMKernel within a class. Then there are permutation kernels. These include PolynomialKernel and RadialBaseKernel. These use a primary kernel to calculate the dot product in the natural space for the objects, and then performs some function on that which produces a result equivalent to doing the same dot-product in a very much higher dimensional feature space. To avoid needless recalculation, there are some caching kernels that only calculate the value of the kernel they wrap if needed, and retrieve the value from a cache otherwise.
As far as possible, kernels are java beans.