My ML PhD friends say that SVM are popular in academic circles in part be cause you can prove things on them. They are in the NN camp when it comes to solving problems.
I like the basic idea. It feels like a nonlinear relative of PCA (Principal Component Analysis). I also haven't seen NN used for outlier detection.
SVM comes from a theory called Vapnik–Chervonenkis dimension which talks about ways to balance best learner performance against likelihood of actually achieving that theoretical optimum. It's very nice theory.
The nonlinear stuff is called the "kernel trick" and is essentially a recognition that an SVM's operation is depends on the data exclusively through measurements of how far apart the n^2 data points are from one another. Once you realize that you can twist the definitions around so as to insert whatever kind of non-linear notion of "distance" you like. Many ML algorithms can be augmented by the kernel trick.
Finally, SVM's geometry allows you to think of it as a relatively nice optimization problem. Compare this to many other ML algorithms which have only heuristic optimization methods. It makes the algorithm surprisingly fast.
I'm not sure if its been done before, but neural networks could be used for outlier detection. Autoencoders can also be thought of as a form on nonlinear PCA, and the LISA group in Montreal has actually proved that the reconstruction error from a contractive or denoising autoencoder for a data point serves as an estimator of the probability density under that point.
Actually, NNs (specifically auto-encoders) are more like a nonlinear PCA. PCA can be seen as an "information bottleneck" where you try to represent an N dimensional space as a K < N dimensional space such that your ability to reconstruct the original space is maximized (that is, the "reconstruction error" is minimized.)
Autoencoders are precisely the same idea, except that you use nonlinear transformations (e.g. softmax or tanh) in addition to linear transformations (matrix multiplications). You can also use multiple layers in an autoencoder, just like any neural net. (The linearness of PCA makes multiple layers useless. You can always replace multiple layers with a single layer.)
this may end up as an empty thread as SVMs are pretty simple and very well known...i'd like to say that i enjoyed the post - one hobby of mine is reading the same mathematical/technical material through different writers/perspectives
Nice to hear you enjoyed it. SVMs are indeed quite familiar in the machine learning community. But I think (and hope :)) that more introductory material will help others. And the examples on one-class classification is not that exhaustive. I hope to be able to write a follow-up on novelty detection in time series; that is the main subject of my master thesis.
The article and your research are quite interesting. I am currently interested in the similar things (more from the theoretical side than practical application side).
You say at the end of the article that you are interested in “Change detection” which is quite wide. What exact setting are you interested in? :
Here is a (biased, IMHO) taxonomy of change detection methods for time-series data:
1. Only samples from a time series is provided, no labels or assumptions.
2. Time series has different “modes”. For example, you have a time series of an electric motor that can either be in normal operation or abnormal operation (broken). You have “training” data that only consist of normal operation (“inlier”). You then have the “test” data that is a time series of the motor and some abnormal (failure) times (the unlabeled test dataset therefore consists of both “inliers” and “outliers”). The goal is then to identify the abnormal times (you do not have labels of the abnormal so you cannot straightforwardly calculate a classifier).
3. Time series has different “modes”. For example, you have a time series of an electric motor in normal operation. You only have “test” data that is a time series of the motor and some abnormal points (e.g. failure). The goal is then to identify the outlier points.
4. The time series have different “modes”, but you don’t have access to labeled sets. A good example is human activity recognition from accelerometer (e.g. “walking”, “running”, “dancing”, and “sitting”). This is more akin to clustering (i.e. divide the time series data into four clusters).
For (1), I don’t think one-class SVM is the best approach (there are some other methods based on divergence estimation). One-class SVM will work well for (3). There are already some techniques for (2) (I have some ideas to improve them). (4) is an interesting problem, because it is actually an “easier” problem than clustering.
I assume that you have labeled dataset. Then perhaps (2) is the best setting for determining change points. All of the training data can then be considered to be one class (“inliers”). The points at which the classes change you don’t know (i.e. “outliers”).
EDIT: Sorry for the wall of text. Also, why your idea is good:
Consider a person going from "sitting" to "jogging". When the change point occurs, the reading on the accelerometer will be high for a short period (in the forward direction). As soon as he runs at his steady pace, it will be "0".
Thank you for your nice reply; currently I don't have time to look at your suggestions in depth, but I can give a (short) notion of my interpretation of 'change detection' in this context.
I will consider a signal from accelerometers in a smartphone. There are already many algorithms to determine the activity an user is currently performing (such as walking, jogging, biking, sitting, etc). Often these methods take windows of time (1-2 seconds) and compare characteristics with learned models. This way the time series data is clustered/classified into models. An additional result from this is an (implicit) segmentation of the time series arises.
My theory is that when the segmentation of the time series is first performed explicitly, the (bigger) chucks of data can be better characterized. So my goal is to find the change points in the time series; and a change points means that the underlying process (being the user carrying the phone) has changed (e.g. from walking to running).
I want to detect the change point using novelty-detection techniques; so roughly speaking: when a certain number of data points are considered "new"/"outlier", I assume there is a change in the signal and report that moment of time. Then the model adapts to the new distribution of data and watches again for a change point.
It's always hard to get a birds'-eye view of these things, but from my vantage point, I think SVMs are much more widely used in applications, but have been surpassed by deep learning as the current hot research area in machine learning. There is still SVM research, but it's plateaued compared to the frenzy of the late 1990s and early 2000s, while deep learning has taken off in the past few years. SVM has instead entered the realm of "standard ML tool", as an off-the-shelf technique with robust and well-documented implementations that can be applied in a reasonably straightforward way.
One interesting thing is that the mechanism of SVMs is formally identical to that of either deep or shallow neural network. The main thing that different is the learning mechanism. I believe Neural Networks inherently take more time to learn than SVMs but for all I know networks could learn a set "better".
I believe NNs and SVMs involve tweaks and judgement to perfect
Thanks for writing this. I love the illustration of the concept with the video. I have little knowledge or background in the area so it is very useful.
I got confused where you introduced the objective function though. Is the reader supposed to understand where that comes from or are you just presenting it as a fact? It also doesn't make sense to me that the function is minimised over 'b' but 'b' does not appear in the function?
The minimization function is indeed "as a fact"; in the sense that it is (somewhat) the formulation of Cortes and Vapnik (http://link.springer.com/article/10.1007%2FBF00994018), which is the main formulation for Support Vector Machines.
The value of 'b' is minimized and the value if found in the first constraint. The value "b\||w||" determines the offset of the hyperplane to the origin and 'w' is the normal vector of that plane.
I like the basic idea. It feels like a nonlinear relative of PCA (Principal Component Analysis). I also haven't seen NN used for outlier detection.