JunaidB's comments

JunaidB · on Feb 23, 2020

I think this is a concise and good overview of ML model evaluation. I was recently thinking along the same lines and posted it on my blog here: https://scienceofdata.org/2020/02/23/ml-classifier-evaluatio...

You have nice graphs and cover regression models too. I use mostly these when evaluating models I've built, I also use the Kappa statistic which can be a useful summary.

rangerranvir · on March 3, 2020

Good post! I really liked the concept of ROC curve in there. Will read more about it.

JunaidB · on Jan 4, 2020

Your point about convergence to the exact minima over training loss not guaranteeing the best generalization loss reminds me of the point made in this lecture here https://www.youtube.com/watch?v=k3AiUhwHQ28.

You also made an interesting comment about work not catching on outside of the optimization community - can you recommend some resources or websites to follow in order to see what the optimization community is working on? I've developed an interest in the area but don't really know where to go for "up to date" information.

smallcharleston · on Jan 4, 2020

I’m not that other guy and I also haven’t read this paper but it seems quite thorough

https://arxiv.org/abs/1606.04838

JunaidB · on Jan 4, 2020

This seems like an excellent review. I'll check it out. Thanks very much!

JunaidB · on Jan 3, 2020

That's a great point and to be honest I could have been a lot tighter with the terminology. Good advice to take on board for next time - thanks!

Your point about combining optimisation techniques is interesting and I'd love to learn about it a little more. When you say "As such, most good, fast optimization algorithms based on differentiable functions use the steepest descent direction as a metric, or fallback, to guarantee convergence and then use a different direction, most likely a truncated-Newton method, to converge quickly", does this mean that both algorithms are being used together? So first steepest descent is run for a few iterations and then the truncated-Newton method takes over?

If you have some resources where I could read up on this it would be much appreciated!

kxyvr · on Jan 4, 2020

Though I have complaints with it, Numerical Optimization by Nocedal and Wright is probably the best reference for modern optimization techniques. My complaint with it is that they also present many historical techniques that I would argue should not be used and don't provide clear guidance as to what are the modern, robust algorithms. And, to be sure, arguments can be made for all sorts of algorithms, but I will contend: (unconstrained) trust-region newton-cg [algorithm 7.2 in Numerical Optimization], (equality) composite-step SQP method [algorithm 15.4.1 in Trust-Region Methods by Conn, Gould, and Toint], (inequality) NITRO interior point algorithm [algorithm 19.4 in Numerical Optimization], (equality and inequality) combination of the above. There are many implementation nuances with these algorithms and they can be made better than their presentation, but I believe them to be a good starting point for modern, fast algorithms.

As far as switching back and forth between the Newton and gradient descent steps, this is largely done in a class of algorithms called dogleg methods. Essentially, the Newton step is tried against some convergence criteria. If it satisfies this criteria, it takes a step. If not, it reduces itself until eventually it assumes the gradient descent step. I'll contend that truncated-CG (Steihaug-Toint CG) does this, but better. Essentially, it's a modified conjugate gradient algorithm to solve the Newton system that maintains a descent direction. The first Krylov vector this method generates is the gradient descent step, so it eventually reduces to this step if convergence proves difficult.

More broadly, there's a question of whether all of the trouble of using second-order information (Hessians) is worth it away from the optimal solution. I will contend, strongly, yes. I base this on experience, but there are some simple thought experiments as well. For example, say we have the gradient descent direction. How far should we travel in this direction? Certainly, we can conduct a line-search or play with a "learning parameter". Also, if you do this, please use a line-search because it will provide vastly better convergence guarantees and performance. However, if we have the second derivative, we have a model to determine how far we need to go. Recall, a Taylor series tells us that f(x + dx) ~= f(x) + grad f(x)'dx + 0.5 dx' hess f(x) dx. We can use this to figure out how far to travel in this direction where we try to find an optimal alpha such that J(alpha) = f(x + alpha dx) = f(x) + alpha grad f(x)'dx + (alpha/2) dx' hess f(x) dx. If dx' hess f(x) dx > 0, the problem is convex and we can simply look for when J'(alpha) = 0, which occurs when alpha = -grad f(x)' dx / (dx' hess f(x) dx). When dx' hess f(x) dx < 0, this implies that we should take a really long step as this is predicting the gradient will be even more negative in this direction the farther we go. Though both methods, must be safeguarded (the easiest is to just halve the step if we don't get descent), the point is that the Hessian provides information that the gradient did not and this information is useful. This is only one place where this information can be use, others include in the direction calculation itself, which is what truncated-CG does.

As a brief aside, the full Hessian is rarely, if ever, computed. Hessian-vector products are enough, which allows the problem to scale to really anything that a gradient descent method can scale to.

As one final comment, the angle observation that you make in the blog post is important. It comes in a different form when proving convergence of methods, which can be seen in Theorem 3.2 within Numerical Optimization, which uses expression 3.12. Essentially, to guarantee convergence, the angle between the gradient descent direction and whatever we choose must be controlled.

JunaidB · on Jan 4, 2020

Thank you for taking the time to write a thorough and considerate response. I have been working through the Engineering Optimization Methods and Applications by Ravindran, Ragsdell and Reklaitis so far but I will spend some time in the coming few weeks with Nocedal and Wright in accordance with your recommendation.

I intend to write more about what I learn in this area and I'd be honoured if you would contribute like you did here with your comments/ corrections and suggestions! Thank you for the help and reference, I will definitely be following up.

JunaidB · on Jan 3, 2020

This is a really important point and I wish I'd mentioned it. The computational considerations (as you've said) make the classic Gradient Descent method infeasible in practice. Therefore we resort to stochastic estimates or Quasi Newton approaches (which I'm still looking into).

My main objective was to highlight is that given that we are performing the classic gradient descent, the gradient will yield the greatest reduction in the function value. Essentially it was a point to highlight the underlying calculus. Wayne Winston in his book Operations Research: Applications and Algorithms has an interesting passage where he discusses the gradient being the direction of maximum increase (he was looking as steepest ascent).

JunaidB · on Jan 3, 2020

Thank you for taking the time to look through the post. When I learned this formally it was introduced as steepest descent (Cauchy's variant). Like yourself, it didn't become concrete to me until I looked at the surface plots. I concur with your point that it's interesting to see the different paths people take toward learning the same thing.

ci5er · on Jan 3, 2020

I too, like commenter above, find it fascinating how various concepts are quickly accessible (or not!) for different people. Usually I blame it on the teacher's presentation (for my part, I am completely unable to explain pointers in (say, C) to people who haven't already been able to get it), but sometimes it is also the building blocks that the learner already has in place.

So - maybe a rude question - but had you take Calculus, Differential Equations or Vector Calculus (div, grad and curl and all that) before you started in on this more integrated material?

JunaidB · on Jan 3, 2020

Not rude at all! Thanks for the question. My background is in Statistics not Machine learning (that came later) so I covered these topics without reference to ML applications. I suppose I never learned to connect these ideas when I was learning about ML, it was only when I went back to my old material I realised how these ideas relate. I agree with your point about teacher material, when I was learning about ML it was separated from raw Calculus.

A short answer to your question - yes I took those courses before I started looking at the more integrated material.