I don’t understand the obsession of LOC for wrappers - it’s the whole point of a...

vvrm · on Feb 11, 2024

Another problem with the title: the article is about DPO, which doesn’t do reinforcement learning. So not RLHF. I guess RLHF has more of a name recognition than DPO.

janalsncm · on Feb 12, 2024

Honestly a much bigger problem than LOC. It’s a completely different algorithm.

patelajay285 · on Feb 11, 2024

This was discussed in another comment, DPO is pretty much strictly better than RLHF + PPO, and far more stable when training. Yes, DPO is not technically "RL", but it's semantics for the most part. DataDreamer does support PPO training if you want, but it's so unstable, it's a less popular choice now.

antonvs · on Feb 11, 2024

In the DPO paper linked from the OP page, DPO is described as "a simple RL-free algorithm for training language models from preferences." So as you say, "not technically RL."

Given that, shouldn't the first sentence on the linked page end with "...in a process known as DPO (...)" ? Ditto for the title.

It sounds like you're saying that the terms RL and RLHF should subsume DPO because they both solve the same problem, with similar results. But they're different techniques, and there are established terms for both of them.

patelajay285 · on Feb 11, 2024

I think the discussion in the other comment thread discusses this well. They are different techniques, but the line between RL & SL is quite fuzzy. The DPO authors advertise this as a "non-RL" technique to precisely get away from the reputation of unstable training RL has, but they also say and treat the language model as an (implicit) reward model, similar to PPO. The point is well taken though, I will update this page to clarify the differences to avoid confusion.

vvrm · on Feb 11, 2024

> DPO is pretty much strictly better than RLHF + PPO

Out of genuine curiosity, do you have any pointers/evidence to support this. I know that some of the industry leading research labs haven't switched over to DPO yet, in spite of the fact that DPO is significantly faster than RLHF. It might just be organizational inertia, but I do not know. I would be very happy if simpler alternatives like DPO were as good as RLHF or better, but I haven't seen that proof yet.

changoplatanero · on Feb 12, 2024

I can second that. From what I’ve heard from people at leading labs, it’s not clear that dpo is worth switching to from RLHF

behnamoh · on Feb 11, 2024

This. If I'm the type of person who wants to do RLHF, then I'm the type of person who wants control and doesn't like delegating it to imported libraries.

patelajay285 · on Feb 11, 2024

This is built for ML researchers out of an academic lab. There's a ton of functionality in the library (beyond RLHF and alignment) that ML researchers do every day to write papers and run experiments that the library helps abstract and make repeatable and usable.

Unless your research hypothesis is specifically around improving or changing RLHF, it's unlikely you should be implementing it from scratch. Abstractions are useful for a reason. The library is quite configurable to let you tune any knobs you would want.

patelajay285 · on Feb 11, 2024

This is developed for researchers, so I assure it’s very hackable and configurable. ;-) but appreciate the feedback on the title!

tgsovlerkhgsel · on Feb 11, 2024

Honestly the amount of complicated boilerplate that you're supposed to write from scratch every time you do something with ML (in some of the major frameworks) deterred me from touching anything ML-related for a long time.

As far as I understand, what the training loop is supposed to be doing is pretty static and you don't need to understand most of it in order to "do ML", but at the same time it's full of complicated things to get right (which would be much easier to understand when controlled through well defined parameters instead of mixing boilerplate and config).

brigadier132 · on Feb 11, 2024

I always appreciate these projects because I just dive into the code itself and copy out what I need once the wrapper becomes too much of a burden.

patelajay285 · on Feb 11, 2024

That’s totally valid and something we would even encourage! This project is for researchers so if there is a point where the abstraction is no longer useful, by all means configure, or subclass, or copy code.

rovr138 · on Feb 11, 2024

Of course. And they're not saying they don't have a place.

They're saying why does it matter if it's 50 vs 60 or even 100. It's a wrapper, which should be less lines. That's the whole point. Abstracting things even further and making assumptions.

Of course you can use them. Of course you can remove them after and use the underlying code. But the LOC shouldn't be the important part of it

verticalscaler · on Feb 11, 2024

Yes you do. Most casuals are downright afraid of code. This messaging is meant to make the project more approachable.

Kind of like everybody knows the pop-science around e = mc^2 but most are completely oblivious that it takes a bunch of whiteboards to derive it and what all that actually means.

No pithy formula no way for the actual ideas to spread to the mainstream for you to somehow hear about it.

antonvs · on Feb 11, 2024

This reminds me of the advice Stephen Hawking's publisher gave him, which was that every equation he included in his book, A Brief History of Time, would cut the sales of the book in half. As a result the only equation that ended up in the book was E=mc^2.