Adding guardrails to large language models

Animats · on March 16, 2023

Well, sort of. This is mostly for using a large language model to generate JSON or XML or SQL, something for which there's a syntax checker. It guarantees only that the output has the right syntax. If used for censorship, it's just looking for keywords.

codetrotter · on March 16, 2023

Maybe I am a bit too lazy but I don’t understand what “guardrails” add that I wouldn’t get simply from:

1. Denying outputs with blacklisted words or phrases, and

2. Deserialising the JSON with serde_json [0], and denying output that fails to deserialise. If my requirements are very specific I can use strongly typed structs. If my requirements are more loose I can use serde json value types etc

[0]: https://docs.rs/serde_json/latest/serde_json/

krandiash · on March 16, 2023

I've been very mildly involved in this project, so I can give my two cents. While it's true that structural / type checks are not difficult to implement, there's no real need for a back-and-forth when you do static checks -- you either fail out, or run rules to fix.

There's something a bit different that we (should) expect with LLMs (and FMs more generally) since they are fundamentally interactive, so you can actually get them to correct things in interesting ways. Passing the outputs of static checkers back to the models is one nice trick. I (and some friends) have been exploring some stuff with using models in the loop for evaluation (more research side), and I think guardrails is directionally exciting in bringing that kind of vision into more production type settings. There's also just the crud of dealing with LLM code...

codetrotter · on March 16, 2023

Thank you. That makes a lot of sense! Good idea :)

asimpletune · on March 16, 2023

Wouldn’t the right way to create AI guardrails is to to have an antagonistic AI act as a moderator? Like you have one model trained to be as accurate as possible in fulfilling the prompt, and then another AI trained based on how human moderators apply the terms of another, “moderation” prompt. Then you have the two fight on a large training set and when you’re done you have generated a moderated AI.

jszymborski · on March 16, 2023

"LLM moderation" is something that sounds downright dystopic.

Speaking to your comment practically, I feel like it would probably be possible to prompt an LLM to successfully "express X concept that breaks ToS" in such a way that moderation doesn't flag it. It may take clever prompt engineering but that's what these jailbreaks are.

ape4 · on March 16, 2023

To be effective the moderator AI would need to be as smart (or smarter) as the source AI. Think of all the ways we have already seen people get around restrictions. Giving instructions for murder isn't allowed but people said they were writing a novel and want to have a murder in it and how could be be done. A smart moderator would see what the user is trying to do and stop it.

la64710 · on March 16, 2023

It will be fun to watch them argue.

can16358p · on March 16, 2023

Something like a GAN?

Will probably end up something similar though.

thr717272 · on March 16, 2023

What does FM mean in this context? I already see it mentioned in one top level comment and in the first thread but I don't see a definition here or on the project page.

Edit: after adding Large Language Model to my query it seems I found the explanation: FM stands for "Foundational Model".

https://kagi.com/search?q=llm+large+language+model+fm&r=no&s...

doppenhe · on March 16, 2023

This is great, thanks for sharing. Key component in evolving FM based applications is making them feel as deterministic as possible vs probabilistic. Framework like this would enable generating trust in the outputs of these FMs.. exciting.

eurasiantiger · on March 16, 2023

This is gonna be great for generating domain-specific mock data.

anileated · on March 16, 2023

You can bet people will ask LLMs to generate the rail files ^_^