> Data extraction tasks are amongst the easiest to evaluate because there’s a known “right” answer.
Wrong. There can be a lot of subjectivity and pretending that some golden answer exists does more harm and narrows down the scope of what you can build.
My other main problem with data extraction tasks and why I'm not satisfied with any of the existing eval tools is that the schemas I write change can drastically as my understanding of the problem increases. And nothing really seems to handle that well, I mostly just resort to reading diffs of what happens when I change something and reading the input/output data very closely. Marimo is fantastic for anything visual like this btw.
Also there is a difference between: the problem in reality → the business model → your db/application schema → the schema you send to the LLM. And to actually improve your schema/prompt you have to be mindful of the entire problem stack and how you might separate things that are handled through post processing rather than by the LLM directly.
> Abstract model calls. Make swapping GPT-4 for Claude a one-line change.
And in practice random limitations like structured output API schema limits between providers can make this non-trivial. God I hate the Gemini API.
This is very true! I could have been more careful/precise in how I worded this. I was really trying to just get across that it's in a sense easier than some tasks that can be much more open ended.
I'll think about how to word this better, thanks for the feedback!
This is extremely true. In fact, from what we see many/most of the problems to be solved with LLMs do not have ground-truth values; even hand-labeled data tends to be mostly subjective.
I think they're just saying that data extraction tasks are easy to evaluate because for a given input text/file you can specify the exact structured output you expect from it.
I got claude to reverse engineer the extension and compare to changedetection and here's what it came up with. Apologies for clanker slop but I think its in poor taste to not attribute the opensource tool that the service is built on (one that's also funded by their SaaS plan)
---
Summary: What Is Objectively Provable
- The extension stores its config under the key changedetection_config
- 16 API endpoints in the extension are 1:1 matches with changedetection.io's documented API
- 16 data model field names are exact matches with changedetection.io's Watch model (including obscure ones like time_between_check_use_default, history_n, notification_muted, fetch_backend)
- The authentication mechanism (x-api-key header) is identical
- The default port (5000) matches changedetection.io's default
- Custom endpoints (/auth/, /feature-flags, /email/, /generate_key, /pregate) do NOT exist in changedetection.io — these are proprietary additions
- The watch limit error format is completely different from changedetection.io's, adding billing-specific fields (current_plan, upgrade_required)
- The extension ships with error tracking that sends telemetry (including user emails on login) to the developer's GlitchTip server at 100% sample rate
The extension is provably a client for a modified/extended changedetection.io backend. The open question is only the degree of modification - whether it's a fork, a proxy wrapper, or a plugin system. But the underlying engine is unambiguously changedetection.io.
Fair point, and I should have been upfront about this earlier. The backend is a fork of changedetection.io. I've built on top of it — added the browser extension workflow, element picker, billing, auth, notifications, and other things — but the core detection engine comes from their project. That should have been clearly attributed from the start, and I'll add it to the docs and about page.
changedetection.io is a genuinely great project. What I'm trying to build on top of it is the browser-first UX layer and hosted product that makes it easier for non-technical users to get value from it without self-hosting and AI focus approach
Apologies but I will use this thread as an opportunity to report CC VSCode extension bugs because I don't think there's an official channel that actually gets read by humans.
> yeah they're shipping too fast and everything is buggy as shit
- fork conversation button doesn't even work anymore in vscode extension
- sometimes when I reconnect to my remote SSH in VSCode, previously loaded chats become inaccessible. The chats are still there in the .jsonl files but for some reason the CC extension becomes incapable of reading them.
-- this issue happens so frequently that I ended up making a skill to allow CC to dig up info from the bugged sessions
Are there good open models out there that beat gemini 2.5 flash on price? I often run data extraction queries ("here is this article, tell me xyz") with structured output (pydantic) and wasn't aware of any feasible (= supports pydantic) cheap enough soln :/
> every single product/feature I've used other than the Claude Code CLI has been terrible
yeah they're shipping too fast and everything is buggy as shit
- fork conversation button doesn't even work anymore in vscode extension
- sometimes when I reconnect to my remote SSH in VSCode, previously loaded chats become inaccessible. The chats are still there in the .jsonl files but for some reason the CC extension becomes incapable of reading them.
Batshit situation, respectable position from Dario throughout.
But there's some irony in this happening to Anthropic after all the constant hawkish fearmongering about the evil Chinese (and open source AI sentiment too).
Horrific comparison point. LLM inference is way more expensive locally for single users than running batch inference at scale in a datacenter on actual GPUs/TPUs.
Wrong. There can be a lot of subjectivity and pretending that some golden answer exists does more harm and narrows down the scope of what you can build.
My other main problem with data extraction tasks and why I'm not satisfied with any of the existing eval tools is that the schemas I write change can drastically as my understanding of the problem increases. And nothing really seems to handle that well, I mostly just resort to reading diffs of what happens when I change something and reading the input/output data very closely. Marimo is fantastic for anything visual like this btw.
Also there is a difference between: the problem in reality → the business model → your db/application schema → the schema you send to the LLM. And to actually improve your schema/prompt you have to be mindful of the entire problem stack and how you might separate things that are handled through post processing rather than by the LLM directly.
> Abstract model calls. Make swapping GPT-4 for Claude a one-line change.
And in practice random limitations like structured output API schema limits between providers can make this non-trivial. God I hate the Gemini API.
reply