Yeah, it feels like an unsolved problem still. I've also seen many teams spend hours on human review in eval pipelines (and this accumulates with each new model that gets released).
I’m building EventSentinel.ai, a predictive AI platform that monitors hardware and network infrastructure to detect early signals of failures and connectivity issues before they cause downtime.
I’m looking for a few early-stage design partners (SRE / DevOps / IT / Network teams) who:
Manage on‑prem or hybrid infrastructure with critical uptime requirements
Are currently using tools like Datadog, PRTG, Zabbix, or similar, but still deal with “surprise” incidents?
Are open to trying an MVP and giving candid feedback in short feedback sessions?
What you’d get:
-Early access to our predictive failure and anomaly detection features
-Direct influence on the roadmap based on your needs
-Free usage during the MVP phase (and preferential terms later)
If this sounds relevant, drop a comment “interested” and I’ll follow up with details or email at gabriele@eventsentinel.ai
In some cases I've seen teams rely on a mix of automated metrics and human review, especially for production systems where reliability matters a lot.
But evaluation pipelines for AI still seem much less standardized compared to traditional software monitoring.