More

pplonski86 · 2026-04-14T12:35:37 1776170137

We built a benchmark to evaluate LLMs on real data analysis workflows. Instead of single prompts, each task is a sequence of prompts (steps). It is similar to how a human data analyst works in practice. Each run is saved as full python notebook, including prompts, code and outputs. We evaluated runs across task completion, code correctess, output quality, reasoning and reliability. Each workflow is execuuted multiple times and scored automatically.

Modern LLMs perform very well on individual steps. The benchmark currently inludes 23 workflows from different data analysis tasks (EDA, ML, NLP, statistics ...). The top-3 models across the 23 workflows, gpt-oss:120b scored 9.87/10, followed by gpt-5.4 at 9.65/10, glm-5.1 at 9.48/10. Which is very high in my opinion. The results show that modern LLMs perform very well on data analysis tasks. All feedback is welcome! I uploaded all notebooks for each model https://github.com/pplonski/ai-for-data-analysis

pplonski86 · 2026-03-31T14:02:07 1774965727

I thought it was open source project on github? https://github.com/anthropics/claude-code no?

athorax · 2026-03-31T14:12:52 1774966372

Did you even look in that repo?

pplonski86 · 2026-03-31T08:40:56 1774946456

Can someone explain ELI5 how it does work? and how many data points it can read?

pplonski86 · 2026-02-03T12:27:29 1770121649

Do we need rockets to put satelittes to the space? Cant it be done with baloons? https://www.youtube.com/watch?v=NFieAD5Gpms

MPSimmons · 2026-02-03T12:36:44 1770122204

Balloons work by displacing the atmosphere (mostly nitrogen with some oxygen) with something lighter (helium or hydrogen). This causes buoyancy, and makes the balloon rise.

This only works so long as the atmosphere being displaced weighs more than the balloon plus the payload. As soon as the air gets thin enough that the weight of the balloon+payload is equal to the weight of the air that would fill the volume of the balloon, then it stops rising. (Or, more likely the balloon rips open because it expanded farther than it could stretch).

Usually, this is really high in the atmosphere, but it's definitely not space.

This is all ignoring that orbit requires going sideways really, really fast (so fast, actually, that it requires falling, but going sideways so fast that the earth curves away and you miss).

gilbetron · 2026-02-03T13:21:11 1770124871

"Space" aka Orbit, is done not by going high, but by going fast.

pplonski86 · 2026-02-03T12:20:09 1770121209

It is not that easy to build such app from scratch ... it all requires a lot of work, even with AI help. I think the most important is to provide easy to use UI first, and if speed or some missing features will be blockers for further innovation step then maybe native app will be at some point created.

pplonski86 · 2026-01-28T21:10:33 1769634633

I have dual boot on decent laptop, doing nothing, on windows fan is always on, computing something? On Linux it is just silent

pplonski86 · 2026-01-27T07:28:09 1769498889

What if AI starts to have sense of craft? we just miss the verify and critique models, that will tell other models what looks good

pplonski86 · 2026-01-27T07:26:07 1769498767

thank you for sharing, is there a new container for each code run, or it stays the same for whole conversation?

aryehof · 2026-01-27T07:57:56 1769500676

It’s maintained for the conversation. You can ask it for details like this.

pplonski86 · 2026-01-27T07:22:59 1769498579

There are so many models, is there any website with list of all of them and comparison of performance on different tasks?

Reubend · 2026-01-27T07:31:11 1769499071

The post actually has great benchmark tables inside of it. They might be outdated in a few months, but for now, it gives you a great summary. Seems like Gemini wins on image and video perf, Claude is the best at coding, ChatGPT is the best for general knowledge.

But ultimately, you need to try them yourself on the tasks you care about and just see. My personal experience is that right now, Gemini Pro performs the best at everything I throw at it. I think it's superior to Claude and all of the OSS models by a small margin, even for things like coding.

Imustaskforhelp · 2026-01-27T07:54:04 1769500444

I like Gemini Pro's UI over Claude so much but honestly I might start using Kimi K2.5 if its open source & just +/- Gemini Pro/Chatgpt/Claude because at that point I feel like the results are negligible and we are getting SOTA open source models again.

wobfan · 2026-01-27T10:03:40 1769508220

> honestly I might start using Kimi K2.5 if its open source & just +/- Gemini Pro/Chatgpt/Claude because at that point I feel like the results are negligible and we are getting SOTA open source models again.

Me too!

> I like Gemini Pro's UI over Claude so much

This I don't understand. I mean, I don't see a lot of difference in both UIs. Quite the opposite, apart from some animations, round corners and color gradings, they seem to look very alike, no?

Imustaskforhelp · 2026-01-27T11:29:49 1769513389

Y'know I ended up buying Kimi's moderato plan which is 19$ but they had this unique idea where you can talk to a bot and they could reduce the price

I made it reduce the price of first month to 1.49$ (It could go to 0.99$ and my frugal mind wanted it haha but I just couldn't have it do that lol)

Anyways, afterwards for privacy purposes/( I am a minor so don't have a card), ended up going to g2a to get a 10$ Visa gift card essentially and used it. (I had to pay a 1$ extra but sure)

Installed kimi code on my mac and trying it out. Honestly, I am kind of liking it.

My internal benchmark is creating pomodoro apps in golang web... Gemini 3 pro has nailed it, I just tried the kimi version and it does have some bugs but it feels like it added more features.

Gonna have to try it out for a month.

I mean I just wish it was this cheap for the whole year :< (As I could then move from, say using the completely free models)

Gonna have to try it out more!

coffeeri · 2026-01-27T07:27:05 1769498825

There is https://artificialanalysis.ai

XCSme · 2026-01-27T12:25:26 1769516726

There are many lists, but I find all of them outdated or containing wrong information or missing the actual benchmarks I'm looking for.

I was thinking, that maybe it's better to make my own benchmarks with the questions/things I'm interested in, and whenever a new model comes out run those tests with that model using open-router.

pplonski86 · 2026-01-27T08:14:45 1769501685

Thank you! Exactly what I was looking for

pplonski86 · 2026-01-26T11:19:39 1769426379

Maybe point 9, trust but verify, should be extended to AI coworkers as well. I would love to have tools to verify AI code by quantity.

jrflowers · 2026-01-26T12:03:04 1769428984

Whatever human that is in charge of the chat bots is your coworker. That person that is responsible for the output of the bots is the one that you would trust but verify with.