Hacker Timesnew | past | comments | ask | show | jobs | submit | adiraja's commentslogin

We're building a developer tool (currently a VSCode / Cursor extension) that helps performance optimization engineers. One of the things we hated when analyzing performance and writing kernels ourselves was that things like timing code or TorchRecord or NVTX markers couldn't be permanent. Either the instrumented code went out-of-sync and merge conflicted with the original branch or we had to keep remembering to remove it before merging the optimized branch.

When building our dev tool we decided to try out runtime injection of these statements with Python AST rewriting. You basically highlight a region in VSCode / Cursor, say the type of AST injection you want to do, it stores it in a config and you can use our SDK to inject this code at runtime without actually editing your source code.

We don't want to limit this to just TorchRecord and NVTX markers, so if you think you could benefit from this, we'd really appreciate contributions and suggestions for new injectors - or tell us why you think this is a terrible idea :)

SDK - https://github.com/nCompass-tech/ncompass Extension - https://open-vsx.org/extension/nCompassTech/ncprof-vscode


Thank you! Absolutely, I'll send over a DM and we can take it from there!


Thanks for your comments. Absolutely, as we were mentioning in one of the other threads, we are really keen on building towards having a reproducible dashboard of efficiency and other metrics.

Also regarding the no rate limits, we agree this is a real challenge and it's part of why we're interested in building this as well. I think the clever GPU utilization tricks are exactly what we're building out and also looking forward to see what the various issues we're going to run into at such scale.


It now makes sense that when we tested the domain ncompass.com it took us to a Microsoft home page, which is why we're ncompass.tech :)


That’s hilarious. I bet if you reach out to Microsoft, they will give you that domain. There’s no way they’re using the trademark anymore.


Hey, great that you mentioned this. We actually had BAAI/bge-m3 on our list of models to put up in the near future to see if people had use for it over an API. It's great to hear that this is something you're looking for. If you could let us know if there was a specific model you wanted to run, we can look into getting that put up soon.


Colbert, colqwen are underserved would benefit from a latency optimized inference service


Awesome, we really appreciate the suggestions! We'll look into getting these up and running shortly!


We focused mainly on the scheduling side of things. So we essentially prioritize prefills over decodes. In order to do this correctly, we had to monitor KV cache usage and whenever it's close to running out of memory, we schedule more decodes again.

So this means that you end up either having many decodes wait for prefills to complete or you end up scheduling decodes with prefills. Both scenarios result in slower decodes which is why we're seeing an increase in the ITL. This is the main tradeoff we've made.


So, while time to first token is lower, throughput might also be lower in most cases?


Per user throughput might be lower at the moment yes. We're working on GPU kernel level optimizations now to fix that.

But across all users on our system, the throughput is better because doing more prefills or a large number of grouped decodes has better utilization of the GPU.

The idea is that this works for someone who wants to build a product that is consistent across users in terms of initial response but can trade-off some E2E latency. It ensures that no one is waiting for a long time before getting the first response.


I don’t really get it. Prefill saturates compute and decode saturates memory bandwidth. Why are you not doing mixed batch?


You're totally right and we are doing a mixed batch. What we changed was the priority of performing prefills over decodes.

When looking at a variety of workloads, we realized that prioritizing finishing a query (priotizing decodes) lead to underutilization of the GPU. We noticed there tended to not be enough requests that are concurrently running (because prefill wasn't prioritized) to meaningfully utilize the memory bandwidth with available decodes. This lead to a system that was unfortunately neither compute nor memory bound.

By running mixed batches that prioritize prefills we still compute some decode tokens in our spare capacity, but ensure compute is as saturated as possible. This additionally leads to a buildup of decodes, so that when we are primarily computing decode we're pushing our memory bandwidth as much as we can.

Of course there is still plenty of improvements that can be made on this front. Finding a dynamic balance between prefill and decode that allows us to have both the memory bandwidth and compute being pushed to their limits is the goal from a scheduling perspective. There are a whole host of factors such as the model architecture, input-token:output-token ratio, underlying hardware, KV-cache allocation (and many more) that all play into the pressure placed on memory and compute, so there's definitely still exploration to be done!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: