We're building a developer tool (currently a VSCode / Cursor extension) that helps performance optimization engineers. One of the things we hated when analyzing performance and writing kernels ourselves was that things like timing code or TorchRecord or NVTX markers couldn't be permanent. Either the instrumented code went out-of-sync and merge conflicted with the original branch or we had to keep remembering to remove it before merging the optimized branch.
When building our dev tool we decided to try out runtime injection of these statements with Python AST rewriting. You basically highlight a region in VSCode / Cursor, say the type of AST injection you want to do, it stores it in a config and you can use our SDK to inject this code at runtime without actually editing your source code.
We don't want to limit this to just TorchRecord and NVTX markers, so if you think you could benefit from this, we'd really appreciate contributions and suggestions for new injectors - or tell us why you think this is a terrible idea :)
Thanks for your comments. Absolutely, as we were mentioning in one of the other threads, we are really keen on building towards having a reproducible dashboard of efficiency and other metrics.
Also regarding the no rate limits, we agree this is a real challenge and it's part of why we're interested in building this as well. I think the clever GPU utilization tricks are exactly what we're building out and also looking forward to see what the various issues we're going to run into at such scale.
Hey, great that you mentioned this. We actually had BAAI/bge-m3 on our list of models to put up in the near future to see if people had use for it over an API. It's great to hear that this is something you're looking for. If you could let us know if there was a specific model you wanted to run, we can look into getting that put up soon.
We focused mainly on the scheduling side of things. So we essentially prioritize prefills over decodes. In order to do this correctly, we had to monitor KV cache usage and whenever it's close to running out of memory, we schedule more decodes again.
So this means that you end up either having many decodes wait for prefills to complete or you end up scheduling decodes with prefills. Both scenarios result in slower decodes which is why we're seeing an increase in the ITL. This is the main tradeoff we've made.
Per user throughput might be lower at the moment yes. We're working on GPU kernel level optimizations now to fix that.
But across all users on our system, the throughput is better because doing more prefills or a large number of grouped decodes has better utilization of the GPU.
The idea is that this works for someone who wants to build a product that is consistent across users in terms of initial response but can trade-off some E2E latency. It ensures that no one is waiting for a long time before getting the first response.
You're totally right and we are doing a mixed batch. What we changed was the priority of performing prefills over decodes.
When looking at a variety of workloads, we realized that prioritizing finishing a query (priotizing decodes) lead to underutilization of the GPU. We noticed there tended to not be enough requests that are concurrently running (because prefill wasn't prioritized) to meaningfully utilize the memory bandwidth with available decodes. This lead to a system that was unfortunately neither compute nor memory bound.
By running mixed batches that prioritize prefills we still compute some decode tokens in our spare capacity, but ensure compute is as saturated as possible. This additionally leads to a buildup of decodes, so that when we are primarily computing decode we're pushing our memory bandwidth as much as we can.
Of course there is still plenty of improvements that can be made on this front. Finding a dynamic balance between prefill and decode that allows us to have both the memory bandwidth and compute being pushed to their limits is the goal from a scheduling perspective. There are a whole host of factors such as the model architecture, input-token:output-token ratio, underlying hardware, KV-cache allocation (and many more) that all play into the pressure placed on memory and compute, so there's definitely still exploration to be done!
When building our dev tool we decided to try out runtime injection of these statements with Python AST rewriting. You basically highlight a region in VSCode / Cursor, say the type of AST injection you want to do, it stores it in a config and you can use our SDK to inject this code at runtime without actually editing your source code.
We don't want to limit this to just TorchRecord and NVTX markers, so if you think you could benefit from this, we'd really appreciate contributions and suggestions for new injectors - or tell us why you think this is a terrible idea :)
SDK - https://github.com/nCompass-tech/ncompass Extension - https://open-vsx.org/extension/nCompassTech/ncprof-vscode