Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Can you point at the autobatching packages? What strategy are they taking? Will they recognize opportunities to combine compatible operations within a given function body into a batch? Does one need a cost model for merging and splitting data into batches?

Also, what does an approach like bucketing even look like for the approach that Julia is taking? The idea there of course is to have 'slop': to combine many similar examples whose tensor sizes differ by small amounts, and to carefully define all your primitive operations such that they can can ignore the padding used to combine similar tensors into a uniform shape. Doing this requires awareness of the tensor sizes all the way back to the way you sample the training data, so I don't see how compiler magic can achieve the same performance as you get from bucketing.

Of course, bucketing becomes more complex for things like trees and graphs and other higher-level objects. And bucketing, theoretically, can bias into your gradients, if there is any correlation between the gradient of an example and its tensor shape.



From the blogpost

"Automatic Batching To get the most from these accelerators – which can have significant overheads per kernel launch, but scale very well over input size – it is common to batch programs, applying the forwards and backwards passes to multiple training examples at once. In simple cases, such as with convolutional nets, it’s simple to handle this by concatenating, say, 10 images along an extra batch dimension. But this task becomes much harder when dealing with variably-structured inputs, such as trees or graphs.

Most researchers address this by taking on the significant burden of batching code by hand. Different solutions have been proposed for different frameworks (DyNet, TensorFlow Fold, which heuristically try to batch some high level operations together when possible, but these typically either have their own usability issues or do not achieve the performance of hand-written code.

We suggest that this problem is identical to that of Single Program Multiple Data (SPMD) programming, which has been well-studied by the language and compiler community for decades, and becomes visible in more recent approaches to batching like matchbox. Indeed, it is very similar to the model of parallelism used by GPUs internally, and has been implemented as a compiler transform for the SIMD units of CPUs. Taking inspiration from this work, we are implementing the same transform in Julia to provide SPMD programming both for scalar SIMD units and for model-level batching. This allows us to reach the ideal of writing simple code that operates on individual samples, while still getting the best performance on modern hardware."




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: