I've found myself adopting this philosophy for a specific use case: monitoring M...

I've found myself adopting this philosophy for a specific use case: monitoring ML training jobs. It's pretty common to see people output training metrics (loss, validation accuracy, etc) every N batches, iterations or epochs. And that does make sense for a lot of reasons, and it's pretty simple to do. But also when you're exploring models that might have wildly varying inference latencies, or using different hardware, or varying batch sizes, or using a differently sized dataset, all of those might end up reporting too infrequently to get an idea of what's happening or too frequently and just spamming too much output.

Checkpointing the model every N iterations/epochs/batches has a similar problem - you may end up saving very few checkpoints and risk losing work or waste a lot of time/space with lots of checkpoints.

So I've often found myself implementing some kind of monitoring and checkpointing callbacks based on time, e.g., reporting every half an hour, checkpointing every two hours, etc.