I enjoyed our metrics systems at Amazon’s and wrote one with a similar API at Okta and should really look at writing another one to open source.
One of the huge missing things in metrics systems, imho, is keeping granular metrics in the context of a business operation and then using late aggregation for trends. Last I looked nearly every metrics systems either logged individual events and and required processing for any rollup or aggregated too early and you couldn’t determine the effect on any individual operation/request. There’s a happy medium where you can get per-request counts, stats, and timing and still roll those up at the host/data center/region/granularity to get higher level trends.
Most metrics APIs are incompatible with this idea, however.
You're likely talking about "wide events" Which is essentially as many dimensions as possible attached to an event.
I believe Meta was the one to develop an internal tool for handling this named Scuba.
One of the huge missing things in metrics systems, imho, is keeping granular metrics in the context of a business operation and then using late aggregation for trends. Last I looked nearly every metrics systems either logged individual events and and required processing for any rollup or aggregated too early and you couldn’t determine the effect on any individual operation/request. There’s a happy medium where you can get per-request counts, stats, and timing and still roll those up at the host/data center/region/granularity to get higher level trends.
Most metrics APIs are incompatible with this idea, however.