The layout is doable with hobby servos, but you'd need to patch in current sensing for that bit of the feedback. It's not terribly difficult conceptually but it's an extra complication that most servo power distribution boards don't give you.
You can also strap a capstan to the servo axle, if that's your thing. I've prototyped that myself in the past. You can go surprisingly far with an FDM printer, an SG90, and some dyneema bowstring. One thing I haven't tried is modding one for continuout rotation to get around the way the capstan drive limits the output angle you can achieve - I was happy reducing from ~180deg to ~45deg for what I was doing - but that's relatively well-trodden ground. Might pull that project out of the storage box it's languishing in at some point.
I think you're joking, but to clarify -- not personally yours. A misbehaving worker box, an app server in the staging environment, etc. A resource owned by the organization for which you work, where it would not be appropriate for you to customize it to your own liking
The blog's title can be misleading here, "we" in this context refers to the Cognition team. I don't work at Cognition, just thought this was interesting.
> On August 29, a routine load balancing change unintentionally increased the number of short-context requests routed to the 1M context servers. At the worst impacted hour on August 31, 16% of Sonnet 4 requests were affected.
Interesting, this implies that the 1M context servers performs worst at low context.
Perhaps this is due to some KV cache compression, eviction or sparse attention scheme being applied on these 1M context servers?
> All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required. It is also recommended to modify the factor as needed. For example, if the typical context length for your application is 524,288 tokens, it would be better to set factor as 2.0.
The key issue is that their post-mortem never explained what went wrong on two out of three issues.
All I know is that my requests can now travel along three completely different code paths, each on its own stack and tuned differently. Those optimizations can flip overnight, independent of any model-version bump—so whatever worked yesterday may already be broken today.
I really don't get the praise that they are getting for this postmortem, it only made me more annoyed.
They sounded a tinge strange, like they’ve almost crossed the uncanny valley, only to succumb at the final 3% stretch.
I was suspicious, but their ability to understand my complex request and the relatively low latency make an LLM -> TTS or e2e voice model unlikely.
This post finally solved the mystery.