I'm impressed. Not by the first 90%; that's par for the course (and why I'm a big believer in open source). No, what impresses me is the sharing of their solution. A solution, which, in their own words, is 'not a proper fix'. The engineer in me would be embarrassed by the fix but you can't fault that it works. There's a workaround, so the business perspective is any further time spent fixing this could be spent working on new features instead.
Still, I have a bunch of unanswered questions. Why not just upgrade all the hosts to Xen 4.4? Does recompiling the 3.0+ kernel without the bad 'if' in /net/ipv4/esp4.c make the problem go away? Does the problem happen if there's only one VM on a host? Of the seven AES-NI instructions, which one is faulting? How often does it fault? The final question though, is what causes it to fault?
Why not upgrade Xen? We run on cloud VMs, so we aren't in control of the version of Xen that's being used.
Patch the kernel? Honestly, we don't have the expertise to take on that level of effort. We're hiring though. ;)
Still, I have a bunch of unanswered questions. Why not just upgrade all the hosts to Xen 4.4? Does recompiling the 3.0+ kernel without the bad 'if' in /net/ipv4/esp4.c make the problem go away? Does the problem happen if there's only one VM on a host? Of the seven AES-NI instructions, which one is faulting? How often does it fault? The final question though, is what causes it to fault?