I’m more impressed by Xeon Gold vs EPYC perf. The fact that Power 9 is currently slower on workloads that heavily benefit from hand-optimized code specific to Intel CPUs should not really surprise anybody. It’s likely that this code just uses a “portable”, slow implementation on Power9, just because nobody has the systems to work with yet. Additionally it’s also likely that most software in these tests can’t take advantage of such a huge number of threads.
The hardware used in this benchmark isn't very well chosen. All the systems have a different number of cores (Power 16, AMD 32, Intel 40), and the AMD system has the additional disadvantage over the others of being 1P instead of 2P, which approximately halves the available cooling and correspondingly limits clock speeds. The results aren't very surprising given that.
Making comparisons like this can make sense when one of the competitors doesn't make an equivalent product, but they all do make products with an equivalent number of sockets and cores.
It depends on what you want to compare. If it's performance at a certain price point, you select systems at that point. You can compare on performance per watt or per volume or more abstract comparisons (ideally unbound by price or size) of performance per core, socket or thread at various load factors.
I believe the fair comparison would be performance per dollar - comparing similarly priced components. Same sockets and cores can have wildly different costs.
Comparing like with like can give you a better idea of what's going on even if the prices aren't the same. It's a matter of isolating the architecture from the other factors. If a 40 core 2P Intel system is faster than a 32 core 1P AMD system, you don't know if the difference is architectural or just the extra cores and sockets, so you don't know whether to expect a 48 core 2P AMD system to be faster or slower than the Intel system.
And the tested systems aren't equivalently priced here either.
> the AMD system has the additional disadvantage over the others of being 1P instead of 2P, which approximately halves the available cooling and correspondingly limits clock speeds.
This doesn't make sense to me. I'd assume each system to be spec'ed correctly for the TDP of the processors. Each system has it's own HSF and case solution.
Multiprocessor systems do perform as less than 2x of their single processor counterparts, but it nothing to do with thermals in a properly designed system.
If you are memory throughput bound, and both processors are working independently (think embarrassingly parallelizable problems) you'd probably benefit from the total throughput change.
Otherwise it's pretty easy to get worse performance due to numa issues
POWER doesn't have that many more threads. The top-of-the-line is 22-core 88-thread, compared to 32-core 64-thread from AMD and 28-core 56-thread from Intel.
I have no idea what having that many threads per core means for performance.
Extra threads means that there's more likely to be something else available to schedule when one thread's pipelines are stalled due to a fetch from DRAM or MMIO, basically. There's very much a diminishing returns beyond two, but for some workloads (big mostly-RAM-resident data sets with poor locality of reference -- consumer databases say) it's worth it. It's unclear to me that 4-way multhreading is going to help any of the benchmarks in this test.
All diminishing returns really depend on your workload. We normally would say that for caches, with no point having enormous caches on desktop and typical x86 server processors, but IBM's mainframe CPUs have tons of L3 and even more tons of L4 caches, as well as dedicated cores (in the form of secondary processors) for offloading all kinds of tasks from the main CPUs, each with its own cache architecture.
In the specific case of POWER 8 and 9 the cores are seriously overprevisioned with execution resources and you really need at least 2 threads running in order to make full use of them.
Could be. But the point was more that there's a very thin regime between "waiting for DRAM latency too often" (where more threads can help) and "bound by DRAM bandwidth" (where they won't). The DRAM isn't nearly as parallel as the cores are and saturates really fast.
In that case it was even more serious. Like the PPUs of the Cell processors, each core runs one instruction every other cycle. If you have only one thread, you effectively have half the throughput.
Offhand, I would guess that it's good for I/O-bound tasks, where you can have lots of threads waiting for input that don't need CPU time. A busy database maybe.
The 8-core version only has 8 (each) ALUs, LSUs, and vector units. https://en.wikipedia.org/wiki/POWER9#Core If each core has 4 threads "running" on it, some of them are not going to be executing.
In before questions of "why would I pay so much for such a slow machine" - the target market for Talos is not necessarily those who seek raw performance.
For many of us, having a workstation that is at least in the same rough range of performance while being free and clear of IME and PSP is worth the price premium.
RISC V is slowly getting there, but for real work, this is a decent option for us tinfoil hat crowd.
I concur; while I think it's great to see these POWER systems get some press I think that this article in particular puts forth the wrong set of priorities, addresses speed unfairly in more than one way, and presents unreasonable conclusions in the end. That these qualities are addressed in the article doesn't transform poor choices into reasonable choices, or unfairness into informative review.
The benchmarks seem unfair to me because they compare code that either in code choice (specific instruction sets) or algorithm design could favor the Intel/AMD systems. Larabel addresses this but still publishes these results as if we could use them as a reasonable baseline for future comparisons when the software is optimized for POWER systems.
Also on performance, I imagine most computer users will find that any modern computer will do what they need it to do quickly enough to get real work done. Check out the videos this POWER system team put together for their other systems (systems that aren't radically different from this one) like https://www.youtube.com/watch?v=05NNFJj3Mrw (use youtube-dl or avideo to avoid running YouTube's nonfree software) or at https://www.raptorengineering.com/TALOS/op_ue4_gl.php ; you'll see games run well, multi-monitor setups run well; and users can expect to get plenty of work done. The bottleneck is usually elsewhere: how quickly a server gives the user data, how much space the computer has to store data (either in RAM or permanent storage), and other more complex delays imposed by some other part of a process well beyond one's own computer. Most computer users simply don't do tasks akin to what's being tested here. The most mathematically challenging thing they do is decode video and that's often designed to be faster than encoding that same video. Even video encoding is somewhat overrated as this is typically done far fewer times than decoding that same video. One can't get a computer fast enough to avoid purchasing many computers to do large encoding jobs; it's not a matter of switching to some modern Intel/AMD system.
As you said, the respect for privacy and software freedom is underrated to a fault in this Phoronix article. I'm also disappointed in how readily the readership that posts about this article buys right into the framing of the debate around performance and ignores or minimizes the importance of software freedom and respect for privacy. By the metric of software freedom -- which systems respect a user's software freedom? -- modern Intel/AMD chipsets simply fail. There's no contest but there's plenty to talk about and teach about. Intel/AMD systems come with backdoors (pitched as sysadmin conveniences) we are prohibited from altering thus preventing most computer users from truly owning their computers, regardless of programmatic skill or desire. Larabel acknowledges this in a paragraph toward the end but that hardly gives this important issue its due attention. Particularly when one buys a computer they'll use for years, making a bad decision here pays off for those who want to spy on you for years to come regardless of which OSes are run atop that system. Finally, I think those that discount the importance of privacy are flatly lying; everyone has something to hide and most people reading this rely on networked computing in their daily lives. It's critical we have computers we can trust for all of our computing needs.
In addition to all the other caveats, it's important to remember that Power9's biggest market target is probably highly transactional, tens-of-thousands-of-users applications (basically, databases of one flavor or another) where how the system handles NUMA, cache coherency and atomics is far more important than how fast it can churn through embarrassingly parallel video encoding tasks.
Edit: whoops, the article ran against the dual 8-core version, not the 22-core one. Looks like $10,600 as tested, with the 500GB HD, 256GB RAM, and workstation graphics.
I nearly always find benchmark numbers useless because they don't have nearly enough information about how they're run and what the profile is. It's difficult to gain much understanding otherwise. (A classic case is communication-sensitive programs without information on MPI parameters like collective algorithms.)
Can someone comment on how any of these that are basically computational behave -- e.g. memory pressure, FP density, threading? I can't immediately find profile information, for instance, though I could derive it with some effort.
Also, is there definitive information somewhere on the POWER9 SIMD implementation (e.g. vector width), which I couldn't find when I last looked.
Yes, although I looked again, and found some things I understand well. There's certainly no profile information there, and they don't appear to be well-specified at build/run time. Some don't seem useful, e.g. reference HPCG basically reduces to STREAM.
A lot of these benchmarks (video and audio encode for example) have hand-coded assembly or hand-coded functions full of sse/avx intrinsics for x86_64. I doubt they do the equivalent for PPC64LE (yet) so the results aren't surprising (or useful).
But this shouldn't be the case at least for the PyBench and PHPBench benchmarks, or am I wrong?
I really didn't expect such a huge difference in those single-threaded tests (especially taking into account the Power9 CPU running at 3.8Ghz...).
How much is this an artifact of better code optimization for x64 systems vs. PPC? Before Power9 I was under the impression that PPC was dead, and so never put any effort into optimizing anything for it. I imagine this is common.
Much like the current Intel architecture is amd64, "ppc64" or "ppc64le" (for little-endian) is the name of the architecture, even for current POWER products.
> Much like the current Intel architecture is amd64
Wrong. For x86-64 there exist two implementations, which are called by their vendors AMD64 and Intel 64 (the latter was marketed by Intel under the name EM64T for a long time, but now Intel seems to use the name "Intel 64").
These are not identical, though mostly compatible. If you want to have examples where they differ, look at
for instructions marked with "Df64" and "F64" (for the meaning of Df64 and F64 cf. http://sandpile.org/x86/opc_enc.htm). Another more subtle difference can be found at slide 141-142 (though better start at slide 133) of
Irrelevant. ppc means "any power based/derived architecture". It's called amd64 in many places where it simply means 64bit x86. Linux does, Microsoft does.
Not really, as I understand it, PPC is a fork made for consumer devices and IBM independently maintained a somewhat different POWER architecture for its servers.
Power9 might be amazing but it is worthless to me until I can buy off the shelf motherboards for it from one of the big ten taiwanese motherboard manufacturers. This is the reason why x86-64 has been so successful.
I know this is about Power9, but these benchmarks really make me look forward to a time when companies will be ditching these first gen AMD EPYC processors on eBay for cheap. Good times are coming.