Hacker Timesnew | past | comments | ask | show | jobs | submitlogin
Power9 Benchmarks vs. Intel Xeon vs. AMD EPYC Performance on Debian Linux (phoronix.com)
110 points by rbanffy on April 13, 2018 | hide | past | favorite | 57 comments


I’m more impressed by Xeon Gold vs EPYC perf. The fact that Power 9 is currently slower on workloads that heavily benefit from hand-optimized code specific to Intel CPUs should not really surprise anybody. It’s likely that this code just uses a “portable”, slow implementation on Power9, just because nobody has the systems to work with yet. Additionally it’s also likely that most software in these tests can’t take advantage of such a huge number of threads.


> I’m more impressed by Xeon Gold vs EPYC perf.

The hardware used in this benchmark isn't very well chosen. All the systems have a different number of cores (Power 16, AMD 32, Intel 40), and the AMD system has the additional disadvantage over the others of being 1P instead of 2P, which approximately halves the available cooling and correspondingly limits clock speeds. The results aren't very surprising given that.

Making comparisons like this can make sense when one of the competitors doesn't make an equivalent product, but they all do make products with an equivalent number of sockets and cores.


It depends on what you want to compare. If it's performance at a certain price point, you select systems at that point. You can compare on performance per watt or per volume or more abstract comparisons (ideally unbound by price or size) of performance per core, socket or thread at various load factors.


Then test against 2 EPYC 7451s as well for a fairer comparison.


Also, some tuning for the POWER software would probably be very welcome.


I believe the fair comparison would be performance per dollar - comparing similarly priced components. Same sockets and cores can have wildly different costs.


Comparing like with like can give you a better idea of what's going on even if the prices aren't the same. It's a matter of isolating the architecture from the other factors. If a 40 core 2P Intel system is faster than a 32 core 1P AMD system, you don't know if the difference is architectural or just the extra cores and sockets, so you don't know whether to expect a 48 core 2P AMD system to be faster or slower than the Intel system.

And the tested systems aren't equivalently priced here either.


I still need to see raw performance: single core, multi-core single socket, and multi-socket.


Per dollar of CPU, CPU + power, CPU + motherboard, TCO?


> the AMD system has the additional disadvantage over the others of being 1P instead of 2P, which approximately halves the available cooling and correspondingly limits clock speeds.

This doesn't make sense to me. I'd assume each system to be spec'ed correctly for the TDP of the processors. Each system has it's own HSF and case solution.

Multiprocessor systems do perform as less than 2x of their single processor counterparts, but it nothing to do with thermals in a properly designed system.


I wonder how 1P vs 2P would affect memory bandwidth. The 1P vs 2P difference would probably be deeper than just CPU clock speed.

Also, I wonder if these were Meltdown and Spectre patched.


If you are memory throughput bound, and both processors are working independently (think embarrassingly parallelizable problems) you'd probably benefit from the total throughput change.

Otherwise it's pretty easy to get worse performance due to numa issues


EPYC is already four NUMA nodes per socket though.


POWER doesn't have that many more threads. The top-of-the-line is 22-core 88-thread, compared to 32-core 64-thread from AMD and 28-core 56-thread from Intel.

I have no idea what having that many threads per core means for performance.


Extra threads means that there's more likely to be something else available to schedule when one thread's pipelines are stalled due to a fetch from DRAM or MMIO, basically. There's very much a diminishing returns beyond two, but for some workloads (big mostly-RAM-resident data sets with poor locality of reference -- consumer databases say) it's worth it. It's unclear to me that 4-way multhreading is going to help any of the benchmarks in this test.


All diminishing returns really depend on your workload. We normally would say that for caches, with no point having enormous caches on desktop and typical x86 server processors, but IBM's mainframe CPUs have tons of L3 and even more tons of L4 caches, as well as dedicated cores (in the form of secondary processors) for offloading all kinds of tasks from the main CPUs, each with its own cache architecture.


In the specific case of POWER 8 and 9 the cores are seriously overprevisioned with execution resources and you really need at least 2 threads running in order to make full use of them.


Could be. But the point was more that there's a very thin regime between "waiting for DRAM latency too often" (where more threads can help) and "bound by DRAM bandwidth" (where they won't). The DRAM isn't nearly as parallel as the cores are and saturates really fast.


Knights Corner was another case where you normally needed multiple threads.


In that case it was even more serious. Like the PPUs of the Cell processors, each core runs one instruction every other cycle. If you have only one thread, you effectively have half the throughput.


Some of the models even have 8-way SMT!


Offhand, I would guess that it's good for I/O-bound tasks, where you can have lots of threads waiting for input that don't need CPU time. A busy database maybe.


You don't need CPU threads for threads that don't need CPU...


The 8-core version only has 8 (each) ALUs, LSUs, and vector units. https://en.wikipedia.org/wiki/POWER9#Core If each core has 4 threads "running" on it, some of them are not going to be executing.


Still: the kernel won’t schedule threads that are waiting on I/O i believe.


Ok that's an interesting point. So they'd have to be waiting on a fetch instruction for the time not to be totally wasted?


Blocked tasks are not scheduled until unblocked.


Is there anyway to price out a power9 itself? Just for comparison.


Do you mean just the processor? RaptorCS sells Processors + main board bundles.


In before questions of "why would I pay so much for such a slow machine" - the target market for Talos is not necessarily those who seek raw performance.

For many of us, having a workstation that is at least in the same rough range of performance while being free and clear of IME and PSP is worth the price premium.

RISC V is slowly getting there, but for real work, this is a decent option for us tinfoil hat crowd.


Sorry for the grim picture, but what is stopping RISC-V SoC makers from bolting on their own IME and PSP nonsense?


Presumably, the "tinfoil hat users" would just elect to buy a different board without those features grafted on

It's an open design. So those additions would be an OEM decision


I find it very unlikely that an OEM will emerge that will suddenly cater to us “tinfoil hat users”.


I concur; while I think it's great to see these POWER systems get some press I think that this article in particular puts forth the wrong set of priorities, addresses speed unfairly in more than one way, and presents unreasonable conclusions in the end. That these qualities are addressed in the article doesn't transform poor choices into reasonable choices, or unfairness into informative review.

The benchmarks seem unfair to me because they compare code that either in code choice (specific instruction sets) or algorithm design could favor the Intel/AMD systems. Larabel addresses this but still publishes these results as if we could use them as a reasonable baseline for future comparisons when the software is optimized for POWER systems.

Also on performance, I imagine most computer users will find that any modern computer will do what they need it to do quickly enough to get real work done. Check out the videos this POWER system team put together for their other systems (systems that aren't radically different from this one) like https://www.youtube.com/watch?v=05NNFJj3Mrw (use youtube-dl or avideo to avoid running YouTube's nonfree software) or at https://www.raptorengineering.com/TALOS/op_ue4_gl.php ; you'll see games run well, multi-monitor setups run well; and users can expect to get plenty of work done. The bottleneck is usually elsewhere: how quickly a server gives the user data, how much space the computer has to store data (either in RAM or permanent storage), and other more complex delays imposed by some other part of a process well beyond one's own computer. Most computer users simply don't do tasks akin to what's being tested here. The most mathematically challenging thing they do is decode video and that's often designed to be faster than encoding that same video. Even video encoding is somewhat overrated as this is typically done far fewer times than decoding that same video. One can't get a computer fast enough to avoid purchasing many computers to do large encoding jobs; it's not a matter of switching to some modern Intel/AMD system.

As you said, the respect for privacy and software freedom is underrated to a fault in this Phoronix article. I'm also disappointed in how readily the readership that posts about this article buys right into the framing of the debate around performance and ignores or minimizes the importance of software freedom and respect for privacy. By the metric of software freedom -- which systems respect a user's software freedom? -- modern Intel/AMD chipsets simply fail. There's no contest but there's plenty to talk about and teach about. Intel/AMD systems come with backdoors (pitched as sysadmin conveniences) we are prohibited from altering thus preventing most computer users from truly owning their computers, regardless of programmatic skill or desire. Larabel acknowledges this in a paragraph toward the end but that hardly gives this important issue its due attention. Particularly when one buys a computer they'll use for years, making a bad decision here pays off for those who want to spy on you for years to come regardless of which OSes are run atop that system. Finally, I think those that discount the importance of privacy are flatly lying; everyone has something to hide and most people reading this rely on networked computing in their daily lives. It's critical we have computers we can trust for all of our computing needs.


In addition to all the other caveats, it's important to remember that Power9's biggest market target is probably highly transactional, tens-of-thousands-of-users applications (basically, databases of one flavor or another) where how the system handles NUMA, cache coherency and atomics is far more important than how fast it can churn through embarrassingly parallel video encoding tasks.


Should have probably built/tested the comparison systems without SSDs, since the POWER9 had an SAS disk in it.


Does the article give full system prices? Perhaps in my rush to the graphs I missed them.

I'm interested to know the price:performance comparison of the systems (wrt kernel compilation time)


You can price out the POWER system they tested here. https://secure.raptorcs.com/content/TL2WK2/purchase.html Remember to type in "2" for the quantity of 22-core CPU upgrades you want. Looks like at least $9k.

Edit: whoops, the article ran against the dual 8-core version, not the 22-core one. Looks like $10,600 as tested, with the 500GB HD, 256GB RAM, and workstation graphics.


I nearly always find benchmark numbers useless because they don't have nearly enough information about how they're run and what the profile is. It's difficult to gain much understanding otherwise. (A classic case is communication-sensitive programs without information on MPI parameters like collective algorithms.)

Can someone comment on how any of these that are basically computational behave -- e.g. memory pressure, FP density, threading? I can't immediately find profile information, for instance, though I could derive it with some effort.

Also, is there definitive information somewhere on the POWER9 SIMD implementation (e.g. vector width), which I couldn't find when I last looked.


Did you have a look at https://www.phoronix-test-suite.com/ ?


Yes, although I looked again, and found some things I understand well. There's certainly no profile information there, and they don't appear to be well-specified at build/run time. Some don't seem useful, e.g. reference HPCG basically reduces to STREAM.


A lot of these benchmarks (video and audio encode for example) have hand-coded assembly or hand-coded functions full of sse/avx intrinsics for x86_64. I doubt they do the equivalent for PPC64LE (yet) so the results aren't surprising (or useful).


But this shouldn't be the case at least for the PyBench and PHPBench benchmarks, or am I wrong? I really didn't expect such a huge difference in those single-threaded tests (especially taking into account the Power9 CPU running at 3.8Ghz...).


I generally avoid languages like Python and PHP, so I do not know for sure but are they still not JITted on x86?


How much is this an artifact of better code optimization for x64 systems vs. PPC? Before Power9 I was under the impression that PPC was dead, and so never put any effort into optimizing anything for it. I imagine this is common.


Power9 isn't PPC: PPC is a derivative of the Power architecture made for Apple, Power9 is IBM's server architecture.


Much like the current Intel architecture is amd64, "ppc64" or "ppc64le" (for little-endian) is the name of the architecture, even for current POWER products.


> Much like the current Intel architecture is amd64

Wrong. For x86-64 there exist two implementations, which are called by their vendors AMD64 and Intel 64 (the latter was marketed by Intel under the name EM64T for a long time, but now Intel seems to use the name "Intel 64").

These are not identical, though mostly compatible. If you want to have examples where they differ, look at

> http://sandpile.org/x86/opc_1.htm

> http://sandpile.org/x86/opc_2.htm

for instructions marked with "Df64" and "F64" (for the meaning of Df64 and F64 cf. http://sandpile.org/x86/opc_enc.htm). Another more subtle difference can be found at slide 141-142 (though better start at slide 133) of

> https://www.blackhat.com/docs/us-17/thursday/us-17-Domas-Bre...


Irrelevant. ppc means "any power based/derived architecture". It's called amd64 in many places where it simply means 64bit x86. Linux does, Microsoft does.


Not really, as I understand it, PPC is a fork made for consumer devices and IBM independently maintained a somewhat different POWER architecture for its servers.


Why is the Intel machine 2P and all others are 1P?


> Those were the systems available for this initial round of testing

The POWER9 system is also 2P.


Power9 might be amazing but it is worthless to me until I can buy off the shelf motherboards for it from one of the big ten taiwanese motherboard manufacturers. This is the reason why x86-64 has been so successful.


I know this is about Power9, but these benchmarks really make me look forward to a time when companies will be ditching these first gen AMD EPYC processors on eBay for cheap. Good times are coming.


Well Ryzen2 is almost hear and Rome Eypc2 is going to be next year I believe.


This benchmark is a joke. Those 3 cpus cannot be compared to eachother.


Why?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: