Performance Measurement on ARM
After working mostly with different ARM processors in the 200...400 MHz range in lots of Embedded Linux projects over the last years, we have seen an interesting development in the market recently:
- ARM cpus, having been known for their low power consumption, are becoming faster and faster (example: OMAP3, Beagleboard, MX51/MX53).
- x86, having been known for it's high computing performance, is becoming more and more SoC-like, power friendly and slower.
If you read the marketing stuff from the chip manufacturers, it sounds like if ARM is the next x86 (in terms of performance) and x86 is the next ARM (in terms of power consumption). But where do we stand today? How fast are modern ARM derivates?
The Pengutronix Kernel team wanted to know, and so we measured, in order to get some real numbers. Here are the results, and they show up some interesting questions. Don't take the "Observations" below too scientific - I try to sum up the results in short claims.
In order to find out more about the real speed of today's hardware, we collected some typical industrial hardware in our lab, so this is the list of devices we have benchmarked:
|phyCORE-PXA270||PXA270 (Marvell)||520 MHz||XScale (ARMv5)|
|phyCORE-i.MX27||MX27 (Freescale)||400 MHz||ARM926 (ARMv5)|
|phyCORE-i.MX35||MX35 (Freescale)||532 MHz||ARM1136 (ARMv5)|
|TI/Mistral TMDXEVM3503||OMAP3503 (Texas Instruments)||500 MHz||Cortex-A8 (ARMv6)|
|Beagleboard||OMAP3530 (Texas Instruments)||500 MHz||Cortex-A8 (ARMv6)|
|phyCORE-Atom||Z510 (Intel)||1100 MHz||Atom|
How fast are these boards? Your's truely assumed that the order in the table above does more or less reflect the systems in ascending performance order: PXA270 is a platform from the past, MX27 reflects the current generation of busmatrix optimized ARM9s, the ARM11 should be the next step there, Cortex-A8 appears to be the next killer platform and the Atom would probably be an order of magnitude above that.
So let's look at what we've measured.
Floatingpoint Multiplication (lat_ops)
This benchmark measures the time for a floating point multiplication. It shall be an indication of the computation power and is heavily influenced by the fact if a SoC has a hardware floating point unit or not. Here are the results:
The PXA270 and i.MX27 both have no hardware floating point unit, so the difference between the plots seems to directly reflect the different CPU clock speed.
An interesting observation is that the MX35 (ARM1136, 532 MHz) is faster than the OMAPs (Cortex-A8, 500 MHz). The frequency differs by 6%, whereas the speed is about 25% higher.
Observation 1: So even if scaled to the same frequency, the ARM11 is faster than the Cortex-A8!
Observation 2: The Atom needs 4.5 ns; it is about twice the clock frequency of the MX35, but needs only one third of the time (which needs 15 ns).
Memory Bandwidth (bw_mem)
We measure the memory transfer speed with the bw_mem benchmark.
Observation 3: There is a factor 2 between the PXA270 and MX27/MX35.
Observation 4: OMAP is twice as fast as the i.MX arm9/arm11 ones.
Observation 5: The atom still is 2.7 times faster than the OMAP, at 2.2 times the clock rate.
Context Switch Time (lat_ctx)http://www.bitmover.com/lmbench/lat_ctx.8.html
An important indicator of the system speed is the time to switch the CPU context. This benchmark measures the context switching time and it can be configured which number of processes with which size shall be tested. The processes are started, read a token from a pipe, perform a certain ammount of work and give the token to the next process.
Observation 6: This shows impressively how slow the PXA is. Factor 40 to the Atom, and still factor 3 to the ARM926.
Observation 7: The MX35/ARM1136 has almost the same speed as the Cortex-A8. I would have thought that the newer Cortex would indeed be much faster, somewhere between the ARM11 and the Atom. But the Cortex is still three times slower than the atom, although at half the clock rate.
Syscall Performance (lat_syscall.open)
In order to estimate the performance of calling operating system functionality, we measrured the syscall latency with lat_sys. The benchmark performs an open() and close() on a 1 MB random data file located in a ramdisk (tmpfs), accessing the file with a relative path (absolute paths seem to give other results). The time for both operations after each other is measured.
Observation 7: The PXA isn't too bad when it comes to syscalls.
Observation 8: The MX27 jumps between 15 and 40, so this measurement is probably invalid.
Observation 9: The Cortex-A8 and the ARM11 are almost identically fast.
Observation 10: Event between OMAP/ARM11 and Atom, there is only a factor 1.5.
Process Forking (lat_proc)
The lat_proc benchmark forks processes and measures the time to do so.
Observation 11: The ARM11 is even better than the Cortex-A8! I had expected that the newer Cortex would perform better there.
Observation 12: The Atom is 3 times as fast, at 2 times the clock frequency.
Yes, these measurements are probably not completely scientifically correct. But anyway, the intention was to give us a raw idea of how the systems perform.
We expected the Cortex-A8 to be an order of magnitude faster than the ARM11. This doesn't seem to be the case. Only the memory bandwidth is much faster, but most of the other benchmarks show almost the same values. It's currently totally unclear to us where the performance win we expected from an ARMv7 over an ARMv6 core went to.
There seems to be a pattern that, at double the clock frequency, the Atom is often three times faster than the ARM11/Cortex-A8.
Do you have any remarks, ideas about the observed effects and other things you might want to tell us? We want to improve this article with the help of the community. So please send us your feedback to the mail address in the box below.
|Thanks to ...||... for ...|
|Jochen Frieling||all the measurements|
|Marc Kleine-Budde||porting all kernels to 2.6.34|