Performance Assessment using the
Dhrystone v2.1 Benchmark on the Xilinx ML507 FPGA
Platform
Overview
Test Setup
Hardware Architecture
Clocking
Infrastructure
Glossary
The Dhrystone Benchmark
IBM PowerPC Performance Libraries
Test Results
Measurement
Results
Inter-/Extrapolated Results
Cheated Results
Measurement
Results
Inter-/Extrapolated Results
Summary & Conclusions
References
Well, I'm aware of the paraphrase 'fake, lie, benchmark', therefore I seek to follow best practices by clearly specifying the overall setup employed including hardware system, software and compiler versions used, and employed compilation flags. Moreover, I do not intend to tweak any system specifics in order to reach highest scores, but rely on ordinary compiler flags such as '-O2' and '-O3'. The corresponding results are listed below, any interpretation is left to the prospective reader.
Note that in general, benchmark numbers are meaningless
without proper specification of compiler settings and
benchmarking conditions [4].
Last but not least, keep in mind Dilbert:
Hardware | IBM PowerPC 440 CPU running at
400 MHz CPM & PLB running at 133 MHz Peripherals: 32 kB BRAM, RS232 UART, interrupt controller, timer |
Software | Dhrystone benchmark version 2.1 (Language:
C) [5] Selected tests employ the IBM PowerPC Performance Libraries [6] |
Compiler | powerpc-eabi-gcc (GCC) 4.1.1 20060524 |
Compilation | Program compiled without 'register'
attribute Separate compilation of files dhry_1.c and dhry_2.c, as intended by the Dhrystone benchmark |
The hardware architecture is depicted below. Only essential peripheral blocks have been attached to the CPU.
Hardware architecture for the performance assessment of the embedded IBM PowerPC 440 CPU using the Dhrystone benchmark. As peripheral blocks, 32 kB on-board memory, timer, interrupt controller and RS232 UART for serial communication are connected. |
The clocking scheme of the system architecture needs to adhere
to a bunch of different rules imposed by various interconnect and
device specifics. The applied clocking scheme and the
corresponding important clock ratios are listed below.
Clock | Frequency | Clock Ratio |
CPU core clock | 400 MHz | |
CPM clock | 133 MHz | CPU:CPM 3:1 |
MPLB (PLB_v46_0) | 133 MHz | CPU:MPLB 3:1, CPM:MPLB 1:1 |
APU | Auxiliary Processing Unit |
BRAM | Block Random Access Memory |
CPM | Communications Processor Module |
DAC | Digital-to-Analog Converter |
DCR | Device Control Register |
DMA | Direct Memory Access |
FCB | Fabric Coprocessor Bus |
FCM | Fabric Coprocessor Module |
FPU | Floating-Point Unit |
GPIO | General Purpose Input/Output |
MCI | Memory Controller Interface |
MPLB | Processor Local Bus Master |
SPLB | Processor Local Bus Slave |
PLB | Processor Local Bus |
PPC | PowerPC |
The Dhrystone benchmark is a synthetic benchmark program and performs a series of CPU-centric operations such as integer arithmetic, comparisons, and logic and string operations. Being ported to the C programming language in 1988, the Dhrystone version 2.1 benchmark mostly relies on standard C library functions such as strcmp(), strcpy(), and memcpy(), and does not involve any multiply-accumulate or floating-point execution. The benchmark was mainly intended to characterize the integer performance of CPUs during the dawn of the Internet age in the 80's and 90's. In the end, the measured performance is reflected in a benchmark-specific metric called Dhrystones per second. This number can then be converted to Dhrystone MIPS (DMIPS).
What does 1 DMIPS stand for? The VAX-11/780 has been selected as the 'reference 1 MIPS machine', which scores 1757 Dhrystones per second in the Dhrystone benchmark, and hence, constitutes the reference for 1 DMIPS compute performance. As you can see from the measurement results below, the examined IBM PowerPC 440 CPU has an equivalent performance of 879 VAX-11/780. Did you get it? Well, even for me, the VAX-11/780 is an obscure ancient computing device from Digital Equipment Corporation (DEC), which was introduced in October 1977. It is kind of a computer dinosaur I haven't had the pleasure to become acquainted with...
The VAX-11/780 from Digital Equipment
Corporation (DEC) introduced in 1977. It was running at
5 MHz and incorporated 32 bit addressing,
16 registers, 2 kB cache and 128 kB -
8 MB ECC RAM. This system achieves 1757
Dhrystones per second in the Dhrystone benchmark and
represents the virtual metric for the benchmark by
considering this system's performance as 1 Dhrystone
MIPS (DMIPS). (Source: en.wikipedia.org / Digital Equipment Corporation) |
Selected tests employ the compilation flag '-mppcperflib' which infers the IBM PowerPC Performance Libraries for optimized low-level integer and floating-point emulation, and optimized string handling routines [6]. According to Xilinx, the IBM PowerPC Performance Libraries may show an average of three times increase in speed on applications that heavily use these routines.
Caution: The IBM PowerPC Performance Libraries are only
intended for improving the execution of emulated floating-point
arithmetics and hence cannot be used in conjunction with
floating-point hardware, i.e., with active '-mfpu'
switch.
Using separate compilation of files dhry_1.c and dhry_2.c, as
intended by the Dhrystone benchmark.
Any interpretation is left to the prospective reader.
IBM PowerPC 440 @ 400 MHz
The gray shaded table entries represent compiler settings violating the intended compilation rules of the synthetic benchmark.
Compiler optimization flags | -O3 -mppcperflib¹º | -O3¹¹ | -O2 -mppcperflib²º | -O2²¹ |
Execution time for 100'000'000 iterations [s] | 64.01 | 75.75 |
98.8 |
110.8 |
Dhrystones per second | 1'562'376 | 1'320'147 |
1'012'608 |
902'930 |
Dhrystone MIPS (DMIPS) | 889.2 |
751.4 | 576.3 |
513.9 |
Time for one run through Dhrystone [us] | 0.64 |
0.76 | 0.99 |
1.11 |
DMIPS/MHz | 2.22 | 1.88 | 1.44 |
1.28 |
Using 889 DMIPS @ 400 MHz as reference data:
CPU clock [MHz] | 100 | 200 | 300 | 400 | 500 | 550 |
Dhrystones per second |
390'594 |
781'188 |
1'171'782 |
1'562'376 |
1'952'970 |
2'148'267 |
Dhrystone MIPS (DMIPS) | 222 |
445 |
667 |
889 |
1112 |
1223 |
DMIPS/MHz | 2.22 |
Using 576 DMIPS @ 400 MHz as reference data:
CPU clock [MHz] | 100 | 200 | 300 | 400 | 500 | 550 |
Dhrystones per second |
253'152 |
506'304 |
759'456 |
1'012'608 |
1'265'760 |
1'392'336 |
Dhrystone MIPS (DMIPS) | 144 |
288 |
432 |
576 |
720 |
793 |
DMIPS/MHz | 1.44 |
I was curious about the impact on Dhrystone performance if you merge files dhry_1.c and dhry_2.c into a single file, as you are not supposed to do [2]. The results are listed below. Again, any interpretation is left to the prospective reader.
IBM PowerPC 440 @ 400 MHz
The gray shaded table entries represent compiler settings violating the intended compilation rules of the synthetic benchmark.
Compiler optimization flags | -O3 -mppcperflib¹º | -O3¹¹ | -O2 -mppcperflib²º | -O2²¹ |
Execution time for 100'000'000 iterations [s] | 49.25 | 61.51 | 94.0 |
105.5 |
Dhrystones per second | 2'030'390 | 1'625'766 |
1'063'775 |
947'798 |
Dhrystone MIPS (DMIPS) | 1155.6 | 925.3 | 605.4 | 539.4 |
Time for one run through Dhrystone [us] | 0.49 | 0.62 | 0.94 | 1.06 |
DMIPS/MHz | 2.89 | 2.31 | 1.51 | 1.35 |
Using 1156 DMIPS @ 400 MHz as reference data:
CPU clock [MHz] | 100 | 200 | 300 | 400 | 500 | 550 |
Dhrystones per second | 507'598 |
1'015'195 |
1'522'793 |
2'030'390 |
2'537'988 |
2'791'787 |
Dhrystone MIPS (DMIPS) | 289 | 578 | 867 | 1156 | 1445 | 1589 |
DMIPS/MHz | 2.89 |
Using 605 DMIPS @ 400 MHz as reference data:
CPU clock [MHz] | 100 | 200 | 300 | 400 | 500 | 550 |
Dhrystones per second | 265'944 |
531'888 |
797'831 |
1'063'775 |
1'329'719 |
1'462'691 |
Dhrystone MIPS (DMIPS) | 151 |
303 |
454 |
605 |
757 |
833 |
DMIPS/MHz |
1.51 |
By using gcc 4.1.1 and the IBM PowerPC Performance Libraries for an embedded PowerPC 440 CPU running at 400 MHz, it was possible to achieve up to 576 Dhrystone MIPS (DMIPS) with cache-adjusted code size and compiler optimization flags, which are officially allowed by the Dhrystone benchmark. Extrapolated to 550 MHz, the PowerPC 440 would achieve 793 DMIPS for the version 2.1 benchmark, what is well below the advertised single-core performance of 1100+ DMIPS from Xilinx [1]. In case aggressive compiler optimization techniques beyond the intention of the benchmark are used, the Dhrystone performance increases dramatically - ultimately owed to the fact, that the synthetic Dhrystone benchmark was both, not intended and not designed to cope with such compiler-based code optimizations [4].
However, the relevance of these performance numbers are in general questionable: The code must be running entirely from cache without any I/O transfers to show best results. As soon as larger code size, costly I/O transfers, and different compiler options are involved, these numbers are merely theoretical.
Last but not least, it is very impressive to see how different
code optimization techniques of the compiler significantly
influence the execution time of the identical piece of code. Here
at last it becomes obvious, that a CPU performance measuring tool
like a benchmark needs to be designed by keeping clearly in mind
hardware, software and compiler architectures and
capabilities.
[1] Xilinx Inc., Virtex-5 Family Brochure, Dec 2008
[2] Xilinc Inc., Virtex-5
Website
[3] Paul Glover, Running the Dhrystone 2.1 Benchmark on a Virtex-II Pro PowerPC Processor, Xilinx Application Note (XAPP507), July 2005
[4] Alan R Weiss, Dhrystone Benchmark - History, Analysis, "Scores" and Recommendations, White Paper, Nov. 1, 2002
[5] netlib.org, Benchmark Programs and Reports
[6] sourceforge.net, IBM PowerPC Performance Libraries
Last updated: 2012/12/30