Mandelbrot/Julia Fractals on Xilinx FPGA

Floating-Point Performance Characterization of PowerPC 440 CPU using Fractal Sets

Table of Contents [Toc]

Overview
Features
Hardware Architecture
   Clocking Infrastructure
   Glossary
Fractal Kernels
   Basic Kernel
   Dual-Issue Kernel
Implementation Results
Performance Assessment
   Julia Set
   Mandelbrot Set
Summary & Conclusions
Outlook
References

Overview [Toc] [Top]

This project briefly describes the implementation of a fractal compute kernel on a FPGA prototyping platform with an embedded PowerPC 440 CPU and floating-point coprocessor, including visualization of the fractal set on an external TFT display. The main objectives of this project are summarized as follows:

Hands-on experience concerning the capabilities of the Xilinx Virtex-5 ML507 FPGA platform with XC5VFX70T FPGA [1, 2, 3].
Assessment of the integrated development flow of the Xilinx embedded development kit (EDK) including HW/SW tool chain.
Examination of the quality and versatility of accompanied IP components.
Getting accustomed to the system architecture of the embedded processor block (CPU, cache, crossbar, APU interface).
Performance characterization of the embedded PowerPC 440 CPU with present and absent single-/double-precision floating-point coprocessor.
Assessment of the impact of the compiler's different code optimization techniques for this embedded platform.

Basically, the project intends to accomplish all computational tasks for the fractal sets using the embedded PowerPC 440 RISC CPU, ultimately for characterization of the CPU-based processing performance. If, in the end, the computational performance lacks significantly for the intended application, selected computational tasks might still be offloaded from the CPU to hardware-based processing.

Features [Toc] [Top]

The hardware architecture consists of:

IBM PowerPC 440 RISC CPU running at 400 MHz
single-precision floating-point coprocessor implemented as soft-IP running at 200 MHz
512 MB DDR2 SDRAM operated at 200 MHz
TFT device supporting VGA resolution (640x480 pixels) with 256k colors, configured to perform autonomous fetch of video data from DDR2 memory
various peripheral devices attached to PLB_v46_0 clocked at 100 MHz
dedicated bus instance PLB_v46_1 running at 200 MHz and connecting TFT PLB master to SPLB0
peer-to-peer support from SPLB0 to MCI

The associated software includes:

colormap generation similar to MATLAB's Jet colormap, with colormap size equal to max. number of iterations
unified dual-issue fractal kernel for computation of Julia and Mandelbrot sets
computation of fractal sets at VGA resolution, i.e., a total of 307'200 pixels
interrupt-based user control including zoom, position, and colormap alterations
selection of different pre-defined Julia and Mandelbrot sets
animation of fractal sets through continuous zooming or colormap alterations

Hardware Architecture [Toc] [Top]

The hardware architecture is depicted below. Besides the TFT display and the serial communication, all components are part of the Xilinx ML507 FPGA prototyping board.

Hardware architecture for the fractal project. Visual output is accomplished through the VGA TFT controller and the TFT display. User control is realized through serial communication and push buttons. The VGA TFT controller implements a PLB master device, which is capable of performing autonomous data transfers from its assigned video memory, in this case a section of the DDR2 SDRAM. In this context, it is important to note that the PowerPC CPU needs to perform a cache flush of the video data once it is completely computed and ready to be displayed.

Clocking Infrastructure [Toc] [Top]

The clocking scheme of the system architecture needs to adhere to a bunch of different rules imposed by various interconnect and device specifics. The applied clocking scheme and the corresponding important clock ratios are listed below.

Clock	Frequency	Clock Ratio
CPU core clock	400 MHz
FCM clock (APU)	200 MHz	CPU:FCM 2:1
CPM clock	200 MHz	CPU:CPM 2:1
MCI	200 MHz	CPM:MCI 1:1
DDR2	200 MHz	MCI:MEM 1:1
MPLB (PLB_v46_0)	100 MHz	CPU:MPLB 4:1, CPM:MPLB 2:1
SPLB0 (PLB_v46_1)	200 MHz	CPU:SPLB 2:1, CPM:SPLB0 1:1
TFT	25 MHz

Glossary [Toc] [Top]

APU	Auxiliary Processing Unit
BRAM	Block Random Access Memory
CPM	Communications Processor Module
DAC	Digital-to-Analog Converter
DCR	Device Control Register
DMA	Direct Memory Access
FCB	Fabric Coprocessor Bus
FCM	Fabric Coprocessor Module
FPU	Floating-Point Unit
GPIO	General Purpose Input/Output
MCI	Memory Controller Interface
MPLB	Processor Local Bus Master
SPLB	Processor Local Bus Slave
PLB	Processor Local Bus
PPC	PowerPC

Fractal Kernels [Toc] [Top]

Basic Kernel [Toc] [Top]

The initial kernel for computing the fractal sets has been written in plain C using floating-point arithmetics. For each data point (pixel) in the complex-valued coordinate system, the basic fractal kernel computes:

iter = 0;

while ((zr*zr + zi*zi <= max_val) && (iter++ < max_iter)) {
  // fundamental equation: z = z^2 + c
  zr_tmp = zr*zr - zi*zi + cr;  // compute real part
  zi = 2*zr*zi + ci;            // compute imaginary part
  zr = zr_tmp;
}

This kernel contains the following (dominant) floating-point operations:

4 floating-point
multiplications:

zr*zr
zi*zi
2*zr*zi

4 floating-point
add/sub operations:

zr² + zi²
zr² - zi² + cr
2*zr*zi + ci

1 floating-point
compare:

zr² + zi² <= max_val

Here, I only focus on add/sub, multiply and compare operations and intentionally neglect any load/store instructions. According to the FPU data sheet, the add/sub operation requires 5 (6) clock cycles, the multiplication 4 (6) clock cycles, and the compare operation 4 (4) clock cycles in single- (double-) precision data format for the 200 MHz high-speed FPU variant. Moreover, the FPU coprocessor is specified to be fully pipelined for add/sub and multiply operations, and incorporates 32 floating-point registers.
In the end, the fractal kernel exhibits 9 dominant floating-point operations (FLOPS) per iteration, a fact that is used in the subsequent performance analysis.

Dual-Issue Kernel [Toc] [Top]

A fruitful attempt to improve the computational performance has been made by extending the basic fractal kernel to a dual-issue variant. In other words, the revised fractal kernel computes now two independent (adjacent) pixels per loop iteration, and thus improves the computational performance based on the following rationale:

Temporal Locality:	The data required for processing is still present in the data cache and/or the CPU/FPU registers. No extra load/store operations are required.
Spatial Locality:	The data required for processing is accessed through one or several contiguous memory access operations and is cached in the corresponding cache locations. In general, sequential access on contiguous data is superior to random access.
Data Dependency:	i) Identical operations on the same data can be fused, e.g., the twofold occurrence of zrzr (or zizi) in the code example above are merged by the C compiler into a single operation and the corresponding intermediate result is re-used. ii) Independent data can be interleaved in order to increase the computational load on certain processing blocks. As such, a certain degree of concurrency in data processing might increase the aggregated computational throughput. Nevertheless, the applicable degree of concurrency depends on the individual CPU/coprocessor architecture, e.g., the amount of CPU/FPU registers, support for multiple pipelined instructions, and the cache architecture. Therefore, successful application of concurrency requires careful software adaptations in order to achieve increased aggregated computational performance.

In short, the main goal of the dual-issue fractal kernel is to increase the average load on the FPU by exploiting its support for pipelined instructions. This goal is ultimately motivated by the fundamental guiding principle 'Make the common case fast!'.

The implementation of the dual-issue fractal kernel finally looks like:

iter0 = 0;
iter1 = 0;
do_flag0 = 1;
do_flag1 = 1;

while (do_flag0 || do_flag1) {

  if (do_flag0) {
    do_flag0 = ((zr0*zr0 + zi0*zi0 <= max_val) && (iter0++ < max_iter));
    // fundamental equation: z = z^2 + c
    zr_tmp0 = zr0*zr0 - zi0*zi0 + cr0;  // compute real part
    zi0 = 2*zr0*zi0 + ci0;              // compute imaginary part
    zr0 = zr_tmp0;
  }

  if (do_flag1) {
    do_flag1 = ((zr1*zr1 + zi1*zi1 <= max_val) && (iter1++ < max_iter));
    // fundamental equation: z = z^2 + c
    zr_tmp1 = zr1*zr1 - zi1*zi1 + cr1;  // compute real part
    zi1 = 2*zr1*zi1 + ci1;              // compute imaginary part
    zr1 = zr_tmp1;
  }
}

Further experiments with a triple-issue fractal kernel showed a decrease in overall performance.

Please note, that the above code examples allow variable iter to be in range [0, max_iter+1], depending on the individual abort condition.

Implementation Results [Toc] [Top]


Julia set with max_iter = 130 and c = -0.4 + 0.6i as rendered by the embedded system at VGA resolution (equivalent MATLAB picture)		Mandelbrot set with max_iter = 120 and c = 0 as rendered by the embedded system at VGA resolution (equivalent MATLAB picture)


The Xilinx ML507 FPGA prototyping board with additional cooling solution including fan. Additional cooling is mandatory when using a lot of external high-speed I/O switching, in this case the DDR2 SDRAM.		Julia set computed on the Xilinx ML507 prototyping platform at VGA resolution (before fixing the vertically mirrored image).


Julia set at increased zoom level. The number of colors in the colormap equals the maximum number of iterations.		Julia set at further increased zoom level. Navigation is either accomplished through keyboard cursor keys and RS232 transmission, or through push buttons on the FPGA board.


Mandelbrot set computed on the Xilinx ML507 prototyping platform.		Mandelbrot set at increased zoom level.


Mandelbrot set at further increased zoom level.		The maximum number of iterations can be dynamically increased at any zoom level in order to improve the level of details. Changing the maximum number of iterations instantly affects the colormap generation.

Performance Assessment [Toc] [Top]

Using VGA resolution, the fractal image computation involves a total of 307'200 pixels. Each pixel represents a complex-valued number consisting of two single- or double-precision floating-point values, i.e., in total 8 or 16 bytes per pixel. The total computation time per image is showed for two selected Julia and Mandelbrot sets in the corresponding tables below.

Julia Set [Toc] [Top]

Parameters:
c = -0.4 + 0.6i
x_origin = -1.4400
y_origin = -1.0800
resolution = 0.0045
max_iter = 130
dimensions = 640x480 pixels

Measurement Results

average number of iterations per pixel = 21.3

FPU coprocessor	not present		single-precision FPU	double-precision FPU
data format	single-precision	double-precision	single-precision	double-precision
basic kernel	20.185s	30.839s	1.248s	1.416s
dual-issue kernel	20.835s	31.321s	0.944s	1.007s
performance gain	-3.2 %	-1.6 %	24.4 %	28.9 %

Floating-Point Performance Analysis

Since we have previously identified the dominant floating-point operations per iteration of the fractal kernel, i.e., 9 FLOPS per iteration, the corresponding floating-point performance can be calculated based on the following parameters:

pixels per image	307'200
average iterations per pixel	21.3
FLOPS per iteration	9

FPU coprocessor	not present		single-precision FPU	double-precision FPU
data format	single-precision	double-precision	single-precision	double-precision
basic kernel	2'918'000	1'910'000	47'188'000	41'589'000
dual-issue kernel	2'827'000	1'880'000	62'384'000	58'481'000
performance gain	-3.2 %	-1.6 %	24.4 %	28.9 %

Numbers represent FLOPS.

Mandelbrot Set [Toc] [Top]

Parameters:
c = 0
x_origin = -1.0600
y_origin = 0.2350
resolution = 0.0003
max_iter = 120
dimensions = 640x480 pixels

Measurement Results

average number of iterations per pixel = 33.2

FPU coprocessor	not present		single-precision FPU	double-precision FPU
data format	single-precision	double-precision	single-precision	double-precision
basic kernel	30.985s	47.417s	1.898s	2.171s
dual-issue kernel	31.761s	47.752s	1.426s	1.510s
performance gain	-2.5 %	-0.7 %	24.9 %	30.4 %

Floating-Point Performance Analysis

pixels per image	307'200
average iterations per pixel	33.2
FLOPS per iteration	9

FPU coprocessor	not present		single-precision FPU	double-precision FPU
data format	single-precision	double-precision	single-precision	double-precision
basic kernel	2'962'000	1'936'000	48'362'000	42'281'000
dual-issue kernel	2'890'000	1'922'000	64'370'000	60'789'000
performance gain	-2.5 %	-0.7 %	24.9 %	30.4 %

Numbers represent FLOPS.

Summary & Conclusions [Toc] [Top]

The PowerPC 440 floating-point coprocessor realized as soft-IP and clocked at 200 MHz has a theoretical asymptotic computational performance of 200 MFLOPS. However, it is barely impossible for practical software to unleash this performance due to the inherent overhead of load/store operations, data dependency issues, or the limited availability of FPU registers.

Nevertheless, it was possible in this project, by solely adhering to the C programming language for software implementation, to achieve an FPU performance of up to 64 MFLOPS at application level. It's a solid achievement, however, there might be room for further improvements by using assembler instructions for the implementation of the fractal kernel.

Outlook [Toc] [Top]

Future steps and improvements might involve:

PS/2 or USB mouse support
Low-level assessment of fractal kernel including FPU load conditions (instruction-level assessment)
Implementation of optimized fractal kernel using assembler instructions
Implementation of fractal kernel as dedicated PowerPC 440 coprocessor module
Implementation of fractal kernel as dedicated peripheral hardware block by employing floating-point IP components

References [Toc] [Top]

[1] Xilinx Inc., Virtex-5 Family Brochure, Dec 2008

[2] Xilinc Inc., Virtex-5 Website

[3] Xilinc Inc., Xilinx Virtex-5 ML507 FPGA platform

Last updated: 2012/12/30

[Toc] [Top]