Namazu
Specification
Namazu comprises one server hosting an Altera FPGA. The entire system is supplied by Nallatech and is a Nallatech 385 FPGA.
The IBM host:
- A Six-core Intel Sandybridge CPU
- 32 GB RAM
Altera FPGA:
- Altera Stratix V PCIe-385N A7
- 8GB as 2 x 4GB. CL_DEVICE_GLOBAL_MEM_SIZE is 4GB.
Software:
- OpenCL 1.0 Altera SDK for OpenCL, Version 13.1.4
- Quartus II FPGA Designer
- Altera IP
Getting Access to Namazu
To gain access, please email the ITS RI team at its-ri-team@manchester.ac.uk.
Restrictions on Access
Priority is given to those who funded the system, but other University of Manchester academics and computational researchers may gain access to evaluation and pump-priming purposes.
Accessing the Host Node
Those who have been given access to namazu can login to the node by using qrsh from the zrek login node
qrsh -l fpga_altera bash # or (slightly shorter) qrsh -l fpgaalt bash
No password required from the zrek login node.
Any references to the host in the instructions below are referring to this node (namazu).
Setup
To use the FPGA hardware and software load the following modulefile:
module load apps/binapps/altera/13.1.4
Compilation
Compiling OpenCL programs for the Altera FPGA involves:
- Compiling a host program to run on the CPU. This is done using a standard compiler such as
gcc
org++
. The host program must be linked against the Altera OpenCL host libraries. - Compiling a kernal program to run on the FPGA. Use the
aoc
compiler provided. This is done offline before you execute the host program. This differs from GPU-based OpenCL where the host code usually compiles the kernel code at runtime (i.e. online) and no separate kernel code compilation step by the user is needed. For the FPGA we compile the kernel code in advance of running the host code.
Kernel Compilation
To compile an OpenCL kernel run the following:
aoc -v -c --board pcie385n_a7 my_kernel.cl
where
-v
makes the compiler report which stage it is at. This is recommended otherwise you may think the compiler has hung.-c
generates an intermediate.aoco
object file. This is fairly quick to produce and is strongly recommended because part of the compilation processes (the final linking) can take a long time. Separating the the process in to two stages allows you to rpeeat the first stage until you are ready to create the final executable kernel code (see below).--board
selects the target hardware platformmy_kernel.cl
is your kernel source code
You may also add the --report
and --estimate-throughput
flags to the above compiler line to get more detailed information about the compilation and some performance estimates.
Once your .aoco
file compiles you must link it to produce the final .aocx
executable. This can take a long time (over an hour for the hello_world example).
aoc -v my_kernel.aoco
Very little is reported by the compiler while it is linking the final executable. A new sub-directory will be created in your compilation directory containing a file named quartus_sh_compile.log
which can be examined to see where the compiler is up to.
The resulting my_kernel.aocx
file will be loaded on to the FPGA by your host program when you run the host program.
Host Compilation
The host code is compiled with a standard compiler such as gcc
. You must add compiler flags to pick up the Altera OpenCL host header files and libraries. For example:
g++ -o my_host.exe -I$ALTERAOCLSDKROOT/host/include \ -L$ALTERAOCLSDKROOT/linux64/lib \ -L$ALTERAOCLSDKROOT/host/linux64/lib \ -lalteracl -ldl -lalterahalmmd \ -lalterammdpcie -lelf -lrt -lstdc++ my_host.cpp
The compiler settings can be discovered by running
aocl compile-config aocl link-config
and a simple Makefile example can be generated by running
aocl makefile
Executing the OpenCL Program
When you run the host code it will load the compiled kernel code on to the FPGA. You do not manually flash the kernel code to the device or manually reprogramme the device. However, your host code will be slightly different to existing GPU-based OpenCL host code:
When developing GPU-based OpenCL code your host code typically calls
f = fopen( filename, "r" ); // Open plain-text file (.cl) fread(buffer, ..., f ); // Read kernel source code in to a string program = clCreateProgramWithSource(..., buffer, ...); // Compile source clBuildProgram(program, ...); // Link code kernel = clCreateKernel(program, kernel_name, ...); // Get specific kernel to run // Set up kernel args ... // Then finally send the kernel to the GPU clEnqueueNDRangeKernel(..., kernel, ...);
to compile and link kernel source. For the Altera FPGA your host code should read a binary kernel file (we compiled it ourselves offline earlier):
f = fopen( filename, "rb" ); // Open binary file (.aocx) fread(buffer, ..., f ); // Read compiled kernel program = clCreateProgramWithBinary(..., buffer, ...); // Doesn't actually compile clBuildProgram(program, ...); // Doesn't actually build kernel = clCreateKernel(program, kernel_name, ...); // Get specific kernel to run // Set up kernel args ... // Then finally programme the kernel on to the FPGA clEnqueueNDRangeKernel(..., kernel, ...)
This last step does the programming of the FPGA device. It also programmes all kernels in the program on to the device (you can have more than one kernel defined in your source file – all of them are compiled in to the same program.) If you repeatedly run the kernel on the device (which is perfectly valid!) then the device won’t be reprogrammed every time (so the first call to clEnqueueNDRangeKernel() will be slower than subsequent calls).
Try an Example
The Altera hello_world is available as follows:
module load apps/binapps/altera/13.1.4 # Create a local directory mkdir ~/altera cd ~/altera # Unpack the example in to your local directory tar xzf $ALTERA_HOME/examples/archive/exm_opencl_hello_world_linux64.tgz # Go in to the hello_world subdir cd hello_world # Compile the host code (creates bin/hello_world) make # The kernel (device) code has already been compiled. Hence # you can now run the example as follows: cd bin ./hello_world # # This will load the hello_world_131_pcie385n_a7.aocx kernel on to the device
To compile the device code yourself (carrying on from the previous example):
cd ../device aoc -v -c --board pcie385n_a7 hello_world.cl # # This will create a hello_world.aoco file (and a directory). # Now need to link the .aoco file to create a .aocx file. # !!! WARNING - THIS IS SLOW - CAN TAKE OVER AN HOUR !!! aoc -v hello_world.aoco # # This will create a hello_world.aocx file # Now run the host code with our compile .aocx file cd ../bin mv hello_world_131_pcie385n_a7.aocx hello_world_131_pcie385n_a7.aocx.original cp ../device/hello_world.aocx hello_world_131_pcie385n_a7.aocx ./hello_world
Further Information
Local and online docs:
- Altera OpenCL SDK documentation
- Altera OpenCL SDK Getting Started Guide (pdf)
- Altera OpenCL Examples (be sure to download the 13.1 versions – see archive downloads on each example page).