Namazu

Specification

Namazu comprises one server hosting an Altera FPGA. The entire system is supplied by Nallatech and is a Nallatech 385 FPGA.

The IBM host:

  • A Six-core Intel Sandybridge CPU
  • 32 GB RAM

Altera FPGA:

  • Altera Stratix V PCIe-385N A7
  • 8GB as 2 x 4GB. CL_DEVICE_GLOBAL_MEM_SIZE is 4GB.

Software:

  • OpenCL 1.0 Altera SDK for OpenCL, Version 13.1.4
  • Quartus II FPGA Designer
  • Altera IP

Getting Access to Namazu

To gain access, please email the ITS RI team at its-ri-team@manchester.ac.uk.

Restrictions on Access

Priority is given to those who funded the system, but other University of Manchester academics and computational researchers may gain access to evaluation and pump-priming purposes.

Accessing the Host Node

Those who have been given access to namazu can login to the node by using qrsh from the zrek login node

qrsh -l fpga_altera bash

# or (slightly shorter)
qrsh -l fpgaalt bash

No password required from the zrek login node.

Any references to the host in the instructions below are referring to this node (namazu).

Setup

To use the FPGA hardware and software load the following modulefile:

module load apps/binapps/altera/13.1.4

Compilation

Compiling OpenCL programs for the Altera FPGA involves:

  • Compiling a host program to run on the CPU. This is done using a standard compiler such as gcc or g++. The host program must be linked against the Altera OpenCL host libraries.
  • Compiling a kernal program to run on the FPGA. Use the aoc compiler provided. This is done offline before you execute the host program. This differs from GPU-based OpenCL where the host code usually compiles the kernel code at runtime (i.e. online) and no separate kernel code compilation step by the user is needed. For the FPGA we compile the kernel code in advance of running the host code.

Kernel Compilation

To compile an OpenCL kernel run the following:

aoc -v -c --board pcie385n_a7 my_kernel.cl

where

  • -v makes the compiler report which stage it is at. This is recommended otherwise you may think the compiler has hung.
  • -c generates an intermediate .aoco object file. This is fairly quick to produce and is strongly recommended because part of the compilation processes (the final linking) can take a long time. Separating the the process in to two stages allows you to rpeeat the first stage until you are ready to create the final executable kernel code (see below).
  • --board selects the target hardware platform
  • my_kernel.cl is your kernel source code

You may also add the --report and --estimate-throughput flags to the above compiler line to get more detailed information about the compilation and some performance estimates.

Once your .aoco file compiles you must link it to produce the final .aocx executable. This can take a long time (over an hour for the hello_world example).

aoc -v my_kernel.aoco

Very little is reported by the compiler while it is linking the final executable. A new sub-directory will be created in your compilation directory containing a file named quartus_sh_compile.log which can be examined to see where the compiler is up to.

The resulting my_kernel.aocx file will be loaded on to the FPGA by your host program when you run the host program.

Host Compilation

The host code is compiled with a standard compiler such as gcc. You must add compiler flags to pick up the Altera OpenCL host header files and libraries. For example:

g++ -o my_host.exe -I$ALTERAOCLSDKROOT/host/include \
                   -L$ALTERAOCLSDKROOT/linux64/lib \
                   -L$ALTERAOCLSDKROOT/host/linux64/lib \
                   -lalteracl -ldl -lalterahalmmd \
                   -lalterammdpcie -lelf -lrt -lstdc++ my_host.cpp

The compiler settings can be discovered by running

aocl compile-config
aocl link-config

and a simple Makefile example can be generated by running

aocl makefile

Executing the OpenCL Program

When you run the host code it will load the compiled kernel code on to the FPGA. You do not manually flash the kernel code to the device or manually reprogramme the device. However, your host code will be slightly different to existing GPU-based OpenCL host code:

When developing GPU-based OpenCL code your host code typically calls

f = fopen( filename, "r" );                               // Open plain-text file (.cl)
fread(buffer, ..., f );                                   // Read kernel source code in to a string
program = clCreateProgramWithSource(..., buffer, ...);    // Compile source
clBuildProgram(program, ...);                             // Link code
kernel = clCreateKernel(program, kernel_name, ...);       // Get specific kernel to run
// Set up kernel args ...
// Then finally send the kernel to the GPU
clEnqueueNDRangeKernel(..., kernel, ...);

to compile and link kernel source. For the Altera FPGA your host code should read a binary kernel file (we compiled it ourselves offline earlier):

f = fopen( filename, "rb" );                              // Open binary file (.aocx)
fread(buffer, ..., f );                                   // Read compiled kernel
program = clCreateProgramWithBinary(..., buffer, ...);    // Doesn't actually compile
clBuildProgram(program, ...);                             // Doesn't actually build
kernel = clCreateKernel(program, kernel_name, ...);       // Get specific kernel to run
// Set up kernel args ...
// Then finally programme the kernel on to the FPGA
clEnqueueNDRangeKernel(..., kernel, ...)

This last step does the programming of the FPGA device. It also programmes all kernels in the program on to the device (you can have more than one kernel defined in your source file – all of them are compiled in to the same program.) If you repeatedly run the kernel on the device (which is perfectly valid!) then the device won’t be reprogrammed every time (so the first call to clEnqueueNDRangeKernel() will be slower than subsequent calls).

Try an Example

The Altera hello_world is available as follows:

module load apps/binapps/altera/13.1.4

# Create a local directory
mkdir ~/altera
cd ~/altera

# Unpack the example in to your local directory
tar xzf $ALTERA_HOME/examples/archive/exm_opencl_hello_world_linux64.tgz

# Go in to the hello_world subdir
cd hello_world

# Compile the host code (creates bin/hello_world)
make

# The kernel (device) code has already been compiled. Hence
# you can now run the example as follows:
cd bin
./hello_world
    #
    # This will load the hello_world_131_pcie385n_a7.aocx kernel on to the device

To compile the device code yourself (carrying on from the previous example):

cd ../device
aoc -v -c --board pcie385n_a7 hello_world.cl
  #
  # This will create a hello_world.aoco file (and a directory).

# Now need to link the .aoco file to create a .aocx file.
# !!! WARNING - THIS IS SLOW - CAN TAKE OVER AN HOUR !!!
aoc -v hello_world.aoco
  #
  # This will create a hello_world.aocx file

# Now run the host code with our compile .aocx file
cd ../bin
mv hello_world_131_pcie385n_a7.aocx hello_world_131_pcie385n_a7.aocx.original
cp ../device/hello_world.aocx hello_world_131_pcie385n_a7.aocx
./hello_world

Further Information

Local and online docs:

Last modified on February 2, 2016 at 3:19 pm by Site Admin