AMD Interlagos: Bulldozer architecture for the server world - Software Ecosystem

Indice articoli

Software Ecosystem

To take advantage of the Bulldozer architecture you must recognize its own peculiarities:

  • Optimize the software for its module architecture: the operating system or hypervisor must recognize Bulldozer and act accordingly;
  • Exploiting new Bulldozer instructions, with particular reference to the SIMD instructions, SSEn, AVX, cryptographic acceleration and the exclusive FMA4 and XOP;
  • Exploiting the performance counters to monitor and improve the software;
  • Exploit the C6 state to consume less and/or enable a more advanced turbo core (e.g. core parking)
  • Exploiting new instructions for virtualization acceleration to achieve performance almost identical to a native machine.

AMD_Interlagos_17

Among the various instructions supported by AMD Bulldozer CPUs, we describe the AMD's exclusive FMA4 and XOP.

FMA4 implements the fused multiply accumulate, in the formula d = a + b x c, which allows great flexibility in programming, allowing you to choose the 4 registers independently. Such instruction is able to perform this calculation in a pipelined FPU and a bulldozer module is able to perform a 256 bit or 2 128 bit floating point FMAs, and simultaneously a 256 bit or 2 128 bit integer FMAs in a single clock cycle.

This class of instruction accelerates applications that require this type of calculations. IBM, SPARC and Itanium CPUs have this type of instructions. The x86 class Intel CPUs will implement them in 2013 but in the FMA3 version, more limited, in which one of the 4 registers is overwritten. This is due to the internal micro architecture of the Intel CPUs that can not have more than 3 registers per instruction.

AMD_Interlagos_18

XOP includes 128 and 256-bit, 3 or 4 operands instructions of horizontal summation or subtraction, compare, shift, rotation, permutation, integer accumulation and product, fraction extraction, and conversion to and from the 16-bit floating point, used in video cards.

To fully exploit the power of the Bulldozer CPU, software must be compiled with the SSE, AVX 128-bit and FMA4 option and linked to the ACML library version 5.x.

If the software supports the instructions in common with INTEL and not use the Intel compiler (which controls that the CPU is Intel to enable support for new instructions) then does not need to be recompiled. If you want to make use of FMA4 or XOP instructions, you must recompile the software or link it with the ACML 5.x libraries.

The new instructions to support virtualization is being implemented in all later software and kernels.

The Bulldozer support is active on all the latest compilers, except in the Intel one in which XOP and FMA4 are not supported and you have to force by hand the use of AVX with the -mAVX switch.

Using Libraries:

  • ACML 4.0 library is compatible with AMD Bulldozer. Version 5.0 is optimized for Bulldozer. It contains basic linear algebra routines (BLAS), advanced algebra routines (LAPACK) routines for FFTs and random numbers. The 5.1 version of ACML library that contains the same routines extended to double precision and complex numbers it's in development;
  • the libm AMD library, version 3.0 optimized for Bulldozer, contains the standard math functions optimized for this CPU;
  • finally, the IPP Intel library, that is limited to SSE3 version with AMD CPU.

 

Corsair