Transparent use of library packages optimized for Intel® architecture

03 Apr, 2018

by Victor Rodriguez Bahena

This blog is the first in a "behind the magic" series:

  • Part 1: Transparent use of library packages optimized for Intel® architecture
  • Part 2: Profile guided optimizations 
  • Part 3: Boot time: how to fix it

Subsequent blogs in this series will be published and linked as they become available.

Introduction and problem statement

In the past year, impressive technologies for the data center market were launched; however, many developers do not take advantage of these advances immediately. CPU architectures gain interesting new instructions, but developers choose backward compatibility instead of leveraging innovative architectures. For example, a recent hardware evolution incorporated powerful vectorization instructions, yet many Linux* distributions do not use these instructions in their code. Developers are faced with the challenge of improving operations while supporting existing installations.

In a previous article, we presented function multi-versioning (FMV) as one solution to this problem. Starting with GCC 6, FMV is supported in both C and C++. The FMV compiler feature makes it easier to develop Linux applications that take advantage of CPU enhanced instructions without the overhead of replicating functions for each target. However, modifying the source code with an “__attribute__” for each function that is a candidate for vectorization is not always a feasible option.

To address the problem of maximizing performance while preserving compatibility, the glibc project recently introduced the capability to generate optimized binaries for multiple platform targets without source code changes. This article describes the solution and provides an application example that uses a container.

How to solve the problem

glibc 2.26 adds a mechanism to support a single binary library package optimized for different platforms where library selection is performed at runtime. Hongjiu Lu, one of the authors of this article, developed the patch to make this solution possible.

The patch enables the following:

  • glibc uses hardware detection capabilities during startup to determine the platform in use.
  • glibc builds an array of hardware capability names, which are added to the search path when loading shared objects.
  • During startup, the dl_x86_cpu_features option sets the hardware characteristics by reusing the dl_platform, dl_hwcap, and dl_hwcap_mask variables.
  • During runtime, CPU_FEATURES_ARCH_P (cpu_features) and CPU_FEATURES_CPU_P (cpu_features) are used to set dl_platform with GLRO(dl_platform) = "<platform>".
  • glibc uses platform-specific paths to find its shared libraries, for example: /usr/lib64/haswell/ or /usr/lib64/haswell/avx512_1/.

For x86-64, the supported platforms are:

  • haswell: 4th generation Intel® Core™ processors with BMI1, BMI2, LZCNT, MOVBE, POPCNT, Intel® Advanced Vector Extensions 2 (Intel® AVX2), and Intel Fast Memory Access (FMA). A capability, avx512_1, is also added to x86-64 for Advanced Vector Extensions 512 (AVX-512) instruction sets: AVX512F, AVX512CD, AVX512BW, AVX512DQ, and AVX512VL.
  • xeon_phi:  Intel® Xeon Phi™ processors with AVX512F, AVX512CD, AVX512ER, and AVX512PF.

Application examples

The Clear Linux* Project demonstrates how to use this glibc capability in a Linux distribution. First, you must select and compile specific math-heavy libraries for vector instruction optimizations, such as: -march=haswell and -march=skylake-avx512. Libraries that benefit from an optimized library package are widely used for machine learning, signal processing, and big data analytics.

One library used in these fields is OpenBLAS, the core library for math operations. OpenBLAS is an open source implementation of Basic Linear Algebra Subprograms (BLAS). It provides standard interfaces for linear algebra. Replacing the default linear algebra libraries with vector instructions increases speed.

The following examples show the computation of the dot function for matrix multiplication used in diverse scientific applications with the OpenBLAS library.

static double a[4] = {1,2,5,6};
static double b[4] = {3,4,7,8};
static double c[4] = {17,20,57,68};
int N = 2;
double res;
for (int i = 0; i < N; i++)
  for (int j=0; j<N; j++)
    res = cblas_ddot(N,&a[N*i],1,&b[j],2)

When we execute this code on a system with a 4th generation Intel® Core™ processor with Intel® AVX2 capabilities but with no /usr/lib64/haswell/ optimized library, the strace output shows that the system looks for /usr/lib64/haswell/libopenblas.so.0 and cannot find it. The glibc resolver ends up using /usr/lib64/libopenblas.so.0 to provide a valid, but not optimized, cblas_ddot function.

However, if we compile OpenBLAS with platform optimizations (see this example), then we have the following list of files:

/usr/lib64/haswell/libopenblas.so
/usr/lib64/haswell/libopenblas.so.0
/usr/lib64/haswell/libopenblas_haswellp-r0.2.20.so

When we run the same dot matrix multiplication with glibc 2.26 and the optimized OpenBLAS libraries, the strace output shows that glibc finds and uses /usr/lib64/haswell/libopenblas.so.0. This success is based on the result of dl_platform and dl_hwcap we set at boot time.

In the Clear Linux* Project, the OpenBLAS package includes files for systems with and without Intel® AVX2 support. Support for both system types allows workload handling improvements depending on the available processor capabilities. The improvements extend to machine learning and big data analytics projects that have benefited from vectorized instructions.

According to KDnuggets magazine, four languages for analytics, data mining, and data science have become dominant in the last few years. These languages are R*, SAS*, Python*, and SQL*. The KDnuggets article confirms that these languages are used by 91% of data scientists. Applying the advancements in newer computing architectures to these programming languages improves the performance of data science applications.

The next example shows how this solution also benefits a matrix multiplication in R language:

v1 <- matrix(data=c(1,5,2,6),nrow=2,ncol=2)
v2 <- matrix(data=c(3,7,4,8),nrow=2,ncol=2)
v1 %*% v2
Answer
      [,1] [,2]
[1,]   17   20
[2,]   57   68

When we look at the strace output for a program running on a processor with Intel® AVX2 capabilities, we see that the glibc-linked library is /usr/lib64/haswell/libopenblas.so.0 based on the platform discovered at process startup time. Reading the file (using readelf -Ws) shows the functions for matrix multiplication, such as cblas_ddot(), cblas_sdot(), and others. Checking the instructions (using objdump) of the Haswell-optimized OpenBLAS library, we can see the use of the fused multiply and add instruction:

vfmadd132ss %xmm6,%xmm3,%xmm0
vfnmadd213sd (%r9),%xmm1,%xmm2

Accessing these types of optimized instructions, using either Intel FMA or Intel® AVX, is possible using the new glibc feature supporting a single binary library package optimized for different platforms linked transparently at run time.

Ways to use on containers and virtualization

Another important factor to consider is deployability. In today's data center world, the need to deploy solutions to customers quickly forces operating system engineers to make a solution available for virtual systems as containers and virtual machines.

Containers provide a lightweight and stand-alone system including everything needed to run a program, such as system tools, system libraries, and settings. With precompiled and optimized package libraries handled by glibc2.26, a Docker* container user can take advantage of the new architecture's specialized instruction set extensions by running the command:

docker run -it clearlinux/machine-learning

From a data center administrator’s point of view, this scalable and easy-to-deploy solution allows current data center applications to take advantage of new instruction set extensions without modifying the host operating system. With this change in glibc, users can execute an application inside a container and link to an optimized library provided by the container for the architecture of the host system without having to do anything else.

Conclusion

Every year, data center technology sees the birth of new instructions and computer architecture technologies. Operating systems and software applications can implement these features to allow customers and users to benefit from their implementation.

Developers can maximize their use of Intel® architecture technology during runtime by implementing either one of these solutions: adding the mechanism to support a single binary library package optimized for different platforms or with multiple versions of the function targeting the specified instruction. As software developers, we are responsible for providing ways to use this technology in scalable and easy-to-deploy solutions like the ones described in this paper.

Credits: Hongjiu Lu (co-author)