Power and Performance

Boosting Python* from profile-guided to platform-specific optimizations

12 Feb, 2019

by Victor Rodriguez Bahena

This blog is the second in a "behind the magic" series:

Part 1: Transparent use of library packages optimized for Intel® architecture
Part 2: Profile guided optimizations (this article)
Part 3: Boot time: how to fix it

Subsequent blogs in this series will be published and linked as they become available.

Introduction

Python* is a major force in the computing industry. From its humble beginnings as a hobbyist project created to fill time over a Christmas holiday, Python has become one of the most popular general-purpose interpreted languages. Its flexibility and ease of use have led to widespread adoption and an enthusiastic user community. Python is now widely used for automation scripts, cloud computing infrastructure, and deep learning.

Python developers are always looking for ways to improve performance. The Python wiki contains some recommendations to improve performance and scalability in a Python application that range from using the best algorithms to taking advantage of interpreter optimizations. Most of these proposed optimizations are strategies and suggestions on how to best use Python language features and are outside the scope of what a modern Linux distribution can control. However, even if these suggestions are followed, it's quite possible that a Python application is not realizing its maximum performance because there are optimizations that a Linux distribution does heavily influence -- enabling Python and accompanying modules and shared libraries to make the best use of the underlying hardware.

Clear Linux developers have invested heavily in improving the performance of libraries and tools across multiple levels of the Python stack. The goal is to enable developers to realize the full potential of Intel® architecture without having to do anything special themselves. This is achieved by using techniques like profile-guided optimizations and platform-specific optimizations.

Platform optimizations across the Python stack

The next sections describe the following platform optimizations across the Python stack in the Clear Linux project.

Patches to implement Intel® Advanced Vector Extensions compiler flags for distutils and math library
Compiler flags to build CPython code

All changes and patches to Python itself in Clear Linux are documented in the python3.spec file, and the patches applied to the original upstream source code are publically available.

Distutils

The first section to review are the patches to enable the build of Intel® Advanced Vector Extensions (Intel® AVX) technology across the Python libraries. The distutils package provides the end user with a vast list of Python libraries for multiple development tools, some of which allow users to build and install additional modules into a Python installation. To build Python libraries for packages that use the upstream Python provided distutils optimized for x86-64 systems, the Clear Linux team modified the distutils tool scripts with the following patch:

diff --git a/Lib/distutils/unixccompiler.py b/Lib/distutils/unixccompiler.py
index ab4d4de..de09d99 100644
--- a/Lib/distutils/unixccompiler.py
+++ b/Lib/distutils/unixccompiler.py
@@ -116,6 +116,10 @@ class UnixCCompiler(CCompiler):
         try:
             self.spawn(compiler_so + cc_args + [src, '-o', obj] +
                        extra_postargs)
+            self.spawn(compiler_so + cc_args+ ["-march=haswell", "-O3", "-fno-semantic-interposition", "-ffat-lto-objects", "-flto=4"] + [src, '-o', obj + ".avx2"] +
+                       extra_postargs)
+            self.spawn(compiler_so + cc_args+ ["-march=skylake-avx512", "-O3", "-fno-semantic-interposition", "-ffat-lto-objects", "-flto=4", "-mprefer-vector-width=512"] + [src, '-o', obj + ".avx512"] +
+                       extra_postargs)
         except DistutilsExecError as msg:
             raise CompileError(msg)

This part of the patch builds two more versions of the Python libraries using these flags:

-march=haswell 
-march=skylake-avx512

Each library compiled with these flags ends with the suffix .avx2 and .avx512 respectively. They are installed in the /usr/lib/python3.7/site-packages/ directory as the pandas library file as follows:

pandas/_libs/skiplist.cpython-37m-x86_64-linux-gnu.so.avx2

(The Python library for data manipulation and analysis is called the pandas library.)

In addition, we added support for dynamic loading of extension modules. This is done with a hack on the file dynload_shlib.c as shown below:

diff --git a/Python/dynload_shlib.c b/Python/dynload_shlib.c
index f271193..4315237 100644
--- a/Python/dynload_shlib.c
+++ b/Python/dynload_shlib.c
@@ -62,6 +62,8 @@ _PyImport_FindSharedFuncptr(const char *prefix,
     char funcname[258];
     char pathbuf[260];
     int dlopenflags=0;
+    char *pathname2;
+    char *pathname3;

     if (strchr(pathname, '/') == NULL) {
         /* Prefix bare filename with "./" */
@@ -93,7 +95,19 @@ _PyImport_FindSharedFuncptr(const char *prefix,

     dlopenflags = PyThreadState_GET()->interp->dlopenflags;

-    handle = dlopen(pathname, dlopenflags);
+    pathname2 = malloc(strlen(pathname) + strlen(".avx2") + 1);
+    sprintf(pathname2, "%s%s", pathname, ".avx2");
+    pathname3 = malloc(strlen(pathname) + strlen(".avx512") + 1);
+    sprintf(pathname3, "%s%s", pathname, ".avx512");
+
+    if (__builtin_cpu_supports("avx512dq") && access(pathname3, R_OK) == 0)
+        handle = dlopen(pathname3, dlopenflags);
+    else if (__builtin_cpu_supports("avx2") && access(pathname2, R_OK) == 0)
+        handle = dlopen(pathname2, dlopenflags);
+    else
+        handle = dlopen(pathname, dlopenflags);
+    free(pathname2);
+    free(pathname3);

     if (handle == NULL) {
         PyObject *mod_name;

This change enables dynamic loading of extension modules based on the platform where our Python module is running.

By integrating these patches, it is possible to have Intel® AVX2 and Intel® AVX-512 compiled libraries of Python plugin packages such as:

However, if we review the SPEC files of these projects, we will not see any patches for Intel® AVX technology enablement or multiple %build sections on the SPEC files. This is because most libraries generate platform specific libraries correctly, thanks to the above patch to distutils.

Similar distutil patches must be implemented for packages that use their own distutils, like the NumPy project, which is the fundamental package for scientific computing on Python. The NumPy project contains embedded distutils. For this reason, the Clear Linux project team implemented similar patches (avx2-distutils.patch and avx2-fortran-distutils.patch) to enable the Intel® AVX technology across the NumPy stack.

Math library

The top-level makefile for Python has one section for the math library, which is a library shared by the math and cmath modules. The math module provides access to the mathematical functions defined by the C standard and the cmath module is used for complex numbers. The Clear Linux project added platform optimization by enabling the Intel® AVX technology for this library via the following patch:

diff --git a/Makefile.pre.in b/Makefile.pre.in
index baa1d0a..7b07d60 100644
--- a/Makefile.pre.in
+++ b/Makefile.pre.in
@@ -591,6 +591,8 @@ pybuilddir.txt: $(BUILDPYTHON)
 # This is shared by the math and cmath modules
 Modules/_math.o: Modules/_math.c Modules/_math.h
      $(CC) -c $(CCSHARED) $(PY_CORE_CFLAGS) -o $@ $<
+     $(CC) -c $(CCSHARED) $(PY_CORE_CFLAGS) -march=haswell -o $@.avx2 $<
+     $(CC) -c $(CCSHARED) $(PY_CORE_CFLAGS) -march=skylake-avx512 -o $@.avx512 $<

CPython code

CPython is the most widely-used implementation reference implementation of the Python programming language. Written in C and Python, CPython use the auto tools project to configure, build, and install the binaries. In the Clear Linux project, the python3 spec file creates two build environments for platform optimizations. One regular environment:

%configure %python_configure_flags --enable-shared
make %{?_smp_mflags}

The second environment uses the 64-bit CPU optimizations for 4th generation Intel® Core™ processors (formerly codenamed Haswell):

pushd ../Python-avx2
export CFLAGS="$CFLAGS -march=haswell -mfma  "
export CXXFLAGS="$CXXFLAGS -march=haswell -mfma"
%configure %python_configure_flags --enable-shared --bindir=/usr/bin/haswell
make %{?_smp_mflags}

Profile-guided optimizations across Python

The other part of our approach for improving Clear Linux for Python performance uses profile-guided optimization (PGO). PGO, also known as feedback-directed optimization (FDO), is a compiler optimization technique that uses profiling to improve program runtime performance. PGO in GCC uses static instrumentation to collect profiles, then GCC uses execution profiles to guide optimizations such as instruction scheduling, branch prediction, basic block reordering, function splitting, and register allocation.

In GCC, the current method of PGO optimization involves the following steps:

Build an instrumented version of the program using the GCC flag -fprofile-generate
Run the instrumented program with representative training data to collect the execution profile.
Rebuild the source using the profile date as feedback with the GCC option -fprofile-use=sort.gcda

The upstream Python project provides an easy mechanism in the makefile to change the training task according to the user needs. Clear Linux takes advantage of this and applies a patch to the Python 3 makefile to change the task to run:

run_profile_task:
       $(LLVM_PROF_FILE) $(RUNSHARED) ./$(BUILDPYTHON) $(PROFILE_TASK) || true
where the PROFILE_TASK is defined in: 
PROFILE_TASK=<choose your favorite training app>

This change is critical because choosing the proper training task is crucial in the FDO technology. Why? Because each application will generate different block and edge frequency counts. The information in one profile could optimize the performance of one use case, but could also have a negative effect on the performance of other applications at the same time. For this reason, we highly recommend that developers have the option to define the proper training task for their application use case.

Clear Linux provides an example of how to set up a training task with the patch use-pybench-to-optimize-python.patch, however other benchmarks can be used for multiple applications and use cases. Another example of benchmarks to be used as a training task is the Python performance benchmark suite.

Conclusion

In this blog post, we have shown two methodologies to improve the performance of Python libraries as well as the language interpreter (/usrbin/python) for multiple scenarios:

If we want to use optimized instruction sets provided by new x86 platforms, we need to make sure that our operating system provides the platform optimizations that we need. Arithmetic applications such as big data and machine learning are examples of end user applications that could benefit from this performance boost methodology.
If our operating system runs only one kind of application, like a container use case, we could use the FDO performance boost methodology, where the Python interpreter has been optimized for a specific task, such as instruction scheduling, basic block reordering, function splitting, or register allocation.

In the end, the methodology chosen to improve the performance of a Python application is tightly coupled to the data and experiments that sustain it. However, regardless of the workload and use case, Python on Clear Linux will maximize the performance of the underlying hardware.

Blogs & News