Optimizing Cython loop compared to Numpy

#cython: boundscheck=False, wraparound=False, nonecheck=False, cdivision=True, language_level=3
cpdef int query(double[::1] q, double[:,::1] data) nogil:
        int n = data.shape[0]
        int dim = data.shape[1]
        int best_i = -1
        double best_ip = -1
        double ip
    for i in range(n):
        ip = 0
        for j in range(dim):
            ip += q[j] * data[i, j]
        if ip > best_ip:
            best_i = i
            best_ip = ip
    return best_i

After compiling, I time the code from Python:

import numpy as np
import ip
n, dim = 10**6, 10**2
X = np.random.randn(n, dim)
q = np.random.randn(dim)
%timeit ip.query(q, X)

This takes roughly 100ms. Meanwhile the equivalent numpy code:

%timeit np.argmax(q @ X.T)

Takes just around 50ms.

This is odd, since the NumPy code seemingly has to allocate the big array q @ X.T before taking the argmax. I thus wonder if there are some optimizations I am lacking?

I have added extra_compile_args=["-O3", '-march=native'], to my setup.py and I also tried changing the function definition to

cpdef int query(np.ndarray[double] q, np.ndarray[double, ndim=2] data):

but it had virtually no difference in performance.


The operation q @ X.T will be mapped to an implementation of matrix-vector-multiplication (dgemv) from either OpenBlas or MKL (depending on your distribution) under the hood – that means you are against one of the best optimized algorithms out there.

The resulting vector has 1M elements, which results in about 8MB memory. 8MB will not always fit into L3-cache, but even RAM has about 15GB/s bandwith, thus writing/reading 8MB will take at most 1-2ms – not much gain compared to about 50ms of the overall running time.

The most obvios issue with your code, is that it doesn’t calculate the same as q @X.T. It calculates


Because of IEEE 754 the compiler is not allowed to reorder the operations and executes them in this non-optimal order: in order to calculate the second sum, the operation must wait until the first summation is performed. This approach doesn’t use the full potential of modern architectures.

A good dgemv implementation will choose a much better order of operations. A similar issue, but with sums, can be found in this SO-post.

To level the field one could use -ffast-math, which allows compiler to reoder operations and thus make a better use of pipelines.

Here are results on my machine for your benchmark:

%timeit query(q, X)            # 101 ms
%timeit query_ffastmath(q, X)  # 56.3 ms
%timeit np.argmax(q @ X.T)     # 50.2 ms

It is still about 10% worse, but I would be really surprised if compiler could beat a hand-crafted version created by experts especially for my processor.