python vs cython

case1：字节码执行

同样的python代码，经过cython编译后运行，一般情况下也比用python解释器运行要快。

因为python解释代码，本质上就是一个for/switch，对字节码的逐条执行，相比机器语言，使得CPU无法预判指令分支，也破坏指令缓存的局部化。

p1.py

def test(a, b):
    c = a * b + 1
    c /= 2
    return c

p1.py字节码

In [4]: dis.dis(test)
  2           0 LOAD_FAST                0 (a)
              3 LOAD_FAST                1 (b)
              6 BINARY_MULTIPLY
              7 LOAD_CONST               1 (1)
             10 BINARY_ADD
             11 STORE_FAST               2 (c)
  3          14 LOAD_FAST                2 (c)
             17 LOAD_CONST               2 (2)
             20 INPLACE_DIVIDE
             21 STORE_FAST               2 (c)
  4          24 LOAD_FAST                2 (c)
             27 RETURN_VALUE

PyEval_EvalFrameEx() 
        case BINARY_MULTIPLY:
            w = POP();
            v = TOP();
            x = PyNumber_Multiply(v, w);
            Py_DECREF(v);
            Py_DECREF(w);
            SET_TOP(x);
            if (x != NULL) continue;
            break;
...
PyObject *
PyNumber_Multiply(PyObject *v, PyObject *w)
{
    PyObject *result = binary_op1(v, w, NB_SLOT(nb_multiply));
    if (result == Py_NotImplemented) {
        PySequenceMethods *mv = v->ob_type->tp_as_sequence;
        PySequenceMethods *mw = w->ob_type->tp_as_sequence;
        Py_DECREF(result);
        if  (mv && mv->sq_repeat) {
            return sequence_repeat(mv->sq_repeat, v, w);
        }
        else if (mw && mw->sq_repeat) {
            return sequence_repeat(mw->sq_repeat, w, v);
        }
        result = binop_type_error(v, w, "*");
    }
    return result;
}

p2.pyx （跟p1.py同样的内容）

def test(a, b):
    c = a * b + 1
    c /= 2
    return c

用cython编译出来的片段

  /* "p2.pyx":2
 * def test(a, b):
 *     c = a * b + 1             # <<<<<<<<<<<<<<
 *     c /= 2
 *     return c
 */
  __pyx_t_1 = PyNumber_Multiply(__pyx_v_a, __pyx_v_b); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 2; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_t_1);
  __pyx_t_2 = PyNumber_Add(__pyx_t_1, __pyx_int_1); if (unlikely(!__pyx_t_2)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 2; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
  __Pyx_GOTREF(__pyx_t_2);
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_v_c = __pyx_t_2;
  __pyx_t_2 = 0;

t.py

import timeit
total = 50000000
t1 = timeit.Timer("test(1, 2)", "from p1 import test")
print "python:", t1.timeit(total)
import pyximport; pyximport.install()
t2 = timeit.Timer("test(1, 2)", "from p2 import test")
print "cython:", t2.timeit(total)

测试结果对比

$ python t.py
python: 1.63565206528
cython: 0.989973068237

case2：函数调用

在python里任何东西都是对象，包括函数参数，即便是数值类型也会在python编译代码对象的时候转换为对应类型的数值对象。但很多程序的大部分逻辑，类型是固定的，所以通过将python函数转换为C函数，特别是函数调用发生在大循环内的时候，会极大提高效率。

p2.pyx

cdef test(int a, int b):
    cdef int c = a * b + 1
    c /= 2
    return c
def timeit(int cnt, int a, int b):
    cdef int i
    for i in range(cnt):
        test(a, b)

用cython编译出来的片段

/* "p2.pyx":2
 * cdef test(int a, int b):
 *     cdef int c = a * b + 1             # <<<<<<<<<<<<<<
 *     c /= 2
 *     return c
 */
  __pyx_v_c = ((__pyx_v_a * __pyx_v_b) + 1);

  /* "p2.pyx":7
 * def timeit(int cnt, int a, int b):
 *     cdef int i
 *     for i in range(cnt):             # <<<<<<<<<<<<<<
 *         test(a, b)
 */
  __pyx_t_1 = __pyx_v_cnt;
  for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_1; __pyx_t_2+=1) {
    __pyx_v_i = __pyx_t_2;

我们声明了所有变量的类型，并且用cdef定义了test这个C函数，这样做cython会作以下优化：

将python代码转换为对等的C代码，尤其对于数值运算
将python循环转换为对等的C循环，对于大循环优化明显
对等的C函数，其调用代价小

t.py

import timeit
total = 10000000
t1 = timeit.Timer("test(1, 2)", "from p1 import test")
print "python:", t1.timeit(total)
import pyximport; pyximport.install()
timeit_stmt = "timeit(%d, 1, 2)" % total
t2 = timeit.Timer(timeit_stmt, "from p2 import timeit")
print "cython:", t2.timeit(1)

测试结果对比

$ python t.py
python: 1.61836004257
cython: 0.0431079864502

case3：GIL

GIL：Global Interpreter Lock，是Python虚拟机的多线程机制的核心机制，翻译为：全局解释器锁。

其实Python线程是操作系统级别的线程，在不同平台有不同的底层实现（如win下就用win32_thread, posix下就用pthread等）。

Python解释器为了使所有对象的操作是线程安全的，使用了一个全局锁（GIL）来同步所有的线程，所以造成“一个时刻只有一个Python线程运行”的伪线程假象。GIL是个颗粒度很大的锁，它的实现跟性能问题多年来也引起过争议，但到今天它还是经受起了考验，即使它让Python在多核平台下CPU得不到最大发挥。

GIL的作用很简单，任何一个线程除非获得锁，否则都在睡眠，而如果获得锁的线程一刻不释放锁，别的线程就永远睡眠下去。对于纯Python线程，这个问题不大，Python代码会通过解释器实时转换成微指令，而解释器给他们算着，每个线程执行了一定的指令数后就要把机会让给别的线程。这个过程中操作系统的调度作用比较微妙，不管操作系统怎么调度，即使把有锁线程挂起到后台，尝试唤醒没锁的，解释器也不给他任何执行机会，所以Python对象很安全。

所以一般来说，做纯Python的编程不需要考虑到GIL，它们是不同层面的东西，但是模块级别的C-Python、Cython等C层面的代码，跟Python虚拟机是平起平坐的，所以GIL很可能需要考虑，特别那些代码涉及IO阻塞、长时间运算、休眠等情况的时候（否则整个Python都在等这个耗时操作的返回，因为他们没获得锁，急也没办法）。

GIL的释放点：

CPU-bound的程序

大部分释放点都发生在字节码执行的循环体内（看PyEval_EvalFrameEx函数），释放时间间隔不定，取决于_Py_CheckInterval的设定以及当时具体执行的字节码

IO-bound的程序

（或者经常要等候某些条件满足而睡眠的程序，例如thread模块的lock.acquire），那么在每个会触发IO或睡眠的API内会自动释放GIL（可以搜索Py_BEGIN_ALLOW_THREADS和Py_END_ALLOW_THREADS）

p2.pyx

def test(int n_count):
    cdef int c = 1
    cdef int i
    for i in range(n_count):
        c += 2 * 34 + 1
        c /= 2
        c *= 39
    return c
def test_nogil(int n_count):
    cdef int c = 1
    cdef int i
    with nogil:
        for i in range(n_count):
            c += 2 * 34 + 1
            c /= 2
            c *= 39
    return c

t.py

import timeit
n_count = 900000000
n_thread = 2
import threading
def threading_test(n_thread, func, n_count):
    threads = []
    for i in range(n_thread):
        t = threading.Thread(target=func, args=(n_count,))
        threads.append(t)
    for t in threads:
        t.start()
    for t in threads:
        t.join()
import pyximport; pyximport.install()
t1 = timeit.Timer("threading_test(%d, test, %d)" % (n_thread, n_count), "from p2 import test; from __main__ import threading_test")
print "python:", t1.timeit(1)
t2 = timeit.Timer("threading_test(%d, test_nogil, %d)" % (n_thread, n_count), "from p2 import test_nogil; from __main__ import threading_test")
print "python nogil:", t2.timeit(1)

测试结果对比

$ python t.py
python: 8.29753804207
python nogil: 4.25584983826

case4：对象数据成员访问

对象数据成员的访问都是通过对象自身字典（dict）来进行的，相比C Struct的基于偏移量的访问，效率很低。

p1.py

class Test(object):
    def __init__(self, a, b):
        self.a = a
        self.b = b
    def test(self):
        c = self.a * self.b + 1
        c /= 2
        return c

字节码分析

Disassembly of test:
  6           0 LOAD_FAST                0 (self)
              3 LOAD_ATTR                0 (a)
              6 LOAD_FAST                0 (self)
              9 LOAD_ATTR                1 (b)
             12 BINARY_MULTIPLY     
             13 LOAD_CONST               1 (1)
             16 BINARY_ADD          
             17 STORE_FAST               1 (c)
  7          20 LOAD_FAST                1 (c)
             23 LOAD_CONST               2 (2)
             26 INPLACE_DIVIDE      
             27 STORE_FAST               1 (c)
  8          30 LOAD_FAST                1 (c)
             33 RETURN_VALUE

注意到这里的LOAD_ATTR，在ceavl.c里面的实现是：
        case LOAD_ATTR:
            w = GETITEM(names, oparg);
            v = TOP();
            x = PyObject_GetAttr(v, w);
            Py_DECREF(v);
            SET_TOP(x);
            if (x != NULL) continue;
            break;

这里PyObject_GetAttr就是对对象字典的访问。

p2.pyx

cdef class Test(object):
    cdef int a, b
    def __init__(self, a, b):
        self.a = a
        self.b = b
    def test(self):
        cdef int c = self.a * self.b + 1
        c /= 2
        return c

cython编译输出分析

/* "p2.pyx":7
 *         self.b = b
 *     def test(self):
 *         cdef int c = self.a * self.b + 1             # <<<<<<<<<<<<<<
 *         c /= 2
 *         return c
 */
  __pyx_v_c = ((__pyx_v_self->a * __pyx_v_self->b) + 1);

可见对应到cdef class里面声明的数据成员，是直接按偏移量访问的。

t.py

import timeit
total = 10000000
t1 = timeit.Timer("t.test()", "import p1; t = p1.Test(1, 2)")
print "python:", t1.timeit(total)
import pyximport; pyximport.install()
t2 = timeit.Timer("t.test()", "import p2; t = p2.Test(1, 2)")
print "python:", t2.timeit(total)

测试结果对比

$ python t.py
python: 2.80288910866
python: 0.585206985474

kingluo/cython.markdown

case1：字节码执行

case2：函数调用

case3：GIL

case4：对象数据成员访问