BLAS/LAPACK benchmarks with NumPy, SciPy and Theano

최근 컴퓨터를 한 대 새로 구입한 기념으로 가지고 있는 머신 몇 대에 BLAS/LAPACK benchmark를 해봤다.¹

BLAS(Basic Linear Algebra Subprograms)란 벡터 및 행렬 연산을 관장하는 스펙이며 GNU Octave², Mathematica, NumPy, R, 그리고 아래의 LAPACK 등 다양한 소프트웨어에 사용된다. BLAS를 잘 설정하고 다루면 같은 코드를 돌리더라도 몇 배에 달하는 속도 향상을 이룰 수 있기 때문에 잘 이해하고 있으면 좋다.³

BLAS는 다음의 3가지 레벨로 구성되어 있다 (괄호 안 숫자는 발표 연도) ⁴:

Level 1 (1979): scalar-vector and vector-vector operations
- 예를 들어 daxpy는 말 그대로 “double precision scalar $a$ times vector $x$ plus vector $y$”를 수행한 후 $y$에 결과값을 대입한다
- $\mathbf{y} \leftarrow a \mathbf{x} + \mathbf{y}$
Level 2 (1988): matrix-vector operations
- 예를 들어 sgemv는 말 그대로 “single precision general matrix-vector product”를 계산한다
- $\mathbf{y} \leftarrow a \mathbf{Ax} + b \mathbf{y}$
Level 3 (1990): matrix-matrix operations
- 예를 들어 gemm은 “general matrix-matrix product”를 계산한다
- $\mathbf{C} \leftarrow a \mathbf{AB} + b \mathbf{C}$

BLAS의 구현체로는 NVIDIA의 CUDA용 cuBLAS, AMD의 ACML, 인텔의 MKL, 애플의 Accelerate Framework안에 포함된 vecLib, 오픈소스인 ATLAS, 그리고 아마 가장 범용적으로 쓰이는 오픈소스 OpenBLAS 등이 있으며 그 외에도 다양한 구현체가 있다.

한편 LAPACK(Linear Algebra PACKage)은 least squares와 SVD(singular value decomposition) 등의 행렬 분해(matrix decomposition) 과정 이 구현 되어있는 스펙이며 2008년 FORTRAN 버젼이 공개되었다. ATLAS와 OpenBLAS가 LAPACK의 일부를 구현하고 있고, 그 외에는 LAPACK++ 등의 구현체가 있다.⁵

Commands used for benchmarks

다음은 성능 측정을 할 때 사용한 명령 목록이다 (Ubuntu 기준):

Switch BLAS: sudo update-alternatives --config libblas.so.3
Switch LAPACK: sudo update-alternatives --config liblapack.so.3
Check BLAS/LAPACK linkage with numpy: python -c "import numpy; numpy.__config__.show()"
NumPy test (~30s ⁶): python -c "import numpy; numpy.test()"
SciPy test (~1m ⁶): python -c "import scipy; scipy.test()"
Theano test (~30m ⁶): python -c "import theano; theano.test()"
- GPU: THEANO_FLAGS=floatX=float32,device=gpu python -c "import theano; theano.test()" ⁷
BLAS test: Theano 패키지에 포함된 다음 코드를 돌리면 그들의 머신에서 생성한 벤치마크가 나온다:
```
 python `python -c "import os, theano; print os.path.dirname(theano.__file__)"`/misc/check_blas.py
```

Theano BLAS benchmarks:
2000x2000(M=N=K=2000)의 float64 행렬에 대해 gemm을 10번 수행했다. (All memory layout was in C order):
테스트용으로 사용된 CPU의 상세 스펙은 다음과 같고:
Xeon E5345 (2.33Ghz, 8M L2 cache, 1333Mhz FSB)
Xeon E5430 (2.66Ghz, 12M L2 cache, 1333Mhz FSB)
Xeon E5450 (3Ghz, 12M L2 cache, 1333Mhz FSB)
Core 2 E8500 (2.8Ghz, hyper-threads enabled)
Core i7 930 (2.8Ghz, hyper-threads enabled)
Core i7 950 (3.07GHz, hyper-threads enabled)
Xeon X5560 (2.8Ghz, 12M L2 cache, hyper-threads?)
Xeon X5550 (2.67GHz, 8M l2 cache?, hyper-threads enabled)
라이브러리는 다음과 같을 때:
numpy with ATLAS from distribution (FC9) package (1 thread)
manually compiled numpy and ATLAS with 2 threads
goto 1.26 with 1, 2, 4 and 8 threads
goto2 1.13 compiled with multiple threads enabled
각 CPU와 라이브러리에 대한 테스트 수행결과는 다음과 같다. (라이브러리 이름 옆의 숫자는 사용된 thread의 수이다):
CPU Xeon E5345 Xeon E5430 Xeon E5450 Core 2 E8500 Core i7 930 Core i7 950 Xeon X5560 Xeon X5550
numpy 1.3.0 blas 775.92s
numpy_FC9_atlas/1 39.2s 35.0s 30.7s 29.6s 21.5s 19.60s
numpy_MAN_atlas/2 12.0s 11.6s 10.2s 9.2s 9.0s
goto/1 18.7s 16.1s 14.2s 13.7s 16.1s 14.67s
goto/2 9.5s 8.1s 7.1s 7.3s 8.1s 7.4s
goto/4 4.9s 4.4s 3.7s - 4.1s 3.8s
goto/8 2.7s 2.4s 2.0s - 4.1s 3.8s
openblas/1 14.04s
openblas/2 7.16s
openblas/4 3.71s
openblas/8 3.70s
mkl 11.0.083/1 7.97s
mkl 10.2.2.025/1 13.7s
mkl 10.2.2.025/2 7.6s
mkl 10.2.2.025/4 4.0s
mkl 10.2.2.025/8 2.0s
goto2 1.13/1 14.37s
goto2 1.13/2 7.26s
goto2 1.13/4 3.70s
goto2 1.13/8 1.94s
goto2 1.13/16 3.16s
또, GPU에 대한 벤치마크는 다음과 같다 (Test time in float32):
CUDA version 6.5 6.0 5.5 5.0 4.2 4.1 4.0 3.2 3.0 note
K6000/NOECC 0.06s 0.06s
K40 0.07s
K20m/ECC 0.08s 0.08s 0.07s
K20/NOECC 0.07s
M2090 0.19s
C2075 0.25s
M2075 0.25s
M2070 0.25s 0.27s 0.32s
M2070-Q 0.48s 0.27s 0.32s
M2050(Amazon) 0.25s
C1060 0.46s
K600 1.04s
GTX Titan Black 0.05s
GTX Titan(D15U-50) 0.06s 0.06s don't work
GTX 780 0.06s
GTX 980 0.06s
GTX 970 0.08s
GTX 680 0.11s 0.12s 0.154s 0.218s
GRID K520 0.14s
GTX 580 0.16s 0.16s 0.164s 0.203s
GTX 480 0.19s 0.19s 0.192s 0.237s 0.27s
GTX 750 Ti 0.20s
GTX 470 0.23s 0.23s 0.238s 0.297s 0.34s
GTX 660 0.18s 0.20s 0.23s
GTX 560 0.30s
GTX 650 Ti 0.27s
GTX 765M 0.27s
GTX 460 0.37s 0.45s
GTX 285 0.42s 0.452s 0.452s 0.40s cuda 3.0 seems faster? driver version?
750M 0.49s
GT 610 2.38s
GTX 550 Ti 0.57s
GT 520 2.68s 3.06s
520M 2.44s 3.19s with bumblebee on Ubuntu 12.04
GT 220 3.80s
GT 210 6.35s
8500 GT 10.68s

My test results

비교하는 컴퓨터는 총 7대이며, 하드웨어 스펙은 다음과 같다:

Name	Description	OS	CPU	RAM	GPU
tigger	MacBook Air 13" (Early 2014)	Mac OS X 10.10.4	Intel Core i5-4260U 1.4GHz	4GB 1600MHz (DDR3)	~~Intel HD Graphics 5000 1536MB~~
playbook	MacBook Pro 15" (Mid 2014)	Mac OS X 10.10.4	Intel Core i7-4770HQ 2.2GHz	16GB 1600MHz (DDR3)	~~Intel Iris Pro 1536MB~~
joker	PC	Ubuntu 13.10	Intel Xeon E3-1230 v3 3.30GHz	8GB 1600MHz (DDR3)	-
dada	PC	Ubuntu 14.04.1	Intel Pentium G620 2.6GHz	8GB 1067MHz (DDR3)	-
daca	PC	Ubuntu 14.04.2	Intel Core i7-3930K 3.2GHz	32GB 1600MHz (DDR3)	-
merci	PC	Ubuntu 14.04.2	Intel Core i7-5820K 3.3GHz	32GB 2133 MHz (DDR4)	NVIDIA GeForce GTX 980
labpc	PC	Windows 7	AMD Phenom II X3 720 2.8GHz	16GB 2133 MHz (DDR3)?	~~ATI Radeon HD 4850~~

각 머신에 깔려있는 소프트웨어/패키지 버젼은 다음과 같다:

Name	python	numpy	scipy	theano	cuda
tigger	2.7.6	1.8.0rc1	0.13.0b1	-	-
playbook	2.7.10	1.9.2	0.15.1	0.7.0	-
joker	2.7.5	1.8.0	-	-	-
dada	2.7.6	1.9.1	0.16.0	0.7.0	-
daca	2.7.6	1.9.2	0.16.0	0.7.0	-
merci	2.7.6	1.9.2	0.16.0	0.7.0	7.0.27
labpc	-	-	-	-	-

결과를 최종적으로 요약 정리한 표이다 (괄호 안 숫자는 테스트 수):

name	numpy	scipy	theano	blas
tigger	109.813	275.232	-	-
playbook	22.767 (5557)	145.634 (17005)	5705.634 (2724)	1.181
joker	22.279	-	-	-
dada/blas	26.335 (5580)	195.697 (18456)	4546.648 (2722)	17.37
daca/openblas	18.965 (5593)	148.614 (18456)	3418.434 (2722)	1.9490
merci/blas	24.243 (-)	49.994 (-)	2700.233 (-)	-
merci/openblas	15.193 (5593)	114.156 (18456)	3447.037 (2722)	2.76
merci/openblas+cuda	16.393 (5593)	109.405 (18456)	4165.916 (19844) - FAILED	0.06
labpc	39.183	-	-	-

다음은 앞서 나열된 머신 중 merci에서 작업한 상세 로그이다.

1단계: Vanilla Ubuntu

$ sudo apt-get install python-dev python-pip python-nose g++ libopenblas-dev git
$ sudo apt-get install python-numpy     # 1.8.1
$ sudo apt-get install python-scipy     # 0.14.0
$ sudo pip install Theano               # 0.7.0
$ python -c "import numpy; numpy.__config__.show()"  # or, `from numpy.distutils.system_info import get_info; get_info('blas')`
blas_info:
    libraries = ['blas']
    library_dirs = ['/usr/lib']
    language = f77
lapack_info:
    libraries = ['lapack']
    library_dirs = ['/usr/lib']
    language = f77
atlas_threads_info:
  NOT AVAILABLE
blas_opt_info:
    libraries = ['blas']
    library_dirs = ['/usr/lib']
    language = f77
    define_macros = [('NO_ATLAS_INFO', 1)]
atlas_blas_threads_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
lapack_opt_info:
    libraries = ['lapack', 'blas']
    library_dirs = ['/usr/lib']
    language = f77
    define_macros = [('NO_ATLAS_INFO', 1)]
atlas_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

NumPy (1.8.1): 24.243s, OK
SciPy (0.14.0): 49.994s, OK
Theano (0.7.0): 2700.233s, OK

2단계: With OpenBLAS

$ sudo apt-get install libopenblas-dev
$ sudo apt-get purge python-numpy python-scipy      # http://stackoverflow.com/a/25326614/1054939 http://stackoverflow.com/q/29979539/1054939
$ sudo pip install numpy    # 1.9.2
$ sudo pip install scipy    # 0.16.0
$ sudo update-alternatives --config libblas.so.3    # /usr/lib/openblas-base/libblas.so.3
$ sudo update-alternatives --config liblapack.so.3  # /usr/lib/lapack/liblapack.so.3
$ python -c "import numpy; numpy.__config__.show()"                                                                                                                           [22/7614]
blas_info:
    libraries = ['blas']
    library_dirs = ['/usr/lib']
    language = f77
lapack_info:
    libraries = ['lapack']
    library_dirs = ['/usr/lib']
    language = f77
atlas_threads_info:
  NOT AVAILABLE
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/lib']
    language = f77
openblas_info:
    libraries = ['openblas']
    library_dirs = ['/usr/lib']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'blas']
    library_dirs = ['/usr/lib']
    language = f77
    define_macros = [('NO_ATLAS_INFO', 1)]
openblas_lapack_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

NumPy (1.9.2): 15.193s (5593), OK
SciPy (0.16.0): 114.156s (18456), OK
Theano (0.7.0): 3447.037s (2722), OK

SciPy 테스트 시간의 큰 차이에 관해서는 #scipy irc에서 물어보니 BLAS나 라이브러리의 성능 자체보다는 버전에 따른 테스트 수 차이 때문일 수 있다고.

10:27 PM <e9t_> I'm using Ubuntu 14.04, and installed numpy+scipy with apt-get. Then I installed libopenblas-dev, purged numpy+scipy and reinstalled them with pip. But the test results are peculiar.
10:27 PM <e9t_> - numpy.test(): 24s -> 15s (decreased. great!)
10:27 PM <e9t_> - scipy.test(): 50s -> 114s (increased. why?)
10:27 PM <e9t_> Anyone know the reason?
10:34 PM <jtaylor> e9t_: likely the version difference, not blas
10:35 PM <jtaylor> scipy simply added more tests
10:35 PM <jtaylor> numpy too, but numpy also reduced the time the tests takes in recent versions
10:35 PM <jtaylor> the number of tests should be printed too

3단계: With CUDA

Pre-installation: NVIDIA 툴킷 설치

 $ lspci | grep -i nvidia
 $ sudo apt-get install nvidia-346  # nvidia-current installed driver 304.125 which resulted in API mismatch
 $ sudo apt-get install nvidia-cuda-toolkit
 $ nvidia-smi
 NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Install cuda

NOTE: apt-get install cuda를 하면서(?) WARNING - No MPI compiler found.라는 워닝이 떴다. MPI는 message passing interface인데, 큰 문제가 되지 않을 것 같아 일단 별다른 조치는 취하지 않음.

 $ wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.0-28_amd64.deb
 $ sudo dpkg -i cuda-repo-ubuntu1404_7.0-28_amd64.deb
 $ sudo apt-get update
 $ sudo apt-get install cuda
 $ sudo reboot 0
 $ nvidia-smi
 Sat Sep  5 05:07:26 2015
 +------------------------------------------------------+
 | NVIDIA-SMI 346.82     Driver Version: 346.82         |
 |-------------------------------+----------------------+----------------------+
 | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |===============================+======================+======================|
 |   0  GeForce GTX 980     Off  | 0000:01:00.0     N/A |                  N/A |
 |  0%   42C    P0    N/A /  N/A |     15MiB /  4095MiB |     N/A      Default |
 +-------------------------------+----------------------+----------------------+

 +-----------------------------------------------------------------------------+
 | Processes:                                                       GPU Memory |
 |  GPU       PID  Type  Process name                               Usage      |
 |=============================================================================|
 |    0            C+G   Not Supported                                         |
 +-----------------------------------------------------------------------------+

Post-installation: PATH 등록 및 테스트

 $ echo "export PATH=/usr/local/cuda-7.0/bin:$PATH" >> ~/.bash_aliases
 $ echo "export LD_LIBRARY_PATH=/usr/local/cuda-7.0/lib64:$LD_LIBRARY_PATH" >> ~/.bash_aliases
 $ source ~/.bash_aliases
 $ cuda-install-samples-7.0.sh ~/tmp
 $ cat /proc/driver/nvidia/version
 NVRM version: NVIDIA UNIX x86_64 Kernel Module  346.82  Wed Jun 17 10:37:46 PDT 2015
 GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04)
 $ nvcc -V
 nvcc: NVIDIA (R) Cuda compiler driver
 Copyright (c) 2005-2015 NVIDIA Corporation
 Built on Mon_Feb_16_22:59:02_CST_2015
 Cuda compilation tools, release 7.0, V7.0.27
 $ cd ~/tmp/NVIDIA_CUDA-7.0_Samples
 $ make
 $ ./bin/x86_64/linux/release/deviceQuery
 ./bin/x86_64/linux/release/deviceQuery Starting...

  CUDA Device Query (Runtime API) version (CUDART static linking)

 Detected 1 CUDA Capable device(s)

 Device 0: "GeForce GTX 980"
   CUDA Driver Version / Runtime Version          7.0 / 7.0
   CUDA Capability Major/Minor version number:    5.2
   Total amount of global memory:                 4096 MBytes (4294639616 bytes)
   (16) Multiprocessors, (128) CUDA Cores/MP:     2048 CUDA Cores
   GPU Max Clock rate:                            1329 MHz (1.33 GHz)
   Memory Clock rate:                             3505 Mhz
   Memory Bus Width:                              256-bit
   L2 Cache Size:                                 2097152 bytes
   Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
   Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
   Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
   Total amount of constant memory:               65536 bytes
   Total amount of shared memory per block:       49152 bytes
   Total number of registers available per block: 65536
   Warp size:                                     32
   Maximum number of threads per multiprocessor:  2048
   Maximum number of threads per block:           1024
   Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
   Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
   Maximum memory pitch:                          2147483647 bytes
   Texture alignment:                             512 bytes
   Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
   Run time limit on kernels:                     Yes
   Integrated GPU sharing Host Memory:            No
   Support host page-locked memory mapping:       Yes
   Alignment requirement for Surfaces:            Yes
   Device has ECC support:                        Disabled
   Device supports Unified Addressing (UVA):      Yes
   Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
   Compute Mode:
      < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

 deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GeForce GTX 980
 Result = PASS
 $ ./bin/x86_64/linux/release/bandwidthTest
 [CUDA Bandwidth Test] - Starting...
 Running on...

  Device 0: GeForce GTX 980
  Quick Mode

  Host to Device Bandwidth, 1 Device(s)
  PINNED Memory Transfers
    Transfer Size (Bytes)        Bandwidth(MB/s)
    33554432                     12164.5

  Device to Host Bandwidth, 1 Device(s)
  PINNED Memory Transfers
    Transfer Size (Bytes)        Bandwidth(MB/s)
    33554432                     12896.5

  Device to Device Bandwidth, 1 Device(s)
  PINNED Memory Transfers
    Transfer Size (Bytes)        Bandwidth(MB/s)
    33554432                     164863.5

 Result = PASS

 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

NumPy (1.9.2): 16.393 (5593), OK
SciPy (0.16.0): 109.405 (18456), OK
Theano (0.7.0): 4165.916 (19844), FAILED

4단계: Bleeding edge Theano

$ git clone git://github.com/Theano/Theano.git
$ cd Theano
$ python setup.py develop   # 0.7.0.dev-c042a9c49ac6516b74668747d1e6e6bbe832efba
$ THEANO_FLAGS=init_gpu_device=gpu0,device=cpu,floatX=float32 python -c "import theano; theano.test()"
WARNING (theano.sandbox.cuda): GPU device gpu0 will be initialized, and used if a GPU is needed. However, no computation, nor shared variables, will be implicitly moved to that device. If you want
that behavior, use the 'device' flag instead.
Using gpu device 0: GeForce GTX 980 (CNMeM is enabled)
Theano version 0.7.0.dev-c042a9c49ac6516b74668747d1e6e6bbe832efba
theano is installed in /home/epark/pkgs/Theano/theano
NumPy version 1.9.2
NumPy is installed in /usr/local/lib/python2.7/dist-packages/numpy
Python version 2.7.6 (default, Jun 22 2015, 17:58:13) [GCC 4.8.2]
nose version 1.3.1
......00001     #include <Python.h>
00002   #include <iostream>
00003   #include "theano_mod_helper.h"

... # 더 자세한 로그는 [이 링크](http://pastebin.com/WSqrQkYA)에서 볼 수 있음

----------------------------------------------------------------------
Ran 19844 tests in 4093.222s

FAILED (KNOWNFAIL=18, SKIP=69, errors=218, failures=192)

Theano (0.7.0): 4093.222 (19844), FAILED ⁸

Some random comments

Test 결과를 볼 때는 1) 테스트 시간 뿐 아니라 2) 패키지 버젼 3) 테스트 수를 아는 것이 중요. 같은 버젼이라도 다른 환경(ex: OS, GPU)에서 실행할 경우 테스트 수는 달라질 수 있다.
Ubuntu에서는 numpy나 scipy를 깔 때 ppa를 이용하지 않는 이상 apt-get을 사용하면 더 낮은 버젼(ex: 14.04의 경우 1.8.2)이 깔린다. pip으로 설치해야 좀 더 최신버젼을 깔 수 있다(ex: 내 경우 1.9.2). 버젼 간 속도 차가 나는 경우가 있으니 pip 설치를 권장.
dada, daca에서 Python 2.7.6에 NumPy 1.9.1, SciPy 0.13.3가 깔려있을 때는 SciPy의 test가 항상 fail했다. 특히 daca의 경우 NumPy test에서 segfault도 발생. 최신 버젼으로 업글하면 문제가 해결되었을 가능성도 있다.

Some other benchmark codes for BLAS ↩
옥타브가 생각보다 잘 안 알려져 있던데, MATLAB의 오픈소스 버젼이라고 생각하면 된다. ↩
리눅스 커널을 열거나 core dump를 읽기 시작할 때부터 segfault에 대한 이해가 깊어지듯. ↩
각 함수의 이름에 대한 설명을 좀 더 보고 싶을 때는 Intel Developer Zone가 좋은 가이드가 되어준다. ↩
Handle different versions of BLAS and LAPACK ↩
NumPy, SciPy, Theano 각각에 대한 test 시간은 Theano 웹사이트에서 제공. ↩ ↩² ↩³
또는, .theanorc를 사용. ↩
Theano users에 질문을 올려두었더니 GPU 사용할 때는 test가 fail해도 별 문제가 없는거라고 함… ↩

Lucy Park

Bilingual posts regarding data science, machine learning, and some miscellaneous geeky stuff.