최근 컴퓨터를 한 대 새로 구입한 기념으로 가지고 있는 머신 몇 대에 BLAS/LAPACK benchmark를 해봤다.1

BLAS(Basic Linear Algebra Subprograms)란 벡터 및 행렬 연산을 관장하는 스펙이며 GNU Octave2, Mathematica, NumPy, R, 그리고 아래의 LAPACK 등 다양한 소프트웨어에 사용된다. BLAS를 잘 설정하고 다루면 같은 코드를 돌리더라도 몇 배에 달하는 속도 향상을 이룰 수 있기 때문에 잘 이해하고 있으면 좋다.3

BLAS는 다음의 3가지 레벨로 구성되어 있다 (괄호 안 숫자는 발표 연도) 4:

  • Level 1 (1979): scalar-vector and vector-vector operations
    • 예를 들어 daxpy는 말 그대로 “double precision scalar $a$ times vector $x$ plus vector $y$”를 수행한 후 $y$에 결과값을 대입한다
    • $\mathbf{y} \leftarrow a \mathbf{x} + \mathbf{y}$
  • Level 2 (1988): matrix-vector operations
    • 예를 들어 sgemv는 말 그대로 “single precision general matrix-vector product”를 계산한다
    • $\mathbf{y} \leftarrow a \mathbf{Ax} + b \mathbf{y}$
  • Level 3 (1990): matrix-matrix operations
    • 예를 들어 gemm은 “general matrix-matrix product”를 계산한다
    • $\mathbf{C} \leftarrow a \mathbf{AB} + b \mathbf{C}$

BLAS의 구현체로는 NVIDIA의 CUDA용 cuBLAS, AMD의 ACML, 인텔의 MKL, 애플의 Accelerate Framework안에 포함된 vecLib, 오픈소스인 ATLAS, 그리고 아마 가장 범용적으로 쓰이는 오픈소스 OpenBLAS 등이 있으며 그 외에도 다양한 구현체가 있다.

한편 LAPACK(Linear Algebra PACKage)은 least squares와 SVD(singular value decomposition) 등의 행렬 분해(matrix decomposition) 과정 이 구현 되어있는 스펙이며 2008년 FORTRAN 버젼이 공개되었다. ATLAS와 OpenBLAS가 LAPACK의 일부를 구현하고 있고, 그 외에는 LAPACK++ 등의 구현체가 있다.5

Commands used for benchmarks

다음은 성능 측정을 할 때 사용한 명령 목록이다 (Ubuntu 기준):

  1. Switch BLAS: sudo update-alternatives --config libblas.so.3
  2. Switch LAPACK: sudo update-alternatives --config liblapack.so.3
  3. Check BLAS/LAPACK linkage with numpy: python -c "import numpy; numpy.__config__.show()"
  4. NumPy test (~30s 6): python -c "import numpy; numpy.test()"
  5. SciPy test (~1m 6): python -c "import scipy; scipy.test()"
  6. Theano test (~30m 6): python -c "import theano; theano.test()"
    • GPU: THEANO_FLAGS=floatX=float32,device=gpu python -c "import theano; theano.test()" 7
  7. BLAS test: Theano 패키지에 포함된 다음 코드를 돌리면 그들의 머신에서 생성한 벤치마크가 나온다:

     python `python -c "import os, theano; print os.path.dirname(theano.__file__)"`/misc/check_blas.py
    

Theano BLAS benchmarks:

2000x2000(M=N=K=2000)의 float64 행렬에 대해 gemm을 10번 수행했다. (All memory layout was in C order):

테스트용으로 사용된 CPU의 상세 스펙은 다음과 같고:

  • Xeon E5345 (2.33Ghz, 8M L2 cache, 1333Mhz FSB)
  • Xeon E5430 (2.66Ghz, 12M L2 cache, 1333Mhz FSB)
  • Xeon E5450 (3Ghz, 12M L2 cache, 1333Mhz FSB)
  • Core 2 E8500 (2.8Ghz, hyper-threads enabled)
  • Core i7 930 (2.8Ghz, hyper-threads enabled)
  • Core i7 950 (3.07GHz, hyper-threads enabled)
  • Xeon X5560 (2.8Ghz, 12M L2 cache, hyper-threads?)
  • Xeon X5550 (2.67GHz, 8M l2 cache?, hyper-threads enabled)

라이브러리는 다음과 같을 때:

  • numpy with ATLAS from distribution (FC9) package (1 thread)
  • manually compiled numpy and ATLAS with 2 threads
  • goto 1.26 with 1, 2, 4 and 8 threads
  • goto2 1.13 compiled with multiple threads enabled

각 CPU와 라이브러리에 대한 테스트 수행결과는 다음과 같다. (라이브러리 이름 옆의 숫자는 사용된 thread의 수이다):

CPUXeon E5345Xeon E5430Xeon E5450Core 2 E8500Core i7 930Core i7 950Xeon X5560Xeon X5550
numpy 1.3.0 blas 775.92s
numpy_FC9_atlas/139.2s 35.0s 30.7s29.6s21.5s19.60s
numpy_MAN_atlas/212.0s 11.6s 10.2s 9.2s 9.0s
goto/118.7s 16.1s 14.2s13.7s16.1s14.67s
goto/2 9.5s8.1s7.1s 7.3s 8.1s 7.4s
goto/4 4.9s4.4s3.7s - 4.1s 3.8s
goto/8 2.7s2.4s2.0s - 4.1s 3.8s
openblas/114.04s
openblas/2 7.16s
openblas/4 3.71s
openblas/8 3.70s
mkl 11.0.083/17.97s
mkl 10.2.2.025/113.7s
mkl 10.2.2.025/2 7.6s
mkl 10.2.2.025/4 4.0s
mkl 10.2.2.025/8 2.0s
goto2 1.13/114.37s
goto2 1.13/27.26s
goto2 1.13/43.70s
goto2 1.13/81.94s
goto2 1.13/163.16s

또, GPU에 대한 벤치마크는 다음과 같다 (Test time in float32):

CUDA version6.56.05.55.04.24.14.03.23.0note
K6000/NOECC0.06s0.06s
K400.07s
K20m/ECC 0.08s0.08s0.07s
K20/NOECC0.07s
M2090 0.19s
C20750.25s
M20750.25s
M20700.25s0.27s0.32s
M2070-Q0.48s0.27s0.32s
M2050(Amazon)0.25s
C10600.46s
K600 1.04s
GTX Titan Black 0.05s
GTX Titan(D15U-50)0.06s0.06sdon't work
GTX 780 0.06s
GTX 980 0.06s
GTX 970 0.08s
GTX 680 0.11s0.12s0.154s0.218s
GRID K520 0.14s
GTX 580 0.16s0.16s0.164s0.203s
GTX 480 0.19s0.19s0.192s0.237s0.27s
GTX 750 Ti 0.20s
GTX 470 0.23s0.23s0.238s0.297s0.34s
GTX 660 0.18s0.20s0.23s
GTX 5600.30s
GTX 650 Ti0.27s
GTX 765M 0.27s
GTX 4600.37s0.45s
GTX 285 0.42s0.452s0.452s0.40scuda 3.0 seems faster? driver version?
750M0.49s
GT 610 2.38s
GTX 550 Ti0.57s
GT 5202.68s3.06s
520M2.44s3.19swith bumblebee on Ubuntu 12.04
GT 2203.80s
GT 2106.35s
8500 GT10.68s

My test results

비교하는 컴퓨터는 총 7대이며, 하드웨어 스펙은 다음과 같다:

NameDescriptionOSCPURAMGPU
tiggerMacBook Air 13" (Early 2014)Mac OS X 10.10.4Intel Core i5-4260U 1.4GHz4GB 1600MHz (DDR3)Intel HD Graphics 5000 1536MB
playbookMacBook Pro 15" (Mid 2014)Mac OS X 10.10.4Intel Core i7-4770HQ 2.2GHz16GB 1600MHz (DDR3)Intel Iris Pro 1536MB
jokerPCUbuntu 13.10Intel Xeon E3-1230 v3 3.30GHz8GB 1600MHz (DDR3)-
dadaPCUbuntu 14.04.1Intel Pentium G620 2.6GHz8GB 1067MHz (DDR3)-
dacaPCUbuntu 14.04.2Intel Core i7-3930K 3.2GHz32GB 1600MHz (DDR3)-
merciPCUbuntu 14.04.2Intel Core i7-5820K 3.3GHz32GB 2133 MHz (DDR4)NVIDIA GeForce GTX 980
labpcPCWindows 7AMD Phenom II X3 720 2.8GHz16GB 2133 MHz (DDR3)?ATI Radeon HD 4850

각 머신에 깔려있는 소프트웨어/패키지 버젼은 다음과 같다:

Namepythonnumpyscipytheanocuda
tigger2.7.61.8.0rc10.13.0b1--
playbook2.7.101.9.20.15.10.7.0-
joker2.7.51.8.0---
dada2.7.61.9.10.16.00.7.0-
daca2.7.61.9.20.16.00.7.0-
merci2.7.61.9.20.16.00.7.07.0.27
labpc-----

결과를 최종적으로 요약 정리한 표이다 (괄호 안 숫자는 테스트 수):

namenumpyscipytheanoblas
tigger109.813275.232--
playbook22.767 (5557)145.634 (17005)5705.634 (2724)1.181
joker22.279---
dada/blas26.335 (5580)195.697 (18456)4546.648 (2722)17.37
daca/openblas18.965 (5593)148.614 (18456)3418.434 (2722)1.9490
merci/blas24.243 (-)49.994 (-)2700.233 (-)-
merci/openblas15.193 (5593)114.156 (18456)3447.037 (2722)2.76
merci/openblas+cuda16.393 (5593)109.405 (18456)4165.916 (19844) - FAILED0.06
labpc39.183---

다음은 앞서 나열된 머신 중 merci에서 작업한 상세 로그이다.

1단계: Vanilla Ubuntu

$ sudo apt-get install python-dev python-pip python-nose g++ libopenblas-dev git
$ sudo apt-get install python-numpy     # 1.8.1
$ sudo apt-get install python-scipy     # 0.14.0
$ sudo pip install Theano               # 0.7.0
$ python -c "import numpy; numpy.__config__.show()"  # or, `from numpy.distutils.system_info import get_info; get_info('blas')`
blas_info:
    libraries = ['blas']
    library_dirs = ['/usr/lib']
    language = f77
lapack_info:
    libraries = ['lapack']
    library_dirs = ['/usr/lib']
    language = f77
atlas_threads_info:
  NOT AVAILABLE
blas_opt_info:
    libraries = ['blas']
    library_dirs = ['/usr/lib']
    language = f77
    define_macros = [('NO_ATLAS_INFO', 1)]
atlas_blas_threads_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
lapack_opt_info:
    libraries = ['lapack', 'blas']
    library_dirs = ['/usr/lib']
    language = f77
    define_macros = [('NO_ATLAS_INFO', 1)]
atlas_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE
  • NumPy (1.8.1): 24.243s, OK
  • SciPy (0.14.0): 49.994s, OK
  • Theano (0.7.0): 2700.233s, OK

2단계: With OpenBLAS

$ sudo apt-get install libopenblas-dev
$ sudo apt-get purge python-numpy python-scipy      # http://stackoverflow.com/a/25326614/1054939 http://stackoverflow.com/q/29979539/1054939
$ sudo pip install numpy    # 1.9.2
$ sudo pip install scipy    # 0.16.0
$ sudo update-alternatives --config libblas.so.3    # /usr/lib/openblas-base/libblas.so.3
$ sudo update-alternatives --config liblapack.so.3  # /usr/lib/lapack/liblapack.so.3
$ python -c "import numpy; numpy.__config__.show()"                                                                                                                           [22/7614]
blas_info:
    libraries = ['blas']
    library_dirs = ['/usr/lib']
    language = f77
lapack_info:
    libraries = ['lapack']
    library_dirs = ['/usr/lib']
    language = f77
atlas_threads_info:
  NOT AVAILABLE
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/lib']
    language = f77
openblas_info:
    libraries = ['openblas']
    library_dirs = ['/usr/lib']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'blas']
    library_dirs = ['/usr/lib']
    language = f77
    define_macros = [('NO_ATLAS_INFO', 1)]
openblas_lapack_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE
  • NumPy (1.9.2): 15.193s (5593), OK
  • SciPy (0.16.0): 114.156s (18456), OK
  • Theano (0.7.0): 3447.037s (2722), OK

SciPy 테스트 시간의 큰 차이에 관해서는 #scipy irc에서 물어보니 BLAS나 라이브러리의 성능 자체보다는 버전에 따른 테스트 수 차이 때문일 수 있다고.

10:27 PM <e9t_> I'm using Ubuntu 14.04, and installed numpy+scipy with apt-get. Then I installed libopenblas-dev, purged numpy+scipy and reinstalled them with pip. But the test results are peculiar.
10:27 PM <e9t_> - numpy.test(): 24s -> 15s (decreased. great!)
10:27 PM <e9t_> - scipy.test(): 50s -> 114s (increased. why?)
10:27 PM <e9t_> Anyone know the reason?
10:34 PM <jtaylor> e9t_: likely the version difference, not blas
10:35 PM <jtaylor> scipy simply added more tests
10:35 PM <jtaylor> numpy too, but numpy also reduced the time the tests takes in recent versions
10:35 PM <jtaylor> the number of tests should be printed too

3단계: With CUDA

  1. Pre-installation: NVIDIA 툴킷 설치

     $ lspci | grep -i nvidia
     $ sudo apt-get install nvidia-346  # nvidia-current installed driver 304.125 which resulted in API mismatch
     $ sudo apt-get install nvidia-cuda-toolkit
     $ nvidia-smi
     NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
    
  2. Install cuda

    NOTE: apt-get install cuda를 하면서(?) WARNING - No MPI compiler found.라는 워닝이 떴다. MPI는 message passing interface인데, 큰 문제가 되지 않을 것 같아 일단 별다른 조치는 취하지 않음.

     $ wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.0-28_amd64.deb
     $ sudo dpkg -i cuda-repo-ubuntu1404_7.0-28_amd64.deb
     $ sudo apt-get update
     $ sudo apt-get install cuda
     $ sudo reboot 0
     $ nvidia-smi
     Sat Sep  5 05:07:26 2015
     +------------------------------------------------------+
     | NVIDIA-SMI 346.82     Driver Version: 346.82         |
     |-------------------------------+----------------------+----------------------+
     | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
     | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
     |===============================+======================+======================|
     |   0  GeForce GTX 980     Off  | 0000:01:00.0     N/A |                  N/A |
     |  0%   42C    P0    N/A /  N/A |     15MiB /  4095MiB |     N/A      Default |
     +-------------------------------+----------------------+----------------------+
    
     +-----------------------------------------------------------------------------+
     | Processes:                                                       GPU Memory |
     |  GPU       PID  Type  Process name                               Usage      |
     |=============================================================================|
     |    0            C+G   Not Supported                                         |
     +-----------------------------------------------------------------------------+
    
  3. Post-installation: PATH 등록 및 테스트

     $ echo "export PATH=/usr/local/cuda-7.0/bin:$PATH" >> ~/.bash_aliases
     $ echo "export LD_LIBRARY_PATH=/usr/local/cuda-7.0/lib64:$LD_LIBRARY_PATH" >> ~/.bash_aliases
     $ source ~/.bash_aliases
     $ cuda-install-samples-7.0.sh ~/tmp
     $ cat /proc/driver/nvidia/version
     NVRM version: NVIDIA UNIX x86_64 Kernel Module  346.82  Wed Jun 17 10:37:46 PDT 2015
     GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04)
     $ nvcc -V
     nvcc: NVIDIA (R) Cuda compiler driver
     Copyright (c) 2005-2015 NVIDIA Corporation
     Built on Mon_Feb_16_22:59:02_CST_2015
     Cuda compilation tools, release 7.0, V7.0.27
     $ cd ~/tmp/NVIDIA_CUDA-7.0_Samples
     $ make
     $ ./bin/x86_64/linux/release/deviceQuery
     ./bin/x86_64/linux/release/deviceQuery Starting...
    
      CUDA Device Query (Runtime API) version (CUDART static linking)
    
     Detected 1 CUDA Capable device(s)
    
     Device 0: "GeForce GTX 980"
       CUDA Driver Version / Runtime Version          7.0 / 7.0
       CUDA Capability Major/Minor version number:    5.2
       Total amount of global memory:                 4096 MBytes (4294639616 bytes)
       (16) Multiprocessors, (128) CUDA Cores/MP:     2048 CUDA Cores
       GPU Max Clock rate:                            1329 MHz (1.33 GHz)
       Memory Clock rate:                             3505 Mhz
       Memory Bus Width:                              256-bit
       L2 Cache Size:                                 2097152 bytes
       Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
       Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
       Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
       Total amount of constant memory:               65536 bytes
       Total amount of shared memory per block:       49152 bytes
       Total number of registers available per block: 65536
       Warp size:                                     32
       Maximum number of threads per multiprocessor:  2048
       Maximum number of threads per block:           1024
       Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
       Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
       Maximum memory pitch:                          2147483647 bytes
       Texture alignment:                             512 bytes
       Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
       Run time limit on kernels:                     Yes
       Integrated GPU sharing Host Memory:            No
       Support host page-locked memory mapping:       Yes
       Alignment requirement for Surfaces:            Yes
       Device has ECC support:                        Disabled
       Device supports Unified Addressing (UVA):      Yes
       Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
       Compute Mode:
          < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
    
     deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GeForce GTX 980
     Result = PASS
     $ ./bin/x86_64/linux/release/bandwidthTest
     [CUDA Bandwidth Test] - Starting...
     Running on...
    
      Device 0: GeForce GTX 980
      Quick Mode
    
      Host to Device Bandwidth, 1 Device(s)
      PINNED Memory Transfers
        Transfer Size (Bytes)        Bandwidth(MB/s)
        33554432                     12164.5
    
      Device to Host Bandwidth, 1 Device(s)
      PINNED Memory Transfers
        Transfer Size (Bytes)        Bandwidth(MB/s)
        33554432                     12896.5
    
      Device to Device Bandwidth, 1 Device(s)
      PINNED Memory Transfers
        Transfer Size (Bytes)        Bandwidth(MB/s)
        33554432                     164863.5
    
     Result = PASS
    
     NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
    
  • NumPy (1.9.2): 16.393 (5593), OK
  • SciPy (0.16.0): 109.405 (18456), OK
  • Theano (0.7.0): 4165.916 (19844), FAILED

4단계: Bleeding edge Theano

$ git clone git://github.com/Theano/Theano.git
$ cd Theano
$ python setup.py develop   # 0.7.0.dev-c042a9c49ac6516b74668747d1e6e6bbe832efba
$ THEANO_FLAGS=init_gpu_device=gpu0,device=cpu,floatX=float32 python -c "import theano; theano.test()"
WARNING (theano.sandbox.cuda): GPU device gpu0 will be initialized, and used if a GPU is needed. However, no computation, nor shared variables, will be implicitly moved to that device. If you want
that behavior, use the 'device' flag instead.
Using gpu device 0: GeForce GTX 980 (CNMeM is enabled)
Theano version 0.7.0.dev-c042a9c49ac6516b74668747d1e6e6bbe832efba
theano is installed in /home/epark/pkgs/Theano/theano
NumPy version 1.9.2
NumPy is installed in /usr/local/lib/python2.7/dist-packages/numpy
Python version 2.7.6 (default, Jun 22 2015, 17:58:13) [GCC 4.8.2]
nose version 1.3.1
......00001     #include <Python.h>
00002   #include <iostream>
00003   #include "theano_mod_helper.h"

... # 더 자세한 로그는 [이 링크](http://pastebin.com/WSqrQkYA)에서 볼 수 있음

----------------------------------------------------------------------
Ran 19844 tests in 4093.222s

FAILED (KNOWNFAIL=18, SKIP=69, errors=218, failures=192)
  • Theano (0.7.0): 4093.222 (19844), FAILED 8

Some random comments

  • Test 결과를 볼 때는 1) 테스트 시간 뿐 아니라 2) 패키지 버젼 3) 테스트 수를 아는 것이 중요. 같은 버젼이라도 다른 환경(ex: OS, GPU)에서 실행할 경우 테스트 수는 달라질 수 있다.
  • Ubuntu에서는 numpyscipy를 깔 때 ppa를 이용하지 않는 이상 apt-get을 사용하면 더 낮은 버젼(ex: 14.04의 경우 1.8.2)이 깔린다. pip으로 설치해야 좀 더 최신버젼을 깔 수 있다(ex: 내 경우 1.9.2). 버젼 간 속도 차가 나는 경우가 있으니 pip 설치를 권장.
  • dada, daca에서 Python 2.7.6에 NumPy 1.9.1, SciPy 0.13.3가 깔려있을 때는 SciPy의 test가 항상 fail했다. 특히 daca의 경우 NumPy test에서 segfault도 발생. 최신 버젼으로 업글하면 문제가 해결되었을 가능성도 있다.
  1. Some other benchmark codes for BLAS

  2. 옥타브가 생각보다 잘 안 알려져 있던데, MATLAB의 오픈소스 버젼이라고 생각하면 된다.

  3. 리눅스 커널을 열거나 core dump를 읽기 시작할 때부터 segfault에 대한 이해가 깊어지듯.

  4. 각 함수의 이름에 대한 설명을 좀 더 보고 싶을 때는 Intel Developer Zone가 좋은 가이드가 되어준다.

  5. Handle different versions of BLAS and LAPACK

  6. NumPy, SciPy, Theano 각각에 대한 test 시간은 Theano 웹사이트에서 제공. 2 3

  7. 또는, .theanorc를 사용.

  8. Theano users에 질문을 올려두었더니 GPU 사용할 때는 test가 fail해도 별 문제가 없는거라고 함…