BLAS/LAPACK benchmarks with NumPy, SciPy and Theano
최근 컴퓨터를 한 대 새로 구입한 기념으로 가지고 있는 머신 몇 대에 BLAS/LAPACK benchmark를 해봤다.1
BLAS(Basic Linear Algebra Subprograms)란 벡터 및 행렬 연산을 관장하는 스펙이며 GNU Octave2, Mathematica, NumPy, R, 그리고 아래의 LAPACK 등 다양한 소프트웨어에 사용된다. BLAS를 잘 설정하고 다루면 같은 코드를 돌리더라도 몇 배에 달하는 속도 향상을 이룰 수 있기 때문에 잘 이해하고 있으면 좋다.3
BLAS는 다음의 3가지 레벨로 구성되어 있다 (괄호 안 숫자는 발표 연도) 4:
- Level 1 (1979): scalar-vector and vector-vector operations
- 예를 들어
daxpy
는 말 그대로 “double precision scalar $a$ times vector $x$ plus vector $y$”를 수행한 후 $y$에 결과값을 대입한다 - $\mathbf{y} \leftarrow a \mathbf{x} + \mathbf{y}$
- 예를 들어
- Level 2 (1988): matrix-vector operations
- 예를 들어
sgemv
는 말 그대로 “single precision general matrix-vector product”를 계산한다 - $\mathbf{y} \leftarrow a \mathbf{Ax} + b \mathbf{y}$
- 예를 들어
- Level 3 (1990): matrix-matrix operations
- 예를 들어
gemm
은 “general matrix-matrix product”를 계산한다 - $\mathbf{C} \leftarrow a \mathbf{AB} + b \mathbf{C}$
- 예를 들어
BLAS의 구현체로는 NVIDIA의 CUDA용 cuBLAS, AMD의 ACML, 인텔의 MKL, 애플의 Accelerate Framework안에 포함된 vecLib, 오픈소스인 ATLAS, 그리고 아마 가장 범용적으로 쓰이는 오픈소스 OpenBLAS 등이 있으며 그 외에도 다양한 구현체가 있다.
한편 LAPACK(Linear Algebra PACKage)은 least squares와 SVD(singular value decomposition) 등의 행렬 분해(matrix decomposition) 과정 이 구현 되어있는 스펙이며 2008년 FORTRAN 버젼이 공개되었다. ATLAS와 OpenBLAS가 LAPACK의 일부를 구현하고 있고, 그 외에는 LAPACK++ 등의 구현체가 있다.5
Commands used for benchmarks
다음은 성능 측정을 할 때 사용한 명령 목록이다 (Ubuntu 기준):
- Switch BLAS:
sudo update-alternatives --config libblas.so.3
- Switch LAPACK:
sudo update-alternatives --config liblapack.so.3
- Check BLAS/LAPACK linkage with numpy:
python -c "import numpy; numpy.__config__.show()"
- NumPy test (~30s 6):
python -c "import numpy; numpy.test()"
- SciPy test (~1m 6):
python -c "import scipy; scipy.test()"
- Theano test (~30m 6):
python -c "import theano; theano.test()"
- GPU:
THEANO_FLAGS=floatX=float32,device=gpu python -c "import theano; theano.test()"
7
- GPU:
BLAS test: Theano 패키지에 포함된 다음 코드를 돌리면 그들의 머신에서 생성한 벤치마크가 나온다:
python `python -c "import os, theano; print os.path.dirname(theano.__file__)"`/misc/check_blas.py
Theano BLAS benchmarks:
2000x2000(M=N=K=2000)의 float64 행렬에 대해
gemm
을 10번 수행했다. (All memory layout was in C order):테스트용으로 사용된 CPU의 상세 스펙은 다음과 같고:
- Xeon E5345 (2.33Ghz, 8M L2 cache, 1333Mhz FSB)
- Xeon E5430 (2.66Ghz, 12M L2 cache, 1333Mhz FSB)
- Xeon E5450 (3Ghz, 12M L2 cache, 1333Mhz FSB)
- Core 2 E8500 (2.8Ghz, hyper-threads enabled)
- Core i7 930 (2.8Ghz, hyper-threads enabled)
- Core i7 950 (3.07GHz, hyper-threads enabled)
- Xeon X5560 (2.8Ghz, 12M L2 cache, hyper-threads?)
- Xeon X5550 (2.67GHz, 8M l2 cache?, hyper-threads enabled)
라이브러리는 다음과 같을 때:
- numpy with ATLAS from distribution (FC9) package (1 thread)
- manually compiled numpy and ATLAS with 2 threads
- goto 1.26 with 1, 2, 4 and 8 threads
- goto2 1.13 compiled with multiple threads enabled
각 CPU와 라이브러리에 대한 테스트 수행결과는 다음과 같다. (라이브러리 이름 옆의 숫자는 사용된 thread의 수이다):
CPU Xeon E5345 Xeon E5430 Xeon E5450 Core 2 E8500 Core i7 930 Core i7 950 Xeon X5560 Xeon X5550 numpy 1.3.0 blas 775.92s numpy_FC9_atlas/1 39.2s 35.0s 30.7s 29.6s 21.5s 19.60s numpy_MAN_atlas/2 12.0s 11.6s 10.2s 9.2s 9.0s goto/1 18.7s 16.1s 14.2s 13.7s 16.1s 14.67s goto/2 9.5s 8.1s 7.1s 7.3s 8.1s 7.4s goto/4 4.9s 4.4s 3.7s - 4.1s 3.8s goto/8 2.7s 2.4s 2.0s - 4.1s 3.8s openblas/1 14.04s openblas/2 7.16s openblas/4 3.71s openblas/8 3.70s mkl 11.0.083/1 7.97s mkl 10.2.2.025/1 13.7s mkl 10.2.2.025/2 7.6s mkl 10.2.2.025/4 4.0s mkl 10.2.2.025/8 2.0s goto2 1.13/1 14.37s goto2 1.13/2 7.26s goto2 1.13/4 3.70s goto2 1.13/8 1.94s goto2 1.13/16 3.16s 또, GPU에 대한 벤치마크는 다음과 같다 (Test time in float32):
CUDA version 6.5 6.0 5.5 5.0 4.2 4.1 4.0 3.2 3.0 note K6000/NOECC 0.06s 0.06s K40 0.07s K20m/ECC 0.08s 0.08s 0.07s K20/NOECC 0.07s M2090 0.19s C2075 0.25s M2075 0.25s M2070 0.25s 0.27s 0.32s M2070-Q 0.48s 0.27s 0.32s M2050(Amazon) 0.25s C1060 0.46s K600 1.04s GTX Titan Black 0.05s GTX Titan(D15U-50) 0.06s 0.06s don't work GTX 780 0.06s GTX 980 0.06s GTX 970 0.08s GTX 680 0.11s 0.12s 0.154s 0.218s GRID K520 0.14s GTX 580 0.16s 0.16s 0.164s 0.203s GTX 480 0.19s 0.19s 0.192s 0.237s 0.27s GTX 750 Ti 0.20s GTX 470 0.23s 0.23s 0.238s 0.297s 0.34s GTX 660 0.18s 0.20s 0.23s GTX 560 0.30s GTX 650 Ti 0.27s GTX 765M 0.27s GTX 460 0.37s 0.45s GTX 285 0.42s 0.452s 0.452s 0.40s cuda 3.0 seems faster? driver version? 750M 0.49s GT 610 2.38s GTX 550 Ti 0.57s GT 520 2.68s 3.06s 520M 2.44s 3.19s with bumblebee on Ubuntu 12.04 GT 220 3.80s GT 210 6.35s 8500 GT 10.68s
My test results
비교하는 컴퓨터는 총 7대이며, 하드웨어 스펙은 다음과 같다:
Name | Description | OS | CPU | RAM | GPU |
---|---|---|---|---|---|
tigger | MacBook Air 13" (Early 2014) | Mac OS X 10.10.4 | Intel Core i5-4260U 1.4GHz | 4GB 1600MHz (DDR3) | |
playbook | MacBook Pro 15" (Mid 2014) | Mac OS X 10.10.4 | Intel Core i7-4770HQ 2.2GHz | 16GB 1600MHz (DDR3) | |
joker | PC | Ubuntu 13.10 | Intel Xeon E3-1230 v3 3.30GHz | 8GB 1600MHz (DDR3) | - |
dada | PC | Ubuntu 14.04.1 | Intel Pentium G620 2.6GHz | 8GB 1067MHz (DDR3) | - |
daca | PC | Ubuntu 14.04.2 | Intel Core i7-3930K 3.2GHz | 32GB 1600MHz (DDR3) | - |
merci | PC | Ubuntu 14.04.2 | Intel Core i7-5820K 3.3GHz | 32GB 2133 MHz (DDR4) | NVIDIA GeForce GTX 980 |
labpc | PC | Windows 7 | AMD Phenom II X3 720 2.8GHz | 16GB 2133 MHz (DDR3)? |
각 머신에 깔려있는 소프트웨어/패키지 버젼은 다음과 같다:
Name | python | numpy | scipy | theano | cuda |
---|---|---|---|---|---|
tigger | 2.7.6 | 1.8.0rc1 | 0.13.0b1 | - | - |
playbook | 2.7.10 | 1.9.2 | 0.15.1 | 0.7.0 | - |
joker | 2.7.5 | 1.8.0 | - | - | - |
dada | 2.7.6 | 1.9.1 | 0.16.0 | 0.7.0 | - |
daca | 2.7.6 | 1.9.2 | 0.16.0 | 0.7.0 | - |
merci | 2.7.6 | 1.9.2 | 0.16.0 | 0.7.0 | 7.0.27 |
labpc | - | - | - | - | - |
결과를 최종적으로 요약 정리한 표이다 (괄호 안 숫자는 테스트 수):
name | numpy | scipy | theano | blas |
---|---|---|---|---|
tigger | 109.813 | 275.232 | - | - |
playbook | 22.767 (5557) | 145.634 (17005) | 5705.634 (2724) | 1.181 |
joker | 22.279 | - | - | - |
dada/blas | 26.335 (5580) | 195.697 (18456) | 4546.648 (2722) | 17.37 |
daca/openblas | 18.965 (5593) | 148.614 (18456) | 3418.434 (2722) | 1.9490 |
merci/blas | 24.243 (-) | 49.994 (-) | 2700.233 (-) | - |
merci/openblas | 15.193 (5593) | 114.156 (18456) | 3447.037 (2722) | 2.76 |
merci/openblas+cuda | 16.393 (5593) | 109.405 (18456) | 4165.916 (19844) - FAILED | 0.06 |
labpc | 39.183 | - | - | - |
다음은 앞서 나열된 머신 중 merci에서 작업한 상세 로그이다.
1단계: Vanilla Ubuntu
$ sudo apt-get install python-dev python-pip python-nose g++ libopenblas-dev git
$ sudo apt-get install python-numpy # 1.8.1
$ sudo apt-get install python-scipy # 0.14.0
$ sudo pip install Theano # 0.7.0
$ python -c "import numpy; numpy.__config__.show()" # or, `from numpy.distutils.system_info import get_info; get_info('blas')`
blas_info:
libraries = ['blas']
library_dirs = ['/usr/lib']
language = f77
lapack_info:
libraries = ['lapack']
library_dirs = ['/usr/lib']
language = f77
atlas_threads_info:
NOT AVAILABLE
blas_opt_info:
libraries = ['blas']
library_dirs = ['/usr/lib']
language = f77
define_macros = [('NO_ATLAS_INFO', 1)]
atlas_blas_threads_info:
NOT AVAILABLE
openblas_info:
NOT AVAILABLE
lapack_opt_info:
libraries = ['lapack', 'blas']
library_dirs = ['/usr/lib']
language = f77
define_macros = [('NO_ATLAS_INFO', 1)]
atlas_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
atlas_blas_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE
- NumPy (1.8.1): 24.243s, OK
- SciPy (0.14.0): 49.994s, OK
- Theano (0.7.0): 2700.233s, OK
2단계: With OpenBLAS
$ sudo apt-get install libopenblas-dev
$ sudo apt-get purge python-numpy python-scipy # http://stackoverflow.com/a/25326614/1054939 http://stackoverflow.com/q/29979539/1054939
$ sudo pip install numpy # 1.9.2
$ sudo pip install scipy # 0.16.0
$ sudo update-alternatives --config libblas.so.3 # /usr/lib/openblas-base/libblas.so.3
$ sudo update-alternatives --config liblapack.so.3 # /usr/lib/lapack/liblapack.so.3
$ python -c "import numpy; numpy.__config__.show()" [22/7614]
blas_info:
libraries = ['blas']
library_dirs = ['/usr/lib']
language = f77
lapack_info:
libraries = ['lapack']
library_dirs = ['/usr/lib']
language = f77
atlas_threads_info:
NOT AVAILABLE
blas_opt_info:
libraries = ['openblas']
library_dirs = ['/usr/lib']
language = f77
openblas_info:
libraries = ['openblas']
library_dirs = ['/usr/lib']
language = f77
lapack_opt_info:
libraries = ['lapack', 'blas']
library_dirs = ['/usr/lib']
language = f77
define_macros = [('NO_ATLAS_INFO', 1)]
openblas_lapack_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
atlas_3_10_threads_info:
NOT AVAILABLE
atlas_info:
NOT AVAILABLE
atlas_3_10_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE
- NumPy (1.9.2): 15.193s (5593), OK
- SciPy (0.16.0): 114.156s (18456), OK
- Theano (0.7.0): 3447.037s (2722), OK
SciPy 테스트 시간의 큰 차이에 관해서는 #scipy irc에서 물어보니 BLAS나 라이브러리의 성능 자체보다는 버전에 따른 테스트 수 차이 때문일 수 있다고.
10:27 PM <e9t_> I'm using Ubuntu 14.04, and installed numpy+scipy with apt-get. Then I installed libopenblas-dev, purged numpy+scipy and reinstalled them with pip. But the test results are peculiar.
10:27 PM <e9t_> - numpy.test(): 24s -> 15s (decreased. great!)
10:27 PM <e9t_> - scipy.test(): 50s -> 114s (increased. why?)
10:27 PM <e9t_> Anyone know the reason?
10:34 PM <jtaylor> e9t_: likely the version difference, not blas
10:35 PM <jtaylor> scipy simply added more tests
10:35 PM <jtaylor> numpy too, but numpy also reduced the time the tests takes in recent versions
10:35 PM <jtaylor> the number of tests should be printed too
3단계: With CUDA
Pre-installation: NVIDIA 툴킷 설치
$ lspci | grep -i nvidia $ sudo apt-get install nvidia-346 # nvidia-current installed driver 304.125 which resulted in API mismatch $ sudo apt-get install nvidia-cuda-toolkit $ nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
NOTE:
apt-get install cuda
를 하면서(?)WARNING - No MPI compiler found.
라는 워닝이 떴다. MPI는 message passing interface인데, 큰 문제가 되지 않을 것 같아 일단 별다른 조치는 취하지 않음.$ wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.0-28_amd64.deb $ sudo dpkg -i cuda-repo-ubuntu1404_7.0-28_amd64.deb $ sudo apt-get update $ sudo apt-get install cuda $ sudo reboot 0 $ nvidia-smi Sat Sep 5 05:07:26 2015 +------------------------------------------------------+ | NVIDIA-SMI 346.82 Driver Version: 346.82 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 980 Off | 0000:01:00.0 N/A | N/A | | 0% 42C P0 N/A / N/A | 15MiB / 4095MiB | N/A Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 C+G Not Supported | +-----------------------------------------------------------------------------+
Post-installation: PATH 등록 및 테스트
$ echo "export PATH=/usr/local/cuda-7.0/bin:$PATH" >> ~/.bash_aliases $ echo "export LD_LIBRARY_PATH=/usr/local/cuda-7.0/lib64:$LD_LIBRARY_PATH" >> ~/.bash_aliases $ source ~/.bash_aliases $ cuda-install-samples-7.0.sh ~/tmp $ cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 346.82 Wed Jun 17 10:37:46 PDT 2015 GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) $ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2015 NVIDIA Corporation Built on Mon_Feb_16_22:59:02_CST_2015 Cuda compilation tools, release 7.0, V7.0.27 $ cd ~/tmp/NVIDIA_CUDA-7.0_Samples $ make $ ./bin/x86_64/linux/release/deviceQuery ./bin/x86_64/linux/release/deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GeForce GTX 980" CUDA Driver Version / Runtime Version 7.0 / 7.0 CUDA Capability Major/Minor version number: 5.2 Total amount of global memory: 4096 MBytes (4294639616 bytes) (16) Multiprocessors, (128) CUDA Cores/MP: 2048 CUDA Cores GPU Max Clock rate: 1329 MHz (1.33 GHz) Memory Clock rate: 3505 Mhz Memory Bus Width: 256-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GeForce GTX 980 Result = PASS $ ./bin/x86_64/linux/release/bandwidthTest [CUDA Bandwidth Test] - Starting... Running on... Device 0: GeForce GTX 980 Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 12164.5 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 12896.5 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 164863.5 Result = PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
- NumPy (1.9.2): 16.393 (5593), OK
- SciPy (0.16.0): 109.405 (18456), OK
- Theano (0.7.0): 4165.916 (19844), FAILED
4단계: Bleeding edge Theano
$ git clone git://github.com/Theano/Theano.git
$ cd Theano
$ python setup.py develop # 0.7.0.dev-c042a9c49ac6516b74668747d1e6e6bbe832efba
$ THEANO_FLAGS=init_gpu_device=gpu0,device=cpu,floatX=float32 python -c "import theano; theano.test()"
WARNING (theano.sandbox.cuda): GPU device gpu0 will be initialized, and used if a GPU is needed. However, no computation, nor shared variables, will be implicitly moved to that device. If you want
that behavior, use the 'device' flag instead.
Using gpu device 0: GeForce GTX 980 (CNMeM is enabled)
Theano version 0.7.0.dev-c042a9c49ac6516b74668747d1e6e6bbe832efba
theano is installed in /home/epark/pkgs/Theano/theano
NumPy version 1.9.2
NumPy is installed in /usr/local/lib/python2.7/dist-packages/numpy
Python version 2.7.6 (default, Jun 22 2015, 17:58:13) [GCC 4.8.2]
nose version 1.3.1
......00001 #include <Python.h>
00002 #include <iostream>
00003 #include "theano_mod_helper.h"
... # 더 자세한 로그는 [이 링크](http://pastebin.com/WSqrQkYA)에서 볼 수 있음
----------------------------------------------------------------------
Ran 19844 tests in 4093.222s
FAILED (KNOWNFAIL=18, SKIP=69, errors=218, failures=192)
- Theano (0.7.0): 4093.222 (19844), FAILED 8
Some random comments
- Test 결과를 볼 때는 1) 테스트 시간 뿐 아니라 2) 패키지 버젼 3) 테스트 수를 아는 것이 중요. 같은 버젼이라도 다른 환경(ex: OS, GPU)에서 실행할 경우 테스트 수는 달라질 수 있다.
- Ubuntu에서는
numpy
나scipy
를 깔 때 ppa를 이용하지 않는 이상apt-get
을 사용하면 더 낮은 버젼(ex: 14.04의 경우 1.8.2)이 깔린다.pip
으로 설치해야 좀 더 최신버젼을 깔 수 있다(ex: 내 경우 1.9.2). 버젼 간 속도 차가 나는 경우가 있으니pip
설치를 권장. - dada, daca에서 Python 2.7.6에 NumPy 1.9.1, SciPy 0.13.3가 깔려있을 때는 SciPy의 test가 항상 fail했다. 특히 daca의 경우 NumPy test에서 segfault도 발생. 최신 버젼으로 업글하면 문제가 해결되었을 가능성도 있다.
옥타브가 생각보다 잘 안 알려져 있던데, MATLAB의 오픈소스 버젼이라고 생각하면 된다. ↩
리눅스 커널을 열거나 core dump를 읽기 시작할 때부터 segfault에 대한 이해가 깊어지듯. ↩
각 함수의 이름에 대한 설명을 좀 더 보고 싶을 때는 Intel Developer Zone가 좋은 가이드가 되어준다. ↩
NumPy, SciPy, Theano 각각에 대한 test 시간은 Theano 웹사이트에서 제공. ↩ ↩2 ↩3
또는,
.theanorc
를 사용. ↩Theano users에 질문을 올려두었더니 GPU 사용할 때는 test가 fail해도 별 문제가 없는거라고 함… ↩