我试图测量NUMA的不对称内存访问效果,并失败。
采用Intel Xeon X5570 @ 2.93GHz,2个CPU,8个内核。
在固定为核心0的线程上,我在numa_alloc_local的核心0的NUMA节点上分配一个大小为10,000,000字节的数组x 。 然后我遍历数组x 50次,读写数组中的每个字节。 测量经过的时间做50次迭代。
然后,在我的服务器的每个其他核心,我钉一个新的线程,并再次测量经过的时间做50读和写入arraysx中的每个字节的迭代。
数组x很大,以尽量减lesscaching效应。 我们想要测量CPU必须一直到RAM加载和存储的速度,而不是在caching帮助时。
在我的服务器中有两个NUMA节点,所以我希望在分配数组x的同一个节点上具有亲和性的核心具有更快的读/写速度。 我没有看到。
为什么?
也许NUMA只与具有8-12个核心的系统有关,正如我在别处看到的那样?
http://lse.sourceforge.net/numa/faq/
#include <numa.h> #include <iostream> #include <boost/thread/thread.hpp> #include <boost/date_time/posix_time/posix_time.hpp> #include <pthread.h> void pin_to_core(size_t core) { cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(core, &cpuset); pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset); } std::ostream& operator<<(std::ostream& os, const bitmask& bm) { for(size_t i=0;i<bm.size;++i) { os << numa_bitmask_isbitset(&bm, i); } return os; } void* thread1(void** x, size_t core, size_t N, size_t M) { pin_to_core(core); void* y = numa_alloc_local(N); boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time(); char c; for (size_t i(0);i<M;++i) for(size_t j(0);j<N;++j) { c = ((char*)y)[j]; ((char*)y)[j] = c; } boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time(); std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl; *x = y; } void thread2(void* x, size_t core, size_t N, size_t M) { pin_to_core(core); boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time(); char c; for (size_t i(0);i<M;++i) for(size_t j(0);j<N;++j) { c = ((char*)x)[j]; ((char*)x)[j] = c; } boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time(); std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl; } int main(int argc, const char **argv) { int numcpus = numa_num_task_cpus(); std::cout << "numa_available() " << numa_available() << std::endl; numa_set_localalloc(); bitmask* bm = numa_bitmask_alloc(numcpus); for (int i=0;i<=numa_max_node();++i) { numa_node_to_cpus(i, bm); std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl; } numa_bitmask_free(bm); void* x; size_t N(10000000); size_t M(50); boost::thread t1(boost::bind(&thread1, &x, 0, N, M)); t1.join(); for (size_t i(0);i<numcpus;++i) { boost::thread t2(boost::bind(&thread2, x, i, N, M)); t2.join(); } numa_free(x, N); return 0; }
g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp ./numatest numa_available() 0 <-- NUMA is available on this system numa node 0 10101010 12884901888 <-- cores 0,2,4,6 are on NUMA node 0, which is about 12 Gb numa node 1 01010101 12874584064 <-- cores 1,3,5,7 are on NUMA node 1, which is slightly smaller than node 0 Elapsed read/write by same thread that allocated on core 0: 00:00:01.767428 Elapsed read/write by thread on core 0: 00:00:01.760554 Elapsed read/write by thread on core 1: 00:00:01.719686 Elapsed read/write by thread on core 2: 00:00:01.708830 Elapsed read/write by thread on core 3: 00:00:01.691560 Elapsed read/write by thread on core 4: 00:00:01.686912 Elapsed read/write by thread on core 5: 00:00:01.691917 Elapsed read/write by thread on core 6: 00:00:01.686509 Elapsed read/write by thread on core 7: 00:00:01.689928
无论读取和写入的核心是什么,在数组x中读取和写入50次迭代大约需要1.7秒。
我的CPU上的caching大小是8Mb,所以10Mb数组x可能不够大,无法消除caching效应。 我尝试了100Mb数组x ,并且我已经尝试在最内层的循环中用__sync_synchronize()发出完整的内存围栏。 它仍然没有显示NUMA节点之间的任何不对称。
我试着用__sync_fetch_and_add()读取和写入数组x 。 依然没有。
我想指出的第一件事情是,您可能要仔细检查每个节点上的哪些核心。 我不记得像这样交错的核心和节点。 而且,由于HT,你应该有16个线程。 (除非你禁用它)
另一件事:
插座1366 Xeon机器只有轻微的NUMA。 所以很难看出差异。 NUMA效应在4P Opteron上更为明显。
在像您这样的系统上,节点到节点的带宽实际上比CPU到内存的带宽要快。 由于您的访问模式是完全顺序的,无论数据是否是本地的,您都可以获得全带宽。 衡量一个更好的事情是延迟。 尝试随机访问1 GB的块,而不是顺序流式传输。
最后一件事:
根据编译器优化的积极性,循环可能会被优化,因为它没有做任何事情:
c = ((char*)x)[j]; ((char*)x)[j] = c;
像这样的东西将保证它不会被编译器消除:
((char*)x)[j] += 1;
啊哈! Mysticial是对的! 不知何故,硬件预取正在优化我的读/写。
如果这是一个缓存优化,那么强制一个内存屏障将会击败优化:
c = __sync_fetch_and_add(((char*)x) + j, 1);
但是这并没有什么区别。 有什么不同的是,我的迭代器索引乘以素数1009来击败预取优化:
*(((char*)x) + ((j * 1009) % N)) += 1;
随着这种变化,NUMA不对称显然是显露的:
numa_available() 0 numa node 0 10101010 12884901888 numa node 1 01010101 12874584064 Elapsed read/write by same thread that allocated on core 0: 00:00:00.961725 Elapsed read/write by thread on core 0: 00:00:00.942300 Elapsed read/write by thread on core 1: 00:00:01.216286 Elapsed read/write by thread on core 2: 00:00:00.909353 Elapsed read/write by thread on core 3: 00:00:01.218935 Elapsed read/write by thread on core 4: 00:00:00.898107 Elapsed read/write by thread on core 5: 00:00:01.211413 Elapsed read/write by thread on core 6: 00:00:00.898021 Elapsed read/write by thread on core 7: 00:00:01.207114
至少我觉得这是怎么回事。
谢谢Mysticial!
编辑:结论〜133%
对于任何人只是看了这个帖子,大致了解NUMA的性能特点,根据我的测试,这里是底线:
对非本地NUMA节点的内存访问大约是对本地节点的内存访问延迟的1.33倍。
感谢这个基准测试代码。 我已经采取了“固定”版本,并将其更改为纯粹的C + OpenMP,并添加了一些测试,以了解内存系统在争用情况下的行为。 你可以在这里找到新的代码。
以下是四核Opteron的一些示例结果:
num cpus: 32 numa available: 0 numa node 0 10001000100010000000000000000000 - 15.9904 GiB numa node 1 00000000000000001000100010001000 - 16 GiB numa node 2 00010001000100010000000000000000 - 16 GiB numa node 3 00000000000000000001000100010001 - 16 GiB numa node 4 00100010001000100000000000000000 - 16 GiB numa node 5 00000000000000000010001000100010 - 16 GiB numa node 6 01000100010001000000000000000000 - 16 GiB numa node 7 00000000000000000100010001000100 - 16 GiB sequential core 0 -> core 0 : BW 4189.87 MB/s sequential core 1 -> core 0 : BW 2409.1 MB/s sequential core 2 -> core 0 : BW 2495.61 MB/s sequential core 3 -> core 0 : BW 2474.62 MB/s sequential core 4 -> core 0 : BW 4244.45 MB/s sequential core 5 -> core 0 : BW 2378.34 MB/s sequential core 6 -> core 0 : BW 2442.93 MB/s sequential core 7 -> core 0 : BW 2468.61 MB/s sequential core 8 -> core 0 : BW 4220.48 MB/s sequential core 9 -> core 0 : BW 2442.88 MB/s sequential core 10 -> core 0 : BW 2388.11 MB/s sequential core 11 -> core 0 : BW 2481.87 MB/s sequential core 12 -> core 0 : BW 4273.42 MB/s sequential core 13 -> core 0 : BW 2381.28 MB/s sequential core 14 -> core 0 : BW 2449.87 MB/s sequential core 15 -> core 0 : BW 2485.48 MB/s sequential core 16 -> core 0 : BW 2938.08 MB/s sequential core 17 -> core 0 : BW 2082.12 MB/s sequential core 18 -> core 0 : BW 2041.84 MB/s sequential core 19 -> core 0 : BW 2060.47 MB/s sequential core 20 -> core 0 : BW 2944.13 MB/s sequential core 21 -> core 0 : BW 2111.06 MB/s sequential core 22 -> core 0 : BW 2063.37 MB/s sequential core 23 -> core 0 : BW 2082.75 MB/s sequential core 24 -> core 0 : BW 2958.05 MB/s sequential core 25 -> core 0 : BW 2091.85 MB/s sequential core 26 -> core 0 : BW 2098.73 MB/s sequential core 27 -> core 0 : BW 2083.7 MB/s sequential core 28 -> core 0 : BW 2934.43 MB/s sequential core 29 -> core 0 : BW 2048.68 MB/s sequential core 30 -> core 0 : BW 2087.6 MB/s sequential core 31 -> core 0 : BW 2014.68 MB/s all-contention core 0 -> core 0 : BW 1081.85 MB/s all-contention core 1 -> core 0 : BW 299.177 MB/s all-contention core 2 -> core 0 : BW 298.853 MB/s all-contention core 3 -> core 0 : BW 263.735 MB/s all-contention core 4 -> core 0 : BW 1081.93 MB/s all-contention core 5 -> core 0 : BW 299.177 MB/s all-contention core 6 -> core 0 : BW 299.63 MB/s all-contention core 7 -> core 0 : BW 263.795 MB/s all-contention core 8 -> core 0 : BW 1081.98 MB/s all-contention core 9 -> core 0 : BW 299.177 MB/s all-contention core 10 -> core 0 : BW 300.149 MB/s all-contention core 11 -> core 0 : BW 262.905 MB/s all-contention core 12 -> core 0 : BW 1081.89 MB/s all-contention core 13 -> core 0 : BW 299.173 MB/s all-contention core 14 -> core 0 : BW 299.025 MB/s all-contention core 15 -> core 0 : BW 263.865 MB/s all-contention core 16 -> core 0 : BW 432.156 MB/s all-contention core 17 -> core 0 : BW 233.12 MB/s all-contention core 18 -> core 0 : BW 232.889 MB/s all-contention core 19 -> core 0 : BW 202.48 MB/s all-contention core 20 -> core 0 : BW 434.299 MB/s all-contention core 21 -> core 0 : BW 233.274 MB/s all-contention core 22 -> core 0 : BW 233.144 MB/s all-contention core 23 -> core 0 : BW 202.505 MB/s all-contention core 24 -> core 0 : BW 434.295 MB/s all-contention core 25 -> core 0 : BW 233.274 MB/s all-contention core 26 -> core 0 : BW 233.169 MB/s all-contention core 27 -> core 0 : BW 202.49 MB/s all-contention core 28 -> core 0 : BW 434.295 MB/s all-contention core 29 -> core 0 : BW 233.309 MB/s all-contention core 30 -> core 0 : BW 233.169 MB/s all-contention core 31 -> core 0 : BW 202.526 MB/s two-contention core 0 -> core 0 : BW 3306.11 MB/s two-contention core 1 -> core 0 : BW 2199.7 MB/s two-contention core 0 -> core 0 : BW 3286.21 MB/s two-contention core 2 -> core 0 : BW 2220.73 MB/s two-contention core 0 -> core 0 : BW 3302.24 MB/s two-contention core 3 -> core 0 : BW 2182.81 MB/s two-contention core 0 -> core 0 : BW 3605.88 MB/s two-contention core 4 -> core 0 : BW 3605.88 MB/s two-contention core 0 -> core 0 : BW 3297.08 MB/s two-contention core 5 -> core 0 : BW 2217.82 MB/s two-contention core 0 -> core 0 : BW 3312.69 MB/s two-contention core 6 -> core 0 : BW 2227.04 MB/s two-contention core 0 -> core 0 : BW 3287.93 MB/s two-contention core 7 -> core 0 : BW 2209.48 MB/s two-contention core 0 -> core 0 : BW 3660.05 MB/s two-contention core 8 -> core 0 : BW 3660.05 MB/s two-contention core 0 -> core 0 : BW 3339.63 MB/s two-contention core 9 -> core 0 : BW 2223.84 MB/s two-contention core 0 -> core 0 : BW 3303.77 MB/s two-contention core 10 -> core 0 : BW 2197.99 MB/s two-contention core 0 -> core 0 : BW 3323.19 MB/s two-contention core 11 -> core 0 : BW 2196.08 MB/s two-contention core 0 -> core 0 : BW 3582.23 MB/s two-contention core 12 -> core 0 : BW 3582.22 MB/s two-contention core 0 -> core 0 : BW 3324.9 MB/s two-contention core 13 -> core 0 : BW 2250.74 MB/s two-contention core 0 -> core 0 : BW 3305.66 MB/s two-contention core 14 -> core 0 : BW 2209.5 MB/s two-contention core 0 -> core 0 : BW 3303.52 MB/s two-contention core 15 -> core 0 : BW 2182.43 MB/s two-contention core 0 -> core 0 : BW 3352.74 MB/s two-contention core 16 -> core 0 : BW 2607.73 MB/s two-contention core 0 -> core 0 : BW 3092.65 MB/s two-contention core 17 -> core 0 : BW 1911.98 MB/s two-contention core 0 -> core 0 : BW 3025.91 MB/s two-contention core 18 -> core 0 : BW 1918.06 MB/s two-contention core 0 -> core 0 : BW 3257.56 MB/s two-contention core 19 -> core 0 : BW 1885.03 MB/s two-contention core 0 -> core 0 : BW 3339.64 MB/s two-contention core 20 -> core 0 : BW 2603.06 MB/s two-contention core 0 -> core 0 : BW 3119.29 MB/s two-contention core 21 -> core 0 : BW 1918.6 MB/s two-contention core 0 -> core 0 : BW 3054.14 MB/s two-contention core 22 -> core 0 : BW 1910.61 MB/s two-contention core 0 -> core 0 : BW 3214.44 MB/s two-contention core 23 -> core 0 : BW 1881.69 MB/s two-contention core 0 -> core 0 : BW 3332.3 MB/s two-contention core 24 -> core 0 : BW 2611.8 MB/s two-contention core 0 -> core 0 : BW 3111.94 MB/s two-contention core 25 -> core 0 : BW 1922.11 MB/s two-contention core 0 -> core 0 : BW 3049.02 MB/s two-contention core 26 -> core 0 : BW 1912.85 MB/s two-contention core 0 -> core 0 : BW 3251.88 MB/s two-contention core 27 -> core 0 : BW 1881.82 MB/s two-contention core 0 -> core 0 : BW 3345.6 MB/s two-contention core 28 -> core 0 : BW 2598.82 MB/s two-contention core 0 -> core 0 : BW 3109.04 MB/s two-contention core 29 -> core 0 : BW 1923.81 MB/s two-contention core 0 -> core 0 : BW 3062.94 MB/s two-contention core 30 -> core 0 : BW 1921.3 MB/s two-contention core 0 -> core 0 : BW 3220.8 MB/s two-contention core 31 -> core 0 : BW 1901.76 MB/s
如果有人有进一步的改善,我很乐意听到他们的消息。 例如,这些显然不是现实世界单元中的完美带宽测量(很可能是恒定的 – 整数因子)。
几点意见:
lstopo
实用程序获得图形化概述。 特别是,你会看到哪些核心数字是哪个NUMA节点(处理器套接字) char
可能不是测量最大RAM吞吐量的理想数据类型。 我怀疑,使用32位或64位数据类型,你可以通过相同数量的CPU周期获得更多的数据。 更一般地说,你也应该检查你的测量不受CPU速度的限制,而是受RAM速度的限制。 例如ramspeed实用程序在源代码中显式地展开了一些内部循环:
for(i = 0; i < blk/sizeof(UTL); i += 32) { b[i] = a[i]; b[i+1] = a[i+1]; ... b[i+30] = a[i+30]; b[i+31] = a[i+31]; }
编辑 :在支持的体系结构ramsmp
甚至使用“手写”汇编代码为这些循环
L1 / L2 / L3缓存效果:以GByte / s作为块大小的函数来测量带宽是有益的。 在增加块大小时,您应该会看到大致四种不同的速度,这些块大小对应于您正在从(高速缓存或主存储器)读取数据的位置。 你的处理器似乎有8M字节的Level3(?)缓存,所以你的1000万字节可能只是大部分停留在L3缓存(在一个处理器的所有内核之间共享)。
内存通道 :您的处理器有3个内存通道 。 如果你的内存条已经安装好了,你可以利用它们(例如参见主板手册),你可能想要同时运行多个线程。 我看到只有一个线程读取的效果,渐近带宽接近单个内存模块(例如DDR-1600的12.8 GByte / s),而在运行多线程时,渐近带宽接近数内存通道数乘以单个内存模块的带宽。
您也可以使用numactl来选择运行哪个节点的进程以及从哪里分配内存:
numactl --cpubind=0 --membind=1 <process>
我使用这与LMbench结合得到内存延迟数字:
numactl --cpubind=0 --membind=0 ./lat_mem_rd -t 512 numactl --cpubind=0 --membind=1 ./lat_mem_rd -t 512
如果有其他人想要尝试这个测试,这里是修改后的工作程序。 我很想看到其他硬件的结果。 这适用于Linux 2.6.34-12桌面,GCC 4.5.0,升压1.47我的机器上。
g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp
#include <numa.h> #include <iostream> #include <boost/thread/thread.hpp> #include <boost/date_time/posix_time/posix_time.hpp> #include <pthread.h> void pin_to_core(size_t core) { cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(core, &cpuset); pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset); } std::ostream& operator<<(std::ostream& os, const bitmask& bm) { for(size_t i=0;i<bm.size;++i) { os << numa_bitmask_isbitset(&bm, i); } return os; } void* thread1(void** x, size_t core, size_t N, size_t M) { pin_to_core(core); void* y = numa_alloc_local(N); boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time(); char c; for (size_t i(0);i<M;++i) for(size_t j(0);j<N;++j) { *(((char*)y) + ((j * 1009) % N)) += 1; } boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time(); std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl; *x = y; } void thread2(void* x, size_t core, size_t N, size_t M) { pin_to_core(core); boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time(); char c; for (size_t i(0);i<M;++i) for(size_t j(0);j<N;++j) { *(((char*)x) + ((j * 1009) % N)) += 1; } boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time(); std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl; } int main(int argc, const char **argv) { int numcpus = numa_num_task_cpus(); std::cout << "numa_available() " << numa_available() << std::endl; numa_set_localalloc(); bitmask* bm = numa_bitmask_alloc(numcpus); for (int i=0;i<=numa_max_node();++i) { numa_node_to_cpus(i, bm); std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl; } numa_bitmask_free(bm); void* x; size_t N(10000000); size_t M(5); boost::thread t1(boost::bind(&thread1, &x, 0, N, M)); t1.join(); for (size_t i(0);i<numcpus;++i) { boost::thread t2(boost::bind(&thread2, x, i, N, M)); t2.join(); } numa_free(x, N); return 0; }