Intereting Posts

Windows vs Linux – math结果差异如何禁用GNOME桌面屏幕locking？使用PowerShell命令行工具下载文件为什么WM_ACTIVE对lParam（另一个wondow句柄）有NULL？ Windows GDI：水平/垂直DPI docker集装箱的尺寸比实际尺寸大得多我如何使用C ++在Linux中制作托盘图标？如何在Debian Jessie中安装dotnet core sdk recvmmsg（）不填充msghdr.msg_namelen？当我运行一个python脚本两次时，文件名中的语法错误？比较2个文件夹和查找具有不同字节数的文件将在Windows中编写的Python代码工作在Linux？ Linux mremap不释放旧的映射？滚动ListView控件到特定的项目 Microsoft Windows Office Suite之间的Java连接？

有人可以解释下面的内存分配C程序的性能行为吗？

在我的机器上，时间A和时间B交换取决于是否定义了A （它改变了调用这两个calloc的顺序）。

我最初将其归因于寻呼系统。奇怪的是，当使用mmap而不是calloc ，情况更加糟糕 – 两个循环都花费了与预期相同的时间。从strace可以看出， calloc最终导致了两个mmap ，所以没有返回已经分配的内存。

我正在英特尔i7上运行Debiantesting。

 #include <stdlib.h> #include <stdio.h> #include <sys/mman.h> #include <time.h> #define SIZE 500002816 #ifndef USE_MMAP #define ALLOC calloc #else #define ALLOC(a, b) (mmap(NULL, a * b, PROT_READ | PROT_WRITE, \ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)) #endif int main() { clock_t start, finish; #ifdef A int *arr1 = ALLOC(sizeof(int), SIZE); int *arr2 = ALLOC(sizeof(int), SIZE); #else int *arr2 = ALLOC(sizeof(int), SIZE); int *arr1 = ALLOC(sizeof(int), SIZE); #endif int i; start = clock(); { for (i = 0; i < SIZE; i++) arr1[i] = (i + 13) * 5; } finish = clock(); printf("Time A: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC); start = clock(); { for (i = 0; i < SIZE; i++) arr2[i] = (i + 13) * 5; } finish = clock(); printf("Time B: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC); return 0; }

我得到的输出：

  ~/directory $ cc -Wall -O3 bench-loop.c -o bench-loop ~/directory $ ./bench-loop Time A: 0.94 Time B: 0.34 ~/directory $ cc -DA -Wall -O3 bench-loop.c -o bench-loop ~/directory $ ./bench-loop Time A: 0.34 Time B: 0.90 ~/directory $ cc -DUSE_MMAP -DA -Wall -O3 bench-loop.c -o bench-loop ~/directory $ ./bench-loop Time A: 0.89 Time B: 0.90 ~/directory $ cc -DUSE_MMAP -Wall -O3 bench-loop.c -o bench-loop ~/directory $ ./bench-loop Time A: 0.91 Time B: 0.92

简答

calloc第一次被调用时，它明确地清零了内存。而下一次它被称为它假定从mmap返回的内存已被清零。

细节

下面是我检查的一些结论，如果你想要的话，你可以试试自己：

在第一次ALLOC呼叫之前插入一个calloc呼叫。你会看到在这之后，时间A和时间B是相同的。
使用clock()函数来检查每个ALLOC调用需要多长时间。在他们都使用calloc的情况下，您将看到第一个调用比第二个调用需要更长的时间。
使用time来计算calloc版本和USE_MMAP版本的执行时间。当我这样做的时候，我看到USE_MMAP的执行时间一直略少。
我用strace -tt -T运行，它显示了系统调用的时间和花了多长时间。这是输出的一部分：

Strace输出：

 21:29:06.127536 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff806fd000 <0.000014> 21:29:07.778442 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff093a0000 <0.000021> 21:29:07.778563 times({tms_utime=63, tms_stime=102, tms_cutime=0, tms_cstime=0}) = 4324241005 <0.000011>

您可以看到第一个mmap调用花费了0.000014秒，但在下一次系统调用之前已经过了大约1.5秒。然后，第二个mmap调用花费了0.000021秒，之后是数百微秒后的times调用。

我还用gdb一部分应用程序的执行过程，看到第一次调用calloc导致了大量的memset调用，而第二次调用calloc没有调用memset 。如果你有兴趣，你可以在这里看到calloc的源代码（查找__libc_calloc ）。至于为什么calloc是在第一次通话而不是后来的memset ，我不知道。但我相当确信，这解释了你所问的行为。

至于为什么被归零的memset的数组性能有所提高，我的猜测是，这是因为值被加载到TLB而不是缓存中，因为它是一个非常大的数组。无论您询问的性能差异的具体原因是两个calloc调用执行时行为不同。

您还应该使用malloc而不是calloc测试。 calloc所做的一件事是用零填充分配的内存。

我相信你的情况是，当你调用最后一个calloc arr1，然后分配给它时，它已经被错误存入高速缓冲存储器，因为它是最后一个分配和零填充。当calloc arr1 first和arr2 second时，arr2的零填充将arr1推出缓存。

猜猜我可以写更多，或更少，特别是越少越多。

原因可能因系统而异。然而; clib：

用于每个操作的总时间是相反的; 如果你的时间calloc +迭代。

即：

 Calloc arr1 : 0.494992654 Calloc arr2 : 0.000021250 Itr arr1 : 0.430646035 Itr arr2 : 0.790992411 Sum arr1 : 0.925638689 Sum arr2 : 0.791013661 Calloc arr1 : 0.503130736 Calloc arr2 : 0.000025906 Itr arr1 : 0.427719162 Itr arr2 : 0.809686047 Sum arr1 : 0.930849898 Sum arr2 : 0.809711953

第一个calloc随后malloc有更长的执行时间，然后秒。在malloc(0)之前的任何calloc之前调用等用于malloc像调用一样的过程（下面的说明）使用的时间。但是，如果一个人做了几个，可以看到这些电话的时间略有下降。

然而迭代时间会变平。

所以总之所使用的总系统时间是最先被分配的最高时间。然而，这是一个在流程的限制中无法逃脱的开销。

有很多维护正在进行。快速触及一些情况：

简短的页面上

当一个进程请求内存被提供一个虚拟地址范围。该范围通过页表转换为物理内存。如果一个页面逐字节翻译，我们会很快得到巨大的页面表。这就是为什么内存范围是以区块或页面形式提供的原因。页面大小取决于系统。该架构还可以提供各种页面大小。

如果我们查看上面代码的执行情况，并从/ proc / PID / stat中添加一些读取，我们可以看到这个操作（特别是RSS）：

 PID Stat { PID : 4830 Process ID MINFLT : 214 Minor faults, (no page memory read) UTIME : 0 Time user mode STIME : 0 Time kernel mode VSIZE : 2039808 Virtual memory size, bytes RSS : 73 Resident Set Size, Number of pages in real memory } : Init PID Stat { PID : 4830 Process ID MINFLT : 51504 Minor faults, (no page memory read) UTIME : 4 Time user mode STIME : 33 Time kernel mode VSIZE : 212135936 Virtual memory size, bytes RSS : 51420 Resident Set Size, Number of pages in real memory } : Post calloc arr1 PID Stat { PID : 4830 Process ID MINFLT : 51515 Minor faults, (no page memory read) UTIME : 4 Time user mode STIME : 33 Time kernel mode VSIZE : 422092800 Virtual memory size, bytes RSS : 51428 Resident Set Size, Number of pages in real memory } : Post calloc arr2 PID Stat { PID : 4830 Process ID MINFLT : 51516 Minor faults, (no page memory read) UTIME : 36 Time user mode STIME : 33 Time kernel mode VSIZE : 422092800 Virtual memory size, bytes RSS : 51431 Resident Set Size, Number of pages in real memory } : Post iteration arr1 PID Stat { PID : 4830 Process ID MINFLT : 102775 Minor faults, (no page memory read) UTIME : 68 Time user mode STIME : 58 Time kernel mode VSIZE : 422092800 Virtual memory size, bytes RSS : 102646 Resident Set Size, Number of pages in real memory } : Post iteration arr2 PID Stat { PID : 4830 Process ID MINFLT : 102776 Minor faults, (no page memory read) UTIME : 68 Time user mode STIME : 69 Time kernel mode VSIZE : 2179072 Virtual memory size, bytes RSS : 171 Resident Set Size, Number of pages in real memory } : Post free()

正如我们可以看到实际分配在内存中的页面被推迟等待页面请求的arr2 ; 持续到迭代开始。如果我们在arr1 calloc之前添加一个malloc(0) ，我们可以注意到在迭代之前，这两个数组都没有被分配到物理内存中。

由于页面可能不被使用，所以根据请求进行映射会更有效率。这就是为什么当这个过程，即做一个calloc足够数量的页面被保留，但不一定实际分配在实际内存中。

当一个地址被引用时，查询页表。如果地址在未分配的页面中，则系统发生页面错误并且随后分配页面。分配的页面总数被称为驻留集大小 （RSS）。

我们可以通过迭代（触摸）它的1/4来做一个实验。在这里，我还在任何calloc之前添加了malloc(0) 。

 Pre iteration 1/4: RSS : 171 Resident Set Size, Number of pages in real meory for (i = 0; i < SIZE / 4; ++i) arr1[i] = 0; Post iteration 1/4: RSS : 12967 Resident Set Size, Number of pages in real meory Post iteration 1/1: RSS : 51134 Resident Set Size, Number of pages in real meory

为了进一步加速事情，大多数系统还在翻译后备缓冲器 （TLB）中缓存N个最近的页表项。

brk，mmap

当一个进程(c|m|…)alloc堆的上限时，由brk()或sbrk()扩展。这些系统调用是昂贵的，为了弥补这个malloc收集多个较小的调用一个更大的brk（）。

这也影响到free()作为一个负面brk()也是资源昂贵，他们收集和执行作为一个更大的操作。

对于巨大的要求; 即像你的代码中的那样， malloc()使用mmap() 。这个可由mallopt()配置的阈值是一个受过教育的值

我们可以通过修改代码中的SIZE来获得乐趣。如果我们使用malloc.h并使用，

 struct mallinfo minf = mallinfo();

（不，不是Hblkhd ），我们可以显示这个（注意Arena和Hblkhd ，…）：

 Initial: mallinfo { Arena : 0 (Bytes of memory allocated with sbrk by malloc) Ordblks : 1 (Number of chunks not in use) Hblks : 0 (Number of chunks allocated with mmap) Hblkhd : 0 (Bytes allocated with mmap) Uordblks: 0 (Memory occupied by chunks handed out by malloc) Fordblks: 0 (Memory occupied by free chunks) Keepcost: 0 (Size of the top-most releasable chunk) } : Initial MAX = ((128 * 1024) / sizeof(int)) mallinfo { Arena : 0 (Bytes of memory allocated with sbrk by malloc) Ordblks : 1 (Number of chunks not in use) Hblks : 1 (Number of chunks allocated with mmap) Hblkhd : 135168 (Bytes allocated with mmap) Uordblks: 0 (Memory occupied by chunks handed out by malloc) Fordblks: 0 (Memory occupied by free chunks) Keepcost: 0 (Size of the top-most releasable chunk) } : After malloc arr1 mallinfo { Arena : 0 (Bytes of memory allocated with sbrk by malloc) Ordblks : 1 (Number of chunks not in use) Hblks : 2 (Number of chunks allocated with mmap) Hblkhd : 270336 (Bytes allocated with mmap) Uordblks: 0 (Memory occupied by chunks handed out by malloc) Fordblks: 0 (Memory occupied by free chunks) Keepcost: 0 (Size of the top-most releasable chunk) } : After malloc arr2

然后我们从MAX减去sizeof(int) ，得到：

 mallinfo { Arena : 266240 (Bytes of memory allocated with sbrk by malloc) Ordblks : 1 (Number of chunks not in use) Hblks : 0 (Number of chunks allocated with mmap) Hblkhd : 0 (Bytes allocated with mmap) Uordblks: 131064 (Memory occupied by chunks handed out by malloc) Fordblks: 135176 (Memory occupied by free chunks) Keepcost: 135176 (Size of the top-most releasable chunk) } : After malloc arr1 mallinfo { Arena : 266240 (Bytes of memory allocated with sbrk by malloc) Ordblks : 1 (Number of chunks not in use) Hblks : 0 (Number of chunks allocated with mmap) Hblkhd : 0 (Bytes allocated with mmap) Uordblks: 262128 (Memory occupied by chunks handed out by malloc) Fordblks: 4112 (Memory occupied by free chunks) Keepcost: 4112 (Size of the top-most releasable chunk) } : After malloc arr2

我们注册该系统的工作原理如上所述。如果分配的大小低于阈值sbrk被使用，内存由malloc内部处理，否则使用mmap 。

这样的结构也有助于防止内存碎片等。

要点是malloc系列针对一般用途进行了优化。但是，可以修改mmap限制以满足特殊需求。

当/如果修改mmap阈值，注意这个（和低谷100多条线）。。

这可以进一步观察，如果我们填充（触摸）arr1和arr2的每一页之前我们做的时间：

 Touch pages … (Here with page size of 4 kB) for (i = 0; i < SIZE; i += 4096 / sizeof(int)) { arr1[i] = 0; arr2[i] = 0; } Itr arr1 : 0.312462317 CPU arr1 : 0.32 Itr arr2 : 0.312869158 CPU arr2 : 0.31

另请参阅：

编译时选项的概要
重要统计数据
实际上整个文件是一个很好的阅读。

小记：

那么，CPU知道物理地址呢？罗。

在记忆的世界里，有很多东西需要解决 ;）。 内存管理单元 （MMU）是其核心硬件。可以作为CPU或外部芯片的集成部分。

操作系统在引导时配置MMU，并定义各个区域的访问权限（只读，读写等），从而提供一定的安全级别。

我们作为凡人看到的地址是CPU使用的逻辑地址 。 MMU将其转换为物理地址 。

CPU的地址由两部分组成：一个页面地址和一个偏移量。 [PAGE_ADDRESS.OFFSET]

而得到一个物理地址的过程，我们可以有这样的事情：

 .-----. .--------------. | CPU > --- Request page 2 ----> | MMU | +-----+ | Pg 2 == Pg 4 | | +------v-------+ +--Request offset 1 -+ | | (Logical page 2 EQ Physical page 4) [ ... ] __ | | [ OFFSET 0 ] | | | [ OFFSET 1 ] | | | [ OFFSET 2 ] | | | [ OFFSET 3 ] +--- Page 3 | | [ OFFSET 4 ] | | | [ OFFSET 5 ] | | | [ OFFSET 6 ]__| ___________|____________+ [ OFFSET 0 ] | | [ OFFSET 1 ] | ...........+ [ OFFSET 2 ] | [ OFFSET 3 ] +--- Page 4 [ OFFSET 4 ] | [ OFFSET 5 ] | [ OFFSET 6 ]__| [ ... ]

CPU的逻辑地址空间直接链接到地址长度。 32位地址处理器具有2 ³²字节的逻辑地址空间。 物理地址空间是系统可承受多少内存。

还有处理碎片内存，重新对齐等。

这使我们进入交换文件的世界。如果一个进程请求更多的内存然后物理可用; 一个或多个其他进程的页面被转移到磁盘/交换，并且其请求进程“ 窃取 ”它们的页面。 MMU跟踪这个; 因此CPU不必担心内存在哪里。

这进一步把我们带入肮脏的记忆。

如果我们从/ proc / [pid] / smaps打印一些信息，更具体的我们的数组的范围，我们得到像这样：

 Start: b76f3000-b76f5000 Private_Dirty: 8 kB Post calloc arr1: aaeb8000-b76f5000 Private_Dirty: 12 kB Post calloc arr2: 9e67c000-b76f5000 Private_Dirty: 20 kB Post iterate 1/4 arr1 9e67b000-b76f5000 Private_Dirty: 51280 kB Post iterate arr1: 9e67a000-b76f5000 Private_Dirty: 205060 kB Post iterate arr2: 9e679000-b76f5000 Private_Dirty: 410096 kB Post free: 9e679000-9e67d000 Private_Dirty: 16 kB b76f2000-b76f5000 Private_Dirty: 12 kB

当创建虚拟页面时，系统通常会清除页面中的脏位。
当CPU写入该页面的一部分时，脏位被设置; 因此当被交换的页面被写入脏页时，会跳过干净的页面。

只是一个页面扩展进程内存映像的问题。

简介： 分析分配数组所需的时间时会解释时间差异 。最后一次分配的calloc只需要多一点时间，而另一个（或者全部使用mmap的时候）没有时间。内存中的实际分配在第一次访问时可能会延迟。

我对Linux的内存分配内部知之甚少。但是我稍微修改了脚本：我为每个数组操作添加了第三个数组和一些额外的迭代。我已经考虑到老专家的评论，分配阵列的时间没有考虑在内。

结论：使用calloc需要的时间比使用mmap进行分配的时间要长（当分配内存时，mmap virtualy不会使用任何时间，可能会在第一次访问时延迟），使用我的程序在使用mmap或calloc之间几乎没有区别为整体程序执行。

无论如何，首先要说的是，这两个内存分配都发生在内存映射区域，而不是在堆中。为了验证这一点，我添加了一个快速的n'脏暂停，因此你可以检查进程的内存映射（/ proc // maps）

现在你的问题，与calloc最后分配的数组似乎是真的在内存中分配（不推迟）。由于arr1和arr2的行为现在完全相同（第一次迭代很慢，后续迭代更快）。 Arr3在第一次迭代中速度更快，因为内存是先分配的。当使用A宏时，它是从这个好处arr1。我的猜测是内核预先分配了最后一个calloc的内存数组。为什么？我不知道…我已经测试了它也只有一个数组（所以我删除了所有arr2和arr3），然后我有相同的时间（大致）所有10次的arr1迭代。

malloc和mmap的行为都是相同的（结果没有在下面显示），第一次迭代很慢，所有3个数组的后续迭代更快。

注意：所有的结果在各种gcc优化标志（-O0到-O3）上是一致的，所以看起来行为的根源看起来并不是某种类型的gcc操作。

注2：在Ubuntu Precise Pangolin（kernel 3.2）上运行测试，使用GCC 4.6.3

 #include <stdlib.h> #include <stdio.h> #include <sys/mman.h> #include <time.h> #define SIZE 500002816 #define ITERATION 10 #if defined(USE_MMAP) # define ALLOC(a, b) (mmap(NULL, a * b, PROT_READ | PROT_WRITE, \ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)) #elif defined(USE_MALLOC) # define ALLOC(a, b) (malloc(b * a)) #elif defined(USE_CALLOC) # define ALLOC calloc #else # error "No alloc routine specified" #endif int main() { clock_t start, finish, gstart, gfinish; start = clock(); gstart = start; #ifdef A unsigned int *arr1 = ALLOC(sizeof(unsigned int), SIZE); unsigned int *arr2 = ALLOC(sizeof(unsigned int), SIZE); unsigned int *arr3 = ALLOC(sizeof(unsigned int), SIZE); #else unsigned int *arr3 = ALLOC(sizeof(unsigned int), SIZE); unsigned int *arr2 = ALLOC(sizeof(unsigned int), SIZE); unsigned int *arr1 = ALLOC(sizeof(unsigned int), SIZE); #endif finish = clock(); unsigned int i, j; double intermed, finalres; intermed = ((double)(finish - start))/CLOCKS_PER_SEC; printf("Time to create: %.2f\n", intermed); printf("arr1 addr: %p\narr2 addr: %p\narr3 addr: %p\n", arr1, arr2, arr3); finalres = 0; for (j = 0; j < ITERATION; j++) { start = clock(); { for (i = 0; i < SIZE; i++) arr1[i] = (i + 13) * 5; } finish = clock(); intermed = ((double)(finish - start))/CLOCKS_PER_SEC; finalres += intermed; printf("Time A: %.2f\n", intermed); } printf("Time A (average): %.2f\n", finalres/ITERATION); finalres = 0; for (j = 0; j < ITERATION; j++) { start = clock(); { for (i = 0; i < SIZE; i++) arr2[i] = (i + 13) * 5; } finish = clock(); intermed = ((double)(finish - start))/CLOCKS_PER_SEC; finalres += intermed; printf("Time B: %.2f\n", intermed); } printf("Time B (average): %.2f\n", finalres/ITERATION); finalres = 0; for (j = 0; j < ITERATION; j++) { start = clock(); { for (i = 0; i < SIZE; i++) arr3[i] = (i + 13) * 5; } finish = clock(); intermed = ((double)(finish - start))/CLOCKS_PER_SEC; finalres += intermed; printf("Time C: %.2f\n", intermed); } printf("Time C (average): %.2f\n", finalres/ITERATION); gfinish = clock(); intermed = ((double)(gfinish - gstart))/CLOCKS_PER_SEC; printf("Global Time: %.2f\n", intermed); return 0; }

结果：

使用USE_CALLOC

 Time to create: 0.13 arr1 addr: 0x7fabcb4a6000 arr2 addr: 0x7fabe917d000 arr3 addr: 0x7fac06e54000 Time A: 0.67 Time A: 0.48 ... Time A: 0.47 Time A (average): 0.48 Time B: 0.63 Time B: 0.47 ... Time B: 0.48 Time B (average): 0.48 Time C: 0.45 ... Time C: 0.46 Time C (average): 0.46

使用USE_CALLOC和A

 Time to create: 0.13 arr1 addr: 0x7fc2fa206010 arr2 addr: 0xx7fc2dc52e010 arr3 addr: 0x7fc2be856010 Time A: 0.44 ... Time A: 0.43 Time A (average): 0.45 Time B: 0.65 Time B: 0.47 ... Time B: 0.46 Time B (average): 0.48 Time C: 0.65 Time C: 0.48 ... Time C: 0.45 Time C (average): 0.48

使用USE_MMAP

 Time to create: 0.0 arr1 addr: 0x7fe6332b7000 arr2 addr: 0x7fe650f8e000 arr3 addr: 0x7fe66ec65000 Time A: 0.55 Time A: 0.48 ... Time A: 0.45 Time A (average): 0.49 Time B: 0.54 Time B: 0.46 ... Time B: 0.49 Time B (average): 0.50 Time C: 0.57 ... Time C: 0.40 Time C (average): 0.43