“无法解释”核心转储

我见过很多核心的垃圾堆，但是这个让我难住。

语境：

运行在AMD Barcelona CPU集群上的multithreadingLinux / x86_64程序
崩溃的代码被执行了很多
在负载下运行1000个程序实例（完全相同的优化二进制）每小时产生1-2次崩溃
崩溃发生在不同的机器上（但机器本身非常相似）
崩溃都看起来相同（相同的确切地址，相同的调用堆栈）

这里是崩溃的细节：

Program terminated with signal 11, Segmentation fault. #0 0x00000000017bd9fd in Foo() (gdb) x/i $pc => 0x17bd9fd <_Z3Foov+349>: rex.RB orb $0x8d,(%r15) (gdb) x/6i $pc-12 0x17bd9f1 <_Z3Foov+337>: mov (%rbx),%eax 0x17bd9f3 <_Z3Foov+339>: mov %rbx,%rdi 0x17bd9f6 <_Z3Foov+342>: callq *0x70(%rax) 0x17bd9f9 <_Z3Foov+345>: cmp %eax,%r12d 0x17bd9fc <_Z3Foov+348>: mov %eax,-0x80(%rbp) 0x17bd9ff <_Z3Foov+351>: jge 0x17bd97e <_Z3Foov+222>

你会注意到在0x17bd9fc的指令中间发生了这个崩溃，这个0x17bd9fc是从0x17bd9f6一个调用返回到一个虚拟函数的。

当我检查虚拟表时，我发现它没有任何损坏：

 (gdb) x/a $rbx 0x2ab094951f80: 0x3f8c550 <_ZTI4Foo1+16> (gdb) x/a 0x3f8c550+0x70 0x3f8c5c0 <_ZTI4Foo1+128>: 0x2d3d7b0 <_ZN4Foo13GetEv>

而且它指向这个微不足道的function（正如预期的那样）：

 (gdb) disas 0x2d3d7b0 Dump of assembler code for function _ZN4Foo13GetEv: 0x0000000002d3d7b0 <+0>: push %rbp 0x0000000002d3d7b1 <+1>: mov 0x70(%rdi),%eax 0x0000000002d3d7b4 <+4>: mov %rsp,%rbp 0x0000000002d3d7b7 <+7>: leaveq 0x0000000002d3d7b8 <+8>: retq End of assembler dump.

而且，当我查看Foo1::Get()应该返回的返回地址时：

 (gdb) x/a $rsp-8 0x2afa55602048: 0x17bd9f9 <_Z3Foov+345>

我看到它指向了正确的指令，所以就好像在从Foo1::Get()返回的Foo1::Get() ，一些gremlin出现了，并且将%rip递增了4。

合理的解释？

所以，看起来不太可能，我们似乎碰到了一个真正的真正的CPU错误。

http://support.amd.com/us/Processor_TechDocs/41322_10h_Rev_Gd.pdf有错误＃721：

721处理器可能会错误地更新堆栈指针

描述

 Under a highly specific and detailed set of internal timing conditions, the processor may incorrectly update the stack pointer after a long series of push and/or near-call instructions, or a long series of pop and/or near-return instructions. The processor must be in 64-bit mode for this erratum to occur.

对系统的潜在影响

 The stack pointer value jumps by a value of approximately 1024, either in the positive or negative direction. This incorrect stack pointer causes unpredictable program or system behavior, usually observed as a program exception or crash (for example, a #GP or #UD).

我曾经在教学中看到过一个“非法操作码”的崩溃。我正在开发一个Linux端口。长话短说，Linux从指令指针中减去重新启动系统调用，在我的情况下，这是发生两次（如果两个信号同时到达）。

所以这是一个可能的罪魁祸首：内核摆弄你的指令指针。你的情况可能有其他原因。

请记住，有时处理器会理解作为指令处理的数据，即使它不应该是。因此，处理器可能执行了0x17bd9fa处的“指令”，然后移动到0x17bd9fd，然后生成非法的操作码异常。（我只是把这个数字加起来，但是用一个反汇编实验可以告诉你处理器可能“进入”指令流的地方。）

快乐调试！