(gdb) p/x $r3
$5 = 0x3fee50f0
(gdb) p/x $r4
$6 = 0x3fee5100
(gdb)
Both $r3 and $r4 contain correct values.
We narrowed down the search to a single loop. Below its full asm:
0x20FD78 lfdu fp0, -0x20(r3)
0x20FD7C addi r4, r4, -0x20
0x20FD80 lfd fp1, 8(r3)
0x20FD84 lfd fp2, 0x10(r3)
0x20FD88 lfd fp3, 0x18(r3)
0x20FD8C dcbz r0, r4
0x20FD90 stfd fp0, 0(r4)
0x20FD94 stfd fp1, 8(r4)
0x20FD98 stfd fp2, 0x10(r4)
0x20FD9C stfd fp3, 0x18(r4)
0x20FDA0 bdnz 0x20FD78
0x20FDA4 blr
This loop driven by the bdnz instruction will be executed 5 times. Each iteration will copy 32 bytes from $r3 to $r4, then both pointers will be decremented by 32. Don't be confused by the floating-point instructions lfd/stfd used there - they perfom no FP calculations, they are used there because they are capable of reading/writng 8 bytes at once.
dcbz clears all bytes of the block pointed by $r4 to zero. It should ensure that everything written to this block before will be wiped away because we're going to write new values.
A cache block on 32-bit PPC is 32 bytes long. On 970, it's 128 bytes long.
My bet is that G5's dcbz zeros more bytes than the corresponding dcbz on 32bit CPU. That could explain why everything at the address 0x3fee5000 will be wiped away.The question is how to catch this bug. I assume that the memory corruption happens on the cache block boundary, i.e. when the address in $r4 < 0x3fee5080. This condition will be reached after the 5th loop iteration (0x3fee5100 - 32 bytes * 5 = 0x3fee5060).
To test that, I'd set a breakpoint at 0x20FD90 (right after dcbz) and monitor memory changes. Expected values of $r4 are:
1st iteration: 0x3fee50e0
2nd iteration: 0x3fee50c0
3rd iteration: 0x3fee50a0
4th iteration: 0x3fee5080
---- it's the cache block boundary ----
5th iteration: 0x3fee5060
---- end of loop ----
When our breakpoint is reached for the 5th time, r4 should contain 0x3fee5060 and dcbz is expected to zero the whole cache block 0x3fee5000...0x3fee5080(!)
You can verify that by dumping the memory block at 0x3fee5000 in the 5th iteration.
Below the same as GDB debugging program:
(gdb) break *0x203ce8
(gdb) cont
(gdb) display/i $pc
(gdb) break *0x20FD90
(gdb) cont
... we should stop after the 1st dcbz here
(gdb) p/x $r4 (should be 0x3fee50e0)
(gdb) x/8xw 0x3fee5000 (should contain non-zero values)
(gdb) cont (execute 2nd dcbz)
(gdb) cont (execute 3rd dcbz)
(gdb) cont (execute 4th dcbz)
(gdb) p/x $r4 (should be 0x3fee5080)
(gdb) x/8xw 0x3fee5000 (should contain non-zero values)
(gdb) cont (execute 5th dcbz)
(gdb) p/x $r4 (should be 0x3fee5060)
(gdb) x/8xw 0x3fee5000 (will supposedly contain all zeroes)
Could you verify that?
Sorry for the long post. I hope you can follow me...