Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3

On Mon, Mar 16, 2015 at 10:09:04PM +0300, ????? ?????? wrote:
> apx@ct:~$ hd zeros    
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 001d4a80  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 1c f0 9f e5  |................|
> *
> 001d4aa0  1c f0 9f e5 74 fb 00 00  bc ff 93 00 c0 ff 93 00  |....t...........|
> 001d4ab0  c4 ff 93 00 c8 ff 93 00  cc ff 93 00 d0 ff 93 00  |................|
> 001d4ac0  d4 ff 93 00 d8 ff 93 00  13 00 00 00 28 63 29 20  |............(c) |
> 001d4ad0  43 6f 70 79 72 69 67 68  74 20 32 30 30 37 2d 32  |Copyright 2007-2|
> 001d4ae0  30 31 32 2c 20 46 72 65  65 73 63 61 6c 65 20 53  |012, Freescale S|
> 001d4af0  65 6d 69 63 6f 6e 64 75  63 74 6f 72 2e 20 41 6c  |emiconductor. Al|
> 001d4b00  6c 20 72 69 67 68 74 73  20 72 65 73 65 72 76 65  |l rights reserve|
> 001d4b10  64 2e 00 00 dd 00 2c 41  11 73 00 00 d3 74 00 00  |d.....,A.s...t..|
> 001d4b20  3d 75 00 00 a9 78 00 00  4f 78 00 00 75 77 00 00  |=u...x..Ox..uw..|
> 001d4b30  07 76 00 00 c3 79 00 00  09 7a 00 00 75 7a 00 00  |.v...y...z..uz..|

I'm seeing this too with 4.0-rc4 _without_ net-next:

0000ba00  2d 20 d4 e5 80 00 52 e3  01 00 52 13 00 20 a0 03  |- ....R...R.. ..|
0000ba10  01 20 a0 13 a4 00 00 0a  1c f0 9f e5 1c f0 9f e5  |. ..............|
0000ba20  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 1c f0 9f e5  |................|
0000ba30  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 74 fb 00 00  |............t...|
0000ba40  bc ff 93 00 c0 ff 93 00  c4 ff 93 00 c8 ff 93 00  |................|
0000ba50  cc ff 93 00 d0 ff 93 00  d4 ff 93 00 d8 ff 93 00  |................|
0000ba60  13 00 00 00 28 63 29 20  43 6f 70 79 72 69 67 68  |....(c) Copyrigh|
0000ba70  74 20 32 30 30 37 2d 32  30 31 32 2c 20 46 72 65  |t 2007-2012, Fre|
0000ba80  65 73 63 61 6c 65 20 53  65 6d 69 63 6f 6e 64 75  |escale Semicondu|
0000ba90  63 74 6f 72 2e 20 41 6c  6c 20 72 69 67 68 74 73  |ctor. All rights|
0000baa0  20 72 65 73 65 72 76 65  64 2e 00 00 dd 00 2c 41  | reserved.....,A|
0000bab0  11 73 00 00 d3 74 00 00  3d 75 00 00 a9 78 00 00  |.s...t..=u...x..|
0000bac0  4f 78 00 00 75 77 00 00  07 76 00 00 c3 79 00 00  |Ox..uw...v...y..|
0000bad0  09 7a 00 00 75 7a 00 00  97 22 00 00 49 1f 00 00  |.z..uz..."..I...|
0000bae0  b9 21 00 00 ff 70 00 00  90 21 90 00 23 4a 52 f8  |.!...p...!..#JR.|
0000baf0  20 00 00 68 c8 40 00 f0  01 00 70 47 10 b5 01 23  | ..h.@....pG...#|
0000bb00  1e 4c 1c 3c 8b 40 54 f8  20 00 01 68 0a b1 19 43  |.L.<.@T. ..h...C|

In my case, this appeared in the middle of the bluetooth.ko module.

If we look at physical memory at zero (I'll omit the devmem2 commands):

Value at address 0x00000000: 0xe59ff01c
Value at address 0x00000004: 0xe59ff01c
Value at address 0x00000008: 0xe59ff01c
Value at address 0x0000000c: 0xe59ff01c
Value at address 0x00000010: 0xe59ff01c
Value at address 0x00000014: 0xe59ff01c
Value at address 0x00000018: 0xe59ff01c
Value at address 0x0000001c: 0xe59ff01c
Value at address 0x00000020: 0xe59ff01c
Value at address 0x00000024: 0x0000fb74
Value at address 0x00000028: 0x0093ffbc
Value at address 0x0000002c: 0x0093ffc0
Value at address 0x00000030: 0x0093ffc4
Value at address 0x00000034: 0x0093ffc8
Value at address 0x00000038: 0x0093ffcc
Value at address 0x0000003c: 0x0093ffd0
Value at address 0x00000040: 0x0093ffd4
Value at address 0x00000044: 0x0093ffd8
Value at address 0x00000048: 0x00000013
Value at address 0x0000004c: 0x20296328 <== start of (c) string.
Value at address 0x00000050: 0x79706f43
Value at address 0x00000054: 0x68676972
Value at address 0x00000058: 0x30322074
Value at address 0x0000005c: 0x322d3730
Value at address 0x00000060: 0x2c323130
Value at address 0x00000064: 0x65724620
Value at address 0x00000068: 0x61637365
Value at address 0x0000006c: 0x5320656c
Value at address 0x00000070: 0x63696d65
Value at address 0x00000074: 0x75646e6f
Value at address 0x00000078: 0x726f7463
Value at address 0x0000007c: 0x6c41202e
Value at address 0x00000080: 0x6972206c

This matches the "corruption".  So, the FEC driver is DMAing from
physical address zero.  There's only two ways this can happen - either
if dma_map_single() returns zero, or if the ring already contains a
zero address.

I've thrown into the FEC driver a load of WARN_ON_ONCE(addr == 0) after
_every_ dma_map_single(), and I also have pre-standing detection of
highmem pages in fec_enet_txq_submit_frag_skb().  None of this is firing.

We know that the ARM architecture can write to memory (which includes
memory allocated via dma_alloc_coherent()) with weak ordering, so to
rule that out, I tried adding a barrier between writing the address
and size, and writing the status field everywhere where we touch the
transmit ring.  That had no effect; I still see the corruption.

Next, I've tried setting free descriptors to have a physical address
of 0x40 - which should change the pattern of the corruption (in that
the "(c)" string should appear earlier in the corruption.)  The
reasoning is to prove whether the FEC TX DMA is reading from zero
because it's being instructed to in the transmit ring, or whether
there's something else going on.  The result is (in a different module):

00001000  61 6c 5f 72 65 67 69 73  74 65 72 5f 64 72 69 76  |al_register_driv|
00001010  65 72 73 00 00 00 00 00  00 00 00 00 00 00 00 00  |ers.............|
00001020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001030  00 00 00 00 1d 84 2e 2b  75 73 62 5f 73 65 72 69  |.......+usb_seri|
00001040  61 6c 5f 67 65 6e 65 72  69 63 5f 6f 70 65 6e 00  |al_generic_open.|
00001050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001060  00 00 00 00 00 00 00 00  d4 ff 93 00 d8 ff 93 00  |................|
00001070  13 00 00 00 28 63 29 20  43 6f 70 79 72 69 67 68  |....(c) Copyrigh|
00001080  74 20 32 30 30 37 2d 32  30 31 32 2c 20 46 72 65  |t 2007-2012, Fre|
00001090  65 73 63 61 6c 65 20 53  65 6d 69 63 6f 6e 64 75  |escale Semicondu|
000010a0  63 74 6f 72 2e 20 41 6c  6c 20 72 69 67 68 74 73  |ctor. All rights|
000010b0  20 72 65 73 65 72 76 65  64 2e 00 00 dd 00 2c 41  | reserved.....,A|
000010c0  11 73 00 00 d3 74 00 00  3d 75 00 00 a9 78 00 00  |.s...t..=u...x..|
000010d0  4f 78 00 00 75 77 00 00  07 76 00 00 c3 79 00 00  |Ox..uw...v...y..|

Here, the bytes which start at 0x1068 match what's in the iMX6 at address
0x40, but the bytes which would've come before are missing.  So, the FEC
is transmitting from a descriptor in the ring which contains an address
which has been freed - and that's with my additional barriers in place
which should ensure that the descriptor gets the address and size updates
_before_ it sees that it owns the descriptor.

The question is how, and this can be revealed by this bit of debugging in
fec_enet_tx_queue():

                for (i = 0; i < bdnum; i++) {
                        if (WARN_ON_ONCE(bdp->cbd_sc & BD_ENET_TX_READY))
                                fec_dump(ndev);
                        if (!IS_TSO_HEADER(txq, bdp->cbd_bufaddr))
                                dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr,
                                                 bdp->cbd_datlen, DMA_TO_DEVICE);
                        bdp->cbd_bufaddr = 0x40;
                        if (i < bdnum - 1)
                                bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
                }
                txq->tx_skbuff[index] = NULL;

Sure enough, that WARN_ON_ONCE() triggers - and we get _soo_ many dumps
of the transmit ring from fec_dump() that it shows that the driver is very
buggy here.  Here's an illustration of the ring state (un-wrapped to show
the order):

 460  H 0x1c00 0x00000040 1448   (null)
 461    0x1400 0x00000040   66   (null)
 462    0x1c00 0x391c0110 1448   (null)
 463    0x1400 0x8005e780   66   (null)
 464    0x1c00 0x3d120000 1448   (null)
 465    0x1400 0x8005e880   66   (null)
 466    0x1c00 0x391c0c60 1448   (null)
 467    0x1400 0x8005e980   66   (null)
 468    0x1c00 0x3d122000 1448   (null)
...
 508    0x1c00 0x3d136000 1448   (null)
 509    0x1400 0x8005fe80   66   (null)
 510    0x1c00 0x391c88d0 1448   (null)
 511    0x3400 0x8005ff80   66   (null)
   0    0x9c00 0x3d038000 1448   (null)
   1    0x9400 0x80050080   66   (null)
   2    0x9c00 0x391c9420 1448   (null)
   3    0x9400 0x80050180   66   (null)
...
  17    0x9400 0x80050880   66   (null)
  18    0x9c00 0x391cc160 1448   (null)
  19    0x9400 0x80050980   66   (null)
  20    0x9c00 0x3d042000  704 ec6106b0
  21 S  0x1c00 0x00000040 1448   (null)

We hit entry 462, and found that it was still owned by the FEC, so we
triggered the dump - during the time it took to produce the dump,
packets from 462 up to 511 were transmitted, clearing their ownership
bit (bit 15 of the first word.)

How can this happen?  Well, if we're submitted a _huge_ unfragmented
TSO skbuff, then:

                index = fec_enet_get_bd_index(txq->tx_bd_base, bdp_t, fep);
                skb = txq->tx_skbuff[index];
                while (!skb) {
                        bdp_t = fec_enet_get_nextdesc(bdp_t, fep, queue_id);
                        index = fec_enet_get_bd_index(txq->tx_bd_base, bdp_t, fep);
                        skb = txq->tx_skbuff[index];
                        bdnum++;
                }
                if (skb_shinfo(skb)->nr_frags &&
                    (status = bdp_t->cbd_sc) & BD_ENET_TX_READY)
                        break;

skb_shinfo(skb)->nr_frags will be zero, and so we won't test whether the
last entry in the submitted group has finished being transmitted - instead,
we will blindly continue on to the loop which frees all the descriptors,
trampling over those which haven't yet been transmitted.

I can see no reason to test for skb_shinfo(skb)->nr_frags here.  If the
last buffer descriptor for the submitted skb is marked as still being
busy, then the skb *can't* be reaped.  There's no question whether it's
a fragmented skb or not - that's completely irrelevant.

In fact, with that bogus test removed, the corruption goes away.  Patch
below for others to test - once I've finished removing the rest of my
debugging, I'll send it properly.

(It should be noted that calling fec_dump() as per above is enough to make
the bug go away - because it delays overwriting the buffer address long
enough that the FEC can transmit the packets before we stomp over the
still-to-be-transmitted entries.)

Given that this bug can seriously screw data up in undetectable ways (TCP
checksums don't save you, because the FEC generates them on the data which
it read from memory, even if it happened to read the data from the SoC's
boot ROM) we do need to get this fixed ASAP.

Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3

Commit Message

Comments

Patch