Message ID | 20210129195240.31871-2-TheSven73@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | lan743x speed boost | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net-next |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | success | CCed 5 of 5 maintainers |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | fail | Errors and warnings before: 0 this patch: 1 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | warning | WARNING: line length of 82 exceeds 80 columns |
netdev/build_allmodconfig_warn | fail | Errors and warnings before: 0 this patch: 1 |
netdev/header_inline | success | Link |
netdev/stable | success | Stable not CCed |
> diff --git a/drivers/net/ethernet/microchip/lan743x_main.c b/drivers/net/ethernet/microchip/lan743x_main.c > index f1f6eba4ace4..f485320e5784 100644 > --- a/drivers/net/ethernet/microchip/lan743x_main.c > +++ b/drivers/net/ethernet/microchip/lan743x_main.c > @@ -1957,11 +1957,11 @@ static int lan743x_rx_next_index(struct lan743x_rx *rx, int index) > > static struct sk_buff *lan743x_rx_allocate_skb(struct lan743x_rx *rx) > { > - int length = 0; > + struct net_device *netdev = rx->adapter->netdev; > > - length = (LAN743X_MAX_FRAME_SIZE + ETH_HLEN + 4 + RX_HEAD_PADDING); > - return __netdev_alloc_skb(rx->adapter->netdev, > - length, GFP_ATOMIC | GFP_DMA); > + return __netdev_alloc_skb(netdev, > + netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING, > + GFP_ATOMIC | GFP_DMA); > } > > static void lan743x_rx_update_tail(struct lan743x_rx *rx, int index) > @@ -1977,9 +1977,10 @@ static int lan743x_rx_init_ring_element(struct lan743x_rx *rx, int index, > { > struct lan743x_rx_buffer_info *buffer_info; > struct lan743x_rx_descriptor *descriptor; > - int length = 0; > + struct net_device *netdev = rx->adapter->netdev; > + int length; Please keep to reverse christmass tree. > > - length = (LAN743X_MAX_FRAME_SIZE + ETH_HLEN + 4 + RX_HEAD_PADDING); > + length = netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING; > descriptor = &rx->ring_cpu_ptr[index]; > buffer_info = &rx->buffer_info[index]; > buffer_info->skb = skb; > @@ -2148,11 +2149,18 @@ static int lan743x_rx_process_packet(struct lan743x_rx *rx) > descriptor = &rx->ring_cpu_ptr[first_index]; > > /* unmap from dma */ > + packet_length = RX_DESC_DATA0_FRAME_LENGTH_GET_ > + (descriptor->data0); > if (buffer_info->dma_ptr) { > - dma_unmap_single(&rx->adapter->pdev->dev, > - buffer_info->dma_ptr, > - buffer_info->buffer_length, > - DMA_FROM_DEVICE); > + dma_sync_single_for_cpu(&rx->adapter->pdev->dev, > + buffer_info->dma_ptr, > + packet_length, > + DMA_FROM_DEVICE); > + dma_unmap_single_attrs(&rx->adapter->pdev->dev, > + buffer_info->dma_ptr, > + buffer_info->buffer_length, > + DMA_FROM_DEVICE, > + DMA_ATTR_SKIP_CPU_SYNC); So this patch appears to contain two different changes 1) You only allocate a receive buffer as big as the MTU plus overheads 2) You change the cache operations to operate on the received length. The first change should be completely safe, and i guess, is giving most of the benefits. The second one is where interesting things might happen. So please split this patch into two. If it does break, we can git bisect, and probably end up on the second patch. Thanks Andrew
On Fri, 29 Jan 2021 14:52:35 -0500 Sven Van Asbroeck wrote: > From: Sven Van Asbroeck <thesven73@gmail.com> > > The buffers in the lan743x driver's receive ring are always 9K, > even when the largest packet that can be received (the mtu) is > much smaller. This performs particularly badly on cpu archs > without dma cache snooping (such as ARM): each received packet > results in a 9K dma_{map|unmap} operation, which is very expensive > because cpu caches need to be invalidated. > > Careful measurement of the driver rx path on armv7 reveals that > the cpu spends the majority of its time waiting for cache > invalidation. > > Optimize as follows: > > 1. set rx ring buffer size equal to the mtu. this limits the > amount of cache that needs to be invalidated per dma_map(). > > 2. when dma_unmap()ping, skip cpu sync. Sync only the packet data > actually received, the size of which the chip will indicate in > its rx ring descriptors. this limits the amount of cache that > needs to be invalidated per dma_unmap(). > > These optimizations double the rx performance on armv7. > Third parties report 3x rx speedup on armv8. > > Performance on dma cache snooping architectures (such as x86) > is expected to stay the same. > > Tested with iperf3 on a freescale imx6qp + lan7430, both sides > set to mtu 1500 bytes, measure rx performance: > > Before: > [ ID] Interval Transfer Bandwidth Retr > [ 4] 0.00-20.00 sec 550 MBytes 231 Mbits/sec 0 > After: > [ ID] Interval Transfer Bandwidth Retr > [ 4] 0.00-20.00 sec 1.33 GBytes 570 Mbits/sec 0 > > Test by Anders Roenningen (anders@ronningen.priv.no) on armv8, > rx iperf3: > Before 102 Mbits/sec > After 279 Mbits/sec > > Signed-off-by: Sven Van Asbroeck <thesven73@gmail.com> You may need to rebase to see this: drivers/net/ethernet/microchip/lan743x_main.c:2123:41: warning: restricted __le32 degrades to integer
On Fri, Jan 29, 2021 at 5:01 PM Jakub Kicinski <kuba@kernel.org> wrote: > > You may need to rebase to see this: > > drivers/net/ethernet/microchip/lan743x_main.c:2123:41: warning: restricted __le32 degrades to integer Good catch. The problem goes away with the next commit in the set. This is probably because I rebased to the little endian/big endian patch at the last minute. I'll fix it up in v2.
Hi Andrew, thank you so much for looking at this patch ! On Fri, Jan 29, 2021 at 3:36 PM Andrew Lunn <andrew@lunn.ch> wrote: > > So this patch appears to contain two different changes > 1) You only allocate a receive buffer as big as the MTU plus overheads > 2) You change the cache operations to operate on the received length. > > The first change should be completely safe, and i guess, is giving > most of the benefits. The second one is where interesting things might > happen. So please split this patch into two. If it does break, we can > git bisect, and probably end up on the second patch. > Yes, I tested this extensively on arm7, but you're right, it might behave differently on other platforms. I will split into two, as you suggested.
Sven, see below comments > @@ -2148,11 +2149,18 @@ static int lan743x_rx_process_packet(struct > lan743x_rx *rx) > descriptor = &rx->ring_cpu_ptr[first_index]; > > /* unmap from dma */ > + packet_length = RX_DESC_DATA0_FRAME_LENGTH_GET_ > + (descriptor->data0); It appears you moved this packet_length assignment from just below the following if block, however you left out the le32_to_cpu.See next comment > if (buffer_info->dma_ptr) { > - dma_unmap_single(&rx->adapter->pdev->dev, > - buffer_info->dma_ptr, > - buffer_info->buffer_length, > - DMA_FROM_DEVICE); > + dma_sync_single_for_cpu(&rx->adapter->pdev->dev, > + buffer_info->dma_ptr, > + packet_length, > + DMA_FROM_DEVICE); > + dma_unmap_single_attrs(&rx->adapter->pdev->dev, > + buffer_info->dma_ptr, > + buffer_info->buffer_length, > + DMA_FROM_DEVICE, > + > + DMA_ATTR_SKIP_CPU_SYNC); > buffer_info->dma_ptr = 0; > buffer_info->buffer_length = 0; > } Just below here is the following line packet_length = RX_DESC_DATA0_FRAME_LENGTH_GET_ (le32_to_cpu(descriptor->data0)); This line was moved above the previous if block, but the le32_to_cpu was removed. Is that intentional? Also I don't see any mention of this packet_length assignment (after the if block) being removed. Since packet_length already contains this value, it seems unnecessary to keep this assignment. > @@ -2167,8 +2175,8 @@ static int lan743x_rx_process_packet(struct > lan743x_rx *rx) > int index = first_index; > > /* multi buffer packet not supported */ > - /* this should not happen since > - * buffers are allocated to be at least jumbo size > + /* this should not happen since buffers are allocated > + * to be at least the mtu size configured in the mac. > */ > > /* clean up buffers */ @@ -2628,6 +2636,9 @@ static int > lan743x_netdev_change_mtu(struct net_device *netdev, int new_mtu) > struct lan743x_adapter *adapter = netdev_priv(netdev); > int ret = 0; > > + if (netif_running(netdev)) > + return -EBUSY; > + > ret = lan743x_mac_set_mtu(adapter, new_mtu); > if (!ret) > netdev->mtu = new_mtu; > -- > 2.17.1
Hi Bryan, thank you so much for reviewing, I really appreciate it. On Sat, Jan 30, 2021 at 5:11 PM <Bryan.Whitehead@microchip.com> wrote: > > > /* unmap from dma */ > > + packet_length = RX_DESC_DATA0_FRAME_LENGTH_GET_ > > + (descriptor->data0); > It appears you moved this packet_length assignment from just below the following if block, however you left out the le32_to_cpu.See next comment > Correct on both counts. This is a merge snafu that crept in when I rebased to Alexey's very recent little/big endian patch, at the last minute. My tests didn't catch it, because I'm running on a little-endian cpu, where le32_to_cpu() compiles to a nop. Had I had the good sense to run sparse on every patch, like Jakub has, I would have caught it...
On Sat, Jan 30, 2021 at 5:11 PM <Bryan.Whitehead@microchip.com> wrote: > > It appears you moved this packet_length assignment from just below the following if block, however you left out the le32_to_cpu.See next comment PS this merge snafu is removed completely by the next patch in the set. So this will not prevent you from reviewing/testing the multi-buffer support, should you want to.
This is a pattern we've seen in a few other net driver, so we should be ok. It just is rather hairy and needs a good justification, which seems to be given here.
Hi Sven I can confirm great stability improvement after your patch "lan743x: boost performance on cpu archs w/o dma cache snooping". Please note, that test_ber opens raw sockets as s = socket(AF_PACKET, SOCK_RAW, ETH_P_ALL) and resulting 'average speed' is a average egress speed. Test machine is Intel Pentium G4560 3.50GHz lan743x with rejected virtual phy 'inside' What I had before: $ ifmtu eth7 500 $ test_ber -l eth7 -c 1000 -n 1000000 -f500 --no-conf ... number of sent packets = 1000000 number of received packets = 289017 number of lost packets = 710983 number of out of order packets = 0 number of bit errors = 0 total errors detected = 710983 bit error rate = 0.710983 average speed: 429.3799 Mbit/s $ ifmtu eth7 1500 $ sudo test_ber -l eth7 -c 1000 -n 1000000 -f1500 --no-conf ... number of sent packets = 1000000 number of received packets = 577194 number of lost packets = 422806 number of out of order packets = 0 number of bit errors = 0 total errors detected = 422806 bit error rate = 0.422806 average speed: 644.6557 Mbit/s --- and what I had after your patch: $ ifmtu eth7 500 $ test_ber -l eth7 -c 1000 -n 1000000 -f500 --no-conf ... number of sent packets = 1000000 number of received packets = 711329 number of lost packets = 288671 number of out of order packets = 0 number of bit errors = 0 total errors detected = 288671 bit error rate = 0.288671 average speed: 429.2263 Mbit/s $ ifmtu eth7 1500 $ test_ber -l eth7 -c 1000 -n 1000000 -f1500 --no-conf ... number of sent packets = 1000000 number of received packets = 1000000 number of lost packets = 0 number of out of order packets = 0 number of bit errors = 0 total errors detected = 0 bit error rate = 0 average speed: 644.5405 Mbit/s
Hi Christoph, On Fri, Feb 5, 2021 at 4:31 AM Christoph Hellwig <hch@lst.de> wrote: > > This is a pattern we've seen in a few other net driver, so we should > be ok. It just is rather hairy and needs a good justification, which > seems to be given here. Thank you so much for taking the time to look into this. That is certainly good news !!
Hi Sergej, On Fri, Feb 5, 2021 at 7:44 AM Sergej Bauer <sbauer@blackbox.su> wrote: > > Hi Sven > I can confirm great stability improvement after your patch > "lan743x: boost performance on cpu archs w/o dma cache snooping". > > Test machine is Intel Pentium G4560 3.50GHz > lan743x with rejected virtual phy 'inside' Interesting, so the speed boost patch seems to improve things even on Intel... Would you be able to apply and test the multi-buffer patch as well? To do that, you can simply apply patches [2/6] and [3/6] on top of what you already have. Keeping in mind that Bryan has identified an issue with the above patch, which will get fixed in v2. So YMMV.
On Friday, February 5, 2021 5:07:22 PM MSK you wrote: > Hi Sergej, > > On Fri, Feb 5, 2021 at 7:44 AM Sergej Bauer <sbauer@blackbox.su> wrote: > > Hi Sven > > I can confirm great stability improvement after your patch > > "lan743x: boost performance on cpu archs w/o dma cache snooping". > > > > Test machine is Intel Pentium G4560 3.50GHz > > lan743x with rejected virtual phy 'inside' > > Interesting, so the speed boost patch seems to improve things even on > Intel... > > Would you be able to apply and test the multi-buffer patch as well? > To do that, you can simply apply patches [2/6] and [3/6] on top of > what you already have. > Hi Sven Tests after applying patches [2/6] and [3/6] are: $ ifmtu eth7 500 $ sudo test_ber -l eth7 -c 1000 -n 1000000 -f500 --no-conf ... number of sent packets = 1000000 number of received packets = 713288 number of lost packets = 286712 number of out of order packets = 0 number of bit errors = 0 total errors detected = 286712 bit error rate = 0.286712 average speed: 427.8043 Mbit/s $ ifmtu eth7 1500 $ sudo test_ber -l eth7 -c 1000 -n 1000000 -f500 --no-conf ... number of sent packets = 1000000 number of received packets = 707869 number of lost packets = 292131 number of out of order packets = 0 number of bit errors = 0 total errors detected = 292131 bit error rate = 0.292131 average speed: 431.0163 Mbit/s $ sudo test_ber -l eth7 -c 1000 -n 1000000 -f1500 --no-conf ... number of sent packets = 1000000 number of received packets = 1000000 number of lost packets = 0 number of out of order packets = 0 number of bit errors = 0 total errors detected = 0 bit error rate = 0 average speed: 646.4932 Mbit/s > Keeping in mind that Bryan has identified an issue with the above > patch, which will get fixed in v2. So YMMV. I'll perform tests with v2 as well.
Hi Sergej, On Fri, Feb 5, 2021 at 10:09 AM Sergej Bauer <sbauer@blackbox.su> wrote: > > Tests after applying patches [2/6] and [3/6] are: > $ ifmtu eth7 500 > $ sudo test_ber -l eth7 -c 1000 -n 1000000 -f500 --no-conf Thank you! Is there a way for me to run test_ber myself? Is this a standard, or a bespoke testing tool?
sOn Friday, February 5, 2021 7:39:40 PM MSK Sven Van Asbroeck wrote: > Hi Sergej, > > On Fri, Feb 5, 2021 at 10:09 AM Sergej Bauer <sbauer@blackbox.su> wrote: > > Tests after applying patches [2/6] and [3/6] are: > > $ ifmtu eth7 500 > > $ sudo test_ber -l eth7 -c 1000 -n 1000000 -f500 --no-conf > > Thank you! Is there a way for me to run test_ber myself? > Is this a standard, or a bespoke testing tool? It's kind of bespoke... A part of framework to assist HW guys in developing PHY-device. But the project is finished, so I could ask for permission to send the source to you.
diff --git a/drivers/net/ethernet/microchip/lan743x_main.c b/drivers/net/ethernet/microchip/lan743x_main.c index f1f6eba4ace4..f485320e5784 100644 --- a/drivers/net/ethernet/microchip/lan743x_main.c +++ b/drivers/net/ethernet/microchip/lan743x_main.c @@ -1957,11 +1957,11 @@ static int lan743x_rx_next_index(struct lan743x_rx *rx, int index) static struct sk_buff *lan743x_rx_allocate_skb(struct lan743x_rx *rx) { - int length = 0; + struct net_device *netdev = rx->adapter->netdev; - length = (LAN743X_MAX_FRAME_SIZE + ETH_HLEN + 4 + RX_HEAD_PADDING); - return __netdev_alloc_skb(rx->adapter->netdev, - length, GFP_ATOMIC | GFP_DMA); + return __netdev_alloc_skb(netdev, + netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING, + GFP_ATOMIC | GFP_DMA); } static void lan743x_rx_update_tail(struct lan743x_rx *rx, int index) @@ -1977,9 +1977,10 @@ static int lan743x_rx_init_ring_element(struct lan743x_rx *rx, int index, { struct lan743x_rx_buffer_info *buffer_info; struct lan743x_rx_descriptor *descriptor; - int length = 0; + struct net_device *netdev = rx->adapter->netdev; + int length; - length = (LAN743X_MAX_FRAME_SIZE + ETH_HLEN + 4 + RX_HEAD_PADDING); + length = netdev->mtu + ETH_HLEN + 4 + RX_HEAD_PADDING; descriptor = &rx->ring_cpu_ptr[index]; buffer_info = &rx->buffer_info[index]; buffer_info->skb = skb; @@ -2148,11 +2149,18 @@ static int lan743x_rx_process_packet(struct lan743x_rx *rx) descriptor = &rx->ring_cpu_ptr[first_index]; /* unmap from dma */ + packet_length = RX_DESC_DATA0_FRAME_LENGTH_GET_ + (descriptor->data0); if (buffer_info->dma_ptr) { - dma_unmap_single(&rx->adapter->pdev->dev, - buffer_info->dma_ptr, - buffer_info->buffer_length, - DMA_FROM_DEVICE); + dma_sync_single_for_cpu(&rx->adapter->pdev->dev, + buffer_info->dma_ptr, + packet_length, + DMA_FROM_DEVICE); + dma_unmap_single_attrs(&rx->adapter->pdev->dev, + buffer_info->dma_ptr, + buffer_info->buffer_length, + DMA_FROM_DEVICE, + DMA_ATTR_SKIP_CPU_SYNC); buffer_info->dma_ptr = 0; buffer_info->buffer_length = 0; } @@ -2167,8 +2175,8 @@ static int lan743x_rx_process_packet(struct lan743x_rx *rx) int index = first_index; /* multi buffer packet not supported */ - /* this should not happen since - * buffers are allocated to be at least jumbo size + /* this should not happen since buffers are allocated + * to be at least the mtu size configured in the mac. */ /* clean up buffers */ @@ -2628,6 +2636,9 @@ static int lan743x_netdev_change_mtu(struct net_device *netdev, int new_mtu) struct lan743x_adapter *adapter = netdev_priv(netdev); int ret = 0; + if (netif_running(netdev)) + return -EBUSY; + ret = lan743x_mac_set_mtu(adapter, new_mtu); if (!ret) netdev->mtu = new_mtu;