Message ID | 1365457381-10184-1-git-send-email-roland@kernel.org (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
> From: Roland Dreier <roland@purestorage.com> > + if (wc->byte_len > IPOIB_UD_HEAD_SIZE) { > + page = priv->rx_ring[wr_id].page; > + priv->rx_ring[wr_id].page = NULL; > + } else { > + page = NULL; > + } > + > /* > * If we can't allocate a new RX buffer, dump > * this packet and reuse the old buffer. > */ > if (unlikely(!ipoib_alloc_rx_skb(dev, wr_id))) { > ++dev->stats.rx_dropped; > + priv->rx_ring[wr_id].page = page; > goto repost; > } Can you go through the "else" of the first if (page is NULL), then enter the second if? If so, isn't the page lost? Dean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 9, 2013 at 6:13 AM, Luick, Dean <dean.luick@intel.com> wrote:
> Can you go through the "else" of the first if (page is NULL), then enter the second if? If so, isn't the page lost?
Thanks, good catch. I'll fix that up.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
> >- IPOIB_UD_HEAD_SIZE = IB_GRH_BYTES + IPOIB_ENCAP_LEN, >+ /* add 128 bytes of tailroom for IP/TCP headers */ >+ IPOIB_UD_HEAD_SIZE = IB_GRH_BYTES + IPOIB_ENCAP_LEN + 128, Hello, the version 3 of the patch finally works. I can see the performance gains but I cannot feel them (in real life). Here are the results of my testbed: Test 1: netperf/netserver message size 16K kernel 3.5 default : 5.1 GBit/s kernel 3.5 + patch v3 : 7.7 GBit/s kernel 3.5 + max MTU 3K: 10.8 GBit/s Test 2: Disk write performance VM with disk mounted on IB async NFS server block size | default | patch v3 | max MTU 3K ------------+----------+----------+---------- 1 KB | 10 MB/s | 10 MB/s | 10 MB/s 2 KB | 20 MB/s | 21 MB/s | 20 MB/s 4 KB | 40 MB/s | 40 MB/s | 43 MB/s 8 KB | 68 MB/s | 70 MB/s | 78 MB/s 16 KB | 105 MB/s | 105 MB/s | 120 MB/s 32 KB | 150 MB/s | 150 MB/s | 170 MB/s 64 KB | 200 MB/s | 210 MB/s | 260 MB/s 128 KB | 270 MB/s | 290 MB/s | 400 MB/s 256 KB | 300 MB/s | 310 MB/s | 430 MB/s 512 KB | 305 MB/s | 320 MB/s | 470 MB/s 1024 KB | 310 MB/s | 325 MB/s | 500 MB/s 2048 KB | 310 MB/s | 325 MB/s | 510 MB/s 4096 KB | 370 MB/s | 325 MB/s | 510 MB/s 8192 KB | 400 MB/s | 325 MB/s | 520 MB/s As you can see netperf throughput increases while NFS does not even care about the optimizations. Maybe it does not work well with fragmented SKBs. The MAX MTU 3K values once again are forced through a hack inside ipoib_main.c. For curiosity I changed the block splitting in your v3 patch from small head with large fragment to large head with small fragment in this line. IPOIB_UD_HEAD_SIZE = IB_GRH_BYTES + IPOIB_ENCAP_LEN + 3072 In my 2044 MTU case this brings the netperf & NFS throughput to the same levels as the dirty hack. Of course this no longer reflects a head but equals more or less to something like a new constant IPOIB_UD_FIXED_SKB_SIZE. I guess 4K MTU will not see any further gains but avoiding the skb_pull calls should improve speed as well. Maybe a final adaption could put the cherry on the cake. Markus -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 9, 2013 at 12:52 PM, Markus Stockhausen <markus.stockhausen@gmx.de> wrote: > IPOIB_UD_HEAD_SIZE = IB_GRH_BYTES + IPOIB_ENCAP_LEN + 3072 > > In my 2044 MTU case this brings the netperf & NFS throughput to > the same levels as the dirty hack. Of course this no longer > reflects a head but equals more or less to something like a > new constant IPOIB_UD_FIXED_SKB_SIZE. After thinking about this, I'm pretty convinced that this is probably a good approach. However it seems that making IPOIB_UD_HEAD_SIZE be IB_GRH_BYTES + 2048 should be enough, since that's the size of the largest receive buffer needed with a 2K IB MTU. I only wonder what the effect on performance would be with an IB MTU of 4K active; then full-sized packets would be pretty much exactly split between the linear part and the fragment page. How does GRO cope with that? I guess in the 2K IB MTU case there's no cost in having all the data in the linear part of the skb. - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > >I only wonder what the effect on performance would be with an IB MTU >of 4K active; then full-sized packets would be pretty much exactly >split between the linear part and the fragment page. How does GRO >cope with that? I guess in the 2K IB MTU case there's no cost in >having all the data in the linear part of the skb. > > - R. Hm, if I think about the current situation we can make it only better. When receiving a packet on a 4K HCA we have to pull the IP header into the linear part of the SKB during GRO handling. That consumes extra CPU cycles and does not depend on the packet size. We can avoid this by splitting the packet at a well defined position. Your patch made a cut at x+128 bytes. From my understanding the position should have no performance impact. The 3 cases that we analyzed up to now are: - 2K fragment + header pull = fast - header and some data in linear part + 1,9K fragment = faster - only linear part + no fragment = fastest Maybe I'm too hasty (without a 4K MTU test environment) but from the above I would derive that larger packets will still benefit from the adapted handling. Markus -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 16, 2013 at 9:22 AM, Markus Stockhausen <markus.stockhausen@gmx.de> wrote: > if I think about the current situation we can make it only better. > When receiving a packet on a 4K HCA we have to pull the IP header > into the linear part of the SKB during GRO handling. That consumes > extra CPU cycles and does not depend on the packet size. > > We can avoid this by splitting the packet at a well defined position. > Your patch made a cut at x+128 bytes. From my understanding the > position should have no performance impact. The 3 cases that we > analyzed up to now are: > > - 2K fragment + header pull = fast > - header and some data in linear part + 1,9K fragment = faster > - only linear part + no fragment = fastest > > Maybe I'm too hasty (without a 4K MTU test environment) but from > the above I would derive that larger packets will still benefit > from the adapted handling. Yes, makes sense to me as well. I sent a v4 patch that I'm pretty sure should work well for you. - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index eb71aaa..5f0d34c 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -64,7 +64,8 @@ enum ipoib_flush_level { enum { IPOIB_ENCAP_LEN = 4, - IPOIB_UD_HEAD_SIZE = IB_GRH_BYTES + IPOIB_ENCAP_LEN, + /* add 128 bytes of tailroom for IP/TCP headers */ + IPOIB_UD_HEAD_SIZE = IB_GRH_BYTES + IPOIB_ENCAP_LEN + 128, IPOIB_UD_RX_SG = 2, /* max buffer needed for 4K mtu */ IPOIB_CM_MTU = 0x10000 - 0x10, /* padding to align header to 16 */ @@ -155,6 +156,7 @@ struct ipoib_mcast { struct ipoib_rx_buf { struct sk_buff *skb; + struct page *page; u64 mapping[IPOIB_UD_RX_SG]; }; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 2cfa76f..890e2c8 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -92,13 +92,15 @@ void ipoib_free_ah(struct kref *kref) } static void ipoib_ud_dma_unmap_rx(struct ipoib_dev_priv *priv, + struct page *page, u64 mapping[IPOIB_UD_RX_SG]) { if (ipoib_ud_need_sg(priv->max_ib_mtu)) { ib_dma_unmap_single(priv->ca, mapping[0], IPOIB_UD_HEAD_SIZE, DMA_FROM_DEVICE); - ib_dma_unmap_page(priv->ca, mapping[1], PAGE_SIZE, - DMA_FROM_DEVICE); + if (page) + ib_dma_unmap_page(priv->ca, mapping[1], PAGE_SIZE, + DMA_FROM_DEVICE); } else ib_dma_unmap_single(priv->ca, mapping[0], IPOIB_UD_BUF_SIZE(priv->max_ib_mtu), @@ -107,23 +109,18 @@ static void ipoib_ud_dma_unmap_rx(struct ipoib_dev_priv *priv, static void ipoib_ud_skb_put_frags(struct ipoib_dev_priv *priv, struct sk_buff *skb, + struct page *page, unsigned int length) { - if (ipoib_ud_need_sg(priv->max_ib_mtu)) { - skb_frag_t *frag = &skb_shinfo(skb)->frags[0]; - unsigned int size; + if (ipoib_ud_need_sg(priv->max_ib_mtu) && + length > IPOIB_UD_HEAD_SIZE) { /* - * There is only two buffers needed for max_payload = 4K, + * There are only two buffers needed for max_payload = 4K, * first buf size is IPOIB_UD_HEAD_SIZE */ - skb->tail += IPOIB_UD_HEAD_SIZE; - skb->len += length; - - size = length - IPOIB_UD_HEAD_SIZE; - - skb_frag_size_set(frag, size); - skb->data_len += size; - skb->truesize += PAGE_SIZE; + skb_put(skb, IPOIB_UD_HEAD_SIZE); + skb_add_rx_frag(skb, 0, page, 0, + length - IPOIB_UD_HEAD_SIZE, PAGE_SIZE); } else skb_put(skb, length); @@ -143,9 +140,11 @@ static int ipoib_ib_post_receive(struct net_device *dev, int id) ret = ib_post_recv(priv->qp, &priv->rx_wr, &bad_wr); if (unlikely(ret)) { ipoib_warn(priv, "receive failed for buf %d (%d)\n", id, ret); - ipoib_ud_dma_unmap_rx(priv, priv->rx_ring[id].mapping); + ipoib_ud_dma_unmap_rx(priv, priv->rx_ring[id].page, priv->rx_ring[id].mapping); dev_kfree_skb_any(priv->rx_ring[id].skb); priv->rx_ring[id].skb = NULL; + put_page(priv->rx_ring[id].page); + priv->rx_ring[id].page = NULL; } return ret; @@ -156,18 +155,13 @@ static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id) struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *skb; int buf_size; - int tailroom; u64 *mapping; + struct page **page; - if (ipoib_ud_need_sg(priv->max_ib_mtu)) { - buf_size = IPOIB_UD_HEAD_SIZE; - tailroom = 128; /* reserve some tailroom for IP/TCP headers */ - } else { - buf_size = IPOIB_UD_BUF_SIZE(priv->max_ib_mtu); - tailroom = 0; - } + buf_size = ipoib_ud_need_sg(priv->max_ib_mtu) ? + IPOIB_UD_HEAD_SIZE : IPOIB_UD_BUF_SIZE(priv->max_ib_mtu); - skb = dev_alloc_skb(buf_size + tailroom + 4); + skb = dev_alloc_skb(buf_size + 4); if (unlikely(!skb)) return NULL; @@ -184,21 +178,24 @@ static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id) if (unlikely(ib_dma_mapping_error(priv->ca, mapping[0]))) goto error; - if (ipoib_ud_need_sg(priv->max_ib_mtu)) { - struct page *page = alloc_page(GFP_ATOMIC); - if (!page) + page = &priv->rx_ring[id].page; + if (ipoib_ud_need_sg(priv->max_ib_mtu) && !*page) { + *page = alloc_page(GFP_ATOMIC); + if (!*page) goto partial_error; - skb_fill_page_desc(skb, 0, page, 0, PAGE_SIZE); mapping[1] = - ib_dma_map_page(priv->ca, page, + ib_dma_map_page(priv->ca, *page, 0, PAGE_SIZE, DMA_FROM_DEVICE); if (unlikely(ib_dma_mapping_error(priv->ca, mapping[1]))) - goto partial_error; + goto map_error; } priv->rx_ring[id].skb = skb; return skb; +map_error: + put_page(*page); + *page = NULL; partial_error: ib_dma_unmap_single(priv->ca, mapping[0], buf_size, DMA_FROM_DEVICE); error: @@ -230,6 +227,7 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) struct ipoib_dev_priv *priv = netdev_priv(dev); unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV; struct sk_buff *skb; + struct page *page; u64 mapping[IPOIB_UD_RX_SG]; union ib_gid *dgid; @@ -249,9 +247,11 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) ipoib_warn(priv, "failed recv event " "(status=%d, wrid=%d vend_err %x)\n", wc->status, wr_id, wc->vendor_err); - ipoib_ud_dma_unmap_rx(priv, priv->rx_ring[wr_id].mapping); + ipoib_ud_dma_unmap_rx(priv, priv->rx_ring[wr_id].page, priv->rx_ring[wr_id].mapping); dev_kfree_skb_any(skb); priv->rx_ring[wr_id].skb = NULL; + put_page(priv->rx_ring[wr_id].page); + priv->rx_ring[wr_id].page = NULL; return; } @@ -265,20 +265,28 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) memcpy(mapping, priv->rx_ring[wr_id].mapping, IPOIB_UD_RX_SG * sizeof *mapping); + if (wc->byte_len > IPOIB_UD_HEAD_SIZE) { + page = priv->rx_ring[wr_id].page; + priv->rx_ring[wr_id].page = NULL; + } else { + page = NULL; + } + /* * If we can't allocate a new RX buffer, dump * this packet and reuse the old buffer. */ if (unlikely(!ipoib_alloc_rx_skb(dev, wr_id))) { ++dev->stats.rx_dropped; + priv->rx_ring[wr_id].page = page; goto repost; } ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", wc->byte_len, wc->slid); - ipoib_ud_dma_unmap_rx(priv, mapping); - ipoib_ud_skb_put_frags(priv, skb, wc->byte_len); + ipoib_ud_dma_unmap_rx(priv, page, mapping); + ipoib_ud_skb_put_frags(priv, skb, page, wc->byte_len); /* First byte of dgid signals multicast when 0xff */ dgid = &((struct ib_grh *)skb->data)->dgid; @@ -861,9 +869,12 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush) if (!rx_req->skb) continue; ipoib_ud_dma_unmap_rx(priv, + priv->rx_ring[i].page, priv->rx_ring[i].mapping); dev_kfree_skb_any(rx_req->skb); rx_req->skb = NULL; + put_page(priv->rx_ring[i].page); + priv->rx_ring[i].page = NULL; } goto timeout;