diff mbox

[RFC/PATCH,v4] IPoIB: Leave space in skb linear buffer for IP headers

Message ID 1366129801-29234-1-git-send-email-roland@kernel.org (mailing list archive)
State Rejected
Headers show

Commit Message

Roland Dreier April 16, 2013, 4:30 p.m. UTC
From: Roland Dreier <roland@purestorage.com>

Markus Stockhausen <markus.stockhausen@gmx.de> noticed that IPoIB was
spending significant time doing memcpy() in __pskb_pull_tail().  He
found that this is because his adapter reports a maximum MTU of 4K,
which causes IPoIB datagram mode to receive all the actual data in a
separate page in the fragment list.

We're already allocating extra tailroom for the skb linear part, so we
might as well use it.  In fact, we might as well allocate a big enough
linear part so that all the data fits there in the relatively common
case of a 2K IB MTU, and only use a fragment page for 4K IB MTU.

Cc: Eric Dumazet <edumazet@google.com>
Reported-by: Markus Stockhausen <markus.stockhausen@gmx.de>
Signed-off-by: Roland Dreier <roland@purestorage.com>
---
v4: Leave enough space in linear part of skb so that all data ends up
there with 2K IB MTU.  Still not sure how this affects perf with a
4K IB MTU (should be better since we avoid pulling IP headers out of
first fragment).

 drivers/infiniband/ulp/ipoib/ipoib.h    |  6 ++-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c | 78 +++++++++++++++++++--------------
 2 files changed, 49 insertions(+), 35 deletions(-)

Comments

Eric Dumazet April 16, 2013, 4:56 p.m. UTC | #1
On Tue, Apr 16, 2013 at 9:30 AM, Roland Dreier <roland@kernel.org> wrote:
> From: Roland Dreier <roland@purestorage.com>
>
> Markus Stockhausen <markus.stockhausen@gmx.de> noticed that IPoIB was
> spending significant time doing memcpy() in __pskb_pull_tail().  He
> found that this is because his adapter reports a maximum MTU of 4K,
> which causes IPoIB datagram mode to receive all the actual data in a
> separate page in the fragment list.
>
> We're already allocating extra tailroom for the skb linear part, so we
> might as well use it.  In fact, we might as well allocate a big enough
> linear part so that all the data fits there in the relatively common
> case of a 2K IB MTU, and only use a fragment page for 4K IB MTU.
>
> Cc: Eric Dumazet <edumazet@google.com>
> Reported-by: Markus Stockhausen <markus.stockhausen@gmx.de>
> Signed-off-by: Roland Dreier <roland@purestorage.com>
> ---
> v4: Leave enough space in linear part of skb so that all data ends up
> there with 2K IB MTU.  Still not sure how this affects perf with a
> 4K IB MTU (should be better since we avoid pulling IP headers out of
> first fragment).
>

I am afraid I don't understand what the issue is.

the pull_tail() in itself is not a performance issue : Intel guys only
fixed last gays ago fact that IGB/IXGBE drivers were not pulling tcp
headers in skb->head , and nobody noticed.

Real cost is the cache line miss.

Now, if you pull too many bytes in skb->head, say part of TCP payload,
you lose opportunities in TCP coalescing or splice().
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roland Dreier April 16, 2013, 5:03 p.m. UTC | #2
On Tue, Apr 16, 2013 at 9:56 AM, Eric Dumazet <edumazet@google.com> wrote:
> I am afraid I don't understand what the issue is.
>
> the pull_tail() in itself is not a performance issue : Intel guys only
> fixed last gays ago fact that IGB/IXGBE drivers were not pulling tcp
> headers in skb->head , and nobody noticed.
>
> Real cost is the cache line miss.

According to Markus there is a pretty huge performance penalty if the
whole packet (TCP/IP headers + payload) ends up in the fragment, and
stack has to memcpy headers into the linear part.

> Now, if you pull too many bytes in skb->head, say part of TCP payload,
> you lose opportunities in TCP coalescing or splice().

So it sounds like this patch is a bad change from that POV, since it
would cause us to have lots of TCP payload in the linear part and then
second half of the payload in the fragments (if we are running with 4K
IPoIB MTU).

Seems like the real fix is to enable the header separation feature in
adapters that support it (ConnectX-2 and -3) so that exactly TCP/IP
headers go in linear part and payload goes in the fragment.  That's a
bit more involved, but I'll look into doing it.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Markus Stockhausen April 16, 2013, 5:44 p.m. UTC | #3
>I am afraid I don't understand what the issue is.
>
>the pull_tail() in itself is not a performance issue : Intel guys only
>fixed last gays ago fact that IGB/IXGBE drivers were not pulling tcp
>headers in skb->head , and nobody noticed.
>
>Real cost is the cache line miss.
>
>Now, if you pull too many bytes in skb->head, say part of TCP payload,
>you lose opportunities in TCP coalescing or splice().

With patch v4 netperf and NFS receive performance raises to the
expected values. As I'm no expert in this I can only repost the
initial performance report that started the whole discussion.
__pskb_pull_tail consumes a lot time on our XEON L5420 test
server.


...

- server side: netserver -p 12345
- client side: netperf -H <server_ip> -p 12345 -l 120

Analysis was performed on the server side with
- perf record -a -g sleep 10
- perf report

# Overhead                                         Symbol
# ........  .............................................
#
    19.67%  [k] copy_user_generic_string
            |
            |--99.74%-- skb_copy_datagram_iovec
            |          tcp_recvmsg
            |          inet_recvmsg
            |          sock_recvmsg
            |          sys_recvfrom
            |          system_call_fastpath
            |          recv
            |          |
            |          |--50.17%-- 0x7074656e00667265
            |          |
            |           --49.83%-- 0x6672657074656e
             --0.26%-- [...]
     7.38%  [k] memcpy
            |
            |--84.56%-- __pskb_pull_tail
            |          |
            |          |--81.88%-- pskb_may_pull.part.6
            |          |          skb_gro_header_slow
            |          |          inet_gro_receive
            |          |          dev_gro_receive
            |          |          napi_gro_receive
            |          |          ipoib_ib_handle_rx_wc
            |          |          ipoib_poll
            |          |          net_rx_action
            |          |          __do_softirq


Markus


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 16, 2013, 5:55 p.m. UTC | #4
On Tue, Apr 16, 2013 at 10:44 AM, Markus Stockhausen
<markus.stockhausen@gmx.de> wrote:
>
>>I am afraid I don't understand what the issue is.
>>
>>the pull_tail() in itself is not a performance issue : Intel guys only
>>fixed last gays ago fact that IGB/IXGBE drivers were not pulling tcp
>>headers in skb->head , and nobody noticed.
>>
>>Real cost is the cache line miss.
>>
>>Now, if you pull too many bytes in skb->head, say part of TCP payload,
>>you lose opportunities in TCP coalescing or splice().
>
> With patch v4 netperf and NFS receive performance raises to the
> expected values. As I'm no expert in this I can only repost the
> initial performance report that started the whole discussion.
> __pskb_pull_tail consumes a lot time on our XEON L5420 test
> server.
>

That's probably because of a cache line miss.

The thing I don't really understand is that normally, the first cache
line (64 bytes) contains both the Ethernet header and IPv4 header.

So what does this adapter in this respect ?

I guess you should try to use IPOIB_UD_HEAD_SIZE=64 to use the whole cache line.

Many drivers use prefetch() to make sure cpu starts to bring this
cache line into cache as soon as possible.

A single prefetch() call at the right place might help a lot.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Markus Stockhausen April 17, 2013, 6:38 p.m. UTC | #5
>
>
>
>That's probably because of a cache line miss.
>
>The thing I don't really understand is that normally, the first cache
>line (64 bytes) contains both the Ethernet header and IPv4 header.
>
>So what does this adapter in this respect ?
>
>I guess you should try to use IPOIB_UD_HEAD_SIZE=64 to use the whole
>cache line.
>
>Many drivers use prefetch() to make sure cpu starts to bring this
>cache line into cache as soon as possible.
>
>A single prefetch() call at the right place might help a lot.

Hello,

@Eric: Thanks for the tip.

In the 4K MAX MTU IPoIB driver path ipoib_ib_handle_rx_wc() will
produce an empty skb linear part with the whole data placed into
the first fragment. napi_gro_receive() finally pulls the IP
header out of the fragment into the linear part.

As far as I understand the pull out of the fragment should come
without additional cost when one calls a prefetch "long" before
the skb_pull(). 

I'm willing to check this out but I'm unsure if the IP header
is aligned to a cache line of 64 bytes. As a simple guess I
would implement the prefetch here:

static void ipoib_ib_handle_rx_wc();
  ...
  skb_pull (skb, IB_GRH_BYTES);

  skb->protocol = ((struct ipoib_header *) skb->data)->proto;
  skb_reset_mac_header(skb);
  skb_pull(skb, IPOIB_ENCAP_LEN);
+
+ if (ipoib_ud_need_sg(priv->max_ib_mtu))
+   prefetch(<whatever address>);
  ... 

Can you give me a hint what address one should put into the call?

Thanks in advance.

Markus


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index eb71aaa..7a56a8e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -64,8 +64,9 @@  enum ipoib_flush_level {
 enum {
 	IPOIB_ENCAP_LEN		  = 4,
 
-	IPOIB_UD_HEAD_SIZE	  = IB_GRH_BYTES + IPOIB_ENCAP_LEN,
-	IPOIB_UD_RX_SG		  = 2, /* max buffer needed for 4K mtu */
+	/* enough so w/ 2K mtu, everything is in linear part of skb */
+	IPOIB_UD_HEAD_SIZE	  = IB_GRH_BYTES + 2048,
+	IPOIB_UD_RX_SG		  = 2, /* max buffer needed for 4K mtu w/ 4K pages */
 
 	IPOIB_CM_MTU		  = 0x10000 - 0x10, /* padding to align header to 16 */
 	IPOIB_CM_BUF_SIZE	  = IPOIB_CM_MTU  + IPOIB_ENCAP_LEN,
@@ -155,6 +156,7 @@  struct ipoib_mcast {
 
 struct ipoib_rx_buf {
 	struct sk_buff *skb;
+	struct page    *page;
 	u64		mapping[IPOIB_UD_RX_SG];
 };
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 2cfa76f..88a4ea3 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -92,13 +92,15 @@  void ipoib_free_ah(struct kref *kref)
 }
 
 static void ipoib_ud_dma_unmap_rx(struct ipoib_dev_priv *priv,
+				  struct page *page,
 				  u64 mapping[IPOIB_UD_RX_SG])
 {
 	if (ipoib_ud_need_sg(priv->max_ib_mtu)) {
 		ib_dma_unmap_single(priv->ca, mapping[0], IPOIB_UD_HEAD_SIZE,
 				    DMA_FROM_DEVICE);
-		ib_dma_unmap_page(priv->ca, mapping[1], PAGE_SIZE,
-				  DMA_FROM_DEVICE);
+		if (page)
+			ib_dma_unmap_page(priv->ca, mapping[1], PAGE_SIZE,
+					  DMA_FROM_DEVICE);
 	} else
 		ib_dma_unmap_single(priv->ca, mapping[0],
 				    IPOIB_UD_BUF_SIZE(priv->max_ib_mtu),
@@ -107,23 +109,18 @@  static void ipoib_ud_dma_unmap_rx(struct ipoib_dev_priv *priv,
 
 static void ipoib_ud_skb_put_frags(struct ipoib_dev_priv *priv,
 				   struct sk_buff *skb,
+				   struct page *page,
 				   unsigned int length)
 {
-	if (ipoib_ud_need_sg(priv->max_ib_mtu)) {
-		skb_frag_t *frag = &skb_shinfo(skb)->frags[0];
-		unsigned int size;
+	if (ipoib_ud_need_sg(priv->max_ib_mtu) &&
+	    length > IPOIB_UD_HEAD_SIZE) {
 		/*
-		 * There is only two buffers needed for max_payload = 4K,
+		 * There are only two buffers needed for max_payload = 4K,
 		 * first buf size is IPOIB_UD_HEAD_SIZE
 		 */
-		skb->tail += IPOIB_UD_HEAD_SIZE;
-		skb->len  += length;
-
-		size = length - IPOIB_UD_HEAD_SIZE;
-
-		skb_frag_size_set(frag, size);
-		skb->data_len += size;
-		skb->truesize += PAGE_SIZE;
+		skb_put(skb, IPOIB_UD_HEAD_SIZE);
+		skb_add_rx_frag(skb, 0, page, 0,
+				length - IPOIB_UD_HEAD_SIZE, PAGE_SIZE);
 	} else
 		skb_put(skb, length);
 
@@ -143,9 +140,11 @@  static int ipoib_ib_post_receive(struct net_device *dev, int id)
 	ret = ib_post_recv(priv->qp, &priv->rx_wr, &bad_wr);
 	if (unlikely(ret)) {
 		ipoib_warn(priv, "receive failed for buf %d (%d)\n", id, ret);
-		ipoib_ud_dma_unmap_rx(priv, priv->rx_ring[id].mapping);
+		ipoib_ud_dma_unmap_rx(priv, priv->rx_ring[id].page, priv->rx_ring[id].mapping);
 		dev_kfree_skb_any(priv->rx_ring[id].skb);
 		priv->rx_ring[id].skb = NULL;
+		put_page(priv->rx_ring[id].page);
+		priv->rx_ring[id].page = NULL;
 	}
 
 	return ret;
@@ -156,18 +155,13 @@  static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id)
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct sk_buff *skb;
 	int buf_size;
-	int tailroom;
 	u64 *mapping;
+	struct page **page;
 
-	if (ipoib_ud_need_sg(priv->max_ib_mtu)) {
-		buf_size = IPOIB_UD_HEAD_SIZE;
-		tailroom = 128; /* reserve some tailroom for IP/TCP headers */
-	} else {
-		buf_size = IPOIB_UD_BUF_SIZE(priv->max_ib_mtu);
-		tailroom = 0;
-	}
+	buf_size = ipoib_ud_need_sg(priv->max_ib_mtu) ?
+		IPOIB_UD_HEAD_SIZE : IPOIB_UD_BUF_SIZE(priv->max_ib_mtu);
 
-	skb = dev_alloc_skb(buf_size + tailroom + 4);
+	skb = dev_alloc_skb(buf_size + 4);
 	if (unlikely(!skb))
 		return NULL;
 
@@ -184,21 +178,24 @@  static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id)
 	if (unlikely(ib_dma_mapping_error(priv->ca, mapping[0])))
 		goto error;
 
-	if (ipoib_ud_need_sg(priv->max_ib_mtu)) {
-		struct page *page = alloc_page(GFP_ATOMIC);
-		if (!page)
+	page = &priv->rx_ring[id].page;
+	if (ipoib_ud_need_sg(priv->max_ib_mtu) && !*page) {
+		*page = alloc_page(GFP_ATOMIC);
+		if (!*page)
 			goto partial_error;
-		skb_fill_page_desc(skb, 0, page, 0, PAGE_SIZE);
 		mapping[1] =
-			ib_dma_map_page(priv->ca, page,
+			ib_dma_map_page(priv->ca, *page,
 					0, PAGE_SIZE, DMA_FROM_DEVICE);
 		if (unlikely(ib_dma_mapping_error(priv->ca, mapping[1])))
-			goto partial_error;
+			goto map_error;
 	}
 
 	priv->rx_ring[id].skb = skb;
 	return skb;
 
+map_error:
+	put_page(*page);
+	*page = NULL;
 partial_error:
 	ib_dma_unmap_single(priv->ca, mapping[0], buf_size, DMA_FROM_DEVICE);
 error:
@@ -230,6 +227,7 @@  static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV;
 	struct sk_buff *skb;
+	struct page *page;
 	u64 mapping[IPOIB_UD_RX_SG];
 	union ib_gid *dgid;
 
@@ -249,9 +247,11 @@  static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 			ipoib_warn(priv, "failed recv event "
 				   "(status=%d, wrid=%d vend_err %x)\n",
 				   wc->status, wr_id, wc->vendor_err);
-		ipoib_ud_dma_unmap_rx(priv, priv->rx_ring[wr_id].mapping);
+		ipoib_ud_dma_unmap_rx(priv, priv->rx_ring[wr_id].page, priv->rx_ring[wr_id].mapping);
 		dev_kfree_skb_any(skb);
 		priv->rx_ring[wr_id].skb = NULL;
+		put_page(priv->rx_ring[wr_id].page);
+		priv->rx_ring[wr_id].page = NULL;
 		return;
 	}
 
@@ -265,20 +265,29 @@  static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 	memcpy(mapping, priv->rx_ring[wr_id].mapping,
 	       IPOIB_UD_RX_SG * sizeof *mapping);
 
+	if (wc->byte_len > IPOIB_UD_HEAD_SIZE) {
+		page = priv->rx_ring[wr_id].page;
+		priv->rx_ring[wr_id].page = NULL;
+	} else {
+		page = NULL;
+	}
+
 	/*
 	 * If we can't allocate a new RX buffer, dump
 	 * this packet and reuse the old buffer.
 	 */
 	if (unlikely(!ipoib_alloc_rx_skb(dev, wr_id))) {
 		++dev->stats.rx_dropped;
+		if (page)
+			priv->rx_ring[wr_id].page = page;
 		goto repost;
 	}
 
 	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
 		       wc->byte_len, wc->slid);
 
-	ipoib_ud_dma_unmap_rx(priv, mapping);
-	ipoib_ud_skb_put_frags(priv, skb, wc->byte_len);
+	ipoib_ud_dma_unmap_rx(priv, page, mapping);
+	ipoib_ud_skb_put_frags(priv, skb, page, wc->byte_len);
 
 	/* First byte of dgid signals multicast when 0xff */
 	dgid = &((struct ib_grh *)skb->data)->dgid;
@@ -861,9 +870,12 @@  int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 				if (!rx_req->skb)
 					continue;
 				ipoib_ud_dma_unmap_rx(priv,
+						      priv->rx_ring[i].page,
 						      priv->rx_ring[i].mapping);
 				dev_kfree_skb_any(rx_req->skb);
 				rx_req->skb = NULL;
+				put_page(priv->rx_ring[i].page);
+				priv->rx_ring[i].page = NULL;
 			}
 
 			goto timeout;