Message ID | 1dbd83dfe7f435eecc5bc460e901b47758280f30.1476206016.git.pabeni@redhat.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Tue, Oct 11, 2016 at 07:15:44PM +0200, Paolo Abeni wrote: > Also the connected mode maximum mtu is reduced by 16 bytes to > cope with the increased hard header len. Changing the MTU is going to cause annoying interop problems, can you avoid this? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2016-10-11 at 11:32 -0600, Jason Gunthorpe wrote: > On Tue, Oct 11, 2016 at 07:15:44PM +0200, Paolo Abeni wrote: > > > Also the connected mode maximum mtu is reduced by 16 bytes to > > cope with the increased hard header len. > > Changing the MTU is going to cause annoying interop problems, can you > avoid this? I don't like changing the maximum MTU value, too, but I was unable to find an alternative solution. The PMTU detection should protect against such issues. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/11/2016 1:32 PM, Jason Gunthorpe wrote: > On Tue, Oct 11, 2016 at 07:15:44PM +0200, Paolo Abeni wrote: > >> Also the connected mode maximum mtu is reduced by 16 bytes to >> cope with the increased hard header len. > > Changing the MTU is going to cause annoying interop problems, can you > avoid this? (Paolo did the work I'm describing here, I'm just giving the explanation he gave me): Not using this particular solution I don't think. We tried it without increasing the declared hard header length and it broke when dealing with skb_clone/GSO paths. In order to make the LL pseudo header get copied along with the rest of the encap and data on clone, we had to declare the header. The problem then became that the sg setup is such that we are limited to 16 4k pages for the sg array, so that header had to come out of the 64k maximum mtu.
On Tue, Oct 11, 2016 at 07:37:32PM +0200, Paolo Abeni wrote: > On Tue, 2016-10-11 at 11:32 -0600, Jason Gunthorpe wrote: > > On Tue, Oct 11, 2016 at 07:15:44PM +0200, Paolo Abeni wrote: > > > > > Also the connected mode maximum mtu is reduced by 16 bytes to > > > cope with the increased hard header len. > > > > Changing the MTU is going to cause annoying interop problems, can you > > avoid this? > > I don't like changing the maximum MTU value, too, but I was unable to > find an alternative solution. The PMTU detection should protect against > such issues. It is more that PMTU, we have instructed all users that is the MTU number needed to enable CM mode, so it appears in documentation, scripts, etc. There is really no way to re-use some of the existing alignment padding or exceed 64k? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2016-10-11 at 11:42 -0600, Jason Gunthorpe wrote: > On Tue, Oct 11, 2016 at 07:37:32PM +0200, Paolo Abeni wrote: > > On Tue, 2016-10-11 at 11:32 -0600, Jason Gunthorpe wrote: > > > On Tue, Oct 11, 2016 at 07:15:44PM +0200, Paolo Abeni wrote: > > > > > > > Also the connected mode maximum mtu is reduced by 16 bytes to > > > > cope with the increased hard header len. > > > > > > Changing the MTU is going to cause annoying interop problems, can you > > > avoid this? > > > > I don't like changing the maximum MTU value, too, but I was unable to > > find an alternative solution. The PMTU detection should protect against > > such issues. > > It is more that PMTU, we have instructed all users that is the MTU > number needed to enable CM mode, so it appears in documentation, > scripts, etc. AFAICS the max mtu is already underlying h/w dependent, how does such differences are currently coped by ? (I'm sorry I lack some/a lot of IB back-ground) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Oct 11, 2016 at 01:41:56PM -0400, Doug Ledford wrote: > declare the header. The problem then became that the sg setup is such > that we are limited to 16 4k pages for the sg array, so that header had > to come out of the 64k maximum mtu. Oh, that clarifies things.. Hum, so various options become: - Use >=17 SGL entries when creating the QP. Is this possible on common adapters? - Use the FRWR infrastructure when necessary. Is there any chance the majority of skbs will have at least two physically continuous pages to make this overhead rare? Perhaps as a fall back if many adaptors can do >=17 SGLs - Pad the hard header out to 4k and discard the first page when building the sgl - Memcopy the first ~8k into a contiguous 8k region on send - Move the pseudo header to the end so it can cross the page barrier without needing a sgl entry. (probably impossible?) From Paolo > AFAICS the max mtu is already underlying h/w dependent, how does such > differences are currently coped by ? (I'm sorry I lack some/a lot of IB > back-ground) It isn't h/w dependent. In CM mode the MTU is 65520 because that is what is hard coded into the ipoib driver. We tell everyone to use that number. Eg see RH's docs on the subject: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-Configuring_IPoIB.html AFAIK, today everyone just wires that number into their scripts, so we have to mass change everything to the smaller number. That sounds really hard, IMHO if there is any way to avoid it we should, even if it is a little costly. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2016-10-11 at 12:01 -0600, Jason Gunthorpe wrote: > > AFAICS the max mtu is already underlying h/w dependent, how does such > > differences are currently coped by ? (I'm sorry I lack some/a lot of IB > > back-ground) > > It isn't h/w dependent. In CM mode the MTU is 65520 because that is > what is hard coded into the ipoib driver. We tell everyone to use that > number. Eg see RH's docs on the subject: > > https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-Configuring_IPoIB.html > > AFAIK, today everyone just wires that number into their scripts, so we > have to mass change everything to the smaller number. That sounds > really hard, IMHO if there is any way to avoid it we should, even if > it is a little costly. Thank you for the details! The first s/g fragment (the head buffer) is not allocated with the page allocator, so perhaps there is some not too difficult/costly way out of this. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/11/2016 2:10 PM, Paolo Abeni wrote: > On Tue, 2016-10-11 at 12:01 -0600, Jason Gunthorpe wrote: >>> AFAICS the max mtu is already underlying h/w dependent, how does such >>> differences are currently coped by ? (I'm sorry I lack some/a lot of IB >>> back-ground) >> >> It isn't h/w dependent. In CM mode the MTU is 65520 because that is >> what is hard coded into the ipoib driver. We tell everyone to use that >> number. Eg see RH's docs on the subject: >> >> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-Configuring_IPoIB.html >> >> AFAIK, today everyone just wires that number into their scripts, so we >> have to mass change everything to the smaller number. Well, not exactly. Even if we put 65520 into the scripts, the kernel will silently drop it down to 65504. It actually won't require anyone change anything, they just won't get the full value. I experimented with this in the past for other reasons and an overly large MTU setting just resulted in the max MTU. I don't know if that's changed, but if it still works that way, this is much less of an issue than it might otherwise be. >> That sounds >> really hard, IMHO if there is any way to avoid it we should, even if >> it is a little costly. > > Thank you for the details! > > The first s/g fragment (the head buffer) is not allocated with the page > allocator, so perhaps there is some not too difficult/costly way out of > this. > > > > > >
On Tue, Oct 11, 2016 at 08:10:07PM +0200, Paolo Abeni wrote: > The first s/g fragment (the head buffer) is not allocated with the page > allocator, so perhaps there is some not too difficult/costly way out of > this. Keep in mind, there is nothing magic about the 16 SGL limit, other than we know all hardware supports it. That can be bumped up and most hardware will support a higher value. We'd just have to figure out if any hardware breaks, Mellanox and Intel should be able to respond to that question. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Oct 11, 2016 at 02:17:51PM -0400, Doug Ledford wrote: > Well, not exactly. Even if we put 65520 into the scripts, the kernel > will silently drop it down to 65504. It actually won't require anyone > change anything, they just won't get the full value. I experimented > with this in the past for other reasons and an overly large MTU setting > just resulted in the max MTU. I don't know if that's changed, but if it > still works that way, this is much less of an issue than it might > otherwise be. So it is just docs and relying on PMTU? That is not as bad.. Still would be nice to avoid if at all possible.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/11/2016 2:30 PM, Jason Gunthorpe wrote: > On Tue, Oct 11, 2016 at 02:17:51PM -0400, Doug Ledford wrote: > >> Well, not exactly. Even if we put 65520 into the scripts, the kernel >> will silently drop it down to 65504. It actually won't require anyone >> change anything, they just won't get the full value. I experimented >> with this in the past for other reasons and an overly large MTU setting >> just resulted in the max MTU. I don't know if that's changed, but if it >> still works that way, this is much less of an issue than it might >> otherwise be. > > So it is just docs and relying on PMTU? That is not as bad.. > > Still would be nice to avoid if at all possible.. I agree, but we have a test getting ready to commence. We'll know shortly how much the reduced MTU effects things because they aren't going to alter any of their setup, just put the new kernel in place, and see what happens.
On 10/11/2016 2:50 PM, Doug Ledford wrote: > On 10/11/2016 2:30 PM, Jason Gunthorpe wrote: >> On Tue, Oct 11, 2016 at 02:17:51PM -0400, Doug Ledford wrote: >> >>> Well, not exactly. Even if we put 65520 into the scripts, the kernel >>> will silently drop it down to 65504. It actually won't require anyone >>> change anything, they just won't get the full value. I experimented >>> with this in the past for other reasons and an overly large MTU setting >>> just resulted in the max MTU. I don't know if that's changed, but if it >>> still works that way, this is much less of an issue than it might >>> otherwise be. >> >> So it is just docs and relying on PMTU? That is not as bad.. >> >> Still would be nice to avoid if at all possible.. > > I agree, but we have a test getting ready to commence. We'll know > shortly how much the reduced MTU effects things because they aren't > going to alter any of their setup, just put the new kernel in place, and > see what happens. > > Long story short on the MTU stuff, the setups whined a bit about not being able to set the desired MTU, used the new max MTU instead, and things otherwise worked fine. But, Paolo submitted a v2 patch that removes this change, so it's all moot anyway.
From: Paolo Abeni <pabeni@redhat.com> Date: Tue, 11 Oct 2016 19:15:44 +0200 > After the commit 9207f9d45b0a ("net: preserve IP control block > during GSO segmentation"), the GSO CB and the IPoIB CB conflict. > That destroy the IPoIB address information cached there, > causing a severe performance regression, as better described here: > > http://marc.info/?l=linux-kernel&m=146787279825501&w=2 > > This change moves the data cached by the IPoIB driver from the > skb control lock into the IPoIB hard header, as done before > the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len > and use skb->cb to stash LL addresses"). > In order to avoid GRO issue, on packet reception, the IPoIB driver > stash into the skb a dummy pseudo header, so that the received > packets have actually a hard header matching the declared length. > Also the connected mode maximum mtu is reduced by 16 bytes to > cope with the increased hard header len. > > After this commit, IPoIB performances are back to pre-regression > value. > > Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation") > Signed-off-by: Paolo Abeni <pabeni@redhat.com> Not providing an accurate hard_header_len causes many problems. In fact we recently fixed the mlxsw driver to stop doing this. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/13/2016 10:24 AM, David Miller wrote: > From: Paolo Abeni <pabeni@redhat.com> > Date: Tue, 11 Oct 2016 19:15:44 +0200 > >> After the commit 9207f9d45b0a ("net: preserve IP control block >> during GSO segmentation"), the GSO CB and the IPoIB CB conflict. >> That destroy the IPoIB address information cached there, >> causing a severe performance regression, as better described here: >> >> http://marc.info/?l=linux-kernel&m=146787279825501&w=2 >> >> This change moves the data cached by the IPoIB driver from the >> skb control lock into the IPoIB hard header, as done before >> the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len >> and use skb->cb to stash LL addresses"). >> In order to avoid GRO issue, on packet reception, the IPoIB driver >> stash into the skb a dummy pseudo header, so that the received >> packets have actually a hard header matching the declared length. >> Also the connected mode maximum mtu is reduced by 16 bytes to >> cope with the increased hard header len. >> >> After this commit, IPoIB performances are back to pre-regression >> value. >> >> Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation") >> Signed-off-by: Paolo Abeni <pabeni@redhat.com> > > Not providing an accurate hard_header_len causes many problems. > > In fact we recently fixed the mlxsw driver to stop doing this. > Sure, but there are too many users of the cb struct, and whatever problems you are saying there are by lying about the hard header len are dwarfed by the problems caused by the inability to store the ll address anywhere between hard_header and send time.
From: Doug Ledford <dledford@redhat.com> Date: Thu, 13 Oct 2016 10:35:35 -0400 > On 10/13/2016 10:24 AM, David Miller wrote: >> From: Paolo Abeni <pabeni@redhat.com> >> Date: Tue, 11 Oct 2016 19:15:44 +0200 >> >>> After the commit 9207f9d45b0a ("net: preserve IP control block >>> during GSO segmentation"), the GSO CB and the IPoIB CB conflict. >>> That destroy the IPoIB address information cached there, >>> causing a severe performance regression, as better described here: >>> >>> http://marc.info/?l=linux-kernel&m=146787279825501&w=2 >>> >>> This change moves the data cached by the IPoIB driver from the >>> skb control lock into the IPoIB hard header, as done before >>> the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len >>> and use skb->cb to stash LL addresses"). >>> In order to avoid GRO issue, on packet reception, the IPoIB driver >>> stash into the skb a dummy pseudo header, so that the received >>> packets have actually a hard header matching the declared length. >>> Also the connected mode maximum mtu is reduced by 16 bytes to >>> cope with the increased hard header len. >>> >>> After this commit, IPoIB performances are back to pre-regression >>> value. >>> >>> Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation") >>> Signed-off-by: Paolo Abeni <pabeni@redhat.com> >> >> Not providing an accurate hard_header_len causes many problems. >> >> In fact we recently fixed the mlxsw driver to stop doing this. >> > > Sure, but there are too many users of the cb struct, and whatever > problems you are saying there are by lying about the hard header len are > dwarfed by the problems caused by the inability to store the ll address > anywhere between hard_header and send time. IB wants to pass addressing information between layers, it needs to find a safe way to do that. The currently propsoed patch does not meet this criteria. Pushing metadata before the head of the SKB data pointer is illegal, as the layers in between might want to push protocol headers, mirror the packet to another interface, etc. So this "metadata in SKB data" approach is buggy too. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2016-10-13 at 10:24 -0400, David Miller wrote: > From: Paolo Abeni <pabeni@redhat.com> > Date: Tue, 11 Oct 2016 19:15:44 +0200 > > > After the commit 9207f9d45b0a ("net: preserve IP control block > > during GSO segmentation"), the GSO CB and the IPoIB CB conflict. > > That destroy the IPoIB address information cached there, > > causing a severe performance regression, as better described here: > > > > http://marc.info/?l=linux-kernel&m=146787279825501&w=2 > > > > This change moves the data cached by the IPoIB driver from the > > skb control lock into the IPoIB hard header, as done before > > the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len > > and use skb->cb to stash LL addresses"). > > In order to avoid GRO issue, on packet reception, the IPoIB driver > > stash into the skb a dummy pseudo header, so that the received > > packets have actually a hard header matching the declared length. > > Also the connected mode maximum mtu is reduced by 16 bytes to > > cope with the increased hard header len. > > > > After this commit, IPoIB performances are back to pre-regression > > value. > > > > Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation") > > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > > Not providing an accurate hard_header_len causes many problems. > > In fact we recently fixed the mlxsw driver to stop doing this. AFAICS the mlxsw did a different thing, adding an additional header in the xmit function and pushing packets in the network stack with a mac length different from the declared hard header length The hard header specified by the IPoIB driver is build in the create() header_ops and its length matches the mac header length of the packets pushed into the network stack by such driver. GRO works correctly on top of that. From the networking stack PoV the hard_header_len should be 'accurate'. Can you please give more details on the problem that may arise from this ? thank you, Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/13/2016 10:43 AM, David Miller wrote: > From: Doug Ledford <dledford@redhat.com> > Date: Thu, 13 Oct 2016 10:35:35 -0400 > >> On 10/13/2016 10:24 AM, David Miller wrote: >>> From: Paolo Abeni <pabeni@redhat.com> >>> Date: Tue, 11 Oct 2016 19:15:44 +0200 >>> >>>> After the commit 9207f9d45b0a ("net: preserve IP control block >>>> during GSO segmentation"), the GSO CB and the IPoIB CB conflict. >>>> That destroy the IPoIB address information cached there, >>>> causing a severe performance regression, as better described here: >>>> >>>> http://marc.info/?l=linux-kernel&m=146787279825501&w=2 >>>> >>>> This change moves the data cached by the IPoIB driver from the >>>> skb control lock into the IPoIB hard header, as done before >>>> the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len >>>> and use skb->cb to stash LL addresses"). >>>> In order to avoid GRO issue, on packet reception, the IPoIB driver >>>> stash into the skb a dummy pseudo header, so that the received >>>> packets have actually a hard header matching the declared length. >>>> Also the connected mode maximum mtu is reduced by 16 bytes to >>>> cope with the increased hard header len. >>>> >>>> After this commit, IPoIB performances are back to pre-regression >>>> value. >>>> >>>> Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation") >>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com> >>> >>> Not providing an accurate hard_header_len causes many problems. >>> >>> In fact we recently fixed the mlxsw driver to stop doing this. >>> >> >> Sure, but there are too many users of the cb struct, and whatever >> problems you are saying there are by lying about the hard header len are >> dwarfed by the problems caused by the inability to store the ll address >> anywhere between hard_header and send time. > > IB wants to pass addressing information between layers, it needs to > find a safe way to do that. We *had* a safe way to do that. It got broken. What about increasing the size of skb->cb? Or adding a skb->dgid that is a u8[INFINIBAND_ALEN]? Or a more generic skb->dest_ll_addr that is sized to hold the dest address for any link layer? > Pushing metadata before the head of the SKB data pointer is illegal, > as the layers in between might want to push protocol headers, That's a total non-issue for us. There are no headers that protocols can add before ours. > mirror > the packet to another interface Doesn't that then mean that the headers specific to this interface should be stripped before the mirror? If so, I believe the way Paolo did this patch, that is what will be done. >, etc. > > So this "metadata in SKB data" approach is buggy too.
From: Doug Ledford <dledford@redhat.com> Date: Thu, 13 Oct 2016 11:20:59 -0400 > We *had* a safe way to do that. It got broken. What about increasing > the size of skb->cb? Or adding a skb->dgid that is a > u8[INFINIBAND_ALEN]? Or a more generic skb->dest_ll_addr that is sized > to hold the dest address for any link layer? I understand the situation, and I also believe that making sk_buff any huger than it already is happens to be a non-starter. >> Pushing metadata before the head of the SKB data pointer is illegal, >> as the layers in between might want to push protocol headers, > > That's a total non-issue for us. There are no headers that protocols > can add before ours. Ok, if that's the case, and based upon Paolo's response to me it appears to be, I guess this is OK for now. Paolo please resubmit your patch, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 9dbfcc0..5dd01fa 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -63,12 +63,14 @@ enum ipoib_flush_level { enum { IPOIB_ENCAP_LEN = 4, + IPOIB_PSEUDO_LEN = 20, + IPOIB_HARD_LEN = IPOIB_ENCAP_LEN + IPOIB_PSEUDO_LEN, IPOIB_UD_HEAD_SIZE = IB_GRH_BYTES + IPOIB_ENCAP_LEN, IPOIB_UD_RX_SG = 2, /* max buffer needed for 4K mtu */ - IPOIB_CM_MTU = 0x10000 - 0x10, /* padding to align header to 16 */ - IPOIB_CM_BUF_SIZE = IPOIB_CM_MTU + IPOIB_ENCAP_LEN, + IPOIB_CM_MTU = 0x10000 - 0x20, /* padding to align header to 16 */ + IPOIB_CM_BUF_SIZE = IPOIB_CM_MTU + IPOIB_HARD_LEN, IPOIB_CM_HEAD_SIZE = IPOIB_CM_BUF_SIZE % PAGE_SIZE, IPOIB_CM_RX_SG = ALIGN(IPOIB_CM_BUF_SIZE, PAGE_SIZE) / PAGE_SIZE, IPOIB_RX_RING_SIZE = 256, @@ -134,15 +136,21 @@ struct ipoib_header { u16 reserved; }; -struct ipoib_cb { - struct qdisc_skb_cb qdisc_cb; - u8 hwaddr[INFINIBAND_ALEN]; +struct ipoib_pseudo_header { + u8 hwaddr[INFINIBAND_ALEN]; }; -static inline struct ipoib_cb *ipoib_skb_cb(const struct sk_buff *skb) +static inline void skb_add_pseudo_hdr(struct sk_buff *skb) { - BUILD_BUG_ON(sizeof(skb->cb) < sizeof(struct ipoib_cb)); - return (struct ipoib_cb *)skb->cb; + char *data = skb_push(skb, IPOIB_PSEUDO_LEN); + + /* + * only the ipoib header is present now, make room for a dummy + * pseudo header and set skb field accordingly + */ + memset(data, 0, IPOIB_PSEUDO_LEN); + skb_reset_mac_header(skb); + skb_pull(skb, IPOIB_HARD_LEN); } /* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 4ad297d..1b04c8a 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -63,6 +63,8 @@ MODULE_PARM_DESC(cm_data_debug_level, #define IPOIB_CM_RX_DELAY (3 * 256 * HZ) #define IPOIB_CM_RX_UPDATE_MASK (0x3) +#define IPOIB_CM_RX_RESERVE (ALIGN(IPOIB_HARD_LEN, 16) - IPOIB_ENCAP_LEN) + static struct ib_qp_attr ipoib_cm_err_attr = { .qp_state = IB_QPS_ERR }; @@ -146,15 +148,15 @@ static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, struct sk_buff *skb; int i; - skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12); + skb = dev_alloc_skb(ALIGN(IPOIB_CM_HEAD_SIZE, 16)); if (unlikely(!skb)) return NULL; /* - * IPoIB adds a 4 byte header. So we need 12 more bytes to align the + * IPoIB adds a IPOIB_ENCAP_LEN byte header, this will align the * IP header to a multiple of 16. */ - skb_reserve(skb, 12); + skb_reserve(skb, IPOIB_CM_RX_RESERVE); mapping[0] = ib_dma_map_single(priv->ca, skb->data, IPOIB_CM_HEAD_SIZE, DMA_FROM_DEVICE); @@ -624,9 +626,9 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) if (wc->byte_len < IPOIB_CM_COPYBREAK) { int dlen = wc->byte_len; - small_skb = dev_alloc_skb(dlen + 12); + small_skb = dev_alloc_skb(dlen + IPOIB_CM_RX_RESERVE); if (small_skb) { - skb_reserve(small_skb, 12); + skb_reserve(small_skb, IPOIB_CM_RX_RESERVE); ib_dma_sync_single_for_cpu(priv->ca, rx_ring[wr_id].mapping[0], dlen, DMA_FROM_DEVICE); skb_copy_from_linear_data(skb, small_skb->data, dlen); @@ -663,8 +665,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) copied: skb->protocol = ((struct ipoib_header *) skb->data)->proto; - skb_reset_mac_header(skb); - skb_pull(skb, IPOIB_ENCAP_LEN); + skb_add_pseudo_hdr(skb); ++dev->stats.rx_packets; dev->stats.rx_bytes += skb->len; @@ -1583,7 +1584,7 @@ int ipoib_cm_dev_init(struct net_device *dev) max_srq_sge = min_t(int, IPOIB_CM_RX_SG, priv->ca->attrs.max_srq_sge); ipoib_cm_create_srq(dev, max_srq_sge); if (ipoib_cm_has_srq(dev)) { - priv->cm.max_cm_mtu = max_srq_sge * PAGE_SIZE - 0x10; + priv->cm.max_cm_mtu = max_srq_sge * PAGE_SIZE - 0x20; priv->cm.num_frags = max_srq_sge; ipoib_dbg(priv, "max_cm_mtu = 0x%x, num_frags=%d\n", priv->cm.max_cm_mtu, priv->cm.num_frags); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index be11d5d..830fecb 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -128,16 +128,15 @@ static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id) buf_size = IPOIB_UD_BUF_SIZE(priv->max_ib_mtu); - skb = dev_alloc_skb(buf_size + IPOIB_ENCAP_LEN); + skb = dev_alloc_skb(buf_size + IPOIB_HARD_LEN); if (unlikely(!skb)) return NULL; /* - * IB will leave a 40 byte gap for a GRH and IPoIB adds a 4 byte - * header. So we need 4 more bytes to get to 48 and align the - * IP header to a multiple of 16. + * the IP header will be at IPOIP_HARD_LEN + IB_GRH_BYTES, that is + * 64 bytes aligned */ - skb_reserve(skb, 4); + skb_reserve(skb, sizeof(struct ipoib_pseudo_header)); mapping = priv->rx_ring[id].mapping; mapping[0] = ib_dma_map_single(priv->ca, skb->data, buf_size, @@ -253,8 +252,7 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) skb_pull(skb, IB_GRH_BYTES); skb->protocol = ((struct ipoib_header *) skb->data)->proto; - skb_reset_mac_header(skb); - skb_pull(skb, IPOIB_ENCAP_LEN); + skb_add_pseudo_hdr(skb); ++dev->stats.rx_packets; dev->stats.rx_bytes += skb->len; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index cc1c1b0..823a528 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -925,9 +925,12 @@ static void neigh_add_path(struct sk_buff *skb, u8 *daddr, ipoib_neigh_free(neigh); goto err_drop; } - if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) + if (skb_queue_len(&neigh->queue) < + IPOIB_MAX_PATH_REC_QUEUE) { + /* put pseudoheader back on for next time */ + skb_push(skb, IPOIB_PSEUDO_LEN); __skb_queue_tail(&neigh->queue, skb); - else { + } else { ipoib_warn(priv, "queue length limit %d. Packet drop.\n", skb_queue_len(&neigh->queue)); goto err_drop; @@ -964,7 +967,7 @@ err_drop: } static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, - struct ipoib_cb *cb) + struct ipoib_pseudo_header *phdr) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_path *path; @@ -972,16 +975,18 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, spin_lock_irqsave(&priv->lock, flags); - path = __path_find(dev, cb->hwaddr + 4); + path = __path_find(dev, phdr->hwaddr + 4); if (!path || !path->valid) { int new_path = 0; if (!path) { - path = path_rec_create(dev, cb->hwaddr + 4); + path = path_rec_create(dev, phdr->hwaddr + 4); new_path = 1; } if (path) { if (skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) { + /* put pseudoheader back on for next time */ + skb_push(skb, IPOIB_PSEUDO_LEN); __skb_queue_tail(&path->queue, skb); } else { ++dev->stats.tx_dropped; @@ -1009,10 +1014,12 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, be16_to_cpu(path->pathrec.dlid)); spin_unlock_irqrestore(&priv->lock, flags); - ipoib_send(dev, skb, path->ah, IPOIB_QPN(cb->hwaddr)); + ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr)); return; } else if ((path->query || !path_rec_start(dev, path)) && skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) { + /* put pseudoheader back on for next time */ + skb_push(skb, IPOIB_PSEUDO_LEN); __skb_queue_tail(&path->queue, skb); } else { ++dev->stats.tx_dropped; @@ -1026,13 +1033,15 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_neigh *neigh; - struct ipoib_cb *cb = ipoib_skb_cb(skb); + struct ipoib_pseudo_header *phdr; struct ipoib_header *header; unsigned long flags; + phdr = (struct ipoib_pseudo_header *) skb->data; + skb_pull(skb, sizeof(*phdr)); header = (struct ipoib_header *) skb->data; - if (unlikely(cb->hwaddr[4] == 0xff)) { + if (unlikely(phdr->hwaddr[4] == 0xff)) { /* multicast, arrange "if" according to probability */ if ((header->proto != htons(ETH_P_IP)) && (header->proto != htons(ETH_P_IPV6)) && @@ -1045,13 +1054,13 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) return NETDEV_TX_OK; } /* Add in the P_Key for multicast*/ - cb->hwaddr[8] = (priv->pkey >> 8) & 0xff; - cb->hwaddr[9] = priv->pkey & 0xff; + phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff; + phdr->hwaddr[9] = priv->pkey & 0xff; - neigh = ipoib_neigh_get(dev, cb->hwaddr); + neigh = ipoib_neigh_get(dev, phdr->hwaddr); if (likely(neigh)) goto send_using_neigh; - ipoib_mcast_send(dev, cb->hwaddr, skb); + ipoib_mcast_send(dev, phdr->hwaddr, skb); return NETDEV_TX_OK; } @@ -1060,16 +1069,16 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) case htons(ETH_P_IP): case htons(ETH_P_IPV6): case htons(ETH_P_TIPC): - neigh = ipoib_neigh_get(dev, cb->hwaddr); + neigh = ipoib_neigh_get(dev, phdr->hwaddr); if (unlikely(!neigh)) { - neigh_add_path(skb, cb->hwaddr, dev); + neigh_add_path(skb, phdr->hwaddr, dev); return NETDEV_TX_OK; } break; case htons(ETH_P_ARP): case htons(ETH_P_RARP): /* for unicast ARP and RARP should always perform path find */ - unicast_arp_send(skb, dev, cb); + unicast_arp_send(skb, dev, phdr); return NETDEV_TX_OK; default: /* ethertype not supported by IPoIB */ @@ -1086,11 +1095,13 @@ send_using_neigh: goto unref; } } else if (neigh->ah) { - ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(cb->hwaddr)); + ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(phdr->hwaddr)); goto unref; } if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) { + /* put pseudoheader back on for next time */ + skb_push(skb, sizeof(*phdr)); spin_lock_irqsave(&priv->lock, flags); __skb_queue_tail(&neigh->queue, skb); spin_unlock_irqrestore(&priv->lock, flags); @@ -1122,8 +1133,8 @@ static int ipoib_hard_header(struct sk_buff *skb, unsigned short type, const void *daddr, const void *saddr, unsigned len) { + struct ipoib_pseudo_header *phdr; struct ipoib_header *header; - struct ipoib_cb *cb = ipoib_skb_cb(skb); header = (struct ipoib_header *) skb_push(skb, sizeof *header); @@ -1132,12 +1143,13 @@ static int ipoib_hard_header(struct sk_buff *skb, /* * we don't rely on dst_entry structure, always stuff the - * destination address into skb->cb so we can figure out where + * destination address into skb hard header so we can figure out where * to send the packet later. */ - memcpy(cb->hwaddr, daddr, INFINIBAND_ALEN); + phdr = (struct ipoib_pseudo_header *) skb_push(skb, sizeof(*phdr)); + memcpy(phdr->hwaddr, daddr, INFINIBAND_ALEN); - return sizeof *header; + return IPOIB_HARD_LEN; } static void ipoib_set_mcast_list(struct net_device *dev) @@ -1759,7 +1771,7 @@ void ipoib_setup(struct net_device *dev) dev->flags |= IFF_BROADCAST | IFF_MULTICAST; - dev->hard_header_len = IPOIB_ENCAP_LEN; + dev->hard_header_len = IPOIB_HARD_LEN; dev->addr_len = INFINIBAND_ALEN; dev->type = ARPHRD_INFINIBAND; dev->tx_queue_len = ipoib_sendq_size * 2; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index d3394b6..1909dd2 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -796,9 +796,11 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb) __ipoib_mcast_add(dev, mcast); list_add_tail(&mcast->list, &priv->multicast_list); } - if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) + if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) { + /* put pseudoheader back on for next time */ + skb_push(skb, sizeof(struct ipoib_pseudo_header)); skb_queue_tail(&mcast->pkt_queue, skb); - else { + } else { ++dev->stats.tx_dropped; dev_kfree_skb_any(skb); }
After the commit 9207f9d45b0a ("net: preserve IP control block during GSO segmentation"), the GSO CB and the IPoIB CB conflict. That destroy the IPoIB address information cached there, causing a severe performance regression, as better described here: http://marc.info/?l=linux-kernel&m=146787279825501&w=2 This change moves the data cached by the IPoIB driver from the skb control lock into the IPoIB hard header, as done before the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len and use skb->cb to stash LL addresses"). In order to avoid GRO issue, on packet reception, the IPoIB driver stash into the skb a dummy pseudo header, so that the received packets have actually a hard header matching the declared length. Also the connected mode maximum mtu is reduced by 16 bytes to cope with the increased hard header len. After this commit, IPoIB performances are back to pre-regression value. Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation") Signed-off-by: Paolo Abeni <pabeni@redhat.com> --- drivers/infiniband/ulp/ipoib/ipoib.h | 24 ++++++++---- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 17 ++++---- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 12 +++--- drivers/infiniband/ulp/ipoib/ipoib_main.c | 54 ++++++++++++++++---------- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 6 ++- 5 files changed, 67 insertions(+), 46 deletions(-)