net/smc: mark as BROKEN due to remote memory exposure

Message ID	20170510072627.12060-1-hch@lst.de (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-rdma-owner@kernel.org> From: Christoph Hellwig <hch@lst.de> To: davem@davemloft.net Cc: ubraun@linux.vnet.ibm.com, netdev@vger.kernel.org, linux-rdma@vger.kernel.org, stable@vger.kernel.org Subject: [PATCH] net/smc: mark as BROKEN due to remote memory exposure Date: Wed, 10 May 2017 09:26:27 +0200 Message-Id: <20170510072627.12060-1-hch@lst.de> Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk

Christoph Hellwig May 10, 2017, 7:26 a.m. UTC

The driver has a lot of quality issues due to the lack of RDMA-side
review, and explicitly bypasses APIs to register all memory once a
connection is made, and thus allows remote access to memoery.

Mark it as broken until at least that part is fixed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org

---
 net/smc/Kconfig | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Bart Van Assche May 11, 2017, 2:57 p.m. UTC | #1

On Wed, 2017-05-10 at 09:26 +0200, Christoph Hellwig wrote:
> The driver has a lot of quality issues due to the lack of RDMA-side
> review, and explicitly bypasses APIs to register all memory once a
> connection is made, and thus allows remote access to memoery.
> 
> Mark it as broken until at least that part is fixed.

Since this is the only way to get the BROKEN marker in the v4.11 stable
kernel series:

Acked-by: Bart Van Assche <bart.vanassche@sandisk.com>--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig May 14, 2017, 5:58 a.m. UTC | #2

Dave,

this patch has not been superceeded by anything, can you explain why it
has been marked as such in patchworks?

On Thu, May 11, 2017 at 02:57:43PM +0000, Bart Van Assche wrote:
> On Wed, 2017-05-10 at 09:26 +0200, Christoph Hellwig wrote:
> > The driver has a lot of quality issues due to the lack of RDMA-side
> > review, and explicitly bypasses APIs to register all memory once a
> > connection is made, and thus allows remote access to memoery.
> > 
> > Mark it as broken until at least that part is fixed.
> 
> Since this is the only way to get the BROKEN marker in the v4.11 stable
> kernel series:
> 
> Acked-by: Bart Van Assche <bart.vanassche@sandisk.com>---end quoted text---
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Miller May 14, 2017, 3:51 p.m. UTC | #3

From: Christoph Hellwig <hch@lst.de>
Date: Sun, 14 May 2017 07:58:48 +0200

> this patch has not been superceeded by anything, can you explain why
> it has been marked as such in patchworks?

I think you're being overbearing by requiring this to be marked BROKEN
and I would like you to explore other ways with the authors to fix
whatever perceived problems you think SMC has.

You claim that this is somehow "urgent" is false.  You can ask
distributions to disable SMC or whatever in the short term if it
reallly, truly, bothers you.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bart Van Assche May 14, 2017, 7:08 p.m. UTC | #4

On Sun, 2017-05-14 at 11:51 -0400, David Miller wrote:
> From: Christoph Hellwig <hch@lst.de>

> Date: Sun, 14 May 2017 07:58:48 +0200

> 

> > this patch has not been superceeded by anything, can you explain why

> > it has been marked as such in patchworks?

> 

> I think you're being overbearing by requiring this to be marked BROKEN

> and I would like you to explore other ways with the authors to fix

> whatever perceived problems you think SMC has.

> 

> You claim that this is somehow "urgent" is false.  You can ask

> distributions to disable SMC or whatever in the short term if it

> reallly, truly, bothers you.

Hello Dave,

There is agreement that the user-space API for using the SMC protocol must
be changed, namely by dropping AF_SMC and by making applications use the
SMC protocol through socket(AF_INET..., SOCK_STREAM, ...). What is your
plan to avoid that applications start using and depending on AF_SMC?

Thanks,

Bart.

David Miller May 15, 2017, 12:44 a.m. UTC | #5

From: Bart Van Assche <Bart.VanAssche@sandisk.com>
Date: Sun, 14 May 2017 19:08:50 +0000

> What is your plan to avoid that applications start using and
> depending on AF_SMC?

The API is out there already so we are out of luck, and neither
you nor I nor anyone else can "stop" this from happening.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Parav Pandit May 15, 2017, 1:58 a.m. UTC | #6

Hi Dave,

> -----Original Message-----
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
> owner@vger.kernel.org] On Behalf Of David Miller
> Sent: Sunday, May 14, 2017 7:44 PM
> To: Bart.VanAssche@sandisk.com
> Cc: hch@lst.de; netdev@vger.kernel.org; linux-rdma@vger.kernel.org;
> stable@vger.kernel.org; ubraun@linux.vnet.ibm.com
> Subject: Re: [PATCH] net/smc: mark as BROKEN due to remote memory
> exposure
> 
> From: Bart Van Assche <Bart.VanAssche@sandisk.com>
> Date: Sun, 14 May 2017 19:08:50 +0000
> 
> > What is your plan to avoid that applications start using and depending
> > on AF_SMC?
> 

status = socket(AF_SMC, field, IPPROT_TCP);
Here,
- AF_SMC actually means AF_INET IPv4 addresses!
- IPPROTO_TCP means TCP and RDMA both when socket is AF_SMC.
- When creating socket addresses, use AF_INET based addresses.
-  When invoking bind(), listen(), connect() APIs, use AF_INET addresses instead.
- Supporting IPv6 is TBD with AF_SMC sockets.
- At user level get_addrinfo will continue to return AF_INET addresses.

Such explanation for socket APIs doesn't sound correct.

The primary motivation for SMC protocol was to simplify the applications and library to make use of RDMA.
This kind of API is against such simplicity and creates more confusion.
RFC only gives example and doesn't asks to create new socket family.
I can provide more data, but a simple grep in get_addrinfo() and friend functions in user space has heavy dependence on AF_INET and AF_INET6.

> The API is out there already so we are out of luck, and neither you nor I nor
> anyone else can "stop" this from happening.

I think it is still not too late to fix this API. SMC is released in v4.11 very recently.
v4.12 is still not out.
Given the limitation of protocol being RoCEv1 only, we might not have many users whose applications will stop functioning.
(Which will anyway won't work for RoCEv2, and IPv6 addresses).

I propose,
(a) AF_SMC socket 43 can be marked reserved in future kernel versions to avoid use.
(b) New protocol family that represents TCP and RDMA protocol, may be named IPPROTO_SMC even though it is not a protocol in IP header.

We can possibly target to have this fix in 4.13 kernel timeframe.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the
> body of a message to majordomo@vger.kernel.org More majordomo info
> at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sagi Grimberg May 15, 2017, 6:41 a.m. UTC | #7

Hi Dave,

>> this patch has not been superceeded by anything, can you explain why
>> it has been marked as such in patchworks?
>
> I think you're being overbearing by requiring this to be marked BROKEN
> and I would like you to explore other ways with the authors to fix
> whatever perceived problems you think SMC has.

Well, its not one's opinion, its a real problem. To be fair, this
security breach existed in other RDMA based storage protocols for
years in the past, but we cleaned it up completely.

Our assumption is that *if* the user is willingly choosing to expose its
entire physical address space to remote access to get some performance
boost that's fine, but to have the kernel expose it by default, without
even letting the user to control it is plain irresponsible. IMHO we
should *not* repeat the mistakes of the past and set a higher bar for
RDMA protocol implementations.

> You claim that this is somehow "urgent" is false.  You can ask
> distributions to disable SMC or whatever in the short term if it
> reallly, truly, bothers you.

I doubt that is sufficient given that not all implementations
out there rely on distros. I'm afraid one time is too much in
this case.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Leon Romanovsky May 15, 2017, 7:18 a.m. UTC | #8

On Sun, May 14, 2017 at 11:51:16AM -0400, David Miller wrote:
> From: Christoph Hellwig <hch@lst.de>
> Date: Sun, 14 May 2017 07:58:48 +0200
>
> > this patch has not been superceeded by anything, can you explain why
> > it has been marked as such in patchworks?
>
> I think you're being overbearing by requiring this to be marked BROKEN
> and I would like you to explore other ways with the authors to fix
> whatever perceived problems you think SMC has.

Hi Dave,

Christoph proposed very clear way to remove BROKEN tag, by providing
secure environment by default to all users.

All ULPs in RDMA and lustre in staging [1] are working in secure mode
by default and I don't understand why SMC-R should be different.

From my experience such "BROKEN" tag will help for authors to get proper
priority from their managers to fix it in a first place, before they are
moved to anything else.

>
> You claim that this is somehow "urgent" is false.  You can ask
> distributions to disable SMC or whatever in the short term if it
> reallly, truly, bothers you.

It is not distributions only. For example, RDMA stack is released by
different providers. Mellanox OFED [2] is one of them and it is based on
upstream kernel.

Thanks

[1] e0ccf5d085ab ("staging: lustre: ko2iblnd: Adapt to the removal of ib_get_dma_mr()")
[2] http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Doug Ledford May 16, 2017, 3:57 p.m. UTC | #9

Hi Linus,

I've added you to this thread.  A quick synopsis: Dave sent you the
net/smc driver for 4.11.  Even though it lives in net/smc, it is, for
the most part, a net<->rdma translator and so it is as much an RDMA
driver as anything else.  And upon review, the rdma community does not
believe either the spec/rfc or the driver are the right way to engineer
this particular technology, and the implementation leaves much to be
desired.

On Sun, 2017-05-14 at 20:44 -0400, David Miller wrote:
> From: Bart Van Assche <Bart.VanAssche@sandisk.com>
> Date: Sun, 14 May 2017 19:08:50 +0000
> 
> > What is your plan to avoid that applications start using and
> > depending on AF_SMC?
> 
> The API is out there already so we are out of luck, and neither
> you nor I nor anyone else can "stop" this from happening.

That's not true at all.  There's nothing that says we can't revert this
now before it goes any further.  It's only been in two kernels, I'm
positive it hasn't landed in any distros yet, and it can go back to
being something people can add on the side.  Futher, the "standard"
this is based on is not a real standard, it's just a publication and
has not been through a standard track.  I wouldn't consider this "out
there already" until there is a standard that has gone through the
standard track.

Regardless though, I'm rather purturbed about this entire thing.  If
you are right that because this got into 4.11, it's now a done deal,
then the fact that this went through 4 review cycles on netdev@ that,
as I understand it, spanned roughly one years time, and not one single
person bothered to note that this was as much an RDMA driver as
anything else, and not one person bothered to note that linux-rdma@ was
not on the Cc: list, and not one person told the submitters that they
needed to include linux-rdma@ on the Cc: list of these submissions, and
you took it without any review comments from any RDMA people in the
course of a year, or an ack from me to show that the RDMA portion of
this had at least been given some sort of review, was a collosal fuckup
of cross tree maintainer cooperation.

The SMC driver makes several mistakes that people tried to avoid with
previous RDMA standards, it only supports one out of the five possible
link layers (iWARP, IB, OPA, RoCEv1, RoCEv2), it uses a highly
discouraged and deprecated technique for memory registration that is
considered horribly insecure (handing the keys to the castle to anyone
who connects to the machine, aka, the entire memory space is registered
with one key and that key is given to remote connections, so they can
read any bit of kernel memory they want as opposed to whatever we tell
them to read), and the design as articuled in the published rfc seems
incomplete for dealing with any of the other link layers, indicating
that this should have probably stayed out until the rfc was discussed
and updated to address the shortcomings obviously present in the
current rfc.  With all of these issues outstanding against it, I hope
you can see why I think the way I do about you taking it without ever
consulting any of the RDMA community.

But that leaves us with the question of what to do moving forward.
 Probably the number one concern is that this protocol chose to create
a new AF as opposed to reusing the IPv4 and IPv6 address families and
adding an option similar to SCTP for enabling the new protocol on a
specific socket.  The concern is that we have means of addressing all
of the link layers the RDMA subsystem supports using IPv4 or IPv6 (sort
of...it's possible to have IB or OPA without IPoIB, which leaves them
without an IPv4 or IPv6 address, in which case the rdmacm can use
native GUIDs to resolve the other side, but that only works for verbs
connections, in the case of TCP connections, we always require IPoIB to
be present, and so IPv4 or IPv6 is always sufficient).  In the end,
switching this protocol to use AF_INET and AF_INET6 and a protocol
option to enable SMC may be what we need to do.  That, of course,
changes the user space API.  So, are we truly locked in at this point?
 I would suggest that, since this is only present in 4.11 and 4.12, and
I'm sure this has not landed in any distros as of yet (except maybe
something like Fedora rawhide), we can submit a patch to both the
current kernel and the 4.11 stable to set this code as
CONFIG_EXPERIMENTAL and mark the API as possibly going to undergo
change.  Then let the RDMA community work with IBM to get this properly
fixed so that this is a reasonable RDMA driver and not something the
community is ready to immediately trash, and only after we've got it
whipped into shape and the RDMA community is satisfied it is a
reasonable driver that can continue to work with future planned RDMA
subsystem updates and across various link layers, we remove the
EXPERIMENTAL marker and freeze the API for user space.

David Miller May 16, 2017, 4:29 p.m. UTC | #10

From: Doug Ledford <dledford@redhat.com>
Date: Tue, 16 May 2017 11:57:04 -0400

> Regardless though, I'm rather purturbed about this entire thing.  If
> you are right that because this got into 4.11, it's now a done deal,
> then the fact that this went through 4 review cycles on netdev@ that,
> as I understand it, spanned roughly one years time, and not one single
> person bothered to note that this was as much an RDMA driver as
> anything else, and not one person bothered to note that linux-rdma@ was
> not on the Cc: list, and not one person told the submitters that they
> needed to include linux-rdma@ on the Cc: list of these submissions, and
> you took it without any review comments from any RDMA people in the
> course of a year, or an ack from me to show that the RDMA portion of
> this had at least been given some sort of review, was a collosal fuckup
> of cross tree maintainer cooperation.

We rely on people from various areas of expertiece to contribute to
patch review on netdev and give appropriate feedback.

If you actually look through the history, I made many semantic reviews
of the SMC changes, and kept pushing back.

And in fact I did this several times, making them go through several
revisions, in the hopes that someone would review more of the meat and
substance of the patch set.

Nobody do this for over a year.

I can't push back on people with silly coding style and small semantic
issues forever.  And I think I made a serious effort to keep the
patches getting posted over and over again to make sure they got more
exposure.

I think it's unsettling that there are no RDMA experts, or at least
people remotely knowledgable about this "networking" technology,
subscribed to netdev and taking a cursory look at pactches that might
be relevant and effect that technology either directly or indirectly.

So there is a lot of blame to go around.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig May 16, 2017, 4:30 p.m. UTC | #11

On Tue, May 16, 2017 at 12:29:23PM -0400, David Miller wrote:
> I can't push back on people with silly coding style and small semantic
> issues forever.  And I think I made a serious effort to keep the
> patches getting posted over and over again to make sure they got more
> exposure.

You can tell them to go to linux-rdma.  I'm sending people to the right
mailing list all the time.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Miller May 16, 2017, 4:33 p.m. UTC | #12

From: Christoph Hellwig <hch@lst.de>
Date: Tue, 16 May 2017 18:30:41 +0200

> On Tue, May 16, 2017 at 12:29:23PM -0400, David Miller wrote:
>> I can't push back on people with silly coding style and small semantic
>> issues forever.  And I think I made a serious effort to keep the
>> patches getting posted over and over again to make sure they got more
>> exposure.
> 
> You can tell them to go to linux-rdma.  I'm sending people to the right
> mailing list all the time.

That doesn't cover things that don't directly touch the RDMA code or
infiniband infrastructure.

There should have been RDMA people on netdev who saw this thing and
cried wolf, and they would have had about an entire year, and about
8 instances of this series being posted in which to do so.

It's not like this got posted once or twice and went in with zero
review.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig May 16, 2017, 4:35 p.m. UTC | #13

On Tue, May 16, 2017 at 12:33:08PM -0400, David Miller wrote:
> That doesn't cover things that don't directly touch the RDMA code or
> infiniband infrastructure.
> 
> There should have been RDMA people on netdev who saw this thing and
> cried wolf, and they would have had about an entire year, and about
> 8 instances of this series being posted in which to do so.
> 
> It's not like this got posted once or twice and went in with zero
> review.

None outside of netdev.  Really, if you get rdma patches include
linux-rdma.  Or at least linux-kernel where I will usually catch
this sort of stuff.  netdev is too high traffic and too far away
from my area of work to watch it.

But for example when I wrote my first RDMA ULP I did subsribe
to linux-rdma, started extensive discussions and fixed up lots
of core code.  There is no excuse for other people to not even
try.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Doug Ledford May 16, 2017, 4:36 p.m. UTC | #14

On Tue, 2017-05-16 at 18:30 +0200, Christoph Hellwig wrote:
> On Tue, May 16, 2017 at 12:29:23PM -0400, David Miller wrote:
> > 
> > I can't push back on people with silly coding style and small
> > semantic
> > issues forever.  And I think I made a serious effort to keep the
> > patches getting posted over and over again to make sure they got
> > more
> > exposure.
> 
> You can tell them to go to linux-rdma.  I'm sending people to the
> right
> mailing list all the time.

Indeed.  Every single time a patch comes into linux-rdma that touches
things in net/ or include/net, unless it is exceedingly minor, I check
the To:/Cc: lines on the email and if netdev@ isn't included, or in the
case of complex/tricky items, you aren't directly Cc:ed, then I
specifically tell them to include netdev@ and/or you.  I've even had
things like a 12 patch series that buried three netdev@ appropriate
patches at different points in the series and told the submitter to
move all of the netdev@ related patches to the front and submit them to
netdev@ so they can be reviewed as a group before I would move on to
the others.  It's just what you do.  I've always considered that part
of my job.

David Miller May 16, 2017, 4:41 p.m. UTC | #15

From: Doug Ledford <dledford@redhat.com>
Date: Tue, 16 May 2017 12:36:01 -0400

> On Tue, 2017-05-16 at 18:30 +0200, Christoph Hellwig wrote:
>> On Tue, May 16, 2017 at 12:29:23PM -0400, David Miller wrote:
>> > 
>> > I can't push back on people with silly coding style and small
>> > semantic
>> > issues forever.  And I think I made a serious effort to keep the
>> > patches getting posted over and over again to make sure they got
>> > more
>> > exposure.
>> 
>> You can tell them to go to linux-rdma.  I'm sending people to the
>> right
>> mailing list all the time.
> 
> Indeed.  Every single time a patch comes into linux-rdma that touches
> things in net/ or include/net, unless it is exceedingly minor, I check
> the To:/Cc: lines on the email and if netdev@ isn't included, or in the
> case of complex/tricky items, you aren't directly Cc:ed, then I
> specifically tell them to include netdev@ and/or you.  I've even had
> things like a 12 patch series that buried three netdev@ appropriate
> patches at different points in the series and told the submitter to
> move all of the netdev@ related patches to the front and submit them to
> netdev@ so they can be reviewed as a group before I would move on to
> the others.  It's just what you do.  I've always considered that part
> of my job.

To be quite honest it wasn't exceedingly clear, even to me, that this
had such implications or was directly a RDMA thing.  From my
perspective while reviewing I saw a patch series adding it's own
protocol stack living inside of it's own directory under net/

And, if even one RDMA/infiniband person said to me "you really
shouldn't apply this" then I would have dropped it on the spot.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Doug Ledford May 16, 2017, 4:42 p.m. UTC | #16

On Tue, 2017-05-16 at 12:29 -0400, David Miller wrote:
> From: Doug Ledford <dledford@redhat.com>
> Date: Tue, 16 May 2017 11:57:04 -0400
> 
> > Regardless though, I'm rather purturbed about this entire thing.
>  If
> > you are right that because this got into 4.11, it's now a done
> deal,
> > then the fact that this went through 4 review cycles on netdev@
> that,
> > as I understand it, spanned roughly one years time, and not one
> single
> > person bothered to note that this was as much an RDMA driver as
> > anything else, and not one person bothered to note that linux-rdma@ 
> was
> > not on the Cc: list, and not one person told the submitters that
> they
> > needed to include linux-rdma@ on the Cc: list of these submissions,
> and
> > you took it without any review comments from any RDMA people in the
> > course of a year, or an ack from me to show that the RDMA portion
> of
> > this had at least been given some sort of review, was a collosal
> fuckup
> > of cross tree maintainer cooperation.
> 
> We rely on people from various areas of expertiece to contribute to
> patch review on netdev and give appropriate feedback.
> 
> If you actually look through the history, I made many semantic
> reviews
> of the SMC changes, and kept pushing back.
> 
> And in fact I did this several times, making them go through several
> revisions, in the hopes that someone would review more of the meat
> and
> substance of the patch set.

If you want to walk to the mailbox, you walk to the mailbox, you don't
walk to the grocery store, to the gym, and never even go to the
mailbox.  Likewise, if you want review from RDMA experts, most (maybe
even all) don't subscribe to netdev@ because it's too high traffic, you
don't waste time on silly semantic pushbacks, you send a single email
that says "Please get review from linux-rdma@, thank you."  Don't beat
around the bush, be direct and get it over with.  That's exactly what I
do for all netdev@ related patches coming to linux-rdma@ without a
proper Cc: to netdev@.

> Nobody do this for over a year.
> 
> I can't push back on people with silly coding style and small
> semantic
> issues forever.  And I think I made a serious effort to keep the
> patches getting posted over and over again to make sure they got more
> exposure.
> 
> I think it's unsettling that there are no RDMA experts, or at least
> people remotely knowledgable about this "networking" technology,
> subscribed to netdev and taking a cursory look at pactches that might
> be relevant and effect that technology either directly or indirectly.
> 
> So there is a lot of blame to go around.

Fine, allocate blame however, you like.  What I want to actually settle
is how we are going to move forward.  You didn't respond to that part
of my email.  Your thoughts?

David Miller May 16, 2017, 4:49 p.m. UTC | #17

From: Doug Ledford <dledford@redhat.com>
Date: Tue, 16 May 2017 12:42:43 -0400

> On Tue, 2017-05-16 at 12:29 -0400, David Miller wrote:
>> From: Doug Ledford <dledford@redhat.com>
>> Date: Tue, 16 May 2017 11:57:04 -0400
>> 
>> > Regardless though, I'm rather purturbed about this entire thing.
>>  If
>> > you are right that because this got into 4.11, it's now a done
>> deal,
>> > then the fact that this went through 4 review cycles on netdev@
>> that,
>> > as I understand it, spanned roughly one years time, and not one
>> single
>> > person bothered to note that this was as much an RDMA driver as
>> > anything else, and not one person bothered to note that linux-rdma@ 
>> was
>> > not on the Cc: list, and not one person told the submitters that
>> they
>> > needed to include linux-rdma@ on the Cc: list of these submissions,
>> and
>> > you took it without any review comments from any RDMA people in the
>> > course of a year, or an ack from me to show that the RDMA portion
>> of
>> > this had at least been given some sort of review, was a collosal
>> fuckup
>> > of cross tree maintainer cooperation.
>> 
>> We rely on people from various areas of expertiece to contribute to
>> patch review on netdev and give appropriate feedback.
>> 
>> If you actually look through the history, I made many semantic
>> reviews
>> of the SMC changes, and kept pushing back.
>> 
>> And in fact I did this several times, making them go through several
>> revisions, in the hopes that someone would review more of the meat
>> and
>> substance of the patch set.
> 
> If you want to walk to the mailbox, you walk to the mailbox, you don't
> walk to the grocery store, to the gym, and never even go to the
> mailbox.  Likewise, if you want review from RDMA experts, most (maybe
> even all) don't subscribe to netdev@ because it's too high traffic, you
> don't waste time on silly semantic pushbacks, you send a single email
> that says "Please get review from linux-rdma@, thank you."  Don't beat
> around the bush, be direct and get it over with.  That's exactly what I
> do for all netdev@ related patches coming to linux-rdma@ without a
> proper Cc: to netdev@.

Read my other email, it wasn't %100 clear to me that this was so
strictly RDMA related.  And I kept pushing back with semantic changes
in part because it wasn't clear.

So as far as I was concerned I was not necessarily going to the wrong
store, in fact I wasn't sure what store to go to.

And none of the thousands of subscribers to netdev intuit'd this
either.  Maybe there is a reason for that.

Furthermore, if netdev is too much traffic for one or two RDMA people
to just casually subscribe to on the off chance that a situationm like
this comes up, guess what it is like for me who has to read and review
pretty much every single posting that is placed there?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Doug Ledford May 16, 2017, 5:12 p.m. UTC | #18

On Tue, 2017-05-16 at 12:41 -0400, David Miller wrote:
> From: Doug Ledford <dledford@redhat.com>
> Date: Tue, 16 May 2017 12:36:01 -0400
> 
> > On Tue, 2017-05-16 at 18:30 +0200, Christoph Hellwig wrote:
> >> On Tue, May 16, 2017 at 12:29:23PM -0400, David Miller wrote:
> >> > 
> >> > I can't push back on people with silly coding style and small
> >> > semantic
> >> > issues forever.  And I think I made a serious effort to keep the
> >> > patches getting posted over and over again to make sure they got
> >> > more
> >> > exposure.
> >> 
> >> You can tell them to go to linux-rdma.  I'm sending people to the
> >> right
> >> mailing list all the time.
> > 
> > Indeed.  Every single time a patch comes into linux-rdma that
> touches
> > things in net/ or include/net, unless it is exceedingly minor, I
> check
> > the To:/Cc: lines on the email and if netdev@ isn't included, or in
> the
> > case of complex/tricky items, you aren't directly Cc:ed, then I
> > specifically tell them to include netdev@ and/or you.  I've even
> had
> > things like a 12 patch series that buried three netdev@ appropriate
> > patches at different points in the series and told the submitter to
> > move all of the netdev@ related patches to the front and submit
> them to
> > netdev@ so they can be reviewed as a group before I would move on
> to
> > the others.  It's just what you do.  I've always considered that
> part
> > of my job.
> 
> To be quite honest it wasn't exceedingly clear, even to me, that this
> had such implications or was directly a RDMA thing.  From my
> perspective while reviewing I saw a patch series adding it's own
> protocol stack living inside of it's own directory under net/

OK.  Fair enough.  That implies to me that you probably dove right into
the patches and skimmed the cover letter (http://marc.info/?l=linux-s39
0&m=148397751211964&w=2), because when I read the cover letter, it
seems pretty clear to me that this is very much RDMA related.  In any
case, there are already two other code bases that live under net/ but
also involve RDMA heavily: nfs and rds.  This is a third now.  So, for
future reference, just like the RDMA related nfs/rds patches already
get Cc:ed to linux-rdma@, these probably should be too.

> And, if even one RDMA/infiniband person said to me "you really
> shouldn't apply this" then I would have dropped it on the spot.

I'm glad to hear that.  I wish we had managed to get together on this
sooner.

Doug Ledford May 16, 2017, 5:20 p.m. UTC | #19

On Tue, 2017-05-16 at 12:49 -0400, David Miller wrote:
> From: Doug Ledford <dledford@redhat.com>
> Date: Tue, 16 May 2017 12:42:43 -0400
> 
> > On Tue, 2017-05-16 at 12:29 -0400, David Miller wrote:
> >> From: Doug Ledford <dledford@redhat.com>
> >> Date: Tue, 16 May 2017 11:57:04 -0400
> >> 
> >> > Regardless though, I'm rather purturbed about this entire thing.
> >>  If
> >> > you are right that because this got into 4.11, it's now a done
> >> deal,
> >> > then the fact that this went through 4 review cycles on netdev@
> >> that,
> >> > as I understand it, spanned roughly one years time, and not one
> >> single
> >> > person bothered to note that this was as much an RDMA driver as
> >> > anything else, and not one person bothered to note that linux-
> rdma@ 
> >> was
> >> > not on the Cc: list, and not one person told the submitters that
> >> they
> >> > needed to include linux-rdma@ on the Cc: list of these
> submissions,
> >> and
> >> > you took it without any review comments from any RDMA people in
> the
> >> > course of a year, or an ack from me to show that the RDMA
> portion
> >> of
> >> > this had at least been given some sort of review, was a collosal
> >> fuckup
> >> > of cross tree maintainer cooperation.
> >> 
> >> We rely on people from various areas of expertiece to contribute
> to
> >> patch review on netdev and give appropriate feedback.
> >> 
> >> If you actually look through the history, I made many semantic
> >> reviews
> >> of the SMC changes, and kept pushing back.
> >> 
> >> And in fact I did this several times, making them go through
> several
> >> revisions, in the hopes that someone would review more of the meat
> >> and
> >> substance of the patch set.
> > 
> > If you want to walk to the mailbox, you walk to the mailbox, you
> don't
> > walk to the grocery store, to the gym, and never even go to the
> > mailbox.  Likewise, if you want review from RDMA experts, most
> (maybe
> > even all) don't subscribe to netdev@ because it's too high traffic,
> you
> > don't waste time on silly semantic pushbacks, you send a single
> email
> > that says "Please get review from linux-rdma@, thank you."  Don't
> beat
> > around the bush, be direct and get it over with.  That's exactly
> what I
> > do for all netdev@ related patches coming to linux-rdma@ without a
> > proper Cc: to netdev@.
> 
> Read my other email, it wasn't %100 clear to me that this was so
> strictly RDMA related.  And I kept pushing back with semantic changes
> in part because it wasn't clear.

See my other email.  I think in the fact that you are overloaded on
netdev@ you skimmed the cover letter, which is the one thing that would
have made things clear.

> So as far as I was concerned I was not necessarily going to the wrong
> store, in fact I wasn't sure what store to go to.
> 
> And none of the thousands of subscribers to netdev intuit'd this
> either.  Maybe there is a reason for that.

Yeah, it's an overloaded list would be my guess.  I imagine no one can
keep up with it fully in truth.  Maybe it's time to split some of it
out into sublists?  netdev-core/netdev-ethernet/netdev-packet?  I don't
know, but I know I can't keep up with it.  That you do as well as you
do is probably only because you know how to not spend too much time on
any one thing.  I'm sure you have people you know are submitting
important work that effects the overall net subsystem and you focus on
their stuff, but ancillary things probably only get half attention at
double speed in order to allow you to make it through the day.

> Furthermore, if netdev is too much traffic for one or two RDMA people
> to just casually subscribe to on the off chance that a situationm
> like
> this comes up, guess what it is like for me who has to read and
> review
> pretty much every single posting that is placed there?

I get it.  I don't have the time to both be responsible for linux-rdma
and follow netdev.  I can already tell you what would happen if I
tried.  Eventually netdev would get so far behind I would just mass
delete and start over.  So that doesn't solve the problem.

Anyway, we're just talking out what happened, when what we really need
to focus on is moving forward.  Again, your thoughts on marking SMC
EXPERIMENTAL until it's fixed up and unfreezing the API in case we need
to adjust it to work on different link layers?

David Miller May 16, 2017, 5:36 p.m. UTC | #20

From: Doug Ledford <dledford@redhat.com>
Date: Tue, 16 May 2017 13:20:44 -0400

> Anyway, we're just talking out what happened, when what we really need
> to focus on is moving forward.  Again, your thoughts on marking SMC
> EXPERIMENTAL until it's fixed up and unfreezing the API in case we need
> to adjust it to work on different link layers?

Something like:

	http://patchwork.ozlabs.org/patch/762803/

with the addition of the EXPERIMENTAL dependency?

Sure.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Doug Ledford May 16, 2017, 6:03 p.m. UTC | #21

On Tue, 2017-05-16 at 13:36 -0400, David Miller wrote:
> From: Doug Ledford <dledford@redhat.com>
> Date: Tue, 16 May 2017 13:20:44 -0400
> 
> > Anyway, we're just talking out what happened, when what we really
> need
> > to focus on is moving forward.  Again, your thoughts on marking SMC
> > EXPERIMENTAL until it's fixed up and unfreezing the API in case we
> need
> > to adjust it to work on different link layers?
> 
> Something like:
> 
>         http://patchwork.ozlabs.org/patch/762803/
> 
> with the addition of the EXPERIMENTAL dependency?
> 
> Sure.

Perfect.  I assume you'll submit it since it's in your patchworks?

David Miller May 16, 2017, 6:52 p.m. UTC | #22

From: Doug Ledford <dledford@redhat.com>
Date: Tue, 16 May 2017 14:03:22 -0400

> On Tue, 2017-05-16 at 13:36 -0400, David Miller wrote:
>> From: Doug Ledford <dledford@redhat.com>
>> Date: Tue, 16 May 2017 13:20:44 -0400
>> 
>> > Anyway, we're just talking out what happened, when what we really
>> need
>> > to focus on is moving forward.  Again, your thoughts on marking SMC
>> > EXPERIMENTAL until it's fixed up and unfreezing the API in case we
>> need
>> > to adjust it to work on different link layers?
>> 
>> Something like:
>> 
>>         http://patchwork.ozlabs.org/patch/762803/
>> 
>> with the addition of the EXPERIMENTAL dependency?
>> 
>> Sure.
> 
> Perfect.  I assume you'll submit it since it's in your patchworks?

Ok I applied the patch referenced above, but we don't actually have
an EXPERIMENTAL symbol.  The closest thing we have is BROKEN and
even in this situation that's a bit harsh.

I also applied the patch which converts SMC over to using
IB_PD_UNSAFE_GLOBAL_RKEY.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Doug Ledford May 16, 2017, 7:28 p.m. UTC | #23

On Tue, 2017-05-16 at 14:52 -0400, David Miller wrote:
> From: Doug Ledford <dledford@redhat.com>
> Date: Tue, 16 May 2017 14:03:22 -0400
> 
> > On Tue, 2017-05-16 at 13:36 -0400, David Miller wrote:
> >> From: Doug Ledford <dledford@redhat.com>
> >> Date: Tue, 16 May 2017 13:20:44 -0400
> >> 
> >> > Anyway, we're just talking out what happened, when what we
> really
> >> need
> >> > to focus on is moving forward.  Again, your thoughts on marking
> SMC
> >> > EXPERIMENTAL until it's fixed up and unfreezing the API in case
> we
> >> need
> >> > to adjust it to work on different link layers?
> >> 
> >> Something like:
> >> 
> >>         http://patchwork.ozlabs.org/patch/762803/
> >> 
> >> with the addition of the EXPERIMENTAL dependency?
> >> 
> >> Sure.
> > 
> > Perfect.  I assume you'll submit it since it's in your patchworks?
> 
> Ok I applied the patch referenced above, but we don't actually have
> an EXPERIMENTAL symbol.  The closest thing we have is BROKEN and
> even in this situation that's a bit harsh.

I hadn't realized EXPERIMENTAL was gone.  Which is too bad, because
that's entirely appropriate in this case, and would have had the
desired side effect of keeping it out of any non-cutting edge distros
and warning people of possible API changes.  With EXPERIMENTAL gone,
the closest thing we have is drivers/staging, since that tends to imply
some of the same consequences.  I know you think BROKEN is overly
harsh, but I'm not sure we should just do nothing.  How about we take a
few days to let some of the RDMA people closely review the 143 page
(egads!) rfc (http://www.rfc-editor.org/info/rfc7609) to see if we
think it can be fixed to use multiple link layers with the existing API
in place or if it will require something other than AF_SMC.  If we need
to break API, then I think we should either fix it ASAP and send that
fix to the 4.11 stable series (which probably violates the normative
stable patch size/scope) or if the fix will take longer than this
kernel cycle, then move it to staging both here and in 4.11 stable, and
fix it there and then move it back.  Something like that would prevent
the kind of API flappage we ought not do....

Doug Ledford May 17, 2017, 8:37 p.m. UTC | #24

On Tue, 2017-05-16 at 15:28 -0400, Doug Ledford wrote:
> I hadn't realized EXPERIMENTAL was gone.  Which is too bad, because
> that's entirely appropriate in this case, and would have had the
> desired side effect of keeping it out of any non-cutting edge distros
> and warning people of possible API changes.  With EXPERIMENTAL gone,
> the closest thing we have is drivers/staging, since that tends to
> imply
> some of the same consequences.  I know you think BROKEN is overly
> harsh, but I'm not sure we should just do nothing.  How about we take
> a
> few days to let some of the RDMA people closely review the 143 page
> (egads!) rfc (http://www.rfc-editor.org/info/rfc7609) to see if we
> think it can be fixed to use multiple link layers with the existing
> API
> in place or if it will require something other than AF_SMC.  If we
> need
> to break API, then I think we should either fix it ASAP and send that
> fix to the 4.11 stable series (which probably violates the normative
> stable patch size/scope) or if the fix will take longer than this
> kernel cycle, then move it to staging both here and in 4.11 stable,
> and
> fix it there and then move it back.  Something like that would
> prevent
> the kind of API flappage we ought not do....

So, I've skimmed the entire RFC, focusing on things were I needed to.
 Here's my take on it:

It would have been better with AF_INET/AF_INET6 and an option to enable
SMC than AF_SMC.  The first implementation simply assumes AF_INET in
the presence of AF_SMC.  When IPv6 support is added, some sort of
guessing logic will have to be put in place to try and determine if an
AF_SMC address is actually AF_INET or AF_INET6 since we won't have a
guaranteed way of telling.  Apps can use struct sockaddr_storage as
their normal element to stick the address into, and could rely on the
kernel to interpret it properly based on the AF_INET/AF_INET6
differentiation, and this breaks that.  The RFC gives *some* thought to
adding IPv6 in the future, but not a lot.  It may be that the answer is
that in the future, IPv6 support is enabled by making the IPv6 API be
AF_INET6 + setsockopt(SMC) or the equivalent.  If that's the case, then
I would suggest making the later API specifically call out AF_INET +
setsockopt(SMC) be identical to AF_SMC.

The protocol included a version header in the negotation messages.
 This is good as it allows us to almost immediately start work on
version 2 that fixes the shortcomings of version 1 while maintaining
back compatibility.

After reading the RFC, I can see why they only support one link layer:
RoCE.  The issue here is that they punted on the hard issue of doing
any sort of work to determine if security restrictions on the TCP
connection should also be applied to the RDMA connection.  The RFC
basically says "the RoCE link needs to have the same physical
restrictions (vlan) as the TCP/IP link so that the security limitations
are the same" because they don't do anything to check them essentially.
 For v2 of the protocol, and for different link layer support, this is
no longer sufficient, so there will be significant work to determine
the security context of specific TCP connections and then make sure
that they meet the security context of the RDMA links allowed.

Additionally, the same is likely to be true in terms of SELinux
options.  The addition of the IB/OPA link layers will throw a bit of a
monkey wrench in things because the SELinux control over those links is
slightly different than a normal TCP/IP SELinux control.  For instance,
you might create rules about processes and ports to make sure that the
httpd daemon can only access specific ports on TCP/IP, but on IB/OPA
you would need to create process and P_Key rules as IB/OPA don't have
the same port level semantics, and it's the P_Key on communications
that is enforced network wide, including in the switches, so
controlling what P_Key a process can send/receive on is your best way
to limit what a process can do.  Translation from one to the other
might be rather difficult to do in any sort of automated fashion.

There might have to be some additional work to get this to properly
account items for the RDMA control group elements that were recently
taken into the kernel.

Finally, the RDMA subsystem is in the process of switching to
structured option processing similar to how netlink does it.  For
version 2 of this protocol, since it will be interacting with the RDMA
core in many ways, will be simpler if it switches the on-wire
negotiation packets to follow the same methods as that will allow reuse
of code that it will have to have for properly dealing with the RDMA
subsystem in the future.

So, I'm fine with it being left as is since you accepted the patch that
makes note of the memory registration insecurity in the Kconfig text.
 The fact that this is a versioned protocol means that we can fix the
things we see as being wrong without having to have it all right from
the very start, it can be done incrementally.

Parav Pandit May 17, 2017, 10:37 p.m. UTC | #25

SGkgRG91ZywNCg0KPiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiBGcm9tOiBsaW51eC1y
ZG1hLW93bmVyQHZnZXIua2VybmVsLm9yZyBbbWFpbHRvOmxpbnV4LXJkbWEtDQo+IG93bmVyQHZn
ZXIua2VybmVsLm9yZ10gT24gQmVoYWxmIE9mIERvdWcgTGVkZm9yZA0KPiBTZW50OiBXZWRuZXNk
YXksIE1heSAxNywgMjAxNyAzOjM4IFBNDQo+IFRvOiBEYXZpZCBNaWxsZXIgPGRhdmVtQGRhdmVt
bG9mdC5uZXQ+DQo+IENjOiBCYXJ0LlZhbkFzc2NoZUBzYW5kaXNrLmNvbTsgdG9ydmFsZHNAbGlu
dXgtZm91bmRhdGlvbi5vcmc7DQo+IGhjaEBsc3QuZGU7IG5ldGRldkB2Z2VyLmtlcm5lbC5vcmc7
IGxpbnV4LXJkbWFAdmdlci5rZXJuZWwub3JnOw0KPiBzdGFibGVAdmdlci5rZXJuZWwub3JnOyB1
YnJhdW5AbGludXgudm5ldC5pYm0uY29tDQo+IFN1YmplY3Q6IFJlOiBbUEFUQ0hdIG5ldC9zbWM6
IG1hcmsgYXMgQlJPS0VOIGR1ZSB0byByZW1vdGUgbWVtb3J5DQo+IGV4cG9zdXJlDQo+IA0KPiBP
biBUdWUsIDIwMTctMDUtMTYgYXQgMTU6MjggLTA0MDAsIERvdWcgTGVkZm9yZCB3cm90ZToNCj4g
PiBJIGhhZG4ndCByZWFsaXplZCBFWFBFUklNRU5UQUwgd2FzIGdvbmUuIMKgV2hpY2ggaXMgdG9v
IGJhZCwgYmVjYXVzZQ0KPiA+IHRoYXQncyBlbnRpcmVseSBhcHByb3ByaWF0ZSBpbiB0aGlzIGNh
c2UsIGFuZCB3b3VsZCBoYXZlIGhhZCB0aGUNCj4gPiBkZXNpcmVkIHNpZGUgZWZmZWN0IG9mIGtl
ZXBpbmcgaXQgb3V0IG9mIGFueSBub24tY3V0dGluZyBlZGdlIGRpc3Ryb3MNCj4gPiBhbmQgd2Fy
bmluZyBwZW9wbGUgb2YgcG9zc2libGUgQVBJIGNoYW5nZXMuIMKgV2l0aCBFWFBFUklNRU5UQUwg
Z29uZSwNCj4gPiB0aGUgY2xvc2VzdCB0aGluZyB3ZSBoYXZlIGlzIGRyaXZlcnMvc3RhZ2luZywg
c2luY2UgdGhhdCB0ZW5kcyB0bw0KPiA+IGltcGx5IHNvbWUgb2YgdGhlIHNhbWUgY29uc2VxdWVu
Y2VzLiDCoEkga25vdyB5b3UgdGhpbmsgQlJPS0VOIGlzDQo+ID4gb3Zlcmx5IGhhcnNoLCBidXQg
SSdtIG5vdCBzdXJlIHdlIHNob3VsZCBqdXN0IGRvIG5vdGhpbmcuIMKgSG93IGFib3V0DQo+ID4g
d2UgdGFrZSBhIGZldyBkYXlzIHRvIGxldCBzb21lIG9mIHRoZSBSRE1BIHBlb3BsZSBjbG9zZWx5
IHJldmlldyB0aGUNCj4gPiAxNDMgcGFnZQ0KPiA+IChlZ2FkcyEpIHJmYyAoaHR0cDovL3d3dy5y
ZmMtZWRpdG9yLm9yZy9pbmZvL3JmYzc2MDkpIHRvIHNlZSBpZiB3ZQ0KPiA+IHRoaW5rIGl0IGNh
biBiZSBmaXhlZCB0byB1c2UgbXVsdGlwbGUgbGluayBsYXllcnMgd2l0aCB0aGUgZXhpc3RpbmcN
Cj4gPiBBUEkgaW4gcGxhY2Ugb3IgaWYgaXQgd2lsbCByZXF1aXJlIHNvbWV0aGluZyBvdGhlciB0
aGFuIEFGX1NNQy4gwqBJZiB3ZQ0KPiA+IG5lZWQgdG8gYnJlYWsgQVBJLCB0aGVuIEkgdGhpbmsg
d2Ugc2hvdWxkIGVpdGhlciBmaXggaXQgQVNBUCBhbmQgc2VuZA0KPiA+IHRoYXQgZml4IHRvIHRo
ZSA0LjExIHN0YWJsZSBzZXJpZXMgKHdoaWNoIHByb2JhYmx5IHZpb2xhdGVzIHRoZQ0KPiA+IG5v
cm1hdGl2ZSBzdGFibGUgcGF0Y2ggc2l6ZS9zY29wZSkgb3IgaWYgdGhlIGZpeCB3aWxsIHRha2Ug
bG9uZ2VyIHRoYW4NCj4gPiB0aGlzIGtlcm5lbCBjeWNsZSwgdGhlbiBtb3ZlIGl0IHRvIHN0YWdp
bmcgYm90aCBoZXJlIGFuZCBpbiA0LjExDQo+ID4gc3RhYmxlLCBhbmQgZml4IGl0IHRoZXJlIGFu
ZCB0aGVuIG1vdmUgaXQgYmFjay4gwqBTb21ldGhpbmcgbGlrZSB0aGF0DQo+ID4gd291bGQgcHJl
dmVudCB0aGUga2luZCBvZiBBUEkgZmxhcHBhZ2Ugd2Ugb3VnaHQgbm90IGRvLi4uLg0KPiANCj4g
U28sIEkndmUgc2tpbW1lZCB0aGUgZW50aXJlIFJGQywgZm9jdXNpbmcgb24gdGhpbmdzIHdlcmUg
SSBuZWVkZWQgdG8uDQo+IMKgSGVyZSdzIG15IHRha2Ugb24gaXQ6DQo+IA0KPiBJdCB3b3VsZCBo
YXZlIGJlZW4gYmV0dGVyIHdpdGggQUZfSU5FVC9BRl9JTkVUNiBhbmQgYW4gb3B0aW9uIHRvIGVu
YWJsZQ0KPiBTTUMgdGhhbiBBRl9TTUMuIMKgVGhlIGZpcnN0IGltcGxlbWVudGF0aW9uIHNpbXBs
eSBhc3N1bWVzIEFGX0lORVQgaW4NCj4gdGhlIHByZXNlbmNlIG9mIEFGX1NNQy4gwqBXaGVuIElQ
djYgc3VwcG9ydCBpcyBhZGRlZCwgc29tZSBzb3J0IG9mDQo+IGd1ZXNzaW5nIGxvZ2ljIHdpbGwg
aGF2ZSB0byBiZSBwdXQgaW4gcGxhY2UgdG8gdHJ5IGFuZCBkZXRlcm1pbmUgaWYgYW4NCj4gQUZf
U01DIGFkZHJlc3MgaXMgYWN0dWFsbHkgQUZfSU5FVCBvciBBRl9JTkVUNiBzaW5jZSB3ZSB3b24n
dCBoYXZlIGENCj4gZ3VhcmFudGVlZCB3YXkgb2YgdGVsbGluZy4gwqBBcHBzIGNhbiB1c2Ugc3Ry
dWN0IHNvY2thZGRyX3N0b3JhZ2UgYXMgdGhlaXINCj4gbm9ybWFsIGVsZW1lbnQgdG8gc3RpY2sg
dGhlIGFkZHJlc3MgaW50bywgYW5kIGNvdWxkIHJlbHkgb24gdGhlIGtlcm5lbCB0bw0KPiBpbnRl
cnByZXQgaXQgcHJvcGVybHkgYmFzZWQgb24gdGhlIEFGX0lORVQvQUZfSU5FVDYgZGlmZmVyZW50
aWF0aW9uLCBhbmQNCj4gdGhpcyBicmVha3MgdGhhdC4gwqBUaGUgUkZDIGdpdmVzICpzb21lKiB0
aG91Z2h0IHRvIGFkZGluZyBJUHY2IGluIHRoZQ0KPiBmdXR1cmUsIGJ1dCBub3QgYSBsb3QuIMKg
SXQgbWF5IGJlIHRoYXQgdGhlIGFuc3dlciBpcyB0aGF0IGluIHRoZSBmdXR1cmUsIElQdjYNCj4g
c3VwcG9ydCBpcyBlbmFibGVkIGJ5IG1ha2luZyB0aGUgSVB2NiBBUEkgYmUNCj4gQUZfSU5FVDYg
KyBzZXRzb2Nrb3B0KFNNQykgb3IgdGhlIGVxdWl2YWxlbnQuIMKgSWYgdGhhdCdzIHRoZSBjYXNl
LCB0aGVuIEkNCj4gd291bGQgc3VnZ2VzdCBtYWtpbmcgdGhlIGxhdGVyIEFQSSBzcGVjaWZpY2Fs
bHkgY2FsbCBvdXQgQUZfSU5FVCArDQo+IHNldHNvY2tvcHQoU01DKSBiZSBpZGVudGljYWwgdG8g
QUZfU01DLg0KPiANCg0KV2hhdCBhcmUgdGhlIHNob3J0Y29taW5ncyBpbiBteSBwcm9wb3NhbCBb
MV0gd2hpY2ggSSBhbSByZWl0ZXJhdGluZyBiZWxvdy4NCkJhcnQgYWxzbyBzdWdnZXN0ZWQgdG8g
ZGVmaW5lIG5ldyBzdHJlYW0gcHJvdG9jb2wgZm9yIFNNQyBzaW1pbGFyIHRvIFNDVFAuDQoNCihh
KSBhZGRyZXNzIGZhbWlseSBzaG91bGQgYmUgZWl0aGVyIEFGX0lORVQgb3IgQUZfSU5FVDYNCihi
KSBzb2NrZXQoKSBBUEkgdG8gaW50cm9kdWNlIG5ldyBwcm90b2NvbCBhcyBQUk9UT19TTUMgZm9y
IGluIHNvY2tldCgpIHN5c3RlbSBjYWxsLg0KDQpXaXRoIHRoaXMgdGhlcmUgaXMgbm8gYWRkaXRp
b25hbCBzZXRzb2Nrb3AoKSBuZWVkZWQuDQoNCldpdGggdGhpcyAtIHVzZXIgc3BhY2UgYXBwbGlj
YXRpb25zLCBkbyBnZXRhZGRyaW5mbygpIHdpdGggaGludCBhcw0KaGludHMuYWlfZmFtaWx5ICA9
IEFGX0lORVQ7DQpoaW50cy5haV9zb2NrdHlwZSA9IFNPQ0tfU1RSRUFNOw0KDQpnZXRhZGRyaW5m
bygpIHJldHVybnMgYmFjayBib3RoIHRoZSBwcm90b2NvbHMgVENQIGFuZCBTTUMuDQpGYW1vdXMg
ZGF0YWJhc2UgYXBwbGljYXRpb24gc3VjaCBhcyBSZWRpcyBjbGllbnQgaXRlcmF0ZXMgb3ZlciBh
bGwgZW50cmllcyBvZiBnZXRhZGRyaW5mbygpIGFuZCBlc3RhYmxpc2hlcyBjb25uZWN0aW9uIHRv
IHNlcnZlcnMuDQoNClRoZXJlIGFyZSBmZXcgYWR2YW50YWdlcyBvZiB0aGlzIGludGVyZmFjZS4N
CjEuIE5vIGNoYW5nZSBpbiBhbnkgbWFrZWZpbGUgb2YgYXBwbGljYXRpb25zIG5lZWRlZCAtIHVu
bGVzcyB1c2VyIHdhbnRzIHRvIHNwZWNpZnkgZXhwbGljaXRseSB0aGF0IGl0IHdhbnRzIHRvIGZv
cmNlIFNNQyBwcm90b2NvbC4NCjIuIE5vIG5lZWQgdG8gZG8gTERfTElCUkFSWV9QUkVMT0FELCAo
d2hpY2ggd29uJ3Qgd29yayBhbnl3YXkgYmVjYXVzZSBiaW5kKCkgY29ubmVjdCgpIG9mIFNNQyBj
aGVja3MgZm9yIEFGX0lORVQpLg0KMy4gTm8gbWFqb3IgY2hhbmdlcyB0byBnbGliYyB0byBwcm9j
ZXNzIEFGX1NNQyBkaWZmZXJlbnRseQ0KZ2xpYmMgcmVmZXJlbmNlcyBJUFBST1RPX1RDUCBhdCAy
MiBwbGFjZXMuIChjb21wYXJlIHRvIEFGX0lORVQgYXQgMTQwKyBwbGFjZXMpLg0KQSBsb3Qgc2lt
cGxlciB0ZXN0IG1hdHJpeCBmb3IgZ2xpYmMgZm9yIG5ldyBwcm90b2NvbA0KNS4gTm8gbmVlZCB0
byByZWNvbXBpbGUgYXBwbGljYXRpb25zLCBhcyBsb25nIGFzIGdldGFkZHJpbmZvIHJldHVybnMg
YWxsIHN0cmVhbWluZyBwcm90b2NvbHMgKFRDUCwgU01DKQ0KNi4gZm9yIGFwcGxpY2F0aW9ucyB0
byBtYWtlIHVzZSBvZiBzZXRzb2Nrb3B0KCkgaXQgbmVlZHMgYW5vdGhlciBrbm9iIGFuZCBoaW50
IGZyb20gb3RoZXIgcGxhY2VzLCB3aGljaCBjYW4gYmUgYXZvaWRlZCBiZWNhdXNlIFNNQyBUQ1Ag
b3B0aW9uIG5lZ290aWF0ZXMgd2l0aCByZW1vdGUgZW5kDQoNCkFuZCByZXByZXNlbnRpbmcgbmV3
IHByb3RvY29sIGFzIG5ldyBwcm90b2NvbCBmb3IgYSBnaXZlbiBhZGRyZXNzIGZhbWlseSBhcHBl
YXJzIGNvcnJlY3QsIGNvbXBhcmUgdG8gc2V0dGluZyBzb2NrZXQgb3B0aW9ucy4NCg0KVG9vbHMg
bGlrZSBDUklVIG9yIHNpbWlsYXIgdG9vbCBtaWdodCBmaW5kIGEgcmFjZSBjb25kaXRpb25zIC0g
d2hlbiBpdCBxdWVyaWVzIHNvY2tldCBvcHRpb24sIFNNQyB3YXMgbm90IHNldCwgYnV0IGxhdGVy
IG9uIFNNQyB3YXMgc2V0LCBhbmQgZG9lcyBpbmNvcnJlY3QgaGFuZGxpbmcuDQpTZXR0aW5nIHNv
Y2tldCgpIHdpdGggU01DIHByb3RvY29sIG1ha2VzIGl0IGVhc2llciB0byB1bmRlcnN0YW5kIGlu
IHRoaXMgYXJlYSBhcyB3ZWxsLg0KDQpJIGhhdmUgYWRkaXRpb25hbCBwcm9wb3NhbCBmb3IgbGlu
ayBncm91cHMsIHJlc291cmNlIGNyZWF0aW9uIGFyZWEuIEkgd2lsbCB0YWtlIHRoYXQgdXAgYWZ0
ZXIgdGhpcyBkaXNjdXNzaW9uLg0KDQpbMV0gaHR0cHM6Ly9wYXRjaHdvcmsua2VybmVsLm9yZy9w
YXRjaC85NzE5Mzc1Lw0KDQo+IFRoZSBwcm90b2NvbCBpbmNsdWRlZCBhIHZlcnNpb24gaGVhZGVy
IGluIHRoZSBuZWdvdGF0aW9uIG1lc3NhZ2VzLg0KPiDCoFRoaXMgaXMgZ29vZCBhcyBpdCBhbGxv
d3MgdXMgdG8gYWxtb3N0IGltbWVkaWF0ZWx5IHN0YXJ0IHdvcmsgb24gdmVyc2lvbiAyDQo+IHRo
YXQgZml4ZXMgdGhlIHNob3J0Y29taW5ncyBvZiB2ZXJzaW9uIDEgd2hpbGUgbWFpbnRhaW5pbmcg
YmFjaw0KPiBjb21wYXRpYmlsaXR5Lg0KPiANCj4gQWZ0ZXIgcmVhZGluZyB0aGUgUkZDLCBJIGNh
biBzZWUgd2h5IHRoZXkgb25seSBzdXBwb3J0IG9uZSBsaW5rIGxheWVyOg0KPiBSb0NFLiDCoFRo
ZSBpc3N1ZSBoZXJlIGlzIHRoYXQgdGhleSBwdW50ZWQgb24gdGhlIGhhcmQgaXNzdWUgb2YgZG9p
bmcgYW55IHNvcnQNCj4gb2Ygd29yayB0byBkZXRlcm1pbmUgaWYgc2VjdXJpdHkgcmVzdHJpY3Rp
b25zIG9uIHRoZSBUQ1AgY29ubmVjdGlvbiBzaG91bGQNCj4gYWxzbyBiZSBhcHBsaWVkIHRvIHRo
ZSBSRE1BIGNvbm5lY3Rpb24uIMKgVGhlIFJGQyBiYXNpY2FsbHkgc2F5cyAidGhlIFJvQ0UNCj4g
bGluayBuZWVkcyB0byBoYXZlIHRoZSBzYW1lIHBoeXNpY2FsIHJlc3RyaWN0aW9ucyAodmxhbikg
YXMgdGhlIFRDUC9JUCBsaW5rIHNvDQo+IHRoYXQgdGhlIHNlY3VyaXR5IGxpbWl0YXRpb25zIGFy
ZSB0aGUgc2FtZSIgYmVjYXVzZSB0aGV5IGRvbid0IGRvIGFueXRoaW5nDQo+IHRvIGNoZWNrIHRo
ZW0gZXNzZW50aWFsbHkuDQo+IMKgRm9yIHYyIG9mIHRoZSBwcm90b2NvbCwgYW5kIGZvciBkaWZm
ZXJlbnQgbGluayBsYXllciBzdXBwb3J0LCB0aGlzIGlzIG5vIGxvbmdlcg0KPiBzdWZmaWNpZW50
LCBzbyB0aGVyZSB3aWxsIGJlIHNpZ25pZmljYW50IHdvcmsgdG8gZGV0ZXJtaW5lIHRoZSBzZWN1
cml0eSBjb250ZXh0DQo+IG9mIHNwZWNpZmljIFRDUCBjb25uZWN0aW9ucyBhbmQgdGhlbiBtYWtl
IHN1cmUgdGhhdCB0aGV5IG1lZXQgdGhlIHNlY3VyaXR5DQo+IGNvbnRleHQgb2YgdGhlIFJETUEg
bGlua3MgYWxsb3dlZC4NCj4gDQo+IEFkZGl0aW9uYWxseSwgdGhlIHNhbWUgaXMgbGlrZWx5IHRv
IGJlIHRydWUgaW4gdGVybXMgb2YgU0VMaW51eCBvcHRpb25zLiDCoFRoZQ0KPiBhZGRpdGlvbiBv
ZiB0aGUgSUIvT1BBIGxpbmsgbGF5ZXJzIHdpbGwgdGhyb3cgYSBiaXQgb2YgYSBtb25rZXkgd3Jl
bmNoIGluDQo+IHRoaW5ncyBiZWNhdXNlIHRoZSBTRUxpbnV4IGNvbnRyb2wgb3ZlciB0aG9zZSBs
aW5rcyBpcyBzbGlnaHRseSBkaWZmZXJlbnQgdGhhbg0KPiBhIG5vcm1hbCBUQ1AvSVAgU0VMaW51
eCBjb250cm9sLiDCoEZvciBpbnN0YW5jZSwgeW91IG1pZ2h0IGNyZWF0ZSBydWxlcyBhYm91dA0K
PiBwcm9jZXNzZXMgYW5kIHBvcnRzIHRvIG1ha2Ugc3VyZSB0aGF0IHRoZSBodHRwZCBkYWVtb24g
Y2FuIG9ubHkgYWNjZXNzDQo+IHNwZWNpZmljIHBvcnRzIG9uIFRDUC9JUCwgYnV0IG9uIElCL09Q
QSB5b3Ugd291bGQgbmVlZCB0byBjcmVhdGUgcHJvY2Vzcw0KPiBhbmQgUF9LZXkgcnVsZXMgYXMg
SUIvT1BBIGRvbid0IGhhdmUgdGhlIHNhbWUgcG9ydCBsZXZlbCBzZW1hbnRpY3MsIGFuZA0KPiBp
dCdzIHRoZSBQX0tleSBvbiBjb21tdW5pY2F0aW9ucyB0aGF0IGlzIGVuZm9yY2VkIG5ldHdvcmsg
d2lkZSwgaW5jbHVkaW5nDQo+IGluIHRoZSBzd2l0Y2hlcywgc28gY29udHJvbGxpbmcgd2hhdCBQ
X0tleSBhIHByb2Nlc3MgY2FuIHNlbmQvcmVjZWl2ZSBvbiBpcw0KPiB5b3VyIGJlc3Qgd2F5IHRv
IGxpbWl0IHdoYXQgYSBwcm9jZXNzIGNhbiBkby4gwqBUcmFuc2xhdGlvbiBmcm9tIG9uZSB0byB0
aGUNCj4gb3RoZXIgbWlnaHQgYmUgcmF0aGVyIGRpZmZpY3VsdCB0byBkbyBpbiBhbnkgc29ydCBv
ZiBhdXRvbWF0ZWQgZmFzaGlvbi4NCj4gDQo+IFRoZXJlIG1pZ2h0IGhhdmUgdG8gYmUgc29tZSBh
ZGRpdGlvbmFsIHdvcmsgdG8gZ2V0IHRoaXMgdG8gcHJvcGVybHkNCj4gYWNjb3VudCBpdGVtcyBm
b3IgdGhlIFJETUEgY29udHJvbCBncm91cCBlbGVtZW50cyB0aGF0IHdlcmUgcmVjZW50bHkNCj4g
dGFrZW4gaW50byB0aGUga2VybmVsLg0KPiANCj4gRmluYWxseSwgdGhlIFJETUEgc3Vic3lzdGVt
IGlzIGluIHRoZSBwcm9jZXNzIG9mIHN3aXRjaGluZyB0byBzdHJ1Y3R1cmVkDQo+IG9wdGlvbiBw
cm9jZXNzaW5nIHNpbWlsYXIgdG8gaG93IG5ldGxpbmsgZG9lcyBpdC4gwqBGb3IgdmVyc2lvbiAy
IG9mIHRoaXMNCj4gcHJvdG9jb2wsIHNpbmNlIGl0IHdpbGwgYmUgaW50ZXJhY3Rpbmcgd2l0aCB0
aGUgUkRNQSBjb3JlIGluIG1hbnkgd2F5cywgd2lsbA0KPiBiZSBzaW1wbGVyIGlmIGl0IHN3aXRj
aGVzIHRoZSBvbi13aXJlIG5lZ290aWF0aW9uIHBhY2tldHMgdG8gZm9sbG93IHRoZSBzYW1lDQo+
IG1ldGhvZHMgYXMgdGhhdCB3aWxsIGFsbG93IHJldXNlIG9mIGNvZGUgdGhhdCBpdCB3aWxsIGhh
dmUgdG8gaGF2ZSBmb3INCj4gcHJvcGVybHkgZGVhbGluZyB3aXRoIHRoZSBSRE1BIHN1YnN5c3Rl
bSBpbiB0aGUgZnV0dXJlLg0KPiANCj4gU28sIEknbSBmaW5lIHdpdGggaXQgYmVpbmcgbGVmdCBh
cyBpcyBzaW5jZSB5b3UgYWNjZXB0ZWQgdGhlIHBhdGNoIHRoYXQgbWFrZXMNCj4gbm90ZSBvZiB0
aGUgbWVtb3J5IHJlZ2lzdHJhdGlvbiBpbnNlY3VyaXR5IGluIHRoZSBLY29uZmlnIHRleHQuDQo+
IMKgVGhlIGZhY3QgdGhhdCB0aGlzIGlzIGEgdmVyc2lvbmVkIHByb3RvY29sIG1lYW5zIHRoYXQg
d2UgY2FuIGZpeCB0aGUgdGhpbmdzDQo+IHdlIHNlZSBhcyBiZWluZyB3cm9uZyB3aXRob3V0IGhh
dmluZyB0byBoYXZlIGl0IGFsbCByaWdodCBmcm9tIHRoZSB2ZXJ5DQo+IHN0YXJ0LCBpdCBjYW4g
YmUgZG9uZSBpbmNyZW1lbnRhbGx5Lg0KPiANCj4gLS0NCj4gRG91ZyBMZWRmb3JkIDxkbGVkZm9y
ZEByZWRoYXQuY29tPg0KPiDCoCDCoCBHUEcgS2V5SUQ6IEI4MjZBMzMzMEU1NzJGREQNCj4gDQo+
IEtleSBmaW5nZXJwcmludCA9IEFFNkIgMUJEQSAxMjJCIDIzQjQgMjY1QiDCoDEyNzQgQjgyNiBB
MzMzIDBFNTcgMkZERA0KPiANCj4gLS0NCj4gVG8gdW5zdWJzY3JpYmUgZnJvbSB0aGlzIGxpc3Q6
IHNlbmQgdGhlIGxpbmUgInVuc3Vic2NyaWJlIGxpbnV4LXJkbWEiIGluIHRoZQ0KPiBib2R5IG9m
IGEgbWVzc2FnZSB0byBtYWpvcmRvbW9Admdlci5rZXJuZWwub3JnIE1vcmUgbWFqb3Jkb21vIGlu
Zm8NCj4gYXQgIGh0dHA6Ly92Z2VyLmtlcm5lbC5vcmcvbWFqb3Jkb21vLWluZm8uaHRtbA0K
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Doug Ledford May 18, 2017, 12:07 a.m. UTC | #26

On 5/17/2017 6:37 PM, Parav Pandit wrote:
> Hi Doug,
> 

>> It would have been better with AF_INET/AF_INET6 and an option to enable
>> SMC than AF_SMC.  The first implementation simply assumes AF_INET in
>> the presence of AF_SMC.  When IPv6 support is added, some sort of
>> guessing logic will have to be put in place to try and determine if an
>> AF_SMC address is actually AF_INET or AF_INET6 since we won't have a
>> guaranteed way of telling.  Apps can use struct sockaddr_storage as their
>> normal element to stick the address into, and could rely on the kernel to
>> interpret it properly based on the AF_INET/AF_INET6 differentiation, and
>> this breaks that.  The RFC gives *some* thought to adding IPv6 in the
>> future, but not a lot.  It may be that the answer is that in the future, IPv6
>> support is enabled by making the IPv6 API be
>> AF_INET6 + setsockopt(SMC) or the equivalent.  If that's the case, then I
>> would suggest making the later API specifically call out AF_INET +
>> setsockopt(SMC) be identical to AF_SMC.
>>
> 
> What are the shortcomings in my proposal [1] which I am reiterating below.
> Bart also suggested to define new stream protocol for SMC similar to SCTP.
> 
> (a) address family should be either AF_INET or AF_INET6
> (b) socket() API to introduce new protocol as PROTO_SMC for in socket() system call.
> 
> With this there is no additional setsockop() needed.
> 
> With this - user space applications, do getaddrinfo() with hint as
> hints.ai_family  = AF_INET;
> hints.ai_socktype = SOCK_STREAM;
> 
> getaddrinfo() returns back both the protocols TCP and SMC.
> Famous database application such as Redis client iterates over all entries of getaddrinfo() and establishes connection to servers.
> 
> There are few advantages of this interface.
> 1. No change in any makefile of applications needed - unless user wants to specify explicitly that it wants to force SMC protocol.
> 2. No need to do LD_LIBRARY_PRELOAD, (which won't work anyway because bind() connect() of SMC checks for AF_INET).
> 3. No major changes to glibc to process AF_SMC differently
> glibc references IPPROTO_TCP at 22 places. (compare to AF_INET at 140+ places).
> A lot simpler test matrix for glibc for new protocol
> 5. No need to recompile applications, as long as getaddrinfo returns all streaming protocols (TCP, SMC)
> 6. for applications to make use of setsockopt() it needs another knob and hint from other places, which can be avoided because SMC TCP option negotiates with remote end
> 
> And representing new protocol as new protocol for a given address family appears correct, compare to setting socket options.
> 
> Tools like CRIU or similar tool might find a race conditions - when it queries socket option, SMC was not set, but later on SMC was set, and does incorrect handling.
> Setting socket() with SMC protocol makes it easier to understand in this area as well.

I have no problem with the proposal in itself, but as IBM released this
publication and did their own implementation prior to submitting things
upstream, and as there might exist in the field implementations, it
depends on whether we wish to call those in the field implementations
experimental and break them as we go to a final implementation of
version 1, or if we consider version 1 baked.  I'm fine breaking it.
After all, that's what happened with XRC back in the day and Mellanox
learned a valuable lesson about upstream first.  I have no problem with
IBM learning that same lesson IMO.  So, I find your proposal, including
breaking the API of the version 1 implementation just taken into the
kernel before it has had time to fully sit and gel, acceptable.

But this is where we kind of need a judgment from on high, and why I
Cc:ed Linus on this thread.  Any input on this issue Linus?

> I have additional proposal for link groups, resource creation area. I will take that up after this discussion.

Look forward to hearing it.

> [1] https://patchwork.kernel.org/patch/9719375/

Leon Romanovsky May 18, 2017, 4:22 a.m. UTC | #27

On Wed, May 17, 2017 at 08:07:00PM -0400, Doug Ledford wrote:
> On 5/17/2017 6:37 PM, Parav Pandit wrote:
> > Hi Doug,
> >
>
> >> It would have been better with AF_INET/AF_INET6 and an option to enable
> >> SMC than AF_SMC.  The first implementation simply assumes AF_INET in
> >> the presence of AF_SMC.  When IPv6 support is added, some sort of
> >> guessing logic will have to be put in place to try and determine if an
> >> AF_SMC address is actually AF_INET or AF_INET6 since we won't have a
> >> guaranteed way of telling.  Apps can use struct sockaddr_storage as their
> >> normal element to stick the address into, and could rely on the kernel to
> >> interpret it properly based on the AF_INET/AF_INET6 differentiation, and
> >> this breaks that.  The RFC gives *some* thought to adding IPv6 in the
> >> future, but not a lot.  It may be that the answer is that in the future, IPv6
> >> support is enabled by making the IPv6 API be
> >> AF_INET6 + setsockopt(SMC) or the equivalent.  If that's the case, then I
> >> would suggest making the later API specifically call out AF_INET +
> >> setsockopt(SMC) be identical to AF_SMC.
> >>
> >
> > What are the shortcomings in my proposal [1] which I am reiterating below.
> > Bart also suggested to define new stream protocol for SMC similar to SCTP.
> >
> > (a) address family should be either AF_INET or AF_INET6
> > (b) socket() API to introduce new protocol as PROTO_SMC for in socket() system call.
> >
> > With this there is no additional setsockop() needed.
> >
> > With this - user space applications, do getaddrinfo() with hint as
> > hints.ai_family  = AF_INET;
> > hints.ai_socktype = SOCK_STREAM;
> >
> > getaddrinfo() returns back both the protocols TCP and SMC.
> > Famous database application such as Redis client iterates over all entries of getaddrinfo() and establishes connection to servers.
> >
> > There are few advantages of this interface.
> > 1. No change in any makefile of applications needed - unless user wants to specify explicitly that it wants to force SMC protocol.
> > 2. No need to do LD_LIBRARY_PRELOAD, (which won't work anyway because bind() connect() of SMC checks for AF_INET).
> > 3. No major changes to glibc to process AF_SMC differently
> > glibc references IPPROTO_TCP at 22 places. (compare to AF_INET at 140+ places).
> > A lot simpler test matrix for glibc for new protocol
> > 5. No need to recompile applications, as long as getaddrinfo returns all streaming protocols (TCP, SMC)
> > 6. for applications to make use of setsockopt() it needs another knob and hint from other places, which can be avoided because SMC TCP option negotiates with remote end
> >
> > And representing new protocol as new protocol for a given address family appears correct, compare to setting socket options.
> >
> > Tools like CRIU or similar tool might find a race conditions - when it queries socket option, SMC was not set, but later on SMC was set, and does incorrect handling.
> > Setting socket() with SMC protocol makes it easier to understand in this area as well.
>
> I have no problem with the proposal in itself, but as IBM released this
> publication and did their own implementation prior to submitting things
> upstream, and as there might exist in the field implementations, it
> depends on whether we wish to call those in the field implementations
> experimental and break them as we go to a final implementation of
> version 1, or if we consider version 1 baked.  I'm fine breaking it.
> After all, that's what happened with XRC back in the day and Mellanox
> learned a valuable lesson about upstream first.  I have no problem with
> IBM learning that same lesson IMO.  So, I find your proposal, including
> breaking the API of the version 1 implementation just taken into the
> kernel before it has had time to fully sit and gel, acceptable.

Doug,
I am amused that after your excellent summary you are not calling to
this v1 of protocol in its real word: BROKEN.

Current implementation and RFC are DOA (dead-on-arrival) and are not
usable for ALL RDMA key players (Mellanox, Intel, Chelsio, ...) and
ALL RDMA major users. It doesn't fit into existing infrastructure
(DB, HPC, glibc, e.t.c.) and doesn't even fulfill the expected from
cover letter (doesn't work with LD_PRELOAD_*).

The fact that IBM implemented it and used it doesn't mean that they
can't continue to leave it with out-of-tree solution for a little
bit more time till usable version will come.

>
> But this is where we kind of need a judgment from on high, and why I
> Cc:ed Linus on this thread.  Any input on this issue Linus?

It should be dropped/moved_to_staging as soon as possible, before
UAPI started to spread.

>
> > I have additional proposal for link groups, resource creation area. I will take that up after this discussion.
>
> Look forward to hearing it.
>
> > [1] https://patchwork.kernel.org/patch/9719375/
>
>
> --
> Doug Ledford <dledford@redhat.com>
>     GPG Key ID: B826A3330E572FDD
>     Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
>

net/smc: mark as BROKEN due to remote memory exposure

Commit Message

Comments

Patch