diff mbox

[nfs-utils,v2,05/12] getport: recognize "vsock" netid

Message ID 20170630132120.31578-6-stefanha@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Stefan Hajnoczi June 30, 2017, 1:21 p.m. UTC
Neither libtirpc nor getprotobyname(3) know about AF_VSOCK.  For similar
reasons as for "rdma"/"rmda6", translate "vsock" manually in getport.c.

It is now possible to mount a file system from the host (hypervisor)
over AF_VSOCK like this:

  (guest)$ mount.nfs 2:/export /mnt -v -o clientaddr=3,proto=vsock

The VM's cid address is 3 and the hypervisor is 2.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 support/nfs/getport.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

Comments

Steve Dickson June 30, 2017, 3:01 p.m. UTC | #1
On 06/30/2017 09:21 AM, Stefan Hajnoczi wrote:
> Neither libtirpc nor getprotobyname(3) know about AF_VSOCK.  For similar
> reasons as for "rdma"/"rmda6", translate "vsock" manually in getport.c.
> 
> It is now possible to mount a file system from the host (hypervisor)
> over AF_VSOCK like this:
> 
>   (guest)$ mount.nfs 2:/export /mnt -v -o clientaddr=3,proto=vsock
> 
> The VM's cid address is 3 and the hypervisor is 2.
So this is how vsocks are going to look... 
There is not going to be a way to lookup an vsock address?
Since the format of the clientaddr parameter shouldn't
that be documented in the man page?

I guess a general question, is this new mount type
documented anywhere? 

steved.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  support/nfs/getport.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/support/nfs/getport.c b/support/nfs/getport.c
> index 081594c..0b857af 100644
> --- a/support/nfs/getport.c
> +++ b/support/nfs/getport.c
> @@ -217,8 +217,7 @@ nfs_get_proto(const char *netid, sa_family_t *family, unsigned long *protocol)
>  	struct protoent *proto;
>  
>  	/*
> -	 * IANA does not define a protocol number for rdma netids,
> -	 * since "rdma" is not an IP protocol.
> +	 * IANA does not define protocol numbers for non-IP netids.
>  	 */
>  	if (strcmp(netid, "rdma") == 0) {
>  		*family = AF_INET;
> @@ -230,6 +229,11 @@ nfs_get_proto(const char *netid, sa_family_t *family, unsigned long *protocol)
>  		*protocol = NFSPROTO_RDMA;
>  		return 1;
>  	}
> +	if (strcmp(netid, "vsock") == 0) {
> +		*family = AF_VSOCK;
> +		*protocol = 0;
> +		return 1;
> +	}
>  
>  	nconf = getnetconfigent(netid);
>  	if (nconf == NULL)
> @@ -258,14 +262,18 @@ nfs_get_proto(const char *netid, sa_family_t *family, unsigned long *protocol)
>  	struct protoent *proto;
>  
>  	/*
> -	 * IANA does not define a protocol number for rdma netids,
> -	 * since "rdma" is not an IP protocol.
> +	 * IANA does not define protocol numbers for non-IP netids.
>  	 */
>  	if (strcmp(netid, "rdma") == 0) {
>  		*family = AF_INET;
>  		*protocol = NFSPROTO_RDMA;
>  		return 1;
>  	}
> +	if (strcmp(netid, "vsock") == 0) {
> +		*family = AF_VSOCK;
> +		*protocol = 0;
> +		return 1;
> +	}
>  
>  	proto = getprotobyname(netid);
>  	if (proto == NULL)
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever June 30, 2017, 3:52 p.m. UTC | #2
Hi Stefan-


> On Jun 30, 2017, at 9:21 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> Neither libtirpc nor getprotobyname(3) know about AF_VSOCK.

Why?

Basically you are building a lot of specialized
awareness in applications and leaving the
network layer alone. That seems backwards to me.


> For similar
> reasons as for "rdma"/"rmda6", translate "vsock" manually in getport.c.

rdma/rdma6 are specified by standards, and appear
in the IANA Network Identifiers database:

https://www.iana.org/assignments/rpc-netids/rpc-netids.xhtml

Is there a standard netid for vsock? If not,
there needs to be some discussion with the nfsv4
Working Group to get this worked out.

Because AF_VSOCK is an address family and the RPC
framing is the same as TCP, the netid should be
something like "tcpv" and not "vsock". I've
complained about this before and there has been
no response of any kind.

I'll note that rdma/rdma6 do not use alternate
address families: an IP address is specified and
mapped to a GUID by the underlying transport.
We purposely did not expose GUIDs to NFS, which
is based on AF_INET/AF_INET6.

rdma co-exists with IP. vsock doesn't have this
fallback.

It might be a better approach to use well-known
(say, link-local or loopback) addresses and let
the underlying network layer figure it out.

Then hide all this stuff with DNS and let the
client mount the server by hostname and use
normal sockaddr's and "proto=tcp". Then you don't
need _any_ application layer changes.

Without hostnames, how does a client pick a
Kerberos service principal for the server?

Does rpcbind implement "vsock" netids?

Does the NFSv4.0 client advertise "vsock" in
SETCLIENTID, and provide a "vsock" callback
service?


> It is now possible to mount a file system from the host (hypervisor)
> over AF_VSOCK like this:
> 
>  (guest)$ mount.nfs 2:/export /mnt -v -o clientaddr=3,proto=vsock
> 
> The VM's cid address is 3 and the hypervisor is 2.

The mount command is supposed to supply "clientaddr"
automatically. This mount option is exposed only for
debugging purposes or very special cases (like
disabling NFSv4 callback operations).

I mean the whole point of this exercise is to get
rid of network configuration, but here you're
adding the need to additionally specify both the
proto option and the clientaddr option to get this
to work. Seems like that isn't zero-configuration
at all.

Wouldn't it be nicer if it worked like this:

(guest)$ cat /etc/hosts
129.0.0.2  localhyper
(guest)$ mount.nfs localhyper:/export /mnt

And the result was a working NFS mount of the
local hypervisor, using whatever NFS version the
two both support, with no changes needed to the
NFS implementation or the understanding of the
system administrator?


> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
> support/nfs/getport.c | 16 ++++++++++++----
> 1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/support/nfs/getport.c b/support/nfs/getport.c
> index 081594c..0b857af 100644
> --- a/support/nfs/getport.c
> +++ b/support/nfs/getport.c
> @@ -217,8 +217,7 @@ nfs_get_proto(const char *netid, sa_family_t *family, unsigned long *protocol)
> 	struct protoent *proto;
> 
> 	/*
> -	 * IANA does not define a protocol number for rdma netids,
> -	 * since "rdma" is not an IP protocol.
> +	 * IANA does not define protocol numbers for non-IP netids.
> 	 */
> 	if (strcmp(netid, "rdma") == 0) {
> 		*family = AF_INET;
> @@ -230,6 +229,11 @@ nfs_get_proto(const char *netid, sa_family_t *family, unsigned long *protocol)
> 		*protocol = NFSPROTO_RDMA;
> 		return 1;
> 	}
> +	if (strcmp(netid, "vsock") == 0) {
> +		*family = AF_VSOCK;
> +		*protocol = 0;
> +		return 1;
> +	}
> 
> 	nconf = getnetconfigent(netid);
> 	if (nconf == NULL)
> @@ -258,14 +262,18 @@ nfs_get_proto(const char *netid, sa_family_t *family, unsigned long *protocol)
> 	struct protoent *proto;
> 
> 	/*
> -	 * IANA does not define a protocol number for rdma netids,
> -	 * since "rdma" is not an IP protocol.
> +	 * IANA does not define protocol numbers for non-IP netids.
> 	 */
> 	if (strcmp(netid, "rdma") == 0) {
> 		*family = AF_INET;
> 		*protocol = NFSPROTO_RDMA;
> 		return 1;
> 	}
> +	if (strcmp(netid, "vsock") == 0) {
> +		*family = AF_VSOCK;
> +		*protocol = 0;
> +		return 1;
> +	}
> 
> 	proto = getprotobyname(netid);
> 	if (proto == NULL)
> -- 
> 2.9.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown July 7, 2017, 3:17 a.m. UTC | #3
On Fri, Jun 30 2017, Chuck Lever wrote:
>
> Wouldn't it be nicer if it worked like this:
>
> (guest)$ cat /etc/hosts
> 129.0.0.2  localhyper
> (guest)$ mount.nfs localhyper:/export /mnt
>
> And the result was a working NFS mount of the
> local hypervisor, using whatever NFS version the
> two both support, with no changes needed to the
> NFS implementation or the understanding of the
> system administrator?

Yes. Yes. Definitely Yes.
Though I suspect you mean "127.0.0.2", not "129..."??

There must be some way to redirect TCP connections to some address
transparently through to the vsock protocol.
The "sshuttle" program does this to transparently forward TCP connections
over an ssh connection.  Using a similar technique to forward
connections over vsock shouldn't be hard.

Or is performance really critical, and you get too much copying when you
try forwarding connections?  I suspect that is fixable, but it would be
a little less straight forward.

I would really *not* like to see vsock support being bolted into one
network tool after another.

NeilBrown
NeilBrown July 7, 2017, 4:13 a.m. UTC | #4
On Fri, Jul 07 2017, NeilBrown wrote:

> On Fri, Jun 30 2017, Chuck Lever wrote:
>>
>> Wouldn't it be nicer if it worked like this:
>>
>> (guest)$ cat /etc/hosts
>> 129.0.0.2  localhyper
>> (guest)$ mount.nfs localhyper:/export /mnt
>>
>> And the result was a working NFS mount of the
>> local hypervisor, using whatever NFS version the
>> two both support, with no changes needed to the
>> NFS implementation or the understanding of the
>> system administrator?
>
> Yes. Yes. Definitely Yes.
> Though I suspect you mean "127.0.0.2", not "129..."??
>
> There must be some way to redirect TCP connections to some address
> transparently through to the vsock protocol.
> The "sshuttle" program does this to transparently forward TCP connections
> over an ssh connection.  Using a similar technique to forward
> connections over vsock shouldn't be hard.
>
> Or is performance really critical, and you get too much copying when you
> try forwarding connections?  I suspect that is fixable, but it would be
> a little less straight forward.
>
> I would really *not* like to see vsock support being bolted into one
> network tool after another.

I've been digging into this a big more.  I came across
  https://vmsplice.net/~stefan/stefanha-kvm-forum-2015.pdf

which (on page 7) lists some reasons not to use TCP/IP between guest
and host.

 . Adding & configuring guest interfaces is invasive

That is possibly true.  But adding support for a new address family to
NFS, NFSD, and nfs-utils is also very invasive.  You would need to
install this software on the guest.  I suggest you install different
software on the guest which solves the problem better.

 . Prone to break due to config changes inside guest

This is, I suspect, a key issue.  With vsock, the address of the
guest-side interface is defined by options passed to qemu.  With
normal IP addressing, the guest has to configure the address.

However I think that IPv6 autoconfig makes this work well without vsock.
If I create a bridge interface on the host, run
  ip -6 addr  add fe80::1 dev br0
then run a guest with
   -net nic,macaddr=Ch:oo:se:an:ad:dr \
   -net bridge,br=br0 \

then the client can
  mount [fe80::1%interfacename]:/path /mountpoint

and the host will see a connection from
   fe80::ch:oo:se:an:ad:dr

So from the guest side, I have achieved zero-config NFS mounts from the
host.

I don't think the server can filter connections based on which interface
a link-local address came from.  If that was a problem that someone
wanted to be fixed, I'm sure we can fix it.

If you need to be sure that clients don't fake their IPv6 address, I'm
sure netfilter is up to the task.


 . Creates network interfaces on host that must be managed

What vsock does is effectively create a hidden interface on the host that only the
kernel knows about and so the sysadmin cannot break it.  The only
difference between this and an explicit interface on the host is that
the latter requires a competent sysadmin.

If you have other reasons for preferring the use of vsock for NFS, I'd be
happy to hear them.  So far I'm not convinced.

Thanks,
NeilBrown
Chuck Lever July 7, 2017, 4:14 a.m. UTC | #5
> On Jul 6, 2017, at 11:17 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Fri, Jun 30 2017, Chuck Lever wrote:
>> 
>> Wouldn't it be nicer if it worked like this:
>> 
>> (guest)$ cat /etc/hosts
>> 129.0.0.2  localhyper
>> (guest)$ mount.nfs localhyper:/export /mnt
>> 
>> And the result was a working NFS mount of the
>> local hypervisor, using whatever NFS version the
>> two both support, with no changes needed to the
>> NFS implementation or the understanding of the
>> system administrator?
> 
> Yes. Yes. Definitely Yes.
> Though I suspect you mean "127.0.0.2", not "129..."??

I meant 129.x.  127.0.0 has well-defined semantics as a
loopback to the same host. The hypervisor is clearly a
network entity that is distinct from the local host.

But maybe you could set up 127.0.0.2, .3 for this purpose?
Someone smarter than me could figure out what is best to
use here. I'm not familiar with all the rules for loopback
and link-local IPv4 addressing.

Loopback is the correct analogy, though. It has predictable
host numbers that can be known in advance, and loopback
networking is set up automatically on a host, without the
need for a physical network interface. These are the stated
goals for vsock.

The benefit for re-using loopback here is that every
application that can speak AF_INET can already use it. For
NFS that means all the traditional features work: rpcbind,
NFSv4.0 callback, IP-based share access control, and Kerberos,
and especially DNS so that you can mount by hostname.


> There must be some way to redirect TCP connections to some address
> transparently through to the vsock protocol.
> The "sshuttle" program does this to transparently forward TCP connections
> over an ssh connection.  Using a similar technique to forward
> connections over vsock shouldn't be hard.
> 
> Or is performance really critical, and you get too much copying when you
> try forwarding connections?  I suspect that is fixable, but it would be
> a little less straight forward.
> 
> I would really *not* like to see vsock support being bolted into one
> network tool after another.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Hajnoczi July 10, 2017, 6:35 p.m. UTC | #6
On Fri, Jun 30, 2017 at 11:01:13AM -0400, Steve Dickson wrote:
> On 06/30/2017 09:21 AM, Stefan Hajnoczi wrote:
> > Neither libtirpc nor getprotobyname(3) know about AF_VSOCK.  For similar
> > reasons as for "rdma"/"rmda6", translate "vsock" manually in getport.c.
> > 
> > It is now possible to mount a file system from the host (hypervisor)
> > over AF_VSOCK like this:
> > 
> >   (guest)$ mount.nfs 2:/export /mnt -v -o clientaddr=3,proto=vsock
> > 
> > The VM's cid address is 3 and the hypervisor is 2.
> So this is how vsocks are going to look... 
> There is not going to be a way to lookup an vsock address?
> Since the format of the clientaddr parameter shouldn't
> that be documented in the man page?

AF_VSOCK does not have name resolution.  The scope of the CID addresses
is just the hypervisor that the VMs are running on.  Inter-VM
communication is not allowed.  The virtualization software has the CIDs
so there's not much use for name resolution.

> I guess a general question, is this new mount type
> documented anywhere? 

Thanks for pointing this out.  I'll update the man pages in the next
revision of this patch series.
Stefan Hajnoczi July 19, 2017, 3:11 p.m. UTC | #7
On Fri, Jun 30, 2017 at 11:52:15AM -0400, Chuck Lever wrote:
> > On Jun 30, 2017, at 9:21 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > Neither libtirpc nor getprotobyname(3) know about AF_VSOCK.
> 
> Why?
> 
> Basically you are building a lot of specialized
> awareness in applications and leaving the
> network layer alone. That seems backwards to me.

Yes.  I posted glibc patches but there were concerns that getaddrinfo(3)
is IPv4/IPv6 only and applications need to be ported to AF_VSOCK anyway,
so there's not much to gain by adding it:
https://cygwin.com/ml/libc-alpha/2016-10/msg00126.html

> > For similar
> > reasons as for "rdma"/"rmda6", translate "vsock" manually in getport.c.
> 
> rdma/rdma6 are specified by standards, and appear
> in the IANA Network Identifiers database:
> 
> https://www.iana.org/assignments/rpc-netids/rpc-netids.xhtml
> 
> Is there a standard netid for vsock? If not,
> there needs to be some discussion with the nfsv4
> Working Group to get this worked out.
>
> Because AF_VSOCK is an address family and the RPC
> framing is the same as TCP, the netid should be
> something like "tcpv" and not "vsock". I've
> complained about this before and there has been
> no response of any kind.
> 
> I'll note that rdma/rdma6 do not use alternate
> address families: an IP address is specified and
> mapped to a GUID by the underlying transport.
> We purposely did not expose GUIDs to NFS, which
> is based on AF_INET/AF_INET6.
> 
> rdma co-exists with IP. vsock doesn't have this
> fallback.

Thanks for explaining the tcp + rdma relationship, that makes sense.

There is no standard netid for vsock yet.

Sorry I didn't ask about "tcpv" when you originally proposed it, I lost
track of that discussion.  You said:

  If this really is just TCP on a new address family, then "tcpv"
  is more in line with previous work, and you can get away with
  just an IANA action for a new netid, since RPC-over-TCP is
  already specified.

Does "just TCP" mean a "connection-oriented, stream-oriented transport
using RFC 1831 Record Marking"?  Or does "TCP" have any other
attributes?

NFS over AF_VSOCK definitely is "connection-oriented, stream-oriented
transport using RFC 1831 Record Marking".  I'm just not sure whether
there are any other assumptions beyond this that AF_VSOCK might not meet
because it isn't IP and has 32-bit port numbers.

> It might be a better approach to use well-known
> (say, link-local or loopback) addresses and let
> the underlying network layer figure it out.
> 
> Then hide all this stuff with DNS and let the
> client mount the server by hostname and use
> normal sockaddr's and "proto=tcp". Then you don't
> need _any_ application layer changes.
> 
> Without hostnames, how does a client pick a
> Kerberos service principal for the server?

I'm not sure Kerberos would be used with AF_VSOCK.  The hypervisor knows
about the VMs, addresses cannot be spoofed, and VMs can only communicate
with the hypervisor.  This leads to a simple trust relationship.

> Does rpcbind implement "vsock" netids?

I have not modified rpcbind.  My understanding is that rpcbind isn't
required for NFSv4.  Since this is a new transport there is no plan for
it to run old protocol versions.

> Does the NFSv4.0 client advertise "vsock" in
> SETCLIENTID, and provide a "vsock" callback
> service?

The kernel patches implement backchannel support although I haven't
exercised it.

> > It is now possible to mount a file system from the host (hypervisor)
> > over AF_VSOCK like this:
> > 
> >  (guest)$ mount.nfs 2:/export /mnt -v -o clientaddr=3,proto=vsock
> > 
> > The VM's cid address is 3 and the hypervisor is 2.
> 
> The mount command is supposed to supply "clientaddr"
> automatically. This mount option is exposed only for
> debugging purposes or very special cases (like
> disabling NFSv4 callback operations).
> 
> I mean the whole point of this exercise is to get
> rid of network configuration, but here you're
> adding the need to additionally specify both the
> proto option and the clientaddr option to get this
> to work. Seems like that isn't zero-configuration
> at all.

Thanks for pointing this out.  Will fix in v2, there should be no need
to manually specify the client address, this is a remnant from early
development.

> Wouldn't it be nicer if it worked like this:
> 
> (guest)$ cat /etc/hosts
> 129.0.0.2  localhyper
> (guest)$ mount.nfs localhyper:/export /mnt
> 
> And the result was a working NFS mount of the
> local hypervisor, using whatever NFS version the
> two both support, with no changes needed to the
> NFS implementation or the understanding of the
> system administrator?

This is an interesting idea, thanks!  It would be neat to have AF_INET
access over the loopback interface on both guest and host.
Jeff Layton July 19, 2017, 3:35 p.m. UTC | #8
On Wed, 2017-07-19 at 16:11 +0100, Stefan Hajnoczi wrote:
> On Fri, Jun 30, 2017 at 11:52:15AM -0400, Chuck Lever wrote:
> > > On Jun 30, 2017, at 9:21 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > 
> > > Neither libtirpc nor getprotobyname(3) know about AF_VSOCK.
> > 
> > Why?
> > 
> > Basically you are building a lot of specialized
> > awareness in applications and leaving the
> > network layer alone. That seems backwards to me.
> 
> Yes.  I posted glibc patches but there were concerns that getaddrinfo(3)
> is IPv4/IPv6 only and applications need to be ported to AF_VSOCK anyway,
> so there's not much to gain by adding it:
> https://cygwin.com/ml/libc-alpha/2016-10/msg00126.html
> 
> > > For similar
> > > reasons as for "rdma"/"rmda6", translate "vsock" manually in getport.c.
> > 
> > rdma/rdma6 are specified by standards, and appear
> > in the IANA Network Identifiers database:
> > 
> > https://www.iana.org/assignments/rpc-netids/rpc-netids.xhtml
> > 
> > Is there a standard netid for vsock? If not,
> > there needs to be some discussion with the nfsv4
> > Working Group to get this worked out.
> > 
> > Because AF_VSOCK is an address family and the RPC
> > framing is the same as TCP, the netid should be
> > something like "tcpv" and not "vsock". I've
> > complained about this before and there has been
> > no response of any kind.
> > 
> > I'll note that rdma/rdma6 do not use alternate
> > address families: an IP address is specified and
> > mapped to a GUID by the underlying transport.
> > We purposely did not expose GUIDs to NFS, which
> > is based on AF_INET/AF_INET6.
> > 
> > rdma co-exists with IP. vsock doesn't have this
> > fallback.
> 
> Thanks for explaining the tcp + rdma relationship, that makes sense.
> 
> There is no standard netid for vsock yet.
> 
> Sorry I didn't ask about "tcpv" when you originally proposed it, I lost
> track of that discussion.  You said:
> 
>   If this really is just TCP on a new address family, then "tcpv"
>   is more in line with previous work, and you can get away with
>   just an IANA action for a new netid, since RPC-over-TCP is
>   already specified.
> 
> Does "just TCP" mean a "connection-oriented, stream-oriented transport
> using RFC 1831 Record Marking"?  Or does "TCP" have any other
> attributes?
> 
> NFS over AF_VSOCK definitely is "connection-oriented, stream-oriented
> transport using RFC 1831 Record Marking".  I'm just not sure whether
> there are any other assumptions beyond this that AF_VSOCK might not meet
> because it isn't IP and has 32-bit port numbers.
> 
> > It might be a better approach to use well-known
> > (say, link-local or loopback) addresses and let
> > the underlying network layer figure it out.
> > 
> > Then hide all this stuff with DNS and let the
> > client mount the server by hostname and use
> > normal sockaddr's and "proto=tcp". Then you don't
> > need _any_ application layer changes.
> > 
> > Without hostnames, how does a client pick a
> > Kerberos service principal for the server?
> 
> I'm not sure Kerberos would be used with AF_VSOCK.  The hypervisor knows
> about the VMs, addresses cannot be spoofed, and VMs can only communicate
> with the hypervisor.  This leads to a simple trust relationship.
> 
> > Does rpcbind implement "vsock" netids?
> 
> I have not modified rpcbind.  My understanding is that rpcbind isn't
> required for NFSv4.  Since this is a new transport there is no plan for
> it to run old protocol versions.
> 
> > Does the NFSv4.0 client advertise "vsock" in
> > SETCLIENTID, and provide a "vsock" callback
> > service?
> 
> The kernel patches implement backchannel support although I haven't
> exercised it.
> 
> > > It is now possible to mount a file system from the host (hypervisor)
> > > over AF_VSOCK like this:
> > > 
> > >  (guest)$ mount.nfs 2:/export /mnt -v -o clientaddr=3,proto=vsock
> > > 
> > > The VM's cid address is 3 and the hypervisor is 2.
> > 
> > The mount command is supposed to supply "clientaddr"
> > automatically. This mount option is exposed only for
> > debugging purposes or very special cases (like
> > disabling NFSv4 callback operations).
> > 
> > I mean the whole point of this exercise is to get
> > rid of network configuration, but here you're
> > adding the need to additionally specify both the
> > proto option and the clientaddr option to get this
> > to work. Seems like that isn't zero-configuration
> > at all.
> 
> Thanks for pointing this out.  Will fix in v2, there should be no need
> to manually specify the client address, this is a remnant from early
> development.
> 
> > Wouldn't it be nicer if it worked like this:
> > 
> > (guest)$ cat /etc/hosts
> > 129.0.0.2  localhyper
> > (guest)$ mount.nfs localhyper:/export /mnt
> > 
> > And the result was a working NFS mount of the
> > local hypervisor, using whatever NFS version the
> > two both support, with no changes needed to the
> > NFS implementation or the understanding of the
> > system administrator?
> 
> This is an interesting idea, thanks!  It would be neat to have AF_INET
> access over the loopback interface on both guest and host.

I too really like this idea better as it seems a lot less invasive.
Existing applications would "just work" without needing to be changed,
and you get name resolution to boot.

Chuck, is 129.0.0.X within some reserved block of addrs such that you
could get a standard range for this? I didn't see that block listed here
during my half-assed web search:

    https://en.wikipedia.org/wiki/Reserved_IP_addresses

Maybe you meant 192.0.0.X ? It might be easier and more future proof to
get a chunk of ipv6 addrs carved out though.
Chuck Lever July 19, 2017, 3:40 p.m. UTC | #9
> On Jul 19, 2017, at 17:35, Jeff Layton <jlayton@redhat.com> wrote:
> 
> On Wed, 2017-07-19 at 16:11 +0100, Stefan Hajnoczi wrote:
>> On Fri, Jun 30, 2017 at 11:52:15AM -0400, Chuck Lever wrote:
>>>> On Jun 30, 2017, at 9:21 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>> 
>>>> Neither libtirpc nor getprotobyname(3) know about AF_VSOCK.
>>> 
>>> Why?
>>> 
>>> Basically you are building a lot of specialized
>>> awareness in applications and leaving the
>>> network layer alone. That seems backwards to me.
>> 
>> Yes.  I posted glibc patches but there were concerns that getaddrinfo(3)
>> is IPv4/IPv6 only and applications need to be ported to AF_VSOCK anyway,
>> so there's not much to gain by adding it:
>> https://cygwin.com/ml/libc-alpha/2016-10/msg00126.html
>> 
>>>> For similar
>>>> reasons as for "rdma"/"rmda6", translate "vsock" manually in getport.c.
>>> 
>>> rdma/rdma6 are specified by standards, and appear
>>> in the IANA Network Identifiers database:
>>> 
>>> https://www.iana.org/assignments/rpc-netids/rpc-netids.xhtml
>>> 
>>> Is there a standard netid for vsock? If not,
>>> there needs to be some discussion with the nfsv4
>>> Working Group to get this worked out.
>>> 
>>> Because AF_VSOCK is an address family and the RPC
>>> framing is the same as TCP, the netid should be
>>> something like "tcpv" and not "vsock". I've
>>> complained about this before and there has been
>>> no response of any kind.
>>> 
>>> I'll note that rdma/rdma6 do not use alternate
>>> address families: an IP address is specified and
>>> mapped to a GUID by the underlying transport.
>>> We purposely did not expose GUIDs to NFS, which
>>> is based on AF_INET/AF_INET6.
>>> 
>>> rdma co-exists with IP. vsock doesn't have this
>>> fallback.
>> 
>> Thanks for explaining the tcp + rdma relationship, that makes sense.
>> 
>> There is no standard netid for vsock yet.
>> 
>> Sorry I didn't ask about "tcpv" when you originally proposed it, I lost
>> track of that discussion.  You said:
>> 
>>  If this really is just TCP on a new address family, then "tcpv"
>>  is more in line with previous work, and you can get away with
>>  just an IANA action for a new netid, since RPC-over-TCP is
>>  already specified.
>> 
>> Does "just TCP" mean a "connection-oriented, stream-oriented transport
>> using RFC 1831 Record Marking"?  Or does "TCP" have any other
>> attributes?
>> 
>> NFS over AF_VSOCK definitely is "connection-oriented, stream-oriented
>> transport using RFC 1831 Record Marking".  I'm just not sure whether
>> there are any other assumptions beyond this that AF_VSOCK might not meet
>> because it isn't IP and has 32-bit port numbers.
>> 
>>> It might be a better approach to use well-known
>>> (say, link-local or loopback) addresses and let
>>> the underlying network layer figure it out.
>>> 
>>> Then hide all this stuff with DNS and let the
>>> client mount the server by hostname and use
>>> normal sockaddr's and "proto=tcp". Then you don't
>>> need _any_ application layer changes.
>>> 
>>> Without hostnames, how does a client pick a
>>> Kerberos service principal for the server?
>> 
>> I'm not sure Kerberos would be used with AF_VSOCK.  The hypervisor knows
>> about the VMs, addresses cannot be spoofed, and VMs can only communicate
>> with the hypervisor.  This leads to a simple trust relationship.
>> 
>>> Does rpcbind implement "vsock" netids?
>> 
>> I have not modified rpcbind.  My understanding is that rpcbind isn't
>> required for NFSv4.  Since this is a new transport there is no plan for
>> it to run old protocol versions.
>> 
>>> Does the NFSv4.0 client advertise "vsock" in
>>> SETCLIENTID, and provide a "vsock" callback
>>> service?
>> 
>> The kernel patches implement backchannel support although I haven't
>> exercised it.
>> 
>>>> It is now possible to mount a file system from the host (hypervisor)
>>>> over AF_VSOCK like this:
>>>> 
>>>> (guest)$ mount.nfs 2:/export /mnt -v -o clientaddr=3,proto=vsock
>>>> 
>>>> The VM's cid address is 3 and the hypervisor is 2.
>>> 
>>> The mount command is supposed to supply "clientaddr"
>>> automatically. This mount option is exposed only for
>>> debugging purposes or very special cases (like
>>> disabling NFSv4 callback operations).
>>> 
>>> I mean the whole point of this exercise is to get
>>> rid of network configuration, but here you're
>>> adding the need to additionally specify both the
>>> proto option and the clientaddr option to get this
>>> to work. Seems like that isn't zero-configuration
>>> at all.
>> 
>> Thanks for pointing this out.  Will fix in v2, there should be no need
>> to manually specify the client address, this is a remnant from early
>> development.
>> 
>>> Wouldn't it be nicer if it worked like this:
>>> 
>>> (guest)$ cat /etc/hosts
>>> 129.0.0.2  localhyper
>>> (guest)$ mount.nfs localhyper:/export /mnt
>>> 
>>> And the result was a working NFS mount of the
>>> local hypervisor, using whatever NFS version the
>>> two both support, with no changes needed to the
>>> NFS implementation or the understanding of the
>>> system administrator?
>> 
>> This is an interesting idea, thanks!  It would be neat to have AF_INET
>> access over the loopback interface on both guest and host.
> 
> I too really like this idea better as it seems a lot less invasive.
> Existing applications would "just work" without needing to be changed,
> and you get name resolution to boot.
> 
> Chuck, is 129.0.0.X within some reserved block of addrs such that you
> could get a standard range for this? I didn't see that block listed here
> during my half-assed web search:
> 
>    https://en.wikipedia.org/wiki/Reserved_IP_addresses

I thought there would be some range of link-local addresses
that could make this work with IPv4, similar to 192. or 10.
that are "unroutable" site-local addresses.

If there isn't then IPv6 might have what we need.


> Maybe you meant 192.0.0.X ? It might be easier and more future proof to
> get a chunk of ipv6 addrs carved out though.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever July 19, 2017, 3:50 p.m. UTC | #10
> On Jul 19, 2017, at 17:11, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Fri, Jun 30, 2017 at 11:52:15AM -0400, Chuck Lever wrote:
>>> On Jun 30, 2017, at 9:21 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> 
>>> Neither libtirpc nor getprotobyname(3) know about AF_VSOCK.
>> 
>> Why?
>> 
>> Basically you are building a lot of specialized
>> awareness in applications and leaving the
>> network layer alone. That seems backwards to me.
> 
> Yes.  I posted glibc patches but there were concerns that getaddrinfo(3)
> is IPv4/IPv6 only and applications need to be ported to AF_VSOCK anyway,
> so there's not much to gain by adding it:
> https://cygwin.com/ml/libc-alpha/2016-10/msg00126.html
> 
>>> For similar
>>> reasons as for "rdma"/"rmda6", translate "vsock" manually in getport.c.
>> 
>> rdma/rdma6 are specified by standards, and appear
>> in the IANA Network Identifiers database:
>> 
>> https://www.iana.org/assignments/rpc-netids/rpc-netids.xhtml
>> 
>> Is there a standard netid for vsock? If not,
>> there needs to be some discussion with the nfsv4
>> Working Group to get this worked out.
>> 
>> Because AF_VSOCK is an address family and the RPC
>> framing is the same as TCP, the netid should be
>> something like "tcpv" and not "vsock". I've
>> complained about this before and there has been
>> no response of any kind.
>> 
>> I'll note that rdma/rdma6 do not use alternate
>> address families: an IP address is specified and
>> mapped to a GUID by the underlying transport.
>> We purposely did not expose GUIDs to NFS, which
>> is based on AF_INET/AF_INET6.
>> 
>> rdma co-exists with IP. vsock doesn't have this
>> fallback.
> 
> Thanks for explaining the tcp + rdma relationship, that makes sense.
> 
> There is no standard netid for vsock yet.
> 
> Sorry I didn't ask about "tcpv" when you originally proposed it, I lost
> track of that discussion.  You said:
> 
>  If this really is just TCP on a new address family, then "tcpv"
>  is more in line with previous work, and you can get away with
>  just an IANA action for a new netid, since RPC-over-TCP is
>  already specified.
> 
> Does "just TCP" mean a "connection-oriented, stream-oriented transport
> using RFC 1831 Record Marking"?  Or does "TCP" have any other
> attributes?
> 
> NFS over AF_VSOCK definitely is "connection-oriented, stream-oriented
> transport using RFC 1831 Record Marking".  I'm just not sure whether
> there are any other assumptions beyond this that AF_VSOCK might not meet
> because it isn't IP and has 32-bit port numbers.

Right, it is TCP in the sense that it is connection-oriented
and so on. It looks like a stream socket to the RPC client.
TI-RPC calls this "tpi_cots_ord".

But it isn't TCP in the sense that you aren't moving TCP
segments over the link.

I think the "IP / 32-bit ports" is handled entirely within
the address variant that your link is using.

>> It might be a better approach to use well-known
>> (say, link-local or loopback) addresses and let
>> the underlying network layer figure it out.
>> 
>> Then hide all this stuff with DNS and let the
>> client mount the server by hostname and use
>> normal sockaddr's and "proto=tcp". Then you don't
>> need _any_ application layer changes.
>> 
>> Without hostnames, how does a client pick a
>> Kerberos service principal for the server?
> 
> I'm not sure Kerberos would be used with AF_VSOCK.  The hypervisor knows
> about the VMs, addresses cannot be spoofed, and VMs can only communicate
> with the hypervisor.  This leads to a simple trust relationship.

The clients can be exploited if they are exposed in any
way to remote users. Having at least sec=krb5 might be
a way to block attackers from accessing data on the NFS
server from a compromised client.

In any event, NFSv4 will need ID mapping. Do you have a
sense of how the server and clients will determine their
NFSv4 ID mapping domain name? How will the server and
client user ID databases be kept in synchrony? You might
have some issues if there is a "cel" in multiple guests
that are actually different users.


>> Does rpcbind implement "vsock" netids?
> 
> I have not modified rpcbind.  My understanding is that rpcbind isn't
> required for NFSv4.  Since this is a new transport there is no plan for
> it to run old protocol versions.
> 
>> Does the NFSv4.0 client advertise "vsock" in
>> SETCLIENTID, and provide a "vsock" callback
>> service?
> 
> The kernel patches implement backchannel support although I haven't
> exercised it.
> 
>>> It is now possible to mount a file system from the host (hypervisor)
>>> over AF_VSOCK like this:
>>> 
>>> (guest)$ mount.nfs 2:/export /mnt -v -o clientaddr=3,proto=vsock
>>> 
>>> The VM's cid address is 3 and the hypervisor is 2.
>> 
>> The mount command is supposed to supply "clientaddr"
>> automatically. This mount option is exposed only for
>> debugging purposes or very special cases (like
>> disabling NFSv4 callback operations).
>> 
>> I mean the whole point of this exercise is to get
>> rid of network configuration, but here you're
>> adding the need to additionally specify both the
>> proto option and the clientaddr option to get this
>> to work. Seems like that isn't zero-configuration
>> at all.
> 
> Thanks for pointing this out.  Will fix in v2, there should be no need
> to manually specify the client address, this is a remnant from early
> development.
> 
>> Wouldn't it be nicer if it worked like this:
>> 
>> (guest)$ cat /etc/hosts
>> 129.0.0.2  localhyper
>> (guest)$ mount.nfs localhyper:/export /mnt
>> 
>> And the result was a working NFS mount of the
>> local hypervisor, using whatever NFS version the
>> two both support, with no changes needed to the
>> NFS implementation or the understanding of the
>> system administrator?
> 
> This is an interesting idea, thanks!  It would be neat to have AF_INET
> access over the loopback interface on both guest and host.

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Hajnoczi July 25, 2017, 10:05 a.m. UTC | #11
On Fri, Jul 07, 2017 at 02:13:38PM +1000, NeilBrown wrote:
> On Fri, Jul 07 2017, NeilBrown wrote:
> 
> > On Fri, Jun 30 2017, Chuck Lever wrote:
> >>
> >> Wouldn't it be nicer if it worked like this:
> >>
> >> (guest)$ cat /etc/hosts
> >> 129.0.0.2  localhyper
> >> (guest)$ mount.nfs localhyper:/export /mnt
> >>
> >> And the result was a working NFS mount of the
> >> local hypervisor, using whatever NFS version the
> >> two both support, with no changes needed to the
> >> NFS implementation or the understanding of the
> >> system administrator?
> >
> > Yes. Yes. Definitely Yes.
> > Though I suspect you mean "127.0.0.2", not "129..."??
> >
> > There must be some way to redirect TCP connections to some address
> > transparently through to the vsock protocol.
> > The "sshuttle" program does this to transparently forward TCP connections
> > over an ssh connection.  Using a similar technique to forward
> > connections over vsock shouldn't be hard.
> >
> > Or is performance really critical, and you get too much copying when you
> > try forwarding connections?  I suspect that is fixable, but it would be
> > a little less straight forward.
> >
> > I would really *not* like to see vsock support being bolted into one
> > network tool after another.
> 
> I've been digging into this a big more.  I came across
>   https://vmsplice.net/~stefan/stefanha-kvm-forum-2015.pdf
> 
> which (on page 7) lists some reasons not to use TCP/IP between guest
> and host.
> 
>  . Adding & configuring guest interfaces is invasive
> 
> That is possibly true.  But adding support for a new address family to
> NFS, NFSD, and nfs-utils is also very invasive.  You would need to
> install this software on the guest.  I suggest you install different
> software on the guest which solves the problem better.

Two different types of "invasive":
1. Requiring guest configuration changes that are likely to cause
   conflicts.
2. Requiring changes to the software stack.  Once installed there are no
   conflicts.

I'm interested and open to a different solution but it must avoid
invasive configuration changes, especially inside the guest.

>  . Prone to break due to config changes inside guest
> 
> This is, I suspect, a key issue.  With vsock, the address of the
> guest-side interface is defined by options passed to qemu.  With
> normal IP addressing, the guest has to configure the address.
> 
> However I think that IPv6 autoconfig makes this work well without vsock.
> If I create a bridge interface on the host, run
>   ip -6 addr  add fe80::1 dev br0
> then run a guest with
>    -net nic,macaddr=Ch:oo:se:an:ad:dr \
>    -net bridge,br=br0 \
> 
> then the client can
>   mount [fe80::1%interfacename]:/path /mountpoint
> 
> and the host will see a connection from
>    fe80::ch:oo:se:an:ad:dr
> 
> So from the guest side, I have achieved zero-config NFS mounts from the
> host.

It is not zero-configuration since [fe80::1%interfacename] contains a
variable, "interfacename", whose value is unknown ahead of time.  This
will make documentation as well as ability to share configuration
between VMs more difficult.  In other words, we're back to something
that requires per-guest configuration and doesn't just work everywhere.

> I don't think the server can filter connections based on which interface
> a link-local address came from.  If that was a problem that someone
> wanted to be fixed, I'm sure we can fix it.
> 
> If you need to be sure that clients don't fake their IPv6 address, I'm
> sure netfilter is up to the task.

Yes, it's common to prevent spoofing on the host using netfilter and I
think it wouldn't be a problem.

>  . Creates network interfaces on host that must be managed
> 
> What vsock does is effectively create a hidden interface on the host that only the
> kernel knows about and so the sysadmin cannot break it.  The only
> difference between this and an explicit interface on the host is that
> the latter requires a competent sysadmin.
> 
> If you have other reasons for preferring the use of vsock for NFS, I'd be
> happy to hear them.  So far I'm not convinced.

Before working on AF_VSOCK I originally proposed adding dedicated
network interfaces to guests, similar to what you've suggested, but
there was resistance for additional reasons that weren't covered in the
presentation:

Using AF_INET exposes the host's network stack to guests, and through
accidental misconfiguration even external traffic could reach the host's
network stack.  AF_VSOCK doesn't do routing or forwarding so we can be
sure that any activity is intentional.

Some virtualization use cases run guests without any network interfaces
as a matter of security policy.  One could argue that AF_VSOCK is just
another network channel, but due to it's restricted usage, the attack
surface is much smaller than an AF_INET network interface.
Stefan Hajnoczi July 25, 2017, 12:29 p.m. UTC | #12
On Fri, Jul 07, 2017 at 01:17:54PM +1000, NeilBrown wrote:
> On Fri, Jun 30 2017, Chuck Lever wrote:
> >
> > Wouldn't it be nicer if it worked like this:
> >
> > (guest)$ cat /etc/hosts
> > 129.0.0.2  localhyper
> > (guest)$ mount.nfs localhyper:/export /mnt
> >
> > And the result was a working NFS mount of the
> > local hypervisor, using whatever NFS version the
> > two both support, with no changes needed to the
> > NFS implementation or the understanding of the
> > system administrator?
> 
> Yes. Yes. Definitely Yes.
> Though I suspect you mean "127.0.0.2", not "129..."??
> 
> There must be some way to redirect TCP connections to some address
> transparently through to the vsock protocol.
> The "sshuttle" program does this to transparently forward TCP connections
> over an ssh connection.  Using a similar technique to forward
> connections over vsock shouldn't be hard.

Thanks for the sshuttle reference.  I've taken a look at it and the
underlying iptables extensions.

sshuttle does not have the ability to accept incoming connections but
that can be achieved by adding the IP to the loopback device.

Here is how bi-directional TCP connections can be tunnelled without
network interfaces:

  host           <-> vsock transport <-> guest
  129.0.0.2 (lo)                         129.0.0.2 (lo)
  129.0.0.3 (lo)                         129.0.0.3 (lo)

iptables REDIRECT is used to catch 129.0.0.2->.3 connections on the host
and 129.0.0.3->.2 connections in the guest.  A "connect" command is then
sent across the tunnel to establish a new TCP connection on the other
side.

Note that this isn't NAT since both sides see the correct IP addresses.

Unlike using a network interface (even tun/tap) this tunnelling approach
is restricted to TCP connections.  It doesn't have UDP, etc.

Issues:

1. Adding IPs to dev lo has side effects.  For example, firewall rules
   on dev lo will affect the traffic.  This alone probably prevents the
   approach from working without conflicts on existing guests.

2. Is there a safe address range to use?  Using IPv6 link-local
   addresses as suggested in this thread might work, especially when
   using an OUI so we can be sure there are no address collisions.

3. Performance has already been mentioned since a userspace process
   tunnels from loopback TCP to the vsock transport.  splice(2) can
   probably be used.

Stefan
NeilBrown July 27, 2017, 5:13 a.m. UTC | #13
On Tue, Jul 25 2017, Stefan Hajnoczi wrote:

> On Fri, Jul 07, 2017 at 02:13:38PM +1000, NeilBrown wrote:
>> On Fri, Jul 07 2017, NeilBrown wrote:
>> 
>> > On Fri, Jun 30 2017, Chuck Lever wrote:
>> >>
>> >> Wouldn't it be nicer if it worked like this:
>> >>
>> >> (guest)$ cat /etc/hosts
>> >> 129.0.0.2  localhyper
>> >> (guest)$ mount.nfs localhyper:/export /mnt
>> >>
>> >> And the result was a working NFS mount of the
>> >> local hypervisor, using whatever NFS version the
>> >> two both support, with no changes needed to the
>> >> NFS implementation or the understanding of the
>> >> system administrator?
>> >
>> > Yes. Yes. Definitely Yes.
>> > Though I suspect you mean "127.0.0.2", not "129..."??
>> >
>> > There must be some way to redirect TCP connections to some address
>> > transparently through to the vsock protocol.
>> > The "sshuttle" program does this to transparently forward TCP connections
>> > over an ssh connection.  Using a similar technique to forward
>> > connections over vsock shouldn't be hard.
>> >
>> > Or is performance really critical, and you get too much copying when you
>> > try forwarding connections?  I suspect that is fixable, but it would be
>> > a little less straight forward.
>> >
>> > I would really *not* like to see vsock support being bolted into one
>> > network tool after another.
>> 
>> I've been digging into this a big more.  I came across
>>   https://vmsplice.net/~stefan/stefanha-kvm-forum-2015.pdf
>> 
>> which (on page 7) lists some reasons not to use TCP/IP between guest
>> and host.
>> 
>>  . Adding & configuring guest interfaces is invasive
>> 
>> That is possibly true.  But adding support for a new address family to
>> NFS, NFSD, and nfs-utils is also very invasive.  You would need to
>> install this software on the guest.  I suggest you install different
>> software on the guest which solves the problem better.
>
> Two different types of "invasive":
> 1. Requiring guest configuration changes that are likely to cause
>    conflicts.
> 2. Requiring changes to the software stack.  Once installed there are no
>    conflicts.
>
> I'm interested and open to a different solution but it must avoid
> invasive configuration changes, especially inside the guest.

Sounds fair.


>
>>  . Prone to break due to config changes inside guest
>> 
>> This is, I suspect, a key issue.  With vsock, the address of the
>> guest-side interface is defined by options passed to qemu.  With
>> normal IP addressing, the guest has to configure the address.
>> 
>> However I think that IPv6 autoconfig makes this work well without vsock.
>> If I create a bridge interface on the host, run
>>   ip -6 addr  add fe80::1 dev br0
>> then run a guest with
>>    -net nic,macaddr=Ch:oo:se:an:ad:dr \
>>    -net bridge,br=br0 \
>> 
>> then the client can
>>   mount [fe80::1%interfacename]:/path /mountpoint
>> 
>> and the host will see a connection from
>>    fe80::ch:oo:se:an:ad:dr
>> 
>> So from the guest side, I have achieved zero-config NFS mounts from the
>> host.
>
> It is not zero-configuration since [fe80::1%interfacename] contains a
> variable, "interfacename", whose value is unknown ahead of time.  This
> will make documentation as well as ability to share configuration
> between VMs more difficult.  In other words, we're back to something
> that requires per-guest configuration and doesn't just work everywhere.

Maybe.  Why isn't the interfacename known ahead of time.  Once upon a
time it was always "eth0", but I guess guests can rename it....

You can use a number instead of a name.  %1 would always be lo.
%2 seems to always (often?) be the first physical interface.  Presumably
the order in which you describe interfaces to qemu directly maps to the
order that Linux sees.  Maybe %2 could always work.  Maybe we could make
it so that it always works, even if that requires small changes to Linux
(and/or qemu).

>
>> I don't think the server can filter connections based on which interface
>> a link-local address came from.  If that was a problem that someone
>> wanted to be fixed, I'm sure we can fix it.
>> 
>> If you need to be sure that clients don't fake their IPv6 address, I'm
>> sure netfilter is up to the task.
>
> Yes, it's common to prevent spoofing on the host using netfilter and I
> think it wouldn't be a problem.
>
>>  . Creates network interfaces on host that must be managed
>> 
>> What vsock does is effectively create a hidden interface on the host that only the
>> kernel knows about and so the sysadmin cannot break it.  The only
>> difference between this and an explicit interface on the host is that
>> the latter requires a competent sysadmin.
>> 
>> If you have other reasons for preferring the use of vsock for NFS, I'd be
>> happy to hear them.  So far I'm not convinced.
>
> Before working on AF_VSOCK I originally proposed adding dedicated
> network interfaces to guests, similar to what you've suggested, but
> there was resistance for additional reasons that weren't covered in the
> presentation:

I would like to suggest that this is critical information for
understanding the design rationale for AF_VSOCK and should be easily
found from http://wiki.qemu.org/Features/VirtioVsock

>
> Using AF_INET exposes the host's network stack to guests, and through
> accidental misconfiguration even external traffic could reach the host's
> network stack.  AF_VSOCK doesn't do routing or forwarding so we can be
> sure that any activity is intentional.

If I understand this correctly, the suggested configuration has the host
completely isolated from network traffic, and the guests directly
control the physical network interfaces, so the guests see external
traffic, but neither the guests nor the wider network can communicate
with the host.

Except that sometimes the guests do need to communicate with the host so
we create a whole new protocol just for that.

>
> Some virtualization use cases run guests without any network interfaces
> as a matter of security policy.  One could argue that AF_VSOCK is just
> another network channel, but due to it's restricted usage, the attack
> surface is much smaller than an AF_INET network interface.

No network interfaces, but they still want to use NFS.  Does anyone
think that sounds rational?

"due to it's restricted usage, the attack surface is much smaller"
or "due to it's niche use-cache, bug are likely to go undetected for
longer".  I'm not convinced that is sensible security policy.


I think I see where you are coming from now - thanks.  I'm not convinced
though.  It feels like someone is paranoid about possible exploits using
protocols that they think they understand, so they ask you to create a
new protocol that they don't understand (and so cannot be afraid of).

Maybe the NFS server should be run in a guest.  Surely that would
protects the host's network stack.  This would be a rather paranoid
configuration, but it seems to match the paranoia of the requirements.

I'm not against people being paranoid.  I am against major code changes
to well established software, just to placate that paranoia.

To achieve zero-config, I think link-local addresses are by far the best
answer.  To achieve isolation, some targeted filtering seems like the
best approach.

If you really want traffic between guest and host to go over a vsock,
then some sort of packet redirection should be possible.

NeilBrown
Stefan Hajnoczi July 27, 2017, 10:58 a.m. UTC | #14
On Thu, Jul 27, 2017 at 03:13:53PM +1000, NeilBrown wrote:
> On Tue, Jul 25 2017, Stefan Hajnoczi wrote:
> > On Fri, Jul 07, 2017 at 02:13:38PM +1000, NeilBrown wrote:
> >> On Fri, Jul 07 2017, NeilBrown wrote:
> >> > On Fri, Jun 30 2017, Chuck Lever wrote:
> >> I don't think the server can filter connections based on which interface
> >> a link-local address came from.  If that was a problem that someone
> >> wanted to be fixed, I'm sure we can fix it.
> >> 
> >> If you need to be sure that clients don't fake their IPv6 address, I'm
> >> sure netfilter is up to the task.
> >
> > Yes, it's common to prevent spoofing on the host using netfilter and I
> > think it wouldn't be a problem.
> >
> >>  . Creates network interfaces on host that must be managed
> >> 
> >> What vsock does is effectively create a hidden interface on the host that only the
> >> kernel knows about and so the sysadmin cannot break it.  The only
> >> difference between this and an explicit interface on the host is that
> >> the latter requires a competent sysadmin.
> >> 
> >> If you have other reasons for preferring the use of vsock for NFS, I'd be
> >> happy to hear them.  So far I'm not convinced.
> >
> > Before working on AF_VSOCK I originally proposed adding dedicated
> > network interfaces to guests, similar to what you've suggested, but
> > there was resistance for additional reasons that weren't covered in the
> > presentation:
> 
> I would like to suggest that this is critical information for
> understanding the design rationale for AF_VSOCK and should be easily
> found from http://wiki.qemu.org/Features/VirtioVsock

Thanks, I have updated the wiki.

> To achieve zero-config, I think link-local addresses are by far the best
> answer.  To achieve isolation, some targeted filtering seems like the
> best approach.
> 
> If you really want traffic between guest and host to go over a vsock,
> then some sort of packet redirection should be possible.

The issue we seem to hit with designs using AF_INET and network
interfaces is that they cannot meet the "it must avoid invasive
configuration changes, especially inside the guest" requirement.  It's
very hard to autoconfigure in a way that doesn't conflict with the
user's network configuration inside the guest.

One thought about solving the interface naming problem: if the dedicated
NIC uses a well-known OUI dedicated for this purpose then udev could
assign a persistent name (e.g. "virtguestif").  This gets us one step
closer to non-invasive automatic configuration.

Stefan
Jeff Layton July 27, 2017, 11:33 a.m. UTC | #15
On Thu, 2017-07-27 at 11:58 +0100, Stefan Hajnoczi wrote:
> On Thu, Jul 27, 2017 at 03:13:53PM +1000, NeilBrown wrote:
> > On Tue, Jul 25 2017, Stefan Hajnoczi wrote:
> > > On Fri, Jul 07, 2017 at 02:13:38PM +1000, NeilBrown wrote:
> > > > On Fri, Jul 07 2017, NeilBrown wrote:
> > > > > On Fri, Jun 30 2017, Chuck Lever wrote:
> > > > 
> > > > I don't think the server can filter connections based on which
> > > > interface
> > > > a link-local address came from.  If that was a problem that
> > > > someone
> > > > wanted to be fixed, I'm sure we can fix it.
> > > > 
> > > > If you need to be sure that clients don't fake their IPv6
> > > > address, I'm
> > > > sure netfilter is up to the task.
> > > 
> > > Yes, it's common to prevent spoofing on the host using netfilter
> > > and I
> > > think it wouldn't be a problem.
> > > 
> > > >  . Creates network interfaces on host that must be managed
> > > > 
> > > > What vsock does is effectively create a hidden interface on the
> > > > host that only the
> > > > kernel knows about and so the sysadmin cannot break it.  The
> > > > only
> > > > difference between this and an explicit interface on the host
> > > > is that
> > > > the latter requires a competent sysadmin.
> > > > 
> > > > If you have other reasons for preferring the use of vsock for
> > > > NFS, I'd be
> > > > happy to hear them.  So far I'm not convinced.
> > > 
> > > Before working on AF_VSOCK I originally proposed adding dedicated
> > > network interfaces to guests, similar to what you've suggested,
> > > but
> > > there was resistance for additional reasons that weren't covered
> > > in the
> > > presentation:
> > 
> > I would like to suggest that this is critical information for
> > understanding the design rationale for AF_VSOCK and should be
> > easily
> > found from http://wiki.qemu.org/Features/VirtioVsock
> 
> Thanks, I have updated the wiki.
> 
> > To achieve zero-config, I think link-local addresses are by far the
> > best
> > answer.  To achieve isolation, some targeted filtering seems like
> > the
> > best approach.
> >
> > If you really want traffic between guest and host to go over a
> > vsock,
> > then some sort of packet redirection should be possible.
> 
> The issue we seem to hit with designs using AF_INET and network
> interfaces is that they cannot meet the "it must avoid invasive
> configuration changes, especially inside the guest"
> requirement.  It's
> very hard to autoconfigure in a way that doesn't conflict with the
> user's network configuration inside the guest.
> 
> One thought about solving the interface naming problem: if the
> dedicated
> NIC uses a well-known OUI dedicated for this purpose then udev could
> assign a persistent name (e.g. "virtguestif").  This gets us one step
> closer to non-invasive automatic configuration.

Link-local IPv6 addresses are always present once you bring up an IPv6
interface. You can use them to communicate with other hosts on the same
network segment. It's just not routable. That seems entirely fine here
where you're not dealing with routing anyway.

What I would (naively) envision is a new network interface driver that
presents itself as "hvlo0" or soemthing, much like we do with the
loopback interface. You just need the guest to ensure that it plugs in
that driver and brings up the interface for ipv6.

Then the only issue is discovery of addresses. The HV should be able to
figure that out and present it. Maybe roll up a new nsswitch module
that queries the HV directly somehow? The nice thing there is that you
get name resolution "for free", since it's just plain old IPv6 traffic
at that point.

AF_VSOCK just seems like a very invasive solution to this problem
that's going to add a lot of maintenance burden to a lot of different
code.
NeilBrown July 27, 2017, 11:11 p.m. UTC | #16
On Thu, Jul 27 2017, Stefan Hajnoczi wrote:

> On Thu, Jul 27, 2017 at 03:13:53PM +1000, NeilBrown wrote:
>> On Tue, Jul 25 2017, Stefan Hajnoczi wrote:
>> > On Fri, Jul 07, 2017 at 02:13:38PM +1000, NeilBrown wrote:
>> >> On Fri, Jul 07 2017, NeilBrown wrote:
>> >> > On Fri, Jun 30 2017, Chuck Lever wrote:
>> >> I don't think the server can filter connections based on which interface
>> >> a link-local address came from.  If that was a problem that someone
>> >> wanted to be fixed, I'm sure we can fix it.
>> >> 
>> >> If you need to be sure that clients don't fake their IPv6 address, I'm
>> >> sure netfilter is up to the task.
>> >
>> > Yes, it's common to prevent spoofing on the host using netfilter and I
>> > think it wouldn't be a problem.
>> >
>> >>  . Creates network interfaces on host that must be managed
>> >> 
>> >> What vsock does is effectively create a hidden interface on the host that only the
>> >> kernel knows about and so the sysadmin cannot break it.  The only
>> >> difference between this and an explicit interface on the host is that
>> >> the latter requires a competent sysadmin.
>> >> 
>> >> If you have other reasons for preferring the use of vsock for NFS, I'd be
>> >> happy to hear them.  So far I'm not convinced.
>> >
>> > Before working on AF_VSOCK I originally proposed adding dedicated
>> > network interfaces to guests, similar to what you've suggested, but
>> > there was resistance for additional reasons that weren't covered in the
>> > presentation:
>> 
>> I would like to suggest that this is critical information for
>> understanding the design rationale for AF_VSOCK and should be easily
>> found from http://wiki.qemu.org/Features/VirtioVsock
>
> Thanks, I have updated the wiki.

Thanks.  How this one:

  Can be used with VMs that have no network interfaces

is really crying out for some sort of justification.

And given that ethernet/tcpip  must be some of the most attacked (and
hence hardened" code in Linux, some explanation of why it is thought
that they expose more of an attack surface than some brand new code,
might be helpful.

>
>> To achieve zero-config, I think link-local addresses are by far the best
>> answer.  To achieve isolation, some targeted filtering seems like the
>> best approach.
>> 
>> If you really want traffic between guest and host to go over a vsock,
>> then some sort of packet redirection should be possible.
>
> The issue we seem to hit with designs using AF_INET and network
> interfaces is that they cannot meet the "it must avoid invasive
> configuration changes, especially inside the guest" requirement.  It's
> very hard to autoconfigure in a way that doesn't conflict with the
> user's network configuration inside the guest.
>
> One thought about solving the interface naming problem: if the dedicated
> NIC uses a well-known OUI dedicated for this purpose then udev could
> assign a persistent name (e.g. "virtguestif").  This gets us one step
> closer to non-invasive automatic configuration.

I think this is well worth pursuing.  As you say, an OUI allows the
guest to reliably detect the right interface to use a link-local address
on.

Thanks,
NeilBrown


>
> Stefan
Matt Benjamin July 28, 2017, 12:35 a.m. UTC | #17
Hi,

On Fri, Jun 30, 2017 at 11:52 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
> Hi Stefan-

>
> Is there a standard netid for vsock? If not,
> there needs to be some discussion with the nfsv4
> Working Group to get this worked out.
>
> Because AF_VSOCK is an address family and the RPC
> framing is the same as TCP, the netid should be
> something like "tcpv" and not "vsock". I've
> complained about this before and there has been
> no response of any kind.

the onc record marking is just the length/end-of-transmission bit, and
the bytes.  something is being borrowed, but it isn't tcp

>
> I'll note that rdma/rdma6 do not use alternate
> address families: an IP address is specified and
> mapped to a GUID by the underlying transport.
> We purposely did not expose GUIDs to NFS, which
> is based on AF_INET/AF_INET6.

but, as you state, vsock is an address family.

>
> rdma co-exists with IP. vsock doesn't have this
> fallback.

doesn't appear to be needed.

>
> It might be a better approach to use well-known
> (say, link-local or loopback) addresses and let
> the underlying network layer figure it out.
>
> Then hide all this stuff with DNS and let the
> client mount the server by hostname and use
> normal sockaddr's and "proto=tcp". Then you don't
> need _any_ application layer changes.

the changes in nfs-ganesha and ntirpc along these lines were rather trivial.

>
> Without hostnames, how does a client pick a
> Kerberos service principal for the server?

no mechanism has been proposed

>
> Does rpcbind implement "vsock" netids?

are they needed?

>
> Does the NFSv4.0 client advertise "vsock" in
> SETCLIENTID, and provide a "vsock" callback
> service?

It should at least do the latter;  does it need to advertise
differently in SETCLIENTID?

>
>
>> It is now possible to mount a file system from the host (hypervisor)
>> over AF_VSOCK like this:
>>
>>  (guest)$ mount.nfs 2:/export /mnt -v -o clientaddr=3,proto=vsock
>>
>> The VM's cid address is 3 and the hypervisor is 2.
>
> The mount command is supposed to supply "clientaddr"
> automatically. This mount option is exposed only for
> debugging purposes or very special cases (like
> disabling NFSv4 callback operations).
>
> I mean the whole point of this exercise is to get
> rid of network configuration, but here you're
> adding the need to additionally specify both the
> proto option and the clientaddr option to get this
> to work. Seems like that isn't zero-configuration
> at all.

This whole line of criticism seems to me kind of off-kilter.  The
concept of cross-vm pipes appears pretty classical, and one can see
why it might not need to follow Internet conventions.

I'll give you that I never found the zeroconf or security rationales
as compelling--which is to say, I wouldn't restrict vsock to
guest-host communications, except by policy.

>
> Wouldn't it be nicer if it worked like this:
>
> (guest)$ cat /etc/hosts
> 129.0.0.2  localhyper
> (guest)$ mount.nfs localhyper:/export /mnt
>
> And the result was a working NFS mount of the
> local hypervisor, using whatever NFS version the
> two both support, with no changes needed to the
> NFS implementation or the understanding of the
> system administrator?
>
>

not clear;  I can understand 2:/export pretty easily, and I don't
think any minds would be blown if "localhyper:" effected 2:.

>
> --
> Chuck Lever
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matt
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Hajnoczi Aug. 3, 2017, 3:24 p.m. UTC | #18
On Fri, Jul 28, 2017 at 09:11:22AM +1000, NeilBrown wrote:
> On Thu, Jul 27 2017, Stefan Hajnoczi wrote:
> > On Thu, Jul 27, 2017 at 03:13:53PM +1000, NeilBrown wrote:
> >> On Tue, Jul 25 2017, Stefan Hajnoczi wrote:
> >> > On Fri, Jul 07, 2017 at 02:13:38PM +1000, NeilBrown wrote:
> >> >> On Fri, Jul 07 2017, NeilBrown wrote:
> >> >> > On Fri, Jun 30 2017, Chuck Lever wrote:
> >> To achieve zero-config, I think link-local addresses are by far the best
> >> answer.  To achieve isolation, some targeted filtering seems like the
> >> best approach.
> >> 
> >> If you really want traffic between guest and host to go over a vsock,
> >> then some sort of packet redirection should be possible.
> >
> > The issue we seem to hit with designs using AF_INET and network
> > interfaces is that they cannot meet the "it must avoid invasive
> > configuration changes, especially inside the guest" requirement.  It's
> > very hard to autoconfigure in a way that doesn't conflict with the
> > user's network configuration inside the guest.
> >
> > One thought about solving the interface naming problem: if the dedicated
> > NIC uses a well-known OUI dedicated for this purpose then udev could
> > assign a persistent name (e.g. "virtguestif").  This gets us one step
> > closer to non-invasive automatic configuration.
> 
> I think this is well worth pursuing.  As you say, an OUI allows the
> guest to reliably detect the right interface to use a link-local address
> on.

IPv6 link-local addressing with a well-known MAC address range solves
address collisions.  The presence of a network interface still has the
following issues:

1. Network management tools (e.g. NetworkManager) inside the guest
   detect the interface and may auto-configure it (e.g. DHCP).  Guest
   administrators are confronted with a new interface - this opens up
   the possibility that they change its configuration.

2. Default drop firewall policies conflict with the interface.  The
   guest administrator would have to manually configure exceptions for
   their firewall.

3. udev is a Linux-only solution and other OSes do not offer a
   configurable interface naming scheme.  Manual configuration would
   be required.

I still see these as blockers preventing guest<->host file system
sharing.  Users can already manually add a NIC and configure NFS today,
but the goal here is to offer this as a feature that works in an
automated way (useful both for GUI-style virtual machine management and
for OpenStack clouds where guest configuration must be simple and
scale).

In contrast, AF_VSOCK works as long as the driver is loaded.  There is
no configuration.

The changes required to Linux and nfs-utils are related to the sunrpc
transport and configuration.  They do not introduce risks to core NFS or
TCP/IP.  I would really like to get patches merged because I currently
have to direct interested users to building Linux and nfs-utils from
source to try this out.

Stefan
NeilBrown Aug. 3, 2017, 9:45 p.m. UTC | #19
On Thu, Aug 03 2017, Stefan Hajnoczi wrote:

> On Fri, Jul 28, 2017 at 09:11:22AM +1000, NeilBrown wrote:
>> On Thu, Jul 27 2017, Stefan Hajnoczi wrote:
>> > On Thu, Jul 27, 2017 at 03:13:53PM +1000, NeilBrown wrote:
>> >> On Tue, Jul 25 2017, Stefan Hajnoczi wrote:
>> >> > On Fri, Jul 07, 2017 at 02:13:38PM +1000, NeilBrown wrote:
>> >> >> On Fri, Jul 07 2017, NeilBrown wrote:
>> >> >> > On Fri, Jun 30 2017, Chuck Lever wrote:
>> >> To achieve zero-config, I think link-local addresses are by far the best
>> >> answer.  To achieve isolation, some targeted filtering seems like the
>> >> best approach.
>> >> 
>> >> If you really want traffic between guest and host to go over a vsock,
>> >> then some sort of packet redirection should be possible.
>> >
>> > The issue we seem to hit with designs using AF_INET and network
>> > interfaces is that they cannot meet the "it must avoid invasive
>> > configuration changes, especially inside the guest" requirement.  It's
>> > very hard to autoconfigure in a way that doesn't conflict with the
>> > user's network configuration inside the guest.
>> >
>> > One thought about solving the interface naming problem: if the dedicated
>> > NIC uses a well-known OUI dedicated for this purpose then udev could
>> > assign a persistent name (e.g. "virtguestif").  This gets us one step
>> > closer to non-invasive automatic configuration.
>> 
>> I think this is well worth pursuing.  As you say, an OUI allows the
>> guest to reliably detect the right interface to use a link-local address
>> on.
>
> IPv6 link-local addressing with a well-known MAC address range solves
> address collisions.  The presence of a network interface still has the
> following issues:
>
> 1. Network management tools (e.g. NetworkManager) inside the guest
>    detect the interface and may auto-configure it (e.g. DHCP).

Why would this matter?  Auto-configuring may add addresses to the
interface, but will not remove the link-local address.

>                                                                 Guest
>    administrators are confronted with a new interface - this opens up
>    the possibility that they change its configuration.

True, the admin might delete the link-local address themselves.  They
might also delete /sbin/mount.nfs.  Maybe they could even "rm -rf /".
A rogue admin can always shoot themselves in the foot.  Trying to
prevent this is pointless.

>
> 2. Default drop firewall policies conflict with the interface.  The
>    guest administrator would have to manually configure exceptions for
>    their firewall.

This gets back to my original point.  You are willing to stick
required-configuration in the kernel and in nfs-utils, but you are not
willing to require some fixed configuration which actually addresses
your problem.  If you want an easy way to punch a firewall hole for a
particular port on a particular interface, then resolve that by talking
with people who understand firewalls.  Not by creating a new protocol
which cannot be firewalled.

>
> 3. udev is a Linux-only solution and other OSes do not offer a
>    configurable interface naming scheme.  Manual configuration would
>    be required.

Not my problem.
If some other OS is lacking important functionality, you do fix it by
adding rubbish to Linux.  You fix it by fixing those OSes.
For example, is Linux didn't have udev or anything like it, I might be
open to enhancing mount.nfs so that an address syntax like:
  fe80::1%*:xx:yy:xx:*
would mean the that glob pattern should be matched again the MAC address
of each interface and the first such interface used.  This would be a
focused change addressed at fixing a specific issue.  I might not
actually like it, but if it was the best/simplest mechanism to achieve
the goal, I doubt I would fight it.  Fortunately I don't need decide
as we already have udev.
If some of OS doesn't have a way to find the interface for a particular
MAC address, maybe you need to create one.

>
> I still see these as blockers preventing guest<->host file system
> sharing.  Users can already manually add a NIC and configure NFS today,
> but the goal here is to offer this as a feature that works in an
> automated way (useful both for GUI-style virtual machine management and
> for OpenStack clouds where guest configuration must be simple and
> scale).
>
> In contrast, AF_VSOCK works as long as the driver is loaded.  There is
> no configuration.

I think we all agree that providing something that "just works" is a
worth goal.  In only question is about how much new code can be
justified, and where it should be put.

Given that almost everything you need already exists, it seems best to
just tie those pieces together.

NeilBrown


>
> The changes required to Linux and nfs-utils are related to the sunrpc
> transport and configuration.  They do not introduce risks to core NFS or
> TCP/IP.  I would really like to get patches merged because I currently
> have to direct interested users to building Linux and nfs-utils from
> source to try this out.
>
> Stefan
Matt Benjamin Aug. 3, 2017, 11:53 p.m. UTC | #20
Hi Neil,

On Thu, Aug 3, 2017 at 5:45 PM, NeilBrown <neilb@suse.com> wrote:
> On Thu, Aug 03 2017, Stefan Hajnoczi wrote:

Since the vsock address family is in the tin since 4.8, this argument
appears to be about, precisely, tying existing pieces together.  The
ceph developers working on openstack manila did find the nfs over
vsock use case compelling.  I appreciate this because it has
encouraged more interest in the cephfs community around using the
standardized NFS protocol for deployment.

Matt

>
> I think we all agree that providing something that "just works" is a
> worth goal.  In only question is about how much new code can be
> justified, and where it should be put.
>
> Given that almost everything you need already exists, it seems best to
> just tie those pieces together.
>
> NeilBrown
>
>
>>
>> The changes required to Linux and nfs-utils are related to the sunrpc
>> transport and configuration.  They do not introduce risks to core NFS or
>> TCP/IP.  I would really like to get patches merged because I currently
>> have to direct interested users to building Linux and nfs-utils from
>> source to try this out.
>>
>> Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown Aug. 4, 2017, 3:25 a.m. UTC | #21
On Thu, Aug 03 2017, Matt Benjamin wrote:

> Hi Neil,
>
> On Thu, Aug 3, 2017 at 5:45 PM, NeilBrown <neilb@suse.com> wrote:
>> On Thu, Aug 03 2017, Stefan Hajnoczi wrote:
>
> Since the vsock address family is in the tin since 4.8, this argument
> appears to be about, precisely, tying existing pieces together.

No, it is about adding new, unnecessary pieces into various places.

>                                                                  The
> ceph developers working on openstack manila did find the nfs over
> vsock use case compelling.  I appreciate this because it has
> encouraged more interest in the cephfs community around using the
> standardized NFS protocol for deployment.

I'm sure the ceph developers find zero-conf NFS a compelling use case.
I would be surprised if they care whether it is over vsock or IPv6.

But I'm losing interest here.  I'm not a gate-keeper.  If you can
convince Steve/Trond/Anna/Brice to accept your code, then good luck to
you.  I don't think a convincing case has been made though.

NeilBrown


>
> Matt
>
>>
>> I think we all agree that providing something that "just works" is a
>> worth goal.  In only question is about how much new code can be
>> justified, and where it should be put.
>>
>> Given that almost everything you need already exists, it seems best to
>> just tie those pieces together.
>>
>> NeilBrown
>>
>>
>>>
>>> The changes required to Linux and nfs-utils are related to the sunrpc
>>> transport and configuration.  They do not introduce risks to core NFS or
>>> TCP/IP.  I would really like to get patches merged because I currently
>>> have to direct interested users to building Linux and nfs-utils from
>>> source to try this out.
>>>
>>> Stefan
Stefan Hajnoczi Aug. 4, 2017, 3:56 p.m. UTC | #22
On Fri, Aug 04, 2017 at 07:45:22AM +1000, NeilBrown wrote:
> On Thu, Aug 03 2017, Stefan Hajnoczi wrote:
> > On Fri, Jul 28, 2017 at 09:11:22AM +1000, NeilBrown wrote:
> >> On Thu, Jul 27 2017, Stefan Hajnoczi wrote:
> >> > On Thu, Jul 27, 2017 at 03:13:53PM +1000, NeilBrown wrote:
> >> >> On Tue, Jul 25 2017, Stefan Hajnoczi wrote:
> >> >> > On Fri, Jul 07, 2017 at 02:13:38PM +1000, NeilBrown wrote:
> >> >> >> On Fri, Jul 07 2017, NeilBrown wrote:
> >> >> >> > On Fri, Jun 30 2017, Chuck Lever wrote:
> > I still see these as blockers preventing guest<->host file system
> > sharing.  Users can already manually add a NIC and configure NFS today,
> > but the goal here is to offer this as a feature that works in an
> > automated way (useful both for GUI-style virtual machine management and
> > for OpenStack clouds where guest configuration must be simple and
> > scale).
> >
> > In contrast, AF_VSOCK works as long as the driver is loaded.  There is
> > no configuration.
> 
> I think we all agree that providing something that "just works" is a
> worth goal.  In only question is about how much new code can be
> justified, and where it should be put.
> 
> Given that almost everything you need already exists, it seems best to
> just tie those pieces together.

Neil,
You said downthread you're losing interest but there's a point that I
hope you have time to consider because it's key:

Even if the NFS transport can be set up automatically without
conflicting with the user's system configuration, it needs to stay
available going forward.  A network interface is prone to user
configuration changes through network management tools, firewalls, and
other utilities.  The risk of it breakage is significant.

That's not really a technical problem - it will be caused by some user
action - but using the existing Linux AF_VSOCK feature that whole class
of issues can be eliminated.

Stefan
NeilBrown Aug. 4, 2017, 10:35 p.m. UTC | #23
On Fri, Aug 04 2017, Stefan Hajnoczi wrote:

> On Fri, Aug 04, 2017 at 07:45:22AM +1000, NeilBrown wrote:
>> On Thu, Aug 03 2017, Stefan Hajnoczi wrote:
>> > On Fri, Jul 28, 2017 at 09:11:22AM +1000, NeilBrown wrote:
>> >> On Thu, Jul 27 2017, Stefan Hajnoczi wrote:
>> >> > On Thu, Jul 27, 2017 at 03:13:53PM +1000, NeilBrown wrote:
>> >> >> On Tue, Jul 25 2017, Stefan Hajnoczi wrote:
>> >> >> > On Fri, Jul 07, 2017 at 02:13:38PM +1000, NeilBrown wrote:
>> >> >> >> On Fri, Jul 07 2017, NeilBrown wrote:
>> >> >> >> > On Fri, Jun 30 2017, Chuck Lever wrote:
>> > I still see these as blockers preventing guest<->host file system
>> > sharing.  Users can already manually add a NIC and configure NFS today,
>> > but the goal here is to offer this as a feature that works in an
>> > automated way (useful both for GUI-style virtual machine management and
>> > for OpenStack clouds where guest configuration must be simple and
>> > scale).
>> >
>> > In contrast, AF_VSOCK works as long as the driver is loaded.  There is
>> > no configuration.
>> 
>> I think we all agree that providing something that "just works" is a
>> worth goal.  In only question is about how much new code can be
>> justified, and where it should be put.
>> 
>> Given that almost everything you need already exists, it seems best to
>> just tie those pieces together.
>
> Neil,
> You said downthread you're losing interest but there's a point that I
> hope you have time to consider because it's key:
>
> Even if the NFS transport can be set up automatically without
> conflicting with the user's system configuration, it needs to stay
> available going forward.  A network interface is prone to user
> configuration changes through network management tools, firewalls, and
> other utilities.  The risk of it breakage is significant.

I've already addressed this issue.  I wrote:

	  True, the admin might delete the link-local address themselves.  They
	  might also delete /sbin/mount.nfs.  Maybe they could even "rm -rf /".
	  A rogue admin can always shoot themselves in the foot.  Trying to
	  prevent this is pointless.


>
> That's not really a technical problem - it will be caused by some user
> action - but using the existing Linux AF_VSOCK feature that whole class
> of issues can be eliminated.

I suggest you look up the proverb about making things fool-proof and
learn to apply it.

Meanwhile I have another issue.  Is it possible for tcpdump, or some
other tool, to capture all the packets flowing over a vsock?  If it
isn't possible to analyse the traffic with wireshark, it will be much
harder to diagnose issues that customers have.

NeilBrown


>
> Stefan
Stefan Hajnoczi Aug. 8, 2017, 2:07 p.m. UTC | #24
On Sat, Aug 05, 2017 at 08:35:52AM +1000, NeilBrown wrote:
> On Fri, Aug 04 2017, Stefan Hajnoczi wrote:
> 
> > On Fri, Aug 04, 2017 at 07:45:22AM +1000, NeilBrown wrote:
> >> On Thu, Aug 03 2017, Stefan Hajnoczi wrote:
> >> > On Fri, Jul 28, 2017 at 09:11:22AM +1000, NeilBrown wrote:
> >> >> On Thu, Jul 27 2017, Stefan Hajnoczi wrote:
> >> >> > On Thu, Jul 27, 2017 at 03:13:53PM +1000, NeilBrown wrote:
> >> >> >> On Tue, Jul 25 2017, Stefan Hajnoczi wrote:
> >> >> >> > On Fri, Jul 07, 2017 at 02:13:38PM +1000, NeilBrown wrote:
> >> >> >> >> On Fri, Jul 07 2017, NeilBrown wrote:
> >> >> >> >> > On Fri, Jun 30 2017, Chuck Lever wrote:
> >> > I still see these as blockers preventing guest<->host file system
> >> > sharing.  Users can already manually add a NIC and configure NFS today,
> >> > but the goal here is to offer this as a feature that works in an
> >> > automated way (useful both for GUI-style virtual machine management and
> >> > for OpenStack clouds where guest configuration must be simple and
> >> > scale).
> >> >
> >> > In contrast, AF_VSOCK works as long as the driver is loaded.  There is
> >> > no configuration.
> >> 
> >> I think we all agree that providing something that "just works" is a
> >> worth goal.  In only question is about how much new code can be
> >> justified, and where it should be put.
> >> 
> >> Given that almost everything you need already exists, it seems best to
> >> just tie those pieces together.
> >
> > Neil,
> > You said downthread you're losing interest but there's a point that I
> > hope you have time to consider because it's key:
> >
> > Even if the NFS transport can be set up automatically without
> > conflicting with the user's system configuration, it needs to stay
> > available going forward.  A network interface is prone to user
> > configuration changes through network management tools, firewalls, and
> > other utilities.  The risk of it breakage is significant.
> 
> I've already addressed this issue.  I wrote:
> 
> 	  True, the admin might delete the link-local address themselves.  They
> 	  might also delete /sbin/mount.nfs.  Maybe they could even "rm -rf /".
> 	  A rogue admin can always shoot themselves in the foot.  Trying to
> 	  prevent this is pointless.

These are not things that I'm worried about.  I agree that it's
pointless trying to prevent them.

The issue is genuine configuration changes either by the user or by
software they are running that simply interfere with the host<->guest
interface.  For example, a default DROP iptables policy.

> Meanwhile I have another issue.  Is it possible for tcpdump, or some
> other tool, to capture all the packets flowing over a vsock?  If it
> isn't possible to analyse the traffic with wireshark, it will be much
> harder to diagnose issues that customers have.

Yes, packet capture is possible.  The vsockmon driver was added in Linux
4.11.  Wireshark has a dissector for AF_VSOCK.

Stefan
diff mbox

Patch

diff --git a/support/nfs/getport.c b/support/nfs/getport.c
index 081594c..0b857af 100644
--- a/support/nfs/getport.c
+++ b/support/nfs/getport.c
@@ -217,8 +217,7 @@  nfs_get_proto(const char *netid, sa_family_t *family, unsigned long *protocol)
 	struct protoent *proto;
 
 	/*
-	 * IANA does not define a protocol number for rdma netids,
-	 * since "rdma" is not an IP protocol.
+	 * IANA does not define protocol numbers for non-IP netids.
 	 */
 	if (strcmp(netid, "rdma") == 0) {
 		*family = AF_INET;
@@ -230,6 +229,11 @@  nfs_get_proto(const char *netid, sa_family_t *family, unsigned long *protocol)
 		*protocol = NFSPROTO_RDMA;
 		return 1;
 	}
+	if (strcmp(netid, "vsock") == 0) {
+		*family = AF_VSOCK;
+		*protocol = 0;
+		return 1;
+	}
 
 	nconf = getnetconfigent(netid);
 	if (nconf == NULL)
@@ -258,14 +262,18 @@  nfs_get_proto(const char *netid, sa_family_t *family, unsigned long *protocol)
 	struct protoent *proto;
 
 	/*
-	 * IANA does not define a protocol number for rdma netids,
-	 * since "rdma" is not an IP protocol.
+	 * IANA does not define protocol numbers for non-IP netids.
 	 */
 	if (strcmp(netid, "rdma") == 0) {
 		*family = AF_INET;
 		*protocol = NFSPROTO_RDMA;
 		return 1;
 	}
+	if (strcmp(netid, "vsock") == 0) {
+		*family = AF_VSOCK;
+		*protocol = 0;
+		return 1;
+	}
 
 	proto = getprotobyname(netid);
 	if (proto == NULL)