diff mbox series

[2/2] nbd: add support for nbd as root device

Message ID 20190612163144.18486-2-roman.stratiienko@globallogic.com (mailing list archive)
State New, archived
Headers show
Series [1/2] nbd: make sock_xmit() and nbd_add_socket() more generic | expand

Commit Message

Roman Stratiienko June 12, 2019, 4:31 p.m. UTC
From: Roman Stratiienko <roman.stratiienko@globallogic.com>

Adding support to nbd to use it as a root device. This code essentially
provides a minimal nbd-client implementation within the kernel. It opens
a socket and makes the negotiation with the server. Afterwards it passes
the socket to the normal nbd-code to handle the connection.

The arguments for the server are passed via kernel command line.
The kernel command line has the format
'nbdroot=[<SERVER_IP>:]<SERVER_PORT>/<EXPORT_NAME>'.
SERVER_IP is optional. If it is not available it will use the
root_server_addr transmitted through DHCP.

Based on those arguments, the connection to the server is established
and is connected to the nbd0 device. The rootdevice therefore is
root=/dev/nbd0.

Patch was initialy posted by Markus Pargmann <mpa@pengutronix.de>
and can be found at https://lore.kernel.org/patchwork/patch/532556/

Change-Id: I78f7313918bf31b9dc01a74a42f0f068bede312c
Signed-off-by: Roman Stratiienko <roman.stratiienko@globallogic.com>
Reviewed-by: Aleksandr Bulyshchenko <A.Bulyshchenko@globallogic.com>
---
 drivers/block/Kconfig |  19 +++
 drivers/block/nbd.c   | 294 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 313 insertions(+)

Comments

Eric Blake June 13, 2019, 12:56 p.m. UTC | #1
On 6/12/19 11:31 AM, roman.stratiienko@globallogic.com wrote:
> From: Roman Stratiienko <roman.stratiienko@globallogic.com>
> 
> Adding support to nbd to use it as a root device. This code essentially
> provides a minimal nbd-client implementation within the kernel. It opens
> a socket and makes the negotiation with the server. Afterwards it passes
> the socket to the normal nbd-code to handle the connection.
> 
> The arguments for the server are passed via kernel command line.
> The kernel command line has the format
> 'nbdroot=[<SERVER_IP>:]<SERVER_PORT>/<EXPORT_NAME>'.
> SERVER_IP is optional. If it is not available it will use the
> root_server_addr transmitted through DHCP.
> 
> Based on those arguments, the connection to the server is established
> and is connected to the nbd0 device. The rootdevice therefore is
> root=/dev/nbd0.
> 
> Patch was initialy posted by Markus Pargmann <mpa@pengutronix.de>
> and can be found at https://lore.kernel.org/patchwork/patch/532556/
> 
> Change-Id: I78f7313918bf31b9dc01a74a42f0f068bede312c
> Signed-off-by: Roman Stratiienko <roman.stratiienko@globallogic.com>
> Reviewed-by: Aleksandr Bulyshchenko <A.Bulyshchenko@globallogic.com>

> +static int nbd_connection_negotiate(struct socket *sock, char *export_name,
> +				    size_t *rsize, u16 *nflags)
> +{

> +
> +	ret = sock_xmit_buf(sock, 0, &flags, sizeof(flags));
> +	if (ret < 0)
> +		return ret;
> +
> +	*nflags = ntohs(flags);
> +
> +	client_flags = 0;

Why not reply with NBD_FLAG_C_FIXED_NEWSTYLE? Which in turn would be
necessary...

> +
> +	ret = sock_xmit_buf(sock, 1, &client_flags, sizeof(client_flags));
> +	if (ret < 0)
> +		return ret;
> +
> +	magic = cpu_to_be64(nbd_opts_magic);
> +	ret = sock_xmit_buf(sock, 1, &magic, sizeof(magic));
> +	if (ret < 0)
> +		return ret;
> +
> +	opt = htonl(NBD_OPT_EXPORT_NAME);

Why not use NBD_OPT_GO?  That's a better interface (better error
recovery, and some servers can even use it to advertise preferred block
sizing - particularly if a server insists on a 4k minimum block instead
of 512).

...using NBD_OPT_GO would require checking that the server advertised
FIXED_NEWSTYLE as well as replying that we intend to rely on it.

Otherwise the idea looks interesting to me.
Eric Blake June 13, 2019, 1:02 p.m. UTC | #2
On 6/12/19 11:31 AM, roman.stratiienko@globallogic.com wrote:
> From: Roman Stratiienko <roman.stratiienko@globallogic.com>
> 
> Adding support to nbd to use it as a root device. This code essentially
> provides a minimal nbd-client implementation within the kernel. It opens
> a socket and makes the negotiation with the server. Afterwards it passes
> the socket to the normal nbd-code to handle the connection.
> 
> The arguments for the server are passed via kernel command line.
> The kernel command line has the format
> 'nbdroot=[<SERVER_IP>:]<SERVER_PORT>/<EXPORT_NAME>'.

Did you intend for nbdroot=1234 to connect to port 1234 or to server
1234 port 10809?  Is an export name mandatory even when it is the empty
string, in which case, is the / character mandatory?  Maybe this would
be better written as:

 [<SERVER_IP>[:<SERVER_PORT]][/<EXPORT_NAME]

although that would allow nbdroot= using all defaults (will that still
do the right thing?).

Should we support nbdroot=URI, and tie this in to Rich's proposal [1] on
standardizing the set of URIs that refer to an NBD export?  It seems
like you are still limited to a TCP socket (not Unix) with no
encryption, so this would be equivalent to the URI:

nbd://[server[:port]][/export]

[1] https://lists.debian.org/nbd/2019/06/msg00011.html
Eric Blake June 13, 2019, 1:14 p.m. UTC | #3
On 6/13/19 8:02 AM, Eric Blake wrote:
> On 6/12/19 11:31 AM, roman.stratiienko@globallogic.com wrote:
>> From: Roman Stratiienko <roman.stratiienko@globallogic.com>
>>
>> Adding support to nbd to use it as a root device. This code essentially
>> provides a minimal nbd-client implementation within the kernel. It opens
>> a socket and makes the negotiation with the server. Afterwards it passes
>> the socket to the normal nbd-code to handle the connection.
>>
>> The arguments for the server are passed via kernel command line.
>> The kernel command line has the format
>> 'nbdroot=[<SERVER_IP>:]<SERVER_PORT>/<EXPORT_NAME>'.
> 
> Did you intend for nbdroot=1234 to connect to port 1234 or to server
> 1234 port 10809?  Is an export name mandatory even when it is the empty
> string, in which case, is the / character mandatory?  Maybe this would
> be better written as:
> 
>  [<SERVER_IP>[:<SERVER_PORT]][/<EXPORT_NAME]

Make that:

 [[<SERVER_IP>][:<SERVER_PORT>]][/[<EXPORT_NAME>]]

as well as a blurb that IPv6 requires use of [] around SERVER_IP to
avoid ambiguity between IP address and the port designator.  That would
allow variations such as:

nbdroot=1.2.3.4          # port 10809 and export name '' defaulted
nbdroot=1.2.3.4:10809/   # fully explicit, export name ''
nbdroot=:10810/export    # host defaulted by DHCP, rest is explicit
nbdroot=/                # host and port are defaulted, export is ''
nbdroot=[::1]:10810      # IPv6 localhost and port 10810
nbdroot=::1              # IPv6 localhost, default port 10809

[I'm aware that localhost is probably not going to work based on how
early this is encountered during the boot; rather, consider ::1 as a
placeholder for a more realistic IPv6 address]

> 
> although that would allow nbdroot= using all defaults (will that still
> do the right thing?).

nbdroot=                # host from DHCP, port 10809, export ''

> 
> Should we support nbdroot=URI, and tie this in to Rich's proposal [1] on
> standardizing the set of URIs that refer to an NBD export?  It seems
> like you are still limited to a TCP socket (not Unix) with no
> encryption, so this would be equivalent to the URI:
> 
> nbd://[server[:port]][/export]
> 
> [1] https://lists.debian.org/nbd/2019/06/msg00011.html
>
Josef Bacik June 13, 2019, 1:52 p.m. UTC | #4
On Wed, Jun 12, 2019 at 07:31:44PM +0300, roman.stratiienko@globallogic.com wrote:
> From: Roman Stratiienko <roman.stratiienko@globallogic.com>
> 
> Adding support to nbd to use it as a root device. This code essentially
> provides a minimal nbd-client implementation within the kernel. It opens
> a socket and makes the negotiation with the server. Afterwards it passes
> the socket to the normal nbd-code to handle the connection.
> 
> The arguments for the server are passed via kernel command line.
> The kernel command line has the format
> 'nbdroot=[<SERVER_IP>:]<SERVER_PORT>/<EXPORT_NAME>'.
> SERVER_IP is optional. If it is not available it will use the
> root_server_addr transmitted through DHCP.
> 
> Based on those arguments, the connection to the server is established
> and is connected to the nbd0 device. The rootdevice therefore is
> root=/dev/nbd0.
> 
> Patch was initialy posted by Markus Pargmann <mpa@pengutronix.de>
> and can be found at https://lore.kernel.org/patchwork/patch/532556/
> 
> Change-Id: I78f7313918bf31b9dc01a74a42f0f068bede312c
> Signed-off-by: Roman Stratiienko <roman.stratiienko@globallogic.com>
> Reviewed-by: Aleksandr Bulyshchenko <A.Bulyshchenko@globallogic.com>

Just throw nbd-client in your initramfs.  Every nbd server has it's own
handshake protocol, embedding one particular servers handshake protocol into the
kernel isn't the answer here.  Thanks,

Josef
Roman Stratiienko June 13, 2019, 2:45 p.m. UTC | #5
On Thu, Jun 13, 2019 at 4:52 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Wed, Jun 12, 2019 at 07:31:44PM +0300, roman.stratiienko@globallogic.com wrote:
> > From: Roman Stratiienko <roman.stratiienko@globallogic.com>
> >
> > Adding support to nbd to use it as a root device. This code essentially
> > provides a minimal nbd-client implementation within the kernel. It opens
> > a socket and makes the negotiation with the server. Afterwards it passes
> > the socket to the normal nbd-code to handle the connection.
> >
> > The arguments for the server are passed via kernel command line.
> > The kernel command line has the format
> > 'nbdroot=[<SERVER_IP>:]<SERVER_PORT>/<EXPORT_NAME>'.
> > SERVER_IP is optional. If it is not available it will use the
> > root_server_addr transmitted through DHCP.
> >
> > Based on those arguments, the connection to the server is established
> > and is connected to the nbd0 device. The rootdevice therefore is
> > root=/dev/nbd0.
> >
> > Patch was initialy posted by Markus Pargmann <mpa@pengutronix.de>
> > and can be found at https://lore.kernel.org/patchwork/patch/532556/
> >
> > Change-Id: I78f7313918bf31b9dc01a74a42f0f068bede312c
> > Signed-off-by: Roman Stratiienko <roman.stratiienko@globallogic.com>
> > Reviewed-by: Aleksandr Bulyshchenko <A.Bulyshchenko@globallogic.com>
>
> Just throw nbd-client in your initramfs.  Every nbd server has it's own
> handshake protocol, embedding one particular servers handshake protocol into the
> kernel isn't the answer here.  Thanks,
>
> Josef

Hello Josef,

Let me share some of my thoughts that was the motivation for providing
this solution::

We choose NBD as a tool to run CI tests on our platforms.
We have a wide range of different BSP's with different kind of images
where using NFSROOT is hard or even impossible.
Most of these BSPs are not using initramfs and some of them are Android-based.

Taking all this into account we have to put significant efforts to
implement and test custom initramfs and it will not cover all our
needs.

Much easier way is to embed small client into the kernel and just
enable configuration when needed.

I believe such solution will be very useful for wide range of kernel users.

Also, as far as I know mainline nbd-server daemon have only 2
handshake protocols. So called OLD-STYLE and NEW-STYLE. And OLD-STYLE
is no longer supported. So it should not be a problem, or please fix
me if I'm wrong.

Regards,
Roman
Josef Bacik June 13, 2019, 2:55 p.m. UTC | #6
On Thu, Jun 13, 2019 at 05:45:13PM +0300, Roman Stratiienko wrote:
> On Thu, Jun 13, 2019 at 4:52 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Wed, Jun 12, 2019 at 07:31:44PM +0300, roman.stratiienko@globallogic.com wrote:
> > > From: Roman Stratiienko <roman.stratiienko@globallogic.com>
> > >
> > > Adding support to nbd to use it as a root device. This code essentially
> > > provides a minimal nbd-client implementation within the kernel. It opens
> > > a socket and makes the negotiation with the server. Afterwards it passes
> > > the socket to the normal nbd-code to handle the connection.
> > >
> > > The arguments for the server are passed via kernel command line.
> > > The kernel command line has the format
> > > 'nbdroot=[<SERVER_IP>:]<SERVER_PORT>/<EXPORT_NAME>'.
> > > SERVER_IP is optional. If it is not available it will use the
> > > root_server_addr transmitted through DHCP.
> > >
> > > Based on those arguments, the connection to the server is established
> > > and is connected to the nbd0 device. The rootdevice therefore is
> > > root=/dev/nbd0.
> > >
> > > Patch was initialy posted by Markus Pargmann <mpa@pengutronix.de>
> > > and can be found at https://lore.kernel.org/patchwork/patch/532556/
> > >
> > > Change-Id: I78f7313918bf31b9dc01a74a42f0f068bede312c
> > > Signed-off-by: Roman Stratiienko <roman.stratiienko@globallogic.com>
> > > Reviewed-by: Aleksandr Bulyshchenko <A.Bulyshchenko@globallogic.com>
> >
> > Just throw nbd-client in your initramfs.  Every nbd server has it's own
> > handshake protocol, embedding one particular servers handshake protocol into the
> > kernel isn't the answer here.  Thanks,
> >
> > Josef
> 
> Hello Josef,
> 
> Let me share some of my thoughts that was the motivation for providing
> this solution::
> 
> We choose NBD as a tool to run CI tests on our platforms.
> We have a wide range of different BSP's with different kind of images
> where using NFSROOT is hard or even impossible.
> Most of these BSPs are not using initramfs and some of them are Android-based.
> 
> Taking all this into account we have to put significant efforts to
> implement and test custom initramfs and it will not cover all our
> needs.
> 
> Much easier way is to embed small client into the kernel and just
> enable configuration when needed.
> 
> I believe such solution will be very useful for wide range of kernel users.
> 
> Also, as far as I know mainline nbd-server daemon have only 2
> handshake protocols. So called OLD-STYLE and NEW-STYLE. And OLD-STYLE
> is no longer supported. So it should not be a problem, or please fix
> me if I'm wrong.
> 

I don't doubt you have a good reason to want it, I'm just not clear on why an
initramfs isn't an option?  You have this special kernel with your special
option, and you manage to get these things to boot your special kernel right?
So why is a initramfs with a tiny nbd-client binary in there not an option?

Also I mean that there are a bunch of different nbd servers out there.  We have
our own here at Facebook, qemu has one, IIRC there's a ceph one.  They all have
their own connection protocols.  The beauty of NBD is that it doesn't have to
know about that part, it just does the block device part, and I'd really rather
leave it that way.  Thanks,

Josef
Eric Blake June 13, 2019, 3:14 p.m. UTC | #7
On 6/13/19 9:45 AM, Roman Stratiienko wrote:

>>
>> Just throw nbd-client in your initramfs.  Every nbd server has it's own
>> handshake protocol, embedding one particular servers handshake protocol into the
>> kernel isn't the answer here.  Thanks,

The handshake protocol is well-specified:
https://github.com/NetworkBlockDevice/nbd/blob/cdb0bc57f3faefd7a5562d57ad57cd990781c415/doc/proto.md

All servers implement various subsets of that document for the handshake.

> Also, as far as I know mainline nbd-server daemon have only 2
> handshake protocols. So called OLD-STYLE and NEW-STYLE. And OLD-STYLE
> is no longer supported. So it should not be a problem, or please fix
> me if I'm wrong.

You are correct that oldstyle is no longer recommended. However, the
current NBD specification states that newstyle has two different
flavors, NBD_OPT_EXPORT_NAME (which you used, but is also old) and
NBD_OPT_GO (which is newer, but is more likely to encounter differences
where not all servers support it).

The NBD specification includes a compatibility baseline:
https://github.com/NetworkBlockDevice/nbd/blob/cdb0bc57f3faefd7a5562d57ad57cd990781c415/doc/proto.md#compatibility-and-interoperability

and right now, NBD_OPT_GO (and _not_ NBD_OPT_EXPORT_NAME) is the
preferred way forward.  As long as your handshake implementation
complies with the baseline documented there, you'll have maximum
portability to the largest number of servers that also support the
baseline - but not all servers are up to that baseline yet.

So, this becomes a question of how much are you reinventing baseline
portability handshake concerns in the kernel, vs. in initramfs.
Roman Stratiienko June 13, 2019, 4:05 p.m. UTC | #8
On Thu, Jun 13, 2019 at 6:14 PM Eric Blake <eblake@redhat.com> wrote:
>
> On 6/13/19 9:45 AM, Roman Stratiienko wrote:
>
> >>
> >> Just throw nbd-client in your initramfs.  Every nbd server has it's own
> >> handshake protocol, embedding one particular servers handshake protocol into the
> >> kernel isn't the answer here.  Thanks,
>
> The handshake protocol is well-specified:
> https://github.com/NetworkBlockDevice/nbd/blob/cdb0bc57f3faefd7a5562d57ad57cd990781c415/doc/proto.md
>
> All servers implement various subsets of that document for the handshake.
>
> > Also, as far as I know mainline nbd-server daemon have only 2
> > handshake protocols. So called OLD-STYLE and NEW-STYLE. And OLD-STYLE
> > is no longer supported. So it should not be a problem, or please fix
> > me if I'm wrong.
>
> You are correct that oldstyle is no longer recommended. However, the
> current NBD specification states that newstyle has two different
> flavors, NBD_OPT_EXPORT_NAME (which you used, but is also old) and
> NBD_OPT_GO (which is newer, but is more likely to encounter differences
> where not all servers support it).
>
> The NBD specification includes a compatibility baseline:
> https://github.com/NetworkBlockDevice/nbd/blob/cdb0bc57f3faefd7a5562d57ad57cd990781c415/doc/proto.md#compatibility-and-interoperability
>
> and right now, NBD_OPT_GO (and _not_ NBD_OPT_EXPORT_NAME) is the
> preferred way forward.  As long as your handshake implementation
> complies with the baseline documented there, you'll have maximum
> portability to the largest number of servers that also support the
> baseline - but not all servers are up to that baseline yet.
>
> So, this becomes a question of how much are you reinventing baseline
> portability handshake concerns in the kernel, vs. in initramfs.
>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.           +1-919-301-3226
> Virtualization:  qemu.org | libvirt.org
>

Thank you for the review comments, I will address them in v2.

--
Regards,
Roman
Roman Stratiienko June 13, 2019, 4:21 p.m. UTC | #9
> I don't doubt you have a good reason to want it, I'm just not clear on why an
> initramfs isn't an option?  You have this special kernel with your special
> option, and you manage to get these things to boot your special kernel right?
> So why is a initramfs with a tiny nbd-client binary in there not an option?
>
> Also I mean that there are a bunch of different nbd servers out there.  We have
> our own here at Facebook, qemu has one, IIRC there's a ceph one.  They all have
> their own connection protocols.  The beauty of NBD is that it doesn't have to
> know about that part, it just does the block device part, and I'd really rather
> leave it that way.  Thanks,
>
> Josef


The only reason I prefer embed client into the kernel is to save
valuable engineering time creating and supporting custom initramfs,
that is not so easy especially on Android-based systems.

Taking into account that if using NBD and creating custom initramfs
required only for automated testing, many companies would choose
manual deployment instead, that is bad for the human progress.

I believe that all users of non-standard NBD handshake protocols could
continue to use custom nbd-clients.

Either you accept this patch or not I would like to pass review from
maintainers and other persons that was involved in NBD development,
thus making a step closer to get this mainlined in some future.

--
Regards,
Roman
Josef Bacik June 13, 2019, 4:54 p.m. UTC | #10
On Thu, Jun 13, 2019 at 07:21:43PM +0300, Roman Stratiienko wrote:
> > I don't doubt you have a good reason to want it, I'm just not clear on why an
> > initramfs isn't an option?  You have this special kernel with your special
> > option, and you manage to get these things to boot your special kernel right?
> > So why is a initramfs with a tiny nbd-client binary in there not an option?
> >
> > Also I mean that there are a bunch of different nbd servers out there.  We have
> > our own here at Facebook, qemu has one, IIRC there's a ceph one.  They all have
> > their own connection protocols.  The beauty of NBD is that it doesn't have to
> > know about that part, it just does the block device part, and I'd really rather
> > leave it that way.  Thanks,
> >
> > Josef
> 
> 
> The only reason I prefer embed client into the kernel is to save
> valuable engineering time creating and supporting custom initramfs,
> that is not so easy especially on Android-based systems.
> 

I'm unconvinced that creating an initramfs is any harder than building your own
kernel with this patch and providing the configuration information to the
device, especially for android where we inside of Facebook provision android
devices with a custom initramfs all the time, and have done so for many, many
years.

> Taking into account that if using NBD and creating custom initramfs
> required only for automated testing, many companies would choose
> manual deployment instead, that is bad for the human progress.
> 
> I believe that all users of non-standard NBD handshake protocols could
> continue to use custom nbd-clients.

There is no "standard NBD handshake protocol" is my point.  That part exists
outside of the NBD spec that the kernel is concerned with.  The client is happy
to do whatever it pleases.  Now there's no doubt that the standard nbd client
and server that is shipped is the reference spec, but my point is that it is
still outside the kernel and so able to change as it sees fit.

> 
> Either you accept this patch or not I would like to pass review from
> maintainers and other persons that was involved in NBD development,
> thus making a step closer to get this mainlined in some future.

...I am the maintainer, but feel free to try to get Jens to take a patch that I
think is unnecessary.  Thanks,

Josef
Wouter Verhelst June 14, 2019, 10:33 a.m. UTC | #11
On Thu, Jun 13, 2019 at 10:55:36AM -0400, Josef Bacik wrote:
> Also I mean that there are a bunch of different nbd servers out there.  We have
> our own here at Facebook, qemu has one, IIRC there's a ceph one.

I can't claim to know about the Facebook one of course, but the qemu one
uses the same handshake protocol as anyone else. The ceph ones that I've
seen do too (but there are various implementations of that, so...).

> They all have their own connection protocols.  The beauty of NBD is
> that it doesn't have to know about that part, it just does the block
> device part, and I'd really rather leave it that way.  Thanks,

Sure.

OTOH, there is definitely also a benefit to using the same handshake
protocol everywhere, for interoperability reasons.
Josef Bacik June 14, 2019, 1:38 p.m. UTC | #12
On Fri, Jun 14, 2019 at 12:33:43PM +0200, Wouter Verhelst wrote:
> On Thu, Jun 13, 2019 at 10:55:36AM -0400, Josef Bacik wrote:
> > Also I mean that there are a bunch of different nbd servers out there.  We have
> > our own here at Facebook, qemu has one, IIRC there's a ceph one.
> 
> I can't claim to know about the Facebook one of course, but the qemu one
> uses the same handshake protocol as anyone else. The ceph ones that I've
> seen do too (but there are various implementations of that, so...).
> 

Ah, for some reason I remembered Qemu's being distinctly different.

I suppose if most of the main ones people use are using the same handshake
protocol that makes it more compelling.  But there'd have to be a really good
reason why a initramfs isn't viable, and so far I haven't heard a solid reason
that's not an option other than "it's hard and we don't want to do it."

> > They all have their own connection protocols.  The beauty of NBD is
> > that it doesn't have to know about that part, it just does the block
> > device part, and I'd really rather leave it that way.  Thanks,
> 
> Sure.
> 
> OTOH, there is definitely also a benefit to using the same handshake
> protocol everywhere, for interoperability reasons.
> 

Sure, Facebook's isn't different because we hate the protocol, we just use
Thrift for all of our services, and thus it makes sense for us to use thrift for
the client connection stuff to make it easy on all the apps that use disagg.
Thanks,

Josef
Wouter Verhelst June 14, 2019, 9:23 p.m. UTC | #13
On Fri, Jun 14, 2019 at 09:38:04AM -0400, Josef Bacik wrote:
> On Fri, Jun 14, 2019 at 12:33:43PM +0200, Wouter Verhelst wrote:
> > On Thu, Jun 13, 2019 at 10:55:36AM -0400, Josef Bacik wrote:
> > > Also I mean that there are a bunch of different nbd servers out there.  We have
> > > our own here at Facebook, qemu has one, IIRC there's a ceph one.
> > 
> > I can't claim to know about the Facebook one of course, but the qemu one
> > uses the same handshake protocol as anyone else. The ceph ones that I've
> > seen do too (but there are various implementations of that, so...).
> > 
> 
> Ah, for some reason I remembered Qemu's being distinctly different.
> 
> I suppose if most of the main ones people use are using the same handshake
> protocol that makes it more compelling.  But there'd have to be a really good
> reason why a initramfs isn't viable, and so far I haven't heard a solid reason
> that's not an option other than "it's hard and we don't want to do it."

Oh, I agree with that. It's not hard at all; I'm aware of two
implementations of doing that (I've written an nbd initramfs hook for
Debian's initramfs infrastructure, and others have done it for dracut).
I'm assuming that buildroot will have an initramfs framework too (and if
it doesn't, then that probably just means someone should write it), and
then writing an nbd initramfs hook for that framework should be
reasonably easy.

Also, the proposed initramfs configuration protocol uses the same
"nbdroot" name as my Debian initramfs hooks, but in an incompatible way.
I'd really like to see that be done differently, before/if it's accepted
at all ;-)
diff mbox series

Patch

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 20bb4bfa4be6..e17f2376de60 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -273,6 +273,25 @@  config BLK_DEV_NBD
 
 	  If unsure, say N.
 
+config BLK_DEV_NBDROOT
+	bool "Early network block device client support"
+	depends on BLK_DEV_NBD=y
+	---help---
+	  Saying yes will enable kernel NBD client support. This allows to
+	  connect entire disk with multiple partitions before mounting rootfs.
+
+	  The arguments for the server are passed via kernel command line.
+	  The kernel command line has the format
+	  'nbdroot=[<SERVER_IP>:]<SERVER_PORT>/<EXPORT_NAME>'.
+	  SERVER_IP is optional. If it is not available it will use the
+	  root_server_addr transmitted through DHCP.
+
+	  Based on those arguments, the connection to the server is established
+	  and is connected to the nbd0 device. The rootdevice therefore is
+	  root=/dev/nbd0.
+
+	  If unsure, say N.
+
 config BLK_DEV_SKD
 	tristate "STEC S1120 Block Driver"
 	depends on PCI
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 63fcfb38e640..cb5e60419e07 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -46,6 +46,35 @@ 
 #define CREATE_TRACE_POINTS
 #include <trace/events/nbd.h>
 
+#include <net/ipconfig.h>
+#include <linux/in.h>
+#include <linux/tcp.h>
+#include <linux/nfs_fs.h>
+#include <linux/nfs.h>
+
+#define ADDR_NONE cpu_to_be32(INADDR_NONE)
+
+static const char nbd_magic[] = "NBDMAGIC";
+static const u64 nbd_opts_magic = 0x49484156454F5054LL;
+
+/* Options used for the kernel driver */
+#define NBD_OPT_EXPORT_NAME 1
+
+#define NBD_DEFAULT_BLOCKSIZE 512 /* bytes */
+
+#define NBD_DEFAULT_TIMEOUT 2 /* seconds */
+
+#define NBD_MAXPATHLEN  NFS_MAXPATHLEN
+
+struct nbdroot {
+	const char *bdev;
+	__be32 server_addr;
+	__be32 server_port;
+	loff_t block_size;
+	int timeout;
+	char server_export[NBD_MAXPATHLEN + 1];
+};
+
 static DEFINE_IDR(nbd_index_idr);
 static DEFINE_MUTEX(nbd_index_mutex);
 static int nbd_total_devices = 0;
@@ -441,6 +470,16 @@  static int sock_xmit(struct socket *sock, int send,
 	return result;
 }
 
+static int sock_xmit_buf(struct socket *sock, int send,
+			 void *buf, size_t size)
+{
+	struct iov_iter iter;
+	struct kvec iov = {.iov_base = buf, .iov_len = size};
+
+	iov_iter_kvec(&iter, WRITE | ITER_KVEC, &iov, 1, size);
+	return sock_xmit(sock, send, &iter, 0, 0);
+}
+
 static int nbd_xmit(struct nbd_device *nbd, int index, int send,
 		     struct iov_iter *iter, int msg_flags, int *sent)
 {
@@ -2301,6 +2340,261 @@  static void __exit nbd_cleanup(void)
 	unregister_blkdev(NBD_MAJOR, "nbd");
 }
 
+#ifdef CONFIG_BLK_DEV_NBDROOT
+
+struct nbdroot nbdroot_0 = {.bdev = "nbd0",
+			    .server_export = "",
+			    .server_addr = ADDR_NONE,
+			    .timeout = NBD_DEFAULT_TIMEOUT,
+			    .block_size = NBD_DEFAULT_BLOCKSIZE};
+
+static int nbd_connect(struct nbdroot *nbdroot, struct socket **socket)
+{
+	struct socket *sock;
+	struct sockaddr_in sockaddr;
+	int err;
+	char val;
+
+	err = sock_create_kern(&init_net, AF_INET, SOCK_STREAM,
+			       IPPROTO_TCP, &sock);
+	if (err < 0)
+		return err;
+
+	sockaddr.sin_family = AF_INET;
+	sockaddr.sin_addr.s_addr = nbdroot->server_addr;
+	sockaddr.sin_port = nbdroot->server_port;
+
+	val = 1;
+	sock->ops->setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &val,
+			      sizeof(val));
+
+	err = sock->ops->connect(sock, (struct sockaddr *)&sockaddr,
+				 sizeof(sockaddr), 0);
+	if (err < 0)
+		return err;
+
+	*socket = sock;
+
+	return 0;
+}
+
+static int nbd_connection_negotiate(struct socket *sock, char *export_name,
+				    size_t *rsize, u16 *nflags)
+{
+	char buf[256];
+	int ret;
+	u64 magic;
+	u16 flags;
+	u32 client_flags;
+	u32 opt;
+	u32 name_len;
+	u64 nbd_size;
+
+	ret = sock_xmit_buf(sock, 0, buf, 8);
+	if (ret < 0)
+		return ret;
+
+	if (strncmp(buf, nbd_magic, 8))
+		return -EINVAL;
+
+	ret = sock_xmit_buf(sock, 0, &magic, sizeof(magic));
+	if (ret < 0)
+		return ret;
+	magic = be64_to_cpu(magic);
+
+	if (magic != nbd_opts_magic)
+		return -EINVAL;
+
+	ret = sock_xmit_buf(sock, 0, &flags, sizeof(flags));
+	if (ret < 0)
+		return ret;
+
+	*nflags = ntohs(flags);
+
+	client_flags = 0;
+
+	ret = sock_xmit_buf(sock, 1, &client_flags, sizeof(client_flags));
+	if (ret < 0)
+		return ret;
+
+	magic = cpu_to_be64(nbd_opts_magic);
+	ret = sock_xmit_buf(sock, 1, &magic, sizeof(magic));
+	if (ret < 0)
+		return ret;
+
+	opt = htonl(NBD_OPT_EXPORT_NAME);
+	ret = sock_xmit_buf(sock, 1, &opt, sizeof(opt));
+	if (ret < 0)
+		return ret;
+
+	name_len = strlen(export_name) + 1;
+	name_len = htonl(name_len);
+	ret = sock_xmit_buf(sock, 1, &name_len, sizeof(name_len));
+	if (ret < 0)
+		return ret;
+
+	ret = sock_xmit_buf(sock, 1, export_name, strlen(export_name) + 1);
+	if (ret < 0)
+		return ret;
+
+	ret = sock_xmit_buf(sock, 0, &nbd_size, sizeof(nbd_size));
+	if (ret < 0)
+		return ret;
+	nbd_size = be64_to_cpu(nbd_size);
+
+	ret = sock_xmit_buf(sock, 0, &flags, sizeof(flags));
+	if (ret < 0)
+		return ret;
+	*nflags = ntohs(flags);
+
+	ret = sock_xmit_buf(sock, 0, buf, 124);
+	if (ret < 0)
+		return ret;
+
+	*rsize = nbd_size;
+
+	return 0;
+}
+
+static int nbd_bind_connection(struct nbdroot *nbdroot, struct nbd_device *nbd,
+			       struct socket *sock, size_t rsize, u32 flags)
+{
+	int conn, ret;
+	struct block_device *bdev = blkdev_get_by_dev(disk_devt(nbd->disk),
+					FMODE_READ | FMODE_WRITE, 0);
+
+	if (IS_ERR(bdev)) {
+		pr_err("nbdroot: blkdev_get_by_dev failed %ld\n",
+		       PTR_ERR(bdev));
+		return PTR_ERR(bdev);
+	}
+
+	conn = nbd->config->num_connections;
+	ret = nbd_add_socket(nbd, sock, false);
+	if (ret) {
+		pr_err("nbdroot: add socket failed %d\n", ret);
+		return ret;
+	}
+
+	mutex_lock(&nbd->config->socks[conn]->tx_lock);
+
+	nbd->config->flags = flags;
+
+	nbd_size_set(nbd, nbdroot->block_size,
+		     div_s64(rsize, nbdroot->block_size));
+
+	nbd->tag_set.timeout = nbdroot->timeout * HZ;
+	blk_queue_rq_timeout(nbd->disk->queue, nbdroot->timeout * HZ);
+
+	mutex_unlock(&nbd->config->socks[conn]->tx_lock);
+
+	ret = nbd_start_device_ioctl(nbd, bdev);
+	if (ret) {
+		pr_err("nbdroot: start device ioctl failed %d\n", ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static int nbdroot_thread(void *arg)
+{
+	struct nbdroot *nbdroot = (struct nbdroot *)arg;
+	struct socket *sock = 0;
+	size_t rsize;
+	u16 nflags;
+	int ret;
+	dev_t devt = blk_lookup_devt(nbdroot->bdev, 0);
+	struct gendisk *disk = get_gendisk(devt, &ret);
+	struct nbd_device *nbd = (struct nbd_device *)disk->private_data;
+
+	ret = nbd_connect(nbdroot, &sock);
+	if (ret) {
+		pr_err("nbdroot: connect failed %d\n", ret);
+		goto err;
+	}
+
+	ret = nbd_connection_negotiate(sock, nbdroot->server_export,
+				       &rsize, &nflags);
+	if (ret) {
+		pr_err("nbdroot: negotiation failed %d\n", ret);
+		goto err;
+	}
+
+	ret = nbd_bind_connection(nbdroot, nbd, sock, rsize, nflags);
+	if (ret) {
+		pr_err("nbdroot: nbd_bind_connection failed %d\n", ret);
+		goto err;
+	}
+	return 0;
+
+err:
+	pr_err("nbdroot: %s init failed, IP: %pI4, port: %i, export: %s\n",
+	       nbdroot->bdev, &nbdroot->server_addr,
+	       ntohs(nbdroot->server_port), nbdroot->server_export);
+
+	if (sock)
+		sock_release(sock);
+
+	return ret;
+}
+
+static int __init nbdroot_init(void)
+{
+	if (nbdroot_0.server_port != 0)
+		kthread_run(nbdroot_thread, &nbdroot_0, "nbdroot_0");
+
+	return 0;
+}
+
+/* We need this in late_initcall_sync to be sure that the network is setup */
+late_initcall_sync(nbdroot_init);
+
+/*
+ * Parse format "[<SERVER_IP>:]<SERVER_PORT>/<EXPORT_NAME>"
+ */
+static int __init nbdroot_setup(char *line)
+{
+	struct nbdroot *nbdroot = &nbdroot_0;
+	char *export;
+	u16 port;
+	int ret;
+	char buf[NBD_MAXPATHLEN + 1];
+
+	strlcpy(buf, line, sizeof(buf) - 1);
+
+	nbdroot->server_addr = root_nfs_parse_addr(buf);
+
+	if (*buf == '\0')
+		return -EINVAL;
+
+	if (nbdroot->server_addr == ADDR_NONE) {
+		if (root_server_addr == ADDR_NONE) {
+			pr_err("nbdroot: Failed to find server address\n");
+			return -EINVAL;
+		}
+		nbdroot->server_addr = root_server_addr;
+	}
+
+	export = strchr(buf, '/');
+	*export = '\0';
+	++export;
+
+	ret = kstrtou16(buf, 10, &port);
+	if (ret)
+		return ret;
+
+	nbdroot->server_port = htons(port);
+	strlcpy(nbdroot->server_export, export,
+		sizeof(nbdroot->server_export) - 1);
+
+	return 0;
+}
+
+__setup("nbdroot=", nbdroot_setup);
+
+#endif /* CONFIG_BLK_DEV_NBDROOT */
+
 module_init(nbd_init);
 module_exit(nbd_cleanup);