mbox series

[V2,0/1] ubd: add io_uring based userspace block driver

Message ID 20220517055358.3164431-1-ming.lei@redhat.com (mailing list archive)
Headers show
Series ubd: add io_uring based userspace block driver | expand

Message

Ming Lei May 17, 2022, 5:53 a.m. UTC
Hello Guys,

ubd driver is one kernel driver for implementing generic userspace block
device/driver, which delivers io request from ubd block device(/dev/ubdbN) into
ubd server[1] which is the userspace part of ubd for communicating
with ubd driver and handling specific io logic by its target module.

Another thing ubd driver handles is to copy data between user space buffer
and request/bio's pages, or take zero copy if mm is ready for support it in
future. ubd driver doesn't handle any IO logic of the specific driver, so
it is small/simple, and all io logics are done by the target code in ubdserver.

The above two are main jobs done by ubd driver.

ubd driver can help to move IO logic into userspace, in which the
development work is easier/more effective than doing in kernel, such as,
ubd-loop takes < 200 lines of loop specific code to get basically same 
function with kernel loop block driver, meantime the performance is
still good. ubdsrv[1] provide built-in test for comparing both by running
"make test T=loop".

Another example is high performance qcow2 support[2], which could be built with
ubd framework more easily than doing it inside kernel.

Also there are more people who express interests on userspace block driver[3],
Gabriel Krisman Bertazi proposes this topic in lsf/mm/ebpf 2022 and mentioned
requirement from Google. Ziyang Zhang from Alibaba said they "plan to
replace TCMU by UBD as a new choice" because UBD can get better throughput than
TCMU even with single queue[4], meantime UBD is simple. Also there is userspace
storage service for providing storage to containers.

It is io_uring based: io request is delivered to userspace via new added
io_uring command which has been proved as very efficient for making nvme
passthrough IO to get better IOPS than io_uring(READ/WRITE). Meantime one
shared/mmap buffer is used for sharing io descriptor to userspace, the
buffer is readonly for userspace, each IO just takes 24bytes so far.
It is suggested to use io_uring in userspace(target part of ubd server)
to handle IO request too. And it is still easy for ubdserver to support
io handling by non-io_uring, and this work isn't done yet, but can be
supported easily with help o eventfd.

This way is efficient since no extra io command copy is required, no sleep
is needed in transferring io command to userspace. Meantime the communication
protocol is simple and efficient, one single command of
UBD_IO_COMMIT_AND_FETCH_REQ can handle both fetching io request desc and commit
command result in one trip. IO handling is often batched after single
io_uring_enter() returns, both IO requests from ubd server target and
IO commands could be handled as a whole batch.

Remove RFC now because ubd driver codes gets lots of cleanup, enhancement and
bug fixes since V1:

- cleanup uapi: remove ubd specific error code,  switch to linux error code,
remove one command op, remove one field from cmd_desc

- add monitor mechanism to handle ubq_daemon being killed, ubdsrv[1]
  includes builtin tests for covering heavy IO with deleting ubd / killing
  ubq_daemon at the same time, and V2 pass all the two tests(make test T=generic),
  and the abort/stop mechanism is simple

- fix MQ command buffer mmap bug, and now 'xfstetests -g auto' works well on
  MQ ubd-loop devices(test/scratch)

- improve batching submission as suggested by Jens

- improve handling for starting device, replace random wait/poll with
completion

- all kinds of cleanup, bug fix,..

And the patch by patch change since V1 can be found in the following
tree:

https://github.com/ming1/linux/commits/my_for-5.18-ubd-devel_v2

Todo:
	- add lazy user page release for avoiding cost of pinning user pages in
	ubd_copy_pages() most of time, so we can save CPU for handling io logic
	in userpsace


[1] ubd server
https://github.com/ming1/ubdsrv/commits/devel-v2

[2] qcow2 kernel driver attempt
https://www.spinics.net/lists/kernel/msg4292965.html
https://patchwork.kernel.org/project/linux-block/cover/20190823225619.15530-1-development@manuel-bentele.de/#22841183

[3] [LSF/MM/BPF TOPIC] block drivers in user space
https://lore.kernel.org/linux-block/87tucsf0sr.fsf@collabora.com/

[4] Follow up on UBD discussion
https://lore.kernel.org/linux-block/YnsW+utCrosF0lvm@T590/#r

Ming Lei (1):
  ubd: add io_uring based userspace block driver

 drivers/block/Kconfig        |    6 +
 drivers/block/Makefile       |    2 +
 drivers/block/ubd_drv.c      | 1444 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/ubd_cmd.h |  158 ++++
 4 files changed, 1610 insertions(+)
 create mode 100644 drivers/block/ubd_drv.c
 create mode 100644 include/uapi/linux/ubd_cmd.h

Comments

Christoph Hellwig May 17, 2022, 8:01 a.m. UTC | #1
Please pіck a new name for this driver, we already have a block driver
called ubd.
Stefan Hajnoczi May 17, 2022, 2:06 p.m. UTC | #2
Here are some more thoughts on the ubd-control device:

The current patch provides a ubd-control device for processes with
suitable permissions (i.e. root) to create, start, stop, and fetch
information about devices.

There is no isolation between devices created by one process and those
created by another. Therefore two processes that do not trust each other
cannot both use UBD without potential interference. There is also no
isolation for containers.

I think it would be a mistake to keep the ubd-control interface in its
current form since the current global/root model is limited. Instead I
suggest:
- Creating a device returns a new file descriptor instead of a global
  dev_id. The device can be started/stopped/configured through this (and
  only through this) per-device file descriptor. The device is not
  visible to other processes through ubd-control so interference is not
  possible. In order to give another process control over the device the
  fd can be passed (e.g. SCM_RIGHTS). 

Now multiple applications/containers/etc can use ubd-control without
interfering with each other. The security model still requires root
though since devices can be malicious.

FUSE allows unprivileged mounts (see fuse_allow_current_process()). Only
processes with the same uid as the FUSE daemon can access such mounts
(in the default configuration). This prevents security issues while
still allowing unprivileged use cases.

I suggest adapting the FUSE security model to block devices:
- Devices can be created without CAP_SYS_ADMIN but they have an
  'unprivileged' flag set to true.
- Unprivileged devices are not probed for partitions and LVM doesn't
  touch them. This means the kernel doesn't access these devices via
  code paths that might be exploitable.
- When another process with a different uid from ubdsrv opens an
  unprivileged device, -EACCES is returned. This protects other
  uids from the unprivileged device.
- When another process with a different uid from ubdsrv opens a
  _privileged_ device there is no special access check because ubdsrv is
  privileged.

With these changes UBD can be used by unprivileged processes and
containers. I think it's worth discussing the details and having this
model from the start so UBD can be used in a wide range of use cases.

Stefan
Liu Xiaodong May 18, 2022, 6:38 a.m. UTC | #3
On Tue, May 17, 2022 at 01:53:57PM +0800, Ming Lei wrote:
> Hello Guys,
> 
> ubd driver is one kernel driver for implementing generic userspace block
> device/driver, which delivers io request from ubd block device(/dev/ubdbN) into
> ubd server[1] which is the userspace part of ubd for communicating
> with ubd driver and handling specific io logic by its target module.
> 
> Another thing ubd driver handles is to copy data between user space buffer
> and request/bio's pages, or take zero copy if mm is ready for support it in
> future. ubd driver doesn't handle any IO logic of the specific driver, so
> it is small/simple, and all io logics are done by the target code in ubdserver.
> 
> The above two are main jobs done by ubd driver.

Hi, Lei

Your UBD implementation looks great. Its io_uring based design is interesting
and brilliant.
Towards the purpose of userspace block device, last year,
VDUSE initialized by Yongji is going to do a similar work. But VDUSE is under
vdpa. VDUSE will present a virtio-blk device to other userspace process
like containers, while serving virtio-blk req also by an userspace target.
https://lists.linuxfoundation.org/pipermail/iommu/2021-June/056956.html 

I've been working and thinking on serving RUNC container by SPDK efficiently.
But this work requires a new proper userspace block device module in kernel.
The highlevel design idea for userspace block device implementations
should be that: Using ring for IO request, so client and target can exchange
req/resp quickly in batch; Map bounce buffer between kernel and userspace
target, so another extra IO data copy like NBD can be avoid. (Oh, yes, SPDK
needs this kernel module has some more minor functions)

UBD and VDUSE are both implemented in this way, while of course each of
them has specific features and advantages.

Not like UBD which is straightforward and starts from scratch, VDUSE is
embedded in virtio framework. So its implementation is more complicated, but
all virtio frontend utilities can be leveraged.
When considering security/permission issues, feels UBD would be easier to
solve them.

So my questions are:
1. what do you think about the purpose overlap between UBD and VDUSE?
2. Could UBD be implemented with SPDK friendly functionalities? (mainly about
io data mapping, since HW devices in SPDK need to access the mapped data
buffer. Then, in function ubdsrv.c/ubdsrv_init_io_bufs(),
"addr = mmap(,,,,dev->cdev_fd,)", SPDK needs to know the PA of "addr".
Also memory pointed by "addr" should be pinned all the time.)

Thanks
Xiaodong

> 
> ubd driver can help to move IO logic into userspace, in which the
> development work is easier/more effective than doing in kernel, such as,
> ubd-loop takes < 200 lines of loop specific code to get basically same 
> function with kernel loop block driver, meantime the performance is
> still good. ubdsrv[1] provide built-in test for comparing both by running
> "make test T=loop".
> 
> Another example is high performance qcow2 support[2], which could be built with
> ubd framework more easily than doing it inside kernel.
> 
> Also there are more people who express interests on userspace block driver[3],
> Gabriel Krisman Bertazi proposes this topic in lsf/mm/ebpf 2022 and mentioned
> requirement from Google. Ziyang Zhang from Alibaba said they "plan to
> replace TCMU by UBD as a new choice" because UBD can get better throughput than
> TCMU even with single queue[4], meantime UBD is simple. Also there is userspace
> storage service for providing storage to containers.
> 
> It is io_uring based: io request is delivered to userspace via new added
> io_uring command which has been proved as very efficient for making nvme
> passthrough IO to get better IOPS than io_uring(READ/WRITE). Meantime one
> shared/mmap buffer is used for sharing io descriptor to userspace, the
> buffer is readonly for userspace, each IO just takes 24bytes so far.
> It is suggested to use io_uring in userspace(target part of ubd server)
> to handle IO request too. And it is still easy for ubdserver to support
> io handling by non-io_uring, and this work isn't done yet, but can be
> supported easily with help o eventfd.
> 
> This way is efficient since no extra io command copy is required, no sleep
> is needed in transferring io command to userspace. Meantime the communication
> protocol is simple and efficient, one single command of
> UBD_IO_COMMIT_AND_FETCH_REQ can handle both fetching io request desc and commit
> command result in one trip. IO handling is often batched after single
> io_uring_enter() returns, both IO requests from ubd server target and
> IO commands could be handled as a whole batch.
> 
> Remove RFC now because ubd driver codes gets lots of cleanup, enhancement and
> bug fixes since V1:
> 
> - cleanup uapi: remove ubd specific error code,  switch to linux error code,
> remove one command op, remove one field from cmd_desc
> 
> - add monitor mechanism to handle ubq_daemon being killed, ubdsrv[1]
>   includes builtin tests for covering heavy IO with deleting ubd / killing
>   ubq_daemon at the same time, and V2 pass all the two tests(make test T=generic),
>   and the abort/stop mechanism is simple
> 
> - fix MQ command buffer mmap bug, and now 'xfstetests -g auto' works well on
>   MQ ubd-loop devices(test/scratch)
> 
> - improve batching submission as suggested by Jens
> 
> - improve handling for starting device, replace random wait/poll with
> completion
> 
> - all kinds of cleanup, bug fix,..
> 
> And the patch by patch change since V1 can be found in the following
> tree:
> 
> https://github.com/ming1/linux/commits/my_for-5.18-ubd-devel_v2
> 
> Todo:
> 	- add lazy user page release for avoiding cost of pinning user pages in
> 	ubd_copy_pages() most of time, so we can save CPU for handling io logic
> 	in userpsace
> 
> 
> [1] ubd server
> https://github.com/ming1/ubdsrv/commits/devel-v2
> 
> [2] qcow2 kernel driver attempt
> https://www.spinics.net/lists/kernel/msg4292965.html
> https://patchwork.kernel.org/project/linux-block/cover/20190823225619.15530-1-development@manuel-bentele.de/#22841183
> 
> [3] [LSF/MM/BPF TOPIC] block drivers in user space
> https://lore.kernel.org/linux-block/87tucsf0sr.fsf@collabora.com/
> 
> [4] Follow up on UBD discussion
> https://lore.kernel.org/linux-block/YnsW+utCrosF0lvm@T590/#r
> 
> Ming Lei (1):
>   ubd: add io_uring based userspace block driver
> 
>  drivers/block/Kconfig        |    6 +
>  drivers/block/Makefile       |    2 +
>  drivers/block/ubd_drv.c      | 1444 ++++++++++++++++++++++++++++++++++
>  include/uapi/linux/ubd_cmd.h |  158 ++++
>  4 files changed, 1610 insertions(+)
>  create mode 100644 drivers/block/ubd_drv.c
>  create mode 100644 include/uapi/linux/ubd_cmd.h
> 
> -- 
> 2.31.1
Ming Lei May 18, 2022, 7:09 a.m. UTC | #4
On Tue, May 17, 2022 at 03:06:34PM +0100, Stefan Hajnoczi wrote:
> Here are some more thoughts on the ubd-control device:
> 
> The current patch provides a ubd-control device for processes with
> suitable permissions (i.e. root) to create, start, stop, and fetch
> information about devices.
> 
> There is no isolation between devices created by one process and those

I understand linux hasn't device namespace yet, so can you share the
rational behind the idea of device isolation, is it because ubd device
is served by ubd daemon which belongs to one pid NS? Or the user creating
/dev/ubdbN belongs to one user NS?

IMO, ubd device is one file in VFS, and FS permission should be applied,
then here the closest model should be user NS, and process privilege &
file ownership.

> created by another. Therefore two processes that do not trust each other
> cannot both use UBD without potential interference. There is also no

Can you share what the expectation is for this situation?

It is the created UBD which can only be used in this NS, or can only be
visible inside this NS? I guess the latter isn't possible since we don't
have this kind of isolation framework yet.

> isolation for containers.
> 
> I think it would be a mistake to keep the ubd-control interface in its
> current form since the current global/root model is limited. Instead I
> suggest:
> - Creating a device returns a new file descriptor instead of a global
>   dev_id. The device can be started/stopped/configured through this (and
>   only through this) per-device file descriptor. The device is not
>   visible to other processes through ubd-control so interference is not
>   possible. In order to give another process control over the device the
>   fd can be passed (e.g. SCM_RIGHTS). 
> 

/dev/ubdcN can only be opened by the process which is the descendant of
the process which creates the device by sending ADD_DEV.

But the device can be deleted/queried by other processes, however, I
think it is reasonable if all these processes has permission to do that,
such as all processes owns the device with same uid.

So can we apply process privilege & file ownership for isolating ubd device?

If per-process FD is used, it may confuse people, because process can
not delete/query ubd device even though its uid shows it has the
privilege.

> Now multiple applications/containers/etc can use ubd-control without
> interfering with each other. The security model still requires root
> though since devices can be malicious.
> 
> FUSE allows unprivileged mounts (see fuse_allow_current_process()). Only
> processes with the same uid as the FUSE daemon can access such mounts
> (in the default configuration). This prevents security issues while
> still allowing unprivileged use cases.

OK, looks FUSE applies process privilege & file ownership for dealing
with unprivileged mounts.

> 
> I suggest adapting the FUSE security model to block devices:
> - Devices can be created without CAP_SYS_ADMIN but they have an
>   'unprivileged' flag set to true.
> - Unprivileged devices are not probed for partitions and LVM doesn't
>   touch them. This means the kernel doesn't access these devices via
>   code paths that might be exploitable.

The above two makes sense.

> - When another process with a different uid from ubdsrv opens an
>   unprivileged device, -EACCES is returned. This protects other
>   uids from the unprivileged device.

OK, only the user who owns the device can access unprivileged device.

> - When another process with a different uid from ubdsrv opens a
>   _privileged_ device there is no special access check because ubdsrv is
>   privileged.

IMO, it depends if uid of this process has permission to access the
ubd device, and we can set ubd device's owership by the process
credentials.

> 
> With these changes UBD can be used by unprivileged processes and
> containers. I think it's worth discussing the details and having this
> model from the start so UBD can be used in a wide range of use cases.

I am pretty happy to discuss & figure out the details, but not sure
it is one blocker for ubd:

1) kernel driver of loop/nbd or others haven't support the isolation

2) still don't know exact ubd use case for containers


Thanks, 
Ming
Stefan Hajnoczi May 18, 2022, 9:26 a.m. UTC | #5
On Tue, May 17, 2022 at 01:53:57PM +0800, Ming Lei wrote:
> Another thing ubd driver handles is to copy data between user space buffer
> and request/bio's pages, or take zero copy if mm is ready for support it in
> future. ubd driver doesn't handle any IO logic of the specific driver, so
> it is small/simple, and all io logics are done by the target code in ubdserver.

On the topic of zero copy I guess there are two obvious approaches:

1. An mm solution that grants ubdsrv access to the I/O buffer pages. I
   think ZUFS had a strategy based on pinning server threads to CPUs and
   then having a per-CPU vma that can be changed cheaply
   (https://lwn.net/Articles/756625/).

2. A sendfile/splice solution where ubdsrv replies with <fd, offset,
   length> tuples instead of I/O completion and the UBD driver performs
   the I/O on behalf of ubdsrv.

   (A variation is to give ubdsrv a file descriptor so it can call
   sendfile(2) or related syscalls itself without ever having direct
   access to the I/O buffer pages.)

   This direction leads to LBA TLB designs like the old dm-userspace
   target
   (https://listman.redhat.com/archives/dm-devel/2006-April/msg00114.html)
   where the kernel keeps a TLB of <lba, length, fd, offset> so it can
   avoid sending requests to userspace when there is a TLB hit.
   Userspace's job is to program mappings into the LBA TLB and handle
   the slow path (e.g. allocating writes or compressed blocks). IMO the
   downside of this approach is that it's best to have it from the
   beginning - it's hard to retrofit existing ubdsrv code that is
   intended to process every I/O request instead of populating the LBA
   TLB.

Stefan
Stefan Hajnoczi May 18, 2022, 10:45 a.m. UTC | #6
On Wed, May 18, 2022 at 03:09:46PM +0800, Ming Lei wrote:
> On Tue, May 17, 2022 at 03:06:34PM +0100, Stefan Hajnoczi wrote:
> > Here are some more thoughts on the ubd-control device:
> > 
> > The current patch provides a ubd-control device for processes with
> > suitable permissions (i.e. root) to create, start, stop, and fetch
> > information about devices.
> > 
> > There is no isolation between devices created by one process and those
> 
> I understand linux hasn't device namespace yet, so can you share the
> rational behind the idea of device isolation, is it because ubd device
> is served by ubd daemon which belongs to one pid NS? Or the user creating
> /dev/ubdbN belongs to one user NS?

With the current model a process with access to ubd-control has control
over all ubd devices. This is not desirable for most container use cases
because ubd-control usage within a container means that container could
stop any ubd device on the system.

Even for non-container use cases it's problematic that two applications
that use ubd can interfere with each other. If an application passes the
wrong device ID they can stop the other application's device, for
example.

I think it's worth supporting a model where there are multiple ubd
daemons that are not cooperating/aware of each other. They should be
isolated from each other.

> IMO, ubd device is one file in VFS, and FS permission should be applied,
> then here the closest model should be user NS, and process privilege &
> file ownership.

Yes, /dev/ubdbN can has file ownership/permissions and the cgroup device
controller can restrict access too. That works fine when the device was
created previously.

But what about ubd device creation via ubd-control?

The problem is a global control interface like ubd-control gives access
to all ubd devices. There is no way to let an application/container
control (create/start/stop/etc) some ubd devices but not all. I think
ubd-control must be more fine-grained so multiple
applications/containers can use it without the possibility of
interference.

/dev/ubdcN is a separate problem. The cgroup device controller can limit
the device nodes that are accessible from a process. However, this
requires reserving device minor number ranges for each
application/container so they can only mknod/open their own ubd devices
and not devices that don't belong to them. Maybe there is a better
solution?

/dev/ubdbN has similar requirements to /dev/ubdcN. It should be possible
to create a new /dev/ubdbN but not access an existing device that belong

So if we want to let containers create ubd devices without granting them
access to all devices on the system, then the ubd-control interface
needs to be changed (see below) and the container needs a reserved range
of ubdcN minor numbers. Any container using ubdbN needs the cgroup
device controller and file ownership/permissions to open the block
device.

> > created by another. Therefore two processes that do not trust each other
> > cannot both use UBD without potential interference. There is also no
> 
> Can you share what the expectation is for this situation?

Two users should be able to run ubd daemons on the same system without
being able to stop each other's devices.

> It is the created UBD which can only be used in this NS, or can only be
> visible inside this NS? I guess the latter isn't possible since we don't
> have this kind of isolation framework yet.

It should be possible to access the ubd device according to file
ownership/permissions. No new isolation framework is needed for that.

But ubd-control should not grant global access to all ubd devices, at
least not in the typical case of a ubd daemon that just wishes to
create/start/stop its own devices.

> > isolation for containers.
> > 
> > I think it would be a mistake to keep the ubd-control interface in its
> > current form since the current global/root model is limited. Instead I
> > suggest:
> > - Creating a device returns a new file descriptor instead of a global
> >   dev_id. The device can be started/stopped/configured through this (and
> >   only through this) per-device file descriptor. The device is not
> >   visible to other processes through ubd-control so interference is not
> >   possible. In order to give another process control over the device the
> >   fd can be passed (e.g. SCM_RIGHTS). 
> > 
> 
> /dev/ubdcN can only be opened by the process which is the descendant of
> the process which creates the device by sending ADD_DEV.
> 
> But the device can be deleted/queried by other processes, however, I
> think it is reasonable if all these processes has permission to do that,
> such as all processes owns the device with same uid.

I don't think it's a good idea to require all ubd daemons to have
CAP_SYS_ADMIN/same uid. That's the main point I'm trying to make and the
discussion is based on that.

> So can we apply process privilege & file ownership for isolating ubd device?
> 
> If per-process FD is used, it may confuse people, because process can
> not delete/query ubd device even though its uid shows it has the
> privilege.

Is it better to stop the device via ubd-control instead of a
daemon-specific command (or just killing the daemon process)?

Regarding querying the device, the daemon has more information
associated with the device (e.g. if it's an iSCSI initiator it will have
the iSCSI URI). The ubd driver can only tell you the daemon pid and the
block device attributes that should already be available via sysfs.
Quering the daemon will yield more useful information than using
ubd-control.

> > Now multiple applications/containers/etc can use ubd-control without
> > interfering with each other. The security model still requires root
> > though since devices can be malicious.
> > 
> > FUSE allows unprivileged mounts (see fuse_allow_current_process()). Only
> > processes with the same uid as the FUSE daemon can access such mounts
> > (in the default configuration). This prevents security issues while
> > still allowing unprivileged use cases.
> 
> OK, looks FUSE applies process privilege & file ownership for dealing
> with unprivileged mounts.
> 
> > 
> > I suggest adapting the FUSE security model to block devices:
> > - Devices can be created without CAP_SYS_ADMIN but they have an
> >   'unprivileged' flag set to true.
> > - Unprivileged devices are not probed for partitions and LVM doesn't
> >   touch them. This means the kernel doesn't access these devices via
> >   code paths that might be exploitable.
> 
> The above two makes sense.
> 
> > - When another process with a different uid from ubdsrv opens an
> >   unprivileged device, -EACCES is returned. This protects other
> >   uids from the unprivileged device.
> 
> OK, only the user who owns the device can access unprivileged device.
> 
> > - When another process with a different uid from ubdsrv opens a
> >   _privileged_ device there is no special access check because ubdsrv is
> >   privileged.
> 
> IMO, it depends if uid of this process has permission to access the
> ubd device, and we can set ubd device's owership by the process
> credentials.

Yes, file ownership/permissions are still relevant.

> 
> > 
> > With these changes UBD can be used by unprivileged processes and
> > containers. I think it's worth discussing the details and having this
> > model from the start so UBD can be used in a wide range of use cases.
> 
> I am pretty happy to discuss & figure out the details, but not sure
> it is one blocker for ubd:
> 
> 1) kernel driver of loop/nbd or others haven't support the isolation

It may be better to compare it with FUSE where unprivileged users can
run their own servers. Imagine FUSE required a global root control
interface like ubd-control, then it wouldn't be possible to have
unprivileged FUSE mounts.

> 2) still don't know exact ubd use case for containers

There are two common use cases for block devices:
1. File systems or volume managers
2. Direct access for databases, backup tools, disk image tools, etc

The file system use case involved kernel code and probably needs to be
restricted to untrusted containers cannot exploit the kernel file system
implementations. I'll ignore this use case and containers probably
shouldn't do this.

The second use case is when you have any program that can operate on a
block device. It could be an application that imports/exports a block
device from network storage. This kind of application should be able to
do its job without CAP_SYS_ADMIN and it should be able to run in a
container. It might be part of KubeVirt's Containerized Data Importer,
for example, and is deployed as a container.

If ubd supports unprivileged operation then this container use case is
straightforward. If not, then it's problematic because it either
requires a privileged container or some kind of privileged helper
outside the container. At that point people may avoid ubd because it's
too hard to deploy with privilege requirements.

Stefan
Ming Lei May 18, 2022, 12:53 p.m. UTC | #7
On Wed, May 18, 2022 at 11:45:32AM +0100, Stefan Hajnoczi wrote:
> On Wed, May 18, 2022 at 03:09:46PM +0800, Ming Lei wrote:
> > On Tue, May 17, 2022 at 03:06:34PM +0100, Stefan Hajnoczi wrote:
> > > Here are some more thoughts on the ubd-control device:
> > > 
> > > The current patch provides a ubd-control device for processes with
> > > suitable permissions (i.e. root) to create, start, stop, and fetch
> > > information about devices.
> > > 
> > > There is no isolation between devices created by one process and those
> > 
> > I understand linux hasn't device namespace yet, so can you share the
> > rational behind the idea of device isolation, is it because ubd device
> > is served by ubd daemon which belongs to one pid NS? Or the user creating
> > /dev/ubdbN belongs to one user NS?
> 
> With the current model a process with access to ubd-control has control
> over all ubd devices. This is not desirable for most container use cases
> because ubd-control usage within a container means that container could
> stop any ubd device on the system.
> 
> Even for non-container use cases it's problematic that two applications
> that use ubd can interfere with each other. If an application passes the
> wrong device ID they can stop the other application's device, for
> example.
> 
> I think it's worth supporting a model where there are multiple ubd
> daemons that are not cooperating/aware of each other. They should be
> isolated from each other.

Maybe I didn't mention it clearly, I meant the following model in last email:

1) every user can send UBD_CMD_ADD_DEV to /dev/ubd-control

2) the created /dev/ubdcN & /dev/udcbN are owned by the user who creates
it

3) only the user who has permission to /dev/ubdcN can send other control
commands(START_DEV/STOP_DEV/GET_DEV_INFO/GET_QUEUE_AFFINITY/DEL_DEV);
and same with /dev/ubdbN

4) for unprivileged user who owns /dev/ubdbN, limit kernel behavior,
such as, not probed for partitions and LVM, only allow unprivileged
mounts,...

So ubd device can be isolated wrt. user NS.

> 
> > IMO, ubd device is one file in VFS, and FS permission should be applied,
> > then here the closest model should be user NS, and process privilege &
> > file ownership.
> 
> Yes, /dev/ubdbN can has file ownership/permissions and the cgroup device
> controller can restrict access too. That works fine when the device was
> created previously.
> 
> But what about ubd device creation via ubd-control?
> 
> The problem is a global control interface like ubd-control gives access
> to all ubd devices. There is no way to let an application/container
> control (create/start/stop/etc) some ubd devices but not all. I think
> ubd-control must be more fine-grained so multiple
> applications/containers can use it without the possibility of
> interference.
> 
> /dev/ubdcN is a separate problem. The cgroup device controller can limit
> the device nodes that are accessible from a process. However, this
> requires reserving device minor number ranges for each
> application/container so they can only mknod/open their own ubd devices
> and not devices that don't belong to them. Maybe there is a better
> solution?
> 
> /dev/ubdbN has similar requirements to /dev/ubdcN. It should be possible
> to create a new /dev/ubdbN but not access an existing device that belong
> 
> So if we want to let containers create ubd devices without granting them
> access to all devices on the system, then the ubd-control interface
> needs to be changed (see below) and the container needs a reserved range
> of ubdcN minor numbers. Any container using ubdbN needs the cgroup
> device controller and file ownership/permissions to open the block
> device.
> 
> > > created by another. Therefore two processes that do not trust each other
> > > cannot both use UBD without potential interference. There is also no
> > 
> > Can you share what the expectation is for this situation?
> 
> Two users should be able to run ubd daemons on the same system without
> being able to stop each other's devices.

Yeah, the above process privilege & file ownership based way can reach
the goal in user NS.

> 
> > It is the created UBD which can only be used in this NS, or can only be
> > visible inside this NS? I guess the latter isn't possible since we don't
> > have this kind of isolation framework yet.
> 
> It should be possible to access the ubd device according to file
> ownership/permissions. No new isolation framework is needed for that.
> 
> But ubd-control should not grant global access to all ubd devices, at
> least not in the typical case of a ubd daemon that just wishes to
> create/start/stop its own devices.

Yeah, I agree.

> 
> > > isolation for containers.
> > > 
> > > I think it would be a mistake to keep the ubd-control interface in its
> > > current form since the current global/root model is limited. Instead I
> > > suggest:
> > > - Creating a device returns a new file descriptor instead of a global
> > >   dev_id. The device can be started/stopped/configured through this (and
> > >   only through this) per-device file descriptor. The device is not
> > >   visible to other processes through ubd-control so interference is not
> > >   possible. In order to give another process control over the device the
> > >   fd can be passed (e.g. SCM_RIGHTS). 
> > > 
> > 
> > /dev/ubdcN can only be opened by the process which is the descendant of
> > the process which creates the device by sending ADD_DEV.
> > 
> > But the device can be deleted/queried by other processes, however, I
> > think it is reasonable if all these processes has permission to do that,
> > such as all processes owns the device with same uid.
> 
> I don't think it's a good idea to require all ubd daemons to have
> CAP_SYS_ADMIN/same uid. That's the main point I'm trying to make and the
> discussion is based on that.

I meant only the user who owns /dev/ubdcN can send the command to
/dev/ubd-control for controlling /dev/ubdcN. I believe this way is
straightforward.

> 
> > So can we apply process privilege & file ownership for isolating ubd device?
> > 
> > If per-process FD is used, it may confuse people, because process can
> > not delete/query ubd device even though its uid shows it has the
> > privilege.
> 
> Is it better to stop the device via ubd-control instead of a
> daemon-specific command (or just killing the daemon process)?
> 
> Regarding querying the device, the daemon has more information
> associated with the device (e.g. if it's an iSCSI initiator it will have
> the iSCSI URI). The ubd driver can only tell you the daemon pid and the
> block device attributes that should already be available via sysfs.
> Quering the daemon will yield more useful information than using
> ubd-control.

I don't think it is good to interrupt daemon for this admin/control job,
which may distract daemon from handling normal IO tasks, also not necessary
to make daemon implementation more complicated.

We should separate admin task from normal IO handling, which is one
common design pattern.

> 
> > > Now multiple applications/containers/etc can use ubd-control without
> > > interfering with each other. The security model still requires root
> > > though since devices can be malicious.
> > > 
> > > FUSE allows unprivileged mounts (see fuse_allow_current_process()). Only
> > > processes with the same uid as the FUSE daemon can access such mounts
> > > (in the default configuration). This prevents security issues while
> > > still allowing unprivileged use cases.
> > 
> > OK, looks FUSE applies process privilege & file ownership for dealing
> > with unprivileged mounts.
> > 
> > > 
> > > I suggest adapting the FUSE security model to block devices:
> > > - Devices can be created without CAP_SYS_ADMIN but they have an
> > >   'unprivileged' flag set to true.
> > > - Unprivileged devices are not probed for partitions and LVM doesn't
> > >   touch them. This means the kernel doesn't access these devices via
> > >   code paths that might be exploitable.
> > 
> > The above two makes sense.
> > 
> > > - When another process with a different uid from ubdsrv opens an
> > >   unprivileged device, -EACCES is returned. This protects other
> > >   uids from the unprivileged device.
> > 
> > OK, only the user who owns the device can access unprivileged device.
> > 
> > > - When another process with a different uid from ubdsrv opens a
> > >   _privileged_ device there is no special access check because ubdsrv is
> > >   privileged.
> > 
> > IMO, it depends if uid of this process has permission to access the
> > ubd device, and we can set ubd device's owership by the process
> > credentials.
> 
> Yes, file ownership/permissions are still relevant.
> 
> > 
> > > 
> > > With these changes UBD can be used by unprivileged processes and
> > > containers. I think it's worth discussing the details and having this
> > > model from the start so UBD can be used in a wide range of use cases.
> > 
> > I am pretty happy to discuss & figure out the details, but not sure
> > it is one blocker for ubd:
> > 
> > 1) kernel driver of loop/nbd or others haven't support the isolation
> 
> It may be better to compare it with FUSE where unprivileged users can
> run their own servers. Imagine FUSE required a global root control
> interface like ubd-control, then it wouldn't be possible to have
> unprivileged FUSE mounts.
> 
> > 2) still don't know exact ubd use case for containers
> 
> There are two common use cases for block devices:
> 1. File systems or volume managers
> 2. Direct access for databases, backup tools, disk image tools, etc
> 
> The file system use case involved kernel code and probably needs to be
> restricted to untrusted containers cannot exploit the kernel file system
> implementations. I'll ignore this use case and containers probably
> shouldn't do this.
> 
> The second use case is when you have any program that can operate on a
> block device. It could be an application that imports/exports a block
> device from network storage. This kind of application should be able to
> do its job without CAP_SYS_ADMIN and it should be able to run in a
> container. It might be part of KubeVirt's Containerized Data Importer,
> for example, and is deployed as a container.
> 
> If ubd supports unprivileged operation then this container use case is
> straightforward. If not, then it's problematic because it either
> requires a privileged container or some kind of privileged helper
> outside the container. At that point people may avoid ubd because it's
> too hard to deploy with privilege requirements.

OK, thanks for the sharing. In short, container requires unprivileged
operation on block device. I think it makes sense.


Thanks, 
Ming
Ming Lei May 18, 2022, 1:18 p.m. UTC | #8
Hello Liu,

On Wed, May 18, 2022 at 02:38:08AM -0400, Liu Xiaodong wrote:
> On Tue, May 17, 2022 at 01:53:57PM +0800, Ming Lei wrote:
> > Hello Guys,
> > 
> > ubd driver is one kernel driver for implementing generic userspace block
> > device/driver, which delivers io request from ubd block device(/dev/ubdbN) into
> > ubd server[1] which is the userspace part of ubd for communicating
> > with ubd driver and handling specific io logic by its target module.
> > 
> > Another thing ubd driver handles is to copy data between user space buffer
> > and request/bio's pages, or take zero copy if mm is ready for support it in
> > future. ubd driver doesn't handle any IO logic of the specific driver, so
> > it is small/simple, and all io logics are done by the target code in ubdserver.
> > 
> > The above two are main jobs done by ubd driver.
> 
> Hi, Lei
> 
> Your UBD implementation looks great. Its io_uring based design is interesting
> and brilliant.
> Towards the purpose of userspace block device, last year,
> VDUSE initialized by Yongji is going to do a similar work. But VDUSE is under
> vdpa. VDUSE will present a virtio-blk device to other userspace process
> like containers, while serving virtio-blk req also by an userspace target.
> https://lists.linuxfoundation.org/pipermail/iommu/2021-June/056956.html 
> 
> I've been working and thinking on serving RUNC container by SPDK efficiently.
> But this work requires a new proper userspace block device module in kernel.
> The highlevel design idea for userspace block device implementations
> should be that: Using ring for IO request, so client and target can exchange
> req/resp quickly in batch; Map bounce buffer between kernel and userspace
> target, so another extra IO data copy like NBD can be avoid. (Oh, yes, SPDK
> needs this kernel module has some more minor functions)
> 
> UBD and VDUSE are both implemented in this way, while of course each of
> them has specific features and advantages.
> 
> Not like UBD which is straightforward and starts from scratch, VDUSE is
> embedded in virtio framework. So its implementation is more complicated, but
> all virtio frontend utilities can be leveraged.
> When considering security/permission issues, feels UBD would be easier to
> solve them.

Stefan Hajnoczi and I are discussing related security/permission
issues, can you share more details in your case?

> 
> So my questions are:
> 1. what do you think about the purpose overlap between UBD and VDUSE?

Sorry, I am not familiar with VDUSE, motivation of ubd is just to make one
high performance generic userspace block driver. ubd driver(kernel part) is
just responsible for communication and copying data between userspace buffers
and kernel io request pages, and the ubdsrv(userspace) target handles io
logic.

> 2. Could UBD be implemented with SPDK friendly functionalities? (mainly about
> io data mapping, since HW devices in SPDK need to access the mapped data
> buffer. Then, in function ubdsrv.c/ubdsrv_init_io_bufs(),
> "addr = mmap(,,,,dev->cdev_fd,)",

No, that code is actually for supporting zero copy.

But each request's buffer is allocated by ubdsrv and definitely available for any
target, please see loop_handle_io_async() which handles IO from /dev/ubdbN about
how to use the buffer. Fro READ, the target code needs to implement READ
logic and fill data to the buffer, then the buffer will be copied to
kernel io request pages; for WRITE, the target code needs to use the buffer to handle
WRITE and the buffer has been updated with kernel io request.

> SPDK needs to know the PA of "addr".

What is PA? and why?

Userspace can only see VM of each buffer. 

> Also memory pointed by "addr" should be pinned all the time.)

The current implementation only pins pages when copying data between
userspace buffers and kernel io request pages. But I plan to support
three pin behavior:

- never (current behavior, just pin pages when copying pages)
- lazy (pin pages until the request is idle for enough time)
- always (all pages in userpace VM are pinned during the device lifetime)


Thanks, 
Ming
Stefan Hajnoczi May 18, 2022, 3:49 p.m. UTC | #9
On Wed, May 18, 2022 at 08:53:54PM +0800, Ming Lei wrote:
> On Wed, May 18, 2022 at 11:45:32AM +0100, Stefan Hajnoczi wrote:
> > On Wed, May 18, 2022 at 03:09:46PM +0800, Ming Lei wrote:
> > > On Tue, May 17, 2022 at 03:06:34PM +0100, Stefan Hajnoczi wrote:
> > > > Here are some more thoughts on the ubd-control device:
> > > > 
> > > > The current patch provides a ubd-control device for processes with
> > > > suitable permissions (i.e. root) to create, start, stop, and fetch
> > > > information about devices.
> > > > 
> > > > There is no isolation between devices created by one process and those
> > > 
> > > I understand linux hasn't device namespace yet, so can you share the
> > > rational behind the idea of device isolation, is it because ubd device
> > > is served by ubd daemon which belongs to one pid NS? Or the user creating
> > > /dev/ubdbN belongs to one user NS?
> > 
> > With the current model a process with access to ubd-control has control
> > over all ubd devices. This is not desirable for most container use cases
> > because ubd-control usage within a container means that container could
> > stop any ubd device on the system.
> > 
> > Even for non-container use cases it's problematic that two applications
> > that use ubd can interfere with each other. If an application passes the
> > wrong device ID they can stop the other application's device, for
> > example.
> > 
> > I think it's worth supporting a model where there are multiple ubd
> > daemons that are not cooperating/aware of each other. They should be
> > isolated from each other.
> 
> Maybe I didn't mention it clearly, I meant the following model in last email:
> 
> 1) every user can send UBD_CMD_ADD_DEV to /dev/ubd-control
> 
> 2) the created /dev/ubdcN & /dev/udcbN are owned by the user who creates
> it

How does this work? Does userspace (udev) somehow get the uid/gid from
the uevent so it can set the device node permissions?

> 3) only the user who has permission to /dev/ubdcN can send other control
> commands(START_DEV/STOP_DEV/GET_DEV_INFO/GET_QUEUE_AFFINITY/DEL_DEV);
> and same with /dev/ubdbN
> 
> 4) for unprivileged user who owns /dev/ubdbN, limit kernel behavior,
> such as, not probed for partitions and LVM, only allow unprivileged
> mounts,...
> 
> So ubd device can be isolated wrt. user NS.

Cool!

> 
> > 
> > > IMO, ubd device is one file in VFS, and FS permission should be applied,
> > > then here the closest model should be user NS, and process privilege &
> > > file ownership.
> > 
> > Yes, /dev/ubdbN can has file ownership/permissions and the cgroup device
> > controller can restrict access too. That works fine when the device was
> > created previously.
> > 
> > But what about ubd device creation via ubd-control?
> > 
> > The problem is a global control interface like ubd-control gives access
> > to all ubd devices. There is no way to let an application/container
> > control (create/start/stop/etc) some ubd devices but not all. I think
> > ubd-control must be more fine-grained so multiple
> > applications/containers can use it without the possibility of
> > interference.
> > 
> > /dev/ubdcN is a separate problem. The cgroup device controller can limit
> > the device nodes that are accessible from a process. However, this
> > requires reserving device minor number ranges for each
> > application/container so they can only mknod/open their own ubd devices
> > and not devices that don't belong to them. Maybe there is a better
> > solution?
> > 
> > /dev/ubdbN has similar requirements to /dev/ubdcN. It should be possible
> > to create a new /dev/ubdbN but not access an existing device that belong
> > 
> > So if we want to let containers create ubd devices without granting them
> > access to all devices on the system, then the ubd-control interface
> > needs to be changed (see below) and the container needs a reserved range
> > of ubdcN minor numbers. Any container using ubdbN needs the cgroup
> > device controller and file ownership/permissions to open the block
> > device.
> > 
> > > > created by another. Therefore two processes that do not trust each other
> > > > cannot both use UBD without potential interference. There is also no
> > > 
> > > Can you share what the expectation is for this situation?
> > 
> > Two users should be able to run ubd daemons on the same system without
> > being able to stop each other's devices.
> 
> Yeah, the above process privilege & file ownership based way can reach
> the goal in user NS.
> 
> > 
> > > It is the created UBD which can only be used in this NS, or can only be
> > > visible inside this NS? I guess the latter isn't possible since we don't
> > > have this kind of isolation framework yet.
> > 
> > It should be possible to access the ubd device according to file
> > ownership/permissions. No new isolation framework is needed for that.
> > 
> > But ubd-control should not grant global access to all ubd devices, at
> > least not in the typical case of a ubd daemon that just wishes to
> > create/start/stop its own devices.
> 
> Yeah, I agree.
> 
> > 
> > > > isolation for containers.
> > > > 
> > > > I think it would be a mistake to keep the ubd-control interface in its
> > > > current form since the current global/root model is limited. Instead I
> > > > suggest:
> > > > - Creating a device returns a new file descriptor instead of a global
> > > >   dev_id. The device can be started/stopped/configured through this (and
> > > >   only through this) per-device file descriptor. The device is not
> > > >   visible to other processes through ubd-control so interference is not
> > > >   possible. In order to give another process control over the device the
> > > >   fd can be passed (e.g. SCM_RIGHTS). 
> > > > 
> > > 
> > > /dev/ubdcN can only be opened by the process which is the descendant of
> > > the process which creates the device by sending ADD_DEV.
> > > 
> > > But the device can be deleted/queried by other processes, however, I
> > > think it is reasonable if all these processes has permission to do that,
> > > such as all processes owns the device with same uid.
> > 
> > I don't think it's a good idea to require all ubd daemons to have
> > CAP_SYS_ADMIN/same uid. That's the main point I'm trying to make and the
> > discussion is based on that.
> 
> I meant only the user who owns /dev/ubdcN can send the command to
> /dev/ubd-control for controlling /dev/ubdcN. I believe this way is
> straightforward.
> 
> > 
> > > So can we apply process privilege & file ownership for isolating ubd device?
> > > 
> > > If per-process FD is used, it may confuse people, because process can
> > > not delete/query ubd device even though its uid shows it has the
> > > privilege.
> > 
> > Is it better to stop the device via ubd-control instead of a
> > daemon-specific command (or just killing the daemon process)?
> > 
> > Regarding querying the device, the daemon has more information
> > associated with the device (e.g. if it's an iSCSI initiator it will have
> > the iSCSI URI). The ubd driver can only tell you the daemon pid and the
> > block device attributes that should already be available via sysfs.
> > Quering the daemon will yield more useful information than using
> > ubd-control.
> 
> I don't think it is good to interrupt daemon for this admin/control job,
> which may distract daemon from handling normal IO tasks, also not necessary
> to make daemon implementation more complicated.
> 
> We should separate admin task from normal IO handling, which is one
> common design pattern.
> 
> > 
> > > > Now multiple applications/containers/etc can use ubd-control without
> > > > interfering with each other. The security model still requires root
> > > > though since devices can be malicious.
> > > > 
> > > > FUSE allows unprivileged mounts (see fuse_allow_current_process()). Only
> > > > processes with the same uid as the FUSE daemon can access such mounts
> > > > (in the default configuration). This prevents security issues while
> > > > still allowing unprivileged use cases.
> > > 
> > > OK, looks FUSE applies process privilege & file ownership for dealing
> > > with unprivileged mounts.
> > > 
> > > > 
> > > > I suggest adapting the FUSE security model to block devices:
> > > > - Devices can be created without CAP_SYS_ADMIN but they have an
> > > >   'unprivileged' flag set to true.
> > > > - Unprivileged devices are not probed for partitions and LVM doesn't
> > > >   touch them. This means the kernel doesn't access these devices via
> > > >   code paths that might be exploitable.
> > > 
> > > The above two makes sense.
> > > 
> > > > - When another process with a different uid from ubdsrv opens an
> > > >   unprivileged device, -EACCES is returned. This protects other
> > > >   uids from the unprivileged device.
> > > 
> > > OK, only the user who owns the device can access unprivileged device.
> > > 
> > > > - When another process with a different uid from ubdsrv opens a
> > > >   _privileged_ device there is no special access check because ubdsrv is
> > > >   privileged.
> > > 
> > > IMO, it depends if uid of this process has permission to access the
> > > ubd device, and we can set ubd device's owership by the process
> > > credentials.
> > 
> > Yes, file ownership/permissions are still relevant.
> > 
> > > 
> > > > 
> > > > With these changes UBD can be used by unprivileged processes and
> > > > containers. I think it's worth discussing the details and having this
> > > > model from the start so UBD can be used in a wide range of use cases.
> > > 
> > > I am pretty happy to discuss & figure out the details, but not sure
> > > it is one blocker for ubd:
> > > 
> > > 1) kernel driver of loop/nbd or others haven't support the isolation
> > 
> > It may be better to compare it with FUSE where unprivileged users can
> > run their own servers. Imagine FUSE required a global root control
> > interface like ubd-control, then it wouldn't be possible to have
> > unprivileged FUSE mounts.
> > 
> > > 2) still don't know exact ubd use case for containers
> > 
> > There are two common use cases for block devices:
> > 1. File systems or volume managers
> > 2. Direct access for databases, backup tools, disk image tools, etc
> > 
> > The file system use case involved kernel code and probably needs to be
> > restricted to untrusted containers cannot exploit the kernel file system
> > implementations. I'll ignore this use case and containers probably
> > shouldn't do this.
> > 
> > The second use case is when you have any program that can operate on a
> > block device. It could be an application that imports/exports a block
> > device from network storage. This kind of application should be able to
> > do its job without CAP_SYS_ADMIN and it should be able to run in a
> > container. It might be part of KubeVirt's Containerized Data Importer,
> > for example, and is deployed as a container.
> > 
> > If ubd supports unprivileged operation then this container use case is
> > straightforward. If not, then it's problematic because it either
> > requires a privileged container or some kind of privileged helper
> > outside the container. At that point people may avoid ubd because it's
> > too hard to deploy with privilege requirements.
> 
> OK, thanks for the sharing. In short, container requires unprivileged
> operation on block device. I think it makes sense.

Thanks!

Stefan
Ming Lei May 19, 2022, 2:42 a.m. UTC | #10
On Wed, May 18, 2022 at 04:49:03PM +0100, Stefan Hajnoczi wrote:
> On Wed, May 18, 2022 at 08:53:54PM +0800, Ming Lei wrote:
> > On Wed, May 18, 2022 at 11:45:32AM +0100, Stefan Hajnoczi wrote:
> > > On Wed, May 18, 2022 at 03:09:46PM +0800, Ming Lei wrote:
> > > > On Tue, May 17, 2022 at 03:06:34PM +0100, Stefan Hajnoczi wrote:
> > > > > Here are some more thoughts on the ubd-control device:
> > > > > 
> > > > > The current patch provides a ubd-control device for processes with
> > > > > suitable permissions (i.e. root) to create, start, stop, and fetch
> > > > > information about devices.
> > > > > 
> > > > > There is no isolation between devices created by one process and those
> > > > 
> > > > I understand linux hasn't device namespace yet, so can you share the
> > > > rational behind the idea of device isolation, is it because ubd device
> > > > is served by ubd daemon which belongs to one pid NS? Or the user creating
> > > > /dev/ubdbN belongs to one user NS?
> > > 
> > > With the current model a process with access to ubd-control has control
> > > over all ubd devices. This is not desirable for most container use cases
> > > because ubd-control usage within a container means that container could
> > > stop any ubd device on the system.
> > > 
> > > Even for non-container use cases it's problematic that two applications
> > > that use ubd can interfere with each other. If an application passes the
> > > wrong device ID they can stop the other application's device, for
> > > example.
> > > 
> > > I think it's worth supporting a model where there are multiple ubd
> > > daemons that are not cooperating/aware of each other. They should be
> > > isolated from each other.
> > 
> > Maybe I didn't mention it clearly, I meant the following model in last email:
> > 
> > 1) every user can send UBD_CMD_ADD_DEV to /dev/ubd-control
> > 
> > 2) the created /dev/ubdcN & /dev/udcbN are owned by the user who creates
> > it
> 
> How does this work? Does userspace (udev) somehow get the uid/gid from
> the uevent so it can set the device node permissions?

We can let 'ubd list' export the owner info, then udev may override the default
owner with exported info.

Or it can be done inside devtmpfs_create_node() by passing ubd's uid/gid
at default.

For /dev/ubdcN, I think it is safe, since the driver is only
communicating with the userspace daemon, and both belong to same owner.
Also ubd driver is simple enough to get full audited.

For /dev/ubdbN, even though FS isn't allowed to mount, there is still
lots of kernel code path involved, and some code path may not be run
with unprivileged user before, that needs careful audit.

So the biggest problem is if it is safe to export block disk to unprivileged
user, and that is the one which can't be bypassed for any approach.




Thanks, 
Ming
Stefan Hajnoczi May 19, 2022, 9:46 a.m. UTC | #11
On Thu, May 19, 2022 at 10:42:22AM +0800, Ming Lei wrote:
> On Wed, May 18, 2022 at 04:49:03PM +0100, Stefan Hajnoczi wrote:
> > On Wed, May 18, 2022 at 08:53:54PM +0800, Ming Lei wrote:
> > > On Wed, May 18, 2022 at 11:45:32AM +0100, Stefan Hajnoczi wrote:
> > > > On Wed, May 18, 2022 at 03:09:46PM +0800, Ming Lei wrote:
> > > > > On Tue, May 17, 2022 at 03:06:34PM +0100, Stefan Hajnoczi wrote:
> > > > > > Here are some more thoughts on the ubd-control device:
> > > > > > 
> > > > > > The current patch provides a ubd-control device for processes with
> > > > > > suitable permissions (i.e. root) to create, start, stop, and fetch
> > > > > > information about devices.
> > > > > > 
> > > > > > There is no isolation between devices created by one process and those
> > > > > 
> > > > > I understand linux hasn't device namespace yet, so can you share the
> > > > > rational behind the idea of device isolation, is it because ubd device
> > > > > is served by ubd daemon which belongs to one pid NS? Or the user creating
> > > > > /dev/ubdbN belongs to one user NS?
> > > > 
> > > > With the current model a process with access to ubd-control has control
> > > > over all ubd devices. This is not desirable for most container use cases
> > > > because ubd-control usage within a container means that container could
> > > > stop any ubd device on the system.
> > > > 
> > > > Even for non-container use cases it's problematic that two applications
> > > > that use ubd can interfere with each other. If an application passes the
> > > > wrong device ID they can stop the other application's device, for
> > > > example.
> > > > 
> > > > I think it's worth supporting a model where there are multiple ubd
> > > > daemons that are not cooperating/aware of each other. They should be
> > > > isolated from each other.
> > > 
> > > Maybe I didn't mention it clearly, I meant the following model in last email:
> > > 
> > > 1) every user can send UBD_CMD_ADD_DEV to /dev/ubd-control
> > > 
> > > 2) the created /dev/ubdcN & /dev/udcbN are owned by the user who creates
> > > it
> > 
> > How does this work? Does userspace (udev) somehow get the uid/gid from
> > the uevent so it can set the device node permissions?
> 
> We can let 'ubd list' export the owner info, then udev may override the default
> owner with exported info.
> 
> Or it can be done inside devtmpfs_create_node() by passing ubd's uid/gid
> at default.
> 
> For /dev/ubdcN, I think it is safe, since the driver is only
> communicating with the userspace daemon, and both belong to same owner.
> Also ubd driver is simple enough to get full audited.
> 
> For /dev/ubdbN, even though FS isn't allowed to mount, there is still
> lots of kernel code path involved, and some code path may not be run
> with unprivileged user before, that needs careful audit.
> 
> So the biggest problem is if it is safe to export block disk to unprivileged
> user, and that is the one which can't be bypassed for any approach.

Okay.

Stefan
Ming Lei May 19, 2022, 1:33 p.m. UTC | #12
On Tue, May 17, 2022 at 01:53:57PM +0800, Ming Lei wrote:
> Hello Guys,
> 
> ubd driver is one kernel driver for implementing generic userspace block
> device/driver, which delivers io request from ubd block device(/dev/ubdbN) into
> ubd server[1] which is the userspace part of ubd for communicating
> with ubd driver and handling specific io logic by its target module.
> 
> Another thing ubd driver handles is to copy data between user space buffer
> and request/bio's pages, or take zero copy if mm is ready for support it in
> future. ubd driver doesn't handle any IO logic of the specific driver, so
> it is small/simple, and all io logics are done by the target code in ubdserver.
> 
> The above two are main jobs done by ubd driver.
> 
> ubd driver can help to move IO logic into userspace, in which the
> development work is easier/more effective than doing in kernel, such as,
> ubd-loop takes < 200 lines of loop specific code to get basically same 
> function with kernel loop block driver, meantime the performance is
> still good. ubdsrv[1] provide built-in test for comparing both by running
> "make test T=loop".
> 
> Another example is high performance qcow2 support[2], which could be built with
> ubd framework more easily than doing it inside kernel.
> 
> Also there are more people who express interests on userspace block driver[3],
> Gabriel Krisman Bertazi proposes this topic in lsf/mm/ebpf 2022 and mentioned
> requirement from Google. Ziyang Zhang from Alibaba said they "plan to
> replace TCMU by UBD as a new choice" because UBD can get better throughput than
> TCMU even with single queue[4], meantime UBD is simple. Also there is userspace
> storage service for providing storage to containers.
> 
> It is io_uring based: io request is delivered to userspace via new added
> io_uring command which has been proved as very efficient for making nvme
> passthrough IO to get better IOPS than io_uring(READ/WRITE). Meantime one
> shared/mmap buffer is used for sharing io descriptor to userspace, the
> buffer is readonly for userspace, each IO just takes 24bytes so far.
> It is suggested to use io_uring in userspace(target part of ubd server)
> to handle IO request too. And it is still easy for ubdserver to support
> io handling by non-io_uring, and this work isn't done yet, but can be
> supported easily with help o eventfd.
> 
> This way is efficient since no extra io command copy is required, no sleep
> is needed in transferring io command to userspace. Meantime the communication
> protocol is simple and efficient, one single command of
> UBD_IO_COMMIT_AND_FETCH_REQ can handle both fetching io request desc and commit
> command result in one trip. IO handling is often batched after single
> io_uring_enter() returns, both IO requests from ubd server target and
> IO commands could be handled as a whole batch.
> 
> Remove RFC now because ubd driver codes gets lots of cleanup, enhancement and
> bug fixes since V1:
> 
> - cleanup uapi: remove ubd specific error code,  switch to linux error code,
> remove one command op, remove one field from cmd_desc
> 
> - add monitor mechanism to handle ubq_daemon being killed, ubdsrv[1]
>   includes builtin tests for covering heavy IO with deleting ubd / killing
>   ubq_daemon at the same time, and V2 pass all the two tests(make test T=generic),
>   and the abort/stop mechanism is simple
> 
> - fix MQ command buffer mmap bug, and now 'xfstetests -g auto' works well on
>   MQ ubd-loop devices(test/scratch)
> 
> - improve batching submission as suggested by Jens
> 
> - improve handling for starting device, replace random wait/poll with
> completion
> 
> - all kinds of cleanup, bug fix,..
> 
> And the patch by patch change since V1 can be found in the following
> tree:
> 
> https://github.com/ming1/linux/commits/my_for-5.18-ubd-devel_v2

BTW, a one-line fix[1] is added to above branch, which fixes performance
obviously on small BS(< 128k) test. If anyone run performance test,
please include this fix.

[1] https://github.com/ming1/linux/commit/fa91354b418e83953304a3efad4ee6ac40ea6110

Thanks,
Ming
Liu Xiaodong May 23, 2022, 2:56 p.m. UTC | #13
On Wed, May 18, 2022 at 09:18:45PM +0800, Ming Lei wrote:
> Hello Liu,
> 
> On Wed, May 18, 2022 at 02:38:08AM -0400, Liu Xiaodong wrote:
> > On Tue, May 17, 2022 at 01:53:57PM +0800, Ming Lei wrote:
> > > Hello Guys,
> > > 
> > > ubd driver is one kernel driver for implementing generic userspace block
> > > device/driver, which delivers io request from ubd block device(/dev/ubdbN) into
> > > ubd server[1] which is the userspace part of ubd for communicating
> > > with ubd driver and handling specific io logic by its target module.
> > > 
> > > Another thing ubd driver handles is to copy data between user space buffer
> > > and request/bio's pages, or take zero copy if mm is ready for support it in
> > > future. ubd driver doesn't handle any IO logic of the specific driver, so
> > > it is small/simple, and all io logics are done by the target code in ubdserver.
> > > 
> > > The above two are main jobs done by ubd driver.
> > 
> > Not like UBD which is straightforward and starts from scratch, VDUSE is
> > embedded in virtio framework. So its implementation is more complicated, but
> > all virtio frontend utilities can be leveraged.
> > When considering security/permission issues, feels UBD would be easier to
> > solve them.
> 
> Stefan Hajnoczi and I are discussing related security/permission
> issues, can you share more details in your case?

Hi, Ming
Security/permission things covered by your discussion are more than I've
considered.
 
> > 
> > So my questions are:
> > 1. what do you think about the purpose overlap between UBD and VDUSE?
> 
> Sorry, I am not familiar with VDUSE, motivation of ubd is just to make one
> high performance generic userspace block driver. ubd driver(kernel part) is
> just responsible for communication and copying data between userspace buffers
> and kernel io request pages, and the ubdsrv(userspace) target handles io
> logic.
> 
> > 2. Could UBD be implemented with SPDK friendly functionalities? (mainly about
> > io data mapping, since HW devices in SPDK need to access the mapped data
> > buffer. Then, in function ubdsrv.c/ubdsrv_init_io_bufs(),
> > "addr = mmap(,,,,dev->cdev_fd,)",
> 
> No, that code is actually for supporting zero copy.
> 
> But each request's buffer is allocated by ubdsrv and definitely available for any
> target, please see loop_handle_io_async() which handles IO from /dev/ubdbN about
> how to use the buffer. Fro READ, the target code needs to implement READ
> logic and fill data to the buffer, then the buffer will be copied to
> kernel io request pages; for WRITE, the target code needs to use the buffer to handle
> WRITE and the buffer has been updated with kernel io request.
> 

Oh, I see. Yes, you are right. Mmapped addr in ubdsrv_init_io_bufs is not
used yet. Request's buffer is allocated by ubdsrv.


> > SPDK needs to know the PA of "addr".
> 
> What is PA? and why?

Physical address. Sorry, I forgot to expand it.
Previously I've thought Request data buffer is from mmap addr on corresponding
ubd cdev, then I just thought SPDK need to know the PA of the buffer for
its backend hardware devices.
If the request data buffer is allocated by srv process, then it was not
needed. So maybe SPDK can be efficiently working on your corrent implementation.
I'll try to draft an SPDK service backend later.

Thanks
Xiaodong
Ming Lei May 24, 2022, 2:59 a.m. UTC | #14
On Mon, May 23, 2022 at 10:56:43AM -0400, Liu Xiaodong wrote:
> On Wed, May 18, 2022 at 09:18:45PM +0800, Ming Lei wrote:
> > Hello Liu,
> > 
> > On Wed, May 18, 2022 at 02:38:08AM -0400, Liu Xiaodong wrote:
> > > On Tue, May 17, 2022 at 01:53:57PM +0800, Ming Lei wrote:
> > > > Hello Guys,
> > > > 
> > > > ubd driver is one kernel driver for implementing generic userspace block
> > > > device/driver, which delivers io request from ubd block device(/dev/ubdbN) into
> > > > ubd server[1] which is the userspace part of ubd for communicating
> > > > with ubd driver and handling specific io logic by its target module.
> > > > 
> > > > Another thing ubd driver handles is to copy data between user space buffer
> > > > and request/bio's pages, or take zero copy if mm is ready for support it in
> > > > future. ubd driver doesn't handle any IO logic of the specific driver, so
> > > > it is small/simple, and all io logics are done by the target code in ubdserver.
> > > > 
> > > > The above two are main jobs done by ubd driver.
> > > 
> > > Not like UBD which is straightforward and starts from scratch, VDUSE is
> > > embedded in virtio framework. So its implementation is more complicated, but
> > > all virtio frontend utilities can be leveraged.
> > > When considering security/permission issues, feels UBD would be easier to
> > > solve them.
> > 
> > Stefan Hajnoczi and I are discussing related security/permission
> > issues, can you share more details in your case?
> 
> Hi, Ming
> Security/permission things covered by your discussion are more than I've
> considered.

BTW, I'd rather make a summery about the discussion:

1) Stefan suggested that ubd device may be made as one container block
device, which can be isolated from others, such as, the ubd device
created in one container can only be controlled and read write inside
this container, and this way is useful for container use case.

2) the requirement actually needs both /dev/ubdcN and /dev/ubdbN to
be allowed for unprivileged user; so it could be solved by existed
process privilege & file ownership; let user of the process creating
the two devices owns the two devices, and apply FS's file permission
on the two devices;

3) it shouldn't be hard to allow unprivileged user to control
/dev/ubdcN or /dev/ubd-control

- every user can create ubd by sending ADD_DEV command to
  /dev/ubd-control; only user with permission to /dev/ubdcN can send
  other control commands to /dev/ubd-control for controlling/querying
  the specified device

- ubd driver is simple, both in interface and implementation, so we
can make it stable from the beginning

- only the daemon can communicate with /dev/ubdcN, which is only
  allowed to be opened by one process

4) the challenge is in allowing unprivileged user to access /dev/ubdbN:

- no any serious bug in io path(io hang, kernel panic), such as ubd io
  hang may cause sync() hang

- can't affect other users or processes or system, such as, one
  malicious may make a extremely slow device to dirty lots of pages, or
  prevent device from being deleted

- ...

5) as Stefan mentioned, we may start by:
- not allow unprivileged ubd device to be mounted
- not allow unprivileged ubd device's partition table to be read from
  kernel
- not support buffered io for unprivileged ubd device, and only direct io
  is allowed
- maybe more limit for minimizing security risk.


ubd for container is hard, and it should be one extra feature added
in future, especially after fully review/verification.


Thanks,
Ming