Message ID | 20250106142439.216598-1-dlemoal@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | New zoned loop block device driver | expand |
On 1/6/25 7:24 AM, Damien Le Moal wrote: > The first patch implements the new "zloop" zoned block device driver > which allows creating zoned block devices using one regular file per > zone as backing storage. Couldn't we do this with ublk and keep most of this stuff in userspace rather than need a whole new loop driver?
On Mon, Jan 06, 2025 at 07:54:05AM -0700, Jens Axboe wrote: > On 1/6/25 7:24 AM, Damien Le Moal wrote: > > The first patch implements the new "zloop" zoned block device driver > > which allows creating zoned block devices using one regular file per > > zone as backing storage. > > Couldn't we do this with ublk and keep most of this stuff in userspace > rather than need a whole new loop driver? I'm pretty sure we could do that. But dealing with ublk is complete pain especially when setting up and tearing it down all the time for test, and would require a lot more code, so why? As-is I can directly add this to xfstests for the much needed large file system testing that currently doesn't work for zoned file systems, which was the motivation why I started writing this code (before Damien gladly took over and polished it).
On 1/6/25 8:21 AM, Christoph Hellwig wrote: > On Mon, Jan 06, 2025 at 07:54:05AM -0700, Jens Axboe wrote: >> On 1/6/25 7:24 AM, Damien Le Moal wrote: >>> The first patch implements the new "zloop" zoned block device driver >>> which allows creating zoned block devices using one regular file per >>> zone as backing storage. >> >> Couldn't we do this with ublk and keep most of this stuff in userspace >> rather than need a whole new loop driver? > > I'm pretty sure we could do that. But dealing with ublk is complete > pain especially when setting up and tearing it down all the time for > test, and would require a lot more code, so why? As-is I can directly > > add this to xfstests for the much needed large file system testing that > currently doesn't work for zoned file systems, which was the motivation > why I started writing this code (before Damien gladly took over and > polished it). A lot more code where? Not in the kernel. And now we're stuck with a new driver for a relatively niche use case. Seems like a bad tradeoff to me.
On Mon, Jan 06, 2025 at 08:24:06AM -0700, Jens Axboe wrote: > A lot more code where? Very good and relevant question. Some random new repo that no one knows about? Not very helpful. xfstests itself? Maybe, but that would just means other users have to fork it. > Not in the kernel. And now we're stuck with a new > driver for a relatively niche use case. Seems like a bad tradeoff to me. Seriously, if you can't Damien and me to maintain a little driver using completely standard interfaces without any magic you'll have different problems keepign the block layer alive :)
On 1/6/25 8:32 AM, Christoph Hellwig wrote: > On Mon, Jan 06, 2025 at 08:24:06AM -0700, Jens Axboe wrote: >> A lot more code where? > > Very good and relevant question. Some random new repo that no one knows > about? Not very helpful. xfstests itself? Maybe, but that would just > means other users have to fork it. Why would they have to fork it? Just put it in xfstests itself. These are very weak reasons, imho. >> Not in the kernel. And now we're stuck with a new >> driver for a relatively niche use case. Seems like a bad tradeoff to me. > > Seriously, if you can't Damien and me to maintain a little driver > using completely standard interfaces without any magic you'll have > different problems keepign the block layer alive :) Asking "why do we need this driver, when we can accomplish the same with existing stuff" is a valid question, and I'm a bit puzzled why we can't just have a reasonable discussion about this. If that simple question can't be asked, and answered suitably, then something is really amiss.
On Mon, Jan 06, 2025 at 08:38:26AM -0700, Jens Axboe wrote: > On 1/6/25 8:32 AM, Christoph Hellwig wrote: > > On Mon, Jan 06, 2025 at 08:24:06AM -0700, Jens Axboe wrote: > >> A lot more code where? > > > > Very good and relevant question. Some random new repo that no one knows > > about? Not very helpful. xfstests itself? Maybe, but that would just > > means other users have to fork it. > > Why would they have to fork it? Just put it in xfstests itself. These > are very weak reasons, imho. Because that way other users can't use it. Damien has already mentioned some. And someone would actually have to write that hypothetical thing. > >> Not in the kernel. And now we're stuck with a new > >> driver for a relatively niche use case. Seems like a bad tradeoff to me. > > > > Seriously, if you can't Damien and me to maintain a little driver > > using completely standard interfaces without any magic you'll have > > different problems keepign the block layer alive :) > > Asking "why do we need this driver, when we can accomplish the same with > existing stuff" There is no "existing stuff" > is a valid question, and I'm a bit puzzled why we can't > just have a reasonable discussion about this. I think this is a valid and reasonable discussion. But maybe we're just not on the same page. I don't know anything existing and usable, maybe I've just not found it?
On 1/6/25 8:44 AM, Christoph Hellwig wrote: > On Mon, Jan 06, 2025 at 08:38:26AM -0700, Jens Axboe wrote: >> On 1/6/25 8:32 AM, Christoph Hellwig wrote: >>> On Mon, Jan 06, 2025 at 08:24:06AM -0700, Jens Axboe wrote: >>>> A lot more code where? >>> >>> Very good and relevant question. Some random new repo that no one knows >>> about? Not very helpful. xfstests itself? Maybe, but that would just >>> means other users have to fork it. >> >> Why would they have to fork it? Just put it in xfstests itself. These >> are very weak reasons, imho. > > Because that way other users can't use it. Damien has already mentioned > some. If it's actually useful to others, then it can become a standalone thing. Really nothing new there. > And someone would actually have to write that hypothetical thing. That is certainly true. >>>> Not in the kernel. And now we're stuck with a new >>>> driver for a relatively niche use case. Seems like a bad tradeoff to me. >>> >>> Seriously, if you can't Damien and me to maintain a little driver >>> using completely standard interfaces without any magic you'll have >>> different problems keepign the block layer alive :) >> >> Asking "why do we need this driver, when we can accomplish the same with >> existing stuff" > > There is no "existing stuff" Right, that's true on both sides now. Yes this kernel driver has been written, but in practice there is no existing stuff. >> is a valid question, and I'm a bit puzzled why we can't >> just have a reasonable discussion about this. > > I think this is a valid and reasonable discussion. But maybe we're > just not on the same page. I don't know anything existing and usable, > maybe I've just not found it? Not that I'm aware of, it was just a suggestion/thought that we could utilize an existing driver for this, rather than have a separate one. Yes the proposed one is pretty simple and not large, and maintaining it isn't a big deal, but it's still a new driver and hence why I was asking "why can't we just use ublk for this". That also keeps the code mostly in userspace which is nice, rather than needing kernel changes for new features, changes, etc.
On Mon, Jan 06, 2025 at 10:38:24AM -0700, Jens Axboe wrote: > > just not on the same page. I don't know anything existing and usable, > > maybe I've just not found it? > > Not that I'm aware of, it was just a suggestion/thought that we could > utilize an existing driver for this, rather than have a separate one. > Yes the proposed one is pretty simple and not large, and maintaining it > isn't a big deal, but it's still a new driver and hence why I was asking > "why can't we just use ublk for this". That also keeps the code mostly > in userspace which is nice, rather than needing kernel changes for new > features, changes, etc. Well, the reason to do a kernel driver rather than a ublk back end boils down to a few things: - writing highly concurrent code is actually a lot simpler in the kernel than in userspace because we have the right primitives for it - these primitives tend to actually be a lot faster than those available in glibc as well - the double context switch into the kernel and back for a ublk device backed by a file system will actually show up for some xfstests that do a lot of synchronous ops - having an in-tree kernel driver that you just configure / unconfigure from the shell is a lot easier to use than a daemon that needs to be running. Especially from xfstests or other test suites that do a lot of per-test setup and teardown - the kernel actually has really nice infrastructure for block drivers. I'm pretty sure doing this in userspace would actually be more code, while being harder to use and lower performance. So we could go both ways, but the kernel version was pretty obviously the preferred one to me. Maybe that's a little biasses by doing a lot of kernel work, and having run into a lot of problems and performance issues with the SCSI target user backend lately.
On 1/7/25 02:38, Jens Axboe wrote: >> I think this is a valid and reasonable discussion. But maybe we're >> just not on the same page. I don't know anything existing and usable, >> maybe I've just not found it? > > Not that I'm aware of, it was just a suggestion/thought that we could > utilize an existing driver for this, rather than have a separate one. > Yes the proposed one is pretty simple and not large, and maintaining it > isn't a big deal, but it's still a new driver and hence why I was asking > "why can't we just use ublk for this". That also keeps the code mostly > in userspace which is nice, rather than needing kernel changes for new > features, changes, etc. I did consider ublk at some point but did not switch to it because a ublk backend driver to do the same as zloop in userspace would need a lot more code to be efficient. And even then, as Christoph already mentioned, we would still have performance suffer from the context switches. But that performance point was not the primary stopper though as this driver is not intended for production use but rather to be the simplest possible setup that can be used in CI systems to test zoned file systems (among other zone related things). A kernel-based implementation is simpler and the configuration interface literally needs only a single echo bash command to add or remove devices. This allows minimal VM configurations with no dependencies on user tools/libraries to run these zoned devices, which is what we wanted. I completely agree about the user-space vs kernel tradeoff you mentioned. I did consider it but the code simplicity and ease of use in practice won for us and I chose to stick with the kernel driver approach. Note that if you are OK with this, I need to send a V2 to correct the Kconfig description which currently shows an invalid configuration command example.
On 1/6/25 6:08 PM, Damien Le Moal wrote: > On 1/7/25 02:38, Jens Axboe wrote: >>> I think this is a valid and reasonable discussion. But maybe we're >>> just not on the same page. I don't know anything existing and usable, >>> maybe I've just not found it? >> >> Not that I'm aware of, it was just a suggestion/thought that we could >> utilize an existing driver for this, rather than have a separate one. >> Yes the proposed one is pretty simple and not large, and maintaining it >> isn't a big deal, but it's still a new driver and hence why I was asking >> "why can't we just use ublk for this". That also keeps the code mostly >> in userspace which is nice, rather than needing kernel changes for new >> features, changes, etc. > > I did consider ublk at some point but did not switch to it because a > ublk backend driver to do the same as zloop in userspace would need a > lot more code to be efficient. And even then, as Christoph already > mentioned, we would still have performance suffer from the context > switches. But that performance point was not the primary stopper I don't buy this context switch argument at all. Why would it mean more sleeping? There's absolutely zero reason why a ublk solution would be at least as performant as the kernel one. And why would it need "a lot more code to be efficient"? > though as this driver is not intended for production use but rather to > be the simplest possible setup that can be used in CI systems to test > zoned file systems (among other zone related things). Right, that too. > A kernel-based implementation is simpler and the configuration > interface literally needs only a single echo bash command to add or > remove devices. This allows minimal VM configurations with no > dependencies on user tools/libraries to run these zoned devices, which > is what we wanted. > > I completely agree about the user-space vs kernel tradeoff you > mentioned. I did consider it but the code simplicity and ease of use > in practice won for us and I chose to stick with the kernel driver > approach. > > Note that if you are OK with this, I need to send a V2 to correct the > Kconfig description which currently shows an invalid configuration > command example. Sure, I'm not totally against it, even if I think the arguments are very weak, and in some places also just wrong. It's not like it's a huge driver.
On 1/6/25 11:05 AM, Christoph Hellwig wrote: > On Mon, Jan 06, 2025 at 10:38:24AM -0700, Jens Axboe wrote: >>> just not on the same page. I don't know anything existing and usable, >>> maybe I've just not found it? >> >> Not that I'm aware of, it was just a suggestion/thought that we could >> utilize an existing driver for this, rather than have a separate one. >> Yes the proposed one is pretty simple and not large, and maintaining it >> isn't a big deal, but it's still a new driver and hence why I was asking >> "why can't we just use ublk for this". That also keeps the code mostly >> in userspace which is nice, rather than needing kernel changes for new >> features, changes, etc. > > Well, the reason to do a kernel driver rather than a ublk back end > boils down to a few things: > > - writing highly concurrent code is actually a lot simpler in the kernel > than in userspace because we have the right primitives for it > - these primitives tend to actually be a lot faster than those available > in glibc as well That's certainly true. > - the double context switch into the kernel and back for a ublk device > backed by a file system will actually show up for some xfstests that > do a lot of synchronous ops Like I replied to Damien, that's mostly a bogus argument. If you're doing sync stuff, you can do that with a single system call. If you're building up depth, then it doesn't matter. > - having an in-tree kernel driver that you just configure / unconfigure > from the shell is a lot easier to use than a daemon that needs to > be running. Especially from xfstests or other test suites that do > a lot of per-test setup and teardown This is always true when it's a new piece of userspace, but not necessarily true once the use case has been established. > - the kernel actually has really nice infrastructure for block drivers. > I'm pretty sure doing this in userspace would actually be more > code, while being harder to use and lower performance. That's very handwavy... > So we could go both ways, but the kernel version was pretty obviously > the preferred one to me. Maybe that's a little biasses by doing a lot > of kernel work, and having run into a lot of problems and performance > issues with the SCSI target user backend lately. Sure, that is understandable.
On Mon, Jan 06, 2025 at 04:21:18PM +0100, Christoph Hellwig wrote: > On Mon, Jan 06, 2025 at 07:54:05AM -0700, Jens Axboe wrote: > > On 1/6/25 7:24 AM, Damien Le Moal wrote: > > > The first patch implements the new "zloop" zoned block device driver > > > which allows creating zoned block devices using one regular file per > > > zone as backing storage. > > > > Couldn't we do this with ublk and keep most of this stuff in userspace > > rather than need a whole new loop driver? > > I'm pretty sure we could do that. But dealing with ublk is complete > pain especially when setting up and tearing it down all the time for > test, and would require a lot more code, so why? As-is I can directly You can link with libublk or add it to rublk, which supports ramdisk zone already, then install rublk from crates.io directly for setup the test. Forking one new loop could add much more pain since you may have to address everything we have fixed for loop, please look at 'git log loop' Thanks, Ming
On Mon, Jan 06, 2025 at 04:44:33PM +0100, Christoph Hellwig wrote: > On Mon, Jan 06, 2025 at 08:38:26AM -0700, Jens Axboe wrote: > > On 1/6/25 8:32 AM, Christoph Hellwig wrote: > > > On Mon, Jan 06, 2025 at 08:24:06AM -0700, Jens Axboe wrote: > > >> A lot more code where? > > > > > > Very good and relevant question. Some random new repo that no one knows > > > about? Not very helpful. xfstests itself? Maybe, but that would just > > > means other users have to fork it. > > > > Why would they have to fork it? Just put it in xfstests itself. These > > are very weak reasons, imho. > > Because that way other users can't use it. Damien has already mentioned > some. - cargo install rublk - rublk add zoned Then you can setup xfstests over the ublk/zoned disk, also Fedora 42 starts to ship rublk. Thanks, Ming
On 1/8/25 11:29 AM, Ming Lei wrote: > On Mon, Jan 06, 2025 at 04:21:18PM +0100, Christoph Hellwig wrote: >> On Mon, Jan 06, 2025 at 07:54:05AM -0700, Jens Axboe wrote: >>> On 1/6/25 7:24 AM, Damien Le Moal wrote: >>>> The first patch implements the new "zloop" zoned block device driver >>>> which allows creating zoned block devices using one regular file per >>>> zone as backing storage. >>> >>> Couldn't we do this with ublk and keep most of this stuff in userspace >>> rather than need a whole new loop driver? >> >> I'm pretty sure we could do that. But dealing with ublk is complete >> pain especially when setting up and tearing it down all the time for >> test, and would require a lot more code, so why? As-is I can directly > > You can link with libublk or add it to rublk, which supports ramdisk zone > already, then install rublk from crates.io directly for setup the > test. Thanks, but memory backing is not what we want. We need to emulate large drives for FS tests (to catch problems such as overflows), and for that, a file based storage backing is better. > Forking one new loop could add much more pain since you may have to address > everything we have fixed for loop, please look at 'git log loop' Which is why Christoph initially started with the kernel driver approach in the first place. To avoid such issues/difficulties.
On 1/8/25 6:08 AM, Jens Axboe wrote: >> A kernel-based implementation is simpler and the configuration >> interface literally needs only a single echo bash command to add or >> remove devices. This allows minimal VM configurations with no >> dependencies on user tools/libraries to run these zoned devices, which >> is what we wanted. >> >> I completely agree about the user-space vs kernel tradeoff you >> mentioned. I did consider it but the code simplicity and ease of use >> in practice won for us and I chose to stick with the kernel driver >> approach. >> >> Note that if you are OK with this, I need to send a V2 to correct the >> Kconfig description which currently shows an invalid configuration >> command example. > > Sure, I'm not totally against it, even if I think the arguments are > very weak, and in some places also just wrong. It's not like it's a > huge driver. I am not going to try contesting that our arguments are somewhat weak. Yes, if we spend enough time on it, we could eventually get something workable with ublk. But with that said, when you spend your days developing and testing stuff for zoned storage, having a super easy to use emulation setup for VMs without any userspace dependencies does a world of good for productivity. That is a strong argument for those involved, I think. So may I send V2 for getting it queued up ?
On Tue, Jan 07, 2025 at 02:08:20PM -0700, Jens Axboe wrote: > > ublk backend driver to do the same as zloop in userspace would need a > > lot more code to be efficient. And even then, as Christoph already > > mentioned, we would still have performance suffer from the context > > switches. But that performance point was not the primary stopper > > I don't buy this context switch argument at all. The zloop write goes straight from kblockd into the the filesystem. ublk switches to userspace, which goes back to the kernel when the file system writes. Similar double context switch on the completion side. > Why would it mean more > sleeping? ? > There's absolutely zero reason why a ublk solution would be at > least as performant as the kernel one. Well, prove it. From haing worked on similar schemes in the past I highly doubt it. > And why would it need "a lot more code to be efficient"? Because we don't have all the nice locking and even infrastructure in userspace that we have in the kernel.
On Wed, Jan 08, 2025 at 10:29:57AM +0800, Ming Lei wrote: > You can link with libublk or add it to rublk, which supports ramdisk zone > already, then install rublk from crates.io directly for setup the > test. ramdisk are nicely supported in null_blk already. And rust crates are a massive pain as they tend to not be packaged nicely. Exatly what I do not want to depend on. > Forking one new loop could add much more pain since you may have to address > everything we have fixed for loop, please look at 'git log loop' The biggest problem with the loop driver is the historic baggage in the user interface. That's side stepped by this driver (and even for conventional device a loop-ng doing the same might be nice, but that's a separate story).
On Tue, Jan 07, 2025 at 02:10:45PM -0700, Jens Axboe wrote: > > - the double context switch into the kernel and back for a ublk device > > backed by a file system will actually show up for some xfstests that > > do a lot of synchronous ops > > Like I replied to Damien, that's mostly a bogus argument. If you're > doing sync stuff, you can do that with a single system call. If you're > building up depth, then it doesn't matter. How do I do a single system call for retrive the requet from the kernel and execture it on the file system after examining it?
On Wed, Jan 8, 2025 at 1:07 PM Damien Le Moal <dlemoal@kernel.org> wrote: > > On 1/8/25 11:29 AM, Ming Lei wrote: > > On Mon, Jan 06, 2025 at 04:21:18PM +0100, Christoph Hellwig wrote: > >> On Mon, Jan 06, 2025 at 07:54:05AM -0700, Jens Axboe wrote: > >>> On 1/6/25 7:24 AM, Damien Le Moal wrote: > >>>> The first patch implements the new "zloop" zoned block device driver > >>>> which allows creating zoned block devices using one regular file per > >>>> zone as backing storage. > >>> > >>> Couldn't we do this with ublk and keep most of this stuff in userspace > >>> rather than need a whole new loop driver? > >> > >> I'm pretty sure we could do that. But dealing with ublk is complete > >> pain especially when setting up and tearing it down all the time for > >> test, and would require a lot more code, so why? As-is I can directly > > > > You can link with libublk or add it to rublk, which supports ramdisk zone > > already, then install rublk from crates.io directly for setup the > > test. > > Thanks, but memory backing is not what we want. We need to emulate large drives > for FS tests (to catch problems such as overflows), and for that, a file based > storage backing is better. It is backed by virtual memory, which can be big enough because of swap, and it is also easy to extend to file backed support since zloop doesn't store zone meta data, which is similar to ram backed zoned actually. Not like loop, zloop can only serve for test purposes, because each zone's meta data is always reset when adding a new device. Thanks, Ming
On Wed, Jan 08, 2025 at 04:13:01PM +0800, Ming Lei wrote: > It is backed by virtual memory, which can be big enough because of swap, and Good luck getting half way decent performance out of swapping for a 50TB data set. Or even a partially filled one which really is the use case here so it might only be a TB or so. > it is also easy to extend to file backed support since zloop doesn't store > zone meta data, which is similar to ram backed zoned actually. No, zloop does store write point in the file sizse of each zone. That's sorta the whole point becauce it enables things like mount and even power fail testing. All of this is mentioned explicitly in the commit logs, documentation and code comments, so claiming something else here feels a bit uninformed.
On Wed, Jan 08, 2025 at 10:09:12AM +0100, Christoph Hellwig wrote: > On Wed, Jan 08, 2025 at 04:13:01PM +0800, Ming Lei wrote: > > It is backed by virtual memory, which can be big enough because of swap, and > > Good luck getting half way decent performance out of swapping for a 50TB > data set. Or even a partially filled one which really is the use case > here so it might only be a TB or so. > > > it is also easy to extend to file backed support since zloop doesn't store > > zone meta data, which is similar to ram backed zoned actually. > > No, zloop does store write point in the file sizse of each zone. That's > sorta the whole point becauce it enables things like mount and even > power fail testing. > > All of this is mentioned explicitly in the commit logs, documentation and > code comments, so claiming something else here feels a bit uninformed. OK, looks one smart idea. It is easy to extend rublk/zoned in this way with io_uring io emulation, :-) Thanks, Ming
On Wed, Jan 08, 2025 at 10:47:57AM +0800, Ming Lei wrote: > > > Why would they have to fork it? Just put it in xfstests itself. These > > > are very weak reasons, imho. > > > > Because that way other users can't use it. Damien has already mentioned > > some. > > - cargo install rublk > - rublk add zoned > > Then you can setup xfstests over the ublk/zoned disk, also Fedora 42 > starts to ship rublk. Um, I build xfstests on Debian Stable; other people build xfstests on enterprise Linux distributions (e.g., RHEL). I'd be really nice if we don't add a rust dependency on xfstests anytime soon. Or at least, have a way of skipping tests that have a rust dependency if xfstesets is built on a system that doesn't have Rust, and to not add Rust dependency on existing tests, so that we don't suddenly lose a lot of test coverage all in the name of adding Rust.... - Ted