Message ID | 20230422-uring-getdents-v1-2-14c1db36e98c@codewreck.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | io_uring: add getdents support, take 2 | expand |
On Sat, Apr 22, 2023 at 05:40:19PM +0900, Dominique Martinet wrote: > This add support for getdents64 to io_uring, acting exactly like the > syscall: the directory is iterated from it's current's position as > stored in the file struct, and the file's position is updated exactly as > if getdents64 had been called. > > Additionally, since io_uring has no way of issuing a seek, a flag > IORING_GETDENTS_REWIND has been added that will seek to the start of the > directory like rewinddir(3) for users that might require such a thing. > This will act exactly as if seek then getdents64 have been called > sequentially with no atomicity guarantee: > if this wasn't clear it is the responsibility of the caller to not use > getdents multiple time on a single file in parallel if they want useful > results, as is currently the case when using the syscall from multiple > threads. > > Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> > --- > include/uapi/linux/io_uring.h | 7 ++++++ > io_uring/fs.c | 51 +++++++++++++++++++++++++++++++++++++++++++ > io_uring/fs.h | 3 +++ > io_uring/opdef.c | 8 +++++++ > 4 files changed, 69 insertions(+) This doesn't actually introduce non-blocking getdents operations, so what's the point? If it just shuffles the getdents call off to a background thread, why bother with io_uring in the first place? Filesystems like XFS can easily do non-blocking getdents calls - we just need the NOWAIT plumbing (like we added to the IO path with IOCB_NOWAIT) to tell the filesystem not to block on locks or IO. Indeed, filesystems often have async readahead built into their getdents paths (XFS does), so it seems to me that we really want non-blocking getdents to allow filesystems to take full advantage of doing work without blocking and then shuffling the remainder off to a background thread when it actually needs to wait for IO.... Cheers, Dave.
Dave Chinner wrote on Mon, Apr 24, 2023 at 08:40:45AM +1000: > This doesn't actually introduce non-blocking getdents operations, so > what's the point? If it just shuffles the getdents call off to a > background thread, why bother with io_uring in the first place? As said in the cover letter my main motivation really is simplifying the userspace application: - style-wise, mixing in plain old getdents(2) or readdir(3) in the middle of an io_uring handling loop just feels wrong; but this may just be my OCD talking. - in my understanding io_uring has its own thread pool, so even if the actual getdents is blocking other IOs can progress (assuming there is less blocked getdents than threads), without having to build one's own extra thread pool next to the uring handling. Looking at io_uring/fs.c the other "metadata related" calls there also use the synchronous APIs (renameat, unlinkat, mkdirat, symlinkat and linkat all do), so I didn't think of that as a problem in itself. > Filesystems like XFS can easily do non-blocking getdents calls - we > just need the NOWAIT plumbing (like we added to the IO path with > IOCB_NOWAIT) to tell the filesystem not to block on locks or IO. > Indeed, filesystems often have async readahead built into their > getdents paths (XFS does), so it seems to me that we really want > non-blocking getdents to allow filesystems to take full advantage of > doing work without blocking and then shuffling the remainder off to > a background thread when it actually needs to wait for IO.... I believe that can be done without any change of this API, so that'll be a very welcome addition when it is ready; I don't think the adding the uring op should wait on this if we can agree a simple wrapper API is good enough (or come up with a better one if someone has a Good Idea) (looking at io_uring/rw.c for comparison, io_getdents() will "just" need to be adjusted to issue an async req if IO_URING_F_NONBLOCK is set, and the poll/retry logic sorted out) Thanks,
On Mon, Apr 24 2023 at 08:43:00 +0900, Dominique Martinet quoth thus: > Dave Chinner wrote on Mon, Apr 24, 2023 at 08:40:45AM +1000: > > This doesn't actually introduce non-blocking getdents operations, so > > what's the point? If it just shuffles the getdents call off to a > > background thread, why bother with io_uring in the first place? > > As said in the cover letter my main motivation really is simplifying the > userspace application: > - style-wise, mixing in plain old getdents(2) or readdir(3) in the > middle of an io_uring handling loop just feels wrong; but this may just > be my OCD talking. > - in my understanding io_uring has its own thread pool, so even if the > actual getdents is blocking other IOs can progress (assuming there is > less blocked getdents than threads), without having to build one's own > extra thread pool next to the uring handling. > Looking at io_uring/fs.c the other "metadata related" calls there also > use the synchronous APIs (renameat, unlinkat, mkdirat, symlinkat and > linkat all do), so I didn't think of that as a problem in itself. Having written code to create additional worker threads in addition to using io_uring as a main loop, I'm glad to see this proposal back again. I think the original work was shot down much too hastily based on the file positioning issues. Really only two cases of directory position are useful*, which can be expressed either as an off_t (0 = rewind, -1 = curpos), or a single bit flag. As I understand the code, the rewind case shouldn't be any problem. From a practical viewpoint, I don't think non-blocking would see a lot of use, but it wouldn't hurt. This also seems like a good place to bring up a point I made with the last attempt at this code. You're missing an optimization here. getdents knows whether it is returning a buffer because the next entry won't fit versus because there are no more entries. As it doesn't return that information, callers must always keep calling it back until EOF. This means a completely unnecessary call is made for every open directory. In other words, for a directory scan where the buffers are large enough to not overflow, that literally twice as many calls are made to getdents as necessary. As io_uring is in-kernel, it could use an internal interface to getdents which would return an EOF indicator along with the (probably non-empty) buffer. io_uring would then return that flag with the CQE. (* As an aside, the only place I've ever seen a non-zero lseek on a directory, is in a very resource limited environment, e.g. too small open files limit. In the case of a depth-first directory scan, it must close directories before completely reading them, and reopen / lseek to their previous position in order to continue. This scenario is certainly not worth bothering with for io_uring.) > > Filesystems like XFS can easily do non-blocking getdents calls - we > > just need the NOWAIT plumbing (like we added to the IO path with > > IOCB_NOWAIT) to tell the filesystem not to block on locks or IO. > > Indeed, filesystems often have async readahead built into their > > getdents paths (XFS does), so it seems to me that we really want > > non-blocking getdents to allow filesystems to take full advantage of > > doing work without blocking and then shuffling the remainder off to > > a background thread when it actually needs to wait for IO.... > > I believe that can be done without any change of this API, so that'll be > a very welcome addition when it is ready; I don't think the adding the > uring op should wait on this if we can agree a simple wrapper API is > good enough (or come up with a better one if someone has a Good Idea) > > (looking at io_uring/rw.c for comparison, io_getdents() will "just" need > to be adjusted to issue an async req if IO_URING_F_NONBLOCK is set, and > the poll/retry logic sorted out) > > Thanks, > -- > Dominique Martinet | Asmadeus
Thanks! Clay Harris wrote on Mon, Apr 24, 2023 at 02:29:46AM -0500: > This also seems like a good place to bring up a point I made with > the last attempt at this code. You're missing an optimization here. > getdents knows whether it is returning a buffer because the next entry > won't fit versus because there are no more entries. As it doesn't > return that information, callers must always keep calling it back > until EOF. This means a completely unnecessary call is made for > every open directory. In other words, for a directory scan where > the buffers are large enough to not overflow, that literally twice > as many calls are made to getdents as necessary. As io_uring is > in-kernel, it could use an internal interface to getdents which would > return an EOF indicator along with the (probably non-empty) buffer. > io_uring would then return that flag with the CQE. Sorry I didn't spot that comment in the last iteration of the patch, that sounds interesting. This isn't straightforward even in-kernel though: the ctx.actor callback (filldir64) isn't called when we're done, so we only know we couldn't fill in the buffer. We could have the callback record 'buffer full' and consider we're done if the buffer is full, or just single-handedly declare we are if we have more than `MAXNAMLEN + sizeof(struct linux_dirent64)` left over, but I assume a filesystem is allowed to return what it has readily available and expect the user to come back later? In which case we cannot use this as an heuristic... So if we do this, it'll require a way for filesystems to say they're filling in as much as they can, or go the sledgehammer way of adding an extra dir_context dir_context callback, either way I'm not sure I want to deal with all that immediately unless I'm told all filesystems will fill as much as possible without ever failing for any temporary reason in the middle of iterate/iterate_shared(). Call me greedy but I believe such a flag in the CQE could also be added later on without any bad side effects (as it's optional to check on it to stop calling early and there's no harm in not setting it)? > (* As an aside, the only place I've ever seen a non-zero lseek on a > directory, is in a very resource limited environment, e.g. too small > open files limit. In the case of a depth-first directory scan, it > must close directories before completely reading them, and reopen / > lseek to their previous position in order to continue. This scenario > is certainly not worth bothering with for io_uring.) (I also thought of userspace NFS/9P servers are these two at least get requests from clients with an arbitrary offset, but I'll be glad to forget about them for now...)
On Mon, Apr 24 2023 at 17:41:18 +0900, Dominique Martinet quoth thus: > Thanks! > > Clay Harris wrote on Mon, Apr 24, 2023 at 02:29:46AM -0500: > > This also seems like a good place to bring up a point I made with > > the last attempt at this code. You're missing an optimization here. > > getdents knows whether it is returning a buffer because the next entry > > won't fit versus because there are no more entries. As it doesn't > > return that information, callers must always keep calling it back > > until EOF. This means a completely unnecessary call is made for > > every open directory. In other words, for a directory scan where > > the buffers are large enough to not overflow, that literally twice > > as many calls are made to getdents as necessary. As io_uring is > > in-kernel, it could use an internal interface to getdents which would > > return an EOF indicator along with the (probably non-empty) buffer. > > io_uring would then return that flag with the CQE. > > Sorry I didn't spot that comment in the last iteration of the patch, > that sounds interesting. > > This isn't straightforward even in-kernel though: the ctx.actor callback > (filldir64) isn't called when we're done, so we only know we couldn't > fill in the buffer. > We could have the callback record 'buffer full' and consider we're done > if the buffer is full, or just single-handedly declare we are if we have > more than `MAXNAMLEN + sizeof(struct linux_dirent64)` left over, but I > assume a filesystem is allowed to return what it has readily available > and expect the user to come back later? > In which case we cannot use this as an heuristic... > > So if we do this, it'll require a way for filesystems to say they're > filling in as much as they can, or go the sledgehammer way of adding an > extra dir_context dir_context callback, either way I'm not sure I want > to deal with all that immediately unless I'm told all filesystems will > fill as much as possible without ever failing for any temporary reason > in the middle of iterate/iterate_shared(). I don't have a complete understanding of this area, but my thought was not that we would look for any buffer full condition, but rather that an iterator could be tested for next_entry == EOF. > Call me greedy but I believe such a flag in the CQE could also be added > later on without any bad side effects (as it's optional to check on it > to stop calling early and there's no harm in not setting it)? Certainly it could be added later, but I wanted to make sure some thought was put into it now. It would be nice to have it sooner rather than later though... > > > (* As an aside, the only place I've ever seen a non-zero lseek on a > > directory, is in a very resource limited environment, e.g. too small > > open files limit. In the case of a depth-first directory scan, it > > must close directories before completely reading them, and reopen / > > lseek to their previous position in order to continue. This scenario > > is certainly not worth bothering with for io_uring.) > > (I also thought of userspace NFS/9P servers are these two at least get > requests from clients with an arbitrary offset, but I'll be glad to > forget about them for now...) > > -- > Dominique Martinet | Asmadeus
Clay Harris wrote on Mon, Apr 24, 2023 at 04:20:55AM -0500: > > This isn't straightforward even in-kernel though: the ctx.actor callback > > (filldir64) isn't called when we're done, so we only know we couldn't > > fill in the buffer. > > We could have the callback record 'buffer full' and consider we're done > > if the buffer is full, or just single-handedly declare we are if we have > > more than `MAXNAMLEN + sizeof(struct linux_dirent64)` left over, but I > > assume a filesystem is allowed to return what it has readily available > > and expect the user to come back later? > > In which case we cannot use this as an heuristic... > > > > So if we do this, it'll require a way for filesystems to say they're > > filling in as much as they can, or go the sledgehammer way of adding an > > extra dir_context dir_context callback, either way I'm not sure I want > > to deal with all that immediately unless I'm told all filesystems will > > fill as much as possible without ever failing for any temporary reason > > in the middle of iterate/iterate_shared(). > > I don't have a complete understanding of this area, but my thought was > not that we would look for any buffer full condition, but rather that > an iterator could be tested for next_entry == EOF. I'm afraid I don't see any such iterator readily available in the common code, so that would also have to be per fs as far as I can see :/ Also some filesystems just don't have a next_entry iterator readily available; for example on 9p the network protocol is pretty much exactly like getdents and a readdir call is issued continuously until either filldir64 returns an error (e.g. buffer full) or the server returned 0 (or an error); if you leave the v9fs_dir_readdir{,_dotl}() call you loose this information immediately. Getting this information out would be doubly appreciable because the "final getdents" to confirm the dir has been fully read is redundant with that previous last 9p readdir which ended the previous iteration (could also be fixed by caching...), but that'll really require fixing up the iterate_shared() operation to signal such a thing... > > Call me greedy but I believe such a flag in the CQE could also be added > > later on without any bad side effects (as it's optional to check on it > > to stop calling early and there's no harm in not setting it)? > > Certainly it could be added later, but I wanted to make sure some thought > was put into it now. It would be nice to have it sooner rather than later > though... Yes, if there's an easy way forward I overlooked I'd be happy to spend a bit more time on it; and discussing never hurts. In for a penny, in for a pound :)
On Mon, Apr 24, 2023 at 08:43:00AM +0900, Dominique Martinet wrote: > Dave Chinner wrote on Mon, Apr 24, 2023 at 08:40:45AM +1000: > > This doesn't actually introduce non-blocking getdents operations, so > > what's the point? If it just shuffles the getdents call off to a > > background thread, why bother with io_uring in the first place? > > As said in the cover letter my main motivation really is simplifying the > userspace application: > - style-wise, mixing in plain old getdents(2) or readdir(3) in the > middle of an io_uring handling loop just feels wrong; but this may just > be my OCD talking. > - in my understanding io_uring has its own thread pool, so even if the > actual getdents is blocking other IOs can progress (assuming there is > less blocked getdents than threads), without having to build one's own > extra thread pool next to the uring handling. > Looking at io_uring/fs.c the other "metadata related" calls there also > use the synchronous APIs (renameat, unlinkat, mkdirat, symlinkat and > linkat all do), so I didn't think of that as a problem in itself. I think you missed the point. getdents is not an exclusive operation - it is run under shared locking unlike all the other direcotry modification operations you cite above. They use exclusive locking so there's no real benefit by trying to run them non-blocking or as an async operation as they are single threaded and will consume a single thread context from start to end. Further, one of the main reasons they get punted to the per-thread pool is so that io_uring can optimise away the lock contention caused by running multiple work threads on exclusively locked objects; it does this by only running one work item per inode at a time. This is exactly what we don't want with getdents - we want to be able to run as many concurrent getdents and lookup operations in parallel as we can as both all use shared locking. IOWs, getdents and inode lookups are much closer in behaviour and application use to concurrent buffered data reads than they are to directory modification operations. We can already do concurrent getdents/lookup operations on a single directory from userspace with multiple threads, but the way this series adds support to io_uring somewhat prevents concurrent getdents/lookup operations on the same directory inode via io_uring. IOWs, adding getdents support to io_uring like this is not a step forwards for applications that use/need concurrency in directory lookup operations. Keep in mind that if the directory is small enough to fit in the inode, XFS can return all the getdents information immediately as it is guaranteed to be in memory without doing any IO at all. Why should that fast path that is commonly hit get punted to a work queue and suddenly cost an application at least two extra context switches? > > Filesystems like XFS can easily do non-blocking getdents calls - we > > just need the NOWAIT plumbing (like we added to the IO path with > > IOCB_NOWAIT) to tell the filesystem not to block on locks or IO. > > Indeed, filesystems often have async readahead built into their > > getdents paths (XFS does), so it seems to me that we really want > > non-blocking getdents to allow filesystems to take full advantage of > > doing work without blocking and then shuffling the remainder off to > > a background thread when it actually needs to wait for IO.... > > I believe that can be done without any change of this API, so that'll be > a very welcome addition when it is ready; Again, I think you miss the point. Non blocking data IO came before io_uring and we had the infrastructure in place before io_uring took advantage of it. Application developers asked the fs developers to add support for non-blocking direct IO operations and because we pretty much had all the infrastructure to support already in place it got done quickly via preadv2/pwritev2 via RWF_NOWAIT flags. We already pass a struct dir_context to ->iterate_shared(), so we have a simple way to add context specific flags down the filesystem from iterate_dir(). This is similar to the iocb for file data IO that contains the flags field that holds the IOCB_NOWAIT context for io_uring based IO. So the infrastructure to plumb it all the way down the fs implementation of ->iterate_shared is already there. XFS also has async metadata IO capability and we use that for readahead in the xfs_readdir() implementation. hence we've got all the parts we need to do non-blocking readdir already in place. This is very similar to how we already had all the pieces in the IO path ready to do non-block IO well before anyone asked for IOCB_NOWAIT functionality.... AFAICT, the io_uring code wouldn't need to do much more other than punt to the work queue if it receives a -EAGAIN result. Otherwise the what the filesystem returns doesn't need to change, and I don't see that we need to change how the filldir callbacks work, either. We just keep filling the user buffer until we either run out of cached directory data or the user buffer is full. And as I've already implied, several filesystems perform async readahead from their ->iterate_shared methods, so there's every chance they will return some data while there is readahead IO in progress. By the time the io_uring processing loop gets back to issue another getdents operation, that IO will have completed and the application will be able to read more dirents without blocking. The filesystem will issue more readahead while processing what it already has available, and around the loop we go. > I don't think the adding the > uring op should wait on this if we can agree a simple wrapper API is > good enough (or come up with a better one if someone has a Good Idea) It doesn't look at all hard to me. If you add a NOWAIT context flag to the dir_context it should be relatively trivial to connect all the parts together. If you do all the VFS, io_uring and userspace testing infrastructure work, I should be able to sort out the changes needed to xfs_readdir() to support nonblocking ->iterate_shared() behaviour. Cheers, Dave.
Dave Chinner wrote on Fri, Apr 28, 2023 at 03:06:40PM +1000: > > - in my understanding io_uring has its own thread pool, so even if the > > actual getdents is blocking other IOs can progress (assuming there is > > less blocked getdents than threads), without having to build one's own > > extra thread pool next to the uring handling. > > Looking at io_uring/fs.c the other "metadata related" calls there also > > use the synchronous APIs (renameat, unlinkat, mkdirat, symlinkat and > > linkat all do), so I didn't think of that as a problem in itself. > > I think you missed the point. getdents is not an exclusive operation > - it is run under shared locking unlike all the other direcotry > modification operations you cite above. They use exclusive locking > so there's no real benefit by trying to run them non-blocking or > as an async operation as they are single threaded and will consume > a single thread context from start to end. > > Further, one of the main reasons they get punted to the per-thread > pool is so that io_uring can optimise away the lock contention > caused by running multiple work threads on exclusively locked > objects; it does this by only running one work item per inode at a > time. Ah, I didn't know that -- thank you for this piece of information. If we're serializing readdirs per inode I agree that this implementation is subpart for filesystems implementing iterate_shared. > > > Filesystems like XFS can easily do non-blocking getdents calls - we > > > just need the NOWAIT plumbing (like we added to the IO path with > > > IOCB_NOWAIT) to tell the filesystem not to block on locks or IO. > > > Indeed, filesystems often have async readahead built into their > > > getdents paths (XFS does), so it seems to me that we really want > > > non-blocking getdents to allow filesystems to take full advantage of > > > doing work without blocking and then shuffling the remainder off to > > > a background thread when it actually needs to wait for IO.... > > > > I believe that can be done without any change of this API, so that'll be > > a very welcome addition when it is ready; > > Again, I think you miss the point. > > Non blocking data IO came before io_uring and we had the > infrastructure in place before io_uring took advantage of it. > Application developers asked the fs developers to add support for > non-blocking direct IO operations and because we pretty much had all > the infrastructure to support already in place it got done quickly > via preadv2/pwritev2 via RWF_NOWAIT flags. > > We already pass a struct dir_context to ->iterate_shared(), so we > have a simple way to add context specific flags down the filesystem > from iterate_dir(). This is similar to the iocb for file data IO > that contains the flags field that holds the IOCB_NOWAIT context for > io_uring based IO. So the infrastructure to plumb it all the way > down the fs implementation of ->iterate_shared is already there. Sure, that sounds like a good approach that isn't breaking the API (not breaking iterate/iterate_shared implementations that don't look at the flags and allowing the fs that want to look at it to do so) Adding such a flag and setting it on the uring side without doing anything else is trivial enough if polling/rescheduling can be sorted out. (I guess at this rate the dir_context flag could also be modified by the FS to signal it just filled in the last entry, for the other part of this thread asking to know when the directory has been done iterating without wasting a trivial empty getdents) > XFS also has async metadata IO capability and we use that for > readahead in the xfs_readdir() implementation. hence we've got all > the parts we need to do non-blocking readdir already in place. This > is very similar to how we already had all the pieces in the IO path > ready to do non-block IO well before anyone asked for IOCB_NOWAIT > functionality.... > > AFAICT, the io_uring code wouldn't need to do much more other than > punt to the work queue if it receives a -EAGAIN result. Otherwise > the what the filesystem returns doesn't need to change, and I don't > see that we need to change how the filldir callbacks work, either. > We just keep filling the user buffer until we either run out of > cached directory data or the user buffer is full. I agree with the fs side of it, I'd like to confirm what the uring side needs to do before proceeding -- looking at the read/write path there seems to be a polling mechanism in place to tell uring when to look again, and I haven't looked at this part of the code yet to see what happens if no such polling is in place (does uring just retry periodically?) I'll have a look this weekend. > > I don't think the adding the > > uring op should wait on this if we can agree a simple wrapper API is > > good enough (or come up with a better one if someone has a Good Idea) > > It doesn't look at all hard to me. If you add a NOWAIT context flag > to the dir_context it should be relatively trivial to connect all > the parts together. If you do all the VFS, io_uring and userspace > testing infrastructure work, I should be able to sort out the > changes needed to xfs_readdir() to support nonblocking > ->iterate_shared() behaviour. I still think the userspace <-> ring API is unrelated to the nowait side of it, but as said in the other part of the thread I'll be happy to give a bit more time if that helps moving forward. I'll send another reply around the end of the weekend with either v2 of this patch or a few more comments. Thanks,
Dominique Martinet wrote on Fri, Apr 28, 2023 at 03:14:52PM +0900: > > We already pass a struct dir_context to ->iterate_shared(), so we > > have a simple way to add context specific flags down the filesystem > > from iterate_dir(). This is similar to the iocb for file data IO > > that contains the flags field that holds the IOCB_NOWAIT context for > > io_uring based IO. So the infrastructure to plumb it all the way > > down the fs implementation of ->iterate_shared is already there. > > Sure, that sounds like a good approach that isn't breaking the API (not > breaking iterate/iterate_shared implementations that don't look at the > flags and allowing the fs that want to look at it to do so) Hmm actually I said that, but io_getdents() needs to know if the flag will be honored or not (if it will be honored, we can call this when issue_flags & IO_URING_F_NONBLOCK but if we're not sure the fs handles it then we risk blocking) I'm not familiar with this part of the VFS, but I do not see any kind of flags for the filesystem to signal if it'll handle it or not -- this is actually similar to iterate vs. iterate_shared so that'll mean adding iterate_shared_hasnonblock or something, which is getting silly. I'm sure you understand this better than me and I'm missing something obvious here, but I don't think I'll be able to make something I'm happy with here (in a reasonable timeframe anyway). Thanks,
Dominique Martinet wrote on Fri, Apr 28, 2023 at 03:14:52PM +0900: > > AFAICT, the io_uring code wouldn't need to do much more other than > > punt to the work queue if it receives a -EAGAIN result. Otherwise > > the what the filesystem returns doesn't need to change, and I don't > > see that we need to change how the filldir callbacks work, either. > > We just keep filling the user buffer until we either run out of > > cached directory data or the user buffer is full. > > [...] I'd like to confirm what the uring > side needs to do before proceeding -- looking at the read/write path > there seems to be a polling mechanism in place to tell uring when to > look again, and I haven't looked at this part of the code yet to see > what happens if no such polling is in place (does uring just retry > periodically?) Ok so this part can work out as you said, I hadn't understood what you meant by "punt to the work queue" but that should work from my new understanding of the ring; we can just return EAGAIN if the non-blocking variant doesn't have immediate results and call the blocking variant when we're called again without IO_URING_F_NONBLOCK in flags. (So there's no need to try to add a form of polling, although that is possible if we ever become able to do that; I'll just forget about this and be happy this part is easy) That just leaves deciding if a filesystem handles the blocking variant or not; ideally if we can know early (prep time) we can even mark REQ_F_FORCE_ASYNC in flags to skip the non-blocking call for filesystems that don't handle that and we get the best of both worlds. I've had a second look and I still don't see anything obvious though; I'd rather avoid adding a new variant of iterate()/iterate_shared() -- we could use that as a chance to add a flag to struct file_operation instead? e.g., something like mmap_supported_flags: ----- diff --git a/include/linux/fs.h b/include/linux/fs.h index c85916e9f7db..2ebbf48ee18b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1761,7 +1761,7 @@ struct file_operations { int (*iopoll)(struct kiocb *kiocb, struct io_comp_batch *, unsigned int flags); int (*iterate) (struct file *, struct dir_context *); - int (*iterate_shared) (struct file *, struct dir_context *); + unsigned long iterate_supported_flags; __poll_t (*poll) (struct file *, struct poll_table_struct *); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); @@ -1797,6 +1797,10 @@ struct file_operations { unsigned int poll_flags); } __randomize_layout; +/** iterate_supported_flags */ +#define ITERATE_SHARED 0x1 +#define ITERATE_NOWAIT 0x2 + struct inode_operations { struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *); ----- and fix all usages of iterate_shared. I guess at this rate it might make sense to rename mmap_supported_flags to some more generic supported_flags instead?... It's a bit more than I have signed up for, but I guess it's still reasonable enough. I'll wait for feedback before doing it though; please say if this sounds good to you and I'll send a v2 with such a flag, as well as adding flags to dir_context as you had suggested. Thanks,
On Fri, Apr 28, 2023 at 08:27:53PM +0900, Dominique Martinet wrote: > Dominique Martinet wrote on Fri, Apr 28, 2023 at 03:14:52PM +0900: > > > We already pass a struct dir_context to ->iterate_shared(), so we > > > have a simple way to add context specific flags down the filesystem > > > from iterate_dir(). This is similar to the iocb for file data IO > > > that contains the flags field that holds the IOCB_NOWAIT context for > > > io_uring based IO. So the infrastructure to plumb it all the way > > > down the fs implementation of ->iterate_shared is already there. > > > > Sure, that sounds like a good approach that isn't breaking the API (not > > breaking iterate/iterate_shared implementations that don't look at the > > flags and allowing the fs that want to look at it to do so) > > Hmm actually I said that, but io_getdents() needs to know if the flag > will be honored or not (if it will be honored, we can call this when > issue_flags & IO_URING_F_NONBLOCK but if we're not sure the fs handles > it then we risk blocking) See the new FMODE_DIO_PARALLEL_WRITE flag for triggering filesystem specific non-blocking io_uring behaviour.... -Dave.
On Sat, Apr 29, 2023 at 05:07:32PM +0900, Dominique Martinet wrote: > Dominique Martinet wrote on Fri, Apr 28, 2023 at 03:14:52PM +0900: > > > AFAICT, the io_uring code wouldn't need to do much more other than > > > punt to the work queue if it receives a -EAGAIN result. Otherwise > > > the what the filesystem returns doesn't need to change, and I don't > > > see that we need to change how the filldir callbacks work, either. > > > We just keep filling the user buffer until we either run out of > > > cached directory data or the user buffer is full. > > > > [...] I'd like to confirm what the uring > > side needs to do before proceeding -- looking at the read/write path > > there seems to be a polling mechanism in place to tell uring when to > > look again, and I haven't looked at this part of the code yet to see > > what happens if no such polling is in place (does uring just retry > > periodically?) > > Ok so this part can work out as you said, I hadn't understood what you > meant by "punt to the work queue" but that should work from my new > understanding of the ring; we can just return EAGAIN if the non-blocking > variant doesn't have immediate results and call the blocking variant > when we're called again without IO_URING_F_NONBLOCK in flags. > (So there's no need to try to add a form of polling, although that is > possible if we ever become able to do that; I'll just forget about this > and be happy this part is easy) > > > That just leaves deciding if a filesystem handles the blocking variant > or not; ideally if we can know early (prep time) we can even mark > REQ_F_FORCE_ASYNC in flags to skip the non-blocking call for filesystems > that don't handle that and we get the best of both worlds. > > I've had a second look and I still don't see anything obvious though; > I'd rather avoid adding a new variant of iterate()/iterate_shared() -- > we could use that as a chance to add a flag to struct file_operation > instead? e.g., something like mmap_supported_flags: I don't think that makes sense - the eventual goal is to make ->iterate() go away entirely and all filesystems use ->iterate_shared(). Hence I think adding flags to select iterate vs iterate_shared and the locking that is needed is the wrong place to start from here. Whether the filesystem supports non-blocking ->iterate_shared() or not is a filesystem implementation option and io_uring needs that information to be held on the struct file for efficient determination of whether it should use non-blocking operations or not. We already set per-filesystem file modes via the ->open method, that's how we already tell io_uring that it can do NOWAIT IO, as well as async read/write IO for regular files. And now we also use it for FMODE_DIO_PARALLEL_WRITE, too. See __io_file_supports_nowait().... Essentially, io_uring already cwhas the mechanism available to it to determine if it should use NOWAIT semantics for getdents operations; we just need to set FMODE_NOWAIT correctly for directory files via ->open() on the filesystems that support it... [ Hmmmm - we probably need to be more careful in XFS about what types of files we set those flags on.... ] Cheers, Dave.
Dave Chinner wrote on Mon, May 01, 2023 at 09:32:41AM +1000: > > I've had a second look and I still don't see anything obvious though; > > I'd rather avoid adding a new variant of iterate()/iterate_shared() -- > > we could use that as a chance to add a flag to struct file_operation > > instead? e.g., something like mmap_supported_flags: > > I don't think that makes sense - the eventual goal is to make > ->iterate() go away entirely and all filesystems use > ->iterate_shared(). Hence I think adding flags to select iterate vs > iterate_shared and the locking that is needed is the wrong place to > start from here. (The flag could just go away when all filesystems not supporting it are gone, and it could be made the other way around (e.g. explicit NOT_SHARED to encourage migrations), so I don't really see the problem with this but next point makes this moot anyway) > Whether the filesystem supports non-blocking ->iterate_shared() or > not is a filesystem implementation option and io_uring needs that > information to be held on the struct file for efficient > determination of whether it should use non-blocking operations or > not. Right, sorry. I was thinking that since it's fs/op dependant it made more sense to keep next to the iterate operation, but that'd be a layering violation to look directly at the file_operation vector directly from the uring code... So having it in the struct file is better from that point of view. > We already set per-filesystem file modes via the ->open method, > that's how we already tell io_uring that it can do NOWAIT IO, as > well as async read/write IO for regular files. And now we also use > it for FMODE_DIO_PARALLEL_WRITE, too. > > See __io_file_supports_nowait().... > > Essentially, io_uring already cwhas the mechanism available to it > to determine if it should use NOWAIT semantics for getdents > operations; we just need to set FMODE_NOWAIT correctly for directory > files via ->open() on the filesystems that support it... Great, I wasn't aware of FMODE_NOWAIT; things are starting to fall in place. I'll send a v2 around Wed or Thurs (yay national holidays) > [ Hmmmm - we probably need to be more careful in XFS about what > types of files we set those flags on.... ] Yes, FMODE_NOWAIT will be set on directories as xfs_dir_open calls xfs_file_open which sets it inconditionally... So I got to check other filesystems don't do something similar as a bonus, but it looks like none that set FMODE_NOWAIT on regular files share the file open path, so at least that shouldn't be too bad. Happy to also fold the xfs fix as a prerequisite patch of this series or to let you do it, just tell me. Thanks,
On Mon, May 01, 2023 at 09:49:31AM +0900, Dominique Martinet wrote: > Dave Chinner wrote on Mon, May 01, 2023 at 09:32:41AM +1000: > > > I've had a second look and I still don't see anything obvious though; > > > I'd rather avoid adding a new variant of iterate()/iterate_shared() -- > > > we could use that as a chance to add a flag to struct file_operation > > > instead? e.g., something like mmap_supported_flags: > > > > I don't think that makes sense - the eventual goal is to make > > ->iterate() go away entirely and all filesystems use > > ->iterate_shared(). Hence I think adding flags to select iterate vs > > iterate_shared and the locking that is needed is the wrong place to > > start from here. > > (The flag could just go away when all filesystems not supporting it are > gone, and it could be made the other way around (e.g. explicit > NOT_SHARED to encourage migrations), so I don't really see the problem > with this but next point makes this moot anyway) > > > Whether the filesystem supports non-blocking ->iterate_shared() or > > not is a filesystem implementation option and io_uring needs that > > information to be held on the struct file for efficient > > determination of whether it should use non-blocking operations or > > not. > > Right, sorry. I was thinking that since it's fs/op dependant it made > more sense to keep next to the iterate operation, but that'd be a > layering violation to look directly at the file_operation vector > directly from the uring code... So having it in the struct file is > better from that point of view. > > > We already set per-filesystem file modes via the ->open method, > > that's how we already tell io_uring that it can do NOWAIT IO, as > > well as async read/write IO for regular files. And now we also use > > it for FMODE_DIO_PARALLEL_WRITE, too. > > > > See __io_file_supports_nowait().... > > > > Essentially, io_uring already cwhas the mechanism available to it > > to determine if it should use NOWAIT semantics for getdents > > operations; we just need to set FMODE_NOWAIT correctly for directory > > files via ->open() on the filesystems that support it... > > Great, I wasn't aware of FMODE_NOWAIT; things are starting to fall in > place. > I'll send a v2 around Wed or Thurs (yay national holidays) > > > [ Hmmmm - we probably need to be more careful in XFS about what > > types of files we set those flags on.... ] > > Yes, FMODE_NOWAIT will be set on directories as xfs_dir_open calls > xfs_file_open which sets it inconditionally... So I got to check other > filesystems don't do something similar as a bonus, but it looks like > none that set FMODE_NOWAIT on regular files share the file open path, > so at least that shouldn't be too bad. > Happy to also fold the xfs fix as a prerequisite patch of this series or > to let you do it, just tell me. Yeah, we need to audit all the ->open() calls to ensure they do the right thing - I think what XFS does is a harmless oversight at this point, we can simple split xfs_file_open() and xfs_dir_open() as appropriate. Also, the nowait enabled ->iterate_shared() method for XFS will look something like the patch below. It's not tested in any way, I just wrote it quickly to demonstrate the relative simplicity of converting all the locking and IO interfaces in the xfs_readdir() for NOWAIT operation. Making it asynchronous (equivalent of FMODE_BUF_RASYNC) is a lot more work, but I'm not sure that is necessary given the async readahead that gets issued... Cheers, Dave.
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 709de6d4feb2..44c87fad2714 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -65,6 +65,7 @@ struct io_uring_sqe { __u32 xattr_flags; __u32 msg_ring_flags; __u32 uring_cmd_flags; + __u32 getdents_flags; }; __u64 user_data; /* data to be passed back at completion time */ /* pack this to avoid bogus arm OABI complaints */ @@ -223,6 +224,7 @@ enum io_uring_op { IORING_OP_URING_CMD, IORING_OP_SEND_ZC, IORING_OP_SENDMSG_ZC, + IORING_OP_GETDENTS, /* this goes last, obviously */ IORING_OP_LAST, @@ -258,6 +260,11 @@ enum io_uring_op { */ #define SPLICE_F_FD_IN_FIXED (1U << 31) /* the last bit of __u32 */ +/* + * sqe->getdents_flags + */ +#define IORING_GETDENTS_REWIND (1U << 0) + /* * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the * command flags for POLL_ADD are stored in sqe->len. diff --git a/io_uring/fs.c b/io_uring/fs.c index f6a69a549fd4..cb555bc738bd 100644 --- a/io_uring/fs.c +++ b/io_uring/fs.c @@ -47,6 +47,13 @@ struct io_link { int flags; }; +struct io_getdents { + struct file *file; + struct linux_dirent64 __user *dirent; + unsigned int count; + int flags; +}; + int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename); @@ -291,3 +298,47 @@ void io_link_cleanup(struct io_kiocb *req) putname(sl->oldpath); putname(sl->newpath); } + +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents); + + if (READ_ONCE(sqe->off) != 0) + return -EINVAL; + + gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr)); + gd->count = READ_ONCE(sqe->len); + gd->flags = READ_ONCE(sqe->getdents_flags); + if (gd->flags & ~IORING_GETDENTS_REWIND) + return -EINVAL; + + return 0; +} + +int io_getdents(struct io_kiocb *req, unsigned int issue_flags) +{ + struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents); + int ret; + + if (issue_flags & IO_URING_F_NONBLOCK) + return -EAGAIN; + + if ((gd->flags & IORING_GETDENTS_REWIND)) { + ret = vfs_llseek(req->file, 0, SEEK_SET); + if (ret < 0) + goto out; + } + + ret = vfs_getdents(req->file, gd->dirent, gd->count); +out: + if (ret < 0) { + if (ret == -ERESTARTSYS) + ret = -EINTR; + + req_set_fail(req); + } + + io_req_set_res(req, ret, 0); + return 0; +} + diff --git a/io_uring/fs.h b/io_uring/fs.h index 0bb5efe3d6bb..f83a6f3a678d 100644 --- a/io_uring/fs.h +++ b/io_uring/fs.h @@ -18,3 +18,6 @@ int io_symlinkat(struct io_kiocb *req, unsigned int issue_flags); int io_linkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe); int io_linkat(struct io_kiocb *req, unsigned int issue_flags); void io_link_cleanup(struct io_kiocb *req); + +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe); +int io_getdents(struct io_kiocb *req, unsigned int issue_flags); diff --git a/io_uring/opdef.c b/io_uring/opdef.c index cca7c5b55208..8f770c582ab3 100644 --- a/io_uring/opdef.c +++ b/io_uring/opdef.c @@ -428,6 +428,11 @@ const struct io_issue_def io_issue_defs[] = { .prep = io_eopnotsupp_prep, #endif }, + [IORING_OP_GETDENTS] = { + .needs_file = 1, + .prep = io_getdents_prep, + .issue = io_getdents, + }, }; @@ -648,6 +653,9 @@ const struct io_cold_def io_cold_defs[] = { .fail = io_sendrecv_fail, #endif }, + [IORING_OP_GETDENTS] = { + .name = "GETDENTS", + }, }; const char *io_uring_get_opcode(u8 opcode)
This add support for getdents64 to io_uring, acting exactly like the syscall: the directory is iterated from it's current's position as stored in the file struct, and the file's position is updated exactly as if getdents64 had been called. Additionally, since io_uring has no way of issuing a seek, a flag IORING_GETDENTS_REWIND has been added that will seek to the start of the directory like rewinddir(3) for users that might require such a thing. This will act exactly as if seek then getdents64 have been called sequentially with no atomicity guarantee: if this wasn't clear it is the responsibility of the caller to not use getdents multiple time on a single file in parallel if they want useful results, as is currently the case when using the syscall from multiple threads. Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> --- include/uapi/linux/io_uring.h | 7 ++++++ io_uring/fs.c | 51 +++++++++++++++++++++++++++++++++++++++++++ io_uring/fs.h | 3 +++ io_uring/opdef.c | 8 +++++++ 4 files changed, 69 insertions(+)