Message ID | c44fcf87d4c9d417b6cdced787091300fd45a3e4.1553637461.git.kirr@nexedi.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock | expand |
On Tue, 2019-03-26 at 23:23 +0000, Kirill Smelkov wrote: > Using scripts/coccinelle/api/stream_open.cocci added in the previous > patch, search and convert to stream_open all in-kernel nonseekable_open > users for which read and write actually do not depend on ppos and where > there is no other methods in file_operations which assume @offset > access. > > I've verified each generated change manually - that it is correct to convert - > and each other nonseekable_open instance left - that it is either not correct > to convert there, or that it is not converted due to current stream_open.cocci > limitations. The script also does not convert files that should be valid to > convert, but that currently have .llseek = noop_llseek or generic_file_llseek > for unknown reason despite file being opened with nonseekable_open (e.g. > drivers/input/mousedev.c) > > Among cases converted 14 were potentially vulnerable to read vs write deadlock > (see details in the previous patch): ... > and the reset were just safe to convert to stream_open because their > read and write do not use ppos at all and corresponding file_operations > do not have methods that assume @offset file access: ... > drivers/char/pcmcia/scr24x_cs.c:95:8-24: WARNING: scr24x_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open. ... > diff --git a/drivers/char/pcmcia/scr24x_cs.c b/drivers/char/pcmcia/scr24x_cs.c > index f6b43d9350f0..04b39c3596cc 100644 > --- a/drivers/char/pcmcia/scr24x_cs.c > +++ b/drivers/char/pcmcia/scr24x_cs.c > @@ -92,7 +92,7 @@ static int scr24x_open(struct inode *inode, struct file *filp) > kref_get(&dev->refcnt); > filp->private_data = dev; > > - return nonseekable_open(inode, filp); > + return stream_open(inode, filp); > } > > static int scr24x_release(struct inode *inode, struct file *filp) Acked-by: Lubomir Rintel <lkundrak@v3.sk> [scr24x_cs] Thanks, Lubo
On 26/03/2019 23:20, Kirill Smelkov wrote: > Commit 9c225f2655 (vfs: atomic f_pos accesses as per POSIX) added locking for > file.f_pos access and in particular made concurrent read and write not possible > - now both those functions take f_pos lock for the whole run, and so if e.g. a > read is blocked waiting for data, write will deadlock waiting for that read to > complete. This caused regression for stream-like files where previously read > and write could run simultaneously, but after that patch could not do so > anymore. See e.g. 581d21a2d0 (xenbus: fix deadlock on writes to /proc/xen/xenbus) > which fixes such regression for particular case of /proc/xen/xenbus. > > The patch that added f_pos lock in 2014 (see https://lkml.org/lkml/2014/2/17/324 > for background discussion) did so to guarantee POSIX thread safety for > read/write/lseek and added the locking to file descriptors of all regular > files. In 2014 that thread-safety problem was not new as it was already discussed > earlier in 2006: https://lwn.net/Articles/180387. However even though 2006'th > version of Linus's patch (https://lwn.net/Articles/180396) was adding f_pos > locking "only for files that are marked seekable with FMODE_LSEEK (thus avoiding > the stream-like objects like pipes and sockets)", 2014'th version - the one that > actually made it into the tree as 9c225f2655 - is doing so irregardless of whether > a file is seekable or not. The reason that it did so is, probably, that there are > many files that are marked non-seekable, but e.g. their read implementation > actually depends on knowing current position to correctly handle the read. Some > examples: > > kernel/power/user.c snapshot_read > fs/debugfs/file.c u32_array_read > fs/fuse/control.c fuse_conn_waiting_read + ... > drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read > arch/s390/hypfs/inode.c hypfs_read_iter > ... > > In despite that, many nonseekable_open users implement read and write with pure > stream semantics - they don't depend on passed ppos at all. And for those cases > where read could wait for something inside, it creates a situation similar to > xenbus - the write could be never made to go until read is done, and read is > waiting for some, potentially external, event, for potentially unbounded time > -> deadlock. Besides xenbus, there are 14 such places in the kernel that I've > found with semantic patch (see below): > > drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write() > drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write() > drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write() > drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write() > net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write() > drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write() > drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write() > drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write() > net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write() > drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write() > drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write() > drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write() > drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write() > drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write() > > In addition to the cases above another regression caused by f_pos locking is > that now FUSE filesystems that implement open with FOPEN_NONSEEKABLE flag, can > no longer implement bidirectional stream-like files - for the same reason > as above e.g. read can deadlock write locking on file.f_pos in the kernel. > FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f7 (fuse: implement > nonseekable open) to support OSSPD (https://github.com/libfuse/osspd; > https://lwn.net/Articles/308445). OSSPD implements /dev/dsp in userspace with > FOPEN_NONSEEKABLE flag, with corresponding read and write routines not > depending on current position at all, and with both read and write being > potentially blocking operations: > > https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406 > https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477 > https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510 > > Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as > "somewhat pipe-like files ..." with read handler not using offset. However > that test implements only read without write and cannot exercise the deadlock > scenario: > > https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131 > https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163 > https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216 > > I've actually hit the read vs write deadlock for real while implementing my > FUSE filesystem where there is /head/watch file, for which open creates > separate bidirectional socket-like stream in between filesystem and its user > with both read and write being later performed simultaneously. And there it is > semantically not easy to split the stream into two separate read-only and > write-only channels: > > https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169 > > Let's fix this regression. The plan is: > > 1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS - doing so would > break many in-kernel nonseekable_open users which actually use ppos in > read/write handlers. > > 2. Add stream_open() to kernel to open stream-like non-seekable file descriptors. > Read and write on such file descriptors would never use nor change ppos. And > with that property on stream-like files read and write will be running without > taking f_pos lock - i.e. read and write could be running simultaneously. > > 3. With semantic patch search and convert to stream_open all in-kernel > nonseekable_open users for which read and write actually do not depend on ppos and > where there is no other methods in file_operations which assume @offset access. > > 4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via steam_open > if that bit is present in filesystem open reply. > > It was tempting to change fs/fuse/ open handler to use stream_open instead of > nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian > codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually > uses offset in its read and write handlers > > https://codesearch.debian.net/search?q=-%3Enonseekable+%3D > https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080 > https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346 > https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481 > > so if we would do such a change it will break a real user. > > 5. Add stream_open and FOPEN_STREAM handling to stable kernels starting from > v3.14+ (the kernel where 9c225f2655 first appeared). This will allow to patch > OSSPD and other FUSE filesystems that provide stream-like files to return > FOPEN_STREAM | FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on > all kernel versions. This should work because fs/fuse/ ignores unknown open > flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that > is not aware of this flag cannot hurt. In turn the kernel that is not aware of > FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to > implement streams without read vs write deadlock. > > This patch: adds stream_open, converts /proc/xen/xenbus to it and adds semantic > patch to automatically locate in-kernel places that are either required to be > converted due to read vs write deadlock, or that are just safe to be converted > because read and write do not use ppos and there are no other funky methods in > file_operations. > > Followup patches are: > > - apply the result of semantic patch; > - add FOPEN_STREAM to fs/fuse. > > Regarding semantic patch I've verified each generated change manually - that it is > correct to convert - and each other nonseekable_open instance left - that it is > either not correct to convert there, or that it is not converted due to current > stream_open.cocci limitations. The script also does not convert files that should > be valid to convert, but that currently have .llseek = noop_llseek or > generic_file_llseek for unknown reason despite file being opened with > nonseekable_open (e.g. drivers/input/mousedev.c) > > Cc: Michael Kerrisk <mtk.manpages@gmail.com> > Cc: Yongzhi Pan <panyongzhi@gmail.com> > Cc: Jonathan Corbet <corbet@lwn.net> > Cc: David Vrabel <david.vrabel@citrix.com> > Cc: Juergen Gross <jgross@suse.com> > Cc: Miklos Szeredi <miklos@szeredi.hu> > Cc: Tejun Heo <tj@kernel.org> > Cc: Kirill Tkhai <ktkhai@virtuozzo.com> > Cc: Arnd Bergmann <arnd@arndb.de> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: Julia Lawall <Julia.Lawall@lip6.fr> > Cc: Nikolaus Rath <Nikolaus@rath.org> > Cc: Han-Wen Nienhuys <hanwen@google.com> > Signed-off-by: Kirill Smelkov <kirr@nexedi.com> For the Xen changes: Reviewed-by: Juergen Gross <jgross@suse.com> Juergen
On Tue, Mar 26, 2019 at 12:20 PM Kirill Smelkov <kirr@nexedi.com> wrote: > > Commit 9c225f2655 (vfs: atomic f_pos accesses as per POSIX) added locking for > file.f_pos access and in particular made concurrent read and write not possible > [...] Ok, I have applied this patch - but this patch only - as a well-researched preparatory patch. The actual conversion patch looks fine to me too, and I see a few acks, but I think I'd like to see it during a merge window just because it's large and does significant changes, while this one is purely preparatory. Al, comments? One small note: please don't use lkml.org references since they tend to be slow and flaky - use lore.kernel.org instead. Also, fix your git config to use 12-character git hashes (best done by just removing the explicit hash size entirely, and letting modern git shorten hashes appropriately automatically). Linus
On Sat, Apr 06, 2019 at 07:07:14AM -1000, Linus Torvalds wrote: > On Tue, Mar 26, 2019 at 12:20 PM Kirill Smelkov <kirr@nexedi.com> wrote: > > > > Commit 9c225f2655 (vfs: atomic f_pos accesses as per POSIX) added locking for > > file.f_pos access and in particular made concurrent read and write not possible > > [...] > > Ok, I have applied this patch - but this patch only - as a > well-researched preparatory patch. > > The actual conversion patch looks fine to me too, and I see a few > acks, but I think I'd like to see it during a merge window just > because it's large and does significant changes, while this one is > purely preparatory. > > Al, comments? Linus, thanks a lot for feedback and for accepting the patch. I agree that it is better to merge the bulk-conversion at the time of next merge window. I hope Al will pick that patch up and send to you when the next merge window time comes. Could we please clarify one thing: is the general complete plan to deal with this regression, as outlined in the merged patch, ok? I mean do we agree to further proceed with fs/fuse/ FOPEN_STREAM (patch 3/3) and backport it + stream_open to v3.14+ stable kernels? Unfortunately it requires a bit of cooperation from userspace, but this scheme to fix the regression is the best I could come up with. If FUSE bits are ok, how should we go? Are we ok to merge that patch now, or should we wait for the next merge window? I understand that generally it should be merge window from one side, but since we are trying to fix regression here and if the plan for regression fix is accepted it should be merge now. If we merge now, what should be the way for that patch to get merged? Fixing regression on FUSE side is my reason to do this whole work - that's why I care about it the most and ask. > One small note: please don't use lkml.org references since they tend > to be slow and flaky - use lore.kernel.org instead. Also, fix your git > config to use 12-character git hashes (best done by just removing the > explicit hash size entirely, and letting modern git shorten hashes > appropriately automatically). Thanks for pointing to lore.kernel.org - I'm not very keen to kernel development and did not knew it exists. I was seeing lkml.org being referenced here and there and that's why I used it. I will use lore.kernel.org from now. Short hashes were not due to git config, but due to me trimming them to 10 characters in vim manually. I will use longer hashes as you ask. Kirill
On Sun, Apr 7, 2019 at 10:04 AM Kirill Smelkov <kirr@nexedi.com> wrote: > > Fixing regression on FUSE side is my reason to do this whole work - > that's why I care about it the most and ask. Yeah, we can do the actual FUSE fix, I think. Preferably through the FUSE tree. Miklos? Linus
+Linus, +Al, +linux-fsdevel, +linux-kernel On Tue, Apr 09, 2019 at 11:50:23PM +0200, Rasmus Villemoes wrote: > On 09/04/2019 22.38, Kirill Smelkov wrote: > > On Tue, Apr 09, 2019 at 09:43:37AM +0200, Rasmus Villemoes wrote: > >> On 26/03/2019 23.20, Kirill Smelkov wrote: > >> > >>> 2. Add stream_open() to kernel to open stream-like non-seekable file descriptors. > >>> Read and write on such file descriptors would never use nor change ppos. And > >>> with that property on stream-like files read and write will be running without > >>> taking f_pos lock - i.e. read and write could be running simultaneously. > >>> > >> > >>> diff --git a/fs/read_write.c b/fs/read_write.c > >>> index d83003d856a0..91cb375f9840 100644 > >>> --- a/fs/read_write.c > >>> +++ b/fs/read_write.c > >>> @@ -560,12 +560,13 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_ > >>> > >>> static inline loff_t file_pos_read(struct file *file) > >>> { > >>> - return file->f_pos; > >>> + return file->f_mode & FMODE_STREAM ? 0 : file->f_pos; > >>> } > >>> > >>> static inline void file_pos_write(struct file *file, loff_t pos) > >>> { > >>> - file->f_pos = pos; > >>> + if ((file->f_mode & FMODE_STREAM) == 0) > >>> + file->f_pos = pos; > >>> } > >> > >> Perhaps leave file_pos_read alone (to avoid the conditional load), and > >> do WARN_ON_ONCE((file->f_mode & FMODE_STREAM) && pos != 0) or maybe != > >> file->f_pos in _write to catch wrongly converted drivers - once the dust > >> has settled and patches backported to -stable, that WARN_ON_ONCE can > >> also be removed. > >> > >> Or, a more brutal way to detect .read or .write callbacks that do use > >> ppos despite FMODE_STREAM is to lift the FMODE_STREAM handling to the > >> callers of file_pos_{read,write} and pass on (file->f_mode & > >> FMODE_STREAM) ? NULL : &pos instead of &pos. That would also just add > >> one instead of two conditionals to the read/write paths. > > > > Good idea to pass NULL as ppos into FMODE_STREAM users as this will show > > as oops if a driver is wrongly converted and uses the position. Do you mean > > something like this: > > > > Not that invasive. I was more thinking of reverting the > file_pos_read/write to their original state, then do this one-liner once > per caller of those > > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -575,7 +575,7 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, > size_t count) > > if (f.file) { > loff_t pos = file_pos_read(f.file); > - ret = vfs_read(f.file, buf, count, &pos); > + ret = vfs_read(f.file, buf, count, (file->f_mode & FMODE_STREAM) ? NULL : &pos); > if (ret >= 0) > file_pos_write(f.file, pos); > fdput_pos(f); > > The idea being to minimize the number of extra branches introduced in > the read and write syscalls due to the FMODE_STREAM. So the above > catches wrongly done conversions (or backports - perhaps an earlier > version of some ->read callback did use ppos...) in a rather blunt way. I checked generated x86_64 assembly and from the point of view of adding minimum extra branches this is indeed the best version - no branches are added at all as `(file->f_mode & FMODE_STREAM) ? NULL : &pos` is translated to cmov instruction. However file->f_pos writing is still there and it will bug under race detector, e.g. under KTSAN because read and write can be running simultaneously. Maybe it is not only race bug, but also a bit of slowdown as read and write code paths write to the same memory thus needing inter-cpu synchronization if read and write handlers are on different cpus. However for this I'm not sure. > The alternative was to revert file_pos_read() to just return file->f_pos > as it used to, and just do > > diff --git a/fs/read_write.c b/fs/read_write.c > index 177ccc3d405a..97180d61a918 100644 > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -565,6 +565,7 @@ static inline loff_t file_pos_read(struct file *file) > > static inline void file_pos_write(struct file *file, loff_t pos) > { > + WARN_ON_ONCE((file->f_mode & FMODE_STREAM) && pos != 0); > file->f_pos = pos; > } > > which is, essentially, also just one more branch (at least for all the > ordinary, non-STREAM, files); this has the advantage of not oopsing, and > just means that ->f_pos gets updated racily until somebody reacts to the > WARN. I prefer we pass NULL for ppos to FOPEN_STREAM files, because above WARN, even if it triggers, will not tell which file implementation is guilty - it will show the stack trace only till generic VFS code, if I understand correctly. > In either case, the current code papers over bugs introduced by > conversion/backporting and has more branches. Yes, thanks for brining this up. It is indeed better that we pass NULL into converted drivers to catch correctness. Please see below for updated patch and branches analysis. > > but I wonder why we are discussing in private? > > Well, the cc list looked too big for some mostly micro-optimzation > comments, and I couldn't figure out a proper subset to trim it to. But > if you think the sanity-checking (either variant) is worth it before the > mass-conversion starts, I'm happy to take the discussion back to lkml. Added relevant (I think) people and lists to Cc. I was going to propose attached patch, but it turned out that if we start to pass ppos=NULL, we also have to update at least new_sync_read, new_sync_write, do_iter_writev_writev, rw_verify_area etc - functions that can be called from under vfs_read/vfs_write/..., at least this way: diff --git a/fs/read_write.c b/fs/read_write.c index 4bdd1731859c..80724e9791bd 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -371,7 +371,7 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t inode = file_inode(file); if (unlikely((ssize_t) count < 0)) return retval; - pos = *ppos; + pos = (ppos ? *ppos : 0); if (unlikely(pos < 0)) { if (!unsigned_offsets(file)) return retval; @@ -400,12 +400,13 @@ static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, lo ssize_t ret; init_sync_kiocb(&kiocb, filp); - kiocb.ki_pos = *ppos; + kiocb.ki_pos = (ppos ? *ppos : 0); iov_iter_init(&iter, READ, &iov, 1, len); ret = call_read_iter(filp, &kiocb, &iter); BUG_ON(ret == -EIOCBQUEUED); - *ppos = kiocb.ki_pos; + if (ppos) + *ppos = kiocb.ki_pos; return ret; } @@ -468,12 +469,12 @@ static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t ssize_t ret; init_sync_kiocb(&kiocb, filp); - kiocb.ki_pos = *ppos; + kiocb.ki_pos = (ppos ? *ppos : 0); iov_iter_init(&iter, WRITE, &iov, 1, len); ret = call_write_iter(filp, &kiocb, &iter); BUG_ON(ret == -EIOCBQUEUED); - if (ret > 0) + if (ret > 0 && ppos) *ppos = kiocb.ki_pos; return ret; } @@ -666,14 +667,15 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter, ret = kiocb_set_rw_flags(&kiocb, flags); if (ret) return ret; - kiocb.ki_pos = *ppos; + kiocb.ki_pos = (ppos ? *ppos : 0); if (type == READ) ret = call_read_iter(filp, &kiocb, iter); else ret = call_write_iter(filp, &kiocb, iter); BUG_ON(ret == -EIOCBQUEUED); - *ppos = kiocb.ki_pos; + if (ppos) + *ppos = kiocb.ki_pos; return ret; } There is no notion of NULL ppos in kiocb, and I'm not familiar with that subsystem, so it would be good to first get feedback before deciding on how/where to move on. Linus, anyone, comments? Thanks beforehand, Kirill ---- 8< ---- From facf84dfc75bf314c5f19160551750001d2d513b Mon Sep 17 00:00:00 2001 From: Kirill Smelkov <kirr@nexedi.com> Date: Thu, 11 Apr 2019 13:07:03 +0300 Subject: [BUGGY PATCH] fs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files ( The patch is buggy - it will fail at NULL pointer dereference in e.g. rw_verify_area ) This amends commit 10dce8af3422 ("fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock") in how position is passed into .read()/.write() handler for stream-like files: Rasmus noticed that we currently pass 0 as position and ignore any position change if that is done by a file implementation. This papers over bugs if ppos is used in files that declare themselves as being stream-like as such bugs will go unnoticed. Even if a file implementation is correctly converted into using stream_open, its read/write later could be changed to use ppos and even though that won't be working correctly, that bug might go unnoticed without someone doing wrong behaviour analysis. It is thus better to pass ppos=NULL into read/write for stream-like files as that don't give any chance for ppos usage bugs because it will oops if ppos is ever used inside .read() or .write(). Rasmus also made the point that FMODE_STREAM patch added 2 branches into ksys_read/ksys_write path which is too much unnecessary overhead and could be reduces. Let's rework the code to pass ppos=NULL and have less branches. The first approach could be to revert file_pos_read and file_pos_write into their pre-FMODE_STREAM version and just do --- a/fs/read_write.c +++ b/fs/read_write.c @@ -575,7 +575,7 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count) if (f.file) { loff_t pos = file_pos_read(f.file); - ret = vfs_read(f.file, buf, count, &pos); + ret = vfs_read(f.file, buf, count, (file->f_mode & FMODE_STREAM) ? NULL : &pos); if (ret >= 0) file_pos_write(f.file, pos); fdput_pos(f); this adds no branches at all as `(file->f_mode & FMODE_STREAM) ? NULL : &pos)` translates to single cmov instruction on x86_64: if (f.file) { 18e5: 48 89 c3 mov %rax,%rbx 18e8: 48 83 e3 fc and $0xfffffffffffffffc,%rbx 18ec: 74 79 je 1967 <ksys_read+0xa7> 18ee: 48 89 c5 mov %rax,%rbp loff_t pos = f.file->f_pos; 18f1: 48 8b 43 68 mov 0x68(%rbx),%rax ret = vfs_read(f.file, buf, count, (f.file->f_mode & FMODE_STREAM) ? NULL : &pos); 18f5: 48 89 e1 mov %rsp,%rcx 18f8: f6 43 46 20 testb $0x20,0x46(%rbx) 18fc: 4c 89 e6 mov %r12,%rsi 18ff: 4c 89 ea mov %r13,%rdx 1902: 48 89 df mov %rbx,%rdi loff_t pos = f.file->f_pos; 1905: 48 89 04 24 mov %rax,(%rsp) ret = vfs_read(f.file, buf, count, (f.file->f_mode & FMODE_STREAM) ? NULL : &pos); 1909: b8 00 00 00 00 mov $0x0,%eax 190e: 48 0f 45 c8 cmovne %rax,%rcx <-- NOTE 1912: e8 00 00 00 00 callq 1917 <ksys_read+0x57> 1913: R_X86_64_PLT32 vfs_read-0x4 However this leaves on unconditional write into file->f_pos after vfs_read call, and even though it should not be affecting semantic, it will bug under race detector (e.g. KTSAN) and probably it might add some inter-CPU overhead if read and write handlers are running on different cores. If we instead move `file->f_mode & FMODE_STREAM` checking into vfs functions that actually dispatch IO and care to load/store file->f_fpos only for non-stream files, even though it is 3 branches if looking at C source, gcc generates only 1 more branch compared to how it was pre-FMODE_STREAM. For example consider updated ksys_read: ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; if (f.file) { - loff_t pos = file_pos_read(f.file); - ret = vfs_read(f.file, buf, count, &pos); - if (ret >= 0) - file_pos_write(f.file, pos); + loff_t pos, *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &pos; + if (ppos) + pos = f.file->f_pos; + ret = vfs_read(f.file, buf, count, ppos); + if (ret >= 0 && ppos) + f.file->f_pos = pos; fdput_pos(f); } return ret; The code that gcc generates here is essentially as if the source was: if (f.file->f_mode & FMODE_STREAM) { ret = vfs_read(f.file, buf, count, NULL); } else { loff_t pos = f.file->f_pos; ret = vfs_read(f.file, buf, count, ppos); if (ret >= 0) f.file->f_pos = pos; } fdput_pos(f); i.e. it branches only once after checking file->f_mode and then has two distinct codepaths each with its own vfs_read call: if (f.file) { 18e5: 48 89 c5 mov %rax,%rbp 18e8: 48 83 e5 fc and $0xfffffffffffffffc,%rbp 18ec: 0f 84 89 00 00 00 je 197b <ksys_read+0xbb> 18f2: 48 89 c3 mov %rax,%rbx loff_t pos, *ppos = (f.file->f_mode & FMODE_STREAM) ? NULL : &pos; 18f5: f6 45 46 20 testb $0x20,0x46(%rbp) 18f9: 74 3b je 1936 <ksys_read+0x76> <-- branch pos/no-pos ret = vfs_read(f.file, buf, count, ppos); 18fb: 4c 89 e6 mov %r12,%rsi <-- start of IO without pos 18fe: 31 c9 xor %ecx,%ecx 1900: 4c 89 ea mov %r13,%rdx 1903: 48 89 ef mov %rbp,%rdi 1906: e8 00 00 00 00 callq 190b <ksys_read+0x4b> 1907: R_X86_64_PLT32 vfs_read-0x4 <-- vfs_read(..., ppos=NULL) 190b: 49 89 c4 mov %rax,%r12 if (f.flags & FDPUT_POS_UNLOCK) 190e: f6 c3 02 test $0x2,%bl 1911: 75 51 jne 1964 <ksys_read+0xa4> if (fd.flags & FDPUT_FPUT) 1913: 83 e3 01 and $0x1,%ebx 1916: 75 59 jne 1971 <ksys_read+0xb1> } 1918: 48 8b 4c 24 08 mov 0x8(%rsp),%rcx 191d: 65 48 33 0c 25 28 00 xor %gs:0x28,%rcx 1924: 00 00 1926: 4c 89 e0 mov %r12,%rax 1929: 75 59 jne 1984 <ksys_read+0xc4> 192b: 48 83 c4 10 add $0x10,%rsp 192f: 5b pop %rbx 1930: 5d pop %rbp 1931: 41 5c pop %r12 1933: 41 5d pop %r13 1935: c3 retq pos = f.file->f_pos; 1936: 48 8b 45 68 mov 0x68(%rbp),%rax <-- start of IO with pos ret = vfs_read(f.file, buf, count, ppos); 193a: 4c 89 e6 mov %r12,%rsi 193d: 48 89 e1 mov %rsp,%rcx 1940: 4c 89 ea mov %r13,%rdx 1943: 48 89 ef mov %rbp,%rdi pos = f.file->f_pos; 1946: 48 89 04 24 mov %rax,(%rsp) ret = vfs_read(f.file, buf, count, ppos); 194a: e8 00 00 00 00 callq 194f <ksys_read+0x8f> 194b: R_X86_64_PLT32 vfs_read-0x4 <-- vfs_read(..., ppos=&pos) 194f: 49 89 c4 mov %rax,%r12 if (ret >= 0 && ppos) 1952: 48 85 c0 test %rax,%rax 1955: 78 b7 js 190e <ksys_read+0x4e> f.file->f_pos = pos; 1957: 48 8b 04 24 mov (%rsp),%rax 195b: 48 89 45 68 mov %rax,0x68(%rbp) if (f.flags & FDPUT_POS_UNLOCK) ... If adding one branch to handle FMODE_STREAM is acceptable, this approach should be ok. Not duplicating vfs_read calls at C level is better as C-level code is more clear without such duplication, and gcc cares to optimize the function properly to have only 1 branch depending on file->f_mode. If even 1 extra branch is unacceptable, we should indeed go with the first option described in this patch but be prepared for race-detector bug reports and probably some inter-CPU overhead. Overall I would suggest to use simpler approach presented by hereby patch unless 1-extra jump could be shown to cause noticeable slowdowns in practice. Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Kirill Smelkov <kirr@nexedi.com> --- fs/open.c | 5 +++-- fs/read_write.c | 51 +++++++++++++++++++++++-------------------------- 2 files changed, 27 insertions(+), 29 deletions(-) diff --git a/fs/open.c b/fs/open.c index a00350018a47..9c7d724a6f67 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1219,8 +1219,9 @@ EXPORT_SYMBOL(nonseekable_open); /* * stream_open is used by subsystems that want stream-like file descriptors. * Such file descriptors are not seekable and don't have notion of position - * (file.f_pos is always 0). Contrary to file descriptors of other regular - * files, .read() and .write() can run simultaneously. + * (file.f_pos is always 0 and ppos passed to .read()/.write() is always NULL). + * Contrary to file descriptors of other regular files, .read() and .write() + * can run simultaneously. * * stream_open never fails and is marked to return int so that it could be * directly used as file_operations.open . diff --git a/fs/read_write.c b/fs/read_write.c index 61b43ad7608e..21d350df8109 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -558,27 +558,18 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_ return ret; } -static inline loff_t file_pos_read(struct file *file) -{ - return file->f_mode & FMODE_STREAM ? 0 : file->f_pos; -} - -static inline void file_pos_write(struct file *file, loff_t pos) -{ - if ((file->f_mode & FMODE_STREAM) == 0) - file->f_pos = pos; -} - ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; if (f.file) { - loff_t pos = file_pos_read(f.file); - ret = vfs_read(f.file, buf, count, &pos); - if (ret >= 0) - file_pos_write(f.file, pos); + loff_t pos, *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &pos; + if (ppos) + pos = f.file->f_pos; + ret = vfs_read(f.file, buf, count, ppos); + if (ret >= 0 && ppos) + f.file->f_pos = pos; fdput_pos(f); } return ret; @@ -595,10 +586,12 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count) ssize_t ret = -EBADF; if (f.file) { - loff_t pos = file_pos_read(f.file); - ret = vfs_write(f.file, buf, count, &pos); - if (ret >= 0) - file_pos_write(f.file, pos); + loff_t pos, *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &pos; + if (ppos) + pos = f.file->f_pos; + ret = vfs_write(f.file, buf, count, ppos); + if (ret >= 0 && ppos) + f.file->f_pos = pos; fdput_pos(f); } @@ -1013,10 +1006,12 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec, ssize_t ret = -EBADF; if (f.file) { - loff_t pos = file_pos_read(f.file); - ret = vfs_readv(f.file, vec, vlen, &pos, flags); - if (ret >= 0) - file_pos_write(f.file, pos); + loff_t pos, *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &pos; + if (ppos) + pos = f.file->f_pos; + ret = vfs_readv(f.file, vec, vlen, ppos, flags); + if (ret >= 0 && ppos) + f.file->f_pos = pos; fdput_pos(f); } @@ -1033,10 +1028,12 @@ static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec, ssize_t ret = -EBADF; if (f.file) { - loff_t pos = file_pos_read(f.file); - ret = vfs_writev(f.file, vec, vlen, &pos, flags); - if (ret >= 0) - file_pos_write(f.file, pos); + loff_t pos, *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &pos; + if (ppos) + pos = f.file->f_pos; + ret = vfs_writev(f.file, vec, vlen, ppos, flags); + if (ret >= 0 && ppos) + f.file->f_pos = pos; fdput_pos(f); }
On Thu, Apr 11, 2019 at 5:38 AM Kirill Smelkov <kirr@nexedi.com> wrote: > > However file->f_pos writing is still there and it will bug under race > detector, e.g. under KTSAN because read and write can be running > simultaneously. Maybe it is not only race bug, but also a bit of > slowdown as read and write code paths write to the same memory thus > needing inter-cpu synchronization if read and write handlers are on > different cpus. However for this I'm not sure. I doubt it's noticeable, but it would probably be good to simply not do the write for streams. That *could* be done somewhat similarly, with just changing th address to be updated, or course. In fact, we *used* to (long ago) pass in the address of "file->f_pos" itself to the low-level read/write routines. We then changed it to do that indirection through a local copy of pos (and file_pos_read/file_pos_write) because we didn't do the proper locking, so different read/write versions could mess with each other (and with lseek). But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos accesses as per POSIX") did was to add the proper locking at least for the cases that we care about deeply, so we *could* say that we have three cases: - FMODE_ATOMIC_POS: properly locked, - FMODE_STREAM: no pos at all - otherwise a "mostly don't care - don't mix!" and so we could go back to not copying the pos at all, and instead do something like loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos; ret = vfs_write(f.file, buf, count, ppos); and perhaps have a long-term plan to try to get rid of the "don't mix" case entirely (ie "if you use f_pos, then we'll do the proper locking") (The above is obviously surrounded by the fdget_pos()/fdput_pos() that implements the locking decision). Linus
On Thu, Apr 11, 2019 at 09:22:56AM -0700, Linus Torvalds wrote: > On Thu, Apr 11, 2019 at 5:38 AM Kirill Smelkov <kirr@nexedi.com> wrote: > > > > However file->f_pos writing is still there and it will bug under race > > detector, e.g. under KTSAN because read and write can be running > > simultaneously. Maybe it is not only race bug, but also a bit of > > slowdown as read and write code paths write to the same memory thus > > needing inter-cpu synchronization if read and write handlers are on > > different cpus. However for this I'm not sure. > > I doubt it's noticeable, but it would probably be good to simply not > do the write for streams. > > That *could* be done somewhat similarly, with just changing th address > to be updated, or course. > > In fact, we *used* to (long ago) pass in the address of "file->f_pos" > itself to the low-level read/write routines. We then changed it to do > that indirection through a local copy of pos (and > file_pos_read/file_pos_write) because we didn't do the proper locking, > so different read/write versions could mess with each other (and with > lseek). > > But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos > accesses as per POSIX") did was to add the proper locking at least for > the cases that we care about deeply, so we *could* say that we have > three cases: > > - FMODE_ATOMIC_POS: properly locked, > - FMODE_STREAM: no pos at all > - otherwise a "mostly don't care - don't mix!" > > and so we could go back to not copying the pos at all, and instead do > something like > > loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos; > ret = vfs_write(f.file, buf, count, ppos); > > and perhaps have a long-term plan to try to get rid of the "don't mix" > case entirely (ie "if you use f_pos, then we'll do the proper > locking") > > (The above is obviously surrounded by the fdget_pos()/fdput_pos() that > implements the locking decision). Ok Linus, thanks for feedback. Please consider applying or queueing the following patches. If the patches are ok, the first one is probably better to be applied now - to catch wrongly converted drivers right from start, while the second one could be probably better delayed till merge window. However please do as what you prefer is best. Kirill ---- 8< ---- From 328d4cc5dbb1d6fdbff1f80f1eed9f04b9a5de1c Mon Sep 17 00:00:00 2001 From: Kirill Smelkov <kirr@nexedi.com> Date: Fri, 12 Apr 2019 12:31:57 +0300 Subject: [PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files This amends commit 10dce8af3422 ("fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock") in how position is passed into .read()/.write() handler for stream-like files: Rasmus noticed that we currently pass 0 as position and ignore any position change if that is done by a file implementation. This papers over bugs if ppos is used in files that declare themselves as being stream-like as such bugs will go unnoticed. Even if a file implementation is correctly converted into using stream_open, its read/write later could be changed to use ppos and even though that won't be working correctly, that bug might go unnoticed without someone doing wrong behaviour analysis. It is thus better to pass ppos=NULL into read/write for stream-like files as that don't give any chance for ppos usage bugs because it will oops if ppos is ever used inside .read() or .write(). Note 1: rw_verify_area, new_sync_{read,write} needs to be updated because they are called by vfs_read/vfs_write & friends before file_operations .read/.write . Note 2: if file backend uses new-style .read_iter/.write_iter, position is still passed into there as non-pointer kiocb.ki_pos . Currently stream_open.cocci (semantic patch added by 10dce8af3422) ignores files whose file_operations has *_iter methods. Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Kirill Smelkov <kirr@nexedi.com> --- fs/open.c | 5 ++-- fs/read_write.c | 75 +++++++++++++++++++++++++++++-------------------- 2 files changed, 47 insertions(+), 33 deletions(-) diff --git a/fs/open.c b/fs/open.c index a00350018a47..9c7d724a6f67 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1219,8 +1219,9 @@ EXPORT_SYMBOL(nonseekable_open); /* * stream_open is used by subsystems that want stream-like file descriptors. * Such file descriptors are not seekable and don't have notion of position - * (file.f_pos is always 0). Contrary to file descriptors of other regular - * files, .read() and .write() can run simultaneously. + * (file.f_pos is always 0 and ppos passed to .read()/.write() is always NULL). + * Contrary to file descriptors of other regular files, .read() and .write() + * can run simultaneously. * * stream_open never fails and is marked to return int so that it could be * directly used as file_operations.open . diff --git a/fs/read_write.c b/fs/read_write.c index 61b43ad7608e..d62556be6848 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -371,7 +371,7 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t inode = file_inode(file); if (unlikely((ssize_t) count < 0)) return retval; - pos = *ppos; + pos = (ppos ? *ppos : 0); if (unlikely(pos < 0)) { if (!unsigned_offsets(file)) return retval; @@ -400,12 +400,13 @@ static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, lo ssize_t ret; init_sync_kiocb(&kiocb, filp); - kiocb.ki_pos = *ppos; + kiocb.ki_pos = (ppos ? *ppos : 0); iov_iter_init(&iter, READ, &iov, 1, len); ret = call_read_iter(filp, &kiocb, &iter); BUG_ON(ret == -EIOCBQUEUED); - *ppos = kiocb.ki_pos; + if (ppos) + *ppos = kiocb.ki_pos; return ret; } @@ -468,12 +469,12 @@ static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t ssize_t ret; init_sync_kiocb(&kiocb, filp); - kiocb.ki_pos = *ppos; + kiocb.ki_pos = (ppos ? *ppos : 0); iov_iter_init(&iter, WRITE, &iov, 1, len); ret = call_write_iter(filp, &kiocb, &iter); BUG_ON(ret == -EIOCBQUEUED); - if (ret > 0) + if (ret > 0 && ppos) *ppos = kiocb.ki_pos; return ret; } @@ -558,15 +559,10 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_ return ret; } -static inline loff_t file_pos_read(struct file *file) -{ - return file->f_mode & FMODE_STREAM ? 0 : file->f_pos; -} - -static inline void file_pos_write(struct file *file, loff_t pos) +/* file_ppos returns &file->f_pos or NULL if file is stream */ +static inline loff_t *file_ppos(struct file *file) { - if ((file->f_mode & FMODE_STREAM) == 0) - file->f_pos = pos; + return file->f_mode & FMODE_STREAM ? NULL : &file->f_pos; } ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count) @@ -575,10 +571,14 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count) ssize_t ret = -EBADF; if (f.file) { - loff_t pos = file_pos_read(f.file); - ret = vfs_read(f.file, buf, count, &pos); - if (ret >= 0) - file_pos_write(f.file, pos); + loff_t pos, *ppos = file_ppos(f.file); + if (ppos) { + pos = *ppos; + ppos = &pos; + } + ret = vfs_read(f.file, buf, count, ppos); + if (ret >= 0 && ppos) + f.file->f_pos = pos; fdput_pos(f); } return ret; @@ -595,10 +595,14 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count) ssize_t ret = -EBADF; if (f.file) { - loff_t pos = file_pos_read(f.file); - ret = vfs_write(f.file, buf, count, &pos); - if (ret >= 0) - file_pos_write(f.file, pos); + loff_t pos, *ppos = file_ppos(f.file); + if (ppos) { + pos = *ppos; + ppos = &pos; + } + ret = vfs_write(f.file, buf, count, ppos); + if (ret >= 0 && ppos) + f.file->f_pos = pos; fdput_pos(f); } @@ -673,14 +677,15 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter, ret = kiocb_set_rw_flags(&kiocb, flags); if (ret) return ret; - kiocb.ki_pos = *ppos; + kiocb.ki_pos = (ppos ? *ppos : 0); if (type == READ) ret = call_read_iter(filp, &kiocb, iter); else ret = call_write_iter(filp, &kiocb, iter); BUG_ON(ret == -EIOCBQUEUED); - *ppos = kiocb.ki_pos; + if (ppos) + *ppos = kiocb.ki_pos; return ret; } @@ -1013,10 +1018,14 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec, ssize_t ret = -EBADF; if (f.file) { - loff_t pos = file_pos_read(f.file); - ret = vfs_readv(f.file, vec, vlen, &pos, flags); - if (ret >= 0) - file_pos_write(f.file, pos); + loff_t pos, *ppos = file_ppos(f.file); + if (ppos) { + pos = *ppos; + ppos = &pos; + } + ret = vfs_readv(f.file, vec, vlen, ppos, flags); + if (ret >= 0 && ppos) + f.file->f_pos = pos; fdput_pos(f); } @@ -1033,10 +1042,14 @@ static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec, ssize_t ret = -EBADF; if (f.file) { - loff_t pos = file_pos_read(f.file); - ret = vfs_writev(f.file, vec, vlen, &pos, flags); - if (ret >= 0) - file_pos_write(f.file, pos); + loff_t pos, *ppos = file_ppos(f.file); + if (ppos) { + pos = *ppos; + ppos = &pos; + } + ret = vfs_writev(f.file, vec, vlen, ppos, flags); + if (ret >= 0 && ppos) + f.file->f_pos = pos; fdput_pos(f); }
On Fri, Apr 12, 2019 at 03:41:44PM +0300, Kirill Smelkov wrote: > On Thu, Apr 11, 2019 at 09:22:56AM -0700, Linus Torvalds wrote: > > On Thu, Apr 11, 2019 at 5:38 AM Kirill Smelkov <kirr@nexedi.com> wrote: > > > > > > However file->f_pos writing is still there and it will bug under race > > > detector, e.g. under KTSAN because read and write can be running > > > simultaneously. Maybe it is not only race bug, but also a bit of > > > slowdown as read and write code paths write to the same memory thus > > > needing inter-cpu synchronization if read and write handlers are on > > > different cpus. However for this I'm not sure. > > > > I doubt it's noticeable, but it would probably be good to simply not > > do the write for streams. > > > > That *could* be done somewhat similarly, with just changing th address > > to be updated, or course. > > > > In fact, we *used* to (long ago) pass in the address of "file->f_pos" > > itself to the low-level read/write routines. We then changed it to do > > that indirection through a local copy of pos (and > > file_pos_read/file_pos_write) because we didn't do the proper locking, > > so different read/write versions could mess with each other (and with > > lseek). > > > > But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos > > accesses as per POSIX") did was to add the proper locking at least for > > the cases that we care about deeply, so we *could* say that we have > > three cases: > > > > - FMODE_ATOMIC_POS: properly locked, > > - FMODE_STREAM: no pos at all > > - otherwise a "mostly don't care - don't mix!" > > > > and so we could go back to not copying the pos at all, and instead do > > something like > > > > loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos; > > ret = vfs_write(f.file, buf, count, ppos); > > > > and perhaps have a long-term plan to try to get rid of the "don't mix" > > case entirely (ie "if you use f_pos, then we'll do the proper > > locking") > > > > (The above is obviously surrounded by the fdget_pos()/fdput_pos() that > > implements the locking decision). > > Ok Linus, thanks for feedback. Please consider applying or queueing the > following patches. Resending those patches one by one in separate emails, if maybe 2 patches inline was problematic to handle...
On Sun, Apr 07, 2019 at 02:09:08PM -1000, Linus Torvalds wrote: > On Sun, Apr 7, 2019 at 10:04 AM Kirill Smelkov <kirr@nexedi.com> wrote: > > > > Fixing regression on FUSE side is my reason to do this whole work - > > that's why I care about it the most and ask. > > Yeah, we can do the actual FUSE fix, I think. Preferably through the > FUSE tree. Miklos? Linus, thanks for feedback. I'm not sure what is going on, but I'm afraid we can wait very long here. It's been already 1 week, and I was also struggling to get feedback from Miklos on my FUSE patches starting from end of February without any luck: https://lore.kernel.org/linux-fsdevel/cover.1553680185.git.kirr@nexedi.com/ https://lore.kernel.org/linux-fsdevel/12f7d0d98555ee0d174d04bb47644f65c07f035a.1553680185.git.kirr@nexedi.com/ https://lore.kernel.org/linux-fsdevel/d74b17b9d33c3dcc7a1f2fa2914fb3c4e7cda127.1553680185.git.kirr@nexedi.com/ https://lore.kernel.org/linux-fsdevel/cover.1553677194.git.kirr@nexedi.com/ https://lore.kernel.org/linux-fsdevel/d916d401a80e9834c95970d86ca71c0154e988a6.1553677194.git.kirr@nexedi.com/ https://lore.kernel.org/linux-fsdevel/e0b43507976d6ea9010f1bacaef067f18de49f1f.1553677194.git.kirr@nexedi.com/ https://lore.kernel.org/linux-fsdevel/dc47c061f20c464ccf46b43822b062dca6486e90.1553637462.git.kirr@nexedi.com/ I'm not sure what to do with my FUSE patches. It feels like knocking into closed door... Kirill
diff --git a/drivers/xen/xenbus/xenbus_dev_frontend.c b/drivers/xen/xenbus/xenbus_dev_frontend.c index c3e201025ef0..0782ff3c2273 100644 --- a/drivers/xen/xenbus/xenbus_dev_frontend.c +++ b/drivers/xen/xenbus/xenbus_dev_frontend.c @@ -622,9 +622,7 @@ static int xenbus_file_open(struct inode *inode, struct file *filp) if (xen_store_evtchn == 0) return -ENOENT; - nonseekable_open(inode, filp); - - filp->f_mode &= ~FMODE_ATOMIC_POS; /* cdev-style semantics */ + stream_open(inode, filp); u = kzalloc(sizeof(*u), GFP_KERNEL); if (u == NULL) diff --git a/fs/open.c b/fs/open.c index 0285ce7dbd51..b6bf04a2ef8c 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1209,3 +1209,21 @@ int nonseekable_open(struct inode *inode, struct file *filp) } EXPORT_SYMBOL(nonseekable_open); + +/* + * stream_open is used by subsystems that want stream-like file descriptors. + * Such file descriptors are not seekable and don't have notion of position + * (file.f_pos is always 0). Contrary to file descriptors of other regular + * files, .read() and .write() can run simultaneously. + * + * stream_open never fails and is marked to return int so that it could be + * directly used as file_operations.open . + */ +int stream_open(struct inode *inode, struct file *filp) +{ + filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE | FMODE_ATOMIC_POS); + filp->f_mode |= FMODE_STREAM; + return 0; +} + +EXPORT_SYMBOL(stream_open); diff --git a/fs/read_write.c b/fs/read_write.c index d83003d856a0..91cb375f9840 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -560,12 +560,13 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_ static inline loff_t file_pos_read(struct file *file) { - return file->f_pos; + return file->f_mode & FMODE_STREAM ? 0 : file->f_pos; } static inline void file_pos_write(struct file *file, loff_t pos) { - file->f_pos = pos; + if ((file->f_mode & FMODE_STREAM) == 0) + file->f_pos = pos; } ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count) diff --git a/include/linux/fs.h b/include/linux/fs.h index 3e85cb8e8c20..cd59946bc3a4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -155,6 +155,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, #define FMODE_OPENED ((__force fmode_t)0x80000) #define FMODE_CREATED ((__force fmode_t)0x100000) +/* File is stream-like */ +#define FMODE_STREAM ((__force fmode_t)0x200000) + /* File was opened by fanotify and shouldn't generate fanotify events */ #define FMODE_NONOTIFY ((__force fmode_t)0x4000000) @@ -3075,6 +3078,7 @@ extern loff_t no_seek_end_llseek_size(struct file *, loff_t, int, loff_t); extern loff_t no_seek_end_llseek(struct file *, loff_t, int); extern int generic_file_open(struct inode * inode, struct file * filp); extern int nonseekable_open(struct inode * inode, struct file * filp); +extern int stream_open(struct inode * inode, struct file * filp); #ifdef CONFIG_BLOCK typedef void (dio_submit_t)(struct bio *bio, struct inode *inode, diff --git a/scripts/coccinelle/api/stream_open.cocci b/scripts/coccinelle/api/stream_open.cocci new file mode 100644 index 000000000000..350145da7669 --- /dev/null +++ b/scripts/coccinelle/api/stream_open.cocci @@ -0,0 +1,363 @@ +// SPDX-License-Identifier: GPL-2.0 +// Author: Kirill Smelkov (kirr@nexedi.com) +// +// Search for stream-like files that are using nonseekable_open and convert +// them to stream_open. A stream-like file is a file that does not use ppos in +// its read and write. Rationale for the conversion is to avoid deadlock in +// between read and write. + +virtual report +virtual patch +virtual explain // explain decisions in the patch (SPFLAGS="-D explain") + +// stream-like reader & writer - ones that do not depend on f_pos. +@ stream_reader @ +identifier readstream, ppos; +identifier f, buf, len; +type loff_t; +@@ + ssize_t readstream(struct file *f, char *buf, size_t len, loff_t *ppos) + { + ... when != ppos + } + +@ stream_writer @ +identifier writestream, ppos; +identifier f, buf, len; +type loff_t; +@@ + ssize_t writestream(struct file *f, const char *buf, size_t len, loff_t *ppos) + { + ... when != ppos + } + + +// a function that blocks +@ blocks @ +identifier block_f; +identifier wait_event =~ "^wait_event_.*"; +@@ + block_f(...) { + ... when exists + wait_event(...) + ... when exists + } + +// stream_reader that can block inside. +// +// XXX wait_* can be called not directly from current function (e.g. func -> f -> g -> wait()) +// XXX currently reader_blocks supports only direct and 1-level indirect cases. +@ reader_blocks_direct @ +identifier stream_reader.readstream; +identifier wait_event =~ "^wait_event_.*"; +@@ + readstream(...) + { + ... when exists + wait_event(...) + ... when exists + } + +@ reader_blocks_1 @ +identifier stream_reader.readstream; +identifier blocks.block_f; +@@ + readstream(...) + { + ... when exists + block_f(...) + ... when exists + } + +@ reader_blocks depends on reader_blocks_direct || reader_blocks_1 @ +identifier stream_reader.readstream; +@@ + readstream(...) { + ... + } + + +// file_operations + whether they have _any_ .read, .write, .llseek ... at all. +// +// XXX add support for file_operations xxx[N] = ... (sound/core/pcm_native.c) +@ fops0 @ +identifier fops; +@@ + struct file_operations fops = { + ... + }; + +@ has_read @ +identifier fops0.fops; +identifier read_f; +@@ + struct file_operations fops = { + .read = read_f, + }; + +@ has_read_iter @ +identifier fops0.fops; +identifier read_iter_f; +@@ + struct file_operations fops = { + .read_iter = read_iter_f, + }; + +@ has_write @ +identifier fops0.fops; +identifier write_f; +@@ + struct file_operations fops = { + .write = write_f, + }; + +@ has_write_iter @ +identifier fops0.fops; +identifier write_iter_f; +@@ + struct file_operations fops = { + .write_iter = write_iter_f, + }; + +@ has_llseek @ +identifier fops0.fops; +identifier llseek_f; +@@ + struct file_operations fops = { + .llseek = llseek_f, + }; + +@ has_no_llseek @ +identifier fops0.fops; +@@ + struct file_operations fops = { + .llseek = no_llseek, + }; + +@ has_mmap @ +identifier fops0.fops; +identifier mmap_f; +@@ + struct file_operations fops = { + .mmap = mmap_f, + }; + +@ has_copy_file_range @ +identifier fops0.fops; +identifier copy_file_range_f; +@@ + struct file_operations fops = { + .copy_file_range = copy_file_range_f, + }; + +@ has_remap_file_range @ +identifier fops0.fops; +identifier remap_file_range_f; +@@ + struct file_operations fops = { + .remap_file_range = remap_file_range_f, + }; + +@ has_splice_read @ +identifier fops0.fops; +identifier splice_read_f; +@@ + struct file_operations fops = { + .splice_read = splice_read_f, + }; + +@ has_splice_write @ +identifier fops0.fops; +identifier splice_write_f; +@@ + struct file_operations fops = { + .splice_write = splice_write_f, + }; + + +// file_operations that is candidate for stream_open conversion - it does not +// use mmap and other methods that assume @offset access to file. +// +// XXX for simplicity require no .{read/write}_iter and no .splice_{read/write} for now. +// XXX maybe_steam.fops cannot be used in other rules - it gives "bad rule maybe_stream or bad variable fops". +@ maybe_stream depends on (!has_llseek || has_no_llseek) && !has_mmap && !has_copy_file_range && !has_remap_file_range && !has_read_iter && !has_write_iter && !has_splice_read && !has_splice_write @ +identifier fops0.fops; +@@ + struct file_operations fops = { + }; + + +// ---- conversions ---- + +// XXX .open = nonseekable_open -> .open = stream_open +// XXX .open = func -> openfunc -> nonseekable_open + +// read & write +// +// if both are used in the same file_operations together with an opener - +// under that conditions we can use stream_open instead of nonseekable_open. +@ fops_rw depends on maybe_stream @ +identifier fops0.fops, openfunc; +identifier stream_reader.readstream; +identifier stream_writer.writestream; +@@ + struct file_operations fops = { + .open = openfunc, + .read = readstream, + .write = writestream, + }; + +@ report_rw depends on report @ +identifier fops_rw.openfunc; +position p1; +@@ + openfunc(...) { + <... + nonseekable_open@p1 + ...> + } + +@ script:python depends on report && reader_blocks @ +fops << fops0.fops; +p << report_rw.p1; +@@ +coccilib.report.print_report(p[0], + "ERROR: %s: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix." % (fops,)) + +@ script:python depends on report && !reader_blocks @ +fops << fops0.fops; +p << report_rw.p1; +@@ +coccilib.report.print_report(p[0], + "WARNING: %s: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open." % (fops,)) + + +@ explain_rw_deadlocked depends on explain && reader_blocks @ +identifier fops_rw.openfunc; +@@ + openfunc(...) { + <... +- nonseekable_open ++ nonseekable_open /* read & write (was deadlock) */ + ...> + } + + +@ explain_rw_nodeadlock depends on explain && !reader_blocks @ +identifier fops_rw.openfunc; +@@ + openfunc(...) { + <... +- nonseekable_open ++ nonseekable_open /* read & write (no direct deadlock) */ + ...> + } + +@ patch_rw depends on patch @ +identifier fops_rw.openfunc; +@@ + openfunc(...) { + <... +- nonseekable_open ++ stream_open + ...> + } + + +// read, but not write +@ fops_r depends on maybe_stream && !has_write @ +identifier fops0.fops, openfunc; +identifier stream_reader.readstream; +@@ + struct file_operations fops = { + .open = openfunc, + .read = readstream, + }; + +@ report_r depends on report @ +identifier fops_r.openfunc; +position p1; +@@ + openfunc(...) { + <... + nonseekable_open@p1 + ...> + } + +@ script:python depends on report @ +fops << fops0.fops; +p << report_r.p1; +@@ +coccilib.report.print_report(p[0], + "WARNING: %s: .read() has stream semantic; safe to change nonseekable_open -> stream_open." % (fops,)) + +@ explain_r depends on explain @ +identifier fops_r.openfunc; +@@ + openfunc(...) { + <... +- nonseekable_open ++ nonseekable_open /* read only */ + ...> + } + +@ patch_r depends on patch @ +identifier fops_r.openfunc; +@@ + openfunc(...) { + <... +- nonseekable_open ++ stream_open + ...> + } + + +// write, but not read +@ fops_w depends on maybe_stream && !has_read @ +identifier fops0.fops, openfunc; +identifier stream_writer.writestream; +@@ + struct file_operations fops = { + .open = openfunc, + .write = writestream, + }; + +@ report_w depends on report @ +identifier fops_w.openfunc; +position p1; +@@ + openfunc(...) { + <... + nonseekable_open@p1 + ...> + } + +@ script:python depends on report @ +fops << fops0.fops; +p << report_w.p1; +@@ +coccilib.report.print_report(p[0], + "WARNING: %s: .write() has stream semantic; safe to change nonseekable_open -> stream_open." % (fops,)) + +@ explain_w depends on explain @ +identifier fops_w.openfunc; +@@ + openfunc(...) { + <... +- nonseekable_open ++ nonseekable_open /* write only */ + ...> + } + +@ patch_w depends on patch @ +identifier fops_w.openfunc; +@@ + openfunc(...) { + <... +- nonseekable_open ++ stream_open + ...> + } + + +// no read, no write - don't change anything
Commit 9c225f2655 (vfs: atomic f_pos accesses as per POSIX) added locking for file.f_pos access and in particular made concurrent read and write not possible - now both those functions take f_pos lock for the whole run, and so if e.g. a read is blocked waiting for data, write will deadlock waiting for that read to complete. This caused regression for stream-like files where previously read and write could run simultaneously, but after that patch could not do so anymore. See e.g. 581d21a2d0 (xenbus: fix deadlock on writes to /proc/xen/xenbus) which fixes such regression for particular case of /proc/xen/xenbus. The patch that added f_pos lock in 2014 (see https://lkml.org/lkml/2014/2/17/324 for background discussion) did so to guarantee POSIX thread safety for read/write/lseek and added the locking to file descriptors of all regular files. In 2014 that thread-safety problem was not new as it was already discussed earlier in 2006: https://lwn.net/Articles/180387. However even though 2006'th version of Linus's patch (https://lwn.net/Articles/180396) was adding f_pos locking "only for files that are marked seekable with FMODE_LSEEK (thus avoiding the stream-like objects like pipes and sockets)", 2014'th version - the one that actually made it into the tree as 9c225f2655 - is doing so irregardless of whether a file is seekable or not. The reason that it did so is, probably, that there are many files that are marked non-seekable, but e.g. their read implementation actually depends on knowing current position to correctly handle the read. Some examples: kernel/power/user.c snapshot_read fs/debugfs/file.c u32_array_read fs/fuse/control.c fuse_conn_waiting_read + ... drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read arch/s390/hypfs/inode.c hypfs_read_iter ... In despite that, many nonseekable_open users implement read and write with pure stream semantics - they don't depend on passed ppos at all. And for those cases where read could wait for something inside, it creates a situation similar to xenbus - the write could be never made to go until read is done, and read is waiting for some, potentially external, event, for potentially unbounded time -> deadlock. Besides xenbus, there are 14 such places in the kernel that I've found with semantic patch (see below): drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write() drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write() drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write() drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write() net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write() drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write() drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write() drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write() net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write() drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write() drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write() drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write() drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write() drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write() In addition to the cases above another regression caused by f_pos locking is that now FUSE filesystems that implement open with FOPEN_NONSEEKABLE flag, can no longer implement bidirectional stream-like files - for the same reason as above e.g. read can deadlock write locking on file.f_pos in the kernel. FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f7 (fuse: implement nonseekable open) to support OSSPD (https://github.com/libfuse/osspd; https://lwn.net/Articles/308445). OSSPD implements /dev/dsp in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and write routines not depending on current position at all, and with both read and write being potentially blocking operations: https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510 Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as "somewhat pipe-like files ..." with read handler not using offset. However that test implements only read without write and cannot exercise the deadlock scenario: https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131 https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163 https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216 I've actually hit the read vs write deadlock for real while implementing my FUSE filesystem where there is /head/watch file, for which open creates separate bidirectional socket-like stream in between filesystem and its user with both read and write being later performed simultaneously. And there it is semantically not easy to split the stream into two separate read-only and write-only channels: https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169 Let's fix this regression. The plan is: 1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS - doing so would break many in-kernel nonseekable_open users which actually use ppos in read/write handlers. 2. Add stream_open() to kernel to open stream-like non-seekable file descriptors. Read and write on such file descriptors would never use nor change ppos. And with that property on stream-like files read and write will be running without taking f_pos lock - i.e. read and write could be running simultaneously. 3. With semantic patch search and convert to stream_open all in-kernel nonseekable_open users for which read and write actually do not depend on ppos and where there is no other methods in file_operations which assume @offset access. 4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via steam_open if that bit is present in filesystem open reply. It was tempting to change fs/fuse/ open handler to use stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually uses offset in its read and write handlers https://codesearch.debian.net/search?q=-%3Enonseekable+%3D https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481 so if we would do such a change it will break a real user. 5. Add stream_open and FOPEN_STREAM handling to stable kernels starting from v3.14+ (the kernel where 9c225f2655 first appeared). This will allow to patch OSSPD and other FUSE filesystems that provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on all kernel versions. This should work because fs/fuse/ ignores unknown open flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that is not aware of this flag cannot hurt. In turn the kernel that is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to implement streams without read vs write deadlock. This patch: adds stream_open, converts /proc/xen/xenbus to it and adds semantic patch to automatically locate in-kernel places that are either required to be converted due to read vs write deadlock, or that are just safe to be converted because read and write do not use ppos and there are no other funky methods in file_operations. Followup patches are: - apply the result of semantic patch; - add FOPEN_STREAM to fs/fuse. Regarding semantic patch I've verified each generated change manually - that it is correct to convert - and each other nonseekable_open instance left - that it is either not correct to convert there, or that it is not converted due to current stream_open.cocci limitations. The script also does not convert files that should be valid to convert, but that currently have .llseek = noop_llseek or generic_file_llseek for unknown reason despite file being opened with nonseekable_open (e.g. drivers/input/mousedev.c) Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Yongzhi Pan <panyongzhi@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Juergen Gross <jgross@suse.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Tejun Heo <tj@kernel.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Cc: Nikolaus Rath <Nikolaus@rath.org> Cc: Han-Wen Nienhuys <hanwen@google.com> Signed-off-by: Kirill Smelkov <kirr@nexedi.com> --- drivers/xen/xenbus/xenbus_dev_frontend.c | 4 +- fs/open.c | 18 ++ fs/read_write.c | 5 +- include/linux/fs.h | 4 + scripts/coccinelle/api/stream_open.cocci | 363 +++++++++++++++++++++++ 5 files changed, 389 insertions(+), 5 deletions(-) create mode 100644 scripts/coccinelle/api/stream_open.cocci