diff mbox series

[1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock

Message ID c44fcf87d4c9d417b6cdced787091300fd45a3e4.1553637461.git.kirr@nexedi.com (mailing list archive)
State New, archived
Headers show
Series [1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock | expand

Commit Message

Kirill Smelkov March 26, 2019, 10:20 p.m. UTC
Commit 9c225f2655 (vfs: atomic f_pos accesses as per POSIX) added locking for
file.f_pos access and in particular made concurrent read and write not possible
- now both those functions take f_pos lock for the whole run, and so if e.g. a
read is blocked waiting for data, write will deadlock waiting for that read to
complete. This caused regression for stream-like files where previously read
and write could run simultaneously, but after that patch could not do so
anymore. See e.g. 581d21a2d0 (xenbus: fix deadlock on writes to /proc/xen/xenbus)
which fixes such regression for particular case of /proc/xen/xenbus.

The patch that added f_pos lock in 2014 (see https://lkml.org/lkml/2014/2/17/324
for background discussion) did so to guarantee POSIX thread safety for
read/write/lseek and added the locking to file descriptors of all regular
files. In 2014 that thread-safety problem was not new as it was already discussed
earlier in 2006: https://lwn.net/Articles/180387. However even though 2006'th
version of Linus's patch (https://lwn.net/Articles/180396) was adding f_pos
locking "only for files that are marked seekable with FMODE_LSEEK (thus avoiding
the stream-like objects like pipes and sockets)", 2014'th version - the one that
actually made it into the tree as 9c225f2655 - is doing so irregardless of whether
a file is seekable or not. The reason that it did so is, probably, that there are
many files that are marked non-seekable, but e.g. their read implementation
actually depends on knowing current position to correctly handle the read. Some
examples:

	kernel/power/user.c		snapshot_read
	fs/debugfs/file.c		u32_array_read
	fs/fuse/control.c		fuse_conn_waiting_read + ...
	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
	arch/s390/hypfs/inode.c		hypfs_read_iter
	...

In despite that, many nonseekable_open users implement read and write with pure
stream semantics - they don't depend on passed ppos at all. And for those cases
where read could wait for something inside, it creates a situation similar to
xenbus - the write could be never made to go until read is done, and read is
waiting for some, potentially external, event, for potentially unbounded time
-> deadlock. Besides xenbus, there are 14 such places in the kernel that I've
found with semantic patch (see below):

	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()

In addition to the cases above another regression caused by f_pos locking is
that now FUSE filesystems that implement open with FOPEN_NONSEEKABLE flag, can
no longer implement bidirectional stream-like files - for the same reason
as above e.g. read can deadlock write locking on file.f_pos in the kernel.
FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f7 (fuse: implement
nonseekable open) to support OSSPD (https://github.com/libfuse/osspd;
https://lwn.net/Articles/308445). OSSPD implements /dev/dsp in userspace with
FOPEN_NONSEEKABLE flag, with corresponding read and write routines not
depending on current position at all, and with both read and write being
potentially blocking operations:

	https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
	https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
	https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510

Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
"somewhat pipe-like files ..." with read handler not using offset. However
that test implements only read without write and cannot exercise the deadlock
scenario:

	https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
	https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
	https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216

I've actually hit the read vs write deadlock for real while implementing my
FUSE filesystem where there is /head/watch file, for which open creates
separate bidirectional socket-like stream in between filesystem and its user
with both read and write being later performed simultaneously. And there it is
semantically not easy to split the stream into two separate read-only and
write-only channels:

	https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169

Let's fix this regression. The plan is:

1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS - doing so would
   break many in-kernel nonseekable_open users which actually use ppos in
   read/write handlers.

2. Add stream_open() to kernel to open stream-like non-seekable file descriptors.
   Read and write on such file descriptors would never use nor change ppos. And
   with that property on stream-like files read and write will be running without
   taking f_pos lock - i.e. read and write could be running simultaneously.

3. With semantic patch search and convert to stream_open all in-kernel
   nonseekable_open users for which read and write actually do not depend on ppos and
   where there is no other methods in file_operations which assume @offset access.

4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via steam_open
   if that bit is present in filesystem open reply.

   It was tempting to change fs/fuse/ open handler to use stream_open instead of
   nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian
   codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually
   uses offset in its read and write handlers

	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481

   so if we would do such a change it will break a real user.

5. Add stream_open and FOPEN_STREAM handling to stable kernels starting from
   v3.14+ (the kernel where 9c225f2655 first appeared). This will allow to patch
   OSSPD and other FUSE filesystems that provide stream-like files to return
   FOPEN_STREAM | FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on
   all kernel versions. This should work because fs/fuse/ ignores unknown open
   flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that
   is not aware of this flag cannot hurt. In turn the kernel that is not aware of
   FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to
   implement streams without read vs write deadlock.

This patch: adds stream_open, converts /proc/xen/xenbus to it and adds semantic
patch to automatically locate in-kernel places that are either required to be
converted due to read vs write deadlock, or that are just safe to be converted
because read and write do not use ppos and there are no other funky methods in
file_operations.

Followup patches are:

- apply the result of semantic patch;
- add FOPEN_STREAM to fs/fuse.

Regarding semantic patch I've verified each generated change manually - that it is
correct to convert - and each other nonseekable_open instance left - that it is
either not correct to convert there, or that it is not converted due to current
stream_open.cocci limitations. The script also does not convert files that should
be valid to convert, but that currently have .llseek = noop_llseek or
generic_file_llseek for unknown reason despite file being opened with
nonseekable_open (e.g. drivers/input/mousedev.c)

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Yongzhi Pan <panyongzhi@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Nikolaus Rath <Nikolaus@rath.org>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 drivers/xen/xenbus/xenbus_dev_frontend.c |   4 +-
 fs/open.c                                |  18 ++
 fs/read_write.c                          |   5 +-
 include/linux/fs.h                       |   4 +
 scripts/coccinelle/api/stream_open.cocci | 363 +++++++++++++++++++++++
 5 files changed, 389 insertions(+), 5 deletions(-)
 create mode 100644 scripts/coccinelle/api/stream_open.cocci

Comments

Lubomir Rintel March 27, 2019, 6:54 a.m. UTC | #1
On Tue, 2019-03-26 at 23:23 +0000, Kirill Smelkov wrote:
> Using scripts/coccinelle/api/stream_open.cocci added in the previous
> patch, search and convert to stream_open all in-kernel nonseekable_open
> users for which read and write actually do not depend on ppos and where
> there is no other methods in file_operations which assume @offset
> access.
> 
> I've verified each generated change manually - that it is correct to convert -
> and each other nonseekable_open instance left - that it is either not correct
> to convert there, or that it is not converted due to current stream_open.cocci
> limitations. The script also does not convert files that should be valid to
> convert, but that currently have .llseek = noop_llseek or generic_file_llseek
> for unknown reason despite file being opened with nonseekable_open (e.g.
> drivers/input/mousedev.c)
> 
> Among cases converted 14 were potentially vulnerable to read vs write deadlock
> (see details in the previous patch):

...

> and the reset were just safe to convert to stream_open because their
> read and write do not use ppos at all and corresponding file_operations
> do not have methods that assume @offset file access:

...

> 	drivers/char/pcmcia/scr24x_cs.c:95:8-24: WARNING: scr24x_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.

...

> diff --git a/drivers/char/pcmcia/scr24x_cs.c b/drivers/char/pcmcia/scr24x_cs.c
> index f6b43d9350f0..04b39c3596cc 100644
> --- a/drivers/char/pcmcia/scr24x_cs.c
> +++ b/drivers/char/pcmcia/scr24x_cs.c
> @@ -92,7 +92,7 @@ static int scr24x_open(struct inode *inode, struct file *filp)
>  	kref_get(&dev->refcnt);
>  	filp->private_data = dev;
>  
> -	return nonseekable_open(inode, filp);
> +	return stream_open(inode, filp);
>  }
>  
>  static int scr24x_release(struct inode *inode, struct file *filp)

Acked-by: Lubomir Rintel <lkundrak@v3.sk> [scr24x_cs]

Thanks,
Lubo
Jürgen Groß March 27, 2019, 4:58 p.m. UTC | #2
On 26/03/2019 23:20, Kirill Smelkov wrote:
> Commit 9c225f2655 (vfs: atomic f_pos accesses as per POSIX) added locking for
> file.f_pos access and in particular made concurrent read and write not possible
> - now both those functions take f_pos lock for the whole run, and so if e.g. a
> read is blocked waiting for data, write will deadlock waiting for that read to
> complete. This caused regression for stream-like files where previously read
> and write could run simultaneously, but after that patch could not do so
> anymore. See e.g. 581d21a2d0 (xenbus: fix deadlock on writes to /proc/xen/xenbus)
> which fixes such regression for particular case of /proc/xen/xenbus.
> 
> The patch that added f_pos lock in 2014 (see https://lkml.org/lkml/2014/2/17/324
> for background discussion) did so to guarantee POSIX thread safety for
> read/write/lseek and added the locking to file descriptors of all regular
> files. In 2014 that thread-safety problem was not new as it was already discussed
> earlier in 2006: https://lwn.net/Articles/180387. However even though 2006'th
> version of Linus's patch (https://lwn.net/Articles/180396) was adding f_pos
> locking "only for files that are marked seekable with FMODE_LSEEK (thus avoiding
> the stream-like objects like pipes and sockets)", 2014'th version - the one that
> actually made it into the tree as 9c225f2655 - is doing so irregardless of whether
> a file is seekable or not. The reason that it did so is, probably, that there are
> many files that are marked non-seekable, but e.g. their read implementation
> actually depends on knowing current position to correctly handle the read. Some
> examples:
> 
> 	kernel/power/user.c		snapshot_read
> 	fs/debugfs/file.c		u32_array_read
> 	fs/fuse/control.c		fuse_conn_waiting_read + ...
> 	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
> 	arch/s390/hypfs/inode.c		hypfs_read_iter
> 	...
> 
> In despite that, many nonseekable_open users implement read and write with pure
> stream semantics - they don't depend on passed ppos at all. And for those cases
> where read could wait for something inside, it creates a situation similar to
> xenbus - the write could be never made to go until read is done, and read is
> waiting for some, potentially external, event, for potentially unbounded time
> -> deadlock. Besides xenbus, there are 14 such places in the kernel that I've
> found with semantic patch (see below):
> 
> 	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
> 	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
> 	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
> 	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
> 	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
> 	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
> 	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
> 	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
> 	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
> 	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
> 	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
> 	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
> 	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
> 	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
> 
> In addition to the cases above another regression caused by f_pos locking is
> that now FUSE filesystems that implement open with FOPEN_NONSEEKABLE flag, can
> no longer implement bidirectional stream-like files - for the same reason
> as above e.g. read can deadlock write locking on file.f_pos in the kernel.
> FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f7 (fuse: implement
> nonseekable open) to support OSSPD (https://github.com/libfuse/osspd;
> https://lwn.net/Articles/308445). OSSPD implements /dev/dsp in userspace with
> FOPEN_NONSEEKABLE flag, with corresponding read and write routines not
> depending on current position at all, and with both read and write being
> potentially blocking operations:
> 
> 	https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
> 	https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
> 	https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
> 
> Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
> "somewhat pipe-like files ..." with read handler not using offset. However
> that test implements only read without write and cannot exercise the deadlock
> scenario:
> 
> 	https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
> 	https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
> 	https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
> 
> I've actually hit the read vs write deadlock for real while implementing my
> FUSE filesystem where there is /head/watch file, for which open creates
> separate bidirectional socket-like stream in between filesystem and its user
> with both read and write being later performed simultaneously. And there it is
> semantically not easy to split the stream into two separate read-only and
> write-only channels:
> 
> 	https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
> 
> Let's fix this regression. The plan is:
> 
> 1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS - doing so would
>    break many in-kernel nonseekable_open users which actually use ppos in
>    read/write handlers.
> 
> 2. Add stream_open() to kernel to open stream-like non-seekable file descriptors.
>    Read and write on such file descriptors would never use nor change ppos. And
>    with that property on stream-like files read and write will be running without
>    taking f_pos lock - i.e. read and write could be running simultaneously.
> 
> 3. With semantic patch search and convert to stream_open all in-kernel
>    nonseekable_open users for which read and write actually do not depend on ppos and
>    where there is no other methods in file_operations which assume @offset access.
> 
> 4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via steam_open
>    if that bit is present in filesystem open reply.
> 
>    It was tempting to change fs/fuse/ open handler to use stream_open instead of
>    nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian
>    codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually
>    uses offset in its read and write handlers
> 
> 	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
> 	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
> 	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
> 	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
> 
>    so if we would do such a change it will break a real user.
> 
> 5. Add stream_open and FOPEN_STREAM handling to stable kernels starting from
>    v3.14+ (the kernel where 9c225f2655 first appeared). This will allow to patch
>    OSSPD and other FUSE filesystems that provide stream-like files to return
>    FOPEN_STREAM | FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on
>    all kernel versions. This should work because fs/fuse/ ignores unknown open
>    flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that
>    is not aware of this flag cannot hurt. In turn the kernel that is not aware of
>    FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to
>    implement streams without read vs write deadlock.
> 
> This patch: adds stream_open, converts /proc/xen/xenbus to it and adds semantic
> patch to automatically locate in-kernel places that are either required to be
> converted due to read vs write deadlock, or that are just safe to be converted
> because read and write do not use ppos and there are no other funky methods in
> file_operations.
> 
> Followup patches are:
> 
> - apply the result of semantic patch;
> - add FOPEN_STREAM to fs/fuse.
> 
> Regarding semantic patch I've verified each generated change manually - that it is
> correct to convert - and each other nonseekable_open instance left - that it is
> either not correct to convert there, or that it is not converted due to current
> stream_open.cocci limitations. The script also does not convert files that should
> be valid to convert, but that currently have .llseek = noop_llseek or
> generic_file_llseek for unknown reason despite file being opened with
> nonseekable_open (e.g. drivers/input/mousedev.c)
> 
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Yongzhi Pan <panyongzhi@gmail.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: David Vrabel <david.vrabel@citrix.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Julia Lawall <Julia.Lawall@lip6.fr>
> Cc: Nikolaus Rath <Nikolaus@rath.org>
> Cc: Han-Wen Nienhuys <hanwen@google.com>
> Signed-off-by: Kirill Smelkov <kirr@nexedi.com>

For the Xen changes: Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen
Linus Torvalds April 6, 2019, 5:07 p.m. UTC | #3
On Tue, Mar 26, 2019 at 12:20 PM Kirill Smelkov <kirr@nexedi.com> wrote:
>
> Commit 9c225f2655 (vfs: atomic f_pos accesses as per POSIX) added locking for
> file.f_pos access and in particular made concurrent read and write not possible
> [...]

Ok, I have applied this patch - but this patch only - as a
well-researched preparatory patch.

The actual conversion patch looks fine to me too, and I see a few
acks, but I think I'd like to see it during a merge window just
because it's large and does significant changes, while this one is
purely preparatory.

Al, comments?

One small note: please don't use lkml.org references since they tend
to be slow and flaky - use lore.kernel.org instead. Also, fix your git
config to use 12-character git hashes (best done by just removing the
explicit hash size entirely, and letting modern git shorten hashes
appropriately automatically).

                  Linus
Kirill Smelkov April 7, 2019, 8:04 p.m. UTC | #4
On Sat, Apr 06, 2019 at 07:07:14AM -1000, Linus Torvalds wrote:
> On Tue, Mar 26, 2019 at 12:20 PM Kirill Smelkov <kirr@nexedi.com> wrote:
> >
> > Commit 9c225f2655 (vfs: atomic f_pos accesses as per POSIX) added locking for
> > file.f_pos access and in particular made concurrent read and write not possible
> > [...]
> 
> Ok, I have applied this patch - but this patch only - as a
> well-researched preparatory patch.
> 
> The actual conversion patch looks fine to me too, and I see a few
> acks, but I think I'd like to see it during a merge window just
> because it's large and does significant changes, while this one is
> purely preparatory.
> 
> Al, comments?

Linus, thanks a lot for feedback and for accepting the patch.

I agree that it is better to merge the bulk-conversion at the time of
next merge window. I hope Al will pick that patch up and send to you
when the next merge window time comes.

Could we please clarify one thing: is the general complete plan to deal
with this regression, as outlined in the merged patch, ok? I mean do we
agree to further proceed with fs/fuse/ FOPEN_STREAM (patch 3/3) and
backport it + stream_open to v3.14+ stable kernels? Unfortunately it
requires a bit of cooperation from userspace, but this scheme to fix the
regression is the best I could come up with.

If FUSE bits are ok, how should we go? Are we ok to merge that patch
now, or should we wait for the next merge window? I understand that
generally it should be merge window from one side, but since we are
trying to fix regression here and if the plan for regression fix is
accepted it should be merge now. If we merge now, what should be the way
for that patch to get merged?

Fixing regression on FUSE side is my reason to do this whole work -
that's why I care about it the most and ask.

> One small note: please don't use lkml.org references since they tend
> to be slow and flaky - use lore.kernel.org instead. Also, fix your git
> config to use 12-character git hashes (best done by just removing the
> explicit hash size entirely, and letting modern git shorten hashes
> appropriately automatically).

Thanks for pointing to lore.kernel.org - I'm not very keen to kernel
development and did not knew it exists. I was seeing lkml.org being
referenced here and there and that's why I used it. I will use
lore.kernel.org from now.

Short hashes were not due to git config, but due to me trimming them to
10 characters in vim manually. I will use longer hashes as you ask.

Kirill
Linus Torvalds April 8, 2019, 12:09 a.m. UTC | #5
On Sun, Apr 7, 2019 at 10:04 AM Kirill Smelkov <kirr@nexedi.com> wrote:
>
> Fixing regression on FUSE side is my reason to do this whole work -
> that's why I care about it the most and ask.

Yeah, we can do the actual FUSE fix, I think. Preferably through the
FUSE tree. Miklos?

                 Linus
Kirill Smelkov April 11, 2019, 12:38 p.m. UTC | #6
+Linus, +Al, +linux-fsdevel, +linux-kernel

On Tue, Apr 09, 2019 at 11:50:23PM +0200, Rasmus Villemoes wrote:
> On 09/04/2019 22.38, Kirill Smelkov wrote:
> > On Tue, Apr 09, 2019 at 09:43:37AM +0200, Rasmus Villemoes wrote:
> >> On 26/03/2019 23.20, Kirill Smelkov wrote:
> >>
> >>> 2. Add stream_open() to kernel to open stream-like non-seekable file descriptors.
> >>>    Read and write on such file descriptors would never use nor change ppos. And
> >>>    with that property on stream-like files read and write will be running without
> >>>    taking f_pos lock - i.e. read and write could be running simultaneously.
> >>>
> >>
> >>> diff --git a/fs/read_write.c b/fs/read_write.c
> >>> index d83003d856a0..91cb375f9840 100644
> >>> --- a/fs/read_write.c
> >>> +++ b/fs/read_write.c
> >>> @@ -560,12 +560,13 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
> >>>  
> >>>  static inline loff_t file_pos_read(struct file *file)
> >>>  {
> >>> -	return file->f_pos;
> >>> +	return file->f_mode & FMODE_STREAM ? 0 : file->f_pos;
> >>>  }
> >>>  
> >>>  static inline void file_pos_write(struct file *file, loff_t pos)
> >>>  {
> >>> -	file->f_pos = pos;
> >>> +	if ((file->f_mode & FMODE_STREAM) == 0)
> >>> +		file->f_pos = pos;
> >>>  }
> >>
> >> Perhaps leave file_pos_read alone (to avoid the conditional load), and
> >> do WARN_ON_ONCE((file->f_mode & FMODE_STREAM) && pos != 0) or maybe !=
> >> file->f_pos in _write to catch wrongly converted drivers - once the dust
> >> has settled and patches backported to -stable, that WARN_ON_ONCE can
> >> also be removed.
> >>
> >> Or, a more brutal way to detect .read or .write callbacks that do use
> >> ppos despite FMODE_STREAM is to lift the FMODE_STREAM handling to the
> >> callers of file_pos_{read,write} and pass on (file->f_mode &
> >> FMODE_STREAM) ? NULL : &pos instead of &pos. That would also just add
> >> one instead of two conditionals to the read/write paths.
> > 
> > Good idea to pass NULL as ppos into FMODE_STREAM users as this will show
> > as oops if a driver is wrongly converted and uses the position. Do you mean
> > something like this:
> > 
> 
> Not that invasive. I was more thinking of reverting the
> file_pos_read/write to their original state, then do this one-liner once
> per caller of those
> 
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -575,7 +575,7 @@ ssize_t ksys_read(unsigned int fd, char __user *buf,
> size_t count)
> 
>         if (f.file) {
>                 loff_t pos = file_pos_read(f.file);
> -               ret = vfs_read(f.file, buf, count, &pos);
> +               ret = vfs_read(f.file, buf, count, (file->f_mode & FMODE_STREAM) ? NULL : &pos);
>                 if (ret >= 0)
>                         file_pos_write(f.file, pos);
>                 fdput_pos(f);
> 
> The idea being to minimize the number of extra branches introduced in
> the read and write syscalls due to the FMODE_STREAM. So the above
> catches wrongly done conversions (or backports - perhaps an earlier
> version of some ->read callback did use ppos...) in a rather blunt way.

I checked generated x86_64 assembly and from the point of view of adding
minimum extra branches this is indeed the best version - no branches are
added at all as `(file->f_mode & FMODE_STREAM) ? NULL : &pos` is
translated to cmov instruction.

However file->f_pos writing is still there and it will bug under race
detector, e.g. under KTSAN because read and write can be running
simultaneously. Maybe it is not only race bug, but also a bit of
slowdown as read and write code paths write to the same memory thus
needing inter-cpu synchronization if read and write handlers are on
different cpus. However for this I'm not sure.

> The alternative was to revert file_pos_read() to just return file->f_pos
> as it used to, and just do
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 177ccc3d405a..97180d61a918 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -565,6 +565,7 @@ static inline loff_t file_pos_read(struct file *file)
> 
>  static inline void file_pos_write(struct file *file, loff_t pos)
>  {
> +       WARN_ON_ONCE((file->f_mode & FMODE_STREAM) && pos != 0);
>         file->f_pos = pos;
>  }
> 
> which is, essentially, also just one more branch (at least for all the
> ordinary, non-STREAM, files); this has the advantage of not oopsing, and
> just means that ->f_pos gets updated racily until somebody reacts to the
> WARN.

I prefer we pass NULL for ppos to FOPEN_STREAM files, because above
WARN, even if it triggers, will not tell which file implementation is
guilty - it will show the stack trace only till generic VFS code, if I
understand correctly.

> In either case, the current code papers over bugs introduced by
> conversion/backporting and has more branches.

Yes, thanks for brining this up. It is indeed better that we pass NULL
into converted drivers to catch correctness. Please see below for
updated patch and branches analysis.

> > but I wonder why we are discussing in private?
> 
> Well, the cc list looked too big for some mostly micro-optimzation
> comments, and I couldn't figure out a proper subset to trim it to. But
> if you think the sanity-checking (either variant) is worth it before the
> mass-conversion starts, I'm happy to take the discussion back to lkml.

Added relevant (I think) people and lists to Cc.

I was going to propose attached patch, but it turned out that if we
start to pass ppos=NULL, we also have to update at least new_sync_read,
new_sync_write, do_iter_writev_writev, rw_verify_area etc - functions
that can be called from under vfs_read/vfs_write/..., at least this way:

diff --git a/fs/read_write.c b/fs/read_write.c
index 4bdd1731859c..80724e9791bd 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -371,7 +371,7 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t
        inode = file_inode(file);
        if (unlikely((ssize_t) count < 0))
                return retval;
-       pos = *ppos;
+       pos = (ppos ? *ppos : 0);
        if (unlikely(pos < 0)) {
                if (!unsigned_offsets(file))
                        return retval;
@@ -400,12 +400,13 @@ static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, lo
        ssize_t ret;

        init_sync_kiocb(&kiocb, filp);
-       kiocb.ki_pos = *ppos;
+       kiocb.ki_pos = (ppos ? *ppos : 0);
        iov_iter_init(&iter, READ, &iov, 1, len);

        ret = call_read_iter(filp, &kiocb, &iter);
        BUG_ON(ret == -EIOCBQUEUED);
-       *ppos = kiocb.ki_pos;
+       if (ppos)
+               *ppos = kiocb.ki_pos;
        return ret;
 }

@@ -468,12 +469,12 @@ static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t
        ssize_t ret;

        init_sync_kiocb(&kiocb, filp);
-       kiocb.ki_pos = *ppos;
+       kiocb.ki_pos = (ppos ? *ppos : 0);
        iov_iter_init(&iter, WRITE, &iov, 1, len);

        ret = call_write_iter(filp, &kiocb, &iter);
        BUG_ON(ret == -EIOCBQUEUED);
-       if (ret > 0)
+       if (ret > 0 && ppos)
                *ppos = kiocb.ki_pos;
        return ret;
 }
@@ -666,14 +667,15 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
        ret = kiocb_set_rw_flags(&kiocb, flags);
        if (ret)
                return ret;
-       kiocb.ki_pos = *ppos;
+       kiocb.ki_pos = (ppos ? *ppos : 0);

        if (type == READ)
                ret = call_read_iter(filp, &kiocb, iter);
        else
                ret = call_write_iter(filp, &kiocb, iter);
        BUG_ON(ret == -EIOCBQUEUED);
-       *ppos = kiocb.ki_pos;
+       if (ppos)
+               *ppos = kiocb.ki_pos;
        return ret;
 }

There is no notion of NULL ppos in kiocb, and I'm not familiar with that
subsystem, so it would be good to first get feedback before deciding on
how/where to move on.

Linus, anyone, comments?

Thanks beforehand,
Kirill

---- 8< ----
From facf84dfc75bf314c5f19160551750001d2d513b Mon Sep 17 00:00:00 2001
From: Kirill Smelkov <kirr@nexedi.com>
Date: Thu, 11 Apr 2019 13:07:03 +0300
Subject: [BUGGY PATCH] fs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files

( The patch is buggy - it will fail at NULL pointer dereference in e.g.
  rw_verify_area )

This amends commit 10dce8af3422 ("fs: stream_open - opener for
stream-like files so that read and write can run simultaneously without
deadlock") in how position is passed into .read()/.write() handler for
stream-like files:

Rasmus noticed that we currently pass 0 as position and ignore any position
change if that is done by a file implementation. This papers over bugs if ppos
is used in files that declare themselves as being stream-like as such bugs will
go unnoticed. Even if a file implementation is correctly converted into using
stream_open, its read/write later could be changed to use ppos and even though
that won't be working correctly, that bug might go unnoticed without someone
doing wrong behaviour analysis. It is thus better to pass ppos=NULL into
read/write for stream-like files as that don't give any chance for ppos usage
bugs because it will oops if ppos is ever used inside .read() or .write().

Rasmus also made the point that FMODE_STREAM patch added 2 branches into
ksys_read/ksys_write path which is too much unnecessary overhead and
could be reduces.

Let's rework the code to pass ppos=NULL and have less branches.

The first approach could be to revert file_pos_read and file_pos_write
into their pre-FMODE_STREAM version and just do

	--- a/fs/read_write.c
	+++ b/fs/read_write.c
	@@ -575,7 +575,7 @@ ssize_t ksys_read(unsigned int fd, char __user *buf,
	size_t count)

	        if (f.file) {
	                loff_t pos = file_pos_read(f.file);
	-               ret = vfs_read(f.file, buf, count, &pos);
	+               ret = vfs_read(f.file, buf, count, (file->f_mode & FMODE_STREAM) ? NULL : &pos);
	                if (ret >= 0)
	                        file_pos_write(f.file, pos);
	                fdput_pos(f);

this adds no branches at all as `(file->f_mode & FMODE_STREAM) ? NULL : &pos)`
translates to single cmov instruction on x86_64:

        if (f.file) {
    18e5:       48 89 c3                mov    %rax,%rbx
    18e8:       48 83 e3 fc             and    $0xfffffffffffffffc,%rbx
    18ec:       74 79                   je     1967 <ksys_read+0xa7>
    18ee:       48 89 c5                mov    %rax,%rbp
                loff_t pos = f.file->f_pos;
    18f1:       48 8b 43 68             mov    0x68(%rbx),%rax
                ret = vfs_read(f.file, buf, count, (f.file->f_mode & FMODE_STREAM) ? NULL : &pos);
    18f5:       48 89 e1                mov    %rsp,%rcx
    18f8:       f6 43 46 20             testb  $0x20,0x46(%rbx)
    18fc:       4c 89 e6                mov    %r12,%rsi
    18ff:       4c 89 ea                mov    %r13,%rdx
    1902:       48 89 df                mov    %rbx,%rdi
                loff_t pos = f.file->f_pos;
    1905:       48 89 04 24             mov    %rax,(%rsp)
                ret = vfs_read(f.file, buf, count, (f.file->f_mode & FMODE_STREAM) ? NULL : &pos);
    1909:       b8 00 00 00 00          mov    $0x0,%eax
    190e:       48 0f 45 c8             cmovne %rax,%rcx		<-- NOTE
    1912:       e8 00 00 00 00          callq  1917 <ksys_read+0x57>
                        1913: R_X86_64_PLT32    vfs_read-0x4

However this leaves on unconditional write into file->f_pos after
vfs_read call, and even though it should not be affecting semantic, it
will bug under race detector (e.g. KTSAN) and probably it might add some
inter-CPU overhead if read and write handlers are running on different
cores.

If we instead move `file->f_mode & FMODE_STREAM` checking into vfs
functions that actually dispatch IO and care to load/store file->f_fpos
only for non-stream files, even though it is 3 branches if looking at C
source, gcc generates only 1 more branch compared to how it was
pre-FMODE_STREAM. For example consider updated ksys_read:

	 ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
	 {
	 	struct fd f = fdget_pos(fd);
	 	ssize_t ret = -EBADF;

	 	if (f.file) {
	-		loff_t pos = file_pos_read(f.file);
	-		ret = vfs_read(f.file, buf, count, &pos);
	-		if (ret >= 0)
	-			file_pos_write(f.file, pos);
	+		loff_t pos, *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &pos;
	+		if (ppos)
	+			pos = f.file->f_pos;
	+		ret = vfs_read(f.file, buf, count, ppos);
	+		if (ret >= 0 && ppos)
	+			f.file->f_pos = pos;
	 		fdput_pos(f);
	 	}
	 	return ret;

The code that gcc generates here is essentially as if the source was:

	if (f.file->f_mode & FMODE_STREAM) {
		ret = vfs_read(f.file, buf, count, NULL);
	} else {
		loff_t pos = f.file->f_pos;
		ret = vfs_read(f.file, buf, count, ppos);
		if (ret >= 0)
			f.file->f_pos = pos;
	}
	fdput_pos(f);

i.e. it branches only once after checking file->f_mode and then has two
distinct codepaths each with its own vfs_read call:

        if (f.file) {
    18e5:       48 89 c5                mov    %rax,%rbp
    18e8:       48 83 e5 fc             and    $0xfffffffffffffffc,%rbp
    18ec:       0f 84 89 00 00 00       je     197b <ksys_read+0xbb>
    18f2:       48 89 c3                mov    %rax,%rbx
                loff_t pos, *ppos = (f.file->f_mode & FMODE_STREAM) ? NULL : &pos;
    18f5:       f6 45 46 20             testb  $0x20,0x46(%rbp)
    18f9:       74 3b                   je     1936 <ksys_read+0x76>	<-- branch pos/no-pos
                ret = vfs_read(f.file, buf, count, ppos);
    18fb:       4c 89 e6                mov    %r12,%rsi		<-- start of IO without pos
    18fe:       31 c9                   xor    %ecx,%ecx
    1900:       4c 89 ea                mov    %r13,%rdx
    1903:       48 89 ef                mov    %rbp,%rdi
    1906:       e8 00 00 00 00          callq  190b <ksys_read+0x4b>
                        1907: R_X86_64_PLT32    vfs_read-0x4		<-- vfs_read(..., ppos=NULL)
    190b:       49 89 c4                mov    %rax,%r12
        if (f.flags & FDPUT_POS_UNLOCK)
    190e:       f6 c3 02                test   $0x2,%bl
    1911:       75 51                   jne    1964 <ksys_read+0xa4>
        if (fd.flags & FDPUT_FPUT)
    1913:       83 e3 01                and    $0x1,%ebx
    1916:       75 59                   jne    1971 <ksys_read+0xb1>
}
    1918:       48 8b 4c 24 08          mov    0x8(%rsp),%rcx
    191d:       65 48 33 0c 25 28 00    xor    %gs:0x28,%rcx
    1924:       00 00
    1926:       4c 89 e0                mov    %r12,%rax
    1929:       75 59                   jne    1984 <ksys_read+0xc4>
    192b:       48 83 c4 10             add    $0x10,%rsp
    192f:       5b                      pop    %rbx
    1930:       5d                      pop    %rbp
    1931:       41 5c                   pop    %r12
    1933:       41 5d                   pop    %r13
    1935:       c3                      retq
                        pos = f.file->f_pos;
    1936:       48 8b 45 68             mov    0x68(%rbp),%rax		<-- start of IO with pos
                ret = vfs_read(f.file, buf, count, ppos);
    193a:       4c 89 e6                mov    %r12,%rsi
    193d:       48 89 e1                mov    %rsp,%rcx
    1940:       4c 89 ea                mov    %r13,%rdx
    1943:       48 89 ef                mov    %rbp,%rdi
                        pos = f.file->f_pos;
    1946:       48 89 04 24             mov    %rax,(%rsp)
                ret = vfs_read(f.file, buf, count, ppos);
    194a:       e8 00 00 00 00          callq  194f <ksys_read+0x8f>
                        194b: R_X86_64_PLT32    vfs_read-0x4		<-- vfs_read(..., ppos=&pos)
    194f:       49 89 c4                mov    %rax,%r12
                if (ret >= 0 && ppos)
    1952:       48 85 c0                test   %rax,%rax
    1955:       78 b7                   js     190e <ksys_read+0x4e>
                        f.file->f_pos = pos;
    1957:       48 8b 04 24             mov    (%rsp),%rax
    195b:       48 89 45 68             mov    %rax,0x68(%rbp)
        if (f.flags & FDPUT_POS_UNLOCK)
    ...

If adding one branch to handle FMODE_STREAM is acceptable, this approach should
be ok. Not duplicating vfs_read calls at C level is better as C-level code is
more clear without such duplication, and gcc cares to optimize the function
properly to have only 1 branch depending on file->f_mode.

If even 1 extra branch is unacceptable, we should indeed go with the first
option described in this patch but be prepared for race-detector bug reports
and probably some inter-CPU overhead.

Overall I would suggest to use simpler approach presented by hereby patch unless
1-extra jump could be shown to cause noticeable slowdowns in practice.

Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 fs/open.c       |  5 +++--
 fs/read_write.c | 51 +++++++++++++++++++++++--------------------------
 2 files changed, 27 insertions(+), 29 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index a00350018a47..9c7d724a6f67 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1219,8 +1219,9 @@ EXPORT_SYMBOL(nonseekable_open);
 /*
  * stream_open is used by subsystems that want stream-like file descriptors.
  * Such file descriptors are not seekable and don't have notion of position
- * (file.f_pos is always 0). Contrary to file descriptors of other regular
- * files, .read() and .write() can run simultaneously.
+ * (file.f_pos is always 0 and ppos passed to .read()/.write() is always NULL).
+ * Contrary to file descriptors of other regular files, .read() and .write()
+ * can run simultaneously.
  *
  * stream_open never fails and is marked to return int so that it could be
  * directly used as file_operations.open .
diff --git a/fs/read_write.c b/fs/read_write.c
index 61b43ad7608e..21d350df8109 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -558,27 +558,18 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
 	return ret;
 }
 
-static inline loff_t file_pos_read(struct file *file)
-{
-	return file->f_mode & FMODE_STREAM ? 0 : file->f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
-{
-	if ((file->f_mode & FMODE_STREAM) == 0)
-		file->f_pos = pos;
-}
-
 ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_read(f.file, buf, count, &pos);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &pos;
+		if (ppos)
+			pos = f.file->f_pos;
+		ret = vfs_read(f.file, buf, count, ppos);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 	return ret;
@@ -595,10 +586,12 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_write(f.file, buf, count, &pos);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &pos;
+		if (ppos)
+			pos = f.file->f_pos;
+		ret = vfs_write(f.file, buf, count, ppos);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
@@ -1013,10 +1006,12 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &pos;
+		if (ppos)
+			pos = f.file->f_pos;
+		ret = vfs_readv(f.file, vec, vlen, ppos, flags);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
@@ -1033,10 +1028,12 @@ static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &pos;
+		if (ppos)
+			pos = f.file->f_pos;
+		ret = vfs_writev(f.file, vec, vlen, ppos, flags);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
Linus Torvalds April 11, 2019, 4:22 p.m. UTC | #7
On Thu, Apr 11, 2019 at 5:38 AM Kirill Smelkov <kirr@nexedi.com> wrote:
>
> However file->f_pos writing is still there and it will bug under race
> detector, e.g. under KTSAN because read and write can be running
> simultaneously. Maybe it is not only race bug, but also a bit of
> slowdown as read and write code paths write to the same memory thus
> needing inter-cpu synchronization if read and write handlers are on
> different cpus. However for this I'm not sure.

I doubt it's noticeable, but it would probably be good to simply not
do the write for streams.

That *could* be done somewhat similarly, with just changing th address
to be updated, or course.

In fact, we *used* to (long ago) pass in the address of "file->f_pos"
itself to the low-level read/write routines. We then changed it to do
that indirection through a local copy of pos (and
file_pos_read/file_pos_write) because we didn't do the proper locking,
so different read/write versions could mess with each other (and with
lseek).

But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos
accesses as per POSIX") did was to add the proper locking at least for
the cases that we care about deeply, so we *could* say that we have
three cases:

 - FMODE_ATOMIC_POS: properly locked,
 - FMODE_STREAM: no pos at all
 - otherwise a "mostly don't care - don't mix!"

and so we could go back to not copying the pos at all, and instead do
something like

        loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
        ret = vfs_write(f.file, buf, count, ppos);

and perhaps have a long-term plan to try to get rid of the "don't mix"
case entirely (ie "if you use f_pos, then we'll do the proper
locking")

(The above is obviously surrounded by the fdget_pos()/fdput_pos() that
implements the locking decision).

           Linus
Kirill Smelkov April 12, 2019, 12:42 p.m. UTC | #8
On Thu, Apr 11, 2019 at 09:22:56AM -0700, Linus Torvalds wrote:
> On Thu, Apr 11, 2019 at 5:38 AM Kirill Smelkov <kirr@nexedi.com> wrote:
> >
> > However file->f_pos writing is still there and it will bug under race
> > detector, e.g. under KTSAN because read and write can be running
> > simultaneously. Maybe it is not only race bug, but also a bit of
> > slowdown as read and write code paths write to the same memory thus
> > needing inter-cpu synchronization if read and write handlers are on
> > different cpus. However for this I'm not sure.
> 
> I doubt it's noticeable, but it would probably be good to simply not
> do the write for streams.
> 
> That *could* be done somewhat similarly, with just changing th address
> to be updated, or course.
> 
> In fact, we *used* to (long ago) pass in the address of "file->f_pos"
> itself to the low-level read/write routines. We then changed it to do
> that indirection through a local copy of pos (and
> file_pos_read/file_pos_write) because we didn't do the proper locking,
> so different read/write versions could mess with each other (and with
> lseek).
> 
> But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos
> accesses as per POSIX") did was to add the proper locking at least for
> the cases that we care about deeply, so we *could* say that we have
> three cases:
> 
>  - FMODE_ATOMIC_POS: properly locked,
>  - FMODE_STREAM: no pos at all
>  - otherwise a "mostly don't care - don't mix!"
> 
> and so we could go back to not copying the pos at all, and instead do
> something like
> 
>         loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
>         ret = vfs_write(f.file, buf, count, ppos);
> 
> and perhaps have a long-term plan to try to get rid of the "don't mix"
> case entirely (ie "if you use f_pos, then we'll do the proper
> locking")
> 
> (The above is obviously surrounded by the fdget_pos()/fdput_pos() that
> implements the locking decision).

Ok Linus, thanks for feedback. Please consider applying or queueing the
following patches.

If the patches are ok, the first one is probably better to be applied
now - to catch wrongly converted drivers right from start, while the
second one could be probably better delayed till merge window. However
please do as what you prefer is best.

Kirill

---- 8< ----
From 328d4cc5dbb1d6fdbff1f80f1eed9f04b9a5de1c Mon Sep 17 00:00:00 2001
From: Kirill Smelkov <kirr@nexedi.com>
Date: Fri, 12 Apr 2019 12:31:57 +0300
Subject: [PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files

This amends commit 10dce8af3422 ("fs: stream_open - opener for
stream-like files so that read and write can run simultaneously without
deadlock") in how position is passed into .read()/.write() handler for
stream-like files:

Rasmus noticed that we currently pass 0 as position and ignore any position
change if that is done by a file implementation. This papers over bugs if ppos
is used in files that declare themselves as being stream-like as such bugs will
go unnoticed. Even if a file implementation is correctly converted into using
stream_open, its read/write later could be changed to use ppos and even though
that won't be working correctly, that bug might go unnoticed without someone
doing wrong behaviour analysis. It is thus better to pass ppos=NULL into
read/write for stream-like files as that don't give any chance for ppos usage
bugs because it will oops if ppos is ever used inside .read() or .write().

Note 1: rw_verify_area, new_sync_{read,write} needs to be updated
because they are called by vfs_read/vfs_write & friends before
file_operations .read/.write .

Note 2: if file backend uses new-style .read_iter/.write_iter, position
is still passed into there as non-pointer kiocb.ki_pos . Currently
stream_open.cocci (semantic patch added by 10dce8af3422) ignores files
whose file_operations has *_iter methods.

Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 fs/open.c       |  5 ++--
 fs/read_write.c | 75 +++++++++++++++++++++++++++++--------------------
 2 files changed, 47 insertions(+), 33 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index a00350018a47..9c7d724a6f67 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1219,8 +1219,9 @@ EXPORT_SYMBOL(nonseekable_open);
 /*
  * stream_open is used by subsystems that want stream-like file descriptors.
  * Such file descriptors are not seekable and don't have notion of position
- * (file.f_pos is always 0). Contrary to file descriptors of other regular
- * files, .read() and .write() can run simultaneously.
+ * (file.f_pos is always 0 and ppos passed to .read()/.write() is always NULL).
+ * Contrary to file descriptors of other regular files, .read() and .write()
+ * can run simultaneously.
  *
  * stream_open never fails and is marked to return int so that it could be
  * directly used as file_operations.open .
diff --git a/fs/read_write.c b/fs/read_write.c
index 61b43ad7608e..d62556be6848 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -371,7 +371,7 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t
 	inode = file_inode(file);
 	if (unlikely((ssize_t) count < 0))
 		return retval;
-	pos = *ppos;
+	pos = (ppos ? *ppos : 0);
 	if (unlikely(pos < 0)) {
 		if (!unsigned_offsets(file))
 			return retval;
@@ -400,12 +400,13 @@ static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, lo
 	ssize_t ret;
 
 	init_sync_kiocb(&kiocb, filp);
-	kiocb.ki_pos = *ppos;
+	kiocb.ki_pos = (ppos ? *ppos : 0);
 	iov_iter_init(&iter, READ, &iov, 1, len);
 
 	ret = call_read_iter(filp, &kiocb, &iter);
 	BUG_ON(ret == -EIOCBQUEUED);
-	*ppos = kiocb.ki_pos;
+	if (ppos)
+		*ppos = kiocb.ki_pos;
 	return ret;
 }
 
@@ -468,12 +469,12 @@ static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t
 	ssize_t ret;
 
 	init_sync_kiocb(&kiocb, filp);
-	kiocb.ki_pos = *ppos;
+	kiocb.ki_pos = (ppos ? *ppos : 0);
 	iov_iter_init(&iter, WRITE, &iov, 1, len);
 
 	ret = call_write_iter(filp, &kiocb, &iter);
 	BUG_ON(ret == -EIOCBQUEUED);
-	if (ret > 0)
+	if (ret > 0 && ppos)
 		*ppos = kiocb.ki_pos;
 	return ret;
 }
@@ -558,15 +559,10 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
 	return ret;
 }
 
-static inline loff_t file_pos_read(struct file *file)
-{
-	return file->f_mode & FMODE_STREAM ? 0 : file->f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
+/* file_ppos returns &file->f_pos or NULL if file is stream */
+static inline loff_t *file_ppos(struct file *file)
 {
-	if ((file->f_mode & FMODE_STREAM) == 0)
-		file->f_pos = pos;
+	return file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
 }
 
 ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
@@ -575,10 +571,14 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_read(f.file, buf, count, &pos);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_read(f.file, buf, count, ppos);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 	return ret;
@@ -595,10 +595,14 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_write(f.file, buf, count, &pos);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_write(f.file, buf, count, ppos);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
@@ -673,14 +677,15 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
 	ret = kiocb_set_rw_flags(&kiocb, flags);
 	if (ret)
 		return ret;
-	kiocb.ki_pos = *ppos;
+	kiocb.ki_pos = (ppos ? *ppos : 0);
 
 	if (type == READ)
 		ret = call_read_iter(filp, &kiocb, iter);
 	else
 		ret = call_write_iter(filp, &kiocb, iter);
 	BUG_ON(ret == -EIOCBQUEUED);
-	*ppos = kiocb.ki_pos;
+	if (ppos)
+		*ppos = kiocb.ki_pos;
 	return ret;
 }
 
@@ -1013,10 +1018,14 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_readv(f.file, vec, vlen, ppos, flags);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
@@ -1033,10 +1042,14 @@ static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_writev(f.file, vec, vlen, ppos, flags);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
Kirill Smelkov April 13, 2019, 4:54 p.m. UTC | #9
On Fri, Apr 12, 2019 at 03:41:44PM +0300, Kirill Smelkov wrote:
> On Thu, Apr 11, 2019 at 09:22:56AM -0700, Linus Torvalds wrote:
> > On Thu, Apr 11, 2019 at 5:38 AM Kirill Smelkov <kirr@nexedi.com> wrote:
> > >
> > > However file->f_pos writing is still there and it will bug under race
> > > detector, e.g. under KTSAN because read and write can be running
> > > simultaneously. Maybe it is not only race bug, but also a bit of
> > > slowdown as read and write code paths write to the same memory thus
> > > needing inter-cpu synchronization if read and write handlers are on
> > > different cpus. However for this I'm not sure.
> > 
> > I doubt it's noticeable, but it would probably be good to simply not
> > do the write for streams.
> > 
> > That *could* be done somewhat similarly, with just changing th address
> > to be updated, or course.
> > 
> > In fact, we *used* to (long ago) pass in the address of "file->f_pos"
> > itself to the low-level read/write routines. We then changed it to do
> > that indirection through a local copy of pos (and
> > file_pos_read/file_pos_write) because we didn't do the proper locking,
> > so different read/write versions could mess with each other (and with
> > lseek).
> > 
> > But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos
> > accesses as per POSIX") did was to add the proper locking at least for
> > the cases that we care about deeply, so we *could* say that we have
> > three cases:
> > 
> >  - FMODE_ATOMIC_POS: properly locked,
> >  - FMODE_STREAM: no pos at all
> >  - otherwise a "mostly don't care - don't mix!"
> > 
> > and so we could go back to not copying the pos at all, and instead do
> > something like
> > 
> >         loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
> >         ret = vfs_write(f.file, buf, count, ppos);
> > 
> > and perhaps have a long-term plan to try to get rid of the "don't mix"
> > case entirely (ie "if you use f_pos, then we'll do the proper
> > locking")
> > 
> > (The above is obviously surrounded by the fdget_pos()/fdput_pos() that
> > implements the locking decision).
> 
> Ok Linus, thanks for feedback. Please consider applying or queueing the
> following patches.

Resending those patches one by one in separate emails, if maybe 2
patches inline was problematic to handle...
Kirill Smelkov April 14, 2019, 7:11 a.m. UTC | #10
On Sun, Apr 07, 2019 at 02:09:08PM -1000, Linus Torvalds wrote:
> On Sun, Apr 7, 2019 at 10:04 AM Kirill Smelkov <kirr@nexedi.com> wrote:
> >
> > Fixing regression on FUSE side is my reason to do this whole work -
> > that's why I care about it the most and ask.
> 
> Yeah, we can do the actual FUSE fix, I think. Preferably through the
> FUSE tree. Miklos?

Linus, thanks for feedback.

I'm not sure what is going on, but I'm afraid we can wait very long
here. It's been already 1 week, and I was also struggling to get
feedback from Miklos on my FUSE patches starting from end of February
without any luck:

https://lore.kernel.org/linux-fsdevel/cover.1553680185.git.kirr@nexedi.com/
	https://lore.kernel.org/linux-fsdevel/12f7d0d98555ee0d174d04bb47644f65c07f035a.1553680185.git.kirr@nexedi.com/
	https://lore.kernel.org/linux-fsdevel/d74b17b9d33c3dcc7a1f2fa2914fb3c4e7cda127.1553680185.git.kirr@nexedi.com/

https://lore.kernel.org/linux-fsdevel/cover.1553677194.git.kirr@nexedi.com/
	https://lore.kernel.org/linux-fsdevel/d916d401a80e9834c95970d86ca71c0154e988a6.1553677194.git.kirr@nexedi.com/
	https://lore.kernel.org/linux-fsdevel/e0b43507976d6ea9010f1bacaef067f18de49f1f.1553677194.git.kirr@nexedi.com/

https://lore.kernel.org/linux-fsdevel/dc47c061f20c464ccf46b43822b062dca6486e90.1553637462.git.kirr@nexedi.com/

I'm not sure what to do with my FUSE patches. It feels like knocking
into closed door...

Kirill
diff mbox series

Patch

diff --git a/drivers/xen/xenbus/xenbus_dev_frontend.c b/drivers/xen/xenbus/xenbus_dev_frontend.c
index c3e201025ef0..0782ff3c2273 100644
--- a/drivers/xen/xenbus/xenbus_dev_frontend.c
+++ b/drivers/xen/xenbus/xenbus_dev_frontend.c
@@ -622,9 +622,7 @@  static int xenbus_file_open(struct inode *inode, struct file *filp)
 	if (xen_store_evtchn == 0)
 		return -ENOENT;
 
-	nonseekable_open(inode, filp);
-
-	filp->f_mode &= ~FMODE_ATOMIC_POS; /* cdev-style semantics */
+	stream_open(inode, filp);
 
 	u = kzalloc(sizeof(*u), GFP_KERNEL);
 	if (u == NULL)
diff --git a/fs/open.c b/fs/open.c
index 0285ce7dbd51..b6bf04a2ef8c 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1209,3 +1209,21 @@  int nonseekable_open(struct inode *inode, struct file *filp)
 }
 
 EXPORT_SYMBOL(nonseekable_open);
+
+/*
+ * stream_open is used by subsystems that want stream-like file descriptors.
+ * Such file descriptors are not seekable and don't have notion of position
+ * (file.f_pos is always 0). Contrary to file descriptors of other regular
+ * files, .read() and .write() can run simultaneously.
+ *
+ * stream_open never fails and is marked to return int so that it could be
+ * directly used as file_operations.open .
+ */
+int stream_open(struct inode *inode, struct file *filp)
+{
+	filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE | FMODE_ATOMIC_POS);
+	filp->f_mode |= FMODE_STREAM;
+	return 0;
+}
+
+EXPORT_SYMBOL(stream_open);
diff --git a/fs/read_write.c b/fs/read_write.c
index d83003d856a0..91cb375f9840 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -560,12 +560,13 @@  ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
 
 static inline loff_t file_pos_read(struct file *file)
 {
-	return file->f_pos;
+	return file->f_mode & FMODE_STREAM ? 0 : file->f_pos;
 }
 
 static inline void file_pos_write(struct file *file, loff_t pos)
 {
-	file->f_pos = pos;
+	if ((file->f_mode & FMODE_STREAM) == 0)
+		file->f_pos = pos;
 }
 
 ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3e85cb8e8c20..cd59946bc3a4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -155,6 +155,9 @@  typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 #define FMODE_OPENED		((__force fmode_t)0x80000)
 #define FMODE_CREATED		((__force fmode_t)0x100000)
 
+/* File is stream-like */
+#define FMODE_STREAM		((__force fmode_t)0x200000)
+
 /* File was opened by fanotify and shouldn't generate fanotify events */
 #define FMODE_NONOTIFY		((__force fmode_t)0x4000000)
 
@@ -3075,6 +3078,7 @@  extern loff_t no_seek_end_llseek_size(struct file *, loff_t, int, loff_t);
 extern loff_t no_seek_end_llseek(struct file *, loff_t, int);
 extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
+extern int stream_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(struct bio *bio, struct inode *inode,
diff --git a/scripts/coccinelle/api/stream_open.cocci b/scripts/coccinelle/api/stream_open.cocci
new file mode 100644
index 000000000000..350145da7669
--- /dev/null
+++ b/scripts/coccinelle/api/stream_open.cocci
@@ -0,0 +1,363 @@ 
+// SPDX-License-Identifier: GPL-2.0
+// Author: Kirill Smelkov (kirr@nexedi.com)
+//
+// Search for stream-like files that are using nonseekable_open and convert
+// them to stream_open. A stream-like file is a file that does not use ppos in
+// its read and write. Rationale for the conversion is to avoid deadlock in
+// between read and write.
+
+virtual report
+virtual patch
+virtual explain  // explain decisions in the patch (SPFLAGS="-D explain")
+
+// stream-like reader & writer - ones that do not depend on f_pos.
+@ stream_reader @
+identifier readstream, ppos;
+identifier f, buf, len;
+type loff_t;
+@@
+  ssize_t readstream(struct file *f, char *buf, size_t len, loff_t *ppos)
+  {
+    ... when != ppos
+  }
+
+@ stream_writer @
+identifier writestream, ppos;
+identifier f, buf, len;
+type loff_t;
+@@
+  ssize_t writestream(struct file *f, const char *buf, size_t len, loff_t *ppos)
+  {
+    ... when != ppos
+  }
+
+
+// a function that blocks
+@ blocks @
+identifier block_f;
+identifier wait_event =~ "^wait_event_.*";
+@@
+  block_f(...) {
+    ... when exists
+    wait_event(...)
+    ... when exists
+  }
+
+// stream_reader that can block inside.
+//
+// XXX wait_* can be called not directly from current function (e.g. func -> f -> g -> wait())
+// XXX currently reader_blocks supports only direct and 1-level indirect cases.
+@ reader_blocks_direct @
+identifier stream_reader.readstream;
+identifier wait_event =~ "^wait_event_.*";
+@@
+  readstream(...)
+  {
+    ... when exists
+    wait_event(...)
+    ... when exists
+  }
+
+@ reader_blocks_1 @
+identifier stream_reader.readstream;
+identifier blocks.block_f;
+@@
+  readstream(...)
+  {
+    ... when exists
+    block_f(...)
+    ... when exists
+  }
+
+@ reader_blocks depends on reader_blocks_direct || reader_blocks_1 @
+identifier stream_reader.readstream;
+@@
+  readstream(...) {
+    ...
+  }
+
+
+// file_operations + whether they have _any_ .read, .write, .llseek ... at all.
+//
+// XXX add support for file_operations xxx[N] = ...	(sound/core/pcm_native.c)
+@ fops0 @
+identifier fops;
+@@
+  struct file_operations fops = {
+    ...
+  };
+
+@ has_read @
+identifier fops0.fops;
+identifier read_f;
+@@
+  struct file_operations fops = {
+    .read = read_f,
+  };
+
+@ has_read_iter @
+identifier fops0.fops;
+identifier read_iter_f;
+@@
+  struct file_operations fops = {
+    .read_iter = read_iter_f,
+  };
+
+@ has_write @
+identifier fops0.fops;
+identifier write_f;
+@@
+  struct file_operations fops = {
+    .write = write_f,
+  };
+
+@ has_write_iter @
+identifier fops0.fops;
+identifier write_iter_f;
+@@
+  struct file_operations fops = {
+    .write_iter = write_iter_f,
+  };
+
+@ has_llseek @
+identifier fops0.fops;
+identifier llseek_f;
+@@
+  struct file_operations fops = {
+    .llseek = llseek_f,
+  };
+
+@ has_no_llseek @
+identifier fops0.fops;
+@@
+  struct file_operations fops = {
+    .llseek = no_llseek,
+  };
+
+@ has_mmap @
+identifier fops0.fops;
+identifier mmap_f;
+@@
+  struct file_operations fops = {
+    .mmap = mmap_f,
+  };
+
+@ has_copy_file_range @
+identifier fops0.fops;
+identifier copy_file_range_f;
+@@
+  struct file_operations fops = {
+    .copy_file_range = copy_file_range_f,
+  };
+
+@ has_remap_file_range @
+identifier fops0.fops;
+identifier remap_file_range_f;
+@@
+  struct file_operations fops = {
+    .remap_file_range = remap_file_range_f,
+  };
+
+@ has_splice_read @
+identifier fops0.fops;
+identifier splice_read_f;
+@@
+  struct file_operations fops = {
+    .splice_read = splice_read_f,
+  };
+
+@ has_splice_write @
+identifier fops0.fops;
+identifier splice_write_f;
+@@
+  struct file_operations fops = {
+    .splice_write = splice_write_f,
+  };
+
+
+// file_operations that is candidate for stream_open conversion - it does not
+// use mmap and other methods that assume @offset access to file.
+//
+// XXX for simplicity require no .{read/write}_iter and no .splice_{read/write} for now.
+// XXX maybe_steam.fops cannot be used in other rules - it gives "bad rule maybe_stream or bad variable fops".
+@ maybe_stream depends on (!has_llseek || has_no_llseek) && !has_mmap && !has_copy_file_range && !has_remap_file_range && !has_read_iter && !has_write_iter && !has_splice_read && !has_splice_write @
+identifier fops0.fops;
+@@
+  struct file_operations fops = {
+  };
+
+
+// ---- conversions ----
+
+// XXX .open = nonseekable_open -> .open = stream_open
+// XXX .open = func -> openfunc -> nonseekable_open
+
+// read & write
+//
+// if both are used in the same file_operations together with an opener -
+// under that conditions we can use stream_open instead of nonseekable_open.
+@ fops_rw depends on maybe_stream @
+identifier fops0.fops, openfunc;
+identifier stream_reader.readstream;
+identifier stream_writer.writestream;
+@@
+  struct file_operations fops = {
+      .open  = openfunc,
+      .read  = readstream,
+      .write = writestream,
+  };
+
+@ report_rw depends on report @
+identifier fops_rw.openfunc;
+position p1;
+@@
+  openfunc(...) {
+    <...
+     nonseekable_open@p1
+    ...>
+  }
+
+@ script:python depends on report && reader_blocks @
+fops << fops0.fops;
+p << report_rw.p1;
+@@
+coccilib.report.print_report(p[0],
+  "ERROR: %s: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix." % (fops,))
+
+@ script:python depends on report && !reader_blocks @
+fops << fops0.fops;
+p << report_rw.p1;
+@@
+coccilib.report.print_report(p[0],
+  "WARNING: %s: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open." % (fops,))
+
+
+@ explain_rw_deadlocked depends on explain && reader_blocks @
+identifier fops_rw.openfunc;
+@@
+  openfunc(...) {
+    <...
+-    nonseekable_open
++    nonseekable_open /* read & write (was deadlock) */
+    ...>
+  }
+
+
+@ explain_rw_nodeadlock depends on explain && !reader_blocks @
+identifier fops_rw.openfunc;
+@@
+  openfunc(...) {
+    <...
+-    nonseekable_open
++    nonseekable_open /* read & write (no direct deadlock) */
+    ...>
+  }
+
+@ patch_rw depends on patch @
+identifier fops_rw.openfunc;
+@@
+  openfunc(...) {
+    <...
+-   nonseekable_open
++   stream_open
+    ...>
+  }
+
+
+// read, but not write
+@ fops_r depends on maybe_stream && !has_write @
+identifier fops0.fops, openfunc;
+identifier stream_reader.readstream;
+@@
+  struct file_operations fops = {
+      .open  = openfunc,
+      .read  = readstream,
+  };
+
+@ report_r depends on report @
+identifier fops_r.openfunc;
+position p1;
+@@
+  openfunc(...) {
+    <...
+    nonseekable_open@p1
+    ...>
+  }
+
+@ script:python depends on report @
+fops << fops0.fops;
+p << report_r.p1;
+@@
+coccilib.report.print_report(p[0],
+  "WARNING: %s: .read() has stream semantic; safe to change nonseekable_open -> stream_open." % (fops,))
+
+@ explain_r depends on explain @
+identifier fops_r.openfunc;
+@@
+  openfunc(...) {
+    <...
+-   nonseekable_open
++   nonseekable_open /* read only */
+    ...>
+  }
+
+@ patch_r depends on patch @
+identifier fops_r.openfunc;
+@@
+  openfunc(...) {
+    <...
+-   nonseekable_open
++   stream_open
+    ...>
+  }
+
+
+// write, but not read
+@ fops_w depends on maybe_stream && !has_read @
+identifier fops0.fops, openfunc;
+identifier stream_writer.writestream;
+@@
+  struct file_operations fops = {
+      .open  = openfunc,
+      .write = writestream,
+  };
+
+@ report_w depends on report @
+identifier fops_w.openfunc;
+position p1;
+@@
+  openfunc(...) {
+    <...
+    nonseekable_open@p1
+    ...>
+  }
+
+@ script:python depends on report @
+fops << fops0.fops;
+p << report_w.p1;
+@@
+coccilib.report.print_report(p[0],
+  "WARNING: %s: .write() has stream semantic; safe to change nonseekable_open -> stream_open." % (fops,))
+
+@ explain_w depends on explain @
+identifier fops_w.openfunc;
+@@
+  openfunc(...) {
+    <...
+-   nonseekable_open
++   nonseekable_open /* write only */
+    ...>
+  }
+
+@ patch_w depends on patch @
+identifier fops_w.openfunc;
+@@
+  openfunc(...) {
+    <...
+-   nonseekable_open
++   stream_open
+    ...>
+  }
+
+
+// no read, no write - don't change anything