diff mbox series

Pending splice(file -> FIFO) always blocks read(FIFO), regardless of O_NONBLOCK on read side?

Message ID qk6hjuam54khlaikf2ssom6custxf5is2ekkaequf4hvode3ls@zgf7j5j4ubvw (mailing list archive)
State New, archived
Headers show
Series Pending splice(file -> FIFO) always blocks read(FIFO), regardless of O_NONBLOCK on read side? | expand

Commit Message

наб June 26, 2023, 1:12 a.m. UTC
Hi! (starting with get_maintainers.pl fs/splice.c,
     idk if that's right though)

Per fs/splice.c:
 * The traditional unix read/write is extended with a "splice()" operation
 * that transfers data buffers to or from a pipe buffer.
so I expect splice() to work just about the same as read()/write()
(and, to a large extent, it does so).

Thus, a refresher on pipe read() semantics
(quoting Issue 8 Draft 3; Linux when writing with write()):
60746  When attempting to read from an empty pipe or FIFO:
60747  • If no process has the pipe open for writing, read( ) shall return 0 to indicate end-of-file.
60748  • If some process has the pipe open for writing and O_NONBLOCK is set, read( ) shall return
60749    −1 and set errno to [EAGAIN].
60750  • If some process has the pipe open for writing and O_NONBLOCK is clear, read( ) shall
60751    block the calling thread until some data is written or the pipe is closed by all processes that
60752    had the pipe open for writing.

However, I've observed that this is not the case when splicing from
something that sleeps on read to a pipe, and that in that case all
readers block, /including/ ones that are reading from fds with
O_NONBLOCK set!

As an example, consider these two programs:
-- >8 --
// wr.c
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
int main() {
  while (splice(0, 0, 1, 0, 128 * 1024 * 1024, 0) > 0)
    ;
  fprintf(stderr, "wr: %m\n");
}
-- >8 --

-- >8 --
// rd.c
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
int main() {
  fcntl(0, F_SETFL, fcntl(0, F_GETFL) | O_NONBLOCK);

  char buf[64 * 1024] = {};
  for (ssize_t rd;;) {
#if 1
    while ((rd = read(0, buf, sizeof(buf))) == -1 && errno == EINTR)
      ;
#else
    while ((rd = splice(0, 0, 1, 0, 128 * 1024 * 1024, 0)) == -1 &&
           errno == EINTR)
      ;
#endif
    fprintf(stderr, "rd=%zd: %m\n", rd);
    write(1, buf, rd);

    errno = 0;
    sleep(1);
  }
}
-- >8 --

Thus:
-- >8 --
a$ make rd wr
a$ mkfifo fifo
a$ ./rd < fifo                           b$ echo qwe > fifo
rd=4: Success
qwe
rd=0: Success
rd=0: Success                            b$ sleep 2 > fifo
rd=-1: Resource temporarily unavailable
rd=-1: Resource temporarily unavailable
rd=0: Success
rd=0: Success                            
rd=-1: Resource temporarily unavailable  b$ /bin/cat > fifo
rd=-1: Resource temporarily unavailable
rd=4: Success                            abc
abc
rd=-1: Resource temporarily unavailable
rd=4: Success                            def
def
rd=0: Success                            ^D
rd=0: Success
rd=0: Success                            b$ ./wr > fifo
-- >8 --
and nothing. Until you actually type a line (or a few) into teletype b
so that the splice completes, at which point so does the read.

An even simpler case is 
-- >8 --
$ ./wr | ./rd
abc
def
rd=8: Success
abc
def
ghi
jkl
rd=8: Success
ghi
jkl
^D
wr: Success
rd=-1: Resource temporarily unavailable
rd=0: Success
rd=0: Success
-- >8 --

splice flags don't do anything.
Tested on bookworm (6.1.27-1) and Linus' HEAD (v6.4-rc7-234-g547cc9be86f4).

You could say this is a "denial of service", since this is a valid
way of following pipes (and, sans SIGIO, the only portable one),
and this makes it so the reader is completely blocked,
and impervious to all deadly signals (incl. SIGKILL).
I've also observed strace hanging (but it responded to SIGKILL).


Rudimentary analysis shows that sys_splice() -> __do_splice() ->
do_splice() -> {!(ipipe && opipe) -> !(ipipe) -> (ipipe)}:
splice_file_to_pipe which then does
-- >8 --
	pipe_lock(opipe);
	ret = wait_for_space(opipe, flags);
	if (!ret)
		ret = do_splice_to(in, offset, opipe, len, flags);
	pipe_unlock(opipe);
-- >8 --
conversely:
-- >8 --
static ssize_t
pipe_read(struct kiocb *iocb, struct iov_iter *to)
{
        size_t total_len = iov_iter_count(to);
        struct file *filp = iocb->ki_filp;
        struct pipe_inode_info *pipe = filp->private_data;
        bool was_full, wake_next_reader = false;
        ssize_t ret;

        /* Null read succeeds. */
        if (unlikely(total_len == 0))
                return 0;

        ret = 0;
        __pipe_lock(pipe);
-- >8 --
so, naturally(?), all non-empty reads wait for the splice to end.

To validate this, I've applied the following
(which I'm 100% sure is wrong and breaks unrelated stuff):
-- >8 --
(naturally this now means it always EAGAINs even if there is data to be
 read since the splice() loop keeps it locked, but you get the picture).

Comments

Christian Brauner June 26, 2023, 9:32 a.m. UTC | #1
On Mon, Jun 26, 2023 at 03:12:09AM +0200, Ahelenia Ziemiańska wrote:
> Hi! (starting with get_maintainers.pl fs/splice.c,
>      idk if that's right though)
> 
> Per fs/splice.c:
>  * The traditional unix read/write is extended with a "splice()" operation
>  * that transfers data buffers to or from a pipe buffer.
> so I expect splice() to work just about the same as read()/write()
> (and, to a large extent, it does so).
> 
> Thus, a refresher on pipe read() semantics
> (quoting Issue 8 Draft 3; Linux when writing with write()):
> 60746  When attempting to read from an empty pipe or FIFO:
> 60747  • If no process has the pipe open for writing, read( ) shall return 0 to indicate end-of-file.
> 60748  • If some process has the pipe open for writing and O_NONBLOCK is set, read( ) shall return
> 60749    −1 and set errno to [EAGAIN].
> 60750  • If some process has the pipe open for writing and O_NONBLOCK is clear, read( ) shall
> 60751    block the calling thread until some data is written or the pipe is closed by all processes that
> 60752    had the pipe open for writing.
> 
> However, I've observed that this is not the case when splicing from
> something that sleeps on read to a pipe, and that in that case all
> readers block, /including/ ones that are reading from fds with
> O_NONBLOCK set!
> 
> As an example, consider these two programs:
> -- >8 --
> // wr.c
> #define _GNU_SOURCE
> #include <fcntl.h>
> #include <stdio.h>
> int main() {
>   while (splice(0, 0, 1, 0, 128 * 1024 * 1024, 0) > 0)
>     ;
>   fprintf(stderr, "wr: %m\n");
> }
> -- >8 --
> 
> -- >8 --
> // rd.c
> #define _GNU_SOURCE
> #include <errno.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <unistd.h>
> int main() {
>   fcntl(0, F_SETFL, fcntl(0, F_GETFL) | O_NONBLOCK);
> 
>   char buf[64 * 1024] = {};
>   for (ssize_t rd;;) {
> #if 1
>     while ((rd = read(0, buf, sizeof(buf))) == -1 && errno == EINTR)
>       ;
> #else
>     while ((rd = splice(0, 0, 1, 0, 128 * 1024 * 1024, 0)) == -1 &&
>            errno == EINTR)
>       ;
> #endif
>     fprintf(stderr, "rd=%zd: %m\n", rd);
>     write(1, buf, rd);
> 
>     errno = 0;
>     sleep(1);
>   }
> }
> -- >8 --
> 
> Thus:
> -- >8 --
> a$ make rd wr
> a$ mkfifo fifo
> a$ ./rd < fifo                           b$ echo qwe > fifo
> rd=4: Success
> qwe
> rd=0: Success
> rd=0: Success                            b$ sleep 2 > fifo
> rd=-1: Resource temporarily unavailable
> rd=-1: Resource temporarily unavailable
> rd=0: Success
> rd=0: Success                            
> rd=-1: Resource temporarily unavailable  b$ /bin/cat > fifo
> rd=-1: Resource temporarily unavailable
> rd=4: Success                            abc
> abc
> rd=-1: Resource temporarily unavailable
> rd=4: Success                            def
> def
> rd=0: Success                            ^D
> rd=0: Success
> rd=0: Success                            b$ ./wr > fifo
> -- >8 --
> and nothing. Until you actually type a line (or a few) into teletype b
> so that the splice completes, at which point so does the read.
> 
> An even simpler case is 
> -- >8 --
> $ ./wr | ./rd
> abc
> def
> rd=8: Success
> abc
> def
> ghi
> jkl
> rd=8: Success
> ghi
> jkl
> ^D
> wr: Success
> rd=-1: Resource temporarily unavailable
> rd=0: Success
> rd=0: Success
> -- >8 --
> 
> splice flags don't do anything.
> Tested on bookworm (6.1.27-1) and Linus' HEAD (v6.4-rc7-234-g547cc9be86f4).
> 
> You could say this is a "denial of service", since this is a valid
> way of following pipes (and, sans SIGIO, the only portable one),

splice() may block for any of the two file descriptors if they don't
have O_NONBLOCK set even if SPLICE_F_NONBLOCK is raised.

SPLICE_F_NONBLOCK in splice_file_to_pipe() is only relevant if the pipe
is full. If the pipe isn't full then the write is attempted. That of
course involves reading the data to splice from the source file. If the
source file isn't O_NONBLOCK that read may block holding pipe_lock().

If you raise O_NONBLOCK on the source fd in wr.c then your problems go
away. This is pretty long-standing behavior. Splice would have to be
refactored to not rely on pipe_lock(). That's likely major work with a
good portion of regressions if the past is any indication.

If you need that ability to fully async read from a pipe with splice
rn then io_uring will at least allow you to punt that read into an async
worker thread afaict.
наб June 26, 2023, 11:59 a.m. UTC | #2
On Mon, Jun 26, 2023 at 11:32:16AM +0200, Christian Brauner wrote:
> On Mon, Jun 26, 2023 at 03:12:09AM +0200, Ahelenia Ziemiańska wrote:
> > Hi! (starting with get_maintainers.pl fs/splice.c,
> >      idk if that's right though)
> > 
> > Per fs/splice.c:
> >  * The traditional unix read/write is extended with a "splice()" operation
> >  * that transfers data buffers to or from a pipe buffer.
> > so I expect splice() to work just about the same as read()/write()
> > (and, to a large extent, it does so).
> > 
> > Thus, a refresher on pipe read() semantics
> > (quoting Issue 8 Draft 3; Linux when writing with write()):
> > 60746  When attempting to read from an empty pipe or FIFO:
> > 60747  • If no process has the pipe open for writing, read( ) shall return 0 to indicate end-of-file.
> > 60748  • If some process has the pipe open for writing and O_NONBLOCK is set, read( ) shall return
> > 60749    −1 and set errno to [EAGAIN].
> > 60750  • If some process has the pipe open for writing and O_NONBLOCK is clear, read( ) shall
> > 60751    block the calling thread until some data is written or the pipe is closed by all processes that
> > 60752    had the pipe open for writing.
> > 
> > However, I've observed that this is not the case when splicing from
> > something that sleeps on read to a pipe, and that in that case all
> > readers block, /including/ ones that are reading from fds with
> > O_NONBLOCK set!
> > 
> > As an example, consider these two programs:
> > -- >8 --
> > // wr.c
> > #define _GNU_SOURCE
> > #include <fcntl.h>
> > #include <stdio.h>
> > int main() {
> >   while (splice(0, 0, 1, 0, 128 * 1024 * 1024, 0) > 0)
> >     ;
> >   fprintf(stderr, "wr: %m\n");
> > }
> > -- >8 --
> > 
> > -- >8 --
> > // rd.c
> > #define _GNU_SOURCE
> > #include <errno.h>
> > #include <fcntl.h>
> > #include <stdio.h>
> > #include <unistd.h>
> > int main() {
> >   fcntl(0, F_SETFL, fcntl(0, F_GETFL) | O_NONBLOCK);
> > 
> >   char buf[64 * 1024] = {};
> >   for (ssize_t rd;;) {
> > #if 1
> >     while ((rd = read(0, buf, sizeof(buf))) == -1 && errno == EINTR)
> >       ;
> > #else
> >     while ((rd = splice(0, 0, 1, 0, 128 * 1024 * 1024, 0)) == -1 &&
> >            errno == EINTR)
> >       ;
> > #endif
> >     fprintf(stderr, "rd=%zd: %m\n", rd);
> >     write(1, buf, rd);
> > 
> >     errno = 0;
> >     sleep(1);
> >   }
> > }
> > -- >8 --
> > 
> > Thus:
> > -- >8 --
> > a$ make rd wr
> > a$ mkfifo fifo
> > a$ ./rd < fifo                           b$ echo qwe > fifo
> > rd=4: Success
> > qwe
> > rd=0: Success
> > rd=0: Success                            b$ sleep 2 > fifo
> > rd=-1: Resource temporarily unavailable
> > rd=-1: Resource temporarily unavailable
> > rd=0: Success
> > rd=0: Success                            
> > rd=-1: Resource temporarily unavailable  b$ /bin/cat > fifo
> > rd=-1: Resource temporarily unavailable
> > rd=4: Success                            abc
> > abc
> > rd=-1: Resource temporarily unavailable
> > rd=4: Success                            def
> > def
> > rd=0: Success                            ^D
> > rd=0: Success
> > rd=0: Success                            b$ ./wr > fifo
> > -- >8 --
> > and nothing. Until you actually type a line (or a few) into teletype b
> > so that the splice completes, at which point so does the read.
> > 
> > An even simpler case is 
> > -- >8 --
> > $ ./wr | ./rd
> > abc
> > def
> > rd=8: Success
> > abc
> > def
> > ghi
> > jkl
> > rd=8: Success
> > ghi
> > jkl
> > ^D
> > wr: Success
> > rd=-1: Resource temporarily unavailable
> > rd=0: Success
> > rd=0: Success
> > -- >8 --
> > 
> > splice flags don't do anything.
> > Tested on bookworm (6.1.27-1) and Linus' HEAD (v6.4-rc7-234-g547cc9be86f4).
> > 
> > You could say this is a "denial of service", since this is a valid
> > way of following pipes (and, sans SIGIO, the only portable one),
> splice() may block for any of the two file descriptors if they don't
> have O_NONBLOCK set even if SPLICE_F_NONBLOCK is raised.
> 
> SPLICE_F_NONBLOCK in splice_file_to_pipe() is only relevant if the pipe
> is full. If the pipe isn't full then the write is attempted. That of
> course involves reading the data to splice from the source file. If the
> source file isn't O_NONBLOCK that read may block holding pipe_lock().
> 
> If you raise O_NONBLOCK on the source fd in wr.c then your problems go
> away. This is pretty long-standing behavior.
I don't see how this is relevant here. Whether the writer splice blocks
‒ or how it behaves at all ‒ doesn't matter.

The /reader/ demands non-blocking reads. Just by running a splice()
we've managed to permanently hang the reader in a way that's fully
impervious to everything.

Actually, hold that: in testing this on an actual program that relies on
this (nullmailer), I've found that trying to /open the FIFO/ also hangs
forever, in that same signal-impervious state.

To wit:
  $ ps 3766
    PID TTY      STAT   TIME COMMAND
   3766 ?        Ss     0:01 /usr/sbin/nullmailer-send
  $ ls -l /proc/3766/fd
  total 0
  lr-x------ 1 mail mail 64 Jun 14 15:03 0 -> /dev/null
  lrwx------ 1 mail mail 64 Jun 14 15:03 1 -> 'socket:[81721760]'
  lrwx------ 1 mail mail 64 Jun 14 15:03 2 -> 'socket:[81721760]'
  lr-x------ 1 mail mail 64 Apr 28 15:38 3 -> 'pipe:[81721763]'
  l-wx------ 1 mail mail 64 Jun 14 15:03 4 -> 'pipe:[81721763]'
  lr-x------ 1 mail mail 64 Jun 14 15:03 5 -> /var/spool/nullmailer/trigger
  lrwx------ 1 mail mail 64 Jun 14 15:03 9 -> /dev/null
  # cat /proc/3766/fdinfo/5
  pos:    0
  flags:  0104000
  mnt_id: 64
  ino:    393969
  # < /proc/3766/fdinfo/5 fdinfo
  O_RDONLY        O_NONBLOCK O_LARGEFILE
  # strace -yp 3766 &
  strace: Process 3766 attached
  $ strace out/cmd/cat > /var/spool/nullmailer/trigger
  [cat] (normal libc setup)
  [cat] splice(0, NULL, 1, NULL, 134217728, SPLICE_F_MOVE|SPLICE_F_MOREa
  [cat] ) = 2
  [cat] splice(0, NULL, 1, NULL, 134217728, SPLICE_F_MOVE|SPLICE_F_MORE
  [nullmailer] pselect6(6, [5</var/spool/nullmailer/trigger>], NULL, NULL, {tv_sec=86397, tv_nsec=624894145}, NULL) = 1 (in [5], left {tv_sec=86394, tv_nsec=841299215})
  [nullmailer] write(1<socket:[81721760]>, "Trigger pulled.\n", 16) = 16
  [nullmailer] read(5</var/spool/nullmailer/trigger>,
and
  $ strace -y sh -c 'echo zupa > /var/spool/nullmailer/trigger'
  (...whatever shell setup)
  rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0xf7d21bb0}, NULL, 8) = 0
  openat(AT_FDCWD, "/var/spool/nullmailer/trigger", O_WRONLY|O_CREAT|O_TRUNC, 0666

This is a "you've lost" situation to me. This system will /never/
send mail now, and any mailer program will also hang forever
(again, to wit:
   # echo zupa | strace -yfo /tmp/ss mail root
 does hang forever and /tmp/ss ends in
   16915 close(6</var/spool/nullmailer/queue>) = 0
   16915 unlink("/var/spool/nullmailer/tmp/16915") = 0
   16915 openat(AT_FDCWD, "/var/spool/nullmailer/trigger", O_WRONLY|O_NONBLOCK
 )
which means that, on this system, I will never get events from smartd
or ZED, so fuck me if I wanted to get "scrub errored" or "disk
will die soon" notifications (in pre-2.0.0 ZED this would also have
 broken autoreplace=on since it waited synchronously),
or from other monitoring, so again fuck me if I wanted to get
overheating/packet drops/whatever notifications,
or again fuck me if I wanted to get cron mail.
In many ways I've brought the system down (or will have done in like a
day once some mails go out) by sending a mail weird.


Naturally systemd stopping nullmailer failed after a few minutes with
  × nullmailer.service - Nullmailer relay-only MTA
       Loaded: loaded (/lib/systemd/system/nullmailer.service; enabled; preset: enabled)
       Active: failed (Result: timeout) since Mon 2023-06-26 13:10:02 CEST; 6min ago
     Duration: 1month 4w 10h 55min 29.666s
         Docs: man:nullmailer(7)
     Main PID: 3766
        Tasks: 1 (limit: 4673)
       Memory: 3.1M
          CPU: 1min 26.893s
       CGroup: /system.slice/nullmailer.service
               └─3766 /usr/sbin/nullmailer-send
  
  Jun 26 13:05:32 szarotka systemd[1]: nullmailer.service: State 'stop-sigterm' timed out. Killing.
  Jun 26 13:05:32 szarotka systemd[1]: nullmailer.service: Killing process 3766 (nullmailer-send) with signal SIGKILL.
  Jun 26 13:07:02 szarotka systemd[1]: nullmailer.service: Processes still around after SIGKILL. Ignoring.
  Jun 26 13:08:32 szarotka systemd[1]: nullmailer.service: State 'final-sigterm' timed out. Killing.
  Jun 26 13:08:32 szarotka systemd[1]: nullmailer.service: Killing process 3766 (nullmailer-send) with signal SIGKILL.
  Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Processes still around after final SIGKILL. Entering failed mode.
  Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Failed with result 'timeout'.
  Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Unit process 3766 (nullmailer-send) remains running after unit s>
  Jun 26 13:10:02 szarotka systemd[1]: Stopped nullmailer.service - Nullmailer relay-only MTA.
  Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Consumed 1min 26.893s CPU time.

But not to fret! Maybe we can still kill it with the cgroup! No:
  # strace -y sh -c 'echo 1 > /sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill'
  ...
  dup2(3</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>, 1) = 1</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>
  close(3</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>) = 0
  write(1</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>, "1\n", 2) = 2
  ...
This completes, sure, but doesn't do anything at all
(admittedly, I'm not a cgroup expert, but it did work on other,
 non-poisoned, cgroups, so I'd expect it to work).

Opening the FIFO with O_NONBLOCK also hangs, obviously.
Killing the splicer restores order, as expected.

> Splice would have to be
> refactored to not rely on pipe_lock(). That's likely major work with a
> good portion of regressions if the past is any indication.
That's likely; however, it ‒ or an equivalent solution ‒ would
probably be a good idea to do, on balance of all my points above,
I think.

> If you need that ability to fully async read from a pipe with splice
> rn then io_uring will at least allow you to punt that read into an async
> worker thread afaict.
I need my system to not be hanged by any user with a magic syscall.
I think you've confused the splice bit as being contentious ‒ I don't
care, and couldn't care less about how this is triggered ‒ the issue is
that a splice fully excludes /ALL OTHER/ operations on a pipe,
and zombifies all processes that try.

Consider the case where messages arrive at a collection pipe
(this is well-specified and well-used with O_DIRECT and <=PIPE_MAX,
 I don't have a demo off-hand),
or, hell, even a case where logs do:
giving any user with append capability effectively exclusive control
for as long as they want is, well, suboptimal;
you could analyse this as a stronger variant of
  https://lore.kernel.org/linux-fsdevel/fa6de786ee1241c6ba54c3cce0b980aa@AcuMS.aculab.com/t/#e74be7131861099a7f3d82d51dfc96645d26e9a94
and indeed, my original use-case was this broke tail -f,
but you know how it be.
Christian Brauner June 26, 2023, 3:56 p.m. UTC | #3
On Mon, Jun 26, 2023 at 01:59:07PM +0200, Ahelenia Ziemiańska wrote:
> On Mon, Jun 26, 2023 at 11:32:16AM +0200, Christian Brauner wrote:
> > On Mon, Jun 26, 2023 at 03:12:09AM +0200, Ahelenia Ziemiańska wrote:
> > > Hi! (starting with get_maintainers.pl fs/splice.c,
> > >      idk if that's right though)
> > > 
> > > Per fs/splice.c:
> > >  * The traditional unix read/write is extended with a "splice()" operation
> > >  * that transfers data buffers to or from a pipe buffer.
> > > so I expect splice() to work just about the same as read()/write()
> > > (and, to a large extent, it does so).
> > > 
> > > Thus, a refresher on pipe read() semantics
> > > (quoting Issue 8 Draft 3; Linux when writing with write()):
> > > 60746  When attempting to read from an empty pipe or FIFO:
> > > 60747  • If no process has the pipe open for writing, read( ) shall return 0 to indicate end-of-file.
> > > 60748  • If some process has the pipe open for writing and O_NONBLOCK is set, read( ) shall return
> > > 60749    −1 and set errno to [EAGAIN].
> > > 60750  • If some process has the pipe open for writing and O_NONBLOCK is clear, read( ) shall
> > > 60751    block the calling thread until some data is written or the pipe is closed by all processes that
> > > 60752    had the pipe open for writing.
> > > 
> > > However, I've observed that this is not the case when splicing from
> > > something that sleeps on read to a pipe, and that in that case all
> > > readers block, /including/ ones that are reading from fds with
> > > O_NONBLOCK set!
> > > 
> > > As an example, consider these two programs:
> > > -- >8 --
> > > // wr.c
> > > #define _GNU_SOURCE
> > > #include <fcntl.h>
> > > #include <stdio.h>
> > > int main() {
> > >   while (splice(0, 0, 1, 0, 128 * 1024 * 1024, 0) > 0)
> > >     ;
> > >   fprintf(stderr, "wr: %m\n");
> > > }
> > > -- >8 --
> > > 
> > > -- >8 --
> > > // rd.c
> > > #define _GNU_SOURCE
> > > #include <errno.h>
> > > #include <fcntl.h>
> > > #include <stdio.h>
> > > #include <unistd.h>
> > > int main() {
> > >   fcntl(0, F_SETFL, fcntl(0, F_GETFL) | O_NONBLOCK);
> > > 
> > >   char buf[64 * 1024] = {};
> > >   for (ssize_t rd;;) {
> > > #if 1
> > >     while ((rd = read(0, buf, sizeof(buf))) == -1 && errno == EINTR)
> > >       ;
> > > #else
> > >     while ((rd = splice(0, 0, 1, 0, 128 * 1024 * 1024, 0)) == -1 &&
> > >            errno == EINTR)
> > >       ;
> > > #endif
> > >     fprintf(stderr, "rd=%zd: %m\n", rd);
> > >     write(1, buf, rd);
> > > 
> > >     errno = 0;
> > >     sleep(1);
> > >   }
> > > }
> > > -- >8 --
> > > 
> > > Thus:
> > > -- >8 --
> > > a$ make rd wr
> > > a$ mkfifo fifo
> > > a$ ./rd < fifo                           b$ echo qwe > fifo
> > > rd=4: Success
> > > qwe
> > > rd=0: Success
> > > rd=0: Success                            b$ sleep 2 > fifo
> > > rd=-1: Resource temporarily unavailable
> > > rd=-1: Resource temporarily unavailable
> > > rd=0: Success
> > > rd=0: Success                            
> > > rd=-1: Resource temporarily unavailable  b$ /bin/cat > fifo
> > > rd=-1: Resource temporarily unavailable
> > > rd=4: Success                            abc
> > > abc
> > > rd=-1: Resource temporarily unavailable
> > > rd=4: Success                            def
> > > def
> > > rd=0: Success                            ^D
> > > rd=0: Success
> > > rd=0: Success                            b$ ./wr > fifo
> > > -- >8 --
> > > and nothing. Until you actually type a line (or a few) into teletype b
> > > so that the splice completes, at which point so does the read.
> > > 
> > > An even simpler case is 
> > > -- >8 --
> > > $ ./wr | ./rd
> > > abc
> > > def
> > > rd=8: Success
> > > abc
> > > def
> > > ghi
> > > jkl
> > > rd=8: Success
> > > ghi
> > > jkl
> > > ^D
> > > wr: Success
> > > rd=-1: Resource temporarily unavailable
> > > rd=0: Success
> > > rd=0: Success
> > > -- >8 --
> > > 
> > > splice flags don't do anything.
> > > Tested on bookworm (6.1.27-1) and Linus' HEAD (v6.4-rc7-234-g547cc9be86f4).
> > > 
> > > You could say this is a "denial of service", since this is a valid
> > > way of following pipes (and, sans SIGIO, the only portable one),
> > splice() may block for any of the two file descriptors if they don't
> > have O_NONBLOCK set even if SPLICE_F_NONBLOCK is raised.
> > 
> > SPLICE_F_NONBLOCK in splice_file_to_pipe() is only relevant if the pipe
> > is full. If the pipe isn't full then the write is attempted. That of
> > course involves reading the data to splice from the source file. If the
> > source file isn't O_NONBLOCK that read may block holding pipe_lock().
> > 
> > If you raise O_NONBLOCK on the source fd in wr.c then your problems go
> > away. This is pretty long-standing behavior.
> I don't see how this is relevant here. Whether the writer splice blocks
> ‒ or how it behaves at all ‒ doesn't matter.
> 
> The /reader/ demands non-blocking reads. Just by running a splice()
> we've managed to permanently hang the reader in a way that's fully
> impervious to everything.
> 
> Actually, hold that: in testing this on an actual program that relies on
> this (nullmailer), I've found that trying to /open the FIFO/ also hangs
> forever, in that same signal-impervious state.
> 
> To wit:
>   $ ps 3766
>     PID TTY      STAT   TIME COMMAND
>    3766 ?        Ss     0:01 /usr/sbin/nullmailer-send
>   $ ls -l /proc/3766/fd
>   total 0
>   lr-x------ 1 mail mail 64 Jun 14 15:03 0 -> /dev/null
>   lrwx------ 1 mail mail 64 Jun 14 15:03 1 -> 'socket:[81721760]'
>   lrwx------ 1 mail mail 64 Jun 14 15:03 2 -> 'socket:[81721760]'
>   lr-x------ 1 mail mail 64 Apr 28 15:38 3 -> 'pipe:[81721763]'
>   l-wx------ 1 mail mail 64 Jun 14 15:03 4 -> 'pipe:[81721763]'
>   lr-x------ 1 mail mail 64 Jun 14 15:03 5 -> /var/spool/nullmailer/trigger
>   lrwx------ 1 mail mail 64 Jun 14 15:03 9 -> /dev/null
>   # cat /proc/3766/fdinfo/5
>   pos:    0
>   flags:  0104000
>   mnt_id: 64
>   ino:    393969
>   # < /proc/3766/fdinfo/5 fdinfo
>   O_RDONLY        O_NONBLOCK O_LARGEFILE
>   # strace -yp 3766 &
>   strace: Process 3766 attached
>   $ strace out/cmd/cat > /var/spool/nullmailer/trigger
>   [cat] (normal libc setup)
>   [cat] splice(0, NULL, 1, NULL, 134217728, SPLICE_F_MOVE|SPLICE_F_MOREa
>   [cat] ) = 2
>   [cat] splice(0, NULL, 1, NULL, 134217728, SPLICE_F_MOVE|SPLICE_F_MORE
>   [nullmailer] pselect6(6, [5</var/spool/nullmailer/trigger>], NULL, NULL, {tv_sec=86397, tv_nsec=624894145}, NULL) = 1 (in [5], left {tv_sec=86394, tv_nsec=841299215})
>   [nullmailer] write(1<socket:[81721760]>, "Trigger pulled.\n", 16) = 16
>   [nullmailer] read(5</var/spool/nullmailer/trigger>,
> and
>   $ strace -y sh -c 'echo zupa > /var/spool/nullmailer/trigger'
>   (...whatever shell setup)
>   rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0xf7d21bb0}, NULL, 8) = 0
>   openat(AT_FDCWD, "/var/spool/nullmailer/trigger", O_WRONLY|O_CREAT|O_TRUNC, 0666
> 
> This is a "you've lost" situation to me. This system will /never/
> send mail now, and any mailer program will also hang forever
> (again, to wit:
>    # echo zupa | strace -yfo /tmp/ss mail root
>  does hang forever and /tmp/ss ends in
>    16915 close(6</var/spool/nullmailer/queue>) = 0
>    16915 unlink("/var/spool/nullmailer/tmp/16915") = 0
>    16915 openat(AT_FDCWD, "/var/spool/nullmailer/trigger", O_WRONLY|O_NONBLOCK
>  )
> which means that, on this system, I will never get events from smartd
> or ZED, so fuck me if I wanted to get "scrub errored" or "disk
> will die soon" notifications (in pre-2.0.0 ZED this would also have
>  broken autoreplace=on since it waited synchronously),
> or from other monitoring, so again fuck me if I wanted to get
> overheating/packet drops/whatever notifications,
> or again fuck me if I wanted to get cron mail.
> In many ways I've brought the system down (or will have done in like a
> day once some mails go out) by sending a mail weird.
> 
> 
> Naturally systemd stopping nullmailer failed after a few minutes with
>   × nullmailer.service - Nullmailer relay-only MTA
>        Loaded: loaded (/lib/systemd/system/nullmailer.service; enabled; preset: enabled)
>        Active: failed (Result: timeout) since Mon 2023-06-26 13:10:02 CEST; 6min ago
>      Duration: 1month 4w 10h 55min 29.666s
>          Docs: man:nullmailer(7)
>      Main PID: 3766
>         Tasks: 1 (limit: 4673)
>        Memory: 3.1M
>           CPU: 1min 26.893s
>        CGroup: /system.slice/nullmailer.service
>                └─3766 /usr/sbin/nullmailer-send
>   
>   Jun 26 13:05:32 szarotka systemd[1]: nullmailer.service: State 'stop-sigterm' timed out. Killing.
>   Jun 26 13:05:32 szarotka systemd[1]: nullmailer.service: Killing process 3766 (nullmailer-send) with signal SIGKILL.
>   Jun 26 13:07:02 szarotka systemd[1]: nullmailer.service: Processes still around after SIGKILL. Ignoring.
>   Jun 26 13:08:32 szarotka systemd[1]: nullmailer.service: State 'final-sigterm' timed out. Killing.
>   Jun 26 13:08:32 szarotka systemd[1]: nullmailer.service: Killing process 3766 (nullmailer-send) with signal SIGKILL.
>   Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Processes still around after final SIGKILL. Entering failed mode.
>   Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Failed with result 'timeout'.
>   Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Unit process 3766 (nullmailer-send) remains running after unit s>
>   Jun 26 13:10:02 szarotka systemd[1]: Stopped nullmailer.service - Nullmailer relay-only MTA.
>   Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Consumed 1min 26.893s CPU time.
> 
> But not to fret! Maybe we can still kill it with the cgroup! No:
>   # strace -y sh -c 'echo 1 > /sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill'
>   ...
>   dup2(3</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>, 1) = 1</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>
>   close(3</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>) = 0
>   write(1</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>, "1\n", 2) = 2
>   ...
> This completes, sure, but doesn't do anything at all
> (admittedly, I'm not a cgroup expert, but it did work on other,
>  non-poisoned, cgroups, so I'd expect it to work).
> 
> Opening the FIFO with O_NONBLOCK also hangs, obviously.
> Killing the splicer restores order, as expected.
> 
> > Splice would have to be
> > refactored to not rely on pipe_lock(). That's likely major work with a
> > good portion of regressions if the past is any indication.
> That's likely; however, it ‒ or an equivalent solution ‒ would
> probably be a good idea to do, on balance of all my points above,
> I think.

In-kernel consumers already have a way of detecting when the pipe isn't
safe for non-blocking read anymore because splice has been called and
cleared FMODE_NOWAIT.

I mean, one workaround would probably be poll() even with O_NONBLOCK but
I get why that's annoying and half of a solution.

So there are three options afaict:

(1) rewrite splice.c to kill its reliance on pipe_lock()
    Very involved and would need a splice + pipe expert.
(2) Add pipe_lock_interruptible() to stop the bleeding and give
    userspace the ability to at least kill a hanging reader.
    Also a potentially sensitive change probably regression prone.
(3) Somehow factor in FMODE_NOWAIT when acquiring pipe_lock().
    If FMODE_NOWAIT is set, try to acquire the lock and if not report
    EAGAIN otherwise proceed as before. I think Jens proposed a version
    of this a while back.

Adding Linus as well since he probably has thoughts on this.
tl;dr it by splicing from a regular file to a pipe where the regular
file in splice isn't O_NONBLOCK we can hold pipe_lock() as long as we
want and hang pipe_read() even with O_NONBLOCK unkillable trying to
acquire pipe_lock().
наб June 26, 2023, 4:14 p.m. UTC | #4
On Mon, Jun 26, 2023 at 05:56:28PM +0200, Christian Brauner wrote:
> I mean, one workaround would probably be poll() even with O_NONBLOCK but
> I get why that's annoying and half of a solution.
poll() doesn't really change anything since it turns into a hotloop of
.revents=POLLHUP with a disconnected writer, cf.
  https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#m747e2bbd0c5cffb6baaf1955f6f8b0d97e216839
The only apparently-supported way to poll pipes under linux is to
sleep()/read(O_NONBLOCK) like in the bad old days.

And even if that was a working work-around, the fundamental problem of
./spl>fifo excluding all other access to fifo is quite unfortunate too.

> tl;dr it by splicing from a regular file to a pipe where the regular
> file in splice isn't O_NONBLOCK
(Most noticeable when the "regular file" is a tty and thus the I/O never
 completes by design.)
Linus Torvalds July 6, 2023, 9:56 p.m. UTC | #5
On Mon, 26 Jun 2023 at 09:14, Ahelenia Ziemiańska
<nabijaczleweli@nabijaczleweli.xyz> wrote:
>
> And even if that was a working work-around, the fundamental problem of
> ./spl>fifo excluding all other access to fifo is quite unfortunate too.

So I already sent an earlier broken version of this patch to Ahelenia
and Christian, but here's an actually slightly tested version with the
obvious bugs fixed.

The keyword here being "obvious". It's probably broken in many
non-obvious ways, but I'm sending it out in case anybody wants to play
around with it.

It boots for me. It even changes behavior of programs that splice()
and used to keep the pipe lock over the IO and no longer do.

But it might do unspeakable things to your pets, so caveat emptor. It
"feels" right to me. But it's a rather quick hack, and really needs
more eyes and more thought. I mention O_NDELAY in the (preliminary)
commit message, but there are probably other issues that need thinking
about.

                Linus
Christian Brauner July 7, 2023, 5:21 p.m. UTC | #6
On Thu, Jul 06, 2023 at 02:56:45PM -0700, Linus Torvalds wrote:
> On Mon, 26 Jun 2023 at 09:14, Ahelenia Ziemiańska
> <nabijaczleweli@nabijaczleweli.xyz> wrote:
> >
> > And even if that was a working work-around, the fundamental problem of
> > ./spl>fifo excluding all other access to fifo is quite unfortunate too.
> 
> So I already sent an earlier broken version of this patch to Ahelenia
> and Christian, but here's an actually slightly tested version with the
> obvious bugs fixed.
> 
> The keyword here being "obvious". It's probably broken in many
> non-obvious ways, but I'm sending it out in case anybody wants to play
> around with it.
> 
> It boots for me. It even changes behavior of programs that splice()
> and used to keep the pipe lock over the IO and no longer do.
> 
> But it might do unspeakable things to your pets, so caveat emptor. It
> "feels" right to me. But it's a rather quick hack, and really needs
> more eyes and more thought. I mention O_NDELAY in the (preliminary)
> commit message, but there are probably other issues that need thinking
> about.

Forgot to say, fwiw, I've been running this through the LTP splice,
pipe, and ipc tests without issues. A hanging reader can be signaled
away cleanly with this.
Linus Torvalds July 7, 2023, 7:10 p.m. UTC | #7
On Fri, 7 Jul 2023 at 10:21, Christian Brauner <brauner@kernel.org> wrote:
>
> Forgot to say, fwiw, I've been running this through the LTP splice,
> pipe, and ipc tests without issues. A hanging reader can be signaled
> away cleanly with this.

So that patch still has a couple of "wait for this" cases remaining.

In particular, when we do a read, and we do have pipe buffers, both
the read() system call and a number of internal splice functions will
go "Ahh, I have data", and then do pipe_buf_confirm() and read it.

Which then results in pipe_buf_confirm() blocking. It now blocks
interruptibly, which is much nicer, but several of these users *could*
just do a non-blocking confirmation instead, and wait for pipe
readability.

HOWEVER, that's slightly less trivial than you'd expect, because the
"wait for readability" needs to be done without the pipe lock held -
so you can't actually check the pipe buffer state at that point (since
you need the pipe lock to look up the buffer).

That's true even of "trivial" cases like actual user-space "read()
with O_NONBLOCK and poll()" situations.

Now, the solution to all this is *fairly* straightforward:

 (a) don't use "!pipe_empty()" for a readability check.

     We already have "pipe_readable()", but it's hidden in fs/pipe.c,
so all the splice() code ended up writing the "does this pipe have
data" using "!pipe_empty()" instead.

 (b) make "pipe_buf_confirm()" take a "non-blocking" boolean argument,
and if it is non-blocking but hits one of those blocked pages, set
"pipe->not_ready", and return -EAGAIN.

     This is ok, because "pipe_buf_confirm()" is always under the pipe
lock, and we'll just clear "pipe->not_ready" under the pipe lock after
finalizing all those pages (and before waking up readers)

 (c) make "pipe_wait_readable()" and "poll()" know about this all, so
that we wait properly for a pipe that was not ready to become ready

This all makes *most* users deal properly with these blocking events.
In particular, things like splice_to_socket() can now do the whole
proper "wait without holding the pipe lock" sequence, even when the
pipe is not empty, just in this blocked state.

This *may* also make all the cases Jens had with io_uring and splicing
JustWork(tm).

NOTE! NOTE! NOTE! Once more, this "feels right to me", and I'd argue
that the basic approach is fairly straightfoward. The patch is also
not horrendous. It all makes a fair amount of sense. BUT! I haven't
tested this, and like the previous patch, I really would want people
to think about this a lot.

Comments? Jens?

                Linus
Jens Axboe July 7, 2023, 7:57 p.m. UTC | #8
On 7/7/23 1:10?PM, Linus Torvalds wrote:
> On Fri, 7 Jul 2023 at 10:21, Christian Brauner <brauner@kernel.org> wrote:
>>
>> Forgot to say, fwiw, I've been running this through the LTP splice,
>> pipe, and ipc tests without issues. A hanging reader can be signaled
>> away cleanly with this.
> 
> So that patch still has a couple of "wait for this" cases remaining.
> 
> In particular, when we do a read, and we do have pipe buffers, both
> the read() system call and a number of internal splice functions will
> go "Ahh, I have data", and then do pipe_buf_confirm() and read it.
> 
> Which then results in pipe_buf_confirm() blocking. It now blocks
> interruptibly, which is much nicer, but several of these users *could*
> just do a non-blocking confirmation instead, and wait for pipe
> readability.
> 
> HOWEVER, that's slightly less trivial than you'd expect, because the
> "wait for readability" needs to be done without the pipe lock held -
> so you can't actually check the pipe buffer state at that point (since
> you need the pipe lock to look up the buffer).
> 
> That's true even of "trivial" cases like actual user-space "read()
> with O_NONBLOCK and poll()" situations.
> 
> Now, the solution to all this is *fairly* straightforward:
> 
>  (a) don't use "!pipe_empty()" for a readability check.
> 
>      We already have "pipe_readable()", but it's hidden in fs/pipe.c,
> so all the splice() code ended up writing the "does this pipe have
> data" using "!pipe_empty()" instead.
> 
>  (b) make "pipe_buf_confirm()" take a "non-blocking" boolean argument,
> and if it is non-blocking but hits one of those blocked pages, set
> "pipe->not_ready", and return -EAGAIN.
> 
>      This is ok, because "pipe_buf_confirm()" is always under the pipe
> lock, and we'll just clear "pipe->not_ready" under the pipe lock after
> finalizing all those pages (and before waking up readers)
> 
>  (c) make "pipe_wait_readable()" and "poll()" know about this all, so
> that we wait properly for a pipe that was not ready to become ready
> 
> This all makes *most* users deal properly with these blocking events.
> In particular, things like splice_to_socket() can now do the whole
> proper "wait without holding the pipe lock" sequence, even when the
> pipe is not empty, just in this blocked state.
> 
> This *may* also make all the cases Jens had with io_uring and splicing
> JustWork(tm).

Exactly! I was reading this thread with excitement just now, would be
nice to get rid of that kludge.

> NOTE! NOTE! NOTE! Once more, this "feels right to me", and I'd argue
> that the basic approach is fairly straightfoward. The patch is also
> not horrendous. It all makes a fair amount of sense. BUT! I haven't
> tested this, and like the previous patch, I really would want people
> to think about this a lot.
> 
> Comments? Jens?

I'll take a closer look at this, but won't be until Monday most likely.
But the approach seems sane, and going in a more idiomatic direction
than before. So seems promising.
наб July 7, 2023, 10:41 p.m. UTC | #9
On Fri, Jul 07, 2023 at 12:10:36PM -0700, Linus Torvalds wrote:
> On Fri, 7 Jul 2023 at 10:21, Christian Brauner <brauner@kernel.org> wrote:
> > Forgot to say, fwiw, I've been running this through the LTP splice,
> > pipe, and ipc tests without issues. A hanging reader can be signaled
> > away cleanly with this.
> NOTE! NOTE! NOTE! Once more, this "feels right to me", and I'd argue
> that the basic approach is fairly straightfoward. The patch is also
> not horrendous. It all makes a fair amount of sense. BUT! I haven't
> tested this, and like the previous patch, I really would want people
> to think about this a lot.
> 
> Comments? Jens?
I applied the patch upthread + this diff to 4f6b6c2b2f86b7878a770736bf478d8a263ff0bc;
during test setup I got a null deref (building defconfig minus graphics).
Reproducible, full BUG dump attached; trace of
[  149.878931]  <TASK>
[  149.879533]  ? __die+0x1e/0x60
[  149.880309]  ? page_fault_oops+0x17c/0x470
[  149.881313]  ? search_module_extables+0x14/0x50
[  149.882422]  ? exc_page_fault+0x67/0x150
[  149.883397]  ? asm_exc_page_fault+0x26/0x30
[  149.884426]  ? __pfx_pipe_to_null+0x10/0x10
[  149.885451]  ? splice_from_pipe_next+0x129/0x150
[  149.886580]  __splice_from_pipe+0x39/0x1c0
[  149.887594]  ? __pfx_pipe_to_null+0x10/0x10
[  149.888615]  ? __pfx_pipe_to_null+0x10/0x10
[  149.889635]  splice_from_pipe+0x5c/0x90
[  149.890579]  do_splice+0x35c/0x840
[  149.891407]  __do_splice+0x1eb/0x210
[  149.892176]  __x64_sys_splice+0xad/0x120
[  149.893019]  do_syscall_64+0x3e/0x90
[  149.893798]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8

$ scripts/faddr2line vmlinux splice_from_pipe_next+0x129
splice_from_pipe_next+0x129/0x150:
pipe_buf_release at include/linux/pipe_fs_i.h:221
(inlined by) eat_empty_buffer at fs/splice.c:594
(inlined by) splice_from_pipe_next at fs/splice.c:640

I gamed this down to 
  echo c | grep c >/dev/null
where grep is
  ii  grep           3.8-5        amd64        GNU grep, egrep and fgrep
and strace of the same invocation (on the host) ends with
  newfstatat(1, "", {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x3), ...}, AT_EMPTY_PATH) = 0
  newfstatat(AT_FDCWD, "/dev/null", {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x3), ...}, 0) = 0
  newfstatat(0, "", {st_mode=S_IFIFO|0600, st_size=0, ...}, AT_EMPTY_PATH) = 0
  lseek(0, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
  read(0, "c\n", 98304)                   = 2
  splice(0, NULL, 1, NULL, 98304, SPLICE_F_MOVE) = 0
  close(1)                                = 0
  close(2)                                = 0
  exit_group(0)                           = ?
  +++ exited with 0 +++

And can also reproduce it with
  echo | { read -r _; exec ./wr; } > /dev/null
(where ./wr is "while (splice(0, 0, 1, 0, 128 * 1024 * 1024, 0) > 0) {}").
However:
  echo | ./wr > /dev/null
does NOT crash.


Besides that, this doesn't solve the original issue, inasmuch as
  ./v > fifo &
  head fifo &
  echo zupa > fifo
(where ./v splices from an empty pty to stdout; v.c attached)
echo still sleeps until ./v dies, though it also succumbs to ^C now.

"OTOH, on 4f6b6c2b2f86b7878a770736bf478d8a263ff0bc,
"timeout 10 ./v > fifo &" (then lines 2 and 3 as above) does
  kill ./v -> unblock echo -> head copies "zupa",
i.e. life resumes as normal after the splicer went away.

With the patches, echo zupa is stuck forever (until you signal it)!
This is kinda worse.
[  149.843966] BUG: kernel NULL pointer dereference, address: 0000000000000008
[  149.845820] #PF: supervisor read access in kernel mode
[  149.847190] #PF: error_code(0x0000) - not-present page
[  149.848540] PGD 0 P4D 0
[  149.849231] Oops: 0000 [#1] PREEMPT SMP PTI
[  149.850345] CPU: 0 PID: 230 Comm: grep Not tainted 6.4.0-12317-gabf530ed3e36-dirty #3
[  149.852411] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[  149.854900] RIP: 0010:splice_from_pipe_next+0x129/0x150
[  149.856328] Code: ff c6 45 38 00 eb af 5b b8 00 fe ff ff 5d 41 5c 41 5d c3 cc cc cc cc 48 8b 46 10 41 83 c5 01 48 89 df 48 c7 46 10 00 00 00 00 <48> 8b 40 08 e8 ce a5 9a
[  149.861118] RSP: 0018:ffffb2ed40347d70 EFLAGS: 00010202
[  149.862488] RAX: 0000000000000000 RBX: ffff8c06c1d9a0c0 RCX: 0000000000000000
[  149.864357] RDX: 0000000000000005 RSI: ffff8c06c8c98028 RDI: ffff8c06c1d9a0c0
[  149.866217] RBP: ffffb2ed40347de0 R08: 0000000000000001 R09: ffffffffaa428db0
[  149.868088] R10: 0000000000018000 R11: 0000000000000000 R12: ffff8c06c2625580
[  149.869950] R13: 0000000000000002 R14: ffff8c06c1d9a0c0 R15: ffffb2ed40347de0
[  149.871828] FS:  00007fa5a6b3e740(0000) GS:ffff8c06dd800000(0000) knlGS:0000000000000000
[  149.873937] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  149.875459] CR2: 0000000000000008 CR3: 000000000269a000 CR4: 00000000000006f0
[  149.877327] Call Trace:
[  149.878931]  <TASK>
[  149.879533]  ? __die+0x1e/0x60
[  149.880309]  ? page_fault_oops+0x17c/0x470
[  149.881313]  ? search_module_extables+0x14/0x50
[  149.882422]  ? exc_page_fault+0x67/0x150
[  149.883397]  ? asm_exc_page_fault+0x26/0x30
[  149.884426]  ? __pfx_pipe_to_null+0x10/0x10
[  149.885451]  ? splice_from_pipe_next+0x129/0x150
[  149.886580]  __splice_from_pipe+0x39/0x1c0
[  149.887594]  ? __pfx_pipe_to_null+0x10/0x10
[  149.888615]  ? __pfx_pipe_to_null+0x10/0x10
[  149.889635]  splice_from_pipe+0x5c/0x90
[  149.890579]  do_splice+0x35c/0x840
[  149.891407]  __do_splice+0x1eb/0x210
[  149.892176]  __x64_sys_splice+0xad/0x120
[  149.893019]  do_syscall_64+0x3e/0x90
[  149.893798]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  149.894881] RIP: 0033:0x7fa5a6c49dd3
[  149.895682] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 66 2e 0f 1f 84 00 00 00 00 00 90 80 3d 11 18 0d 00 00 49 89 ca 74 14 b8 13 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 74
[  149.899538] RSP: 002b:00007ffc83d77768 EFLAGS: 00000202 ORIG_RAX: 0000000000000113
[  149.901116] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fa5a6c49dd3
[  149.902602] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[  149.904048] RBP: 0000564d8aaeb000 R08: 0000000000018000 R09: 0000000000000001
[  149.905439] R10: 0000000000000000 R11: 0000000000000202 R12: 000000000000000a
[  149.906832] R13: 0000564d8aaeb010 R14: 0000564d8aaeb000 R15: 0000000000000000
[  149.908239]  </TASK>
[  149.908692] Modules linked in:
[  149.909326] CR2: 0000000000000008
[  149.910050] ---[ end trace 0000000000000000 ]---
[  149.910986] RIP: 0010:splice_from_pipe_next+0x129/0x150
[  149.912063] Code: ff c6 45 38 00 eb af 5b b8 00 fe ff ff 5d 41 5c 41 5d c3 cc cc cc cc 48 8b 46 10 41 83 c5 01 48 89 df 48 c7 46 10 00 00 00 00 <48> 8b 40 08 e8 ce a5 9a
[  149.915639] RSP: 0018:ffffb2ed40347d70 EFLAGS: 00010202
[  149.916589] RAX: 0000000000000000 RBX: ffff8c06c1d9a0c0 RCX: 0000000000000000
[  149.917877] RDX: 0000000000000005 RSI: ffff8c06c8c98028 RDI: ffff8c06c1d9a0c0
[  149.919172] RBP: ffffb2ed40347de0 R08: 0000000000000001 R09: ffffffffaa428db0
[  149.920457] R10: 0000000000018000 R11: 0000000000000000 R12: ffff8c06c2625580
[  149.921737] R13: 0000000000000002 R14: ffff8c06c1d9a0c0 R15: ffffb2ed40347de0
[  149.923021] FS:  00007fa5a6b3e740(0000) GS:ffff8c06dd800000(0000) knlGS:0000000000000000
[  149.924481] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  149.925529] CR2: 0000000000000008 CR3: 000000000269a000 CR4: 00000000000006f0
Linus Torvalds July 7, 2023, 10:57 p.m. UTC | #10
On Fri, 7 Jul 2023 at 15:41, Ahelenia Ziemiańska
<nabijaczleweli@nabijaczleweli.xyz> wrote:
>
> (inlined by) eat_empty_buffer at fs/splice.c:594

Ahh, eat_empty_buffer() ends up releasing the buffer without waiting for it.

And the reason for that is actually somewhat interesting: we do have that

        while (!pipe_readable(pipe)) {
             ..

above it, but the logic for this all is that pipes with pipe buffers
are by *default* considered readable until they try to actually
confirm the buffer, and at that point they might say "oh, I have to
return -EAGAIN and set 'not_ready'".

And that splice_from_pipe_next() doesn't do that.

End result: it will happily free that pipe buffer that is still in the
process of being filled.

The good news is that I think the fix is probably trivial. Something
like the attached?

Again - NOT TESTED.

> Besides that, this doesn't solve the original issue, inasmuch as
>   ./v > fifo &
>   head fifo &
>   echo zupa > fifo
> (where ./v splices from an empty pty to stdout; v.c attached)
> echo still sleeps until ./v dies, though it also succumbs to ^C now.

Yeah, I concentrated on just making everything interruptible,

But the fact that the echo has to wait for the previous write to
finish is kind of fundamental. We can't just magically do writes out
of order. 'v' is busy writing to the fifo, we can't let some other
write just come in.

(We *could* make the splice in ./v not fill the whole pipe buffer, and
allow some other writes to fill in buffers afterwards, but at _some_
point you'll hit the "pipe buffers are full and busy, can't add any
more without waiting for them to empty").

One thing we could possibly do is to say that we just don't accept any
new writes if there are old busy splices in process. So we could make
new writes return -EBUSY or something, I guess.

             Linus
Matthew Wilcox (Oracle) July 8, 2023, midnight UTC | #11
On Thu, Jul 06, 2023 at 02:56:45PM -0700, Linus Torvalds wrote:
> +static int busy_pipe_buf_confirm(struct pipe_inode_info *pipe,
> +				 struct pipe_buffer *buf)
> +{
> +	struct page *page = buf->page;
> +
> +	if (folio_wait_bit_interruptible(page_folio(page), PG_locked))
> +		return -EINTR;

Do we really want interruptible here rather than killable?  That is,
do we want SIGWINCH or SIGALRM to result in a short read?  I assume
it's OK to return a short read because userspace has explicitly asked
for O_NONBLOCK and can therefore be expected to actually check the
return value from read().
Linus Torvalds July 8, 2023, 12:07 a.m. UTC | #12
On Fri, 7 Jul 2023 at 17:00, Matthew Wilcox <willy@infradead.org> wrote:
>
> Do we really want interruptible here rather than killable?

Yes, we actually do need to be just regular interruptible,

This is a bog-standard "IO to/from pipe" situation, which is interruptible.

> That is, do we want SIGWINCH or SIGALRM to result in a short read?

Now, that's a different issue, and is actually handled by the signal
layer: a signal that is ignored (where "ignored" includes the case of
"default handler") will be dropped early, exactly because we don't
want to interrupt things like tty or pipe reads when you resize the
window.

Of course, if you actually *catch* SIGWINCH, then you will get that
short read on a window change.

           Linus
Linus Torvalds July 8, 2023, 12:21 a.m. UTC | #13
On Fri, 7 Jul 2023 at 17:07, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> > That is, do we want SIGWINCH or SIGALRM to result in a short read?
>
> Now, that's a different issue, and is actually handled by the signal
> layer: a signal that is ignored (where "ignored" includes the case of
> "default handler") will be dropped early, exactly because we don't
> want to interrupt things like tty or pipe reads when you resize the
> window.

In case you care, it's prepare_signal() -> sig_ignored() ->
sig_task_ignored() -> sig_handler_ignored() logic that does this
short-circuiting.

And while I don't think it's required by POSIX, this was definitely a
case where lots of programs that *don't* use any signal handlers at
all are very much not expecting to see -EINTR as a return value, and
used to break exactly on things like SIGWINCH when reading from a tty
or a pipe.

But that logic goes back to before linux-1.0.

In fact, I think it goes back to at least 0.99.10 (June '93).

                 Linus
наб July 8, 2023, 12:30 a.m. UTC | #14
On Fri, Jul 07, 2023 at 03:57:40PM -0700, Linus Torvalds wrote:
> On Fri, 7 Jul 2023 at 15:41, Ahelenia Ziemiańska
> <nabijaczleweli@nabijaczleweli.xyz> wrote:
> > (inlined by) eat_empty_buffer at fs/splice.c:594
> Ahh, eat_empty_buffer() ends up releasing the buffer without waiting for it.
> 
> And the reason for that is actually somewhat interesting: we do have that
> 
>         while (!pipe_readable(pipe)) {
>              ..
> 
> above it, but the logic for this all is that pipes with pipe buffers
> are by *default* considered readable until they try to actually
> confirm the buffer, and at that point they might say "oh, I have to
> return -EAGAIN and set 'not_ready'".
> 
> And that splice_from_pipe_next() doesn't do that.
> 
> End result: it will happily free that pipe buffer that is still in the
> process of being filled.
> 
> The good news is that I think the fix is probably trivial. Something
> like the attached?
> 
> Again - NOT TESTED
Same reproducer, backtrace attached:
$ scripts/faddr2line vmlinux splice_from_pipe_next+0x6e
splice_from_pipe_next+0x6e/0x180:
pipe_buf_confirm at include/linux/pipe_fs_i.h:233
(inlined by) eat_empty_buffer at fs/splice.c:597
(inlined by) splice_from_pipe_next at fs/splice.c:647

Looks like the same place.

> > Besides that, this doesn't solve the original issue, inasmuch as
> >   ./v > fifo &
> >   head fifo &
> >   echo zupa > fifo
> > (where ./v splices from an empty pty to stdout; v.c attached)
> > echo still sleeps until ./v dies, though it also succumbs to ^C now.
> Yeah, I concentrated on just making everything interruptible,
> 
> But the fact that the echo has to wait for the previous write to
> finish is kind of fundamental. We can't just magically do writes out
> of order. 'v' is busy writing to the fifo, we can't let some other
> write just come in.
(It's no longer busy writing to it when it gets killed by timeout in my
 second example, but echo still doesn't wake up.)

> (We *could* make the splice in ./v not fill the whole pipe buffer, and
> allow some other writes to fill in buffers afterwards, but at _some_
> point you'll hit the "pipe buffers are full and busy, can't add any
> more without waiting for them to empty").
You are, but, well, that's also the case when the pipe is full.
As it stands, the pipe is /empty/ and yet /no-one can write to it/.
This is the crux of the issue at hand.

(Coincidentally, you're describing what looks quite similar to pt 1.
 from <naljsvzzemr6pjiwwcdpdsdh5vtycdr6fmi2fk2dlr4nn4kq6o@ycanbgxhslti>.)

I think we got away with it for so long because most files behave
like regular files/blockdevs and the read is always guaranteed to
complete ~instantly, but splice is, fundamentally, /not/ writing
the whole time because it's not /reading/ the whole time when it's
reading an empty socket/a chardev/whatever.

Or, rather: splice() from a non-seekable (non-mmap()pable?)
fundamentally doesn't really make much sense, because you're not
gluing a bit of the pro-verbial page cache (forgive me my term
abuse here, you see what I'm getting at tho) to the end of the pipe,
you're just telling the kernel to run a read()/write() loop for you.

sendfile() works around this by reading and then separately writing
to its special buffer (in the form of an anonymous process-local pipe).
splice() just raw-dogs the read with the write lock held,
but just doesn't check if it can do it.

That's how it's, honestly, shaking out to me: someone just forgot
a check the first time they wrote it.
Because the precondition for "does reading directly into the pipe
buffer make sense" is "is it directly equivalent to read(f)/write(p)",
and that holds only for seekables.

/Maybe/ for, like, sockets if there's already data, or as a special
case similar to pipe->pipe. But then for sockets you're already
using sendfile(), so?

To that end, I'm including a patch based on
  4f6b6c2b2f86b7878a770736bf478d8a263ff0bc
that does just that: EINVAL.

Maybe if you're worried about breaking compatibility
(which idk if it's an issue since splice and sendfile
 are temperamental w.r.t. what they take anyway across versions;
 you need a read()/write() fallback anyway)
that case could become sendfile-like by first reading into a buffer
once then pipe->pipe splicing out of it separately.
Though, even the usual sendfile read/write loop only works on seekables.

> One thing we could possibly do is to say that we just don't accept any
> new writes if there are old busy splices in process. So we could make
> new writes return -EBUSY or something, I guess.
Same logic as above applies + no-one's really expecting EBUSY +
what /would/ you do about EBUSY in this case?
-- >8 --
From 552d93bee8e1b8333ce84ed0ca8490f6712c5b8b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ahelenia=20Ziemia=C5=84ska?=
 <nabijaczleweli@nabijaczleweli.xyz>
Date: Sat, 8 Jul 2023 01:47:59 +0200
Subject: [PATCH] splice: file->pipe: -EINVAL if file non-seekable
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Both the logical semantic of "tie a page from the page cache to the pipe"
and the implemented semantic of "lock the pipe, then read into it"
(thus holding the write lock for as as long as the read is outstanding)
only make sense if the read is guaranteed to complete instantly.

This has been a long-standing omission and, when the read doesn't
immediately complete (because it's a tty, socket, &c.), the write lock
‒ which for pipes is a pipe-global mutex ‒ is held until,
thus excluding all mutual users of the pipe, until it does.

Refuse it. Use read()/write() in userspace instead of getting the kernel
to do it for you, badly, when there's no point to doing so.

Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wjEC_Rh8+-rtEi8C45upO-Ffw=M_i1211qS_3AvWZCbOg@mail.gmail.com/t/#u
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
---
 fs/splice.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/splice.c b/fs/splice.c
index 004eb1c4ce31..14cf3cea1091 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1309,6 +1309,8 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out,
 	if (opipe) {
 		if (off_out)
 			return -ESPIPE;
+		if (!(in->f_mode & FMODE_LSEEK))
+			return -EINVAL;
 		if (off_in) {
 			if (!(in->f_mode & FMODE_PREAD))
 				return -EINVAL;
Linus Torvalds July 8, 2023, 8:06 p.m. UTC | #15
On Fri, 7 Jul 2023 at 17:30, Ahelenia Ziemiańska
<nabijaczleweli@nabijaczleweli.xyz> wrote:
>
> Same reproducer, backtrace attached:
> $ scripts/faddr2line vmlinux splice_from_pipe_next+0x6e
> splice_from_pipe_next+0x6e/0x180:
> pipe_buf_confirm at include/linux/pipe_fs_i.h:233

Bah. I should have looked more closely at your case.

This is a buffer without an 'ops' pointer. So it looks like was
already released.

And the reason is that the pipe was readable because there were no
writers, and I had put the

        if (!pipe->writers)
                return 0;

check in splice_from_pipe_next() in the wrong place. It needs to be
*before* the eat_empty_buffer() call.

Duh.

Anyway, while I think that fixes your NULL pointer thing, having
looked at my original patch I realized that keeping the pointer to the
pipe buffer in copy_splice_read() across the dropping of the pipe lock
is completely broken.

I thought (because I didn't remember the code) that pipe resizing will
just copy the pipe buffer pointers around. That would have made it ok
to remember a pipe buffer pointer. But it is not so. It will actually
create new pipe buffers and copy the buffer contents around.

So while fixing your NULL pointer check should be trivial, I think
that first patch is actually fundamentally broken wrt pipe resizing,
and I see no really sane way to fix it. We could add a new lock just
for that, but I don't think it's worth it.

> You are, but, well, that's also the case when the pipe is full.
> As it stands, the pipe is /empty/ and yet /no-one can write to it/.
> This is the crux of the issue at hand.

No, I think you are mis-representing things. The pipe isn't empty.
It's full of things that just aren't finalized yet.

> Or, rather: splice() from a non-seekable (non-mmap()pable?)

Please stop with the non-seekable nonsense.

Any time I see a patch like this:

> +               if (!(in->f_mode & FMODE_LSEEK))
> +                       return -EINVAL;

I will just go "that person is not competent".

This has absolutely nothing to do with seekability.

But it is possible that we need to just bite the bullet and say
"copy_splice_read() needs to use a non-blocking kiocb for the IO".

Of course, that then doesn't work, because while doing this is trivial:

  --- a/fs/splice.c
  +++ b/fs/splice.c
  @@ -364,6 +364,7 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos,
        iov_iter_bvec(&to, ITER_DEST, bv, npages, len);
        init_sync_kiocb(&kiocb, in);
        kiocb.ki_pos = *ppos;
  +     kiocb.ki_flags |= IOCB_NOWAIT;
        ret = call_read_iter(in, &kiocb, &to);

        if (ret > 0) {

I suspectr you'll find that it makes no difference, because the tty
layer doesn't actually honor the IOCB_NOWAIT flag for various
historical reasons. In fact, the kiocb isn't even passed down to the
low-level routine, which only gets the 'struct file *', and instead it
looks at tty_io_nonblock(), which just does that legacy

        file->f_flags & O_NONBLOCK

test.

I guess combined with something like

        if (!(in->f_mode & FMODE_NOWAIT))
                return -EINVAL;

it might all work.

                Linus
наб July 9, 2023, 1:03 a.m. UTC | #16
On Sat, Jul 08, 2023 at 01:06:56PM -0700, Linus Torvalds wrote:
> On Fri, 7 Jul 2023 at 17:30, Ahelenia Ziemiańska
> <nabijaczleweli@nabijaczleweli.xyz> wrote:
> >
> > Same reproducer, backtrace attached:
> > $ scripts/faddr2line vmlinux splice_from_pipe_next+0x6e
> > splice_from_pipe_next+0x6e/0x180:
> > pipe_buf_confirm at include/linux/pipe_fs_i.h:233
> Bah. I should have looked more closely at your case.
> 
> This is a buffer without an 'ops' pointer. So it looks like was
> already released.
> 
> And the reason is that the pipe was readable because there were no
> writers, and I had put the
> 
>         if (!pipe->writers)
>                 return 0;
> 
> check in splice_from_pipe_next() in the wrong place. It needs to be
> *before* the eat_empty_buffer() call.
>
> Anyway, while I think that fixes your NULL pointer thing,
It does.

> So while fixing your NULL pointer check should be trivial, I think
> that first patch is actually fundamentally broken wrt pipe resizing,
> and I see no really sane way to fix it. We could add a new lock just
> for that, but I don't think it's worth it.
> 
> > You are, but, well, that's also the case when the pipe is full.
> > As it stands, the pipe is /empty/ and yet /no-one can write to it/.
> > This is the crux of the issue at hand.
> No, I think you are mis-representing things. The pipe isn't empty.
> It's full of things that just aren't finalized yet.
Being full of no data (as part of some hidden state)
doesn't make it any less empty, but meh; neither here not there.

> > Or, rather: splice() from a non-seekable (non-mmap()pable?)
> Please stop with the non-seekable nonsense.
> 
> Any time I see a patch like this:
> 
> > +               if (!(in->f_mode & FMODE_LSEEK))
> > +                       return -EINVAL;
> 
> I will just go "that person is not competent".
Accurate assessment.

> This has absolutely nothing to do with seekability.
Yes, and as noted, I was using it as a stand-in for "I/O won't
block" due to the above (and splice_direct_to_actor() already uses
it). Glad to see you've managed to synthesise my drivel into something
workable.

> But it is possible that we need to just bite the bullet and say
> "copy_splice_read() needs to use a non-blocking kiocb for the IO".
> 
> Of course, that then doesn't work, because while doing this is trivial:
> 
>   --- a/fs/splice.c
>   +++ b/fs/splice.c
>   @@ -364,6 +364,7 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos,
>         iov_iter_bvec(&to, ITER_DEST, bv, npages, len);
>         init_sync_kiocb(&kiocb, in);
>         kiocb.ki_pos = *ppos;
>   +     kiocb.ki_flags |= IOCB_NOWAIT;
>         ret = call_read_iter(in, &kiocb, &to);
> 
>         if (ret > 0) {
> 
> I suspectr you'll find that it makes no difference, because the tty
> layer doesn't actually honor the IOCB_NOWAIT flag for various
> historical reasons.
Indeed, neither when splicing from a tty,
nor from a socket (same setup but socketpair(AF_UNIX, SOCK_STREAM, 0); w.c).

> In fact, the kiocb isn't even passed down to the
> low-level routine, which only gets the 'struct file *', and instead it
> looks at tty_io_nonblock(), which just does that legacy
> 
>         file->f_flags & O_NONBLOCK
> 
> test.
> 
> I guess combined with something like
> 
>         if (!(in->f_mode & FMODE_NOWAIT))
>                 return -EINVAL;
> 
> it might all work.
Yes, that makes the splice instantly -EINVAL for ttys, but doesn't
affect the socketpair() case above, which still blocks forever.

This appears to be because vfs_splice_read() does
	if ((in->f_flags & O_DIRECT) || IS_DAX(in->f_mapping->host))
		return copy_splice_read(in, ppos, pipe, len, flags);
	return in->f_op->splice_read(in, ppos, pipe, len, flags);
so the c_s_r() isn't even called for the socket, which uses
unix_stream_splice_read().

Thus,
  diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
  index 123b35ddfd71..384d5a479b4a 100644
  --- a/net/unix/af_unix.c
  +++ b/net/unix/af_unix.c
  @@ -2880,15 +2880,12 @@ static ssize_t unix_stream_splice_read(struct socket *sock,  loff_t *ppos,
                  .pipe = pipe,
                  .size = size,
                  .splice_flags = flags,
  +               .flags = MSG_DONTWAIT,
          };
  
          if (unlikely(*ppos))
                  return -ESPIPE;
  
  -       if (sock->file->f_flags & O_NONBLOCK ||
  -           flags & SPLICE_F_NONBLOCK)
  -               state.flags = MSG_DONTWAIT;
  -
          return unix_stream_read_generic(&state, false);
   }
makes the splice -EAGAIN w/o data and splice whatever's in the socket
when there is data.

git grep '\.splice_read.*=' | cut -d= -f2 | tr -s ',;[:space:]' '\n' | sort -u
gives me 27 distinct splice_read implementations and 1 cocci config.

These are simple delegations:
  nfs_file_splice_read               filemap_splice_read
  afs_file_splice_read               filemap_splice_read
  ceph_splice_read                   filemap_splice_read
  ecryptfs_splice_read_update_atime  filemap_splice_read
  ext4_file_splice_read              filemap_splice_read
  f2fs_file_splice_read              filemap_splice_read
  ntfs_file_splice_read              filemap_splice_read
  ocfs2_file_splice_read             filemap_splice_read
  orangefs_file_splice_read          filemap_splice_read
  v9fs_file_splice_read              filemap_splice_read
  xfs_file_splice_read               filemap_splice_read
  zonefs_file_splice_read            filemap_splice_read
  sock_splice_read                   copy_splice_read or a socket-specific one
  coda_file_splice_read              vfs_splice_read
  ovl_splice_read                    vfs_splice_read

vfs_splice_read calls copy_splice_read (not used as a .splice_read).

And the rest are:
01. copy_splice_read you've fixed
02. filemap_splice_read is, as I understand it, only applicable to
    files/blockdevs and already has the required semantics?

03. unix_stream_splice_read can be fixed as above

04. fuse_dev_splice_read allocates a buffer and fuse_dev_do_read()s into
    it, fuse_dev_do_read does
                 if (file->f_flags & O_NONBLOCK)
                         return -EAGAIN;
    so this is likely a similarly easy fix
05. tracing_buffers_splice_read, when it doesn't read anything does
                 ret = -EAGAIN;
                 if ((file->f_flags & O_NONBLOCK) || (flags & SPLICE_F_NONBLOCK))
                         goto out;
    and waits for at least one thing to read;
    can probably just goto out instantly
06. tracing_splice_read_pipe delegates to whatever it's tracing, and
    errors if that errored, so it's fine(?)

07. shmem_file_splice_read is always nonblocking
08. relay_file_splice_read only checks SPLICE_F_NONBLOCK to turn 0 into
    -EAGAIN so I think it also just doesn't block

09. smc_splice_read falls back to an embedded socket's splice_read,
    or uses 
                if (flags & SPLICE_F_NONBLOCK)
                        flags = MSG_DONTWAIT;
                else
                        flags = 0;
                SMC_STAT_INC(smc, splice_cnt);
                rc = smc_rx_recvmsg(smc, NULL, pipe, len, flags);
    so also probably remove the condition
10. kcm_splice_read:
10a. passes flags (SPLICE_F_...) to skb_recv_datagram(), which does
        timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
     verbatim!? and forwards them to try_recv which also checks
     for socket-style bits
10b. give it MSG_DONTWAIT, call it a day?
10c. it also passes flags to skb_splice_bits() to actually splice to the
     pipe, but that discards flags, so no change needed

11. tls_sw_splice_read I don't really understand but it does
        err = tls_rx_reader_lock(sk, ctx, flags & SPLICE_F_NONBLOCK);
    and
                err = tls_rx_rec_wait(sk, NULL, flags & SPLICE_F_NONBLOCK,
                                      true);
    (and skb_splice_bits()) so make them both true?

12. tcp_splice_read uses skb_splice_bits() and timeout governed by
        timeo = sock_rcvtimeo(sk, sock->file->f_flags & O_NONBLOCK);
    and re-tries on empty read if timeo!=0 (i.e. !(sock->file->f_flags & O_NONBLOCK));
    so turning that into true (timeo = 0) and propagating that would
	make it behave accd'g to the new semantic

Would that make sense?
наб July 9, 2023, 10:33 p.m. UTC | #17
On Sun, Jul 09, 2023 at 03:03:22AM +0200, Ahelenia Ziemiańska wrote:
> On Sat, Jul 08, 2023 at 01:06:56PM -0700, Linus Torvalds wrote:
> > I guess combined with something like
> > 
> >         if (!(in->f_mode & FMODE_NOWAIT))
> >                 return -EINVAL;
> > 
> > it might all work.
> Yes, that makes the splice instantly -EINVAL for ttys, but doesn't
> affect the socketpair() case above, which still blocks forever.

This also triggers for regular file -> pipe splices,
which is probably a no-go.


Attaching a summary diff that does all I said in the previous mail.

filemap_get_pages() does use and inspect IOCB_NOWAIT if set in
filemap_splice_read(), but it appears to not really make much sense,
inasmuch it returns EAGAIN for the first splice() from a
blockdev and then instantly return with data on the next go-around.

This doesn't really make much sense ‒ and open(2) says blockdevs
don't have O_NONBLOCK semantics, so I'm assuming this is not supposed
to be exposed to userspace ‒ so I'm not setting it in the diff.


I've tested this for:
  * tty: -EINVAL
  * socketpair(AF_UNIX, SOCK_STREAM): works as expected
	$ wc -c fifo &
	[1] 261
	$ ./af_unix ./s > fifo
	5       Success
	6454    Resource temporarily unavailable
	5       Success
	6445    Resource temporarily unavailable
	0       Success
	10 fifo
  * socket(AF_INET, SOCK_STREAM, 0): works as expected
	$ wc fifo &
	[1] 249
	$ ./tcp ./s > fifo
	23099   Resource temporarily unavailable
	7       Success
	2099    Resource temporarily unavailable
	4       Success
	1751    Resource temporarily unavailable
	3       Success
	21655   Resource temporarily unavailable
	95      Success
	19589   Resource temporarily unavailable
	0       Success
		  4      15     109 fifo
    corresponding to
	host$ nc 127.0.0.1 14640
	żupan
	asd
	ad
	asdda dasj aiojd askdl; jasiopdij as[pkdo[p askd9p ias90dk aso[pjd 890pasid90[ jaskl;dj il;asd
	^C
  * blockdev (/dev/vda): as expected (with filemap_splice_read() unchanged), copies it all
  * regular file: -EINVAL(!)

Splicers still sleep if the pipe's full, of course,
unless SPLICE_F_NONBLOCK.

Test drivers attached as well.
наб July 10, 2023, 1:22 p.m. UTC | #18
On Mon, Jul 10, 2023 at 12:33:07AM +0200, Ahelenia Ziemiańska wrote:
> On Sun, Jul 09, 2023 at 03:03:22AM +0200, Ahelenia Ziemiańska wrote:
> > On Sat, Jul 08, 2023 at 01:06:56PM -0700, Linus Torvalds wrote:
> > > I guess combined with something like
> > > 
> > >         if (!(in->f_mode & FMODE_NOWAIT))
> > >                 return -EINVAL;
> > > 
> > > it might all work.
> > Yes, that makes the splice instantly -EINVAL for ttys, but doesn't
> > affect the socketpair() case above, which still blocks forever.
> This also triggers for regular file -> pipe splices,
> which is probably a no-go.
Actually, that's only the case for regular files on some filesystems?
I originally tested on tmpfs, and now on vfat, ramfs, procfs, and sysfs,
and none have FMODE_NOWAIT set.

Conversely, ext4 and XFS files both have FMODE_NOWAIT set,
and behave like blockdevs incl. the filemap_splice_read() oddities below.

Indeed, it looks like Some filesystems
(btrfs/ext4/f2fs/ocfs2/xfs, blockdevs, /dev/{null,zero,random,urandom},
 pipes, tun/tap)
set FMODE_NOWAIT, but that's by far not All of them, so maybe
  /* File is capable of returning -EAGAIN if I/O will block */
is not the right check for regular files.

> filemap_get_pages() does use and inspect IOCB_NOWAIT if set in
> filemap_splice_read(), but it appears to not really make much sense,
> inasmuch it returns EAGAIN for the first splice() from a
> blockdev and then instantly return with data on the next go-around.
Indeed, this is inconsistent to both:
  * readv2(off=-1, RWF_NOWAIT), which always returns EAGAIN, and
  * fcntl(0, F_SETFL, O_NONBLOCK), read(), which always reads.

Thus,
> This doesn't really make much sense ‒ and open(2) says blockdevs
> don't have O_NONBLOCK semantics, so I'm assuming this is not supposed
> to be exposed to userspace ‒ so I'm not setting it in the diff.
not specifying IOCB_NOWAIT in filemap_splice_read() provides consistent
semantics to "file is read as-if it had O_NONBLOCK set".


I've tentatively updated the check to
		if (!((in->f_mode & FMODE_NOWAIT) || S_ISREG(in->f_inode->i_mode)))
(with the reasoning, as previously, that regular files don't /have/ a
 distinct O_NONBLOCK mode, because they always behave non-blockingly)
and with that
> I've tested this for:
  * regular file: works as expected
diff mbox series

Patch

diff --git a/fs/pipe.c b/fs/pipe.c
index 2d88f73f585a..a76535839d32 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -240,6 +240,11 @@  pipe_read(struct kiocb *iocb, struct iov_iter *to)
 	if (unlikely(total_len == 0))
 		return 0;
 
+	if ((filp->f_flags & O_NONBLOCK) || (iocb->ki_flags & IOCB_NOWAIT)) {
+		if (mutex_is_locked(&pipe->mutex))
+			return -EAGAIN;
+	}
+
 	ret = 0;
 	__pipe_lock(pipe);
-- >8 --
which does make the aforementioned cases less pathological