[v2,2/2] iomap: partially revert 4721a601099 (simulated directio short read on EFAULT)
diff mbox series

Message ID 20181202181045.GS8125@magnolia
State New
Headers show
Series
  • [v2,1/2] splice: don't read more than available pipe space
Related show

Commit Message

Darrick J. Wong Dec. 2, 2018, 6:10 p.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

In commit 4721a601099, we tried to fix a problem wherein directio reads
into a splice pipe will bounce EFAULT/EAGAIN all the way out to
userspace by simulating a zero-byte short read.  This happens because
some directio read implementations (xfs) will call
bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
reads, but as soon as we run out of pipe buffers that _get_pages call
returns EFAULT, which the splice code translates to EAGAIN and bounces
out to userspace.

In that commit, the iomap code catches the EFAULT and simulates a
zero-byte read, but that causes assertion errors on regular splice reads
because xfs doesn't allow short directio reads.  This causes infinite
splice() loops and assertion failures on generic/095 on overlayfs
because xfs only permit total success or total failure of a directio
operation.  The underlying issue in the pipe splice code has now been
fixed by changing the pipe splice loop to avoid avoid reading more data
than there is space in the pipe.

Therefore, it's no longer necessary to simulate the short directio, so
remove the hack from iomap.

Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill")
Reported-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: split into two patches per hch request
---
 fs/iomap.c |    9 ---------
 1 file changed, 9 deletions(-)

Comments

Amir Goldstein Dec. 2, 2018, 7:37 p.m. UTC | #1
On Sun, Dec 2, 2018 at 8:10 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> From: Darrick J. Wong <darrick.wong@oracle.com>
>
> In commit 4721a601099, we tried to fix a problem wherein directio reads
> into a splice pipe will bounce EFAULT/EAGAIN all the way out to
> userspace by simulating a zero-byte short read.  This happens because
> some directio read implementations (xfs) will call
> bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
> reads, but as soon as we run out of pipe buffers that _get_pages call
> returns EFAULT, which the splice code translates to EAGAIN and bounces
> out to userspace.
>
> In that commit, the iomap code catches the EFAULT and simulates a
> zero-byte read, but that causes assertion errors on regular splice reads
> because xfs doesn't allow short directio reads.  This causes infinite
> splice() loops and assertion failures on generic/095 on overlayfs
> because xfs only permit total success or total failure of a directio
> operation.  The underlying issue in the pipe splice code has now been
> fixed by changing the pipe splice loop to avoid avoid reading more data
> than there is space in the pipe.
>
> Therefore, it's no longer necessary to simulate the short directio, so
> remove the hack from iomap.
>
> Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill")
> Reported-by: Amir Goldstein <amir73il@gmail.com>

Wasn't me. I believe it was Murphy Zhou <jencce.kernel@gmail.com>.
If you want you can add Ranted-by Amir ;-)

Anyway, looks fine.

> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v2: split into two patches per hch request
> ---
>  fs/iomap.c |    9 ---------
>  1 file changed, 9 deletions(-)
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 3ffb776fbebe..d6bc98ae8d35 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1877,15 +1877,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>                                 dio->wait_for_completion = true;
>                                 ret = 0;
>                         }
> -
> -                       /*
> -                        * Splicing to pipes can fail on a full pipe. We have to
> -                        * swallow this to make it look like a short IO
> -                        * otherwise the higher splice layers will completely
> -                        * mishandle the error and stop moving data.
> -                        */
> -                       if (ret == -EFAULT)
> -                               ret = 0;
>                         break;
>                 }
>                 pos += ret;
Andreas Grünbacher Aug. 21, 2019, 8:23 p.m. UTC | #2
Hi Darrick,

Am So., 2. Dez. 2018 um 19:13 Uhr schrieb Darrick J. Wong
<darrick.wong@oracle.com>:
> From: Darrick J. Wong <darrick.wong@oracle.com>
>
> In commit 4721a601099, we tried to fix a problem wherein directio reads
> into a splice pipe will bounce EFAULT/EAGAIN all the way out to
> userspace by simulating a zero-byte short read.  This happens because
> some directio read implementations (xfs) will call
> bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
> reads, but as soon as we run out of pipe buffers that _get_pages call
> returns EFAULT, which the splice code translates to EAGAIN and bounces
> out to userspace.
>
> In that commit, the iomap code catches the EFAULT and simulates a
> zero-byte read, but that causes assertion errors on regular splice reads
> because xfs doesn't allow short directio reads.  This causes infinite
> splice() loops and assertion failures on generic/095 on overlayfs
> because xfs only permit total success or total failure of a directio
> operation.  The underlying issue in the pipe splice code has now been
> fixed by changing the pipe splice loop to avoid avoid reading more data
> than there is space in the pipe.
>
> Therefore, it's no longer necessary to simulate the short directio, so
> remove the hack from iomap.
>
> Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill")
> Reported-by: Amir Goldstein <amir73il@gmail.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v2: split into two patches per hch request
> ---
>  fs/iomap.c |    9 ---------
>  1 file changed, 9 deletions(-)
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 3ffb776fbebe..d6bc98ae8d35 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1877,15 +1877,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>                                 dio->wait_for_completion = true;
>                                 ret = 0;
>                         }
> -
> -                       /*
> -                        * Splicing to pipes can fail on a full pipe. We have to
> -                        * swallow this to make it look like a short IO
> -                        * otherwise the higher splice layers will completely
> -                        * mishandle the error and stop moving data.
> -                        */
> -                       if (ret == -EFAULT)
> -                               ret = 0;
>                         break;
>                 }
>                 pos += ret;

I'm afraid this breaks the following test case on xfs and gfs2, the
two current users of iomap_dio_rw.

Here, the splice system call fails with errno = EAGAIN when trying to
"move data" from a file opened with O_DIRECT into a pipe.

The test case can be run with option -d to not use O_DIRECT, which
makes the test succeed.

The -r option switches from reading from the pipe sequentially to
reading concurrently with the splice, which doesn't change the
behavior.

Any thoughts?

Thanks,
Andreas

=================================== 8< ===================================
#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <unistd.h>
#include <fcntl.h>
#include <err.h>

#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>
#include <string.h>
#include <errno.h>

#define SECTOR_SIZE 512
#define BUFFER_SIZE (150 * SECTOR_SIZE)

void read_from_pipe(int fd, const char *filename, size_t size)
{
    char buffer[SECTOR_SIZE];
    size_t sz;
    ssize_t ret;

    while (size) {
        sz = size;
        if (sz > sizeof buffer)
            sz = sizeof buffer;
        ret = read(fd, buffer, sz);
        if (ret < 0)
            err(1, "read: %s", filename);
        if (ret == 0) {
            fprintf(stderr, "read: %s: unexpected EOF\n", filename);
            exit(1);
        }
        size -= sz;
    }
}

void do_splice1(int fd, const char *filename, size_t size)
{
    bool retried = false;
    int pipefd[2];

    if (pipe(pipefd) == -1)
        err(1, "pipe");
    while (size) {
        ssize_t spliced;

        spliced = splice(fd, NULL, pipefd[1], NULL, size, SPLICE_F_MOVE);
        if (spliced == -1) {
            if (errno == EAGAIN && !retried) {
                retried = true;
                fprintf(stderr, "retrying splice\n");
                sleep(1);
                continue;
            }
            err(1, "splice");
        }
        read_from_pipe(pipefd[0], filename, spliced);
        size -= spliced;
    }
    close(pipefd[0]);
    close(pipefd[1]);
}

void do_splice2(int fd, const char *filename, size_t size)
{
    bool retried = false;
    int pipefd[2];
    int pid;

    if (pipe(pipefd) == -1)
        err(1, "pipe");

    pid = fork();
    if (pid == 0) {
        close(pipefd[1]);
        read_from_pipe(pipefd[0], filename, size);
        exit(0);
    } else {
        close(pipefd[0]);
        while (size) {
            ssize_t spliced;

            spliced = splice(fd, NULL, pipefd[1], NULL, size, SPLICE_F_MOVE);
            if (spliced == -1) {
                if (errno == EAGAIN && !retried) {
                    retried = true;
                    fprintf(stderr, "retrying splice\n");
                    sleep(1);
                    continue;
                }
                err(1, "splice");
            }
            size -= spliced;
        }
        close(pipefd[1]);
        waitpid(pid, NULL, 0);
    }
}

void usage(const char *argv0)
{
    fprintf(stderr, "USAGE: %s [-rd] {filename}\n", basename(argv0));
    exit(2);
}

int main(int argc, char *argv[])
{
    void (*do_splice)(int fd, const char *filename, size_t size);
    const char *filename;
    char *buffer;
    int opt, open_flags, fd;
    ssize_t ret;

    do_splice = do_splice1;
    open_flags = O_CREAT | O_TRUNC | O_RDWR | O_DIRECT;

    while ((opt = getopt(argc, argv, "rd")) != -1) {
        switch(opt) {
        case 'r':
            do_splice = do_splice2;
            break;
        case 'd':
            open_flags &= ~O_DIRECT;
            break;
        default:  /* '?' */
            usage(argv[0]);
        }
    }

    if (optind >= argc)
        usage(argv[0]);
    filename = argv[optind];

    printf("%s reader %s O_DIRECT\n",
           do_splice == do_splice1 ? "sequential" : "concurrent",
           (open_flags & O_DIRECT) ? "with" : "without");

    buffer = aligned_alloc(SECTOR_SIZE, BUFFER_SIZE);
    if (buffer == NULL)
        err(1, "aligned_alloc");

    fd = open(filename, open_flags, 0666);
    if (fd == -1)
        err(1, "open: %s", filename);

    memset(buffer, 'x', BUFFER_SIZE);
    ret = write(fd, buffer, BUFFER_SIZE);
    if (ret < 0)
        err(1, "write: %s", filename);
    if (ret != BUFFER_SIZE) {
        fprintf(stderr, "%s: short write\n", filename);
        exit(1);
    }

    ret = lseek(fd, 0, SEEK_SET);
    if (ret != 0)
        err(1, "lseek: %s", filename);

    do_splice(fd, filename, BUFFER_SIZE);

    if (unlink(filename) == -1)
        err(1, "unlink: %s", filename);

    return 0;
}
=================================== 8< ===================================
Darrick J. Wong Aug. 28, 2019, 2:23 p.m. UTC | #3
On Wed, Aug 21, 2019 at 10:23:49PM +0200, Andreas Grünbacher wrote:
> Hi Darrick,
> 
> Am So., 2. Dez. 2018 um 19:13 Uhr schrieb Darrick J. Wong
> <darrick.wong@oracle.com>:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> >
> > In commit 4721a601099, we tried to fix a problem wherein directio reads
> > into a splice pipe will bounce EFAULT/EAGAIN all the way out to
> > userspace by simulating a zero-byte short read.  This happens because
> > some directio read implementations (xfs) will call
> > bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
> > reads, but as soon as we run out of pipe buffers that _get_pages call
> > returns EFAULT, which the splice code translates to EAGAIN and bounces
> > out to userspace.
> >
> > In that commit, the iomap code catches the EFAULT and simulates a
> > zero-byte read, but that causes assertion errors on regular splice reads
> > because xfs doesn't allow short directio reads.  This causes infinite
> > splice() loops and assertion failures on generic/095 on overlayfs
> > because xfs only permit total success or total failure of a directio
> > operation.  The underlying issue in the pipe splice code has now been
> > fixed by changing the pipe splice loop to avoid avoid reading more data
> > than there is space in the pipe.
> >
> > Therefore, it's no longer necessary to simulate the short directio, so
> > remove the hack from iomap.
> >
> > Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill")
> > Reported-by: Amir Goldstein <amir73il@gmail.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> > v2: split into two patches per hch request
> > ---
> >  fs/iomap.c |    9 ---------
> >  1 file changed, 9 deletions(-)
> >
> > diff --git a/fs/iomap.c b/fs/iomap.c
> > index 3ffb776fbebe..d6bc98ae8d35 100644
> > --- a/fs/iomap.c
> > +++ b/fs/iomap.c
> > @@ -1877,15 +1877,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >                                 dio->wait_for_completion = true;
> >                                 ret = 0;
> >                         }
> > -
> > -                       /*
> > -                        * Splicing to pipes can fail on a full pipe. We have to
> > -                        * swallow this to make it look like a short IO
> > -                        * otherwise the higher splice layers will completely
> > -                        * mishandle the error and stop moving data.
> > -                        */
> > -                       if (ret == -EFAULT)
> > -                               ret = 0;
> >                         break;
> >                 }
> >                 pos += ret;
> 
> I'm afraid this breaks the following test case on xfs and gfs2, the
> two current users of iomap_dio_rw.

Hmm, I had kinda wondered if regular pipes still needed this help.
Evidently we don't have a lot of splice tests in fstests. :(

> Here, the splice system call fails with errno = EAGAIN when trying to
> "move data" from a file opened with O_DIRECT into a pipe.
> 
> The test case can be run with option -d to not use O_DIRECT, which
> makes the test succeed.
> 
> The -r option switches from reading from the pipe sequentially to
> reading concurrently with the splice, which doesn't change the
> behavior.
> 
> Any thoughts?

This would be great as an xfstest! :)

Do you have one ready to go, or should I just make one from the source
code?

--D

> Thanks,
> Andreas
> 
> =================================== 8< ===================================
> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <sys/wait.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <err.h>
> 
> #include <stdlib.h>
> #include <stdio.h>
> #include <stdbool.h>
> #include <string.h>
> #include <errno.h>
> 
> #define SECTOR_SIZE 512
> #define BUFFER_SIZE (150 * SECTOR_SIZE)
> 
> void read_from_pipe(int fd, const char *filename, size_t size)
> {
>     char buffer[SECTOR_SIZE];
>     size_t sz;
>     ssize_t ret;
> 
>     while (size) {
>         sz = size;
>         if (sz > sizeof buffer)
>             sz = sizeof buffer;
>         ret = read(fd, buffer, sz);
>         if (ret < 0)
>             err(1, "read: %s", filename);
>         if (ret == 0) {
>             fprintf(stderr, "read: %s: unexpected EOF\n", filename);
>             exit(1);
>         }
>         size -= sz;
>     }
> }
> 
> void do_splice1(int fd, const char *filename, size_t size)
> {
>     bool retried = false;
>     int pipefd[2];
> 
>     if (pipe(pipefd) == -1)
>         err(1, "pipe");
>     while (size) {
>         ssize_t spliced;
> 
>         spliced = splice(fd, NULL, pipefd[1], NULL, size, SPLICE_F_MOVE);
>         if (spliced == -1) {
>             if (errno == EAGAIN && !retried) {
>                 retried = true;
>                 fprintf(stderr, "retrying splice\n");
>                 sleep(1);
>                 continue;
>             }
>             err(1, "splice");
>         }
>         read_from_pipe(pipefd[0], filename, spliced);
>         size -= spliced;
>     }
>     close(pipefd[0]);
>     close(pipefd[1]);
> }
> 
> void do_splice2(int fd, const char *filename, size_t size)
> {
>     bool retried = false;
>     int pipefd[2];
>     int pid;
> 
>     if (pipe(pipefd) == -1)
>         err(1, "pipe");
> 
>     pid = fork();
>     if (pid == 0) {
>         close(pipefd[1]);
>         read_from_pipe(pipefd[0], filename, size);
>         exit(0);
>     } else {
>         close(pipefd[0]);
>         while (size) {
>             ssize_t spliced;
> 
>             spliced = splice(fd, NULL, pipefd[1], NULL, size, SPLICE_F_MOVE);
>             if (spliced == -1) {
>                 if (errno == EAGAIN && !retried) {
>                     retried = true;
>                     fprintf(stderr, "retrying splice\n");
>                     sleep(1);
>                     continue;
>                 }
>                 err(1, "splice");
>             }
>             size -= spliced;
>         }
>         close(pipefd[1]);
>         waitpid(pid, NULL, 0);
>     }
> }
> 
> void usage(const char *argv0)
> {
>     fprintf(stderr, "USAGE: %s [-rd] {filename}\n", basename(argv0));
>     exit(2);
> }
> 
> int main(int argc, char *argv[])
> {
>     void (*do_splice)(int fd, const char *filename, size_t size);
>     const char *filename;
>     char *buffer;
>     int opt, open_flags, fd;
>     ssize_t ret;
> 
>     do_splice = do_splice1;
>     open_flags = O_CREAT | O_TRUNC | O_RDWR | O_DIRECT;
> 
>     while ((opt = getopt(argc, argv, "rd")) != -1) {
>         switch(opt) {
>         case 'r':
>             do_splice = do_splice2;
>             break;
>         case 'd':
>             open_flags &= ~O_DIRECT;
>             break;
>         default:  /* '?' */
>             usage(argv[0]);
>         }
>     }
> 
>     if (optind >= argc)
>         usage(argv[0]);
>     filename = argv[optind];
> 
>     printf("%s reader %s O_DIRECT\n",
>            do_splice == do_splice1 ? "sequential" : "concurrent",
>            (open_flags & O_DIRECT) ? "with" : "without");
> 
>     buffer = aligned_alloc(SECTOR_SIZE, BUFFER_SIZE);
>     if (buffer == NULL)
>         err(1, "aligned_alloc");
> 
>     fd = open(filename, open_flags, 0666);
>     if (fd == -1)
>         err(1, "open: %s", filename);
> 
>     memset(buffer, 'x', BUFFER_SIZE);
>     ret = write(fd, buffer, BUFFER_SIZE);
>     if (ret < 0)
>         err(1, "write: %s", filename);
>     if (ret != BUFFER_SIZE) {
>         fprintf(stderr, "%s: short write\n", filename);
>         exit(1);
>     }
> 
>     ret = lseek(fd, 0, SEEK_SET);
>     if (ret != 0)
>         err(1, "lseek: %s", filename);
> 
>     do_splice(fd, filename, BUFFER_SIZE);
> 
>     if (unlink(filename) == -1)
>         err(1, "unlink: %s", filename);
> 
>     return 0;
> }
> =================================== 8< ===================================
Andreas Grünbacher Aug. 28, 2019, 2:37 p.m. UTC | #4
Am Mi., 28. Aug. 2019 um 16:23 Uhr schrieb Darrick J. Wong
<darrick.wong@oracle.com>:
> On Wed, Aug 21, 2019 at 10:23:49PM +0200, Andreas Grünbacher wrote:
> > Hi Darrick,
> >
> > Am So., 2. Dez. 2018 um 19:13 Uhr schrieb Darrick J. Wong
> > <darrick.wong@oracle.com>:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > >
> > > In commit 4721a601099, we tried to fix a problem wherein directio reads
> > > into a splice pipe will bounce EFAULT/EAGAIN all the way out to
> > > userspace by simulating a zero-byte short read.  This happens because
> > > some directio read implementations (xfs) will call
> > > bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
> > > reads, but as soon as we run out of pipe buffers that _get_pages call
> > > returns EFAULT, which the splice code translates to EAGAIN and bounces
> > > out to userspace.
> > >
> > > In that commit, the iomap code catches the EFAULT and simulates a
> > > zero-byte read, but that causes assertion errors on regular splice reads
> > > because xfs doesn't allow short directio reads.  This causes infinite
> > > splice() loops and assertion failures on generic/095 on overlayfs
> > > because xfs only permit total success or total failure of a directio
> > > operation.  The underlying issue in the pipe splice code has now been
> > > fixed by changing the pipe splice loop to avoid avoid reading more data
> > > than there is space in the pipe.
> > >
> > > Therefore, it's no longer necessary to simulate the short directio, so
> > > remove the hack from iomap.
> > >
> > > Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill")
> > > Reported-by: Amir Goldstein <amir73il@gmail.com>
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > > v2: split into two patches per hch request
> > > ---
> > >  fs/iomap.c |    9 ---------
> > >  1 file changed, 9 deletions(-)
> > >
> > > diff --git a/fs/iomap.c b/fs/iomap.c
> > > index 3ffb776fbebe..d6bc98ae8d35 100644
> > > --- a/fs/iomap.c
> > > +++ b/fs/iomap.c
> > > @@ -1877,15 +1877,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >                                 dio->wait_for_completion = true;
> > >                                 ret = 0;
> > >                         }
> > > -
> > > -                       /*
> > > -                        * Splicing to pipes can fail on a full pipe. We have to
> > > -                        * swallow this to make it look like a short IO
> > > -                        * otherwise the higher splice layers will completely
> > > -                        * mishandle the error and stop moving data.
> > > -                        */
> > > -                       if (ret == -EFAULT)
> > > -                               ret = 0;
> > >                         break;
> > >                 }
> > >                 pos += ret;
> >
> > I'm afraid this breaks the following test case on xfs and gfs2, the
> > two current users of iomap_dio_rw.
>
> Hmm, I had kinda wondered if regular pipes still needed this help.
> Evidently we don't have a lot of splice tests in fstests. :(

So what do you suggest as a fix?

> > Here, the splice system call fails with errno = EAGAIN when trying to
> > "move data" from a file opened with O_DIRECT into a pipe.
> >
> > The test case can be run with option -d to not use O_DIRECT, which
> > makes the test succeed.
> >
> > The -r option switches from reading from the pipe sequentially to
> > reading concurrently with the splice, which doesn't change the
> > behavior.
> >
> > Any thoughts?
>
> This would be great as an xfstest! :)

Or perhaps something generalized from it.

> Do you have one ready to go, or should I just make one from the source
> code?

The bug originally triggered in our internal cluster test system and
I've recreated the test case I've included from the strace. That's all
I have for now; feel free to take it, of course.

It could be that the same condition can be triggered with one of the
existing utilities (fio/fsstress/...).

Thanks,
Andreas
Zorro Lang Aug. 29, 2019, 1:36 a.m. UTC | #5
On Wed, Aug 28, 2019 at 07:23:32AM -0700, Darrick J. Wong wrote:
> On Wed, Aug 21, 2019 at 10:23:49PM +0200, Andreas Grünbacher wrote:
> > Hi Darrick,
> > 
> > Am So., 2. Dez. 2018 um 19:13 Uhr schrieb Darrick J. Wong
> > <darrick.wong@oracle.com>:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > >
> > > In commit 4721a601099, we tried to fix a problem wherein directio reads
> > > into a splice pipe will bounce EFAULT/EAGAIN all the way out to
> > > userspace by simulating a zero-byte short read.  This happens because
> > > some directio read implementations (xfs) will call
> > > bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
> > > reads, but as soon as we run out of pipe buffers that _get_pages call
> > > returns EFAULT, which the splice code translates to EAGAIN and bounces
> > > out to userspace.
> > >
> > > In that commit, the iomap code catches the EFAULT and simulates a
> > > zero-byte read, but that causes assertion errors on regular splice reads
> > > because xfs doesn't allow short directio reads.  This causes infinite
> > > splice() loops and assertion failures on generic/095 on overlayfs
> > > because xfs only permit total success or total failure of a directio
> > > operation.  The underlying issue in the pipe splice code has now been
> > > fixed by changing the pipe splice loop to avoid avoid reading more data
> > > than there is space in the pipe.
> > >
> > > Therefore, it's no longer necessary to simulate the short directio, so
> > > remove the hack from iomap.
> > >
> > > Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill")
> > > Reported-by: Amir Goldstein <amir73il@gmail.com>
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > > v2: split into two patches per hch request
> > > ---
> > >  fs/iomap.c |    9 ---------
> > >  1 file changed, 9 deletions(-)
> > >
> > > diff --git a/fs/iomap.c b/fs/iomap.c
> > > index 3ffb776fbebe..d6bc98ae8d35 100644
> > > --- a/fs/iomap.c
> > > +++ b/fs/iomap.c
> > > @@ -1877,15 +1877,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >                                 dio->wait_for_completion = true;
> > >                                 ret = 0;
> > >                         }
> > > -
> > > -                       /*
> > > -                        * Splicing to pipes can fail on a full pipe. We have to
> > > -                        * swallow this to make it look like a short IO
> > > -                        * otherwise the higher splice layers will completely
> > > -                        * mishandle the error and stop moving data.
> > > -                        */
> > > -                       if (ret == -EFAULT)
> > > -                               ret = 0;
> > >                         break;
> > >                 }
> > >                 pos += ret;
> > 
> > I'm afraid this breaks the following test case on xfs and gfs2, the
> > two current users of iomap_dio_rw.
> 
> Hmm, I had kinda wondered if regular pipes still needed this help.
> Evidently we don't have a lot of splice tests in fstests. :(
> 
> > Here, the splice system call fails with errno = EAGAIN when trying to
> > "move data" from a file opened with O_DIRECT into a pipe.
> > 
> > The test case can be run with option -d to not use O_DIRECT, which
> > makes the test succeed.
> > 
> > The -r option switches from reading from the pipe sequentially to
> > reading concurrently with the splice, which doesn't change the
> > behavior.
> > 
> > Any thoughts?
> 
> This would be great as an xfstest! :)

JFYI, I added splice operation into fsstress, and I tried to add splice operation
into xfs_io long time ago:

https://marc.info/?l=linux-xfs&m=155828702128047&w=2

But it haven't been merged. If you have any suggestion, please feel free to
review it:)

Thanks,
Zorro

> 
> Do you have one ready to go, or should I just make one from the source
> code?
> 
> --D
> 
> > Thanks,
> > Andreas
> > 
> > =================================== 8< ===================================
> > #define _GNU_SOURCE
> > #include <sys/types.h>
> > #include <sys/stat.h>
> > #include <sys/wait.h>
> > #include <unistd.h>
> > #include <fcntl.h>
> > #include <err.h>
> > 
> > #include <stdlib.h>
> > #include <stdio.h>
> > #include <stdbool.h>
> > #include <string.h>
> > #include <errno.h>
> > 
> > #define SECTOR_SIZE 512
> > #define BUFFER_SIZE (150 * SECTOR_SIZE)
> > 
> > void read_from_pipe(int fd, const char *filename, size_t size)
> > {
> >     char buffer[SECTOR_SIZE];
> >     size_t sz;
> >     ssize_t ret;
> > 
> >     while (size) {
> >         sz = size;
> >         if (sz > sizeof buffer)
> >             sz = sizeof buffer;
> >         ret = read(fd, buffer, sz);
> >         if (ret < 0)
> >             err(1, "read: %s", filename);
> >         if (ret == 0) {
> >             fprintf(stderr, "read: %s: unexpected EOF\n", filename);
> >             exit(1);
> >         }
> >         size -= sz;
> >     }
> > }
> > 
> > void do_splice1(int fd, const char *filename, size_t size)
> > {
> >     bool retried = false;
> >     int pipefd[2];
> > 
> >     if (pipe(pipefd) == -1)
> >         err(1, "pipe");
> >     while (size) {
> >         ssize_t spliced;
> > 
> >         spliced = splice(fd, NULL, pipefd[1], NULL, size, SPLICE_F_MOVE);
> >         if (spliced == -1) {
> >             if (errno == EAGAIN && !retried) {
> >                 retried = true;
> >                 fprintf(stderr, "retrying splice\n");
> >                 sleep(1);
> >                 continue;
> >             }
> >             err(1, "splice");
> >         }
> >         read_from_pipe(pipefd[0], filename, spliced);
> >         size -= spliced;
> >     }
> >     close(pipefd[0]);
> >     close(pipefd[1]);
> > }
> > 
> > void do_splice2(int fd, const char *filename, size_t size)
> > {
> >     bool retried = false;
> >     int pipefd[2];
> >     int pid;
> > 
> >     if (pipe(pipefd) == -1)
> >         err(1, "pipe");
> > 
> >     pid = fork();
> >     if (pid == 0) {
> >         close(pipefd[1]);
> >         read_from_pipe(pipefd[0], filename, size);
> >         exit(0);
> >     } else {
> >         close(pipefd[0]);
> >         while (size) {
> >             ssize_t spliced;
> > 
> >             spliced = splice(fd, NULL, pipefd[1], NULL, size, SPLICE_F_MOVE);
> >             if (spliced == -1) {
> >                 if (errno == EAGAIN && !retried) {
> >                     retried = true;
> >                     fprintf(stderr, "retrying splice\n");
> >                     sleep(1);
> >                     continue;
> >                 }
> >                 err(1, "splice");
> >             }
> >             size -= spliced;
> >         }
> >         close(pipefd[1]);
> >         waitpid(pid, NULL, 0);
> >     }
> > }
> > 
> > void usage(const char *argv0)
> > {
> >     fprintf(stderr, "USAGE: %s [-rd] {filename}\n", basename(argv0));
> >     exit(2);
> > }
> > 
> > int main(int argc, char *argv[])
> > {
> >     void (*do_splice)(int fd, const char *filename, size_t size);
> >     const char *filename;
> >     char *buffer;
> >     int opt, open_flags, fd;
> >     ssize_t ret;
> > 
> >     do_splice = do_splice1;
> >     open_flags = O_CREAT | O_TRUNC | O_RDWR | O_DIRECT;
> > 
> >     while ((opt = getopt(argc, argv, "rd")) != -1) {
> >         switch(opt) {
> >         case 'r':
> >             do_splice = do_splice2;
> >             break;
> >         case 'd':
> >             open_flags &= ~O_DIRECT;
> >             break;
> >         default:  /* '?' */
> >             usage(argv[0]);
> >         }
> >     }
> > 
> >     if (optind >= argc)
> >         usage(argv[0]);
> >     filename = argv[optind];
> > 
> >     printf("%s reader %s O_DIRECT\n",
> >            do_splice == do_splice1 ? "sequential" : "concurrent",
> >            (open_flags & O_DIRECT) ? "with" : "without");
> > 
> >     buffer = aligned_alloc(SECTOR_SIZE, BUFFER_SIZE);
> >     if (buffer == NULL)
> >         err(1, "aligned_alloc");
> > 
> >     fd = open(filename, open_flags, 0666);
> >     if (fd == -1)
> >         err(1, "open: %s", filename);
> > 
> >     memset(buffer, 'x', BUFFER_SIZE);
> >     ret = write(fd, buffer, BUFFER_SIZE);
> >     if (ret < 0)
> >         err(1, "write: %s", filename);
> >     if (ret != BUFFER_SIZE) {
> >         fprintf(stderr, "%s: short write\n", filename);
> >         exit(1);
> >     }
> > 
> >     ret = lseek(fd, 0, SEEK_SET);
> >     if (ret != 0)
> >         err(1, "lseek: %s", filename);
> > 
> >     do_splice(fd, filename, BUFFER_SIZE);
> > 
> >     if (unlink(filename) == -1)
> >         err(1, "unlink: %s", filename);
> > 
> >     return 0;
> > }
> > =================================== 8< ===================================
Darrick J. Wong Aug. 29, 2019, 3:12 a.m. UTC | #6
On Wed, Aug 28, 2019 at 04:37:59PM +0200, Andreas Grünbacher wrote:
> Am Mi., 28. Aug. 2019 um 16:23 Uhr schrieb Darrick J. Wong
> <darrick.wong@oracle.com>:
> > On Wed, Aug 21, 2019 at 10:23:49PM +0200, Andreas Grünbacher wrote:
> > > Hi Darrick,
> > >
> > > Am So., 2. Dez. 2018 um 19:13 Uhr schrieb Darrick J. Wong
> > > <darrick.wong@oracle.com>:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > >
> > > > In commit 4721a601099, we tried to fix a problem wherein directio reads
> > > > into a splice pipe will bounce EFAULT/EAGAIN all the way out to
> > > > userspace by simulating a zero-byte short read.  This happens because
> > > > some directio read implementations (xfs) will call
> > > > bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
> > > > reads, but as soon as we run out of pipe buffers that _get_pages call
> > > > returns EFAULT, which the splice code translates to EAGAIN and bounces
> > > > out to userspace.
> > > >
> > > > In that commit, the iomap code catches the EFAULT and simulates a
> > > > zero-byte read, but that causes assertion errors on regular splice reads
> > > > because xfs doesn't allow short directio reads.  This causes infinite
> > > > splice() loops and assertion failures on generic/095 on overlayfs
> > > > because xfs only permit total success or total failure of a directio
> > > > operation.  The underlying issue in the pipe splice code has now been
> > > > fixed by changing the pipe splice loop to avoid avoid reading more data
> > > > than there is space in the pipe.
> > > >
> > > > Therefore, it's no longer necessary to simulate the short directio, so
> > > > remove the hack from iomap.
> > > >
> > > > Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill")
> > > > Reported-by: Amir Goldstein <amir73il@gmail.com>
> > > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > > v2: split into two patches per hch request
> > > > ---
> > > >  fs/iomap.c |    9 ---------
> > > >  1 file changed, 9 deletions(-)
> > > >
> > > > diff --git a/fs/iomap.c b/fs/iomap.c
> > > > index 3ffb776fbebe..d6bc98ae8d35 100644
> > > > --- a/fs/iomap.c
> > > > +++ b/fs/iomap.c
> > > > @@ -1877,15 +1877,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > > >                                 dio->wait_for_completion = true;
> > > >                                 ret = 0;
> > > >                         }
> > > > -
> > > > -                       /*
> > > > -                        * Splicing to pipes can fail on a full pipe. We have to
> > > > -                        * swallow this to make it look like a short IO
> > > > -                        * otherwise the higher splice layers will completely
> > > > -                        * mishandle the error and stop moving data.
> > > > -                        */
> > > > -                       if (ret == -EFAULT)
> > > > -                               ret = 0;
> > > >                         break;
> > > >                 }
> > > >                 pos += ret;
> > >
> > > I'm afraid this breaks the following test case on xfs and gfs2, the
> > > two current users of iomap_dio_rw.
> >
> > Hmm, I had kinda wondered if regular pipes still needed this help.
> > Evidently we don't have a lot of splice tests in fstests. :(
> 
> So what do you suggest as a fix?

(See below)

> > > Here, the splice system call fails with errno = EAGAIN when trying to
> > > "move data" from a file opened with O_DIRECT into a pipe.
> > >
> > > The test case can be run with option -d to not use O_DIRECT, which
> > > makes the test succeed.
> > >
> > > The -r option switches from reading from the pipe sequentially to
> > > reading concurrently with the splice, which doesn't change the
> > > behavior.
> > >
> > > Any thoughts?
> >
> > This would be great as an xfstest! :)
> 
> Or perhaps something generalized from it.
> 
> > Do you have one ready to go, or should I just make one from the source
> > code?
> 
> The bug originally triggered in our internal cluster test system and
> I've recreated the test case I've included from the strace. That's all
> I have for now; feel free to take it, of course.
> 
> It could be that the same condition can be triggered with one of the
> existing utilities (fio/fsstress/...).

Hm, so I made an xfstest out of the program you sent me, and indeed
reverting that chunk makes the failure go away, but that got me
wondering -- that iomap kludge was a workaround for the splice code
telling iomap to try to stuff XXXX bytes into a pipe that only has X
bytes of free buffer space.  We fixed splice_direct_to_actor to clamp
the length parameter to the available pipe space, but we never did the
same to do_splice:

	/* Don't try to read more the pipe has space for. */
	read_len = min_t(size_t, len,
			 (pipe->buffers - pipe->nrbufs) << PAGE_SHIFT);
	ret = do_splice_to(in, &pos, pipe, read_len, flags);

Applying similar logic to the two (opipe != NULL) cases of do_splice()
seem to make the EAGAIN problem go away too.  So why don't we teach
do_splice to only ask for as many bytes as the pipe has space here too?

Does the following patch fix it for you?

--D

From: Darrick J. Wong <darrick.wong@oracle.com>
Subject: [PATCH] splice: only read in as much information as there is pipe buffer space

Andreas Gruenbacher reports that on the two filesystems that support
iomap directio, it's possible for splice() to return -EAGAIN (instead of
a short splice) if the pipe being written to has less space available in
its pipe buffers than the length supplied by the calling process.

Months ago we fixed splice_direct_to_actor to clamp the length of the
read request to the size of the splice pipe.  Do the same to do_splice.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/splice.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/splice.c b/fs/splice.c
index 98412721f056..50335515d7c1 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1101,6 +1101,7 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 	struct pipe_inode_info *ipipe;
 	struct pipe_inode_info *opipe;
 	loff_t offset;
+	unsigned int pipe_pages;
 	long ret;
 
 	ipipe = get_pipe_info(in);
@@ -1123,6 +1124,10 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 		if ((in->f_flags | out->f_flags) & O_NONBLOCK)
 			flags |= SPLICE_F_NONBLOCK;
 
+		/* Don't try to read more the pipe has space for. */
+		pipe_pages = opipe->buffers - opipe->nrbufs;
+		len = min_t(size_t, len, pipe_pages << PAGE_SHIFT);
+
 		return splice_pipe_to_pipe(ipipe, opipe, len, flags);
 	}
 
@@ -1180,8 +1185,13 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 
 		pipe_lock(opipe);
 		ret = wait_for_space(opipe, flags);
-		if (!ret)
+		if (!ret) {
+			/* Don't try to read more the pipe has space for. */
+			pipe_pages = opipe->buffers - opipe->nrbufs;
+			len = min_t(size_t, len, pipe_pages << PAGE_SHIFT);
+
 			ret = do_splice_to(in, &offset, opipe, len, flags);
+		}
 		pipe_unlock(opipe);
 		if (ret > 0)
 			wakeup_pipe_readers(opipe);
Andreas Grünbacher Aug. 29, 2019, 11:49 a.m. UTC | #7
Hi Darrick,

Am Do., 29. Aug. 2019 um 05:12 Uhr schrieb Darrick J. Wong
<darrick.wong@oracle.com>:
> Hm, so I made an xfstest out of the program you sent me, and indeed
> reverting that chunk makes the failure go away, but that got me
> wondering -- that iomap kludge was a workaround for the splice code
> telling iomap to try to stuff XXXX bytes into a pipe that only has X
> bytes of free buffer space.  We fixed splice_direct_to_actor to clamp
> the length parameter to the available pipe space, but we never did the
> same to do_splice:
>
>         /* Don't try to read more the pipe has space for. */
>         read_len = min_t(size_t, len,
>                          (pipe->buffers - pipe->nrbufs) << PAGE_SHIFT);
>         ret = do_splice_to(in, &pos, pipe, read_len, flags);
>
> Applying similar logic to the two (opipe != NULL) cases of do_splice()
> seem to make the EAGAIN problem go away too.  So why don't we teach
> do_splice to only ask for as many bytes as the pipe has space here too?
>
> Does the following patch fix it for you?

Yes, that works, thank you.

> From: Darrick J. Wong <darrick.wong@oracle.com>
> Subject: [PATCH] splice: only read in as much information as there is pipe buffer space
>
> Andreas Gruenbacher reports that on the two filesystems that support
> iomap directio, it's possible for splice() to return -EAGAIN (instead of
> a short splice) if the pipe being written to has less space available in
> its pipe buffers than the length supplied by the calling process.
>
> Months ago we fixed splice_direct_to_actor to clamp the length of the
> read request to the size of the splice pipe.  Do the same to do_splice.

Can you add a reference to that commit here (17614445576b6)?

> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/splice.c |   12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/fs/splice.c b/fs/splice.c
> index 98412721f056..50335515d7c1 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -1101,6 +1101,7 @@ static long do_splice(struct file *in, loff_t __user *off_in,
>         struct pipe_inode_info *ipipe;
>         struct pipe_inode_info *opipe;
>         loff_t offset;
> +       unsigned int pipe_pages;
>         long ret;
>
>         ipipe = get_pipe_info(in);
> @@ -1123,6 +1124,10 @@ static long do_splice(struct file *in, loff_t __user *off_in,
>                 if ((in->f_flags | out->f_flags) & O_NONBLOCK)
>                         flags |= SPLICE_F_NONBLOCK;
>
> +               /* Don't try to read more the pipe has space for. */
> +               pipe_pages = opipe->buffers - opipe->nrbufs;
> +               len = min_t(size_t, len, pipe_pages << PAGE_SHIFT);

This should probably be min(len, (size_t)pipe_pages << PAGE_SHIFT).
Same for the second min_t here and the one added by commit
17614445576b6.

> +
>                 return splice_pipe_to_pipe(ipipe, opipe, len, flags);
>         }
>
> @@ -1180,8 +1185,13 @@ static long do_splice(struct file *in, loff_t __user *off_in,
>
>                 pipe_lock(opipe);
>                 ret = wait_for_space(opipe, flags);
> -               if (!ret)
> +               if (!ret) {
> +                       /* Don't try to read more the pipe has space for. */
> +                       pipe_pages = opipe->buffers - opipe->nrbufs;
> +                       len = min_t(size_t, len, pipe_pages << PAGE_SHIFT);
> +
>                         ret = do_splice_to(in, &offset, opipe, len, flags);
> +               }
>                 pipe_unlock(opipe);
>                 if (ret > 0)
>                         wakeup_pipe_readers(opipe);

Thanks,
Andreas

Patch
diff mbox series

diff --git a/fs/iomap.c b/fs/iomap.c
index 3ffb776fbebe..d6bc98ae8d35 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1877,15 +1877,6 @@  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 				dio->wait_for_completion = true;
 				ret = 0;
 			}
-
-			/*
-			 * Splicing to pipes can fail on a full pipe. We have to
-			 * swallow this to make it look like a short IO
-			 * otherwise the higher splice layers will completely
-			 * mishandle the error and stop moving data.
-			 */
-			if (ret == -EFAULT)
-				ret = 0;
 			break;
 		}
 		pos += ret;