Message ID | 20181202181045.GS8125@magnolia (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2,1/2] splice: don't read more than available pipe space | expand |
On Sun, Dec 2, 2018 at 8:10 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > From: Darrick J. Wong <darrick.wong@oracle.com> > > In commit 4721a601099, we tried to fix a problem wherein directio reads > into a splice pipe will bounce EFAULT/EAGAIN all the way out to > userspace by simulating a zero-byte short read. This happens because > some directio read implementations (xfs) will call > bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous > reads, but as soon as we run out of pipe buffers that _get_pages call > returns EFAULT, which the splice code translates to EAGAIN and bounces > out to userspace. > > In that commit, the iomap code catches the EFAULT and simulates a > zero-byte read, but that causes assertion errors on regular splice reads > because xfs doesn't allow short directio reads. This causes infinite > splice() loops and assertion failures on generic/095 on overlayfs > because xfs only permit total success or total failure of a directio > operation. The underlying issue in the pipe splice code has now been > fixed by changing the pipe splice loop to avoid avoid reading more data > than there is space in the pipe. > > Therefore, it's no longer necessary to simulate the short directio, so > remove the hack from iomap. > > Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill") > Reported-by: Amir Goldstein <amir73il@gmail.com> Wasn't me. I believe it was Murphy Zhou <jencce.kernel@gmail.com>. If you want you can add Ranted-by Amir ;-) Anyway, looks fine. > Reviewed-by: Christoph Hellwig <hch@lst.de> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > --- > v2: split into two patches per hch request > --- > fs/iomap.c | 9 --------- > 1 file changed, 9 deletions(-) > > diff --git a/fs/iomap.c b/fs/iomap.c > index 3ffb776fbebe..d6bc98ae8d35 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -1877,15 +1877,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > dio->wait_for_completion = true; > ret = 0; > } > - > - /* > - * Splicing to pipes can fail on a full pipe. We have to > - * swallow this to make it look like a short IO > - * otherwise the higher splice layers will completely > - * mishandle the error and stop moving data. > - */ > - if (ret == -EFAULT) > - ret = 0; > break; > } > pos += ret;
Hi Darrick, Am So., 2. Dez. 2018 um 19:13 Uhr schrieb Darrick J. Wong <darrick.wong@oracle.com>: > From: Darrick J. Wong <darrick.wong@oracle.com> > > In commit 4721a601099, we tried to fix a problem wherein directio reads > into a splice pipe will bounce EFAULT/EAGAIN all the way out to > userspace by simulating a zero-byte short read. This happens because > some directio read implementations (xfs) will call > bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous > reads, but as soon as we run out of pipe buffers that _get_pages call > returns EFAULT, which the splice code translates to EAGAIN and bounces > out to userspace. > > In that commit, the iomap code catches the EFAULT and simulates a > zero-byte read, but that causes assertion errors on regular splice reads > because xfs doesn't allow short directio reads. This causes infinite > splice() loops and assertion failures on generic/095 on overlayfs > because xfs only permit total success or total failure of a directio > operation. The underlying issue in the pipe splice code has now been > fixed by changing the pipe splice loop to avoid avoid reading more data > than there is space in the pipe. > > Therefore, it's no longer necessary to simulate the short directio, so > remove the hack from iomap. > > Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill") > Reported-by: Amir Goldstein <amir73il@gmail.com> > Reviewed-by: Christoph Hellwig <hch@lst.de> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > --- > v2: split into two patches per hch request > --- > fs/iomap.c | 9 --------- > 1 file changed, 9 deletions(-) > > diff --git a/fs/iomap.c b/fs/iomap.c > index 3ffb776fbebe..d6bc98ae8d35 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -1877,15 +1877,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > dio->wait_for_completion = true; > ret = 0; > } > - > - /* > - * Splicing to pipes can fail on a full pipe. We have to > - * swallow this to make it look like a short IO > - * otherwise the higher splice layers will completely > - * mishandle the error and stop moving data. > - */ > - if (ret == -EFAULT) > - ret = 0; > break; > } > pos += ret; I'm afraid this breaks the following test case on xfs and gfs2, the two current users of iomap_dio_rw. Here, the splice system call fails with errno = EAGAIN when trying to "move data" from a file opened with O_DIRECT into a pipe. The test case can be run with option -d to not use O_DIRECT, which makes the test succeed. The -r option switches from reading from the pipe sequentially to reading concurrently with the splice, which doesn't change the behavior. Any thoughts? Thanks, Andreas =================================== 8< =================================== #define _GNU_SOURCE #include <sys/types.h> #include <sys/stat.h> #include <sys/wait.h> #include <unistd.h> #include <fcntl.h> #include <err.h> #include <stdlib.h> #include <stdio.h> #include <stdbool.h> #include <string.h> #include <errno.h> #define SECTOR_SIZE 512 #define BUFFER_SIZE (150 * SECTOR_SIZE) void read_from_pipe(int fd, const char *filename, size_t size) { char buffer[SECTOR_SIZE]; size_t sz; ssize_t ret; while (size) { sz = size; if (sz > sizeof buffer) sz = sizeof buffer; ret = read(fd, buffer, sz); if (ret < 0) err(1, "read: %s", filename); if (ret == 0) { fprintf(stderr, "read: %s: unexpected EOF\n", filename); exit(1); } size -= sz; } } void do_splice1(int fd, const char *filename, size_t size) { bool retried = false; int pipefd[2]; if (pipe(pipefd) == -1) err(1, "pipe"); while (size) { ssize_t spliced; spliced = splice(fd, NULL, pipefd[1], NULL, size, SPLICE_F_MOVE); if (spliced == -1) { if (errno == EAGAIN && !retried) { retried = true; fprintf(stderr, "retrying splice\n"); sleep(1); continue; } err(1, "splice"); } read_from_pipe(pipefd[0], filename, spliced); size -= spliced; } close(pipefd[0]); close(pipefd[1]); } void do_splice2(int fd, const char *filename, size_t size) { bool retried = false; int pipefd[2]; int pid; if (pipe(pipefd) == -1) err(1, "pipe"); pid = fork(); if (pid == 0) { close(pipefd[1]); read_from_pipe(pipefd[0], filename, size); exit(0); } else { close(pipefd[0]); while (size) { ssize_t spliced; spliced = splice(fd, NULL, pipefd[1], NULL, size, SPLICE_F_MOVE); if (spliced == -1) { if (errno == EAGAIN && !retried) { retried = true; fprintf(stderr, "retrying splice\n"); sleep(1); continue; } err(1, "splice"); } size -= spliced; } close(pipefd[1]); waitpid(pid, NULL, 0); } } void usage(const char *argv0) { fprintf(stderr, "USAGE: %s [-rd] {filename}\n", basename(argv0)); exit(2); } int main(int argc, char *argv[]) { void (*do_splice)(int fd, const char *filename, size_t size); const char *filename; char *buffer; int opt, open_flags, fd; ssize_t ret; do_splice = do_splice1; open_flags = O_CREAT | O_TRUNC | O_RDWR | O_DIRECT; while ((opt = getopt(argc, argv, "rd")) != -1) { switch(opt) { case 'r': do_splice = do_splice2; break; case 'd': open_flags &= ~O_DIRECT; break; default: /* '?' */ usage(argv[0]); } } if (optind >= argc) usage(argv[0]); filename = argv[optind]; printf("%s reader %s O_DIRECT\n", do_splice == do_splice1 ? "sequential" : "concurrent", (open_flags & O_DIRECT) ? "with" : "without"); buffer = aligned_alloc(SECTOR_SIZE, BUFFER_SIZE); if (buffer == NULL) err(1, "aligned_alloc"); fd = open(filename, open_flags, 0666); if (fd == -1) err(1, "open: %s", filename); memset(buffer, 'x', BUFFER_SIZE); ret = write(fd, buffer, BUFFER_SIZE); if (ret < 0) err(1, "write: %s", filename); if (ret != BUFFER_SIZE) { fprintf(stderr, "%s: short write\n", filename); exit(1); } ret = lseek(fd, 0, SEEK_SET); if (ret != 0) err(1, "lseek: %s", filename); do_splice(fd, filename, BUFFER_SIZE); if (unlink(filename) == -1) err(1, "unlink: %s", filename); return 0; } =================================== 8< ===================================
On Wed, Aug 21, 2019 at 10:23:49PM +0200, Andreas Grünbacher wrote: > Hi Darrick, > > Am So., 2. Dez. 2018 um 19:13 Uhr schrieb Darrick J. Wong > <darrick.wong@oracle.com>: > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > In commit 4721a601099, we tried to fix a problem wherein directio reads > > into a splice pipe will bounce EFAULT/EAGAIN all the way out to > > userspace by simulating a zero-byte short read. This happens because > > some directio read implementations (xfs) will call > > bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous > > reads, but as soon as we run out of pipe buffers that _get_pages call > > returns EFAULT, which the splice code translates to EAGAIN and bounces > > out to userspace. > > > > In that commit, the iomap code catches the EFAULT and simulates a > > zero-byte read, but that causes assertion errors on regular splice reads > > because xfs doesn't allow short directio reads. This causes infinite > > splice() loops and assertion failures on generic/095 on overlayfs > > because xfs only permit total success or total failure of a directio > > operation. The underlying issue in the pipe splice code has now been > > fixed by changing the pipe splice loop to avoid avoid reading more data > > than there is space in the pipe. > > > > Therefore, it's no longer necessary to simulate the short directio, so > > remove the hack from iomap. > > > > Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill") > > Reported-by: Amir Goldstein <amir73il@gmail.com> > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > --- > > v2: split into two patches per hch request > > --- > > fs/iomap.c | 9 --------- > > 1 file changed, 9 deletions(-) > > > > diff --git a/fs/iomap.c b/fs/iomap.c > > index 3ffb776fbebe..d6bc98ae8d35 100644 > > --- a/fs/iomap.c > > +++ b/fs/iomap.c > > @@ -1877,15 +1877,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > > dio->wait_for_completion = true; > > ret = 0; > > } > > - > > - /* > > - * Splicing to pipes can fail on a full pipe. We have to > > - * swallow this to make it look like a short IO > > - * otherwise the higher splice layers will completely > > - * mishandle the error and stop moving data. > > - */ > > - if (ret == -EFAULT) > > - ret = 0; > > break; > > } > > pos += ret; > > I'm afraid this breaks the following test case on xfs and gfs2, the > two current users of iomap_dio_rw. Hmm, I had kinda wondered if regular pipes still needed this help. Evidently we don't have a lot of splice tests in fstests. :( > Here, the splice system call fails with errno = EAGAIN when trying to > "move data" from a file opened with O_DIRECT into a pipe. > > The test case can be run with option -d to not use O_DIRECT, which > makes the test succeed. > > The -r option switches from reading from the pipe sequentially to > reading concurrently with the splice, which doesn't change the > behavior. > > Any thoughts? This would be great as an xfstest! :) Do you have one ready to go, or should I just make one from the source code? --D > Thanks, > Andreas > > =================================== 8< =================================== > #define _GNU_SOURCE > #include <sys/types.h> > #include <sys/stat.h> > #include <sys/wait.h> > #include <unistd.h> > #include <fcntl.h> > #include <err.h> > > #include <stdlib.h> > #include <stdio.h> > #include <stdbool.h> > #include <string.h> > #include <errno.h> > > #define SECTOR_SIZE 512 > #define BUFFER_SIZE (150 * SECTOR_SIZE) > > void read_from_pipe(int fd, const char *filename, size_t size) > { > char buffer[SECTOR_SIZE]; > size_t sz; > ssize_t ret; > > while (size) { > sz = size; > if (sz > sizeof buffer) > sz = sizeof buffer; > ret = read(fd, buffer, sz); > if (ret < 0) > err(1, "read: %s", filename); > if (ret == 0) { > fprintf(stderr, "read: %s: unexpected EOF\n", filename); > exit(1); > } > size -= sz; > } > } > > void do_splice1(int fd, const char *filename, size_t size) > { > bool retried = false; > int pipefd[2]; > > if (pipe(pipefd) == -1) > err(1, "pipe"); > while (size) { > ssize_t spliced; > > spliced = splice(fd, NULL, pipefd[1], NULL, size, SPLICE_F_MOVE); > if (spliced == -1) { > if (errno == EAGAIN && !retried) { > retried = true; > fprintf(stderr, "retrying splice\n"); > sleep(1); > continue; > } > err(1, "splice"); > } > read_from_pipe(pipefd[0], filename, spliced); > size -= spliced; > } > close(pipefd[0]); > close(pipefd[1]); > } > > void do_splice2(int fd, const char *filename, size_t size) > { > bool retried = false; > int pipefd[2]; > int pid; > > if (pipe(pipefd) == -1) > err(1, "pipe"); > > pid = fork(); > if (pid == 0) { > close(pipefd[1]); > read_from_pipe(pipefd[0], filename, size); > exit(0); > } else { > close(pipefd[0]); > while (size) { > ssize_t spliced; > > spliced = splice(fd, NULL, pipefd[1], NULL, size, SPLICE_F_MOVE); > if (spliced == -1) { > if (errno == EAGAIN && !retried) { > retried = true; > fprintf(stderr, "retrying splice\n"); > sleep(1); > continue; > } > err(1, "splice"); > } > size -= spliced; > } > close(pipefd[1]); > waitpid(pid, NULL, 0); > } > } > > void usage(const char *argv0) > { > fprintf(stderr, "USAGE: %s [-rd] {filename}\n", basename(argv0)); > exit(2); > } > > int main(int argc, char *argv[]) > { > void (*do_splice)(int fd, const char *filename, size_t size); > const char *filename; > char *buffer; > int opt, open_flags, fd; > ssize_t ret; > > do_splice = do_splice1; > open_flags = O_CREAT | O_TRUNC | O_RDWR | O_DIRECT; > > while ((opt = getopt(argc, argv, "rd")) != -1) { > switch(opt) { > case 'r': > do_splice = do_splice2; > break; > case 'd': > open_flags &= ~O_DIRECT; > break; > default: /* '?' */ > usage(argv[0]); > } > } > > if (optind >= argc) > usage(argv[0]); > filename = argv[optind]; > > printf("%s reader %s O_DIRECT\n", > do_splice == do_splice1 ? "sequential" : "concurrent", > (open_flags & O_DIRECT) ? "with" : "without"); > > buffer = aligned_alloc(SECTOR_SIZE, BUFFER_SIZE); > if (buffer == NULL) > err(1, "aligned_alloc"); > > fd = open(filename, open_flags, 0666); > if (fd == -1) > err(1, "open: %s", filename); > > memset(buffer, 'x', BUFFER_SIZE); > ret = write(fd, buffer, BUFFER_SIZE); > if (ret < 0) > err(1, "write: %s", filename); > if (ret != BUFFER_SIZE) { > fprintf(stderr, "%s: short write\n", filename); > exit(1); > } > > ret = lseek(fd, 0, SEEK_SET); > if (ret != 0) > err(1, "lseek: %s", filename); > > do_splice(fd, filename, BUFFER_SIZE); > > if (unlink(filename) == -1) > err(1, "unlink: %s", filename); > > return 0; > } > =================================== 8< ===================================
Am Mi., 28. Aug. 2019 um 16:23 Uhr schrieb Darrick J. Wong <darrick.wong@oracle.com>: > On Wed, Aug 21, 2019 at 10:23:49PM +0200, Andreas Grünbacher wrote: > > Hi Darrick, > > > > Am So., 2. Dez. 2018 um 19:13 Uhr schrieb Darrick J. Wong > > <darrick.wong@oracle.com>: > > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > In commit 4721a601099, we tried to fix a problem wherein directio reads > > > into a splice pipe will bounce EFAULT/EAGAIN all the way out to > > > userspace by simulating a zero-byte short read. This happens because > > > some directio read implementations (xfs) will call > > > bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous > > > reads, but as soon as we run out of pipe buffers that _get_pages call > > > returns EFAULT, which the splice code translates to EAGAIN and bounces > > > out to userspace. > > > > > > In that commit, the iomap code catches the EFAULT and simulates a > > > zero-byte read, but that causes assertion errors on regular splice reads > > > because xfs doesn't allow short directio reads. This causes infinite > > > splice() loops and assertion failures on generic/095 on overlayfs > > > because xfs only permit total success or total failure of a directio > > > operation. The underlying issue in the pipe splice code has now been > > > fixed by changing the pipe splice loop to avoid avoid reading more data > > > than there is space in the pipe. > > > > > > Therefore, it's no longer necessary to simulate the short directio, so > > > remove the hack from iomap. > > > > > > Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill") > > > Reported-by: Amir Goldstein <amir73il@gmail.com> > > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > --- > > > v2: split into two patches per hch request > > > --- > > > fs/iomap.c | 9 --------- > > > 1 file changed, 9 deletions(-) > > > > > > diff --git a/fs/iomap.c b/fs/iomap.c > > > index 3ffb776fbebe..d6bc98ae8d35 100644 > > > --- a/fs/iomap.c > > > +++ b/fs/iomap.c > > > @@ -1877,15 +1877,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > > > dio->wait_for_completion = true; > > > ret = 0; > > > } > > > - > > > - /* > > > - * Splicing to pipes can fail on a full pipe. We have to > > > - * swallow this to make it look like a short IO > > > - * otherwise the higher splice layers will completely > > > - * mishandle the error and stop moving data. > > > - */ > > > - if (ret == -EFAULT) > > > - ret = 0; > > > break; > > > } > > > pos += ret; > > > > I'm afraid this breaks the following test case on xfs and gfs2, the > > two current users of iomap_dio_rw. > > Hmm, I had kinda wondered if regular pipes still needed this help. > Evidently we don't have a lot of splice tests in fstests. :( So what do you suggest as a fix? > > Here, the splice system call fails with errno = EAGAIN when trying to > > "move data" from a file opened with O_DIRECT into a pipe. > > > > The test case can be run with option -d to not use O_DIRECT, which > > makes the test succeed. > > > > The -r option switches from reading from the pipe sequentially to > > reading concurrently with the splice, which doesn't change the > > behavior. > > > > Any thoughts? > > This would be great as an xfstest! :) Or perhaps something generalized from it. > Do you have one ready to go, or should I just make one from the source > code? The bug originally triggered in our internal cluster test system and I've recreated the test case I've included from the strace. That's all I have for now; feel free to take it, of course. It could be that the same condition can be triggered with one of the existing utilities (fio/fsstress/...). Thanks, Andreas
On Wed, Aug 28, 2019 at 07:23:32AM -0700, Darrick J. Wong wrote: > On Wed, Aug 21, 2019 at 10:23:49PM +0200, Andreas Grünbacher wrote: > > Hi Darrick, > > > > Am So., 2. Dez. 2018 um 19:13 Uhr schrieb Darrick J. Wong > > <darrick.wong@oracle.com>: > > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > In commit 4721a601099, we tried to fix a problem wherein directio reads > > > into a splice pipe will bounce EFAULT/EAGAIN all the way out to > > > userspace by simulating a zero-byte short read. This happens because > > > some directio read implementations (xfs) will call > > > bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous > > > reads, but as soon as we run out of pipe buffers that _get_pages call > > > returns EFAULT, which the splice code translates to EAGAIN and bounces > > > out to userspace. > > > > > > In that commit, the iomap code catches the EFAULT and simulates a > > > zero-byte read, but that causes assertion errors on regular splice reads > > > because xfs doesn't allow short directio reads. This causes infinite > > > splice() loops and assertion failures on generic/095 on overlayfs > > > because xfs only permit total success or total failure of a directio > > > operation. The underlying issue in the pipe splice code has now been > > > fixed by changing the pipe splice loop to avoid avoid reading more data > > > than there is space in the pipe. > > > > > > Therefore, it's no longer necessary to simulate the short directio, so > > > remove the hack from iomap. > > > > > > Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill") > > > Reported-by: Amir Goldstein <amir73il@gmail.com> > > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > --- > > > v2: split into two patches per hch request > > > --- > > > fs/iomap.c | 9 --------- > > > 1 file changed, 9 deletions(-) > > > > > > diff --git a/fs/iomap.c b/fs/iomap.c > > > index 3ffb776fbebe..d6bc98ae8d35 100644 > > > --- a/fs/iomap.c > > > +++ b/fs/iomap.c > > > @@ -1877,15 +1877,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > > > dio->wait_for_completion = true; > > > ret = 0; > > > } > > > - > > > - /* > > > - * Splicing to pipes can fail on a full pipe. We have to > > > - * swallow this to make it look like a short IO > > > - * otherwise the higher splice layers will completely > > > - * mishandle the error and stop moving data. > > > - */ > > > - if (ret == -EFAULT) > > > - ret = 0; > > > break; > > > } > > > pos += ret; > > > > I'm afraid this breaks the following test case on xfs and gfs2, the > > two current users of iomap_dio_rw. > > Hmm, I had kinda wondered if regular pipes still needed this help. > Evidently we don't have a lot of splice tests in fstests. :( > > > Here, the splice system call fails with errno = EAGAIN when trying to > > "move data" from a file opened with O_DIRECT into a pipe. > > > > The test case can be run with option -d to not use O_DIRECT, which > > makes the test succeed. > > > > The -r option switches from reading from the pipe sequentially to > > reading concurrently with the splice, which doesn't change the > > behavior. > > > > Any thoughts? > > This would be great as an xfstest! :) JFYI, I added splice operation into fsstress, and I tried to add splice operation into xfs_io long time ago: https://marc.info/?l=linux-xfs&m=155828702128047&w=2 But it haven't been merged. If you have any suggestion, please feel free to review it:) Thanks, Zorro > > Do you have one ready to go, or should I just make one from the source > code? > > --D > > > Thanks, > > Andreas > > > > =================================== 8< =================================== > > #define _GNU_SOURCE > > #include <sys/types.h> > > #include <sys/stat.h> > > #include <sys/wait.h> > > #include <unistd.h> > > #include <fcntl.h> > > #include <err.h> > > > > #include <stdlib.h> > > #include <stdio.h> > > #include <stdbool.h> > > #include <string.h> > > #include <errno.h> > > > > #define SECTOR_SIZE 512 > > #define BUFFER_SIZE (150 * SECTOR_SIZE) > > > > void read_from_pipe(int fd, const char *filename, size_t size) > > { > > char buffer[SECTOR_SIZE]; > > size_t sz; > > ssize_t ret; > > > > while (size) { > > sz = size; > > if (sz > sizeof buffer) > > sz = sizeof buffer; > > ret = read(fd, buffer, sz); > > if (ret < 0) > > err(1, "read: %s", filename); > > if (ret == 0) { > > fprintf(stderr, "read: %s: unexpected EOF\n", filename); > > exit(1); > > } > > size -= sz; > > } > > } > > > > void do_splice1(int fd, const char *filename, size_t size) > > { > > bool retried = false; > > int pipefd[2]; > > > > if (pipe(pipefd) == -1) > > err(1, "pipe"); > > while (size) { > > ssize_t spliced; > > > > spliced = splice(fd, NULL, pipefd[1], NULL, size, SPLICE_F_MOVE); > > if (spliced == -1) { > > if (errno == EAGAIN && !retried) { > > retried = true; > > fprintf(stderr, "retrying splice\n"); > > sleep(1); > > continue; > > } > > err(1, "splice"); > > } > > read_from_pipe(pipefd[0], filename, spliced); > > size -= spliced; > > } > > close(pipefd[0]); > > close(pipefd[1]); > > } > > > > void do_splice2(int fd, const char *filename, size_t size) > > { > > bool retried = false; > > int pipefd[2]; > > int pid; > > > > if (pipe(pipefd) == -1) > > err(1, "pipe"); > > > > pid = fork(); > > if (pid == 0) { > > close(pipefd[1]); > > read_from_pipe(pipefd[0], filename, size); > > exit(0); > > } else { > > close(pipefd[0]); > > while (size) { > > ssize_t spliced; > > > > spliced = splice(fd, NULL, pipefd[1], NULL, size, SPLICE_F_MOVE); > > if (spliced == -1) { > > if (errno == EAGAIN && !retried) { > > retried = true; > > fprintf(stderr, "retrying splice\n"); > > sleep(1); > > continue; > > } > > err(1, "splice"); > > } > > size -= spliced; > > } > > close(pipefd[1]); > > waitpid(pid, NULL, 0); > > } > > } > > > > void usage(const char *argv0) > > { > > fprintf(stderr, "USAGE: %s [-rd] {filename}\n", basename(argv0)); > > exit(2); > > } > > > > int main(int argc, char *argv[]) > > { > > void (*do_splice)(int fd, const char *filename, size_t size); > > const char *filename; > > char *buffer; > > int opt, open_flags, fd; > > ssize_t ret; > > > > do_splice = do_splice1; > > open_flags = O_CREAT | O_TRUNC | O_RDWR | O_DIRECT; > > > > while ((opt = getopt(argc, argv, "rd")) != -1) { > > switch(opt) { > > case 'r': > > do_splice = do_splice2; > > break; > > case 'd': > > open_flags &= ~O_DIRECT; > > break; > > default: /* '?' */ > > usage(argv[0]); > > } > > } > > > > if (optind >= argc) > > usage(argv[0]); > > filename = argv[optind]; > > > > printf("%s reader %s O_DIRECT\n", > > do_splice == do_splice1 ? "sequential" : "concurrent", > > (open_flags & O_DIRECT) ? "with" : "without"); > > > > buffer = aligned_alloc(SECTOR_SIZE, BUFFER_SIZE); > > if (buffer == NULL) > > err(1, "aligned_alloc"); > > > > fd = open(filename, open_flags, 0666); > > if (fd == -1) > > err(1, "open: %s", filename); > > > > memset(buffer, 'x', BUFFER_SIZE); > > ret = write(fd, buffer, BUFFER_SIZE); > > if (ret < 0) > > err(1, "write: %s", filename); > > if (ret != BUFFER_SIZE) { > > fprintf(stderr, "%s: short write\n", filename); > > exit(1); > > } > > > > ret = lseek(fd, 0, SEEK_SET); > > if (ret != 0) > > err(1, "lseek: %s", filename); > > > > do_splice(fd, filename, BUFFER_SIZE); > > > > if (unlink(filename) == -1) > > err(1, "unlink: %s", filename); > > > > return 0; > > } > > =================================== 8< ===================================
On Wed, Aug 28, 2019 at 04:37:59PM +0200, Andreas Grünbacher wrote: > Am Mi., 28. Aug. 2019 um 16:23 Uhr schrieb Darrick J. Wong > <darrick.wong@oracle.com>: > > On Wed, Aug 21, 2019 at 10:23:49PM +0200, Andreas Grünbacher wrote: > > > Hi Darrick, > > > > > > Am So., 2. Dez. 2018 um 19:13 Uhr schrieb Darrick J. Wong > > > <darrick.wong@oracle.com>: > > > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > > > In commit 4721a601099, we tried to fix a problem wherein directio reads > > > > into a splice pipe will bounce EFAULT/EAGAIN all the way out to > > > > userspace by simulating a zero-byte short read. This happens because > > > > some directio read implementations (xfs) will call > > > > bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous > > > > reads, but as soon as we run out of pipe buffers that _get_pages call > > > > returns EFAULT, which the splice code translates to EAGAIN and bounces > > > > out to userspace. > > > > > > > > In that commit, the iomap code catches the EFAULT and simulates a > > > > zero-byte read, but that causes assertion errors on regular splice reads > > > > because xfs doesn't allow short directio reads. This causes infinite > > > > splice() loops and assertion failures on generic/095 on overlayfs > > > > because xfs only permit total success or total failure of a directio > > > > operation. The underlying issue in the pipe splice code has now been > > > > fixed by changing the pipe splice loop to avoid avoid reading more data > > > > than there is space in the pipe. > > > > > > > > Therefore, it's no longer necessary to simulate the short directio, so > > > > remove the hack from iomap. > > > > > > > > Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill") > > > > Reported-by: Amir Goldstein <amir73il@gmail.com> > > > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > --- > > > > v2: split into two patches per hch request > > > > --- > > > > fs/iomap.c | 9 --------- > > > > 1 file changed, 9 deletions(-) > > > > > > > > diff --git a/fs/iomap.c b/fs/iomap.c > > > > index 3ffb776fbebe..d6bc98ae8d35 100644 > > > > --- a/fs/iomap.c > > > > +++ b/fs/iomap.c > > > > @@ -1877,15 +1877,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > > > > dio->wait_for_completion = true; > > > > ret = 0; > > > > } > > > > - > > > > - /* > > > > - * Splicing to pipes can fail on a full pipe. We have to > > > > - * swallow this to make it look like a short IO > > > > - * otherwise the higher splice layers will completely > > > > - * mishandle the error and stop moving data. > > > > - */ > > > > - if (ret == -EFAULT) > > > > - ret = 0; > > > > break; > > > > } > > > > pos += ret; > > > > > > I'm afraid this breaks the following test case on xfs and gfs2, the > > > two current users of iomap_dio_rw. > > > > Hmm, I had kinda wondered if regular pipes still needed this help. > > Evidently we don't have a lot of splice tests in fstests. :( > > So what do you suggest as a fix? (See below) > > > Here, the splice system call fails with errno = EAGAIN when trying to > > > "move data" from a file opened with O_DIRECT into a pipe. > > > > > > The test case can be run with option -d to not use O_DIRECT, which > > > makes the test succeed. > > > > > > The -r option switches from reading from the pipe sequentially to > > > reading concurrently with the splice, which doesn't change the > > > behavior. > > > > > > Any thoughts? > > > > This would be great as an xfstest! :) > > Or perhaps something generalized from it. > > > Do you have one ready to go, or should I just make one from the source > > code? > > The bug originally triggered in our internal cluster test system and > I've recreated the test case I've included from the strace. That's all > I have for now; feel free to take it, of course. > > It could be that the same condition can be triggered with one of the > existing utilities (fio/fsstress/...). Hm, so I made an xfstest out of the program you sent me, and indeed reverting that chunk makes the failure go away, but that got me wondering -- that iomap kludge was a workaround for the splice code telling iomap to try to stuff XXXX bytes into a pipe that only has X bytes of free buffer space. We fixed splice_direct_to_actor to clamp the length parameter to the available pipe space, but we never did the same to do_splice: /* Don't try to read more the pipe has space for. */ read_len = min_t(size_t, len, (pipe->buffers - pipe->nrbufs) << PAGE_SHIFT); ret = do_splice_to(in, &pos, pipe, read_len, flags); Applying similar logic to the two (opipe != NULL) cases of do_splice() seem to make the EAGAIN problem go away too. So why don't we teach do_splice to only ask for as many bytes as the pipe has space here too? Does the following patch fix it for you? --D From: Darrick J. Wong <darrick.wong@oracle.com> Subject: [PATCH] splice: only read in as much information as there is pipe buffer space Andreas Gruenbacher reports that on the two filesystems that support iomap directio, it's possible for splice() to return -EAGAIN (instead of a short splice) if the pipe being written to has less space available in its pipe buffers than the length supplied by the calling process. Months ago we fixed splice_direct_to_actor to clamp the length of the read request to the size of the splice pipe. Do the same to do_splice. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- fs/splice.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/fs/splice.c b/fs/splice.c index 98412721f056..50335515d7c1 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1101,6 +1101,7 @@ static long do_splice(struct file *in, loff_t __user *off_in, struct pipe_inode_info *ipipe; struct pipe_inode_info *opipe; loff_t offset; + unsigned int pipe_pages; long ret; ipipe = get_pipe_info(in); @@ -1123,6 +1124,10 @@ static long do_splice(struct file *in, loff_t __user *off_in, if ((in->f_flags | out->f_flags) & O_NONBLOCK) flags |= SPLICE_F_NONBLOCK; + /* Don't try to read more the pipe has space for. */ + pipe_pages = opipe->buffers - opipe->nrbufs; + len = min_t(size_t, len, pipe_pages << PAGE_SHIFT); + return splice_pipe_to_pipe(ipipe, opipe, len, flags); } @@ -1180,8 +1185,13 @@ static long do_splice(struct file *in, loff_t __user *off_in, pipe_lock(opipe); ret = wait_for_space(opipe, flags); - if (!ret) + if (!ret) { + /* Don't try to read more the pipe has space for. */ + pipe_pages = opipe->buffers - opipe->nrbufs; + len = min_t(size_t, len, pipe_pages << PAGE_SHIFT); + ret = do_splice_to(in, &offset, opipe, len, flags); + } pipe_unlock(opipe); if (ret > 0) wakeup_pipe_readers(opipe);
Hi Darrick, Am Do., 29. Aug. 2019 um 05:12 Uhr schrieb Darrick J. Wong <darrick.wong@oracle.com>: > Hm, so I made an xfstest out of the program you sent me, and indeed > reverting that chunk makes the failure go away, but that got me > wondering -- that iomap kludge was a workaround for the splice code > telling iomap to try to stuff XXXX bytes into a pipe that only has X > bytes of free buffer space. We fixed splice_direct_to_actor to clamp > the length parameter to the available pipe space, but we never did the > same to do_splice: > > /* Don't try to read more the pipe has space for. */ > read_len = min_t(size_t, len, > (pipe->buffers - pipe->nrbufs) << PAGE_SHIFT); > ret = do_splice_to(in, &pos, pipe, read_len, flags); > > Applying similar logic to the two (opipe != NULL) cases of do_splice() > seem to make the EAGAIN problem go away too. So why don't we teach > do_splice to only ask for as many bytes as the pipe has space here too? > > Does the following patch fix it for you? Yes, that works, thank you. > From: Darrick J. Wong <darrick.wong@oracle.com> > Subject: [PATCH] splice: only read in as much information as there is pipe buffer space > > Andreas Gruenbacher reports that on the two filesystems that support > iomap directio, it's possible for splice() to return -EAGAIN (instead of > a short splice) if the pipe being written to has less space available in > its pipe buffers than the length supplied by the calling process. > > Months ago we fixed splice_direct_to_actor to clamp the length of the > read request to the size of the splice pipe. Do the same to do_splice. Can you add a reference to that commit here (17614445576b6)? > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > --- > fs/splice.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/fs/splice.c b/fs/splice.c > index 98412721f056..50335515d7c1 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -1101,6 +1101,7 @@ static long do_splice(struct file *in, loff_t __user *off_in, > struct pipe_inode_info *ipipe; > struct pipe_inode_info *opipe; > loff_t offset; > + unsigned int pipe_pages; > long ret; > > ipipe = get_pipe_info(in); > @@ -1123,6 +1124,10 @@ static long do_splice(struct file *in, loff_t __user *off_in, > if ((in->f_flags | out->f_flags) & O_NONBLOCK) > flags |= SPLICE_F_NONBLOCK; > > + /* Don't try to read more the pipe has space for. */ > + pipe_pages = opipe->buffers - opipe->nrbufs; > + len = min_t(size_t, len, pipe_pages << PAGE_SHIFT); This should probably be min(len, (size_t)pipe_pages << PAGE_SHIFT). Same for the second min_t here and the one added by commit 17614445576b6. > + > return splice_pipe_to_pipe(ipipe, opipe, len, flags); > } > > @@ -1180,8 +1185,13 @@ static long do_splice(struct file *in, loff_t __user *off_in, > > pipe_lock(opipe); > ret = wait_for_space(opipe, flags); > - if (!ret) > + if (!ret) { > + /* Don't try to read more the pipe has space for. */ > + pipe_pages = opipe->buffers - opipe->nrbufs; > + len = min_t(size_t, len, pipe_pages << PAGE_SHIFT); > + > ret = do_splice_to(in, &offset, opipe, len, flags); > + } > pipe_unlock(opipe); > if (ret > 0) > wakeup_pipe_readers(opipe); Thanks, Andreas
diff --git a/fs/iomap.c b/fs/iomap.c index 3ffb776fbebe..d6bc98ae8d35 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -1877,15 +1877,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, dio->wait_for_completion = true; ret = 0; } - - /* - * Splicing to pipes can fail on a full pipe. We have to - * swallow this to make it look like a short IO - * otherwise the higher splice layers will completely - * mishandle the error and stop moving data. - */ - if (ret == -EFAULT) - ret = 0; break; } pos += ret;