diff mbox series

iomap: fix incomplete async dio reads when using IOMAP_DIO_PARTIAL

Message ID 1f34c8435fed21e9583492661ceb20d642a75699.1646058596.git.fdmanana@suse.com (mailing list archive)
State New, archived
Headers show
Series iomap: fix incomplete async dio reads when using IOMAP_DIO_PARTIAL | expand

Commit Message

Filipe Manana Feb. 28, 2022, 2:32 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

Some users recently reported that MariaDB was getting a read corruption
when using io_uring on top of btrfs. This started to happen in 5.16,
after commit 51bd9563b6783d ("btrfs: fix deadlock due to page faults
during direct IO reads and writes"). That changed btrfs to use the new
iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
corresponds to a memory mapped file region. That type of scenario is
exercised by test case generic/647 from fstests, and it also affected
gfs2, which, besides btrfs, is the only user of IOMAP_DIO_PARTIAL.

For this MariaDB scenario, we attempt to read 16K from file offset X
using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
with a size of 4K, and what happens is the following:

1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();

2) iomap creates a struct iomap_dio object, its reference count is
   initialized to 1 and its ->size field is initialized to 0;

3) iomap calls btrfs_iomap_begin() with file offset X, which finds the
   first 4K extent, and setups an iomap for this extent consisting of
   a single page;

4) At iomap_dio_bio_iter(), we are able to access the first page of the
   buffer (struct iov_iter) with bio_iov_iter_get_pages() without
   triggering a page fault;

5) iomap submits a bio for this 4K extent
   (iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments
   the refcount on the struct iomap_dio object to 2; The ->size field
   of the struct iomap_dio object is incremented to 4K;

6) iomap calls btrfs_iomap_begin() again, this time with a file
   offset of X + 4K. There we setup an iomap for the next extent
   that also has a size of 4K;

7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(),
   which tries to access the next page (2nd page) of the buffer.
   This triggers a page fault and returns -EFAULT;

8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error
   to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and
   the struct iomap_dio object has a ->size value of 4K (we submitted
   a bio for an extent already). The 'wait_for_completion' variable
   is not set to true, because our iocb has IOCB_NOWAIT set;

9) At the bottom of __iomap_dio_rw(), we decrement the reference count
   of the struct iomap_dio object from 2 to 1. Because we were not
   the only ones holding a reference on it and 'wait_for_completion' is
   set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which
   just returns it up the callchain, up to io_uring;

10) The bio submitted for the first extent (step 5) completes and its
    bio endio function, iomap_dio_bio_end_io(), decrements the last
    reference on the struct iomap_dio object, resulting in calling
    iomap_dio_complete_work() -> iomap_dio_complete().

11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
    and return 4K (the amount of io done) to iomap_dio_complete_work();

12) iomap_dio_complete_work() calls the iocb completion callback,
    iocb->ki_complete() with a second argument value of 4K (total io
    done) and the iocb with the adjust ki_pos of X + 4K. This results
    in completing the read request for io_uring, leaving it with a
    result of 4K bytes read, and only the first page of the buffer
    filled in, while the remaining 3 pages, corresponding to the other
    3 extents, were not filled;

13) For the application, the result is unexpected because if we ask
    to read N bytes, it expects to get N bytes read as long as those
    N bytes don't cross the EOF (i_size).

So fix this by making __iomap_dio_rw() assign true to the boolean variable
'wait_for_completion' when we have IOMAP_DIO_PARTIAL set, we did some
progress for a read and we have not crossed the EOF boundary. Do this even
if the read has IOCB_NOWAIT set, as it's the only way to avoid providing
an unexpected result to an application. This results in returning a
positive value to the iomap caller, which tells it to fault in the
remaining pages associated to the buffer (struct iov_iter), followed by
another call to iomap_dio_rw() with IOMAP_DIO_PARTIAL set, in order to
continue the rest of the read operation.

The problem can also be triggered with the following simple program:

  /* Get O_DIRECT */
  #ifndef _GNU_SOURCE
  #define _GNU_SOURCE
  #endif

  #include <stdio.h>
  #include <stdlib.h>
  #include <unistd.h>
  #include <fcntl.h>
  #include <errno.h>
  #include <string.h>
  #include <liburing.h>

  int main(int argc, char *argv[])
  {
      char *foo_path;
      struct io_uring ring;
      struct io_uring_sqe *sqe;
      struct io_uring_cqe *cqe;
      struct iovec iovec;
      int fd;
      long pagesize;
      void *write_buf;
      void *read_buf;
      ssize_t ret;
      int i;

      if (argc != 2) {
          fprintf(stderr, "Use: %s <directory>\n", argv[0]);
          return 1;
      }

      foo_path = malloc(strlen(argv[1]) + 5);
      if (!foo_path) {
          fprintf(stderr, "Failed to allocate memory for file path\n");
          return 1;
      }
      strcpy(foo_path, argv[1]);
      strcat(foo_path, "/foo");

      /*
       * Create file foo with 2 extents, each with a size matching
       * the page size. Then allocate a buffer to read both extents
       * with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing
       * the read with io_uring, access the first page of the buffer
       * to fault it in, so that during the read we only trigger a
       * page fault when accessing the second page of the buffer.
       */
       fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY |
                O_DIRECT, 0666);
       if (fd == -1) {
           fprintf(stderr,
                   "Failed to create file 'foo': %s (errno %d)",
                   strerror(errno), errno);
           return 1;
       }

       pagesize = sysconf(_SC_PAGE_SIZE);
       ret = posix_memalign(&write_buf, pagesize, 2 * pagesize);
       if (ret) {
           fprintf(stderr, "Failed to allocate write buffer\n");
           return 1;
       }

       memset(write_buf, 0xab, pagesize);
       memset(write_buf + pagesize, 0xcd, pagesize);

       /* Create 2 extents, each with a size matching page size. */
       for (i = 0; i < 2; i++) {
           ret = pwrite(fd, write_buf + i * pagesize, pagesize,
	                i * pagesize);
           if (ret != pagesize) {
               fprintf(stderr,
                     "Failed to write to file, ret = %ld errno %d (%s)\n",
                      ret, errno, strerror(errno));
               return 1;
           }
           ret = fsync(fd);
           if (ret != 0) {
               fprintf(stderr, "Failed to fsync file\n");
               return 1;
           }
       }

       close(fd);
       fd = open(foo_path, O_RDONLY | O_DIRECT);
       if (fd == -1) {
           fprintf(stderr,
                   "Failed to open file 'foo': %s (errno %d)",
                   strerror(errno), errno);
           return 1;
       }

       ret = posix_memalign(&read_buf, pagesize, 2 * pagesize);
       if (ret) {
           fprintf(stderr, "Failed to allocate read buffer\n");
           return 1;
       }

       /*
        * Fault in only the first page of the read buffer.
        * We want to trigger a page fault for the 2nd page of the
        * read buffer during the read operation with io_uring
        * (O_DIRECT and IOCB_NOWAIT).
        */
       memset(read_buf, 0, 1);

       ret = io_uring_queue_init(1, &ring, 0);
       if (ret != 0) {
           fprintf(stderr, "Failed to create io_uring queue\n");
           return 1;
       }

       sqe = io_uring_get_sqe(&ring);
       if (!sqe) {
           fprintf(stderr, "Failed to get io_uring sqe\n");
           return 1;
       }

       iovec.iov_base = read_buf;
       iovec.iov_len = 2 * pagesize;
       io_uring_prep_readv(sqe, fd, &iovec, 1, 0);

       ret = io_uring_submit_and_wait(&ring, 1);
       if (ret != 1) {
           fprintf(stderr,
                   "Failed at io_uring_submit_and_wait()\n");
           return 1;
       }

       ret = io_uring_wait_cqe(&ring, &cqe);
       if (ret < 0) {
           fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
           return 1;
       }

       printf("io_uring read result for file foo:\n\n");
       printf("  cqe->res = %d (expected %d)\n", cqe->res, 2 * pagesize);
       printf("  memcmp(read_buf, write_buf) == %d (expected 0)\n",
              memcmp(read_buf, write_buf, 2 * pagesize));

       io_uring_cqe_seen(&ring, cqe);
       io_uring_queue_exit(&ring);

       return 0;
  }

When running it on an unpatched kernel:

  $ gcc io_uring_test.c -luring
  $ mkfs.btrfs -f /dev/sda
  $ mount /dev/sda /mnt/sda
  $ ./a.out /mnt/sda
  io_uring read result for file foo:

    cqe->res = 4096 (expected 8192)
    memcmp(read_buf, write_buf) == -205 (expected 0)

After this patch, the read always returns 8192 bytes, with the buffer
filled with the correct data. Although that reproducer always triggers
the bug in my test vms, it's possible that it will not be so reliable
on other environments, as that can happen if the bio for the first
extent completes and decrements the reference on the struct iomap_dio
object before we do the atomic_dec_and_test() on the reference at
__iomap_dio_rw().

A test case for fstests will followup later.

Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/iomap/direct-io.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Comments

Dave Chinner Feb. 28, 2022, 10:38 p.m. UTC | #1
On Mon, Feb 28, 2022 at 02:32:03PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Some users recently reported that MariaDB was getting a read corruption
> when using io_uring on top of btrfs. This started to happen in 5.16,
> after commit 51bd9563b6783d ("btrfs: fix deadlock due to page faults
> during direct IO reads and writes"). That changed btrfs to use the new
> iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
> iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
> corresponds to a memory mapped file region. That type of scenario is
> exercised by test case generic/647 from fstests, and it also affected
> gfs2, which, besides btrfs, is the only user of IOMAP_DIO_PARTIAL.
> 
> For this MariaDB scenario, we attempt to read 16K from file offset X
> using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
> with a size of 4K, and what happens is the following:
> 
> 1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
> 
> 2) iomap creates a struct iomap_dio object, its reference count is
>    initialized to 1 and its ->size field is initialized to 0;
> 
> 3) iomap calls btrfs_iomap_begin() with file offset X, which finds the

You mean btrfs_dio_iomap_begin()?

>    first 4K extent, and setups an iomap for this extent consisting of
>    a single page;

So we have IOCB_NOWAIT, which means btrfs_dio_iomap_begin() is being
passed IOMAP_NOWAIT and so knows it is being asked
to map an extent for an IO that is on a non-blocking path.

btrfs_dio_iomap_begin() doesn't appear to support NOWAIT semantics
at all - it will block doing writeback IO, memory allocation, extent
locking, transaction reservations, extent allocation, etc....

That, to me, looks like the root cause of the problem here -
btrfs_dio_iomap_begin() is not guaranteeing non-blocking atomic IO
semantics for IOCB_NOWAIT IO.

In the case above, given that the extent lookup only found a 4kB
extent, we know that it doesn't span the entire requested IO range.
We also known that we cannot tell if we'll block on subsequent
mappings of the IO range, and hence no guarantee can be given that
IOCB_NOWAIT IO will not block when it is too late to back out with a
-EAGAIN error.

Hence this whole set of problems could be avoided if
btrfs_dio_iomap_begin() returns -EAGAIN if it can't map the entire
IO into a single extent without blocking when IOMAP_NOWAIT is set?
That's exactly what XFS does in xfs_direct_iomap_write_begin():

        /*
         * NOWAIT and OVERWRITE I/O needs to span the entire requested I/O with
         * a single map so that we avoid partial IO failures due to the rest of
         * the I/O range not covered by this map triggering an EAGAIN condition
         * when it is subsequently mapped and aborting the I/O.
         */
        if (flags & (IOMAP_NOWAIT | IOMAP_OVERWRITE_ONLY)) {
                error = -EAGAIN;
                if (!imap_spans_range(&imap, offset_fsb, end_fsb))
                        goto out_unlock;
        }

Basically, I'm thinking that IOMAP_NOWAIT and IOMAP_DIO_PARTIAL
should be exclusive functionality - if you are doing IOMAP_NOWAIT
then the entire IO must succeed without blocking, and if it doesn't
then we return -EAGAIN and the caller retries without IOCB_NOWAIT
set and so then we run with IOMAP_DIO_PARTIAL semantics in a thread
that can actually block....

.....

> 11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
>     and return 4K (the amount of io done) to iomap_dio_complete_work();
> 
> 12) iomap_dio_complete_work() calls the iocb completion callback,
>     iocb->ki_complete() with a second argument value of 4K (total io
>     done) and the iocb with the adjust ki_pos of X + 4K. This results
>     in completing the read request for io_uring, leaving it with a
>     result of 4K bytes read, and only the first page of the buffer
>     filled in, while the remaining 3 pages, corresponding to the other
>     3 extents, were not filled;
> 
> 13) For the application, the result is unexpected because if we ask
>     to read N bytes, it expects to get N bytes read as long as those
>     N bytes don't cross the EOF (i_size).

Yeah, that's exactly the sort of problem we were having with XFS
with partial DIO completions due to needing multiple iomap iteration
loops to complete a single IO. Hence IOMAP_NOWAIT now triggers the
above range check and aborts before we start...

> 
> So fix this by making __iomap_dio_rw() assign true to the boolean variable
> 'wait_for_completion' when we have IOMAP_DIO_PARTIAL set, we did some
> progress for a read and we have not crossed the EOF boundary. Do this even
> if the read has IOCB_NOWAIT set, as it's the only way to avoid providing
> an unexpected result to an application.

That's highly specific and ultimately will be fragile, IMO. I'd much
prefer that *_iomap_begin_write() implementations simply follow
IOMAP_NOWAIT requirements to ensure that any DIO that needs multiple
mappings if punted to a context that can block...

Cheers,

Dave.
Darrick J. Wong March 1, 2022, 4:42 p.m. UTC | #2
On Mon, Feb 28, 2022 at 02:32:03PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Some users recently reported that MariaDB was getting a read corruption
> when using io_uring on top of btrfs. This started to happen in 5.16,
> after commit 51bd9563b6783d ("btrfs: fix deadlock due to page faults
> during direct IO reads and writes"). That changed btrfs to use the new
> iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
> iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
> corresponds to a memory mapped file region. That type of scenario is
> exercised by test case generic/647 from fstests, and it also affected
> gfs2, which, besides btrfs, is the only user of IOMAP_DIO_PARTIAL.
> 
> For this MariaDB scenario, we attempt to read 16K from file offset X
> using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
> with a size of 4K, and what happens is the following:
> 
> 1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
> 
> 2) iomap creates a struct iomap_dio object, its reference count is
>    initialized to 1 and its ->size field is initialized to 0;
> 
> 3) iomap calls btrfs_iomap_begin() with file offset X, which finds the
>    first 4K extent, and setups an iomap for this extent consisting of
>    a single page;
> 
> 4) At iomap_dio_bio_iter(), we are able to access the first page of the
>    buffer (struct iov_iter) with bio_iov_iter_get_pages() without
>    triggering a page fault;
> 
> 5) iomap submits a bio for this 4K extent
>    (iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments
>    the refcount on the struct iomap_dio object to 2; The ->size field
>    of the struct iomap_dio object is incremented to 4K;
> 
> 6) iomap calls btrfs_iomap_begin() again, this time with a file
>    offset of X + 4K. There we setup an iomap for the next extent
>    that also has a size of 4K;
> 
> 7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(),
>    which tries to access the next page (2nd page) of the buffer.
>    This triggers a page fault and returns -EFAULT;
> 
> 8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error
>    to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and
>    the struct iomap_dio object has a ->size value of 4K (we submitted
>    a bio for an extent already). The 'wait_for_completion' variable
>    is not set to true, because our iocb has IOCB_NOWAIT set;
> 
> 9) At the bottom of __iomap_dio_rw(), we decrement the reference count
>    of the struct iomap_dio object from 2 to 1. Because we were not
>    the only ones holding a reference on it and 'wait_for_completion' is
>    set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which
>    just returns it up the callchain, up to io_uring;
> 
> 10) The bio submitted for the first extent (step 5) completes and its
>     bio endio function, iomap_dio_bio_end_io(), decrements the last
>     reference on the struct iomap_dio object, resulting in calling
>     iomap_dio_complete_work() -> iomap_dio_complete().
> 
> 11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
>     and return 4K (the amount of io done) to iomap_dio_complete_work();
> 
> 12) iomap_dio_complete_work() calls the iocb completion callback,
>     iocb->ki_complete() with a second argument value of 4K (total io
>     done) and the iocb with the adjust ki_pos of X + 4K. This results
>     in completing the read request for io_uring, leaving it with a
>     result of 4K bytes read, and only the first page of the buffer
>     filled in, while the remaining 3 pages, corresponding to the other
>     3 extents, were not filled;

Just checking my understanding of the problem here -- the first page is
filled, the directio read returns that it read $pagesize bytes, and the
remaining 3 pages in the reader's buffer are untouched?  And it's not
the case that there's some IO running in the background that will
eventually fill those three pages, nor is it the case that iouring will
say that the read completed 4k before the contents actually reach the
page?

> 13) For the application, the result is unexpected because if we ask
>     to read N bytes, it expects to get N bytes read as long as those
>     N bytes don't cross the EOF (i_size).

Hmm.  Is the problem here that directio readers expect that a read
request for N bytes either (reads N bytes and returns N) or (returns the
usual negative errno), and nobody's expecting a short direct read?

> So fix this by making __iomap_dio_rw() assign true to the boolean variable
> 'wait_for_completion' when we have IOMAP_DIO_PARTIAL set, we did some
> progress for a read and we have not crossed the EOF boundary. Do this even
> if the read has IOCB_NOWAIT set, as it's the only way to avoid providing
> an unexpected result to an application. This results in returning a
> positive value to the iomap caller, which tells it to fault in the
> remaining pages associated to the buffer (struct iov_iter), followed by
> another call to iomap_dio_rw() with IOMAP_DIO_PARTIAL set, in order to
> continue the rest of the read operation.

I would have thought that a NOWAIT read wouldn't wait for anything,
including page faults...

> The problem can also be triggered with the following simple program:
> 
>   /* Get O_DIRECT */
>   #ifndef _GNU_SOURCE
>   #define _GNU_SOURCE
>   #endif
> 
>   #include <stdio.h>
>   #include <stdlib.h>
>   #include <unistd.h>
>   #include <fcntl.h>
>   #include <errno.h>
>   #include <string.h>
>   #include <liburing.h>
> 
>   int main(int argc, char *argv[])
>   {
>       char *foo_path;
>       struct io_uring ring;
>       struct io_uring_sqe *sqe;
>       struct io_uring_cqe *cqe;
>       struct iovec iovec;
>       int fd;
>       long pagesize;
>       void *write_buf;
>       void *read_buf;
>       ssize_t ret;
>       int i;
> 
>       if (argc != 2) {
>           fprintf(stderr, "Use: %s <directory>\n", argv[0]);
>           return 1;
>       }
> 
>       foo_path = malloc(strlen(argv[1]) + 5);
>       if (!foo_path) {
>           fprintf(stderr, "Failed to allocate memory for file path\n");
>           return 1;
>       }
>       strcpy(foo_path, argv[1]);
>       strcat(foo_path, "/foo");
> 
>       /*
>        * Create file foo with 2 extents, each with a size matching
>        * the page size. Then allocate a buffer to read both extents
>        * with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing
>        * the read with io_uring, access the first page of the buffer
>        * to fault it in, so that during the read we only trigger a
>        * page fault when accessing the second page of the buffer.
>        */
>        fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY |
>                 O_DIRECT, 0666);
>        if (fd == -1) {
>            fprintf(stderr,
>                    "Failed to create file 'foo': %s (errno %d)",
>                    strerror(errno), errno);
>            return 1;
>        }
> 
>        pagesize = sysconf(_SC_PAGE_SIZE);
>        ret = posix_memalign(&write_buf, pagesize, 2 * pagesize);
>        if (ret) {
>            fprintf(stderr, "Failed to allocate write buffer\n");
>            return 1;
>        }
> 
>        memset(write_buf, 0xab, pagesize);
>        memset(write_buf + pagesize, 0xcd, pagesize);
> 
>        /* Create 2 extents, each with a size matching page size. */
>        for (i = 0; i < 2; i++) {
>            ret = pwrite(fd, write_buf + i * pagesize, pagesize,
> 	                i * pagesize);
>            if (ret != pagesize) {
>                fprintf(stderr,
>                      "Failed to write to file, ret = %ld errno %d (%s)\n",
>                       ret, errno, strerror(errno));
>                return 1;
>            }
>            ret = fsync(fd);
>            if (ret != 0) {
>                fprintf(stderr, "Failed to fsync file\n");
>                return 1;
>            }
>        }
> 
>        close(fd);
>        fd = open(foo_path, O_RDONLY | O_DIRECT);
>        if (fd == -1) {
>            fprintf(stderr,
>                    "Failed to open file 'foo': %s (errno %d)",
>                    strerror(errno), errno);
>            return 1;
>        }
> 
>        ret = posix_memalign(&read_buf, pagesize, 2 * pagesize);
>        if (ret) {
>            fprintf(stderr, "Failed to allocate read buffer\n");
>            return 1;
>        }
> 
>        /*
>         * Fault in only the first page of the read buffer.
>         * We want to trigger a page fault for the 2nd page of the
>         * read buffer during the read operation with io_uring
>         * (O_DIRECT and IOCB_NOWAIT).
>         */
>        memset(read_buf, 0, 1);
> 
>        ret = io_uring_queue_init(1, &ring, 0);
>        if (ret != 0) {
>            fprintf(stderr, "Failed to create io_uring queue\n");
>            return 1;
>        }
> 
>        sqe = io_uring_get_sqe(&ring);
>        if (!sqe) {
>            fprintf(stderr, "Failed to get io_uring sqe\n");
>            return 1;
>        }
> 
>        iovec.iov_base = read_buf;
>        iovec.iov_len = 2 * pagesize;
>        io_uring_prep_readv(sqe, fd, &iovec, 1, 0);
> 
>        ret = io_uring_submit_and_wait(&ring, 1);
>        if (ret != 1) {
>            fprintf(stderr,
>                    "Failed at io_uring_submit_and_wait()\n");
>            return 1;
>        }
> 
>        ret = io_uring_wait_cqe(&ring, &cqe);
>        if (ret < 0) {
>            fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
>            return 1;
>        }
> 
>        printf("io_uring read result for file foo:\n\n");
>        printf("  cqe->res = %d (expected %d)\n", cqe->res, 2 * pagesize);
>        printf("  memcmp(read_buf, write_buf) == %d (expected 0)\n",
>               memcmp(read_buf, write_buf, 2 * pagesize));
> 
>        io_uring_cqe_seen(&ring, cqe);
>        io_uring_queue_exit(&ring);
> 
>        return 0;
>   }
> 
> When running it on an unpatched kernel:
> 
>   $ gcc io_uring_test.c -luring
>   $ mkfs.btrfs -f /dev/sda
>   $ mount /dev/sda /mnt/sda
>   $ ./a.out /mnt/sda
>   io_uring read result for file foo:
> 
>     cqe->res = 4096 (expected 8192)
>     memcmp(read_buf, write_buf) == -205 (expected 0)

Does memcmp(read_buf, write_buf, pagesize) == 0?  To me it looks like
you requested a NOWAIT direct read of 8k and it returned 4k, which takes
us back to the question of whether or not nowait-directio readers ought
to expect to handle short reads?

--D

> After this patch, the read always returns 8192 bytes, with the buffer
> filled with the correct data. Although that reproducer always triggers
> the bug in my test vms, it's possible that it will not be so reliable
> on other environments, as that can happen if the bio for the first
> extent completes and decrements the reference on the struct iomap_dio
> object before we do the atomic_dec_and_test() on the reference at
> __iomap_dio_rw().
> 
> A test case for fstests will followup later.
> 
> Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/
> Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/iomap/direct-io.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 03ea367df19a..9a6fdefa34f3 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -606,7 +606,19 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		iov_iter_revert(iter, iomi.pos - dio->i_size);
>  
>  	if (ret == -EFAULT && dio->size && (dio_flags & IOMAP_DIO_PARTIAL)) {
> -		if (!(iocb->ki_flags & IOCB_NOWAIT))
> +		/*
> +		 * If we are a NOWAIT request we don't want to wait for the
> +		 * completion of any previously submitted bios. However there
> +		 * is one exception: if we are doing a read and we have not
> +		 * submitted bios for everything we are supposed to read, then
> +		 * we have to wait for completion - otherwise we may end up
> +		 * returning -EIOCBQUEUED without having read everything we
> +		 * can read, making our caller think we have reached EOF.
> +		 */
> +		if (!(iocb->ki_flags & IOCB_NOWAIT) ||
> +		    (iov_iter_rw(iter) == READ &&
> +		     iomi.len > 0 &&
> +		     iomi.pos < dio->i_size))
>  			wait_for_completion = true;
>  		ret = 0;
>  	}
> -- 
> 2.33.0
>
Filipe Manana March 2, 2022, 10:17 a.m. UTC | #3
On Tue, Mar 01, 2022 at 09:38:30AM +1100, Dave Chinner wrote:
> On Mon, Feb 28, 2022 at 02:32:03PM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> > 
> > Some users recently reported that MariaDB was getting a read corruption
> > when using io_uring on top of btrfs. This started to happen in 5.16,
> > after commit 51bd9563b6783d ("btrfs: fix deadlock due to page faults
> > during direct IO reads and writes"). That changed btrfs to use the new
> > iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
> > iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
> > corresponds to a memory mapped file region. That type of scenario is
> > exercised by test case generic/647 from fstests, and it also affected
> > gfs2, which, besides btrfs, is the only user of IOMAP_DIO_PARTIAL.
> > 
> > For this MariaDB scenario, we attempt to read 16K from file offset X
> > using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
> > with a size of 4K, and what happens is the following:
> > 
> > 1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
> > 
> > 2) iomap creates a struct iomap_dio object, its reference count is
> >    initialized to 1 and its ->size field is initialized to 0;
> > 
> > 3) iomap calls btrfs_iomap_begin() with file offset X, which finds the
> 
> You mean btrfs_dio_iomap_begin()?

Yes, correct.

> 
> >    first 4K extent, and setups an iomap for this extent consisting of
> >    a single page;
> 
> So we have IOCB_NOWAIT, which means btrfs_dio_iomap_begin() is being
> passed IOMAP_NOWAIT and so knows it is being asked
> to map an extent for an IO that is on a non-blocking path.
> 
> btrfs_dio_iomap_begin() doesn't appear to support NOWAIT semantics
> at all - it will block doing writeback IO, memory allocation, extent
> locking, transaction reservations, extent allocation, etc....

We do have some checks for NOWAIT before getting into btrfs_dio_iomap_begin(),
but they are only for the write path, and they are incomplete. Some are a bit
tricky to deal with, but yes, there's several cases that are either missing
or need to be improved.

> 
> That, to me, looks like the root cause of the problem here -
> btrfs_dio_iomap_begin() is not guaranteeing non-blocking atomic IO
> semantics for IOCB_NOWAIT IO.
> 
> In the case above, given that the extent lookup only found a 4kB
> extent, we know that it doesn't span the entire requested IO range.
> We also known that we cannot tell if we'll block on subsequent
> mappings of the IO range, and hence no guarantee can be given that
> IOCB_NOWAIT IO will not block when it is too late to back out with a
> -EAGAIN error.
> 
> Hence this whole set of problems could be avoided if
> btrfs_dio_iomap_begin() returns -EAGAIN if it can't map the entire
> IO into a single extent without blocking when IOMAP_NOWAIT is set?
> That's exactly what XFS does in xfs_direct_iomap_write_begin():
> 
>         /*
>          * NOWAIT and OVERWRITE I/O needs to span the entire requested I/O with
>          * a single map so that we avoid partial IO failures due to the rest of
>          * the I/O range not covered by this map triggering an EAGAIN condition
>          * when it is subsequently mapped and aborting the I/O.
>          */
>         if (flags & (IOMAP_NOWAIT | IOMAP_OVERWRITE_ONLY)) {
>                 error = -EAGAIN;
>                 if (!imap_spans_range(&imap, offset_fsb, end_fsb))
>                         goto out_unlock;
>         }
> 
> Basically, I'm thinking that IOMAP_NOWAIT and IOMAP_DIO_PARTIAL
> should be exclusive functionality - if you are doing IOMAP_NOWAIT
> then the entire IO must succeed without blocking, and if it doesn't
> then we return -EAGAIN and the caller retries without IOCB_NOWAIT
> set and so then we run with IOMAP_DIO_PARTIAL semantics in a thread
> that can actually block....

Indeed, I had not considered that, that is simple and effective, plus
it can be done exclusively in btrfs code, no need to change iomap.

> 
> .....
> 
> > 11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
> >     and return 4K (the amount of io done) to iomap_dio_complete_work();
> > 
> > 12) iomap_dio_complete_work() calls the iocb completion callback,
> >     iocb->ki_complete() with a second argument value of 4K (total io
> >     done) and the iocb with the adjust ki_pos of X + 4K. This results
> >     in completing the read request for io_uring, leaving it with a
> >     result of 4K bytes read, and only the first page of the buffer
> >     filled in, while the remaining 3 pages, corresponding to the other
> >     3 extents, were not filled;
> > 
> > 13) For the application, the result is unexpected because if we ask
> >     to read N bytes, it expects to get N bytes read as long as those
> >     N bytes don't cross the EOF (i_size).
> 
> Yeah, that's exactly the sort of problem we were having with XFS
> with partial DIO completions due to needing multiple iomap iteration
> loops to complete a single IO. Hence IOMAP_NOWAIT now triggers the
> above range check and aborts before we start...

Interesting.

> 
> > 
> > So fix this by making __iomap_dio_rw() assign true to the boolean variable
> > 'wait_for_completion' when we have IOMAP_DIO_PARTIAL set, we did some
> > progress for a read and we have not crossed the EOF boundary. Do this even
> > if the read has IOCB_NOWAIT set, as it's the only way to avoid providing
> > an unexpected result to an application.
> 
> That's highly specific and ultimately will be fragile, IMO. I'd much
> prefer that *_iomap_begin_write() implementations simply follow
> IOMAP_NOWAIT requirements to ensure that any DIO that needs multiple
> mappings if punted to a context that can block...

Yes, agreed.

Thanks for your feedback Dave, it provided a really good insight into this
problem (and others).

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Filipe Manana March 2, 2022, 10:30 a.m. UTC | #4
On Tue, Mar 01, 2022 at 08:42:03AM -0800, Darrick J. Wong wrote:
> On Mon, Feb 28, 2022 at 02:32:03PM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> > 
> > Some users recently reported that MariaDB was getting a read corruption
> > when using io_uring on top of btrfs. This started to happen in 5.16,
> > after commit 51bd9563b6783d ("btrfs: fix deadlock due to page faults
> > during direct IO reads and writes"). That changed btrfs to use the new
> > iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
> > iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
> > corresponds to a memory mapped file region. That type of scenario is
> > exercised by test case generic/647 from fstests, and it also affected
> > gfs2, which, besides btrfs, is the only user of IOMAP_DIO_PARTIAL.
> > 
> > For this MariaDB scenario, we attempt to read 16K from file offset X
> > using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
> > with a size of 4K, and what happens is the following:
> > 
> > 1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
> > 
> > 2) iomap creates a struct iomap_dio object, its reference count is
> >    initialized to 1 and its ->size field is initialized to 0;
> > 
> > 3) iomap calls btrfs_iomap_begin() with file offset X, which finds the
> >    first 4K extent, and setups an iomap for this extent consisting of
> >    a single page;
> > 
> > 4) At iomap_dio_bio_iter(), we are able to access the first page of the
> >    buffer (struct iov_iter) with bio_iov_iter_get_pages() without
> >    triggering a page fault;
> > 
> > 5) iomap submits a bio for this 4K extent
> >    (iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments
> >    the refcount on the struct iomap_dio object to 2; The ->size field
> >    of the struct iomap_dio object is incremented to 4K;
> > 
> > 6) iomap calls btrfs_iomap_begin() again, this time with a file
> >    offset of X + 4K. There we setup an iomap for the next extent
> >    that also has a size of 4K;
> > 
> > 7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(),
> >    which tries to access the next page (2nd page) of the buffer.
> >    This triggers a page fault and returns -EFAULT;
> > 
> > 8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error
> >    to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and
> >    the struct iomap_dio object has a ->size value of 4K (we submitted
> >    a bio for an extent already). The 'wait_for_completion' variable
> >    is not set to true, because our iocb has IOCB_NOWAIT set;
> > 
> > 9) At the bottom of __iomap_dio_rw(), we decrement the reference count
> >    of the struct iomap_dio object from 2 to 1. Because we were not
> >    the only ones holding a reference on it and 'wait_for_completion' is
> >    set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which
> >    just returns it up the callchain, up to io_uring;
> > 
> > 10) The bio submitted for the first extent (step 5) completes and its
> >     bio endio function, iomap_dio_bio_end_io(), decrements the last
> >     reference on the struct iomap_dio object, resulting in calling
> >     iomap_dio_complete_work() -> iomap_dio_complete().
> > 
> > 11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
> >     and return 4K (the amount of io done) to iomap_dio_complete_work();
> > 
> > 12) iomap_dio_complete_work() calls the iocb completion callback,
> >     iocb->ki_complete() with a second argument value of 4K (total io
> >     done) and the iocb with the adjust ki_pos of X + 4K. This results
> >     in completing the read request for io_uring, leaving it with a
> >     result of 4K bytes read, and only the first page of the buffer
> >     filled in, while the remaining 3 pages, corresponding to the other
> >     3 extents, were not filled;
> 
> Just checking my understanding of the problem here -- the first page is
> filled, the directio read returns that it read $pagesize bytes, and the
> remaining 3 pages in the reader's buffer are untouched?

Yep, that's it.

> And it's not
> the case that there's some IO running in the background that will
> eventually fill those three pages, nor is it the case that iouring will
> say that the read completed 4k before the contents actually reach the
> page?

Nop, not the case. There's only one bio submitted, and it's for the first
extent only. So the remaining 12K of the read buffer are not touched at
all by the read operation

> 
> > 13) For the application, the result is unexpected because if we ask
> >     to read N bytes, it expects to get N bytes read as long as those
> >     N bytes don't cross the EOF (i_size).
> 
> Hmm.  Is the problem here that directio readers expect that a read
> request for N bytes either (reads N bytes and returns N) or (returns the
> usual negative errno), and nobody's expecting a short direct read?

Right, MariaDB is not expecting a short read here. Having worked on databases
before (not on MariaDB however), not dealing with a short read is normal because
the database knowns that it nevers attempts reads that cross EOF, as it knows the
exact length and position (file offset) of each database block/page. So a short
read isn't expected.

It's probably debatable if it should keep doing reads until a read returns 0 (meaning
EOF) or an error happens.

> 
> > So fix this by making __iomap_dio_rw() assign true to the boolean variable
> > 'wait_for_completion' when we have IOMAP_DIO_PARTIAL set, we did some
> > progress for a read and we have not crossed the EOF boundary. Do this even
> > if the read has IOCB_NOWAIT set, as it's the only way to avoid providing
> > an unexpected result to an application. This results in returning a
> > positive value to the iomap caller, which tells it to fault in the
> > remaining pages associated to the buffer (struct iov_iter), followed by
> > another call to iomap_dio_rw() with IOMAP_DIO_PARTIAL set, in order to
> > continue the rest of the read operation.
> 
> I would have thought that a NOWAIT read wouldn't wait for anything,
> including page faults...
> 
> > The problem can also be triggered with the following simple program:
> > 
> >   /* Get O_DIRECT */
> >   #ifndef _GNU_SOURCE
> >   #define _GNU_SOURCE
> >   #endif
> > 
> >   #include <stdio.h>
> >   #include <stdlib.h>
> >   #include <unistd.h>
> >   #include <fcntl.h>
> >   #include <errno.h>
> >   #include <string.h>
> >   #include <liburing.h>
> > 
> >   int main(int argc, char *argv[])
> >   {
> >       char *foo_path;
> >       struct io_uring ring;
> >       struct io_uring_sqe *sqe;
> >       struct io_uring_cqe *cqe;
> >       struct iovec iovec;
> >       int fd;
> >       long pagesize;
> >       void *write_buf;
> >       void *read_buf;
> >       ssize_t ret;
> >       int i;
> > 
> >       if (argc != 2) {
> >           fprintf(stderr, "Use: %s <directory>\n", argv[0]);
> >           return 1;
> >       }
> > 
> >       foo_path = malloc(strlen(argv[1]) + 5);
> >       if (!foo_path) {
> >           fprintf(stderr, "Failed to allocate memory for file path\n");
> >           return 1;
> >       }
> >       strcpy(foo_path, argv[1]);
> >       strcat(foo_path, "/foo");
> > 
> >       /*
> >        * Create file foo with 2 extents, each with a size matching
> >        * the page size. Then allocate a buffer to read both extents
> >        * with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing
> >        * the read with io_uring, access the first page of the buffer
> >        * to fault it in, so that during the read we only trigger a
> >        * page fault when accessing the second page of the buffer.
> >        */
> >        fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY |
> >                 O_DIRECT, 0666);
> >        if (fd == -1) {
> >            fprintf(stderr,
> >                    "Failed to create file 'foo': %s (errno %d)",
> >                    strerror(errno), errno);
> >            return 1;
> >        }
> > 
> >        pagesize = sysconf(_SC_PAGE_SIZE);
> >        ret = posix_memalign(&write_buf, pagesize, 2 * pagesize);
> >        if (ret) {
> >            fprintf(stderr, "Failed to allocate write buffer\n");
> >            return 1;
> >        }
> > 
> >        memset(write_buf, 0xab, pagesize);
> >        memset(write_buf + pagesize, 0xcd, pagesize);
> > 
> >        /* Create 2 extents, each with a size matching page size. */
> >        for (i = 0; i < 2; i++) {
> >            ret = pwrite(fd, write_buf + i * pagesize, pagesize,
> > 	                i * pagesize);
> >            if (ret != pagesize) {
> >                fprintf(stderr,
> >                      "Failed to write to file, ret = %ld errno %d (%s)\n",
> >                       ret, errno, strerror(errno));
> >                return 1;
> >            }
> >            ret = fsync(fd);
> >            if (ret != 0) {
> >                fprintf(stderr, "Failed to fsync file\n");
> >                return 1;
> >            }
> >        }
> > 
> >        close(fd);
> >        fd = open(foo_path, O_RDONLY | O_DIRECT);
> >        if (fd == -1) {
> >            fprintf(stderr,
> >                    "Failed to open file 'foo': %s (errno %d)",
> >                    strerror(errno), errno);
> >            return 1;
> >        }
> > 
> >        ret = posix_memalign(&read_buf, pagesize, 2 * pagesize);
> >        if (ret) {
> >            fprintf(stderr, "Failed to allocate read buffer\n");
> >            return 1;
> >        }
> > 
> >        /*
> >         * Fault in only the first page of the read buffer.
> >         * We want to trigger a page fault for the 2nd page of the
> >         * read buffer during the read operation with io_uring
> >         * (O_DIRECT and IOCB_NOWAIT).
> >         */
> >        memset(read_buf, 0, 1);
> > 
> >        ret = io_uring_queue_init(1, &ring, 0);
> >        if (ret != 0) {
> >            fprintf(stderr, "Failed to create io_uring queue\n");
> >            return 1;
> >        }
> > 
> >        sqe = io_uring_get_sqe(&ring);
> >        if (!sqe) {
> >            fprintf(stderr, "Failed to get io_uring sqe\n");
> >            return 1;
> >        }
> > 
> >        iovec.iov_base = read_buf;
> >        iovec.iov_len = 2 * pagesize;
> >        io_uring_prep_readv(sqe, fd, &iovec, 1, 0);
> > 
> >        ret = io_uring_submit_and_wait(&ring, 1);
> >        if (ret != 1) {
> >            fprintf(stderr,
> >                    "Failed at io_uring_submit_and_wait()\n");
> >            return 1;
> >        }
> > 
> >        ret = io_uring_wait_cqe(&ring, &cqe);
> >        if (ret < 0) {
> >            fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
> >            return 1;
> >        }
> > 
> >        printf("io_uring read result for file foo:\n\n");
> >        printf("  cqe->res = %d (expected %d)\n", cqe->res, 2 * pagesize);
> >        printf("  memcmp(read_buf, write_buf) == %d (expected 0)\n",
> >               memcmp(read_buf, write_buf, 2 * pagesize));
> > 
> >        io_uring_cqe_seen(&ring, cqe);
> >        io_uring_queue_exit(&ring);
> > 
> >        return 0;
> >   }
> > 
> > When running it on an unpatched kernel:
> > 
> >   $ gcc io_uring_test.c -luring
> >   $ mkfs.btrfs -f /dev/sda
> >   $ mount /dev/sda /mnt/sda
> >   $ ./a.out /mnt/sda
> >   io_uring read result for file foo:
> > 
> >     cqe->res = 4096 (expected 8192)
> >     memcmp(read_buf, write_buf) == -205 (expected 0)
> 
> Does memcmp(read_buf, write_buf, pagesize) == 0?

Yes - as mentioned before, the read submitted only one bio, and it was for the
first extent only, which was read correctly.

> To me it looks like
> you requested a NOWAIT direct read of 8k and it returned 4k, which takes
> us back to the question of whether or not nowait-directio readers ought
> to expect to handle short reads?

As mentioned before, that is probably debatable. But many applications don't
expected less data to be returned, unless the requested range crosses the EOF
boundary.

Either way, after reading Dave's previous suggestions, this can be addressed
entirely on btrfs code, and it also avoids having a nowait request actually
blocking in some cases.

Thanks.

> 
> --D
> 
> > After this patch, the read always returns 8192 bytes, with the buffer
> > filled with the correct data. Although that reproducer always triggers
> > the bug in my test vms, it's possible that it will not be so reliable
> > on other environments, as that can happen if the bio for the first
> > extent completes and decrements the reference on the struct iomap_dio
> > object before we do the atomic_dec_and_test() on the reference at
> > __iomap_dio_rw().
> > 
> > A test case for fstests will followup later.
> > 
> > Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/
> > Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/iomap/direct-io.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index 03ea367df19a..9a6fdefa34f3 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -606,7 +606,19 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >  		iov_iter_revert(iter, iomi.pos - dio->i_size);
> >  
> >  	if (ret == -EFAULT && dio->size && (dio_flags & IOMAP_DIO_PARTIAL)) {
> > -		if (!(iocb->ki_flags & IOCB_NOWAIT))
> > +		/*
> > +		 * If we are a NOWAIT request we don't want to wait for the
> > +		 * completion of any previously submitted bios. However there
> > +		 * is one exception: if we are doing a read and we have not
> > +		 * submitted bios for everything we are supposed to read, then
> > +		 * we have to wait for completion - otherwise we may end up
> > +		 * returning -EIOCBQUEUED without having read everything we
> > +		 * can read, making our caller think we have reached EOF.
> > +		 */
> > +		if (!(iocb->ki_flags & IOCB_NOWAIT) ||
> > +		    (iov_iter_rw(iter) == READ &&
> > +		     iomi.len > 0 &&
> > +		     iomi.pos < dio->i_size))
> >  			wait_for_completion = true;
> >  		ret = 0;
> >  	}
> > -- 
> > 2.33.0
> >
Andreas Gruenbacher March 2, 2022, 1:03 p.m. UTC | #5
On Wed, Mar 2, 2022 at 11:17 AM Filipe Manana <fdmanana@kernel.org> wrote:
> On Tue, Mar 01, 2022 at 09:38:30AM +1100, Dave Chinner wrote:
> > On Mon, Feb 28, 2022 at 02:32:03PM +0000, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > Some users recently reported that MariaDB was getting a read corruption
> > > when using io_uring on top of btrfs. This started to happen in 5.16,
> > > after commit 51bd9563b6783d ("btrfs: fix deadlock due to page faults
> > > during direct IO reads and writes"). That changed btrfs to use the new
> > > iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
> > > iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
> > > corresponds to a memory mapped file region. That type of scenario is
> > > exercised by test case generic/647 from fstests, and it also affected
> > > gfs2, which, besides btrfs, is the only user of IOMAP_DIO_PARTIAL.
> > >
> > > For this MariaDB scenario, we attempt to read 16K from file offset X
> > > using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
> > > with a size of 4K, and what happens is the following:
> > >
> > > 1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
> > >
> > > 2) iomap creates a struct iomap_dio object, its reference count is
> > >    initialized to 1 and its ->size field is initialized to 0;
> > >
> > > 3) iomap calls btrfs_iomap_begin() with file offset X, which finds the
> >
> > You mean btrfs_dio_iomap_begin()?
>
> Yes, correct.
>
> >
> > >    first 4K extent, and setups an iomap for this extent consisting of
> > >    a single page;
> >
> > So we have IOCB_NOWAIT, which means btrfs_dio_iomap_begin() is being
> > passed IOMAP_NOWAIT and so knows it is being asked
> > to map an extent for an IO that is on a non-blocking path.
> >
> > btrfs_dio_iomap_begin() doesn't appear to support NOWAIT semantics
> > at all - it will block doing writeback IO, memory allocation, extent
> > locking, transaction reservations, extent allocation, etc....
>
> We do have some checks for NOWAIT before getting into btrfs_dio_iomap_begin(),
> but they are only for the write path, and they are incomplete. Some are a bit
> tricky to deal with, but yes, there's several cases that are either missing
> or need to be improved.
>
> >
> > That, to me, looks like the root cause of the problem here -
> > btrfs_dio_iomap_begin() is not guaranteeing non-blocking atomic IO
> > semantics for IOCB_NOWAIT IO.
> >
> > In the case above, given that the extent lookup only found a 4kB
> > extent, we know that it doesn't span the entire requested IO range.
> > We also known that we cannot tell if we'll block on subsequent
> > mappings of the IO range, and hence no guarantee can be given that
> > IOCB_NOWAIT IO will not block when it is too late to back out with a
> > -EAGAIN error.
> >
> > Hence this whole set of problems could be avoided if
> > btrfs_dio_iomap_begin() returns -EAGAIN if it can't map the entire
> > IO into a single extent without blocking when IOMAP_NOWAIT is set?
> > That's exactly what XFS does in xfs_direct_iomap_write_begin():
> >
> >         /*
> >          * NOWAIT and OVERWRITE I/O needs to span the entire requested I/O with
> >          * a single map so that we avoid partial IO failures due to the rest of
> >          * the I/O range not covered by this map triggering an EAGAIN condition
> >          * when it is subsequently mapped and aborting the I/O.
> >          */
> >         if (flags & (IOMAP_NOWAIT | IOMAP_OVERWRITE_ONLY)) {
> >                 error = -EAGAIN;
> >                 if (!imap_spans_range(&imap, offset_fsb, end_fsb))
> >                         goto out_unlock;
> >         }
> >
> > Basically, I'm thinking that IOMAP_NOWAIT and IOMAP_DIO_PARTIAL
> > should be exclusive functionality - if you are doing IOMAP_NOWAIT
> > then the entire IO must succeed without blocking, and if it doesn't
> > then we return -EAGAIN and the caller retries without IOCB_NOWAIT
> > set and so then we run with IOMAP_DIO_PARTIAL semantics in a thread
> > that can actually block....
>
> Indeed, I had not considered that, that is simple and effective, plus
> it can be done exclusively in btrfs code, no need to change iomap.

This will work for btrfs, but short buffered reads can still occur on
gfs2 due to the following conflicting requirements:

* On the one hand, buffered reads and writes are expected to be atomic
with respect to each other [*].

* On the other hand, to prevent deadlocks, we must allow the
cluster-wide inode lock to be stolen while faulting in pages. That's
the lock that provides the atomicity, however.

Direct I/O isn't affected because it doesn't have this atomicity requirement.

A non-solution to this dilemma is to lock the entire buffer into
memory: those buffers can be extremely large, so we would eventually
run out of memory.

So we return short reads instead. This only happens rarely, which
doesn't make debugging any easier. It also doesn't help that the
read(2) and write(2) manual pages don't document that short reads as
well as writes must be expected. (The atomicity requirement [*] also
isn't actually documented there.)

[*] https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07

> >
> > .....
> >
> > > 11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
> > >     and return 4K (the amount of io done) to iomap_dio_complete_work();
> > >
> > > 12) iomap_dio_complete_work() calls the iocb completion callback,
> > >     iocb->ki_complete() with a second argument value of 4K (total io
> > >     done) and the iocb with the adjust ki_pos of X + 4K. This results
> > >     in completing the read request for io_uring, leaving it with a
> > >     result of 4K bytes read, and only the first page of the buffer
> > >     filled in, while the remaining 3 pages, corresponding to the other
> > >     3 extents, were not filled;
> > >
> > > 13) For the application, the result is unexpected because if we ask
> > >     to read N bytes, it expects to get N bytes read as long as those
> > >     N bytes don't cross the EOF (i_size).
> >
> > Yeah, that's exactly the sort of problem we were having with XFS
> > with partial DIO completions due to needing multiple iomap iteration
> > loops to complete a single IO. Hence IOMAP_NOWAIT now triggers the
> > above range check and aborts before we start...
>
> Interesting.

Dave, this seems to affect all users of iomap_dio_rw in the same way,
so would it make sense to move this check there?

Thanks,
Andreas

> > > So fix this by making __iomap_dio_rw() assign true to the boolean variable
> > > 'wait_for_completion' when we have IOMAP_DIO_PARTIAL set, we did some
> > > progress for a read and we have not crossed the EOF boundary. Do this even
> > > if the read has IOCB_NOWAIT set, as it's the only way to avoid providing
> > > an unexpected result to an application.
> >
> > That's highly specific and ultimately will be fragile, IMO. I'd much
> > prefer that *_iomap_begin_write() implementations simply follow
> > IOMAP_NOWAIT requirements to ensure that any DIO that needs multiple
> > mappings if punted to a context that can block...
>
> Yes, agreed.
>
> Thanks for your feedback Dave, it provided a really good insight into this
> problem (and others).
>
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
>
Dave Chinner March 2, 2022, 11:12 p.m. UTC | #6
On Wed, Mar 02, 2022 at 02:03:28PM +0100, Andreas Gruenbacher wrote:
> On Wed, Mar 2, 2022 at 11:17 AM Filipe Manana <fdmanana@kernel.org> wrote:
> > On Tue, Mar 01, 2022 at 09:38:30AM +1100, Dave Chinner wrote:
> > > On Mon, Feb 28, 2022 at 02:32:03PM +0000, fdmanana@kernel.org wrote:
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > .....
> > >
> > > > 11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
> > > >     and return 4K (the amount of io done) to iomap_dio_complete_work();
> > > >
> > > > 12) iomap_dio_complete_work() calls the iocb completion callback,
> > > >     iocb->ki_complete() with a second argument value of 4K (total io
> > > >     done) and the iocb with the adjust ki_pos of X + 4K. This results
> > > >     in completing the read request for io_uring, leaving it with a
> > > >     result of 4K bytes read, and only the first page of the buffer
> > > >     filled in, while the remaining 3 pages, corresponding to the other
> > > >     3 extents, were not filled;
> > > >
> > > > 13) For the application, the result is unexpected because if we ask
> > > >     to read N bytes, it expects to get N bytes read as long as those
> > > >     N bytes don't cross the EOF (i_size).
> > >
> > > Yeah, that's exactly the sort of problem we were having with XFS
> > > with partial DIO completions due to needing multiple iomap iteration
> > > loops to complete a single IO. Hence IOMAP_NOWAIT now triggers the
> > > above range check and aborts before we start...
> >
> > Interesting.
> 
> Dave, this seems to affect all users of iomap_dio_rw in the same way,
> so would it make sense to move this check there?

Perhaps, but I'm not sure it makes sense because filesystems need to
abort ->iomap_begin with -EAGAIN in IOMAP_NOWAIT contexts before
they make any changes.

Hence detecting short extents in the generic code becomes ...
difficult because we might now need to undo changes that have been
made in ->iomap_begin. e.g. if the filesystem allocates space and
the iomap core says "not long enough" because IOMAP_NOWAIT is set,
then we may have to punch out that allocation in ->iomap_end beforei
returning -EAGAIN.

That means filesystems like XFS may now need to supply a ->iomap_end
function to undo stuff the core decides it shouldn't have done,
instead of the filesystem ensuring it never does the operation it in
the first place...

IOWs, the correct behaviour here is for the filesystem ->iomap_begin
method to see that it needs to allocate and return -EAGAIN if
IOMAP_NOWAIT is set, not do the allocation and hope it that it ends
up being long enough to cover the entire IO we have to do.

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 03ea367df19a..9a6fdefa34f3 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -606,7 +606,19 @@  __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		iov_iter_revert(iter, iomi.pos - dio->i_size);
 
 	if (ret == -EFAULT && dio->size && (dio_flags & IOMAP_DIO_PARTIAL)) {
-		if (!(iocb->ki_flags & IOCB_NOWAIT))
+		/*
+		 * If we are a NOWAIT request we don't want to wait for the
+		 * completion of any previously submitted bios. However there
+		 * is one exception: if we are doing a read and we have not
+		 * submitted bios for everything we are supposed to read, then
+		 * we have to wait for completion - otherwise we may end up
+		 * returning -EIOCBQUEUED without having read everything we
+		 * can read, making our caller think we have reached EOF.
+		 */
+		if (!(iocb->ki_flags & IOCB_NOWAIT) ||
+		    (iov_iter_rw(iter) == READ &&
+		     iomi.len > 0 &&
+		     iomi.pos < dio->i_size))
 			wait_for_completion = true;
 		ret = 0;
 	}