diff mbox

btrfs: Fix locking during DIO read

Message ID 1519213303-2802-1-git-send-email-nborisov@suse.com (mailing list archive)
State New, archived
Headers show

Commit Message

Nikolay Borisov Feb. 21, 2018, 11:41 a.m. UTC
Currently the DIO read cases uses a botched idea from ext4 to ensure
that DIO reads don't race with truncate. The idea is that if we have a
pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
forces the dio read case to fallback to inode_locking to prevent
read/truncate races. Unfortunately this is subtly broken for at least
2 reasons:

1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
(for the read case). This means that there is no ordering guarantee
between the invocation of inode_dio_wait and the increment of
i_dio_count in btrfs_direct_IO in the tread case.

2. The memory barriers used in btrfs_inode_(block|resume)_unlocked_dio
are not really paired with the reader side - the test_bit in
btrfs_direct_IO, since the latter is missing a memory barrier. Furthermore,
the actual sleeping condition that needs ordering to prevent live-locks/
missed wakeups is the modification/read of i_dio_count. So in this case
the waker(T2) needs to make the condition false _BEFORE_ doing a test.

The interraction between the two threads roughly looks like:

T1(truncate):		                         T2(btrfs_direct_IO):
set_bit(BTRFS_INODE_READDIO_NEED_LOCK)             if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK))
if (atomic_read())                                  if (atomic_dec_and_test(&inode->i_dio_count)
  schedule()			                        wake_up_bit
clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)

Without the ordering between the test_bit in T2 and setting the bit in
T1 (due to a missing pairing barrier in T2) it's possible that T1 goes
to sleep in schedule and T2 misses the bit set, resulting in missing the
wake up.

In any case all of this is VERY subtle. So fix it by simply making
the DIO READ case take inode_lock_shared. This ensure that we can have
DIO reads in parallel at the same time we are protected against
concurrent modification of the target file. This way we closely mimic
what ext4 codes does and simplify this mess.

Multiple xfstest runs didn't show any regressions.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/btrfs_inode.h | 17 -----------------
 fs/btrfs/inode.c       | 34 ++++++++++++++++++++--------------
 2 files changed, 20 insertions(+), 31 deletions(-)

Comments

Filipe Manana Feb. 21, 2018, 1:06 p.m. UTC | #1
On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nborisov@suse.com> wrote:
> Currently the DIO read cases uses a botched idea from ext4 to ensure
> that DIO reads don't race with truncate. The idea is that if we have a
> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
> forces the dio read case to fallback to inode_locking to prevent
> read/truncate races. Unfortunately this is subtly broken for at least
> 2 reasons:
>
> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
> (for the read case). This means that there is no ordering guarantee
> between the invocation of inode_dio_wait and the increment of
> i_dio_count in btrfs_direct_IO in the tread case.
>
> 2. The memory barriers used in btrfs_inode_(block|resume)_unlocked_dio
> are not really paired with the reader side - the test_bit in
> btrfs_direct_IO, since the latter is missing a memory barrier. Furthermore,
> the actual sleeping condition that needs ordering to prevent live-locks/
> missed wakeups is the modification/read of i_dio_count. So in this case
> the waker(T2) needs to make the condition false _BEFORE_ doing a test.
>
> The interraction between the two threads roughly looks like:
>
> T1(truncate):                                    T2(btrfs_direct_IO):
> set_bit(BTRFS_INODE_READDIO_NEED_LOCK)             if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK))
> if (atomic_read())                                  if (atomic_dec_and_test(&inode->i_dio_count)
>   schedule()                                            wake_up_bit
> clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)
>
> Without the ordering between the test_bit in T2 and setting the bit in
> T1 (due to a missing pairing barrier in T2) it's possible that T1 goes
> to sleep in schedule and T2 misses the bit set, resulting in missing the
> wake up.
>
> In any case all of this is VERY subtle. So fix it by simply making
> the DIO READ case take inode_lock_shared. This ensure that we can have
> DIO reads in parallel at the same time we are protected against
> concurrent modification of the target file.

And that prevents writes and reads against different (i.e. not
overlapping) ranges from happening in parallel.
That has a big impact on applications (databases for e.g.) that
operate on large files serving multiple requests.
Now all reads are serialized against all writes and vice versa.

Unless I missed something, a big NAK to this change as it is.


> This way we closely mimic
> what ext4 codes does and simplify this mess.
>
> Multiple xfstest runs didn't show any regressions.
>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> ---
>  fs/btrfs/btrfs_inode.h | 17 -----------------
>  fs/btrfs/inode.c       | 34 ++++++++++++++++++++--------------
>  2 files changed, 20 insertions(+), 31 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index f527e99c9f8d..3519e49d4ef0 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -329,23 +329,6 @@ struct btrfs_dio_private {
>                         blk_status_t);
>  };
>
> -/*
> - * Disable DIO read nolock optimization, so new dio readers will be forced
> - * to grab i_mutex. It is used to avoid the endless truncate due to
> - * nonlocked dio read.
> - */
> -static inline void btrfs_inode_block_unlocked_dio(struct btrfs_inode *inode)
> -{
> -       set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
> -       smp_mb();
> -}
> -
> -static inline void btrfs_inode_resume_unlocked_dio(struct btrfs_inode *inode)
> -{
> -       smp_mb__before_atomic();
> -       clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
> -}
> -
>  static inline void btrfs_print_data_csum_error(struct btrfs_inode *inode,
>                 u64 logical_start, u32 csum, u32 csum_expected, int mirror_num)
>  {
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 491a7397f6fa..9c43257e6e11 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5149,10 +5149,13 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
>                 /* we don't support swapfiles, so vmtruncate shouldn't fail */
>                 truncate_setsize(inode, newsize);
>
> -               /* Disable nonlocked read DIO to avoid the end less truncate */
> -               btrfs_inode_block_unlocked_dio(BTRFS_I(inode));
> +               /*
> +                * Truncate after all in-flight dios are finished, new ones
> +                * will block on inode_lock. This only matters for AIO requests
> +                * since DIO READ is performed under inode_shared_lock and
> +                * write under exclusive lock.
> +                */
>                 inode_dio_wait(inode);
> -               btrfs_inode_resume_unlocked_dio(BTRFS_I(inode));
>
>                 ret = btrfs_truncate(inode);
>                 if (ret && inode->i_nlink) {
> @@ -8669,15 +8672,12 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>         loff_t offset = iocb->ki_pos;
>         size_t count = 0;
>         int flags = 0;
> -       bool wakeup = true;
>         bool relock = false;
>         ssize_t ret;
>
>         if (check_direct_IO(fs_info, iter, offset))
>                 return 0;
>
> -       inode_dio_begin(inode);
> -
>         /*
>          * The generic stuff only does filemap_write_and_wait_range, which
>          * isn't enough if we've written compressed pages to this area, so
> @@ -8691,6 +8691,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>                                          offset + count - 1);
>
>         if (iov_iter_rw(iter) == WRITE) {
> +
> +               inode_dio_begin(inode);
> +
>                 /*
>                  * If the write DIO is beyond the EOF, we need update
>                  * the isize, but it is protected by i_mutex. So we can
> @@ -8720,11 +8723,13 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>                 dio_data.unsubmitted_oe_range_end = (u64)offset;
>                 current->journal_info = &dio_data;
>                 down_read(&BTRFS_I(inode)->dio_sem);
> -       } else if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK,
> -                                    &BTRFS_I(inode)->runtime_flags)) {
> -               inode_dio_end(inode);
> -               flags = DIO_LOCKING | DIO_SKIP_HOLES;
> -               wakeup = false;
> +       } else {
> +               /*
> +                * In DIO READ case locking the inode in shared mode ensures
> +                * we are protected against parallel writes/truncates
> +                */
> +               inode_lock_shared(inode);
> +               inode_dio_begin(inode);
>         }
>
>         ret = __blockdev_direct_IO(iocb, inode,
> @@ -8755,10 +8760,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>                         btrfs_delalloc_release_space(inode, data_reserved,
>                                         offset, count - (size_t)ret);
>                 btrfs_delalloc_release_extents(BTRFS_I(inode), count);
> -       }
> +       } else
> +               inode_unlock_shared(inode);
>  out:
> -       if (wakeup)
> -               inode_dio_end(inode);
> +       inode_dio_end(inode);
> +
>         if (relock)
>                 inode_lock(inode);
>
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Borisov Feb. 21, 2018, 1:10 p.m. UTC | #2
On 21.02.2018 15:06, Filipe Manana wrote:
> On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nborisov@suse.com> wrote:
>> Currently the DIO read cases uses a botched idea from ext4 to ensure
>> that DIO reads don't race with truncate. The idea is that if we have a
>> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
>> forces the dio read case to fallback to inode_locking to prevent
>> read/truncate races. Unfortunately this is subtly broken for at least
>> 2 reasons:
>>
>> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
>> (for the read case). This means that there is no ordering guarantee
>> between the invocation of inode_dio_wait and the increment of
>> i_dio_count in btrfs_direct_IO in the tread case.
>>
>> 2. The memory barriers used in btrfs_inode_(block|resume)_unlocked_dio
>> are not really paired with the reader side - the test_bit in
>> btrfs_direct_IO, since the latter is missing a memory barrier. Furthermore,
>> the actual sleeping condition that needs ordering to prevent live-locks/
>> missed wakeups is the modification/read of i_dio_count. So in this case
>> the waker(T2) needs to make the condition false _BEFORE_ doing a test.
>>
>> The interraction between the two threads roughly looks like:
>>
>> T1(truncate):                                    T2(btrfs_direct_IO):
>> set_bit(BTRFS_INODE_READDIO_NEED_LOCK)             if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK))
>> if (atomic_read())                                  if (atomic_dec_and_test(&inode->i_dio_count)
>>   schedule()                                            wake_up_bit
>> clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)
>>
>> Without the ordering between the test_bit in T2 and setting the bit in
>> T1 (due to a missing pairing barrier in T2) it's possible that T1 goes
>> to sleep in schedule and T2 misses the bit set, resulting in missing the
>> wake up.
>>
>> In any case all of this is VERY subtle. So fix it by simply making
>> the DIO READ case take inode_lock_shared. This ensure that we can have
>> DIO reads in parallel at the same time we are protected against
>> concurrent modification of the target file.
> 
> And that prevents writes and reads against different (i.e. not
> overlapping) ranges from happening in parallel.
> That has a big impact on applications (databases for e.g.) that
> operate on large files serving multiple requests.
> Now all reads are serialized against all writes and vice versa.
> 

Correct, but I'd prefer correctness over performance! And I assume other
people as well, since as is the code atm it's not providing full
protection between racing reads and truncate.

> Unless I missed something, a big NAK to this change as it is.



> 
> 
>> This way we closely mimic
>> what ext4 codes does and simplify this mess.
>>
>> Multiple xfstest runs didn't show any regressions.
>>
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> ---
>>  fs/btrfs/btrfs_inode.h | 17 -----------------
>>  fs/btrfs/inode.c       | 34 ++++++++++++++++++++--------------
>>  2 files changed, 20 insertions(+), 31 deletions(-)
>>
>> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
>> index f527e99c9f8d..3519e49d4ef0 100644
>> --- a/fs/btrfs/btrfs_inode.h
>> +++ b/fs/btrfs/btrfs_inode.h
>> @@ -329,23 +329,6 @@ struct btrfs_dio_private {
>>                         blk_status_t);
>>  };
>>
>> -/*
>> - * Disable DIO read nolock optimization, so new dio readers will be forced
>> - * to grab i_mutex. It is used to avoid the endless truncate due to
>> - * nonlocked dio read.
>> - */
>> -static inline void btrfs_inode_block_unlocked_dio(struct btrfs_inode *inode)
>> -{
>> -       set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
>> -       smp_mb();
>> -}
>> -
>> -static inline void btrfs_inode_resume_unlocked_dio(struct btrfs_inode *inode)
>> -{
>> -       smp_mb__before_atomic();
>> -       clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
>> -}
>> -
>>  static inline void btrfs_print_data_csum_error(struct btrfs_inode *inode,
>>                 u64 logical_start, u32 csum, u32 csum_expected, int mirror_num)
>>  {
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 491a7397f6fa..9c43257e6e11 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -5149,10 +5149,13 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
>>                 /* we don't support swapfiles, so vmtruncate shouldn't fail */
>>                 truncate_setsize(inode, newsize);
>>
>> -               /* Disable nonlocked read DIO to avoid the end less truncate */
>> -               btrfs_inode_block_unlocked_dio(BTRFS_I(inode));
>> +               /*
>> +                * Truncate after all in-flight dios are finished, new ones
>> +                * will block on inode_lock. This only matters for AIO requests
>> +                * since DIO READ is performed under inode_shared_lock and
>> +                * write under exclusive lock.
>> +                */
>>                 inode_dio_wait(inode);
>> -               btrfs_inode_resume_unlocked_dio(BTRFS_I(inode));
>>
>>                 ret = btrfs_truncate(inode);
>>                 if (ret && inode->i_nlink) {
>> @@ -8669,15 +8672,12 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>         loff_t offset = iocb->ki_pos;
>>         size_t count = 0;
>>         int flags = 0;
>> -       bool wakeup = true;
>>         bool relock = false;
>>         ssize_t ret;
>>
>>         if (check_direct_IO(fs_info, iter, offset))
>>                 return 0;
>>
>> -       inode_dio_begin(inode);
>> -
>>         /*
>>          * The generic stuff only does filemap_write_and_wait_range, which
>>          * isn't enough if we've written compressed pages to this area, so
>> @@ -8691,6 +8691,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>                                          offset + count - 1);
>>
>>         if (iov_iter_rw(iter) == WRITE) {
>> +
>> +               inode_dio_begin(inode);
>> +
>>                 /*
>>                  * If the write DIO is beyond the EOF, we need update
>>                  * the isize, but it is protected by i_mutex. So we can
>> @@ -8720,11 +8723,13 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>                 dio_data.unsubmitted_oe_range_end = (u64)offset;
>>                 current->journal_info = &dio_data;
>>                 down_read(&BTRFS_I(inode)->dio_sem);
>> -       } else if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK,
>> -                                    &BTRFS_I(inode)->runtime_flags)) {
>> -               inode_dio_end(inode);
>> -               flags = DIO_LOCKING | DIO_SKIP_HOLES;
>> -               wakeup = false;
>> +       } else {
>> +               /*
>> +                * In DIO READ case locking the inode in shared mode ensures
>> +                * we are protected against parallel writes/truncates
>> +                */
>> +               inode_lock_shared(inode);
>> +               inode_dio_begin(inode);
>>         }
>>
>>         ret = __blockdev_direct_IO(iocb, inode,
>> @@ -8755,10 +8760,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>                         btrfs_delalloc_release_space(inode, data_reserved,
>>                                         offset, count - (size_t)ret);
>>                 btrfs_delalloc_release_extents(BTRFS_I(inode), count);
>> -       }
>> +       } else
>> +               inode_unlock_shared(inode);
>>  out:
>> -       if (wakeup)
>> -               inode_dio_end(inode);
>> +       inode_dio_end(inode);
>> +
>>         if (relock)
>>                 inode_lock(inode);
>>
>> --
>> 2.7.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana Feb. 21, 2018, 1:27 p.m. UTC | #3
On Wed, Feb 21, 2018 at 1:10 PM, Nikolay Borisov <nborisov@suse.com> wrote:
>
>
> On 21.02.2018 15:06, Filipe Manana wrote:
>> On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nborisov@suse.com> wrote:
>>> Currently the DIO read cases uses a botched idea from ext4 to ensure
>>> that DIO reads don't race with truncate. The idea is that if we have a
>>> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
>>> forces the dio read case to fallback to inode_locking to prevent
>>> read/truncate races. Unfortunately this is subtly broken for at least
>>> 2 reasons:
>>>
>>> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
>>> (for the read case). This means that there is no ordering guarantee
>>> between the invocation of inode_dio_wait and the increment of
>>> i_dio_count in btrfs_direct_IO in the tread case.
>>>
>>> 2. The memory barriers used in btrfs_inode_(block|resume)_unlocked_dio
>>> are not really paired with the reader side - the test_bit in
>>> btrfs_direct_IO, since the latter is missing a memory barrier. Furthermore,
>>> the actual sleeping condition that needs ordering to prevent live-locks/
>>> missed wakeups is the modification/read of i_dio_count. So in this case
>>> the waker(T2) needs to make the condition false _BEFORE_ doing a test.
>>>
>>> The interraction between the two threads roughly looks like:
>>>
>>> T1(truncate):                                    T2(btrfs_direct_IO):
>>> set_bit(BTRFS_INODE_READDIO_NEED_LOCK)             if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK))
>>> if (atomic_read())                                  if (atomic_dec_and_test(&inode->i_dio_count)
>>>   schedule()                                            wake_up_bit
>>> clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)
>>>
>>> Without the ordering between the test_bit in T2 and setting the bit in
>>> T1 (due to a missing pairing barrier in T2) it's possible that T1 goes
>>> to sleep in schedule and T2 misses the bit set, resulting in missing the
>>> wake up.
>>>
>>> In any case all of this is VERY subtle. So fix it by simply making
>>> the DIO READ case take inode_lock_shared. This ensure that we can have
>>> DIO reads in parallel at the same time we are protected against
>>> concurrent modification of the target file.
>>
>> And that prevents writes and reads against different (i.e. not
>> overlapping) ranges from happening in parallel.
>> That has a big impact on applications (databases for e.g.) that
>> operate on large files serving multiple requests.
>> Now all reads are serialized against all writes and vice versa.
>>
>
> Correct, but I'd prefer correctness over performance! And I assume other
> people as well, since as is the code atm it's not providing full
> protection between racing reads and truncate.

And even more people prefer correctness without significant impact on
performance.

So fix it differently please.

>
>> Unless I missed something, a big NAK to this change as it is.
>
>
>
>>
>>
>>> This way we closely mimic
>>> what ext4 codes does and simplify this mess.
>>>
>>> Multiple xfstest runs didn't show any regressions.
>>>
>>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>>> ---
>>>  fs/btrfs/btrfs_inode.h | 17 -----------------
>>>  fs/btrfs/inode.c       | 34 ++++++++++++++++++++--------------
>>>  2 files changed, 20 insertions(+), 31 deletions(-)
>>>
>>> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
>>> index f527e99c9f8d..3519e49d4ef0 100644
>>> --- a/fs/btrfs/btrfs_inode.h
>>> +++ b/fs/btrfs/btrfs_inode.h
>>> @@ -329,23 +329,6 @@ struct btrfs_dio_private {
>>>                         blk_status_t);
>>>  };
>>>
>>> -/*
>>> - * Disable DIO read nolock optimization, so new dio readers will be forced
>>> - * to grab i_mutex. It is used to avoid the endless truncate due to
>>> - * nonlocked dio read.
>>> - */
>>> -static inline void btrfs_inode_block_unlocked_dio(struct btrfs_inode *inode)
>>> -{
>>> -       set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
>>> -       smp_mb();
>>> -}
>>> -
>>> -static inline void btrfs_inode_resume_unlocked_dio(struct btrfs_inode *inode)
>>> -{
>>> -       smp_mb__before_atomic();
>>> -       clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
>>> -}
>>> -
>>>  static inline void btrfs_print_data_csum_error(struct btrfs_inode *inode,
>>>                 u64 logical_start, u32 csum, u32 csum_expected, int mirror_num)
>>>  {
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index 491a7397f6fa..9c43257e6e11 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -5149,10 +5149,13 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
>>>                 /* we don't support swapfiles, so vmtruncate shouldn't fail */
>>>                 truncate_setsize(inode, newsize);
>>>
>>> -               /* Disable nonlocked read DIO to avoid the end less truncate */
>>> -               btrfs_inode_block_unlocked_dio(BTRFS_I(inode));
>>> +               /*
>>> +                * Truncate after all in-flight dios are finished, new ones
>>> +                * will block on inode_lock. This only matters for AIO requests
>>> +                * since DIO READ is performed under inode_shared_lock and
>>> +                * write under exclusive lock.
>>> +                */
>>>                 inode_dio_wait(inode);
>>> -               btrfs_inode_resume_unlocked_dio(BTRFS_I(inode));
>>>
>>>                 ret = btrfs_truncate(inode);
>>>                 if (ret && inode->i_nlink) {
>>> @@ -8669,15 +8672,12 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>>         loff_t offset = iocb->ki_pos;
>>>         size_t count = 0;
>>>         int flags = 0;
>>> -       bool wakeup = true;
>>>         bool relock = false;
>>>         ssize_t ret;
>>>
>>>         if (check_direct_IO(fs_info, iter, offset))
>>>                 return 0;
>>>
>>> -       inode_dio_begin(inode);
>>> -
>>>         /*
>>>          * The generic stuff only does filemap_write_and_wait_range, which
>>>          * isn't enough if we've written compressed pages to this area, so
>>> @@ -8691,6 +8691,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>>                                          offset + count - 1);
>>>
>>>         if (iov_iter_rw(iter) == WRITE) {
>>> +
>>> +               inode_dio_begin(inode);
>>> +
>>>                 /*
>>>                  * If the write DIO is beyond the EOF, we need update
>>>                  * the isize, but it is protected by i_mutex. So we can
>>> @@ -8720,11 +8723,13 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>>                 dio_data.unsubmitted_oe_range_end = (u64)offset;
>>>                 current->journal_info = &dio_data;
>>>                 down_read(&BTRFS_I(inode)->dio_sem);
>>> -       } else if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK,
>>> -                                    &BTRFS_I(inode)->runtime_flags)) {
>>> -               inode_dio_end(inode);
>>> -               flags = DIO_LOCKING | DIO_SKIP_HOLES;
>>> -               wakeup = false;
>>> +       } else {
>>> +               /*
>>> +                * In DIO READ case locking the inode in shared mode ensures
>>> +                * we are protected against parallel writes/truncates
>>> +                */
>>> +               inode_lock_shared(inode);
>>> +               inode_dio_begin(inode);
>>>         }
>>>
>>>         ret = __blockdev_direct_IO(iocb, inode,
>>> @@ -8755,10 +8760,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>>                         btrfs_delalloc_release_space(inode, data_reserved,
>>>                                         offset, count - (size_t)ret);
>>>                 btrfs_delalloc_release_extents(BTRFS_I(inode), count);
>>> -       }
>>> +       } else
>>> +               inode_unlock_shared(inode);
>>>  out:
>>> -       if (wakeup)
>>> -               inode_dio_end(inode);
>>> +       inode_dio_end(inode);
>>> +
>>>         if (relock)
>>>                 inode_lock(inode);
>>>
>>> --
>>> 2.7.4
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
Filipe Manana Feb. 21, 2018, 1:51 p.m. UTC | #4
On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nborisov@suse.com> wrote:
> Currently the DIO read cases uses a botched idea from ext4 to ensure
> that DIO reads don't race with truncate. The idea is that if we have a
> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
> forces the dio read case to fallback to inode_locking to prevent
> read/truncate races. Unfortunately this is subtly broken for at least
> 2 reasons:
>
> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
> (for the read case). This means that there is no ordering guarantee
> between the invocation of inode_dio_wait and the increment of
> i_dio_count in btrfs_direct_IO in the tread case.

Also, looking at this changelog, the diff and the code, why is it a
problem not calling inode_dio_begin without the inode lock in the dio
read path?
The truncate path calls inode_dio_wait after setting the bit
BTRFS_INODE_READDIO_NEED_LOCK and before clearing it.
Assuming the functions to set and clear that bit are correct, I don't
see what problem this brings.

Have you observed any issue in practice? Or can you show a diagram
that points a sequence of concurrent steps leading to a problem?

>
> 2. The memory barriers used in btrfs_inode_(block|resume)_unlocked_dio
> are not really paired with the reader side - the test_bit in
> btrfs_direct_IO, since the latter is missing a memory barrier.

Which memory barrier is missing?
Both btrfs_inode_block_unlocked_dio() and
btrfs_inode_resume_unlocked_dio() have memory barriers, and I think
they are even not needed.
Why do you think they are needed?

> Furthermore,
> the actual sleeping condition that needs ordering to prevent live-locks/
> missed wakeups is the modification/read of i_dio_count. So in this case
> the waker(T2) needs to make the condition false _BEFORE_ doing a test.
>
> The interraction between the two threads roughly looks like:
>
> T1(truncate):                                    T2(btrfs_direct_IO):
> set_bit(BTRFS_INODE_READDIO_NEED_LOCK)             if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK))
> if (atomic_read())                                  if (atomic_dec_and_test(&inode->i_dio_count)
>   schedule()                                            wake_up_bit
> clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)
>
> Without the ordering between the test_bit in T2 and setting the bit in
> T1 (due to a missing pairing barrier in T2) it's possible that T1 goes
> to sleep in schedule and T2 misses the bit set, resulting in missing the
> wake up.

set_bit and test_bit are atomic. I don't understand from that sequence
diagram how the wakeup is missed.

>
> In any case all of this is VERY subtle. So fix it by simply making
> the DIO READ case take inode_lock_shared. This ensure that we can have
> DIO reads in parallel at the same time we are protected against
> concurrent modification of the target file. This way we closely mimic
> what ext4 codes does and simplify this mess.
>
> Multiple xfstest runs didn't show any regressions.
>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> ---
>  fs/btrfs/btrfs_inode.h | 17 -----------------
>  fs/btrfs/inode.c       | 34 ++++++++++++++++++++--------------
>  2 files changed, 20 insertions(+), 31 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index f527e99c9f8d..3519e49d4ef0 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -329,23 +329,6 @@ struct btrfs_dio_private {
>                         blk_status_t);
>  };
>
> -/*
> - * Disable DIO read nolock optimization, so new dio readers will be forced
> - * to grab i_mutex. It is used to avoid the endless truncate due to
> - * nonlocked dio read.
> - */
> -static inline void btrfs_inode_block_unlocked_dio(struct btrfs_inode *inode)
> -{
> -       set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
> -       smp_mb();
> -}
> -
> -static inline void btrfs_inode_resume_unlocked_dio(struct btrfs_inode *inode)
> -{
> -       smp_mb__before_atomic();
> -       clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
> -}
> -
>  static inline void btrfs_print_data_csum_error(struct btrfs_inode *inode,
>                 u64 logical_start, u32 csum, u32 csum_expected, int mirror_num)
>  {
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 491a7397f6fa..9c43257e6e11 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5149,10 +5149,13 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
>                 /* we don't support swapfiles, so vmtruncate shouldn't fail */
>                 truncate_setsize(inode, newsize);
>
> -               /* Disable nonlocked read DIO to avoid the end less truncate */
> -               btrfs_inode_block_unlocked_dio(BTRFS_I(inode));
> +               /*
> +                * Truncate after all in-flight dios are finished, new ones
> +                * will block on inode_lock. This only matters for AIO requests
> +                * since DIO READ is performed under inode_shared_lock and
> +                * write under exclusive lock.
> +                */
>                 inode_dio_wait(inode);
> -               btrfs_inode_resume_unlocked_dio(BTRFS_I(inode));
>
>                 ret = btrfs_truncate(inode);
>                 if (ret && inode->i_nlink) {
> @@ -8669,15 +8672,12 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>         loff_t offset = iocb->ki_pos;
>         size_t count = 0;
>         int flags = 0;
> -       bool wakeup = true;
>         bool relock = false;
>         ssize_t ret;
>
>         if (check_direct_IO(fs_info, iter, offset))
>                 return 0;
>
> -       inode_dio_begin(inode);
> -
>         /*
>          * The generic stuff only does filemap_write_and_wait_range, which
>          * isn't enough if we've written compressed pages to this area, so
> @@ -8691,6 +8691,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>                                          offset + count - 1);
>
>         if (iov_iter_rw(iter) == WRITE) {
> +
> +               inode_dio_begin(inode);
> +
>                 /*
>                  * If the write DIO is beyond the EOF, we need update
>                  * the isize, but it is protected by i_mutex. So we can
> @@ -8720,11 +8723,13 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>                 dio_data.unsubmitted_oe_range_end = (u64)offset;
>                 current->journal_info = &dio_data;
>                 down_read(&BTRFS_I(inode)->dio_sem);
> -       } else if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK,
> -                                    &BTRFS_I(inode)->runtime_flags)) {
> -               inode_dio_end(inode);
> -               flags = DIO_LOCKING | DIO_SKIP_HOLES;
> -               wakeup = false;
> +       } else {
> +               /*
> +                * In DIO READ case locking the inode in shared mode ensures
> +                * we are protected against parallel writes/truncates
> +                */
> +               inode_lock_shared(inode);
> +               inode_dio_begin(inode);
>         }
>
>         ret = __blockdev_direct_IO(iocb, inode,
> @@ -8755,10 +8760,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>                         btrfs_delalloc_release_space(inode, data_reserved,
>                                         offset, count - (size_t)ret);
>                 btrfs_delalloc_release_extents(BTRFS_I(inode), count);
> -       }
> +       } else
> +               inode_unlock_shared(inode);
>  out:
> -       if (wakeup)
> -               inode_dio_end(inode);
> +       inode_dio_end(inode);
> +
>         if (relock)
>                 inode_lock(inode);
>
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Borisov Feb. 21, 2018, 2:15 p.m. UTC | #5
On 21.02.2018 15:51, Filipe Manana wrote:
> On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nborisov@suse.com> wrote:
>> Currently the DIO read cases uses a botched idea from ext4 to ensure
>> that DIO reads don't race with truncate. The idea is that if we have a
>> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
>> forces the dio read case to fallback to inode_locking to prevent
>> read/truncate races. Unfortunately this is subtly broken for at least
>> 2 reasons:
>>
>> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
>> (for the read case). This means that there is no ordering guarantee
>> between the invocation of inode_dio_wait and the increment of
>> i_dio_count in btrfs_direct_IO in the tread case.
> 
> Also, looking at this changelog, the diff and the code, why is it a
> problem not calling inode_dio_begin without the inode lock in the dio
> read path?
> The truncate path calls inode_dio_wait after setting the bit
> BTRFS_INODE_READDIO_NEED_LOCK and before clearing it.
> Assuming the functions to set and clear that bit are correct, I don't
> see what problem this brings.

Assume you have a truncate and a dio READ in parallel. So the following 
execution is possible: 
 
T1:                                                           T2:                                                                              
btrfs_setattr                                            
 set_bit(BTRFS_INODE_READDIO_NEED_LOCK)
 inode_dio_wait (reads i_dio_count)                        btrfs_direct_IO   						 
 clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)                  inode_dio_begin (inc's i_dio_count) 

Since we have no ordering between beginning a dio and waiting for it then 
truncate can assume there isn't any pending dio. At the same time 
btrfs_direct_IO will increment i_dio_count but won't see BTRFS_INODE_READDIO_NEED_LOCK
ever being set and so will proceed servicing the read.  


> 
> Have you observed any issue in practice? Or can you show a diagram
> that points a sequence of concurrent steps leading to a problem?
> 
>>
>> 2. The memory barriers used in btrfs_inode_(block|resume)_unlocked_dio
>> are not really paired with the reader side - the test_bit in
>> btrfs_direct_IO, since the latter is missing a memory barrier.
> 
> Which memory barrier is missing?
> Both btrfs_inode_block_unlocked_dio() and
> btrfs_inode_resume_unlocked_dio() have memory barriers, and I think
> they are even not needed.

I discussed this piece of code with Peter Zijlstra and so if we take the 
2 memory barriers in btrfs_setattr size they *could* be optimised as 
follows: 

1. btrfs_inode_block_unlocked_dio() - it can lose it's smp_mb by virtue
of set_task_state (called from inode_dio_wait->prepare_to_wait) already including a full 
smp_mb. 

2. btrfs_inode_resume_unlocked_dio() - the clear bit in that function can be 
replaced by clear_bit_unlock. 

However, as you can see those barriers only order the setting of the 
flag respective to inode_dio_wait, they have no implication about 
when those writes are going to be seen by the test_bit in btrfs_direct_IO.

The situation we must avoid is T1 (truncate) sitting in schedule() and T2
not seeing the BTRFS_INODE_READDIO_NEED_LOCK bit set, thus missing a wake up.


Let alone the following comment of inode_dio_wait:

* Must be called under a lock that serializes taking new references            
* to i_dio_count, usually by inode->i_mutex. 





> Why do you think they are needed?
> 
>> Furthermore,
>> the actual sleeping condition that needs ordering to prevent live-locks/
>> missed wakeups is the modification/read of i_dio_count. So in this case
>> the waker(T2) needs to make the condition false _BEFORE_ doing a test.
>>
>> The interraction between the two threads roughly looks like:
>>
>> T1(truncate):                                    T2(btrfs_direct_IO):
>> set_bit(BTRFS_INODE_READDIO_NEED_LOCK)             if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK))
>> if (atomic_read())                                  if (atomic_dec_and_test(&inode->i_dio_count)
>>   schedule()                                            wake_up_bit
>> clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)
>>
>> Without the ordering between the test_bit in T2 and setting the bit in
>> T1 (due to a missing pairing barrier in T2) it's possible that T1 goes
>> to sleep in schedule and T2 misses the bit set, resulting in missing the
>> wake up.
> 
> set_bit and test_bit are atomic. I don't understand from that sequence
> diagram how the wakeup is missed.

They may be atomic but atomicity says absolutely nothing about ordering. And the 
memory barriers are only on the write side (truncate) and the test_bit on the 
read side (btrfs_diect_IO) doesn't have any code which says anything about 
the order in which i_dio_count is read and test_bis is executed. 

> 
>>
>> In any case all of this is VERY subtle. So fix it by simply making
>> the DIO READ case take inode_lock_shared. This ensure that we can have
>> DIO reads in parallel at the same time we are protected against
>> concurrent modification of the target file. This way we closely mimic
>> what ext4 codes does and simplify this mess.
>>
>> Multiple xfstest runs didn't show any regressions.
>>
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> ---
>>  fs/btrfs/btrfs_inode.h | 17 -----------------
>>  fs/btrfs/inode.c       | 34 ++++++++++++++++++++--------------
>>  2 files changed, 20 insertions(+), 31 deletions(-)
>>
>> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
>> index f527e99c9f8d..3519e49d4ef0 100644
>> --- a/fs/btrfs/btrfs_inode.h
>> +++ b/fs/btrfs/btrfs_inode.h
>> @@ -329,23 +329,6 @@ struct btrfs_dio_private {
>>                         blk_status_t);
>>  };
>>
>> -/*
>> - * Disable DIO read nolock optimization, so new dio readers will be forced
>> - * to grab i_mutex. It is used to avoid the endless truncate due to
>> - * nonlocked dio read.
>> - */
>> -static inline void btrfs_inode_block_unlocked_dio(struct btrfs_inode *inode)
>> -{
>> -       set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
>> -       smp_mb();
>> -}
>> -
>> -static inline void btrfs_inode_resume_unlocked_dio(struct btrfs_inode *inode)
>> -{
>> -       smp_mb__before_atomic();
>> -       clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
>> -}
>> -
>>  static inline void btrfs_print_data_csum_error(struct btrfs_inode *inode,
>>                 u64 logical_start, u32 csum, u32 csum_expected, int mirror_num)
>>  {
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 491a7397f6fa..9c43257e6e11 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -5149,10 +5149,13 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
>>                 /* we don't support swapfiles, so vmtruncate shouldn't fail */
>>                 truncate_setsize(inode, newsize);
>>
>> -               /* Disable nonlocked read DIO to avoid the end less truncate */
>> -               btrfs_inode_block_unlocked_dio(BTRFS_I(inode));
>> +               /*
>> +                * Truncate after all in-flight dios are finished, new ones
>> +                * will block on inode_lock. This only matters for AIO requests
>> +                * since DIO READ is performed under inode_shared_lock and
>> +                * write under exclusive lock.
>> +                */
>>                 inode_dio_wait(inode);
>> -               btrfs_inode_resume_unlocked_dio(BTRFS_I(inode));
>>
>>                 ret = btrfs_truncate(inode);
>>                 if (ret && inode->i_nlink) {
>> @@ -8669,15 +8672,12 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>         loff_t offset = iocb->ki_pos;
>>         size_t count = 0;
>>         int flags = 0;
>> -       bool wakeup = true;
>>         bool relock = false;
>>         ssize_t ret;
>>
>>         if (check_direct_IO(fs_info, iter, offset))
>>                 return 0;
>>
>> -       inode_dio_begin(inode);
>> -
>>         /*
>>          * The generic stuff only does filemap_write_and_wait_range, which
>>          * isn't enough if we've written compressed pages to this area, so
>> @@ -8691,6 +8691,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>                                          offset + count - 1);
>>
>>         if (iov_iter_rw(iter) == WRITE) {
>> +
>> +               inode_dio_begin(inode);
>> +
>>                 /*
>>                  * If the write DIO is beyond the EOF, we need update
>>                  * the isize, but it is protected by i_mutex. So we can
>> @@ -8720,11 +8723,13 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>                 dio_data.unsubmitted_oe_range_end = (u64)offset;
>>                 current->journal_info = &dio_data;
>>                 down_read(&BTRFS_I(inode)->dio_sem);
>> -       } else if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK,
>> -                                    &BTRFS_I(inode)->runtime_flags)) {
>> -               inode_dio_end(inode);
>> -               flags = DIO_LOCKING | DIO_SKIP_HOLES;
>> -               wakeup = false;
>> +       } else {
>> +               /*
>> +                * In DIO READ case locking the inode in shared mode ensures
>> +                * we are protected against parallel writes/truncates
>> +                */
>> +               inode_lock_shared(inode);
>> +               inode_dio_begin(inode);
>>         }
>>
>>         ret = __blockdev_direct_IO(iocb, inode,
>> @@ -8755,10 +8760,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>                         btrfs_delalloc_release_space(inode, data_reserved,
>>                                         offset, count - (size_t)ret);
>>                 btrfs_delalloc_release_extents(BTRFS_I(inode), count);
>> -       }
>> +       } else
>> +               inode_unlock_shared(inode);
>>  out:
>> -       if (wakeup)
>> -               inode_dio_end(inode);
>> +       inode_dio_end(inode);
>> +
>>         if (relock)
>>                 inode_lock(inode);
>>
>> --
>> 2.7.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana Feb. 21, 2018, 2:42 p.m. UTC | #6
On Wed, Feb 21, 2018 at 2:15 PM, Nikolay Borisov <nborisov@suse.com> wrote:
>
>
> On 21.02.2018 15:51, Filipe Manana wrote:
>> On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nborisov@suse.com> wrote:
>>> Currently the DIO read cases uses a botched idea from ext4 to ensure
>>> that DIO reads don't race with truncate. The idea is that if we have a
>>> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
>>> forces the dio read case to fallback to inode_locking to prevent
>>> read/truncate races. Unfortunately this is subtly broken for at least
>>> 2 reasons:
>>>
>>> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
>>> (for the read case). This means that there is no ordering guarantee
>>> between the invocation of inode_dio_wait and the increment of
>>> i_dio_count in btrfs_direct_IO in the tread case.
>>
>> Also, looking at this changelog, the diff and the code, why is it a
>> problem not calling inode_dio_begin without the inode lock in the dio
>> read path?
>> The truncate path calls inode_dio_wait after setting the bit
>> BTRFS_INODE_READDIO_NEED_LOCK and before clearing it.
>> Assuming the functions to set and clear that bit are correct, I don't
>> see what problem this brings.
>
> Assume you have a truncate and a dio READ in parallel. So the following
> execution is possible:
>
> T1:                                                           T2:
> btrfs_setattr
>  set_bit(BTRFS_INODE_READDIO_NEED_LOCK)
>  inode_dio_wait (reads i_dio_count)                        btrfs_direct_IO
>  clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)                  inode_dio_begin (inc's i_dio_count)
>
> Since we have no ordering between beginning a dio and waiting for it then
> truncate can assume there isn't any pending dio. At the same time
> btrfs_direct_IO will increment i_dio_count but won't see BTRFS_INODE_READDIO_NEED_LOCK
> ever being set and so will proceed servicing the read.

So what you are saying, is that you are concerned with a dio read
starting after clearing the BTRFS_INODE_READDIO_NEED_LOCK.
I don't think that is a problem, because the truncate path has already
started a transaction before, which means blocks/extents deallocated
by the truncation can not be reused and allocated to other inodes or
the same inode (only after the transaction is committed).

And considering that, commit 2e60a51e62185cce48758e596ae7cb2da673b58f
("Btrfs: serialize unlocked dio reads with truncate"), which
introduced all this protection logic, is completely bogus. Looking at
its changelog:

    Btrfs: serialize unlocked dio reads with truncate

    Currently, we can do unlocked dio reads, but the following race
    is possible:

    dio_read_task                   truncate_task
                                    ->btrfs_setattr()
    ->btrfs_direct_IO
        ->__blockdev_direct_IO
          ->btrfs_get_block
                                      ->btrfs_truncate()
                                     #alloc truncated blocks
                                     #to other inode
          ->submit_io()
         #INFORMATION LEAK

    In order to avoid this problem, we must serialize unlocked dio reads with
    truncate. There are two approaches:
    - use extent lock to protect the extent that we truncate
    - use inode_dio_wait() to make sure the truncating task will wait for
      the read DIO.

    If we use the 1st one, we will meet the endless truncation problem due to
    the nonlocked read DIO after we implement the nonlocked write DIO. It is
    because we still need invoke inode_dio_wait() avoid the race between write
    DIO and truncation. By that time, we have to introduce

      btrfs_inode_{block, resume}_nolock_dio()

    again. That is we have to implement this patch again, so I choose the 2nd
    way to fix the problem.

It's concerned with extents deallocated during the truncate operation
being leaked through concurrent reads from other inodes that got that
those extents allocated to them in the meanwhile (and the dio reads
complete after the re-allocations and before the extents get written
with new data) - but that can't happen because truncate is holding a
transaction open. Further all that code that it introduced, can only
prevent concurrent reads from the same inode, not from other inodes.
So I think that commit does absolutely nothing and we should revert
it.

>
>
>>
>> Have you observed any issue in practice? Or can you show a diagram
>> that points a sequence of concurrent steps leading to a problem?
>>
>>>
>>> 2. The memory barriers used in btrfs_inode_(block|resume)_unlocked_dio
>>> are not really paired with the reader side - the test_bit in
>>> btrfs_direct_IO, since the latter is missing a memory barrier.
>>
>> Which memory barrier is missing?
>> Both btrfs_inode_block_unlocked_dio() and
>> btrfs_inode_resume_unlocked_dio() have memory barriers, and I think
>> they are even not needed.
>
> I discussed this piece of code with Peter Zijlstra and so if we take the
> 2 memory barriers in btrfs_setattr size they *could* be optimised as
> follows:
>
> 1. btrfs_inode_block_unlocked_dio() - it can lose it's smp_mb by virtue
> of set_task_state (called from inode_dio_wait->prepare_to_wait) already including a full
> smp_mb.
>
> 2. btrfs_inode_resume_unlocked_dio() - the clear bit in that function can be
> replaced by clear_bit_unlock.
>
> However, as you can see those barriers only order the setting of the
> flag respective to inode_dio_wait, they have no implication about
> when those writes are going to be seen by the test_bit in btrfs_direct_IO.
>
> The situation we must avoid is T1 (truncate) sitting in schedule() and T2
> not seeing the BTRFS_INODE_READDIO_NEED_LOCK bit set, thus missing a wake up.
>
>
> Let alone the following comment of inode_dio_wait:
>
> * Must be called under a lock that serializes taking new references
> * to i_dio_count, usually by inode->i_mutex.
>
>
>
>
>
>> Why do you think they are needed?
>>
>>> Furthermore,
>>> the actual sleeping condition that needs ordering to prevent live-locks/
>>> missed wakeups is the modification/read of i_dio_count. So in this case
>>> the waker(T2) needs to make the condition false _BEFORE_ doing a test.
>>>
>>> The interraction between the two threads roughly looks like:
>>>
>>> T1(truncate):                                    T2(btrfs_direct_IO):
>>> set_bit(BTRFS_INODE_READDIO_NEED_LOCK)             if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK))
>>> if (atomic_read())                                  if (atomic_dec_and_test(&inode->i_dio_count)
>>>   schedule()                                            wake_up_bit
>>> clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)
>>>
>>> Without the ordering between the test_bit in T2 and setting the bit in
>>> T1 (due to a missing pairing barrier in T2) it's possible that T1 goes
>>> to sleep in schedule and T2 misses the bit set, resulting in missing the
>>> wake up.
>>
>> set_bit and test_bit are atomic. I don't understand from that sequence
>> diagram how the wakeup is missed.
>
> They may be atomic but atomicity says absolutely nothing about ordering. And the
> memory barriers are only on the write side (truncate) and the test_bit on the
> read side (btrfs_diect_IO) doesn't have any code which says anything about
> the order in which i_dio_count is read and test_bis is executed.
>
>>
>>>
>>> In any case all of this is VERY subtle. So fix it by simply making
>>> the DIO READ case take inode_lock_shared. This ensure that we can have
>>> DIO reads in parallel at the same time we are protected against
>>> concurrent modification of the target file. This way we closely mimic
>>> what ext4 codes does and simplify this mess.
>>>
>>> Multiple xfstest runs didn't show any regressions.
>>>
>>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>>> ---
>>>  fs/btrfs/btrfs_inode.h | 17 -----------------
>>>  fs/btrfs/inode.c       | 34 ++++++++++++++++++++--------------
>>>  2 files changed, 20 insertions(+), 31 deletions(-)
>>>
>>> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
>>> index f527e99c9f8d..3519e49d4ef0 100644
>>> --- a/fs/btrfs/btrfs_inode.h
>>> +++ b/fs/btrfs/btrfs_inode.h
>>> @@ -329,23 +329,6 @@ struct btrfs_dio_private {
>>>                         blk_status_t);
>>>  };
>>>
>>> -/*
>>> - * Disable DIO read nolock optimization, so new dio readers will be forced
>>> - * to grab i_mutex. It is used to avoid the endless truncate due to
>>> - * nonlocked dio read.
>>> - */
>>> -static inline void btrfs_inode_block_unlocked_dio(struct btrfs_inode *inode)
>>> -{
>>> -       set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
>>> -       smp_mb();
>>> -}
>>> -
>>> -static inline void btrfs_inode_resume_unlocked_dio(struct btrfs_inode *inode)
>>> -{
>>> -       smp_mb__before_atomic();
>>> -       clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
>>> -}
>>> -
>>>  static inline void btrfs_print_data_csum_error(struct btrfs_inode *inode,
>>>                 u64 logical_start, u32 csum, u32 csum_expected, int mirror_num)
>>>  {
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index 491a7397f6fa..9c43257e6e11 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -5149,10 +5149,13 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
>>>                 /* we don't support swapfiles, so vmtruncate shouldn't fail */
>>>                 truncate_setsize(inode, newsize);
>>>
>>> -               /* Disable nonlocked read DIO to avoid the end less truncate */
>>> -               btrfs_inode_block_unlocked_dio(BTRFS_I(inode));
>>> +               /*
>>> +                * Truncate after all in-flight dios are finished, new ones
>>> +                * will block on inode_lock. This only matters for AIO requests
>>> +                * since DIO READ is performed under inode_shared_lock and
>>> +                * write under exclusive lock.
>>> +                */
>>>                 inode_dio_wait(inode);
>>> -               btrfs_inode_resume_unlocked_dio(BTRFS_I(inode));
>>>
>>>                 ret = btrfs_truncate(inode);
>>>                 if (ret && inode->i_nlink) {
>>> @@ -8669,15 +8672,12 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>>         loff_t offset = iocb->ki_pos;
>>>         size_t count = 0;
>>>         int flags = 0;
>>> -       bool wakeup = true;
>>>         bool relock = false;
>>>         ssize_t ret;
>>>
>>>         if (check_direct_IO(fs_info, iter, offset))
>>>                 return 0;
>>>
>>> -       inode_dio_begin(inode);
>>> -
>>>         /*
>>>          * The generic stuff only does filemap_write_and_wait_range, which
>>>          * isn't enough if we've written compressed pages to this area, so
>>> @@ -8691,6 +8691,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>>                                          offset + count - 1);
>>>
>>>         if (iov_iter_rw(iter) == WRITE) {
>>> +
>>> +               inode_dio_begin(inode);
>>> +
>>>                 /*
>>>                  * If the write DIO is beyond the EOF, we need update
>>>                  * the isize, but it is protected by i_mutex. So we can
>>> @@ -8720,11 +8723,13 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>>                 dio_data.unsubmitted_oe_range_end = (u64)offset;
>>>                 current->journal_info = &dio_data;
>>>                 down_read(&BTRFS_I(inode)->dio_sem);
>>> -       } else if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK,
>>> -                                    &BTRFS_I(inode)->runtime_flags)) {
>>> -               inode_dio_end(inode);
>>> -               flags = DIO_LOCKING | DIO_SKIP_HOLES;
>>> -               wakeup = false;
>>> +       } else {
>>> +               /*
>>> +                * In DIO READ case locking the inode in shared mode ensures
>>> +                * we are protected against parallel writes/truncates
>>> +                */
>>> +               inode_lock_shared(inode);
>>> +               inode_dio_begin(inode);
>>>         }
>>>
>>>         ret = __blockdev_direct_IO(iocb, inode,
>>> @@ -8755,10 +8760,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>>                         btrfs_delalloc_release_space(inode, data_reserved,
>>>                                         offset, count - (size_t)ret);
>>>                 btrfs_delalloc_release_extents(BTRFS_I(inode), count);
>>> -       }
>>> +       } else
>>> +               inode_unlock_shared(inode);
>>>  out:
>>> -       if (wakeup)
>>> -               inode_dio_end(inode);
>>> +       inode_dio_end(inode);
>>> +
>>>         if (relock)
>>>                 inode_lock(inode);
>>>
>>> --
>>> 2.7.4
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
Liu Bo Feb. 21, 2018, 6:14 p.m. UTC | #7
On Wed, Feb 21, 2018 at 04:15:38PM +0200, Nikolay Borisov wrote:
> 
> 
> On 21.02.2018 15:51, Filipe Manana wrote:
> > On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nborisov@suse.com> wrote:
> >> Currently the DIO read cases uses a botched idea from ext4 to ensure
> >> that DIO reads don't race with truncate. The idea is that if we have a
> >> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
> >> forces the dio read case to fallback to inode_locking to prevent
> >> read/truncate races. Unfortunately this is subtly broken for at least
> >> 2 reasons:
> >>
> >> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
> >> (for the read case). This means that there is no ordering guarantee
> >> between the invocation of inode_dio_wait and the increment of
> >> i_dio_count in btrfs_direct_IO in the tread case.
> > 
> > Also, looking at this changelog, the diff and the code, why is it a
> > problem not calling inode_dio_begin without the inode lock in the dio
> > read path?
> > The truncate path calls inode_dio_wait after setting the bit
> > BTRFS_INODE_READDIO_NEED_LOCK and before clearing it.
> > Assuming the functions to set and clear that bit are correct, I don't
> > see what problem this brings.
> 
> Assume you have a truncate and a dio READ in parallel. So the following 
> execution is possible: 
>  
> T1:                                                           T2:                                                                              
> btrfs_setattr                                            
>  set_bit(BTRFS_INODE_READDIO_NEED_LOCK)
>  inode_dio_wait (reads i_dio_count)                        btrfs_direct_IO   						 
>  clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)                  inode_dio_begin (inc's i_dio_count) 
> 
> Since we have no ordering between beginning a dio and waiting for it then 
> truncate can assume there isn't any pending dio. At the same time 
> btrfs_direct_IO will increment i_dio_count but won't see BTRFS_INODE_READDIO_NEED_LOCK
> ever being set and so will proceed servicing the read.  
>

->setattr here has truncated the inode's isize and
do_blockdev_direct_IO() then checks if the offset to read from is
within the isize, if true, dio read is fine with this truncate, if
not, it returns.

Apart from this, do you have other concerns?

thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo Feb. 21, 2018, 6:28 p.m. UTC | #8
On Wed, Feb 21, 2018 at 02:42:08PM +0000, Filipe Manana wrote:
> On Wed, Feb 21, 2018 at 2:15 PM, Nikolay Borisov <nborisov@suse.com> wrote:
> >
> >
> > On 21.02.2018 15:51, Filipe Manana wrote:
> >> On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nborisov@suse.com> wrote:
> >>> Currently the DIO read cases uses a botched idea from ext4 to ensure
> >>> that DIO reads don't race with truncate. The idea is that if we have a
> >>> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
> >>> forces the dio read case to fallback to inode_locking to prevent
> >>> read/truncate races. Unfortunately this is subtly broken for at least
> >>> 2 reasons:
> >>>
> >>> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
> >>> (for the read case). This means that there is no ordering guarantee
> >>> between the invocation of inode_dio_wait and the increment of
> >>> i_dio_count in btrfs_direct_IO in the tread case.
> >>
> >> Also, looking at this changelog, the diff and the code, why is it a
> >> problem not calling inode_dio_begin without the inode lock in the dio
> >> read path?
> >> The truncate path calls inode_dio_wait after setting the bit
> >> BTRFS_INODE_READDIO_NEED_LOCK and before clearing it.
> >> Assuming the functions to set and clear that bit are correct, I don't
> >> see what problem this brings.
> >
> > Assume you have a truncate and a dio READ in parallel. So the following
> > execution is possible:
> >
> > T1:                                                           T2:
> > btrfs_setattr
> >  set_bit(BTRFS_INODE_READDIO_NEED_LOCK)
> >  inode_dio_wait (reads i_dio_count)                        btrfs_direct_IO
> >  clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)                  inode_dio_begin (inc's i_dio_count)
> >
> > Since we have no ordering between beginning a dio and waiting for it then
> > truncate can assume there isn't any pending dio. At the same time
> > btrfs_direct_IO will increment i_dio_count but won't see BTRFS_INODE_READDIO_NEED_LOCK
> > ever being set and so will proceed servicing the read.
> 
> So what you are saying, is that you are concerned with a dio read
> starting after clearing the BTRFS_INODE_READDIO_NEED_LOCK.
> I don't think that is a problem, because the truncate path has already
> started a transaction before, which means blocks/extents deallocated
> by the truncation can not be reused and allocated to other inodes or
> the same inode (only after the transaction is committed).
> 
> And considering that, commit 2e60a51e62185cce48758e596ae7cb2da673b58f
> ("Btrfs: serialize unlocked dio reads with truncate"), which
> introduced all this protection logic, is completely bogus. Looking at
> its changelog:
> 
>     Btrfs: serialize unlocked dio reads with truncate
> 
>     Currently, we can do unlocked dio reads, but the following race
>     is possible:
> 
>     dio_read_task                   truncate_task
>                                     ->btrfs_setattr()
>     ->btrfs_direct_IO
>         ->__blockdev_direct_IO
>           ->btrfs_get_block
>                                       ->btrfs_truncate()
>                                      #alloc truncated blocks
>                                      #to other inode
>           ->submit_io()
>          #INFORMATION LEAK
> 
>     In order to avoid this problem, we must serialize unlocked dio reads with
>     truncate. There are two approaches:
>     - use extent lock to protect the extent that we truncate
>     - use inode_dio_wait() to make sure the truncating task will wait for
>       the read DIO.
> 
>     If we use the 1st one, we will meet the endless truncation problem due to
>     the nonlocked read DIO after we implement the nonlocked write DIO. It is
>     because we still need invoke inode_dio_wait() avoid the race between write
>     DIO and truncation. By that time, we have to introduce
> 
>       btrfs_inode_{block, resume}_nolock_dio()
> 
>     again. That is we have to implement this patch again, so I choose the 2nd
>     way to fix the problem.
> 
> It's concerned with extents deallocated during the truncate operation
> being leaked through concurrent reads from other inodes that got that
> those extents allocated to them in the meanwhile (and the dio reads
> complete after the re-allocations and before the extents get written
> with new data) - but that can't happen because truncate is holding a
> transaction open. Further all that code that it introduced, can only
> prevent concurrent reads from the same inode, not from other inodes.
> So I think that commit does absolutely nothing and we should revert
> it.
>

Well...make sense, but still dio read can read stale data past isize
if this inode_dio_wait() is removed.

thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Borisov Feb. 21, 2018, 6:38 p.m. UTC | #9
On 21.02.2018 20:28, Liu Bo wrote:
> On Wed, Feb 21, 2018 at 02:42:08PM +0000, Filipe Manana wrote:
>> On Wed, Feb 21, 2018 at 2:15 PM, Nikolay Borisov <nborisov@suse.com> wrote:
>>>
>>>
>>> On 21.02.2018 15:51, Filipe Manana wrote:
>>>> On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nborisov@suse.com> wrote:
>>>>> Currently the DIO read cases uses a botched idea from ext4 to ensure
>>>>> that DIO reads don't race with truncate. The idea is that if we have a
>>>>> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
>>>>> forces the dio read case to fallback to inode_locking to prevent
>>>>> read/truncate races. Unfortunately this is subtly broken for at least
>>>>> 2 reasons:
>>>>>
>>>>> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
>>>>> (for the read case). This means that there is no ordering guarantee
>>>>> between the invocation of inode_dio_wait and the increment of
>>>>> i_dio_count in btrfs_direct_IO in the tread case.
>>>>
>>>> Also, looking at this changelog, the diff and the code, why is it a
>>>> problem not calling inode_dio_begin without the inode lock in the dio
>>>> read path?
>>>> The truncate path calls inode_dio_wait after setting the bit
>>>> BTRFS_INODE_READDIO_NEED_LOCK and before clearing it.
>>>> Assuming the functions to set and clear that bit are correct, I don't
>>>> see what problem this brings.
>>>
>>> Assume you have a truncate and a dio READ in parallel. So the following
>>> execution is possible:
>>>
>>> T1:                                                           T2:
>>> btrfs_setattr
>>>  set_bit(BTRFS_INODE_READDIO_NEED_LOCK)
>>>  inode_dio_wait (reads i_dio_count)                        btrfs_direct_IO
>>>  clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)                  inode_dio_begin (inc's i_dio_count)
>>>
>>> Since we have no ordering between beginning a dio and waiting for it then
>>> truncate can assume there isn't any pending dio. At the same time
>>> btrfs_direct_IO will increment i_dio_count but won't see BTRFS_INODE_READDIO_NEED_LOCK
>>> ever being set and so will proceed servicing the read.
>>
>> So what you are saying, is that you are concerned with a dio read
>> starting after clearing the BTRFS_INODE_READDIO_NEED_LOCK.
>> I don't think that is a problem, because the truncate path has already
>> started a transaction before, which means blocks/extents deallocated
>> by the truncation can not be reused and allocated to other inodes or
>> the same inode (only after the transaction is committed).
>>
>> And considering that, commit 2e60a51e62185cce48758e596ae7cb2da673b58f
>> ("Btrfs: serialize unlocked dio reads with truncate"), which
>> introduced all this protection logic, is completely bogus. Looking at
>> its changelog:
>>
>>     Btrfs: serialize unlocked dio reads with truncate
>>
>>     Currently, we can do unlocked dio reads, but the following race
>>     is possible:
>>
>>     dio_read_task                   truncate_task
>>                                     ->btrfs_setattr()
>>     ->btrfs_direct_IO
>>         ->__blockdev_direct_IO
>>           ->btrfs_get_block
>>                                       ->btrfs_truncate()
>>                                      #alloc truncated blocks
>>                                      #to other inode
>>           ->submit_io()
>>          #INFORMATION LEAK
>>
>>     In order to avoid this problem, we must serialize unlocked dio reads with
>>     truncate. There are two approaches:
>>     - use extent lock to protect the extent that we truncate
>>     - use inode_dio_wait() to make sure the truncating task will wait for
>>       the read DIO.
>>
>>     If we use the 1st one, we will meet the endless truncation problem due to
>>     the nonlocked read DIO after we implement the nonlocked write DIO. It is
>>     because we still need invoke inode_dio_wait() avoid the race between write
>>     DIO and truncation. By that time, we have to introduce
>>
>>       btrfs_inode_{block, resume}_nolock_dio()
>>
>>     again. That is we have to implement this patch again, so I choose the 2nd
>>     way to fix the problem.
>>
>> It's concerned with extents deallocated during the truncate operation
>> being leaked through concurrent reads from other inodes that got that
>> those extents allocated to them in the meanwhile (and the dio reads
>> complete after the re-allocations and before the extents get written
>> with new data) - but that can't happen because truncate is holding a
>> transaction open. Further all that code that it introduced, can only
>> prevent concurrent reads from the same inode, not from other inodes.
>> So I think that commit does absolutely nothing and we should revert
>> it.
>>
> 
> Well...make sense, but still dio read can read stale data past isize
> if this inode_dio_wait() is removed.

inode_dio_wait not being synchronized at all with inode_dio_begin in dio
read case means bad stuff can happen anyway.

FWIW I'm very much in favor of reverting Miao's patch.

> 
> thanks,
> 
> -liubo
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana Feb. 21, 2018, 7:05 p.m. UTC | #10
On Wed, Feb 21, 2018 at 6:28 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Wed, Feb 21, 2018 at 02:42:08PM +0000, Filipe Manana wrote:
>> On Wed, Feb 21, 2018 at 2:15 PM, Nikolay Borisov <nborisov@suse.com> wrote:
>> >
>> >
>> > On 21.02.2018 15:51, Filipe Manana wrote:
>> >> On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nborisov@suse.com> wrote:
>> >>> Currently the DIO read cases uses a botched idea from ext4 to ensure
>> >>> that DIO reads don't race with truncate. The idea is that if we have a
>> >>> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
>> >>> forces the dio read case to fallback to inode_locking to prevent
>> >>> read/truncate races. Unfortunately this is subtly broken for at least
>> >>> 2 reasons:
>> >>>
>> >>> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
>> >>> (for the read case). This means that there is no ordering guarantee
>> >>> between the invocation of inode_dio_wait and the increment of
>> >>> i_dio_count in btrfs_direct_IO in the tread case.
>> >>
>> >> Also, looking at this changelog, the diff and the code, why is it a
>> >> problem not calling inode_dio_begin without the inode lock in the dio
>> >> read path?
>> >> The truncate path calls inode_dio_wait after setting the bit
>> >> BTRFS_INODE_READDIO_NEED_LOCK and before clearing it.
>> >> Assuming the functions to set and clear that bit are correct, I don't
>> >> see what problem this brings.
>> >
>> > Assume you have a truncate and a dio READ in parallel. So the following
>> > execution is possible:
>> >
>> > T1:                                                           T2:
>> > btrfs_setattr
>> >  set_bit(BTRFS_INODE_READDIO_NEED_LOCK)
>> >  inode_dio_wait (reads i_dio_count)                        btrfs_direct_IO
>> >  clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)                  inode_dio_begin (inc's i_dio_count)
>> >
>> > Since we have no ordering between beginning a dio and waiting for it then
>> > truncate can assume there isn't any pending dio. At the same time
>> > btrfs_direct_IO will increment i_dio_count but won't see BTRFS_INODE_READDIO_NEED_LOCK
>> > ever being set and so will proceed servicing the read.
>>
>> So what you are saying, is that you are concerned with a dio read
>> starting after clearing the BTRFS_INODE_READDIO_NEED_LOCK.
>> I don't think that is a problem, because the truncate path has already
>> started a transaction before, which means blocks/extents deallocated
>> by the truncation can not be reused and allocated to other inodes or
>> the same inode (only after the transaction is committed).
>>
>> And considering that, commit 2e60a51e62185cce48758e596ae7cb2da673b58f
>> ("Btrfs: serialize unlocked dio reads with truncate"), which
>> introduced all this protection logic, is completely bogus. Looking at
>> its changelog:
>>
>>     Btrfs: serialize unlocked dio reads with truncate
>>
>>     Currently, we can do unlocked dio reads, but the following race
>>     is possible:
>>
>>     dio_read_task                   truncate_task
>>                                     ->btrfs_setattr()
>>     ->btrfs_direct_IO
>>         ->__blockdev_direct_IO
>>           ->btrfs_get_block
>>                                       ->btrfs_truncate()
>>                                      #alloc truncated blocks
>>                                      #to other inode
>>           ->submit_io()
>>          #INFORMATION LEAK
>>
>>     In order to avoid this problem, we must serialize unlocked dio reads with
>>     truncate. There are two approaches:
>>     - use extent lock to protect the extent that we truncate
>>     - use inode_dio_wait() to make sure the truncating task will wait for
>>       the read DIO.
>>
>>     If we use the 1st one, we will meet the endless truncation problem due to
>>     the nonlocked read DIO after we implement the nonlocked write DIO. It is
>>     because we still need invoke inode_dio_wait() avoid the race between write
>>     DIO and truncation. By that time, we have to introduce
>>
>>       btrfs_inode_{block, resume}_nolock_dio()
>>
>>     again. That is we have to implement this patch again, so I choose the 2nd
>>     way to fix the problem.
>>
>> It's concerned with extents deallocated during the truncate operation
>> being leaked through concurrent reads from other inodes that got that
>> those extents allocated to them in the meanwhile (and the dio reads
>> complete after the re-allocations and before the extents get written
>> with new data) - but that can't happen because truncate is holding a
>> transaction open. Further all that code that it introduced, can only
>> prevent concurrent reads from the same inode, not from other inodes.
>> So I think that commit does absolutely nothing and we should revert
>> it.
>>
>
> Well...make sense, but still dio read can read stale data past isize
> if this inode_dio_wait() is removed.

Yes, the inode_dio_wait() would remain, to prevent a dio read from
submitting the bio before truncate drops an extent and the bio finish
after the transaction from truncate commits (at which point the pinned
extents could have been allocated for someone else and be partially,
fully rewritten or discarded). All that stuff with the bit
BTRFS_INODE_READDIO_NEED_LOCK would go away.
If the transaction commits after the dio read, then everything is fine
as for the cases where it reads data past the isize set by truncate,
that data is preserved since the dropped extents are pinned, there's
no chance for the application to read partial contents or garbage from
the dropped extents.

>
> thanks,
>
> -liubo
Liu Bo Feb. 21, 2018, 10:38 p.m. UTC | #11
On Wed, Feb 21, 2018 at 07:05:13PM +0000, Filipe Manana wrote:
> On Wed, Feb 21, 2018 at 6:28 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> > On Wed, Feb 21, 2018 at 02:42:08PM +0000, Filipe Manana wrote:
> >> On Wed, Feb 21, 2018 at 2:15 PM, Nikolay Borisov <nborisov@suse.com> wrote:
> >> >
> >> >
> >> > On 21.02.2018 15:51, Filipe Manana wrote:
> >> >> On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nborisov@suse.com> wrote:
> >> >>> Currently the DIO read cases uses a botched idea from ext4 to ensure
> >> >>> that DIO reads don't race with truncate. The idea is that if we have a
> >> >>> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
> >> >>> forces the dio read case to fallback to inode_locking to prevent
> >> >>> read/truncate races. Unfortunately this is subtly broken for at least
> >> >>> 2 reasons:
> >> >>>
> >> >>> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
> >> >>> (for the read case). This means that there is no ordering guarantee
> >> >>> between the invocation of inode_dio_wait and the increment of
> >> >>> i_dio_count in btrfs_direct_IO in the tread case.
> >> >>
> >> >> Also, looking at this changelog, the diff and the code, why is it a
> >> >> problem not calling inode_dio_begin without the inode lock in the dio
> >> >> read path?
> >> >> The truncate path calls inode_dio_wait after setting the bit
> >> >> BTRFS_INODE_READDIO_NEED_LOCK and before clearing it.
> >> >> Assuming the functions to set and clear that bit are correct, I don't
> >> >> see what problem this brings.
> >> >
> >> > Assume you have a truncate and a dio READ in parallel. So the following
> >> > execution is possible:
> >> >
> >> > T1:                                                           T2:
> >> > btrfs_setattr
> >> >  set_bit(BTRFS_INODE_READDIO_NEED_LOCK)
> >> >  inode_dio_wait (reads i_dio_count)                        btrfs_direct_IO
> >> >  clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)                  inode_dio_begin (inc's i_dio_count)
> >> >
> >> > Since we have no ordering between beginning a dio and waiting for it then
> >> > truncate can assume there isn't any pending dio. At the same time
> >> > btrfs_direct_IO will increment i_dio_count but won't see BTRFS_INODE_READDIO_NEED_LOCK
> >> > ever being set and so will proceed servicing the read.
> >>
> >> So what you are saying, is that you are concerned with a dio read
> >> starting after clearing the BTRFS_INODE_READDIO_NEED_LOCK.
> >> I don't think that is a problem, because the truncate path has already
> >> started a transaction before, which means blocks/extents deallocated
> >> by the truncation can not be reused and allocated to other inodes or
> >> the same inode (only after the transaction is committed).
> >>
> >> And considering that, commit 2e60a51e62185cce48758e596ae7cb2da673b58f
> >> ("Btrfs: serialize unlocked dio reads with truncate"), which
> >> introduced all this protection logic, is completely bogus. Looking at
> >> its changelog:
> >>
> >>     Btrfs: serialize unlocked dio reads with truncate
> >>
> >>     Currently, we can do unlocked dio reads, but the following race
> >>     is possible:
> >>
> >>     dio_read_task                   truncate_task
> >>                                     ->btrfs_setattr()
> >>     ->btrfs_direct_IO
> >>         ->__blockdev_direct_IO
> >>           ->btrfs_get_block
> >>                                       ->btrfs_truncate()
> >>                                      #alloc truncated blocks
> >>                                      #to other inode
> >>           ->submit_io()
> >>          #INFORMATION LEAK
> >>
> >>     In order to avoid this problem, we must serialize unlocked dio reads with
> >>     truncate. There are two approaches:
> >>     - use extent lock to protect the extent that we truncate
> >>     - use inode_dio_wait() to make sure the truncating task will wait for
> >>       the read DIO.
> >>
> >>     If we use the 1st one, we will meet the endless truncation problem due to
> >>     the nonlocked read DIO after we implement the nonlocked write DIO. It is
> >>     because we still need invoke inode_dio_wait() avoid the race between write
> >>     DIO and truncation. By that time, we have to introduce
> >>
> >>       btrfs_inode_{block, resume}_nolock_dio()
> >>
> >>     again. That is we have to implement this patch again, so I choose the 2nd
> >>     way to fix the problem.
> >>
> >> It's concerned with extents deallocated during the truncate operation
> >> being leaked through concurrent reads from other inodes that got that
> >> those extents allocated to them in the meanwhile (and the dio reads
> >> complete after the re-allocations and before the extents get written
> >> with new data) - but that can't happen because truncate is holding a
> >> transaction open. Further all that code that it introduced, can only
> >> prevent concurrent reads from the same inode, not from other inodes.
> >> So I think that commit does absolutely nothing and we should revert
> >> it.
> >>
> >
> > Well...make sense, but still dio read can read stale data past isize
> > if this inode_dio_wait() is removed.
> 
> Yes, the inode_dio_wait() would remain, to prevent a dio read from
> submitting the bio before truncate drops an extent and the bio finish
> after the transaction from truncate commits (at which point the pinned
> extents could have been allocated for someone else and be partially,
> fully rewritten or discarded). All that stuff with the bit
> BTRFS_INODE_READDIO_NEED_LOCK would go away.

The commit description doesn't point it out but the code has the
necessary comment, BTRFS_INODE_READDIO_NEED_LOCK is used to prevent a
livelock if there are enough agreesive dio readers rushing in.

> If the transaction commits after the dio read, then everything is fine
> as for the cases where it reads data past the isize set by truncate,
> that data is preserved since the dropped extents are pinned, there's
> no chance for the application to read partial contents or garbage from
> the dropped extents.

Not even that far, isize is truncated before calling inode_dio_wait()
and a memory barrier is set to ensure the correct order, so dio read
would simply return if it's reading past isize.

The code is very subtle, but so far it looks reasonable to me.

thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Borisov Feb. 22, 2018, 6:49 a.m. UTC | #12
On 22.02.2018 00:38, Liu Bo wrote:
> On Wed, Feb 21, 2018 at 07:05:13PM +0000, Filipe Manana wrote:
>> On Wed, Feb 21, 2018 at 6:28 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
>>> On Wed, Feb 21, 2018 at 02:42:08PM +0000, Filipe Manana wrote:
>>>> On Wed, Feb 21, 2018 at 2:15 PM, Nikolay Borisov <nborisov@suse.com> wrote:
>>>>>
>>>>>
>>>>> On 21.02.2018 15:51, Filipe Manana wrote:
>>>>>> On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nborisov@suse.com> wrote:
>>>>>>> Currently the DIO read cases uses a botched idea from ext4 to ensure
>>>>>>> that DIO reads don't race with truncate. The idea is that if we have a
>>>>>>> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
>>>>>>> forces the dio read case to fallback to inode_locking to prevent
>>>>>>> read/truncate races. Unfortunately this is subtly broken for at least
>>>>>>> 2 reasons:
>>>>>>>
>>>>>>> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
>>>>>>> (for the read case). This means that there is no ordering guarantee
>>>>>>> between the invocation of inode_dio_wait and the increment of
>>>>>>> i_dio_count in btrfs_direct_IO in the tread case.
>>>>>>
>>>>>> Also, looking at this changelog, the diff and the code, why is it a
>>>>>> problem not calling inode_dio_begin without the inode lock in the dio
>>>>>> read path?
>>>>>> The truncate path calls inode_dio_wait after setting the bit
>>>>>> BTRFS_INODE_READDIO_NEED_LOCK and before clearing it.
>>>>>> Assuming the functions to set and clear that bit are correct, I don't
>>>>>> see what problem this brings.
>>>>>
>>>>> Assume you have a truncate and a dio READ in parallel. So the following
>>>>> execution is possible:
>>>>>
>>>>> T1:                                                           T2:
>>>>> btrfs_setattr
>>>>>  set_bit(BTRFS_INODE_READDIO_NEED_LOCK)
>>>>>  inode_dio_wait (reads i_dio_count)                        btrfs_direct_IO
>>>>>  clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)                  inode_dio_begin (inc's i_dio_count)
>>>>>
>>>>> Since we have no ordering between beginning a dio and waiting for it then
>>>>> truncate can assume there isn't any pending dio. At the same time
>>>>> btrfs_direct_IO will increment i_dio_count but won't see BTRFS_INODE_READDIO_NEED_LOCK
>>>>> ever being set and so will proceed servicing the read.
>>>>
>>>> So what you are saying, is that you are concerned with a dio read
>>>> starting after clearing the BTRFS_INODE_READDIO_NEED_LOCK.
>>>> I don't think that is a problem, because the truncate path has already
>>>> started a transaction before, which means blocks/extents deallocated
>>>> by the truncation can not be reused and allocated to other inodes or
>>>> the same inode (only after the transaction is committed).
>>>>
>>>> And considering that, commit 2e60a51e62185cce48758e596ae7cb2da673b58f
>>>> ("Btrfs: serialize unlocked dio reads with truncate"), which
>>>> introduced all this protection logic, is completely bogus. Looking at
>>>> its changelog:
>>>>
>>>>     Btrfs: serialize unlocked dio reads with truncate
>>>>
>>>>     Currently, we can do unlocked dio reads, but the following race
>>>>     is possible:
>>>>
>>>>     dio_read_task                   truncate_task
>>>>                                     ->btrfs_setattr()
>>>>     ->btrfs_direct_IO
>>>>         ->__blockdev_direct_IO
>>>>           ->btrfs_get_block
>>>>                                       ->btrfs_truncate()
>>>>                                      #alloc truncated blocks
>>>>                                      #to other inode
>>>>           ->submit_io()
>>>>          #INFORMATION LEAK
>>>>
>>>>     In order to avoid this problem, we must serialize unlocked dio reads with
>>>>     truncate. There are two approaches:
>>>>     - use extent lock to protect the extent that we truncate
>>>>     - use inode_dio_wait() to make sure the truncating task will wait for
>>>>       the read DIO.
>>>>
>>>>     If we use the 1st one, we will meet the endless truncation problem due to
>>>>     the nonlocked read DIO after we implement the nonlocked write DIO. It is
>>>>     because we still need invoke inode_dio_wait() avoid the race between write
>>>>     DIO and truncation. By that time, we have to introduce
>>>>
>>>>       btrfs_inode_{block, resume}_nolock_dio()
>>>>
>>>>     again. That is we have to implement this patch again, so I choose the 2nd
>>>>     way to fix the problem.
>>>>
>>>> It's concerned with extents deallocated during the truncate operation
>>>> being leaked through concurrent reads from other inodes that got that
>>>> those extents allocated to them in the meanwhile (and the dio reads
>>>> complete after the re-allocations and before the extents get written
>>>> with new data) - but that can't happen because truncate is holding a
>>>> transaction open. Further all that code that it introduced, can only
>>>> prevent concurrent reads from the same inode, not from other inodes.
>>>> So I think that commit does absolutely nothing and we should revert
>>>> it.
>>>>
>>>
>>> Well...make sense, but still dio read can read stale data past isize
>>> if this inode_dio_wait() is removed.
>>
>> Yes, the inode_dio_wait() would remain, to prevent a dio read from
>> submitting the bio before truncate drops an extent and the bio finish
>> after the transaction from truncate commits (at which point the pinned
>> extents could have been allocated for someone else and be partially,
>> fully rewritten or discarded). All that stuff with the bit
>> BTRFS_INODE_READDIO_NEED_LOCK would go away.
> 
> The commit description doesn't point it out but the code has the
> necessary comment, BTRFS_INODE_READDIO_NEED_LOCK is used to prevent a
> livelock if there are enough agreesive dio readers rushing in.
> 
>> If the transaction commits after the dio read, then everything is fine
>> as for the cases where it reads data past the isize set by truncate,
>> that data is preserved since the dropped extents are pinned, there's
>> no chance for the application to read partial contents or garbage from
>> the dropped extents.
> 
> Not even that far, isize is truncated before calling inode_dio_wait()
> and a memory barrier is set to ensure the correct order, so dio read
> would simply return if it's reading past isize.

Please, describe concretely which meory barriers pairs with chich in
order to ensure proper visibility. Because I'm of the opinion there are
insufficient memoru barriers to ensure setting READDIO_LOCK is properly
visible in btrfs_direct_IO. Since the latter has no barriers whatsoever.

> 
> The code is very subtle, but so far it looks reasonable to me.
> 
> thanks,
> 
> -liubo
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana Feb. 22, 2018, 10:05 a.m. UTC | #13
On Wed, Feb 21, 2018 at 10:38 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Wed, Feb 21, 2018 at 07:05:13PM +0000, Filipe Manana wrote:
>> On Wed, Feb 21, 2018 at 6:28 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
>> > On Wed, Feb 21, 2018 at 02:42:08PM +0000, Filipe Manana wrote:
>> >> On Wed, Feb 21, 2018 at 2:15 PM, Nikolay Borisov <nborisov@suse.com> wrote:
>> >> >
>> >> >
>> >> > On 21.02.2018 15:51, Filipe Manana wrote:
>> >> >> On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nborisov@suse.com> wrote:
>> >> >>> Currently the DIO read cases uses a botched idea from ext4 to ensure
>> >> >>> that DIO reads don't race with truncate. The idea is that if we have a
>> >> >>> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
>> >> >>> forces the dio read case to fallback to inode_locking to prevent
>> >> >>> read/truncate races. Unfortunately this is subtly broken for at least
>> >> >>> 2 reasons:
>> >> >>>
>> >> >>> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
>> >> >>> (for the read case). This means that there is no ordering guarantee
>> >> >>> between the invocation of inode_dio_wait and the increment of
>> >> >>> i_dio_count in btrfs_direct_IO in the tread case.
>> >> >>
>> >> >> Also, looking at this changelog, the diff and the code, why is it a
>> >> >> problem not calling inode_dio_begin without the inode lock in the dio
>> >> >> read path?
>> >> >> The truncate path calls inode_dio_wait after setting the bit
>> >> >> BTRFS_INODE_READDIO_NEED_LOCK and before clearing it.
>> >> >> Assuming the functions to set and clear that bit are correct, I don't
>> >> >> see what problem this brings.
>> >> >
>> >> > Assume you have a truncate and a dio READ in parallel. So the following
>> >> > execution is possible:
>> >> >
>> >> > T1:                                                           T2:
>> >> > btrfs_setattr
>> >> >  set_bit(BTRFS_INODE_READDIO_NEED_LOCK)
>> >> >  inode_dio_wait (reads i_dio_count)                        btrfs_direct_IO
>> >> >  clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)                  inode_dio_begin (inc's i_dio_count)
>> >> >
>> >> > Since we have no ordering between beginning a dio and waiting for it then
>> >> > truncate can assume there isn't any pending dio. At the same time
>> >> > btrfs_direct_IO will increment i_dio_count but won't see BTRFS_INODE_READDIO_NEED_LOCK
>> >> > ever being set and so will proceed servicing the read.
>> >>
>> >> So what you are saying, is that you are concerned with a dio read
>> >> starting after clearing the BTRFS_INODE_READDIO_NEED_LOCK.
>> >> I don't think that is a problem, because the truncate path has already
>> >> started a transaction before, which means blocks/extents deallocated
>> >> by the truncation can not be reused and allocated to other inodes or
>> >> the same inode (only after the transaction is committed).
>> >>
>> >> And considering that, commit 2e60a51e62185cce48758e596ae7cb2da673b58f
>> >> ("Btrfs: serialize unlocked dio reads with truncate"), which
>> >> introduced all this protection logic, is completely bogus. Looking at
>> >> its changelog:
>> >>
>> >>     Btrfs: serialize unlocked dio reads with truncate
>> >>
>> >>     Currently, we can do unlocked dio reads, but the following race
>> >>     is possible:
>> >>
>> >>     dio_read_task                   truncate_task
>> >>                                     ->btrfs_setattr()
>> >>     ->btrfs_direct_IO
>> >>         ->__blockdev_direct_IO
>> >>           ->btrfs_get_block
>> >>                                       ->btrfs_truncate()
>> >>                                      #alloc truncated blocks
>> >>                                      #to other inode
>> >>           ->submit_io()
>> >>          #INFORMATION LEAK
>> >>
>> >>     In order to avoid this problem, we must serialize unlocked dio reads with
>> >>     truncate. There are two approaches:
>> >>     - use extent lock to protect the extent that we truncate
>> >>     - use inode_dio_wait() to make sure the truncating task will wait for
>> >>       the read DIO.
>> >>
>> >>     If we use the 1st one, we will meet the endless truncation problem due to
>> >>     the nonlocked read DIO after we implement the nonlocked write DIO. It is
>> >>     because we still need invoke inode_dio_wait() avoid the race between write
>> >>     DIO and truncation. By that time, we have to introduce
>> >>
>> >>       btrfs_inode_{block, resume}_nolock_dio()
>> >>
>> >>     again. That is we have to implement this patch again, so I choose the 2nd
>> >>     way to fix the problem.
>> >>
>> >> It's concerned with extents deallocated during the truncate operation
>> >> being leaked through concurrent reads from other inodes that got that
>> >> those extents allocated to them in the meanwhile (and the dio reads
>> >> complete after the re-allocations and before the extents get written
>> >> with new data) - but that can't happen because truncate is holding a
>> >> transaction open. Further all that code that it introduced, can only
>> >> prevent concurrent reads from the same inode, not from other inodes.
>> >> So I think that commit does absolutely nothing and we should revert
>> >> it.
>> >>
>> >
>> > Well...make sense, but still dio read can read stale data past isize
>> > if this inode_dio_wait() is removed.
>>
>> Yes, the inode_dio_wait() would remain, to prevent a dio read from
>> submitting the bio before truncate drops an extent and the bio finish
>> after the transaction from truncate commits (at which point the pinned
>> extents could have been allocated for someone else and be partially,
>> fully rewritten or discarded). All that stuff with the bit
>> BTRFS_INODE_READDIO_NEED_LOCK would go away.
>
> The commit description doesn't point it out but the code has the
> necessary comment, BTRFS_INODE_READDIO_NEED_LOCK is used to prevent a
> livelock if there are enough agreesive dio readers rushing in.

Now I see it. Well the comment doesn't say why, that's it's due to dio
counter being constantly incremented and never getting down to zero
due to too many parallel dio readers.

>
>> If the transaction commits after the dio read, then everything is fine
>> as for the cases where it reads data past the isize set by truncate,
>> that data is preserved since the dropped extents are pinned, there's
>> no chance for the application to read partial contents or garbage from
>> the dropped extents.
>
> Not even that far, isize is truncated before calling inode_dio_wait()
> and a memory barrier is set to ensure the correct order, so dio read
> would simply return if it's reading past isize.
>
> The code is very subtle, but so far it looks reasonable to me.

To me too now.

The commit's changelog is terrible however, it mentions a problem that
it doesn't solve, of leaking data to dio readers of other inodes - it
can't happen for all the reasons mentioned before. Why the hell is it
there in the changelog?
And there's no information whatsoever about why this "endless
truncation" can happen.

>
> thanks,
>
> -liubo
Liu Bo Feb. 22, 2018, 7:09 p.m. UTC | #14
On Thu, Feb 22, 2018 at 08:49:30AM +0200, Nikolay Borisov wrote:
> 
> 
> On 22.02.2018 00:38, Liu Bo wrote:
> > On Wed, Feb 21, 2018 at 07:05:13PM +0000, Filipe Manana wrote:
> >> On Wed, Feb 21, 2018 at 6:28 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> >>> On Wed, Feb 21, 2018 at 02:42:08PM +0000, Filipe Manana wrote:
> >>>> On Wed, Feb 21, 2018 at 2:15 PM, Nikolay Borisov <nborisov@suse.com> wrote:
> >>>>>
> >>>>>
> >>>>> On 21.02.2018 15:51, Filipe Manana wrote:
> >>>>>> On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nborisov@suse.com> wrote:
> >>>>>>> Currently the DIO read cases uses a botched idea from ext4 to ensure
> >>>>>>> that DIO reads don't race with truncate. The idea is that if we have a
> >>>>>>> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
> >>>>>>> forces the dio read case to fallback to inode_locking to prevent
> >>>>>>> read/truncate races. Unfortunately this is subtly broken for at least
> >>>>>>> 2 reasons:
> >>>>>>>
> >>>>>>> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
> >>>>>>> (for the read case). This means that there is no ordering guarantee
> >>>>>>> between the invocation of inode_dio_wait and the increment of
> >>>>>>> i_dio_count in btrfs_direct_IO in the tread case.
> >>>>>>
> >>>>>> Also, looking at this changelog, the diff and the code, why is it a
> >>>>>> problem not calling inode_dio_begin without the inode lock in the dio
> >>>>>> read path?
> >>>>>> The truncate path calls inode_dio_wait after setting the bit
> >>>>>> BTRFS_INODE_READDIO_NEED_LOCK and before clearing it.
> >>>>>> Assuming the functions to set and clear that bit are correct, I don't
> >>>>>> see what problem this brings.
> >>>>>
> >>>>> Assume you have a truncate and a dio READ in parallel. So the following
> >>>>> execution is possible:
> >>>>>
> >>>>> T1:                                                           T2:
> >>>>> btrfs_setattr
> >>>>>  set_bit(BTRFS_INODE_READDIO_NEED_LOCK)
> >>>>>  inode_dio_wait (reads i_dio_count)                        btrfs_direct_IO
> >>>>>  clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)                  inode_dio_begin (inc's i_dio_count)
> >>>>>
> >>>>> Since we have no ordering between beginning a dio and waiting for it then
> >>>>> truncate can assume there isn't any pending dio. At the same time
> >>>>> btrfs_direct_IO will increment i_dio_count but won't see BTRFS_INODE_READDIO_NEED_LOCK
> >>>>> ever being set and so will proceed servicing the read.
> >>>>
> >>>> So what you are saying, is that you are concerned with a dio read
> >>>> starting after clearing the BTRFS_INODE_READDIO_NEED_LOCK.
> >>>> I don't think that is a problem, because the truncate path has already
> >>>> started a transaction before, which means blocks/extents deallocated
> >>>> by the truncation can not be reused and allocated to other inodes or
> >>>> the same inode (only after the transaction is committed).
> >>>>
> >>>> And considering that, commit 2e60a51e62185cce48758e596ae7cb2da673b58f
> >>>> ("Btrfs: serialize unlocked dio reads with truncate"), which
> >>>> introduced all this protection logic, is completely bogus. Looking at
> >>>> its changelog:
> >>>>
> >>>>     Btrfs: serialize unlocked dio reads with truncate
> >>>>
> >>>>     Currently, we can do unlocked dio reads, but the following race
> >>>>     is possible:
> >>>>
> >>>>     dio_read_task                   truncate_task
> >>>>                                     ->btrfs_setattr()
> >>>>     ->btrfs_direct_IO
> >>>>         ->__blockdev_direct_IO
> >>>>           ->btrfs_get_block
> >>>>                                       ->btrfs_truncate()
> >>>>                                      #alloc truncated blocks
> >>>>                                      #to other inode
> >>>>           ->submit_io()
> >>>>          #INFORMATION LEAK
> >>>>
> >>>>     In order to avoid this problem, we must serialize unlocked dio reads with
> >>>>     truncate. There are two approaches:
> >>>>     - use extent lock to protect the extent that we truncate
> >>>>     - use inode_dio_wait() to make sure the truncating task will wait for
> >>>>       the read DIO.
> >>>>
> >>>>     If we use the 1st one, we will meet the endless truncation problem due to
> >>>>     the nonlocked read DIO after we implement the nonlocked write DIO. It is
> >>>>     because we still need invoke inode_dio_wait() avoid the race between write
> >>>>     DIO and truncation. By that time, we have to introduce
> >>>>
> >>>>       btrfs_inode_{block, resume}_nolock_dio()
> >>>>
> >>>>     again. That is we have to implement this patch again, so I choose the 2nd
> >>>>     way to fix the problem.
> >>>>
> >>>> It's concerned with extents deallocated during the truncate operation
> >>>> being leaked through concurrent reads from other inodes that got that
> >>>> those extents allocated to them in the meanwhile (and the dio reads
> >>>> complete after the re-allocations and before the extents get written
> >>>> with new data) - but that can't happen because truncate is holding a
> >>>> transaction open. Further all that code that it introduced, can only
> >>>> prevent concurrent reads from the same inode, not from other inodes.
> >>>> So I think that commit does absolutely nothing and we should revert
> >>>> it.
> >>>>
> >>>
> >>> Well...make sense, but still dio read can read stale data past isize
> >>> if this inode_dio_wait() is removed.
> >>
> >> Yes, the inode_dio_wait() would remain, to prevent a dio read from
> >> submitting the bio before truncate drops an extent and the bio finish
> >> after the transaction from truncate commits (at which point the pinned
> >> extents could have been allocated for someone else and be partially,
> >> fully rewritten or discarded). All that stuff with the bit
> >> BTRFS_INODE_READDIO_NEED_LOCK would go away.
> > 
> > The commit description doesn't point it out but the code has the
> > necessary comment, BTRFS_INODE_READDIO_NEED_LOCK is used to prevent a
> > livelock if there are enough agreesive dio readers rushing in.
> > 
> >> If the transaction commits after the dio read, then everything is fine
> >> as for the cases where it reads data past the isize set by truncate,
> >> that data is preserved since the dropped extents are pinned, there's
> >> no chance for the application to read partial contents or garbage from
> >> the dropped extents.
> > 
> > Not even that far, isize is truncated before calling inode_dio_wait()
> > and a memory barrier is set to ensure the correct order, so dio read
> > would simply return if it's reading past isize.
> 
> Please, describe concretely which meory barriers pairs with chich in
> order to ensure proper visibility. Because I'm of the opinion there are
> insufficient memoru barriers to ensure setting READDIO_LOCK is properly
> visible in btrfs_direct_IO. Since the latter has no barriers whatsoever.

smp_mb() is supposed to be paired, so there is one missing, I agree.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo Feb. 22, 2018, 7:24 p.m. UTC | #15
On Thu, Feb 22, 2018 at 12:09:45PM -0700, Liu Bo wrote:
> On Thu, Feb 22, 2018 at 08:49:30AM +0200, Nikolay Borisov wrote:
> > 
> > 
> > On 22.02.2018 00:38, Liu Bo wrote:
> > > On Wed, Feb 21, 2018 at 07:05:13PM +0000, Filipe Manana wrote:
> > >> On Wed, Feb 21, 2018 at 6:28 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> > >>> On Wed, Feb 21, 2018 at 02:42:08PM +0000, Filipe Manana wrote:
> > >>>> On Wed, Feb 21, 2018 at 2:15 PM, Nikolay Borisov <nborisov@suse.com> wrote:
> > >>>>>
> > >>>>>
> > >>>>> On 21.02.2018 15:51, Filipe Manana wrote:
> > >>>>>> On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nborisov@suse.com> wrote:
> > >>>>>>> Currently the DIO read cases uses a botched idea from ext4 to ensure
> > >>>>>>> that DIO reads don't race with truncate. The idea is that if we have a
> > >>>>>>> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
> > >>>>>>> forces the dio read case to fallback to inode_locking to prevent
> > >>>>>>> read/truncate races. Unfortunately this is subtly broken for at least
> > >>>>>>> 2 reasons:
> > >>>>>>>
> > >>>>>>> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
> > >>>>>>> (for the read case). This means that there is no ordering guarantee
> > >>>>>>> between the invocation of inode_dio_wait and the increment of
> > >>>>>>> i_dio_count in btrfs_direct_IO in the tread case.
> > >>>>>>
> > >>>>>> Also, looking at this changelog, the diff and the code, why is it a
> > >>>>>> problem not calling inode_dio_begin without the inode lock in the dio
> > >>>>>> read path?
> > >>>>>> The truncate path calls inode_dio_wait after setting the bit
> > >>>>>> BTRFS_INODE_READDIO_NEED_LOCK and before clearing it.
> > >>>>>> Assuming the functions to set and clear that bit are correct, I don't
> > >>>>>> see what problem this brings.
> > >>>>>
> > >>>>> Assume you have a truncate and a dio READ in parallel. So the following
> > >>>>> execution is possible:
> > >>>>>
> > >>>>> T1:                                                           T2:
> > >>>>> btrfs_setattr
> > >>>>>  set_bit(BTRFS_INODE_READDIO_NEED_LOCK)
> > >>>>>  inode_dio_wait (reads i_dio_count)                        btrfs_direct_IO
> > >>>>>  clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)                  inode_dio_begin (inc's i_dio_count)
> > >>>>>
> > >>>>> Since we have no ordering between beginning a dio and waiting for it then
> > >>>>> truncate can assume there isn't any pending dio. At the same time
> > >>>>> btrfs_direct_IO will increment i_dio_count but won't see BTRFS_INODE_READDIO_NEED_LOCK
> > >>>>> ever being set and so will proceed servicing the read.
> > >>>>
> > >>>> So what you are saying, is that you are concerned with a dio read
> > >>>> starting after clearing the BTRFS_INODE_READDIO_NEED_LOCK.
> > >>>> I don't think that is a problem, because the truncate path has already
> > >>>> started a transaction before, which means blocks/extents deallocated
> > >>>> by the truncation can not be reused and allocated to other inodes or
> > >>>> the same inode (only after the transaction is committed).
> > >>>>
> > >>>> And considering that, commit 2e60a51e62185cce48758e596ae7cb2da673b58f
> > >>>> ("Btrfs: serialize unlocked dio reads with truncate"), which
> > >>>> introduced all this protection logic, is completely bogus. Looking at
> > >>>> its changelog:
> > >>>>
> > >>>>     Btrfs: serialize unlocked dio reads with truncate
> > >>>>
> > >>>>     Currently, we can do unlocked dio reads, but the following race
> > >>>>     is possible:
> > >>>>
> > >>>>     dio_read_task                   truncate_task
> > >>>>                                     ->btrfs_setattr()
> > >>>>     ->btrfs_direct_IO
> > >>>>         ->__blockdev_direct_IO
> > >>>>           ->btrfs_get_block
> > >>>>                                       ->btrfs_truncate()
> > >>>>                                      #alloc truncated blocks
> > >>>>                                      #to other inode
> > >>>>           ->submit_io()
> > >>>>          #INFORMATION LEAK
> > >>>>
> > >>>>     In order to avoid this problem, we must serialize unlocked dio reads with
> > >>>>     truncate. There are two approaches:
> > >>>>     - use extent lock to protect the extent that we truncate
> > >>>>     - use inode_dio_wait() to make sure the truncating task will wait for
> > >>>>       the read DIO.
> > >>>>
> > >>>>     If we use the 1st one, we will meet the endless truncation problem due to
> > >>>>     the nonlocked read DIO after we implement the nonlocked write DIO. It is
> > >>>>     because we still need invoke inode_dio_wait() avoid the race between write
> > >>>>     DIO and truncation. By that time, we have to introduce
> > >>>>
> > >>>>       btrfs_inode_{block, resume}_nolock_dio()
> > >>>>
> > >>>>     again. That is we have to implement this patch again, so I choose the 2nd
> > >>>>     way to fix the problem.
> > >>>>
> > >>>> It's concerned with extents deallocated during the truncate operation
> > >>>> being leaked through concurrent reads from other inodes that got that
> > >>>> those extents allocated to them in the meanwhile (and the dio reads
> > >>>> complete after the re-allocations and before the extents get written
> > >>>> with new data) - but that can't happen because truncate is holding a
> > >>>> transaction open. Further all that code that it introduced, can only
> > >>>> prevent concurrent reads from the same inode, not from other inodes.
> > >>>> So I think that commit does absolutely nothing and we should revert
> > >>>> it.
> > >>>>
> > >>>
> > >>> Well...make sense, but still dio read can read stale data past isize
> > >>> if this inode_dio_wait() is removed.
> > >>
> > >> Yes, the inode_dio_wait() would remain, to prevent a dio read from
> > >> submitting the bio before truncate drops an extent and the bio finish
> > >> after the transaction from truncate commits (at which point the pinned
> > >> extents could have been allocated for someone else and be partially,
> > >> fully rewritten or discarded). All that stuff with the bit
> > >> BTRFS_INODE_READDIO_NEED_LOCK would go away.
> > > 
> > > The commit description doesn't point it out but the code has the
> > > necessary comment, BTRFS_INODE_READDIO_NEED_LOCK is used to prevent a
> > > livelock if there are enough agreesive dio readers rushing in.
> > > 
> > >> If the transaction commits after the dio read, then everything is fine
> > >> as for the cases where it reads data past the isize set by truncate,
> > >> that data is preserved since the dropped extents are pinned, there's
> > >> no chance for the application to read partial contents or garbage from
> > >> the dropped extents.
> > > 
> > > Not even that far, isize is truncated before calling inode_dio_wait()
> > > and a memory barrier is set to ensure the correct order, so dio read
> > > would simply return if it's reading past isize.
> > 
> > Please, describe concretely which meory barriers pairs with chich in
> > order to ensure proper visibility. Because I'm of the opinion there are
> > insufficient memoru barriers to ensure setting READDIO_LOCK is properly
> > visible in btrfs_direct_IO. Since the latter has no barriers whatsoever.
> 
> smp_mb() is supposed to be paired, so there is one missing, I agree.
>

So the missing smp_mb() was there (commit
2e60a51e62185cce48758e596ae7cb2da673b58f), but was removed in some
cleanup I guess.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba Feb. 22, 2018, 11:39 p.m. UTC | #16
On Thu, Feb 22, 2018 at 12:24:40PM -0700, Liu Bo wrote:
> > > > Not even that far, isize is truncated before calling inode_dio_wait()
> > > > and a memory barrier is set to ensure the correct order, so dio read
> > > > would simply return if it's reading past isize.
> > > 
> > > Please, describe concretely which meory barriers pairs with chich in
> > > order to ensure proper visibility. Because I'm of the opinion there are
> > > insufficient memoru barriers to ensure setting READDIO_LOCK is properly
> > > visible in btrfs_direct_IO. Since the latter has no barriers whatsoever.
> > 
> > smp_mb() is supposed to be paired, so there is one missing, I agree.
> >
> 
> So the missing smp_mb() was there (commit
> 2e60a51e62185cce48758e596ae7cb2da673b58f), but was removed in some
> cleanup I guess.

Not a cleanup, and not a single commit:

38851cc19adbfa1def2b47106d8050a80e0a3673
Btrfs: serialize unlocked dio reads with truncate
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>

at this time the i_dio_counter was still used directly and the barrier
was moved before the if-else block, technically still in the right
before the test_bit.

dc59215d4f42084ee13654bafe3e5130b146aeb7
btrfs: remove unnecessary memory barrier in btrfs_direct_IO
Signed-off-by: Nikolay Borisov <nborisov@suse.com>

... simply removes the barrier.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Borisov Feb. 23, 2018, 6:36 a.m. UTC | #17
On 23.02.2018 01:39, David Sterba wrote:
> On Thu, Feb 22, 2018 at 12:24:40PM -0700, Liu Bo wrote:
>>>>> Not even that far, isize is truncated before calling inode_dio_wait()
>>>>> and a memory barrier is set to ensure the correct order, so dio read
>>>>> would simply return if it's reading past isize.
>>>>
>>>> Please, describe concretely which meory barriers pairs with chich in
>>>> order to ensure proper visibility. Because I'm of the opinion there are
>>>> insufficient memoru barriers to ensure setting READDIO_LOCK is properly
>>>> visible in btrfs_direct_IO. Since the latter has no barriers whatsoever.
>>>
>>> smp_mb() is supposed to be paired, so there is one missing, I agree.
>>>
>>
>> So the missing smp_mb() was there (commit
>> 2e60a51e62185cce48758e596ae7cb2da673b58f), but was removed in some
>> cleanup I guess.
> 
> Not a cleanup, and not a single commit:
> 
> 38851cc19adbfa1def2b47106d8050a80e0a3673
> Btrfs: serialize unlocked dio reads with truncate
> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
> 
> at this time the i_dio_counter was still used directly and the barrier
> was moved before the if-else block, technically still in the right
> before the test_bit.
> 
> dc59215d4f42084ee13654bafe3e5130b146aeb7
> btrfs: remove unnecessary memory barrier in btrfs_direct_IO
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> 
> ... simply removes the barrier.

Right, I removed this barrier because indeed it wasn't paired with
anything in inode_dio_wait. The barrier that is missing here has to
order accessing test_bit(READDIO_LOCK) and not ordering the atomic_inc
vs inode_dio_wait.

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index f527e99c9f8d..3519e49d4ef0 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -329,23 +329,6 @@  struct btrfs_dio_private {
 			blk_status_t);
 };
 
-/*
- * Disable DIO read nolock optimization, so new dio readers will be forced
- * to grab i_mutex. It is used to avoid the endless truncate due to
- * nonlocked dio read.
- */
-static inline void btrfs_inode_block_unlocked_dio(struct btrfs_inode *inode)
-{
-	set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
-	smp_mb();
-}
-
-static inline void btrfs_inode_resume_unlocked_dio(struct btrfs_inode *inode)
-{
-	smp_mb__before_atomic();
-	clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
-}
-
 static inline void btrfs_print_data_csum_error(struct btrfs_inode *inode,
 		u64 logical_start, u32 csum, u32 csum_expected, int mirror_num)
 {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 491a7397f6fa..9c43257e6e11 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5149,10 +5149,13 @@  static int btrfs_setsize(struct inode *inode, struct iattr *attr)
 		/* we don't support swapfiles, so vmtruncate shouldn't fail */
 		truncate_setsize(inode, newsize);
 
-		/* Disable nonlocked read DIO to avoid the end less truncate */
-		btrfs_inode_block_unlocked_dio(BTRFS_I(inode));
+		/*
+		 * Truncate after all in-flight dios are finished, new ones
+		 * will block on inode_lock. This only matters for AIO requests
+		 * since DIO READ is performed under inode_shared_lock and
+		 * write under exclusive lock.
+		 */
 		inode_dio_wait(inode);
-		btrfs_inode_resume_unlocked_dio(BTRFS_I(inode));
 
 		ret = btrfs_truncate(inode);
 		if (ret && inode->i_nlink) {
@@ -8669,15 +8672,12 @@  static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	loff_t offset = iocb->ki_pos;
 	size_t count = 0;
 	int flags = 0;
-	bool wakeup = true;
 	bool relock = false;
 	ssize_t ret;
 
 	if (check_direct_IO(fs_info, iter, offset))
 		return 0;
 
-	inode_dio_begin(inode);
-
 	/*
 	 * The generic stuff only does filemap_write_and_wait_range, which
 	 * isn't enough if we've written compressed pages to this area, so
@@ -8691,6 +8691,9 @@  static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 					 offset + count - 1);
 
 	if (iov_iter_rw(iter) == WRITE) {
+
+		inode_dio_begin(inode);
+
 		/*
 		 * If the write DIO is beyond the EOF, we need update
 		 * the isize, but it is protected by i_mutex. So we can
@@ -8720,11 +8723,13 @@  static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 		dio_data.unsubmitted_oe_range_end = (u64)offset;
 		current->journal_info = &dio_data;
 		down_read(&BTRFS_I(inode)->dio_sem);
-	} else if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK,
-				     &BTRFS_I(inode)->runtime_flags)) {
-		inode_dio_end(inode);
-		flags = DIO_LOCKING | DIO_SKIP_HOLES;
-		wakeup = false;
+	} else {
+		/*
+		 * In DIO READ case locking the inode in shared mode ensures
+		 * we are protected against parallel writes/truncates
+		 */
+		inode_lock_shared(inode);
+		inode_dio_begin(inode);
 	}
 
 	ret = __blockdev_direct_IO(iocb, inode,
@@ -8755,10 +8760,11 @@  static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 			btrfs_delalloc_release_space(inode, data_reserved,
 					offset, count - (size_t)ret);
 		btrfs_delalloc_release_extents(BTRFS_I(inode), count);
-	}
+	} else
+		inode_unlock_shared(inode);
 out:
-	if (wakeup)
-		inode_dio_end(inode);
+	inode_dio_end(inode);
+
 	if (relock)
 		inode_lock(inode);