zonefs: Always invalidate last cache page on append write

Message ID	20230329055823.1677193-1-damien.lemoal@opensource.wdc.com (mailing list archive)
State	Mainlined, archived
Headers	show Return-Path: <linux-fsdevel-owner@vger.kernel.org> IronPort-SDR: I2+P6yyujGB8J/6rb3RLbH6LImSRawCzHW6mEiGlUFOYAUZ5Mlm+qUlCXs8ZNphr3wmYbNBI8k LXvBKWPLN3Ed5ozqJDPA3A+Uje5dNDbfDUkC5BXwKMpEEPb/spoegjkGZClA2XjFEAOYXybBmd FSp7k2aIjuN3q8/Loe3LslDVkJHgWbcxnfOeKYHyesiASHLYNnBUbD6fQFu9Z14CgPwaZPyrmd TfXdDDKK/qzhTfzYb5F47i0DZn4hc2XfEXMLuJTi+O6wqruuEkFTQZAjJepVL++mGl7MjIFeUM yOQ= IronPort-SDR: hCt+/KfUAXedhC2pr7PsshLj1Q4GLYZhLBe8kJ37Xzv70fktY5Lci0s8TD0OAqqccuQJsqIDTu zm35TmFfSg3eZ2XTvY+aIU8JbLt+QTPZsg3DrX7McHXRwDRutrA5XpzN9bAH7rrERi9cnqj65G 4jz/h9w7bU3GItQNFAhKOLDDeXgNivMW1HLqEmByI6GGFwali/0HVR8iZtoH+GaFN+2XXt2c+3 QAIkD4rgKwGOP+NNu6l6qAc5Lg+RFn+p98U+37IFgsLiGkv60owiJFN5kGUJNCC/LE80MKOsX6 CmM= WDCIronportException: Internal From: Damien Le Moal <damien.lemoal@opensource.wdc.com> To: linux-fsdevel@vger.kernel.org Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>, Hans Holmberg <hans.holmberg@wdc.com> Subject: [PATCH] zonefs: Always invalidate last cache page on append write Date: Wed, 29 Mar 2023 14:58:23 +0900 Message-Id: <20230329055823.1677193-1-damien.lemoal@opensource.wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk
Series	zonefs: Always invalidate last cache page on append write \| expand zonefs: Always invalidate last cache page on append write

Damien Le Moal March 29, 2023, 5:58 a.m. UTC

When a direct append write is executed, the append offset may correspond
to the last page of an inode which might have been cached already by
buffered reads, page faults with mmap-read or non-direct readahead.
To ensure that the on-disk and cached data is consistant for such last
cached page, make sure to always invalidate it in
zonefs_file_dio_append(). This invalidation will always be a no-op when
the device block size is equal to the page size (e.g. 4K).

Reported-by: Hans Holmberg <Hans.Holmberg@wdc.com>
Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
---
 fs/zonefs/file.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Johannes Thumshirn March 29, 2023, 6:14 a.m. UTC | #1

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Christoph Hellwig March 29, 2023, 8:14 a.m. UTC | #2

On Wed, Mar 29, 2023 at 02:58:23PM +0900, Damien Le Moal wrote:
> +	/*
> +	 * If the inode block size (sector size) is smaller than the
> +	 * page size, we may be appending data belonging to an already
> +	 * cached last page of the inode. So make sure to invalidate that
> +	 * last cached page. This will always be a no-op for the case where
> +	 * the block size is equal to the page size.
> +	 */
> +	ret = invalidate_inode_pages2_range(inode->i_mapping,
> +					    iocb->ki_pos >> PAGE_SHIFT, -1);
> +	if (ret)
> +		return ret;

The missing truncate here obviously is a bug and needs fixing.

But why does this not follow the logic in __iomap_dio_rw to to return
-ENOTBLK for any error so that the write falls back to buffered I/O.
Also as far as I can tell from reading the code, -1 is not a valid
end special case for invalidate_inode_pages2_range, so you'll actually
have to pass a valid end here.

Damien Le Moal March 29, 2023, 8:27 a.m. UTC | #3

On 3/29/23 17:14, Christoph Hellwig wrote:
> On Wed, Mar 29, 2023 at 02:58:23PM +0900, Damien Le Moal wrote:
>> +	/*
>> +	 * If the inode block size (sector size) is smaller than the
>> +	 * page size, we may be appending data belonging to an already
>> +	 * cached last page of the inode. So make sure to invalidate that
>> +	 * last cached page. This will always be a no-op for the case where
>> +	 * the block size is equal to the page size.
>> +	 */
>> +	ret = invalidate_inode_pages2_range(inode->i_mapping,
>> +					    iocb->ki_pos >> PAGE_SHIFT, -1);
>> +	if (ret)
>> +		return ret;
> 
> The missing truncate here obviously is a bug and needs fixing.
> 
> But why does this not follow the logic in __iomap_dio_rw to to return
> -ENOTBLK for any error so that the write falls back to buffered I/O.

This is a write to sequential zones so we cannot use buffered writes. We have to
do a direct write to ensure ordering between writes.

Note that this is the special blocking write case where we issue a zone append.
For async regular writes, we use iomap so this bug does not exist. But then I
now realize that __iomap_dio_rw() falling back to buffered IOs could also create
an issue with write ordering.

> Also as far as I can tell from reading the code, -1 is not a valid
> end special case for invalidate_inode_pages2_range, so you'll actually
> have to pass a valid end here.

I wondered about that but then saw:

int invalidate_inode_pages2(struct address_space *mapping)
{
	return invalidate_inode_pages2_range(mapping, 0, -1);
}
EXPORT_SYMBOL_GPL(invalidate_inode_pages2);

which tend to indicate that "-1" is fine. The end is passed to
find_get_entries() -> find_get_entry() where it becomes a "max" pgoff_t, so
using -1 seems fine.

Damien Le Moal March 29, 2023, 9:49 a.m. UTC | #4

On 3/29/23 17:27, Damien Le Moal wrote:
> On 3/29/23 17:14, Christoph Hellwig wrote:
>> On Wed, Mar 29, 2023 at 02:58:23PM +0900, Damien Le Moal wrote:
>>> +	/*
>>> +	 * If the inode block size (sector size) is smaller than the
>>> +	 * page size, we may be appending data belonging to an already
>>> +	 * cached last page of the inode. So make sure to invalidate that
>>> +	 * last cached page. This will always be a no-op for the case where
>>> +	 * the block size is equal to the page size.
>>> +	 */
>>> +	ret = invalidate_inode_pages2_range(inode->i_mapping,
>>> +					    iocb->ki_pos >> PAGE_SHIFT, -1);
>>> +	if (ret)
>>> +		return ret;
>>
>> The missing truncate here obviously is a bug and needs fixing.
>>
>> But why does this not follow the logic in __iomap_dio_rw to to return
>> -ENOTBLK for any error so that the write falls back to buffered I/O.
> 
> This is a write to sequential zones so we cannot use buffered writes. We have to
> do a direct write to ensure ordering between writes.
> 
> Note that this is the special blocking write case where we issue a zone append.
> For async regular writes, we use iomap so this bug does not exist. But then I
> now realize that __iomap_dio_rw() falling back to buffered IOs could also create
> an issue with write ordering.

Checking this, there are no issues as it is the FS caller of iomap_dio_rw() who
has to fallback to buffered IO if it wants to. But zonefs does not do that.

> 
>> Also as far as I can tell from reading the code, -1 is not a valid
>> end special case for invalidate_inode_pages2_range, so you'll actually
>> have to pass a valid end here.
> 
> I wondered about that but then saw:
> 
> int invalidate_inode_pages2(struct address_space *mapping)
> {
> 	return invalidate_inode_pages2_range(mapping, 0, -1);
> }
> EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
> 
> which tend to indicate that "-1" is fine. The end is passed to
> find_get_entries() -> find_get_entry() where it becomes a "max" pgoff_t, so
> using -1 seems fine.
> 
>

Hans Holmberg March 29, 2023, 11:04 a.m. UTC | #5

Applied the patch on top of 6.3.0-rc4 and it solves the corruption issue I saw,
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>

Cheers!

On Wed, Mar 29, 2023 at 7:58 AM Damien Le Moal
<damien.lemoal@opensource.wdc.com> wrote:
>
> When a direct append write is executed, the append offset may correspond
> to the last page of an inode which might have been cached already by
> buffered reads, page faults with mmap-read or non-direct readahead.
> To ensure that the on-disk and cached data is consistant for such last
> cached page, make sure to always invalidate it in
> zonefs_file_dio_append(). This invalidation will always be a no-op when
> the device block size is equal to the page size (e.g. 4K).
>
> Reported-by: Hans Holmberg <Hans.Holmberg@wdc.com>
> Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO")
> Cc: stable@vger.kernel.org
> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> ---
>  fs/zonefs/file.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
> index 617e4f9db42e..eeab8b93493b 100644
> --- a/fs/zonefs/file.c
> +++ b/fs/zonefs/file.c
> @@ -390,6 +390,18 @@ static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from)
>         max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize);
>         iov_iter_truncate(from, max);
>
> +       /*
> +        * If the inode block size (sector size) is smaller than the
> +        * page size, we may be appending data belonging to an already
> +        * cached last page of the inode. So make sure to invalidate that
> +        * last cached page. This will always be a no-op for the case where
> +        * the block size is equal to the page size.
> +        */
> +       ret = invalidate_inode_pages2_range(inode->i_mapping,
> +                                           iocb->ki_pos >> PAGE_SHIFT, -1);
> +       if (ret)
> +               return ret;
> +
>         nr_pages = iov_iter_npages(from, BIO_MAX_VECS);
>         if (!nr_pages)
>                 return 0;
> --
> 2.39.2
>

Christoph Hellwig March 29, 2023, 11:36 p.m. UTC | #6

On Wed, Mar 29, 2023 at 05:27:43PM +0900, Damien Le Moal wrote:
> > But why does this not follow the logic in __iomap_dio_rw to to return
> > -ENOTBLK for any error so that the write falls back to buffered I/O.
> 
> This is a write to sequential zones so we cannot use buffered writes. We have to
> do a direct write to ensure ordering between writes.
> 
> Note that this is the special blocking write case where we issue a zone append.
> For async regular writes, we use iomap so this bug does not exist. But then I
> now realize that __iomap_dio_rw() falling back to buffered IOs could also create
> an issue with write ordering.

Can we add a comment please on why this is different?  And maybe bundle
the iomap-using path fix into the series while you're at it.

> > Also as far as I can tell from reading the code, -1 is not a valid
> > end special case for invalidate_inode_pages2_range, so you'll actually
> > have to pass a valid end here.
> 
> I wondered about that but then saw:
> 
> int invalidate_inode_pages2(struct address_space *mapping)
> {
> 	return invalidate_inode_pages2_range(mapping, 0, -1);
> }
> EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
> 
> which tend to indicate that "-1" is fine. The end is passed to
> find_get_entries() -> find_get_entry() where it becomes a "max" pgoff_t, so
> using -1 seems fine.

Oh, indeed.  There's a little magic involved.  Still, any reason not to
pass the real end like iomap?

Damien Le Moal March 29, 2023, 11:57 p.m. UTC | #7

On 3/30/23 08:36, Christoph Hellwig wrote:
> On Wed, Mar 29, 2023 at 05:27:43PM +0900, Damien Le Moal wrote:
>>> But why does this not follow the logic in __iomap_dio_rw to to return
>>> -ENOTBLK for any error so that the write falls back to buffered I/O.
>>
>> This is a write to sequential zones so we cannot use buffered writes. We have to
>> do a direct write to ensure ordering between writes.
>>
>> Note that this is the special blocking write case where we issue a zone append.
>> For async regular writes, we use iomap so this bug does not exist. But then I
>> now realize that __iomap_dio_rw() falling back to buffered IOs could also create
>> an issue with write ordering.
> 
> Can we add a comment please on why this is different?  And maybe bundle
> the iomap-using path fix into the series while you're at it.

Not sure what you mean here. "iomap-using path fix" ?
Do you mean adding a comment about the fact that zonefs does not fallback to
doing buffered writes if the iomap_dio_rw() or zonefs dio append direct write fail ?

> 
>>> Also as far as I can tell from reading the code, -1 is not a valid
>>> end special case for invalidate_inode_pages2_range, so you'll actually
>>> have to pass a valid end here.
>>
>> I wondered about that but then saw:
>>
>> int invalidate_inode_pages2(struct address_space *mapping)
>> {
>> 	return invalidate_inode_pages2_range(mapping, 0, -1);
>> }
>> EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
>>
>> which tend to indicate that "-1" is fine. The end is passed to
>> find_get_entries() -> find_get_entry() where it becomes a "max" pgoff_t, so
>> using -1 seems fine.
> 
> Oh, indeed.  There's a little magic involved.  Still, any reason not to
> pass the real end like iomap?

Simplicity: we write append only and so we know that the only cached page we can
eventually hit is the one straddling inode->i_size. So invalidating everything
from that page is safe, and simple.

Christoph Hellwig March 30, 2023, 12:07 a.m. UTC | #8

On Thu, Mar 30, 2023 at 08:57:56AM +0900, Damien Le Moal wrote:
> Not sure what you mean here. "iomap-using path fix" ?
> Do you mean adding a comment about the fact that zonefs does not fallback to
> doing buffered writes if the iomap_dio_rw() or zonefs dio append direct write fail ?

Making sure that the odd ENOTBLK error code does not leak to userspace
for this case on zonefs.

Damien Le Moal March 30, 2023, 12:22 a.m. UTC | #9

On 3/30/23 09:07, Christoph Hellwig wrote:
> On Thu, Mar 30, 2023 at 08:57:56AM +0900, Damien Le Moal wrote:
>> Not sure what you mean here. "iomap-using path fix" ?
>> Do you mean adding a comment about the fact that zonefs does not fallback to
>> doing buffered writes if the iomap_dio_rw() or zonefs dio append direct write fail ?
> 
> Making sure that the odd ENOTBLK error code does not leak to userspace
> for this case on zonefs.

OK. Let me check that.

zonefs: Always invalidate last cache page on append write

Commit Message

Comments

Patch