Message ID | 20230329055823.1677193-1-damien.lemoal@opensource.wdc.com (mailing list archive) |
---|---|
State | Mainlined, archived |
Headers | show |
Series | zonefs: Always invalidate last cache page on append write | expand |
Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
On Wed, Mar 29, 2023 at 02:58:23PM +0900, Damien Le Moal wrote: > + /* > + * If the inode block size (sector size) is smaller than the > + * page size, we may be appending data belonging to an already > + * cached last page of the inode. So make sure to invalidate that > + * last cached page. This will always be a no-op for the case where > + * the block size is equal to the page size. > + */ > + ret = invalidate_inode_pages2_range(inode->i_mapping, > + iocb->ki_pos >> PAGE_SHIFT, -1); > + if (ret) > + return ret; The missing truncate here obviously is a bug and needs fixing. But why does this not follow the logic in __iomap_dio_rw to to return -ENOTBLK for any error so that the write falls back to buffered I/O. Also as far as I can tell from reading the code, -1 is not a valid end special case for invalidate_inode_pages2_range, so you'll actually have to pass a valid end here.
On 3/29/23 17:14, Christoph Hellwig wrote: > On Wed, Mar 29, 2023 at 02:58:23PM +0900, Damien Le Moal wrote: >> + /* >> + * If the inode block size (sector size) is smaller than the >> + * page size, we may be appending data belonging to an already >> + * cached last page of the inode. So make sure to invalidate that >> + * last cached page. This will always be a no-op for the case where >> + * the block size is equal to the page size. >> + */ >> + ret = invalidate_inode_pages2_range(inode->i_mapping, >> + iocb->ki_pos >> PAGE_SHIFT, -1); >> + if (ret) >> + return ret; > > The missing truncate here obviously is a bug and needs fixing. > > But why does this not follow the logic in __iomap_dio_rw to to return > -ENOTBLK for any error so that the write falls back to buffered I/O. This is a write to sequential zones so we cannot use buffered writes. We have to do a direct write to ensure ordering between writes. Note that this is the special blocking write case where we issue a zone append. For async regular writes, we use iomap so this bug does not exist. But then I now realize that __iomap_dio_rw() falling back to buffered IOs could also create an issue with write ordering. > Also as far as I can tell from reading the code, -1 is not a valid > end special case for invalidate_inode_pages2_range, so you'll actually > have to pass a valid end here. I wondered about that but then saw: int invalidate_inode_pages2(struct address_space *mapping) { return invalidate_inode_pages2_range(mapping, 0, -1); } EXPORT_SYMBOL_GPL(invalidate_inode_pages2); which tend to indicate that "-1" is fine. The end is passed to find_get_entries() -> find_get_entry() where it becomes a "max" pgoff_t, so using -1 seems fine.
On 3/29/23 17:27, Damien Le Moal wrote: > On 3/29/23 17:14, Christoph Hellwig wrote: >> On Wed, Mar 29, 2023 at 02:58:23PM +0900, Damien Le Moal wrote: >>> + /* >>> + * If the inode block size (sector size) is smaller than the >>> + * page size, we may be appending data belonging to an already >>> + * cached last page of the inode. So make sure to invalidate that >>> + * last cached page. This will always be a no-op for the case where >>> + * the block size is equal to the page size. >>> + */ >>> + ret = invalidate_inode_pages2_range(inode->i_mapping, >>> + iocb->ki_pos >> PAGE_SHIFT, -1); >>> + if (ret) >>> + return ret; >> >> The missing truncate here obviously is a bug and needs fixing. >> >> But why does this not follow the logic in __iomap_dio_rw to to return >> -ENOTBLK for any error so that the write falls back to buffered I/O. > > This is a write to sequential zones so we cannot use buffered writes. We have to > do a direct write to ensure ordering between writes. > > Note that this is the special blocking write case where we issue a zone append. > For async regular writes, we use iomap so this bug does not exist. But then I > now realize that __iomap_dio_rw() falling back to buffered IOs could also create > an issue with write ordering. Checking this, there are no issues as it is the FS caller of iomap_dio_rw() who has to fallback to buffered IO if it wants to. But zonefs does not do that. > >> Also as far as I can tell from reading the code, -1 is not a valid >> end special case for invalidate_inode_pages2_range, so you'll actually >> have to pass a valid end here. > > I wondered about that but then saw: > > int invalidate_inode_pages2(struct address_space *mapping) > { > return invalidate_inode_pages2_range(mapping, 0, -1); > } > EXPORT_SYMBOL_GPL(invalidate_inode_pages2); > > which tend to indicate that "-1" is fine. The end is passed to > find_get_entries() -> find_get_entry() where it becomes a "max" pgoff_t, so > using -1 seems fine. > >
Applied the patch on top of 6.3.0-rc4 and it solves the corruption issue I saw, Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Cheers! On Wed, Mar 29, 2023 at 7:58 AM Damien Le Moal <damien.lemoal@opensource.wdc.com> wrote: > > When a direct append write is executed, the append offset may correspond > to the last page of an inode which might have been cached already by > buffered reads, page faults with mmap-read or non-direct readahead. > To ensure that the on-disk and cached data is consistant for such last > cached page, make sure to always invalidate it in > zonefs_file_dio_append(). This invalidation will always be a no-op when > the device block size is equal to the page size (e.g. 4K). > > Reported-by: Hans Holmberg <Hans.Holmberg@wdc.com> > Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO") > Cc: stable@vger.kernel.org > Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> > --- > fs/zonefs/file.c | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c > index 617e4f9db42e..eeab8b93493b 100644 > --- a/fs/zonefs/file.c > +++ b/fs/zonefs/file.c > @@ -390,6 +390,18 @@ static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from) > max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize); > iov_iter_truncate(from, max); > > + /* > + * If the inode block size (sector size) is smaller than the > + * page size, we may be appending data belonging to an already > + * cached last page of the inode. So make sure to invalidate that > + * last cached page. This will always be a no-op for the case where > + * the block size is equal to the page size. > + */ > + ret = invalidate_inode_pages2_range(inode->i_mapping, > + iocb->ki_pos >> PAGE_SHIFT, -1); > + if (ret) > + return ret; > + > nr_pages = iov_iter_npages(from, BIO_MAX_VECS); > if (!nr_pages) > return 0; > -- > 2.39.2 >
On Wed, Mar 29, 2023 at 05:27:43PM +0900, Damien Le Moal wrote: > > But why does this not follow the logic in __iomap_dio_rw to to return > > -ENOTBLK for any error so that the write falls back to buffered I/O. > > This is a write to sequential zones so we cannot use buffered writes. We have to > do a direct write to ensure ordering between writes. > > Note that this is the special blocking write case where we issue a zone append. > For async regular writes, we use iomap so this bug does not exist. But then I > now realize that __iomap_dio_rw() falling back to buffered IOs could also create > an issue with write ordering. Can we add a comment please on why this is different? And maybe bundle the iomap-using path fix into the series while you're at it. > > Also as far as I can tell from reading the code, -1 is not a valid > > end special case for invalidate_inode_pages2_range, so you'll actually > > have to pass a valid end here. > > I wondered about that but then saw: > > int invalidate_inode_pages2(struct address_space *mapping) > { > return invalidate_inode_pages2_range(mapping, 0, -1); > } > EXPORT_SYMBOL_GPL(invalidate_inode_pages2); > > which tend to indicate that "-1" is fine. The end is passed to > find_get_entries() -> find_get_entry() where it becomes a "max" pgoff_t, so > using -1 seems fine. Oh, indeed. There's a little magic involved. Still, any reason not to pass the real end like iomap?
On 3/30/23 08:36, Christoph Hellwig wrote: > On Wed, Mar 29, 2023 at 05:27:43PM +0900, Damien Le Moal wrote: >>> But why does this not follow the logic in __iomap_dio_rw to to return >>> -ENOTBLK for any error so that the write falls back to buffered I/O. >> >> This is a write to sequential zones so we cannot use buffered writes. We have to >> do a direct write to ensure ordering between writes. >> >> Note that this is the special blocking write case where we issue a zone append. >> For async regular writes, we use iomap so this bug does not exist. But then I >> now realize that __iomap_dio_rw() falling back to buffered IOs could also create >> an issue with write ordering. > > Can we add a comment please on why this is different? And maybe bundle > the iomap-using path fix into the series while you're at it. Not sure what you mean here. "iomap-using path fix" ? Do you mean adding a comment about the fact that zonefs does not fallback to doing buffered writes if the iomap_dio_rw() or zonefs dio append direct write fail ? > >>> Also as far as I can tell from reading the code, -1 is not a valid >>> end special case for invalidate_inode_pages2_range, so you'll actually >>> have to pass a valid end here. >> >> I wondered about that but then saw: >> >> int invalidate_inode_pages2(struct address_space *mapping) >> { >> return invalidate_inode_pages2_range(mapping, 0, -1); >> } >> EXPORT_SYMBOL_GPL(invalidate_inode_pages2); >> >> which tend to indicate that "-1" is fine. The end is passed to >> find_get_entries() -> find_get_entry() where it becomes a "max" pgoff_t, so >> using -1 seems fine. > > Oh, indeed. There's a little magic involved. Still, any reason not to > pass the real end like iomap? Simplicity: we write append only and so we know that the only cached page we can eventually hit is the one straddling inode->i_size. So invalidating everything from that page is safe, and simple.
On Thu, Mar 30, 2023 at 08:57:56AM +0900, Damien Le Moal wrote: > Not sure what you mean here. "iomap-using path fix" ? > Do you mean adding a comment about the fact that zonefs does not fallback to > doing buffered writes if the iomap_dio_rw() or zonefs dio append direct write fail ? Making sure that the odd ENOTBLK error code does not leak to userspace for this case on zonefs.
On 3/30/23 09:07, Christoph Hellwig wrote: > On Thu, Mar 30, 2023 at 08:57:56AM +0900, Damien Le Moal wrote: >> Not sure what you mean here. "iomap-using path fix" ? >> Do you mean adding a comment about the fact that zonefs does not fallback to >> doing buffered writes if the iomap_dio_rw() or zonefs dio append direct write fail ? > > Making sure that the odd ENOTBLK error code does not leak to userspace > for this case on zonefs. OK. Let me check that.
diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c index 617e4f9db42e..eeab8b93493b 100644 --- a/fs/zonefs/file.c +++ b/fs/zonefs/file.c @@ -390,6 +390,18 @@ static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from) max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize); iov_iter_truncate(from, max); + /* + * If the inode block size (sector size) is smaller than the + * page size, we may be appending data belonging to an already + * cached last page of the inode. So make sure to invalidate that + * last cached page. This will always be a no-op for the case where + * the block size is equal to the page size. + */ + ret = invalidate_inode_pages2_range(inode->i_mapping, + iocb->ki_pos >> PAGE_SHIFT, -1); + if (ret) + return ret; + nr_pages = iov_iter_npages(from, BIO_MAX_VECS); if (!nr_pages) return 0;
When a direct append write is executed, the append offset may correspond to the last page of an inode which might have been cached already by buffered reads, page faults with mmap-read or non-direct readahead. To ensure that the on-disk and cached data is consistant for such last cached page, make sure to always invalidate it in zonefs_file_dio_append(). This invalidation will always be a no-op when the device block size is equal to the page size (e.g. 4K). Reported-by: Hans Holmberg <Hans.Holmberg@wdc.com> Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> --- fs/zonefs/file.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)