btrfs: remove btrfs_writepage_cow_fixup

Message ID	20220624122334.80603-1-hch@lst.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Christoph Hellwig <hch@lst.de> To: clm@fb.com, josef@toxicpanda.com, dsterba@suse.com Cc: linux-btrfs@vger.kernel.org Subject: [PATCH] btrfs: remove btrfs_writepage_cow_fixup Date: Fri, 24 Jun 2022 14:23:34 +0200 Message-Id: <20220624122334.80603-1-hch@lst.de> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: remove btrfs_writepage_cow_fixup \| expand btrfs: remove btrfs_writepage_cow_fixup

Christoph Hellwig June 24, 2022, 12:23 p.m. UTC

Since the page_mkwrite address space operation was added, starting with
commit 9637a5efd4fb ("[PATCH] add page_mkwrite() vm_operations method")
in 2006, the kernel does not just dirty random pages without telling
the file system.  Remove the code to handles this now impossible case
and replace it with a WARN_ON_ONCE.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/ctree.h     |   8 --
 fs/btrfs/disk-io.c   |   6 +-
 fs/btrfs/extent_io.c |   9 +--
 fs/btrfs/inode.c     | 188 -------------------------------------------
 4 files changed, 3 insertions(+), 208 deletions(-)

Qu Wenruo June 24, 2022, 12:30 p.m. UTC | #1

On 2022/6/24 20:23, Christoph Hellwig wrote:
> Since the page_mkwrite address space operation was added, starting with
> commit 9637a5efd4fb ("[PATCH] add page_mkwrite() vm_operations method")
> in 2006, the kernel does not just dirty random pages without telling
> the file system.  Remove the code to handles this now impossible case
> and replace it with a WARN_ON_ONCE.

I'm super happy to remove that.

In fact, if we can get that removed, we can also remove the
PageOrdered() related code, which takes extra space for subpage support.


But from my previous feedback on subpage code, it looks like it's some
hardware archs (S390?) that can not do page flags update atomically.

I have tested similar thing, with extra ASSERT() to make sure the cow
fixup code never get triggered.

At least for x86_64 and aarch64 it's OK here.

So I hope this time we can get a concrete reason on why we need the
extra page Private2 bit in the first place.

Thanks,
Qu
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   fs/btrfs/ctree.h     |   8 --
>   fs/btrfs/disk-io.c   |   6 +-
>   fs/btrfs/extent_io.c |   9 +--
>   fs/btrfs/inode.c     | 188 -------------------------------------------
>   4 files changed, 3 insertions(+), 208 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 12f59e35755fa..39a1235eda48b 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -882,13 +882,6 @@ struct btrfs_fs_info {
>   	struct btrfs_workqueue *endio_write_workers;
>   	struct btrfs_workqueue *endio_freespace_worker;
>   	struct btrfs_workqueue *caching_workers;
> -
> -	/*
> -	 * fixup workers take dirty pages that didn't properly go through
> -	 * the cow mechanism and make them safe to write.  It happens
> -	 * for the sys_munmap function call path
> -	 */
> -	struct btrfs_workqueue *fixup_workers;
>   	struct btrfs_workqueue *delayed_workers;
>
>   	struct task_struct *transaction_kthread;
> @@ -3390,7 +3383,6 @@ int btrfs_prealloc_file_range_trans(struct inode *inode,
>   int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page,
>   		u64 start, u64 end, int *page_started, unsigned long *nr_written,
>   		struct writeback_control *wbc);
> -int btrfs_writepage_cow_fixup(struct page *page);
>   void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
>   					  struct page *page, u64 start,
>   					  u64 end, bool uptodate);
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 71d92f5dfcb2e..bf908e178ee59 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2163,7 +2163,6 @@ static int read_backup_root(struct btrfs_fs_info *fs_info, u8 priority)
>   /* helper to cleanup workers */
>   static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
>   {
> -	btrfs_destroy_workqueue(fs_info->fixup_workers);
>   	btrfs_destroy_workqueue(fs_info->delalloc_workers);
>   	btrfs_destroy_workqueue(fs_info->hipri_workers);
>   	btrfs_destroy_workqueue(fs_info->workers);
> @@ -2358,9 +2357,6 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
>   	fs_info->caching_workers =
>   		btrfs_alloc_workqueue(fs_info, "cache", flags, max_active, 0);
>
> -	fs_info->fixup_workers =
> -		btrfs_alloc_workqueue(fs_info, "fixup", flags, 1, 0);
> -
>   	fs_info->endio_workers =
>   		alloc_workqueue("btrfs-endio", flags, max_active);
>   	fs_info->endio_meta_workers =
> @@ -2390,7 +2386,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
>   	      fs_info->compressed_write_workers &&
>   	      fs_info->endio_write_workers && fs_info->endio_raid56_workers &&
>   	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
> -	      fs_info->caching_workers && fs_info->fixup_workers &&
> +	      fs_info->caching_workers &&
>   	      fs_info->delayed_workers && fs_info->qgroup_rescan_workers &&
>   	      fs_info->discard_ctl.discard_workers)) {
>   		return -ENOMEM;
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 587d2ba20b53b..232a858b99a0a 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3944,13 +3944,8 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>   	bool has_error = false;
>   	bool compressed;
>
> -	ret = btrfs_writepage_cow_fixup(page);
> -	if (ret) {
> -		/* Fixup worker will requeue */
> -		redirty_page_for_writepage(wbc, page);
> -		unlock_page(page);
> -		return 1;
> -	}
> +	if (WARN_ON_ONCE(!PageOrdered(page)))
> +		return -EIO;
>
>   	/*
>   	 * we don't want to touch the inode after unlocking the page,
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index eea351216db33..5cf314a1b5d17 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2815,194 +2815,6 @@ int btrfs_set_extent_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
>   				   cached_state);
>   }
>
> -/* see btrfs_writepage_start_hook for details on why this is required */
> -struct btrfs_writepage_fixup {
> -	struct page *page;
> -	struct inode *inode;
> -	struct btrfs_work work;
> -};
> -
> -static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
> -{
> -	struct btrfs_writepage_fixup *fixup;
> -	struct btrfs_ordered_extent *ordered;
> -	struct extent_state *cached_state = NULL;
> -	struct extent_changeset *data_reserved = NULL;
> -	struct page *page;
> -	struct btrfs_inode *inode;
> -	u64 page_start;
> -	u64 page_end;
> -	int ret = 0;
> -	bool free_delalloc_space = true;
> -
> -	fixup = container_of(work, struct btrfs_writepage_fixup, work);
> -	page = fixup->page;
> -	inode = BTRFS_I(fixup->inode);
> -	page_start = page_offset(page);
> -	page_end = page_offset(page) + PAGE_SIZE - 1;
> -
> -	/*
> -	 * This is similar to page_mkwrite, we need to reserve the space before
> -	 * we take the page lock.
> -	 */
> -	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, page_start,
> -					   PAGE_SIZE);
> -again:
> -	lock_page(page);
> -
> -	/*
> -	 * Before we queued this fixup, we took a reference on the page.
> -	 * page->mapping may go NULL, but it shouldn't be moved to a different
> -	 * address space.
> -	 */
> -	if (!page->mapping || !PageDirty(page) || !PageChecked(page)) {
> -		/*
> -		 * Unfortunately this is a little tricky, either
> -		 *
> -		 * 1) We got here and our page had already been dealt with and
> -		 *    we reserved our space, thus ret == 0, so we need to just
> -		 *    drop our space reservation and bail.  This can happen the
> -		 *    first time we come into the fixup worker, or could happen
> -		 *    while waiting for the ordered extent.
> -		 * 2) Our page was already dealt with, but we happened to get an
> -		 *    ENOSPC above from the btrfs_delalloc_reserve_space.  In
> -		 *    this case we obviously don't have anything to release, but
> -		 *    because the page was already dealt with we don't want to
> -		 *    mark the page with an error, so make sure we're resetting
> -		 *    ret to 0.  This is why we have this check _before_ the ret
> -		 *    check, because we do not want to have a surprise ENOSPC
> -		 *    when the page was already properly dealt with.
> -		 */
> -		if (!ret) {
> -			btrfs_delalloc_release_extents(inode, PAGE_SIZE);
> -			btrfs_delalloc_release_space(inode, data_reserved,
> -						     page_start, PAGE_SIZE,
> -						     true);
> -		}
> -		ret = 0;
> -		goto out_page;
> -	}
> -
> -	/*
> -	 * We can't mess with the page state unless it is locked, so now that
> -	 * it is locked bail if we failed to make our space reservation.
> -	 */
> -	if (ret)
> -		goto out_page;
> -
> -	lock_extent_bits(&inode->io_tree, page_start, page_end, &cached_state);
> -
> -	/* already ordered? We're done */
> -	if (PageOrdered(page))
> -		goto out_reserved;
> -
> -	ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_SIZE);
> -	if (ordered) {
> -		unlock_extent_cached(&inode->io_tree, page_start, page_end,
> -				     &cached_state);
> -		unlock_page(page);
> -		btrfs_start_ordered_extent(ordered, 1);
> -		btrfs_put_ordered_extent(ordered);
> -		goto again;
> -	}
> -
> -	ret = btrfs_set_extent_delalloc(inode, page_start, page_end, 0,
> -					&cached_state);
> -	if (ret)
> -		goto out_reserved;
> -
> -	/*
> -	 * Everything went as planned, we're now the owner of a dirty page with
> -	 * delayed allocation bits set and space reserved for our COW
> -	 * destination.
> -	 *
> -	 * The page was dirty when we started, nothing should have cleaned it.
> -	 */
> -	BUG_ON(!PageDirty(page));
> -	free_delalloc_space = false;
> -out_reserved:
> -	btrfs_delalloc_release_extents(inode, PAGE_SIZE);
> -	if (free_delalloc_space)
> -		btrfs_delalloc_release_space(inode, data_reserved, page_start,
> -					     PAGE_SIZE, true);
> -	unlock_extent_cached(&inode->io_tree, page_start, page_end,
> -			     &cached_state);
> -out_page:
> -	if (ret) {
> -		/*
> -		 * We hit ENOSPC or other errors.  Update the mapping and page
> -		 * to reflect the errors and clean the page.
> -		 */
> -		mapping_set_error(page->mapping, ret);
> -		end_extent_writepage(page, ret, page_start, page_end);
> -		clear_page_dirty_for_io(page);
> -		SetPageError(page);
> -	}
> -	btrfs_page_clear_checked(inode->root->fs_info, page, page_start, PAGE_SIZE);
> -	unlock_page(page);
> -	put_page(page);
> -	kfree(fixup);
> -	extent_changeset_free(data_reserved);
> -	/*
> -	 * As a precaution, do a delayed iput in case it would be the last iput
> -	 * that could need flushing space. Recursing back to fixup worker would
> -	 * deadlock.
> -	 */
> -	btrfs_add_delayed_iput(&inode->vfs_inode);
> -}
> -
> -/*
> - * There are a few paths in the higher layers of the kernel that directly
> - * set the page dirty bit without asking the filesystem if it is a
> - * good idea.  This causes problems because we want to make sure COW
> - * properly happens and the data=ordered rules are followed.
> - *
> - * In our case any range that doesn't have the ORDERED bit set
> - * hasn't been properly setup for IO.  We kick off an async process
> - * to fix it up.  The async helper will wait for ordered extents, set
> - * the delalloc bit and make it safe to write the page.
> - */
> -int btrfs_writepage_cow_fixup(struct page *page)
> -{
> -	struct inode *inode = page->mapping->host;
> -	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> -	struct btrfs_writepage_fixup *fixup;
> -
> -	/* This page has ordered extent covering it already */
> -	if (PageOrdered(page))
> -		return 0;
> -
> -	/*
> -	 * PageChecked is set below when we create a fixup worker for this page,
> -	 * don't try to create another one if we're already PageChecked()
> -	 *
> -	 * The extent_io writepage code will redirty the page if we send back
> -	 * EAGAIN.
> -	 */
> -	if (PageChecked(page))
> -		return -EAGAIN;
> -
> -	fixup = kzalloc(sizeof(*fixup), GFP_NOFS);
> -	if (!fixup)
> -		return -EAGAIN;
> -
> -	/*
> -	 * We are already holding a reference to this inode from
> -	 * write_cache_pages.  We need to hold it because the space reservation
> -	 * takes place outside of the page lock, and we can't trust
> -	 * page->mapping outside of the page lock.
> -	 */
> -	ihold(inode);
> -	btrfs_page_set_checked(fs_info, page, page_offset(page), PAGE_SIZE);
> -	get_page(page);
> -	btrfs_init_work(&fixup->work, btrfs_writepage_fixup_worker, NULL, NULL);
> -	fixup->page = page;
> -	fixup->inode = inode;
> -	btrfs_queue_work(fs_info->fixup_workers, &fixup->work);
> -
> -	return -EAGAIN;
> -}
> -
>   static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
>   				       struct btrfs_inode *inode, u64 file_pos,
>   				       struct btrfs_file_extent_item *stack_fi,

David Sterba June 24, 2022, 12:49 p.m. UTC | #2

On Fri, Jun 24, 2022 at 02:23:34PM +0200, Christoph Hellwig wrote:
> Since the page_mkwrite address space operation was added, starting with
> commit 9637a5efd4fb ("[PATCH] add page_mkwrite() vm_operations method")
> in 2006, the kernel does not just dirty random pages without telling
> the file system.

It does and there's a history behind the fixup worker. tl;dr it can't be
removed, though every now and then somebody comes and tries to.

On s390 the page status is tracked in two places, hw and in memory and
this needs to be synchronized manually.

On x86_64 it's not a simple reason but it happens as well in some edge
case where the mappings get removed and dirty page is set deep in the
arch mm code.  We've been chasing it long time ago, I don't recall exact
details and it's been a painful experience.

If there's been any change on the s390 side or in arch/x86/mm code I
don't know but to be on the safe side, I strongly assume the fixup code
is needed unless proven otherwise.

Christoph Hellwig June 24, 2022, 12:51 p.m. UTC | #3

On Fri, Jun 24, 2022 at 08:30:00PM +0800, Qu Wenruo wrote:
> But from my previous feedback on subpage code, it looks like it's some
> hardware archs (S390?) that can not do page flags update atomically.
>
> I have tested similar thing, with extra ASSERT() to make sure the cow
> fixup code never get triggered.
>
> At least for x86_64 and aarch64 it's OK here.
>
> So I hope this time we can get a concrete reason on why we need the
> extra page Private2 bit in the first place.

I don't think atomic page flags are a thing here.  I remember Jan
had chased a bug where we'd get into trouble into this area in
ext4 due to the way pages are locked down for direct I/O, but I
don't even remember seeing that on XFS.  Either way the PageOrdered
check prevents a crash in that case and we really can't expect
data to properly be written back in that case.

>
> Thanks,
> Qu
>>
>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>> ---
>>   fs/btrfs/ctree.h     |   8 --
>>   fs/btrfs/disk-io.c   |   6 +-
>>   fs/btrfs/extent_io.c |   9 +--
>>   fs/btrfs/inode.c     | 188 -------------------------------------------
>>   4 files changed, 3 insertions(+), 208 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 12f59e35755fa..39a1235eda48b 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -882,13 +882,6 @@ struct btrfs_fs_info {
>>   	struct btrfs_workqueue *endio_write_workers;
>>   	struct btrfs_workqueue *endio_freespace_worker;
>>   	struct btrfs_workqueue *caching_workers;
>> -
>> -	/*
>> -	 * fixup workers take dirty pages that didn't properly go through
>> -	 * the cow mechanism and make them safe to write.  It happens
>> -	 * for the sys_munmap function call path
>> -	 */
>> -	struct btrfs_workqueue *fixup_workers;
>>   	struct btrfs_workqueue *delayed_workers;
>>
>>   	struct task_struct *transaction_kthread;
>> @@ -3390,7 +3383,6 @@ int btrfs_prealloc_file_range_trans(struct inode *inode,
>>   int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page,
>>   		u64 start, u64 end, int *page_started, unsigned long *nr_written,
>>   		struct writeback_control *wbc);
>> -int btrfs_writepage_cow_fixup(struct page *page);
>>   void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
>>   					  struct page *page, u64 start,
>>   					  u64 end, bool uptodate);
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 71d92f5dfcb2e..bf908e178ee59 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -2163,7 +2163,6 @@ static int read_backup_root(struct btrfs_fs_info *fs_info, u8 priority)
>>   /* helper to cleanup workers */
>>   static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
>>   {
>> -	btrfs_destroy_workqueue(fs_info->fixup_workers);
>>   	btrfs_destroy_workqueue(fs_info->delalloc_workers);
>>   	btrfs_destroy_workqueue(fs_info->hipri_workers);
>>   	btrfs_destroy_workqueue(fs_info->workers);
>> @@ -2358,9 +2357,6 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
>>   	fs_info->caching_workers =
>>   		btrfs_alloc_workqueue(fs_info, "cache", flags, max_active, 0);
>>
>> -	fs_info->fixup_workers =
>> -		btrfs_alloc_workqueue(fs_info, "fixup", flags, 1, 0);
>> -
>>   	fs_info->endio_workers =
>>   		alloc_workqueue("btrfs-endio", flags, max_active);
>>   	fs_info->endio_meta_workers =
>> @@ -2390,7 +2386,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
>>   	      fs_info->compressed_write_workers &&
>>   	      fs_info->endio_write_workers && fs_info->endio_raid56_workers &&
>>   	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
>> -	      fs_info->caching_workers && fs_info->fixup_workers &&
>> +	      fs_info->caching_workers &&
>>   	      fs_info->delayed_workers && fs_info->qgroup_rescan_workers &&
>>   	      fs_info->discard_ctl.discard_workers)) {
>>   		return -ENOMEM;
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 587d2ba20b53b..232a858b99a0a 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -3944,13 +3944,8 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>>   	bool has_error = false;
>>   	bool compressed;
>>
>> -	ret = btrfs_writepage_cow_fixup(page);
>> -	if (ret) {
>> -		/* Fixup worker will requeue */
>> -		redirty_page_for_writepage(wbc, page);
>> -		unlock_page(page);
>> -		return 1;
>> -	}
>> +	if (WARN_ON_ONCE(!PageOrdered(page)))
>> +		return -EIO;
>>
>>   	/*
>>   	 * we don't want to touch the inode after unlocking the page,
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index eea351216db33..5cf314a1b5d17 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -2815,194 +2815,6 @@ int btrfs_set_extent_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
>>   				   cached_state);
>>   }
>>
>> -/* see btrfs_writepage_start_hook for details on why this is required */
>> -struct btrfs_writepage_fixup {
>> -	struct page *page;
>> -	struct inode *inode;
>> -	struct btrfs_work work;
>> -};
>> -
>> -static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
>> -{
>> -	struct btrfs_writepage_fixup *fixup;
>> -	struct btrfs_ordered_extent *ordered;
>> -	struct extent_state *cached_state = NULL;
>> -	struct extent_changeset *data_reserved = NULL;
>> -	struct page *page;
>> -	struct btrfs_inode *inode;
>> -	u64 page_start;
>> -	u64 page_end;
>> -	int ret = 0;
>> -	bool free_delalloc_space = true;
>> -
>> -	fixup = container_of(work, struct btrfs_writepage_fixup, work);
>> -	page = fixup->page;
>> -	inode = BTRFS_I(fixup->inode);
>> -	page_start = page_offset(page);
>> -	page_end = page_offset(page) + PAGE_SIZE - 1;
>> -
>> -	/*
>> -	 * This is similar to page_mkwrite, we need to reserve the space before
>> -	 * we take the page lock.
>> -	 */
>> -	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, page_start,
>> -					   PAGE_SIZE);
>> -again:
>> -	lock_page(page);
>> -
>> -	/*
>> -	 * Before we queued this fixup, we took a reference on the page.
>> -	 * page->mapping may go NULL, but it shouldn't be moved to a different
>> -	 * address space.
>> -	 */
>> -	if (!page->mapping || !PageDirty(page) || !PageChecked(page)) {
>> -		/*
>> -		 * Unfortunately this is a little tricky, either
>> -		 *
>> -		 * 1) We got here and our page had already been dealt with and
>> -		 *    we reserved our space, thus ret == 0, so we need to just
>> -		 *    drop our space reservation and bail.  This can happen the
>> -		 *    first time we come into the fixup worker, or could happen
>> -		 *    while waiting for the ordered extent.
>> -		 * 2) Our page was already dealt with, but we happened to get an
>> -		 *    ENOSPC above from the btrfs_delalloc_reserve_space.  In
>> -		 *    this case we obviously don't have anything to release, but
>> -		 *    because the page was already dealt with we don't want to
>> -		 *    mark the page with an error, so make sure we're resetting
>> -		 *    ret to 0.  This is why we have this check _before_ the ret
>> -		 *    check, because we do not want to have a surprise ENOSPC
>> -		 *    when the page was already properly dealt with.
>> -		 */
>> -		if (!ret) {
>> -			btrfs_delalloc_release_extents(inode, PAGE_SIZE);
>> -			btrfs_delalloc_release_space(inode, data_reserved,
>> -						     page_start, PAGE_SIZE,
>> -						     true);
>> -		}
>> -		ret = 0;
>> -		goto out_page;
>> -	}
>> -
>> -	/*
>> -	 * We can't mess with the page state unless it is locked, so now that
>> -	 * it is locked bail if we failed to make our space reservation.
>> -	 */
>> -	if (ret)
>> -		goto out_page;
>> -
>> -	lock_extent_bits(&inode->io_tree, page_start, page_end, &cached_state);
>> -
>> -	/* already ordered? We're done */
>> -	if (PageOrdered(page))
>> -		goto out_reserved;
>> -
>> -	ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_SIZE);
>> -	if (ordered) {
>> -		unlock_extent_cached(&inode->io_tree, page_start, page_end,
>> -				     &cached_state);
>> -		unlock_page(page);
>> -		btrfs_start_ordered_extent(ordered, 1);
>> -		btrfs_put_ordered_extent(ordered);
>> -		goto again;
>> -	}
>> -
>> -	ret = btrfs_set_extent_delalloc(inode, page_start, page_end, 0,
>> -					&cached_state);
>> -	if (ret)
>> -		goto out_reserved;
>> -
>> -	/*
>> -	 * Everything went as planned, we're now the owner of a dirty page with
>> -	 * delayed allocation bits set and space reserved for our COW
>> -	 * destination.
>> -	 *
>> -	 * The page was dirty when we started, nothing should have cleaned it.
>> -	 */
>> -	BUG_ON(!PageDirty(page));
>> -	free_delalloc_space = false;
>> -out_reserved:
>> -	btrfs_delalloc_release_extents(inode, PAGE_SIZE);
>> -	if (free_delalloc_space)
>> -		btrfs_delalloc_release_space(inode, data_reserved, page_start,
>> -					     PAGE_SIZE, true);
>> -	unlock_extent_cached(&inode->io_tree, page_start, page_end,
>> -			     &cached_state);
>> -out_page:
>> -	if (ret) {
>> -		/*
>> -		 * We hit ENOSPC or other errors.  Update the mapping and page
>> -		 * to reflect the errors and clean the page.
>> -		 */
>> -		mapping_set_error(page->mapping, ret);
>> -		end_extent_writepage(page, ret, page_start, page_end);
>> -		clear_page_dirty_for_io(page);
>> -		SetPageError(page);
>> -	}
>> -	btrfs_page_clear_checked(inode->root->fs_info, page, page_start, PAGE_SIZE);
>> -	unlock_page(page);
>> -	put_page(page);
>> -	kfree(fixup);
>> -	extent_changeset_free(data_reserved);
>> -	/*
>> -	 * As a precaution, do a delayed iput in case it would be the last iput
>> -	 * that could need flushing space. Recursing back to fixup worker would
>> -	 * deadlock.
>> -	 */
>> -	btrfs_add_delayed_iput(&inode->vfs_inode);
>> -}
>> -
>> -/*
>> - * There are a few paths in the higher layers of the kernel that directly
>> - * set the page dirty bit without asking the filesystem if it is a
>> - * good idea.  This causes problems because we want to make sure COW
>> - * properly happens and the data=ordered rules are followed.
>> - *
>> - * In our case any range that doesn't have the ORDERED bit set
>> - * hasn't been properly setup for IO.  We kick off an async process
>> - * to fix it up.  The async helper will wait for ordered extents, set
>> - * the delalloc bit and make it safe to write the page.
>> - */
>> -int btrfs_writepage_cow_fixup(struct page *page)
>> -{
>> -	struct inode *inode = page->mapping->host;
>> -	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>> -	struct btrfs_writepage_fixup *fixup;
>> -
>> -	/* This page has ordered extent covering it already */
>> -	if (PageOrdered(page))
>> -		return 0;
>> -
>> -	/*
>> -	 * PageChecked is set below when we create a fixup worker for this page,
>> -	 * don't try to create another one if we're already PageChecked()
>> -	 *
>> -	 * The extent_io writepage code will redirty the page if we send back
>> -	 * EAGAIN.
>> -	 */
>> -	if (PageChecked(page))
>> -		return -EAGAIN;
>> -
>> -	fixup = kzalloc(sizeof(*fixup), GFP_NOFS);
>> -	if (!fixup)
>> -		return -EAGAIN;
>> -
>> -	/*
>> -	 * We are already holding a reference to this inode from
>> -	 * write_cache_pages.  We need to hold it because the space reservation
>> -	 * takes place outside of the page lock, and we can't trust
>> -	 * page->mapping outside of the page lock.
>> -	 */
>> -	ihold(inode);
>> -	btrfs_page_set_checked(fs_info, page, page_offset(page), PAGE_SIZE);
>> -	get_page(page);
>> -	btrfs_init_work(&fixup->work, btrfs_writepage_fixup_worker, NULL, NULL);
>> -	fixup->page = page;
>> -	fixup->inode = inode;
>> -	btrfs_queue_work(fs_info->fixup_workers, &fixup->work);
>> -
>> -	return -EAGAIN;
>> -}
>> -
>>   static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
>>   				       struct btrfs_inode *inode, u64 file_pos,
>>   				       struct btrfs_file_extent_item *stack_fi,
---end quoted text---

Jan Kara June 24, 2022, 1:07 p.m. UTC | #4

On Fri 24-06-22 14:51:18, Christoph Hellwig wrote:
> On Fri, Jun 24, 2022 at 08:30:00PM +0800, Qu Wenruo wrote:
> > But from my previous feedback on subpage code, it looks like it's some
> > hardware archs (S390?) that can not do page flags update atomically.
> >
> > I have tested similar thing, with extra ASSERT() to make sure the cow
> > fixup code never get triggered.
> >
> > At least for x86_64 and aarch64 it's OK here.
> >
> > So I hope this time we can get a concrete reason on why we need the
> > extra page Private2 bit in the first place.
> 
> I don't think atomic page flags are a thing here.  I remember Jan
> had chased a bug where we'd get into trouble into this area in
> ext4 due to the way pages are locked down for direct I/O, but I
> don't even remember seeing that on XFS.  Either way the PageOrdered
> check prevents a crash in that case and we really can't expect
> data to properly be written back in that case.

I'm not sure I get the context 100% right but pages getting randomly dirty
behind filesystem's back can still happen - most commonly with RDMA and
similar stuff which calls set_page_dirty() on pages it has got from
pin_user_pages() once the transfer is done. page_maybe_dma_pinned() should
be usable within filesystems to detect such cases and protect the
filesystem but so far neither me nor John Hubbart has got to implement this
in the generic writeback infrastructure + some filesystem as a sample case
others could copy...

								Honza

Qu Wenruo June 24, 2022, 1:12 p.m. UTC | #5

On 2022/6/24 20:49, David Sterba wrote:
> On Fri, Jun 24, 2022 at 02:23:34PM +0200, Christoph Hellwig wrote:
>> Since the page_mkwrite address space operation was added, starting with
>> commit 9637a5efd4fb ("[PATCH] add page_mkwrite() vm_operations method")
>> in 2006, the kernel does not just dirty random pages without telling
>> the file system.
>
> It does and there's a history behind the fixup worker. tl;dr it can't be
> removed, though every now and then somebody comes and tries to.
>
> On s390 the page status is tracked in two places, hw and in memory and
> this needs to be synchronized manually.
>
> On x86_64 it's not a simple reason but it happens as well in some edge
> case where the mappings get removed and dirty page is set deep in the
> arch mm code.  We've been chasing it long time ago, I don't recall exact
> details and it's been a painful experience.
>
> If there's been any change on the s390 side or in arch/x86/mm code I
> don't know but to be on the safe side, I strongly assume the fixup code
> is needed unless proven otherwise.

I'd say, if this can be a problem to btrfs, then all fs supporting COW
should also be affected, and should have similar workaround.

Furthermore, this means we can get a page dirtied without us knowing.

This is a super big surprise to any fs, and should be properly
documented, not just leaving some seemly dead and special code in some
random fs.

Furthermore, I'm not sure even if handling this in a fs level is correct.
This looks like more a MM problem to me then.

I totally understand it's a pain to debug such lowlevel bug, but
shouldn't we have a proper regression for it then?

Instead of just keeping what we know works, I really want to handle this
old case/bug in a more modern way.

Thanks,
Qu

Qu Wenruo June 24, 2022, 1:19 p.m. UTC | #6

On 2022/6/24 21:07, Jan Kara wrote:
> On Fri 24-06-22 14:51:18, Christoph Hellwig wrote:
>> On Fri, Jun 24, 2022 at 08:30:00PM +0800, Qu Wenruo wrote:
>>> But from my previous feedback on subpage code, it looks like it's some
>>> hardware archs (S390?) that can not do page flags update atomically.
>>>
>>> I have tested similar thing, with extra ASSERT() to make sure the cow
>>> fixup code never get triggered.
>>>
>>> At least for x86_64 and aarch64 it's OK here.
>>>
>>> So I hope this time we can get a concrete reason on why we need the
>>> extra page Private2 bit in the first place.
>>
>> I don't think atomic page flags are a thing here.  I remember Jan
>> had chased a bug where we'd get into trouble into this area in
>> ext4 due to the way pages are locked down for direct I/O, but I
>> don't even remember seeing that on XFS.  Either way the PageOrdered
>> check prevents a crash in that case and we really can't expect
>> data to properly be written back in that case.
>
> I'm not sure I get the context 100% right but pages getting randomly dirty
> behind filesystem's back can still happen - most commonly with RDMA and
> similar stuff which calls set_page_dirty() on pages it has got from
> pin_user_pages() once the transfer is done.

Just curious, things like RMDA can mark those pages dirty even without
letting kernel know, but how could those pages be from page cache? By
mmap()?

> page_maybe_dma_pinned() should
> be usable within filesystems to detect such cases and protect the
> filesystem but so far neither me nor John Hubbart has got to implement this
> in the generic writeback infrastructure + some filesystem as a sample case
> others could copy...

So the generic idea is just to detect if the page is marked dirty by
traditional means, and if not, skip the writeback for them, and wait for
proper notification to fs?

Thanks,
Qu

>
> 								Honza

David Sterba June 24, 2022, 1:27 p.m. UTC | #7

On Fri, Jun 24, 2022 at 09:12:44PM +0800, Qu Wenruo wrote:
> On 2022/6/24 20:49, David Sterba wrote:
> > On Fri, Jun 24, 2022 at 02:23:34PM +0200, Christoph Hellwig wrote:
> >> Since the page_mkwrite address space operation was added, starting with
> >> commit 9637a5efd4fb ("[PATCH] add page_mkwrite() vm_operations method")
> >> in 2006, the kernel does not just dirty random pages without telling
> >> the file system.
> >
> > It does and there's a history behind the fixup worker. tl;dr it can't be
> > removed, though every now and then somebody comes and tries to.
> >
> > On s390 the page status is tracked in two places, hw and in memory and
> > this needs to be synchronized manually.
> >
> > On x86_64 it's not a simple reason but it happens as well in some edge
> > case where the mappings get removed and dirty page is set deep in the
> > arch mm code.  We've been chasing it long time ago, I don't recall exact
> > details and it's been a painful experience.
> >
> > If there's been any change on the s390 side or in arch/x86/mm code I
> > don't know but to be on the safe side, I strongly assume the fixup code
> > is needed unless proven otherwise.
> 
> I'd say, if this can be a problem to btrfs, then all fs supporting COW
> should also be affected, and should have similar workaround.

Probably yes.

> Furthermore, this means we can get a page dirtied without us knowing.

This should not happen because we do have the detection of the page and
extent state mismatch and the fixup worker makes things right again.

> This is a super big surprise to any fs, and should be properly
> documented, not just leaving some seemly dead and special code in some
> random fs.

You seem to be a non-believer that the bug is real and calling the code
dead. Each filesystem should validate the implementation agains the
platform where it is and btrfs once found the hard way that there are
some corner cases where structures get out of sync.

> Furthermore, I'm not sure even if handling this in a fs level is correct.
> This looks like more a MM problem to me then.
> 
> 
> I totally understand it's a pain to debug such lowlevel bug, but
> shouldn't we have a proper regression for it then?

The regression test is generic/208 and it was not reliable at all, it
fired randomly once a week or month, there used to be a BUG() in the
fixup worker callback.

> Instead of just keeping what we know works, I really want to handle this
> old case/bug in a more modern way.

As long as the guarantees stay the same, then fine. We need to be able
to detect the unexpected dirty bit and have a way to react to it.

f4b1363cae43 ("btrfs: do not do delalloc reservation under page lock")
25f3c5021985 ("Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker")
1d53c9e67230 ("Btrfs: only associate the locked page with one async_chunk struct")

And the commit that fixed it:

87826df0ec36 ("btrfs: delalloc for page dirtied out-of-band in fixup worker")

You can find several reports in the mailing list archives (search term
btrfs_writepage_fixup_worker):

https://lore.kernel.org/linux-btrfs/1295053074.15265.6.camel@mercury.localdomain

https://lore.kernel.org/linux-btrfs/20110701174436.GA8352@yahoo.fr

https://lore.kernel.org/linux-btrfs/j0k65i$29a$1@dough.gmane.org

https://lore.kernel.org/linux-btrfs/CAO47_--H0+6bu4qQ2QA9gZcHvGVWO4QUGCAb3+9a5Kg3+23UiQ@mail.gmail.com

https://lore.kernel.org/linux-btrfs/vqfmv8-9ch.ln1@hurikhan.ath.cx

Jan Kara June 24, 2022, 1:40 p.m. UTC | #8

On Fri 24-06-22 21:19:04, Qu Wenruo wrote:
> 
> 
> On 2022/6/24 21:07, Jan Kara wrote:
> > On Fri 24-06-22 14:51:18, Christoph Hellwig wrote:
> > > On Fri, Jun 24, 2022 at 08:30:00PM +0800, Qu Wenruo wrote:
> > > > But from my previous feedback on subpage code, it looks like it's some
> > > > hardware archs (S390?) that can not do page flags update atomically.
> > > > 
> > > > I have tested similar thing, with extra ASSERT() to make sure the cow
> > > > fixup code never get triggered.
> > > > 
> > > > At least for x86_64 and aarch64 it's OK here.
> > > > 
> > > > So I hope this time we can get a concrete reason on why we need the
> > > > extra page Private2 bit in the first place.
> > > 
> > > I don't think atomic page flags are a thing here.  I remember Jan
> > > had chased a bug where we'd get into trouble into this area in
> > > ext4 due to the way pages are locked down for direct I/O, but I
> > > don't even remember seeing that on XFS.  Either way the PageOrdered
> > > check prevents a crash in that case and we really can't expect
> > > data to properly be written back in that case.
> > 
> > I'm not sure I get the context 100% right but pages getting randomly dirty
> > behind filesystem's back can still happen - most commonly with RDMA and
> > similar stuff which calls set_page_dirty() on pages it has got from
> > pin_user_pages() once the transfer is done.
> 
> Just curious, things like RMDA can mark those pages dirty even without
> letting kernel know, but how could those pages be from page cache? By
> mmap()?

Yes, you pass virtual address to RDMA ioctl and it uses memory at that
address as a target buffer for RDMA. If the target address happens to be
mmapped file, filesystem has problems...

> > page_maybe_dma_pinned() should
> > be usable within filesystems to detect such cases and protect the
> > filesystem but so far neither me nor John Hubbart has got to implement this
> > in the generic writeback infrastructure + some filesystem as a sample case
> > others could copy...
> 
> So the generic idea is just to detect if the page is marked dirty by
> traditional means, and if not, skip the writeback for them, and wait for
> proper notification to fs?

Kind of. The idea is to treat page_maybe_dma_pinned() pages as permanently
dirty (because we have no control over when the hardware decides to modify
the page contents by DMA). So skip the writeback if we can (e.g. memory
cleaning type of writeback) and use bounce pages to do data integrity
writeback (which cannot be skipped).

								Honza

Qu Wenruo June 24, 2022, 1:50 p.m. UTC | #9

On 2022/6/24 21:27, David Sterba wrote:
> On Fri, Jun 24, 2022 at 09:12:44PM +0800, Qu Wenruo wrote:
>> On 2022/6/24 20:49, David Sterba wrote:
>>> On Fri, Jun 24, 2022 at 02:23:34PM +0200, Christoph Hellwig wrote:
>>>> Since the page_mkwrite address space operation was added, starting with
>>>> commit 9637a5efd4fb ("[PATCH] add page_mkwrite() vm_operations method")
>>>> in 2006, the kernel does not just dirty random pages without telling
>>>> the file system.
>>>
>>> It does and there's a history behind the fixup worker. tl;dr it can't be
>>> removed, though every now and then somebody comes and tries to.
>>>
>>> On s390 the page status is tracked in two places, hw and in memory and
>>> this needs to be synchronized manually.
>>>
>>> On x86_64 it's not a simple reason but it happens as well in some edge
>>> case where the mappings get removed and dirty page is set deep in the
>>> arch mm code.  We've been chasing it long time ago, I don't recall exact
>>> details and it's been a painful experience.
>>>
>>> If there's been any change on the s390 side or in arch/x86/mm code I
>>> don't know but to be on the safe side, I strongly assume the fixup code
>>> is needed unless proven otherwise.
>>
>> I'd say, if this can be a problem to btrfs, then all fs supporting COW
>> should also be affected, and should have similar workaround.
>
> Probably yes.
>
>> Furthermore, this means we can get a page dirtied without us knowing.
>
> This should not happen because we do have the detection of the page and
> extent state mismatch and the fixup worker makes things right again.
>
>> This is a super big surprise to any fs, and should be properly
>> documented, not just leaving some seemly dead and special code in some
>> random fs.
>
> You seem to be a non-believer that the bug is real and calling the code
> dead.

Nope, as Jan mentioned RDMA, it immediately light the bulb in my head.

Now I'm totally convinced we can have page marked dirty without proper
notification to fs.

So now I think this is real bug.

But unfortunately the code itself has no concrete reasons on which cases
this can happen, just mentioned kernel can mark page dirty (seemly
randomly).

Thus it "looks like" a dead code.


> Each filesystem should validate the implementation agains the
> platform where it is and btrfs once found the hard way that there are
> some corner cases where structures get out of sync.

In fact, from the fs point of view, there are quite some expectation on
its interfaces, if there is a surprise and such problem is no long
really specific to btrfs, then it should be addressed more generically.

>
>> Furthermore, I'm not sure even if handling this in a fs level is correct.
>> This looks like more a MM problem to me then.
>>
>>
>> I totally understand it's a pain to debug such lowlevel bug, but
>> shouldn't we have a proper regression for it then?
>
> The regression test is generic/208 and it was not reliable at all, it
> fired randomly once a week or month, there used to be a BUG() in the
> fixup worker callback.

And it doesn't have any comment even related to this unexpected dirty pages.

>
>> Instead of just keeping what we know works, I really want to handle this
>> old case/bug in a more modern way.
>
> As long as the guarantees stay the same, then fine. We need to be able
> to detect the unexpected dirty bit and have a way to react to it.
>
> f4b1363cae43 ("btrfs: do not do delalloc reservation under page lock")
> 25f3c5021985 ("Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker")
> 1d53c9e67230 ("Btrfs: only associate the locked page with one async_chunk struct")
>
> And the commit that fixed it:
>
> 87826df0ec36 ("btrfs: delalloc for page dirtied out-of-band in fixup worker")
>
> You can find several reports in the mailing list archives (search term
> btrfs_writepage_fixup_worker):

To me, a proper and modern solution is not to rely on super old reports
(although they are definitely helpful as a record), but proper explanation.

Thanks to Jan, RDMA would be a very direct example for this.

Although personally speaking, I still think we should limit on who can
set a page from page cache dirty.
(AKA, ensuring fs receives notification on every dirtied page)

Thanks,
Qu

>
> https://lore.kernel.org/linux-btrfs/1295053074.15265.6.camel@mercury.localdomain
>
> https://lore.kernel.org/linux-btrfs/20110701174436.GA8352@yahoo.fr
>
> https://lore.kernel.org/linux-btrfs/j0k65i$29a$1@dough.gmane.org
>
> https://lore.kernel.org/linux-btrfs/CAO47_--H0+6bu4qQ2QA9gZcHvGVWO4QUGCAb3+9a5Kg3+23UiQ@mail.gmail.com
>
> https://lore.kernel.org/linux-btrfs/vqfmv8-9ch.ln1@hurikhan.ath.cx

Qu Wenruo June 24, 2022, 1:56 p.m. UTC | #10

On 2022/6/24 21:40, Jan Kara wrote:
> On Fri 24-06-22 21:19:04, Qu Wenruo wrote:
>>
>>
>> On 2022/6/24 21:07, Jan Kara wrote:
>>> On Fri 24-06-22 14:51:18, Christoph Hellwig wrote:
>>>> On Fri, Jun 24, 2022 at 08:30:00PM +0800, Qu Wenruo wrote:
>>>>> But from my previous feedback on subpage code, it looks like it's some
>>>>> hardware archs (S390?) that can not do page flags update atomically.
>>>>>
>>>>> I have tested similar thing, with extra ASSERT() to make sure the cow
>>>>> fixup code never get triggered.
>>>>>
>>>>> At least for x86_64 and aarch64 it's OK here.
>>>>>
>>>>> So I hope this time we can get a concrete reason on why we need the
>>>>> extra page Private2 bit in the first place.
>>>>
>>>> I don't think atomic page flags are a thing here.  I remember Jan
>>>> had chased a bug where we'd get into trouble into this area in
>>>> ext4 due to the way pages are locked down for direct I/O, but I
>>>> don't even remember seeing that on XFS.  Either way the PageOrdered
>>>> check prevents a crash in that case and we really can't expect
>>>> data to properly be written back in that case.
>>>
>>> I'm not sure I get the context 100% right but pages getting randomly dirty
>>> behind filesystem's back can still happen - most commonly with RDMA and
>>> similar stuff which calls set_page_dirty() on pages it has got from
>>> pin_user_pages() once the transfer is done.
>>
>> Just curious, things like RMDA can mark those pages dirty even without
>> letting kernel know, but how could those pages be from page cache? By
>> mmap()?
>
> Yes, you pass virtual address to RDMA ioctl and it uses memory at that
> address as a target buffer for RDMA. If the target address happens to be
> mmapped file, filesystem has problems...

Oh my god, this is going to be disaster.

RDMA is really almost a blackbox which can do anything to the pages.

If some RDMA drivers choose to screw up with Private2, the btrfs
workaround is also screwed up.

Another problem is related to subpage.

Btrfs (and iomap) all uses page->private to store extra bitmaps for
subpage usage.
If RDMA is changing page flags, it can easily lead to de-sync between
subpage bitmaps with real page flags.

I can no longer sleep well knowing this...

Thanks,
Qu
>
>>> page_maybe_dma_pinned() should
>>> be usable within filesystems to detect such cases and protect the
>>> filesystem but so far neither me nor John Hubbart has got to implement this
>>> in the generic writeback infrastructure + some filesystem as a sample case
>>> others could copy...
>>
>> So the generic idea is just to detect if the page is marked dirty by
>> traditional means, and if not, skip the writeback for them, and wait for
>> proper notification to fs?
>
> Kind of. The idea is to treat page_maybe_dma_pinned() pages as permanently
> dirty (because we have no control over when the hardware decides to modify
> the page contents by DMA). So skip the writeback if we can (e.g. memory
> cleaning type of writeback) and use bounce pages to do data integrity
> writeback (which cannot be skipped).
>
> 								Honza

Christoph Hellwig June 25, 2022, 9:11 a.m. UTC | #11

On Fri, Jun 24, 2022 at 03:07:50PM +0200, Jan Kara wrote:
> I'm not sure I get the context 100% right but pages getting randomly dirty
> behind filesystem's back can still happen - most commonly with RDMA and
> similar stuff which calls set_page_dirty() on pages it has got from
> pin_user_pages() once the transfer is done. page_maybe_dma_pinned() should
> be usable within filesystems to detect such cases and protect the
> filesystem but so far neither me nor John Hubbart has got to implement this
> in the generic writeback infrastructure + some filesystem as a sample case
> others could copy...

Well, so far the strategy elsewhere seems to be to just ignore pages
only dirtied through get_user_pages.  E.g. iomap skips over pages
reported as holes, and ext4_writepage complains about pages without
buffers and then clears the dirty bit and continues.

I'm kinda surprised that btrfs wants to treat this so special
especially as more of the btrfs page and sub-page status will be out
of date as well.

Jan Kara June 27, 2022, 10:15 a.m. UTC | #12

On Fri 24-06-22 21:56:57, Qu Wenruo wrote:
> On 2022/6/24 21:40, Jan Kara wrote:
> > On Fri 24-06-22 21:19:04, Qu Wenruo wrote:
> > > 
> > > 
> > > On 2022/6/24 21:07, Jan Kara wrote:
> > > > On Fri 24-06-22 14:51:18, Christoph Hellwig wrote:
> > > > > On Fri, Jun 24, 2022 at 08:30:00PM +0800, Qu Wenruo wrote:
> > > > > > But from my previous feedback on subpage code, it looks like it's some
> > > > > > hardware archs (S390?) that can not do page flags update atomically.
> > > > > > 
> > > > > > I have tested similar thing, with extra ASSERT() to make sure the cow
> > > > > > fixup code never get triggered.
> > > > > > 
> > > > > > At least for x86_64 and aarch64 it's OK here.
> > > > > > 
> > > > > > So I hope this time we can get a concrete reason on why we need the
> > > > > > extra page Private2 bit in the first place.
> > > > > 
> > > > > I don't think atomic page flags are a thing here.  I remember Jan
> > > > > had chased a bug where we'd get into trouble into this area in
> > > > > ext4 due to the way pages are locked down for direct I/O, but I
> > > > > don't even remember seeing that on XFS.  Either way the PageOrdered
> > > > > check prevents a crash in that case and we really can't expect
> > > > > data to properly be written back in that case.
> > > > 
> > > > I'm not sure I get the context 100% right but pages getting randomly dirty
> > > > behind filesystem's back can still happen - most commonly with RDMA and
> > > > similar stuff which calls set_page_dirty() on pages it has got from
> > > > pin_user_pages() once the transfer is done.
> > > 
> > > Just curious, things like RMDA can mark those pages dirty even without
> > > letting kernel know, but how could those pages be from page cache? By
> > > mmap()?
> > 
> > Yes, you pass virtual address to RDMA ioctl and it uses memory at that
> > address as a target buffer for RDMA. If the target address happens to be
> > mmapped file, filesystem has problems...
> 
> Oh my god, this is going to be disaster.
> 
> RDMA is really almost a blackbox which can do anything to the pages.
> 
> If some RDMA drivers choose to screw up with Private2, the btrfs
> workaround is also screwed up.
> 
> Another problem is related to subpage.
> 
> Btrfs (and iomap) all uses page->private to store extra bitmaps for
> subpage usage.
> If RDMA is changing page flags, it can easily lead to de-sync between
> subpage bitmaps with real page flags.

Well, RDMA could do this (in fact any kernel code can do this, can't it? ;)
but RDMA is not expected to mess with page state arbitrarily. The only
thing it should be doing (and that is kind of the whole point of RDMA) is
that it allows RDMA card to alter page contents through DMA and then
dirties those pages to tell the rest of the kernel that page contents
changed.

So practically we need to treat pages pinned by RDMA drivers as "writeably
mapped to userspace" without a chance to unmap them.

								Honza

Jan Kara June 27, 2022, 10:19 a.m. UTC | #13

On Sat 25-06-22 11:11:43, Christoph Hellwig wrote:
> On Fri, Jun 24, 2022 at 03:07:50PM +0200, Jan Kara wrote:
> > I'm not sure I get the context 100% right but pages getting randomly dirty
> > behind filesystem's back can still happen - most commonly with RDMA and
> > similar stuff which calls set_page_dirty() on pages it has got from
> > pin_user_pages() once the transfer is done. page_maybe_dma_pinned() should
> > be usable within filesystems to detect such cases and protect the
> > filesystem but so far neither me nor John Hubbart has got to implement this
> > in the generic writeback infrastructure + some filesystem as a sample case
> > others could copy...
> 
> Well, so far the strategy elsewhere seems to be to just ignore pages
> only dirtied through get_user_pages.  E.g. iomap skips over pages
> reported as holes, and ext4_writepage complains about pages without
> buffers and then clears the dirty bit and continues.
> 
> I'm kinda surprised that btrfs wants to treat this so special
> especially as more of the btrfs page and sub-page status will be out
> of date as well.

I agree btrfs probably needs a different solution than what it is currently
doing if they want to get things right. I just wanted to make it clear that
the code you are ripping out may be a wrong solution but to a real problem.

								Honza

Qu Wenruo June 28, 2022, 12:24 a.m. UTC | #14

On 2022/6/27 18:19, Jan Kara wrote:
> On Sat 25-06-22 11:11:43, Christoph Hellwig wrote:
>> On Fri, Jun 24, 2022 at 03:07:50PM +0200, Jan Kara wrote:
>>> I'm not sure I get the context 100% right but pages getting randomly dirty
>>> behind filesystem's back can still happen - most commonly with RDMA and
>>> similar stuff which calls set_page_dirty() on pages it has got from
>>> pin_user_pages() once the transfer is done. page_maybe_dma_pinned() should
>>> be usable within filesystems to detect such cases and protect the
>>> filesystem but so far neither me nor John Hubbart has got to implement this
>>> in the generic writeback infrastructure + some filesystem as a sample case
>>> others could copy...
>>
>> Well, so far the strategy elsewhere seems to be to just ignore pages
>> only dirtied through get_user_pages.  E.g. iomap skips over pages
>> reported as holes, and ext4_writepage complains about pages without
>> buffers and then clears the dirty bit and continues.
>>
>> I'm kinda surprised that btrfs wants to treat this so special
>> especially as more of the btrfs page and sub-page status will be out
>> of date as well.
>
> I agree btrfs probably needs a different solution than what it is currently
> doing if they want to get things right. I just wanted to make it clear that
> the code you are ripping out may be a wrong solution but to a real problem.

IHMO I believe btrfs should also ignore such dirty but not managed by fs
pages.

But I still have a small concern here.

Is it ensured that, after RDMA dirtying the pages, would we finally got
a proper notification to fs that those pages are marked written?

If not, I would guess those pages would never got a chance to be written
back.

If yes, then I'm totally fine to go the ignoring path.

Thanks,
Qu
>
> 								Honza

Jan Kara June 28, 2022, 8 a.m. UTC | #15

On Tue 28-06-22 08:24:07, Qu Wenruo wrote:
> On 2022/6/27 18:19, Jan Kara wrote:
> > On Sat 25-06-22 11:11:43, Christoph Hellwig wrote:
> > > On Fri, Jun 24, 2022 at 03:07:50PM +0200, Jan Kara wrote:
> > > > I'm not sure I get the context 100% right but pages getting randomly dirty
> > > > behind filesystem's back can still happen - most commonly with RDMA and
> > > > similar stuff which calls set_page_dirty() on pages it has got from
> > > > pin_user_pages() once the transfer is done. page_maybe_dma_pinned() should
> > > > be usable within filesystems to detect such cases and protect the
> > > > filesystem but so far neither me nor John Hubbart has got to implement this
> > > > in the generic writeback infrastructure + some filesystem as a sample case
> > > > others could copy...
> > > 
> > > Well, so far the strategy elsewhere seems to be to just ignore pages
> > > only dirtied through get_user_pages.  E.g. iomap skips over pages
> > > reported as holes, and ext4_writepage complains about pages without
> > > buffers and then clears the dirty bit and continues.
> > > 
> > > I'm kinda surprised that btrfs wants to treat this so special
> > > especially as more of the btrfs page and sub-page status will be out
> > > of date as well.
> > 
> > I agree btrfs probably needs a different solution than what it is currently
> > doing if they want to get things right. I just wanted to make it clear that
> > the code you are ripping out may be a wrong solution but to a real problem.
> 
> IHMO I believe btrfs should also ignore such dirty but not managed by fs
> pages.
> 
> But I still have a small concern here.
> 
> Is it ensured that, after RDMA dirtying the pages, would we finally got
> a proper notification to fs that those pages are marked written?

So there is ->page_mkwrite() notification happening when RDMA code calls
pin_user_pages() when preparing buffers. The trouble is that although later
page_mkclean() makes page not writeable from page tables, it may be still
written by RDMA code (even hours after ->page_mkwrite() notification, RDMA
buffers are really long-lived) and that's what eventually confuses the
filesystem.  Otherwise set_page_dirty() is the notification that page
contents was changed and needs writing out...

								Honza

David Sterba June 28, 2022, 11:46 a.m. UTC | #16

On Sat, Jun 25, 2022 at 11:11:43AM +0200, Christoph Hellwig wrote:
> On Fri, Jun 24, 2022 at 03:07:50PM +0200, Jan Kara wrote:
> > I'm not sure I get the context 100% right but pages getting randomly dirty
> > behind filesystem's back can still happen - most commonly with RDMA and
> > similar stuff which calls set_page_dirty() on pages it has got from
> > pin_user_pages() once the transfer is done. page_maybe_dma_pinned() should
> > be usable within filesystems to detect such cases and protect the
> > filesystem but so far neither me nor John Hubbart has got to implement this
> > in the generic writeback infrastructure + some filesystem as a sample case
> > others could copy...
> 
> Well, so far the strategy elsewhere seems to be to just ignore pages
> only dirtied through get_user_pages.  E.g. iomap skips over pages
> reported as holes, and ext4_writepage complains about pages without
> buffers and then clears the dirty bit and continues.
> 
> I'm kinda surprised that btrfs wants to treat this so special
> especially as more of the btrfs page and sub-page status will be out
> of date as well.

I'm not sure it's safe to ignore that in btrfs, that's sounds quite
risky and potentially breaking something. It's not only that page does
not have buffers but that there's page with dirty bit set but without
the associated extents updated. So this needs a COW to get it back to
sync. It's special because it's for COW, can't compare that to ext4.

David Sterba June 28, 2022, 11:53 a.m. UTC | #17

On Tue, Jun 28, 2022 at 08:24:07AM +0800, Qu Wenruo wrote:
> On 2022/6/27 18:19, Jan Kara wrote:
> > On Sat 25-06-22 11:11:43, Christoph Hellwig wrote:
> >> On Fri, Jun 24, 2022 at 03:07:50PM +0200, Jan Kara wrote:
> >>> I'm not sure I get the context 100% right but pages getting randomly dirty
> >>> behind filesystem's back can still happen - most commonly with RDMA and
> >>> similar stuff which calls set_page_dirty() on pages it has got from
> >>> pin_user_pages() once the transfer is done. page_maybe_dma_pinned() should
> >>> be usable within filesystems to detect such cases and protect the
> >>> filesystem but so far neither me nor John Hubbart has got to implement this
> >>> in the generic writeback infrastructure + some filesystem as a sample case
> >>> others could copy...
> >>
> >> Well, so far the strategy elsewhere seems to be to just ignore pages
> >> only dirtied through get_user_pages.  E.g. iomap skips over pages
> >> reported as holes, and ext4_writepage complains about pages without
> >> buffers and then clears the dirty bit and continues.
> >>
> >> I'm kinda surprised that btrfs wants to treat this so special
> >> especially as more of the btrfs page and sub-page status will be out
> >> of date as well.
> >
> > I agree btrfs probably needs a different solution than what it is currently
> > doing if they want to get things right. I just wanted to make it clear that
> > the code you are ripping out may be a wrong solution but to a real problem.
> 
> IHMO I believe btrfs should also ignore such dirty but not managed by fs
> pages.
> 
> But I still have a small concern here.
> 
> Is it ensured that, after RDMA dirtying the pages, would we finally got
> a proper notification to fs that those pages are marked written?
> 
> If not, I would guess those pages would never got a chance to be written
> back.
> 
> If yes, then I'm totally fine to go the ignoring path.

This would work only for the higher level API where eg. RDMA notifies
the filesystem, but there's still the s390 case that is part of the
hardware architecture. The fixup worker is there as a safety for all
other cases, I'm not fine removing or ignoring it.

Chris Mason June 28, 2022, 2:29 p.m. UTC | #18

On 6/25/22 5:11 AM, Christoph Hellwig wrote:
> On Fri, Jun 24, 2022 at 03:07:50PM +0200, Jan Kara wrote:
>> I'm not sure I get the context 100% right but pages getting randomly dirty
>> behind filesystem's back can still happen - most commonly with RDMA and
>> similar stuff which calls set_page_dirty() on pages it has got from
>> pin_user_pages() once the transfer is done. page_maybe_dma_pinned() should
>> be usable within filesystems to detect such cases and protect the
>> filesystem but so far neither me nor John Hubbart has got to implement this
>> in the generic writeback infrastructure + some filesystem as a sample case
>> others could copy...
> 
> Well, so far the strategy elsewhere seems to be to just ignore pages
> only dirtied through get_user_pages.  E.g. iomap skips over pages
> reported as holes, and ext4_writepage complains about pages without
> buffers and then clears the dirty bit and continues.
> 
> I'm kinda surprised that btrfs wants to treat this so special
> especially as more of the btrfs page and sub-page status will be out
> of date as well.

As Sterba points out later in the thread, btrfs cares more because of 
stable page requirements to protect data during COW and to make sure the 
crcs we write to disk are correct.

The fixup worker path is pretty easy to trigger if you O_DIRECT reads 
into mmap'd pages.  You need some memory pressure to power through 
get_user_pages trying to do the right thing, but it does happen.

I'd love a proper fix for this on the *_user_pages() side where 
page_mkwrite() style notifications are used all the time.  It's just a 
huge change, and my answer so far has always been that using btrfs 
mmap'd memory for this kind of thing isn't a great choice either way.

-chris

Qu Wenruo June 29, 2022, 1:20 a.m. UTC | #19

On 2022/6/28 22:29, Chris Mason wrote:
> On 6/25/22 5:11 AM, Christoph Hellwig wrote:
>> On Fri, Jun 24, 2022 at 03:07:50PM +0200, Jan Kara wrote:
>>> I'm not sure I get the context 100% right but pages getting randomly
>>> dirty
>>> behind filesystem's back can still happen - most commonly with RDMA and
>>> similar stuff which calls set_page_dirty() on pages it has got from
>>> pin_user_pages() once the transfer is done. page_maybe_dma_pinned()
>>> should
>>> be usable within filesystems to detect such cases and protect the
>>> filesystem but so far neither me nor John Hubbart has got to
>>> implement this
>>> in the generic writeback infrastructure + some filesystem as a sample
>>> case
>>> others could copy...
>>
>> Well, so far the strategy elsewhere seems to be to just ignore pages
>> only dirtied through get_user_pages.  E.g. iomap skips over pages
>> reported as holes, and ext4_writepage complains about pages without
>> buffers and then clears the dirty bit and continues.
>>
>> I'm kinda surprised that btrfs wants to treat this so special
>> especially as more of the btrfs page and sub-page status will be out
>> of date as well.
>
> As Sterba points out later in the thread, btrfs cares more because of
> stable page requirements to protect data during COW and to make sure the
> crcs we write to disk are correct.

In fact, COW is not that special, even before btrfs or all the other
fses supporting COW, all those old fses has to do something like COW,
when they are writing into holes.

What makes btrfs special is its csum, and in fact csum requires more
stable page status.

If someone can modify a page without waiting for its writeback to
finish, btrfs csum can easily be stale and cause -EIO for future read.

Thus unless we can ensure the procedure marking page dirty to respect
page writeback, going fixup path would be more dangerous than ignoring it.

>
> The fixup worker path is pretty easy to trigger if you O_DIRECT reads
> into mmap'd pages.  You need some memory pressure to power through
> get_user_pages trying to do the right thing, but it does happen.
>
> I'd love a proper fix for this on the *_user_pages() side where
> page_mkwrite() style notifications are used all the time.  It's just a
> huge change, and my answer so far has always been that using btrfs
> mmap'd memory for this kind of thing isn't a great choice either way.

The same here.

But for now I'm still wondering if the fixup is really the correct
workaround other than ignoring.

Thanks,
Qu

>
> -chris

Qu Wenruo June 29, 2022, 1:33 a.m. UTC | #20

On 2022/6/28 16:00, Jan Kara wrote:
> On Tue 28-06-22 08:24:07, Qu Wenruo wrote:
>> On 2022/6/27 18:19, Jan Kara wrote:
>>> On Sat 25-06-22 11:11:43, Christoph Hellwig wrote:
>>>> On Fri, Jun 24, 2022 at 03:07:50PM +0200, Jan Kara wrote:
>>>>> I'm not sure I get the context 100% right but pages getting randomly dirty
>>>>> behind filesystem's back can still happen - most commonly with RDMA and
>>>>> similar stuff which calls set_page_dirty() on pages it has got from
>>>>> pin_user_pages() once the transfer is done. page_maybe_dma_pinned() should
>>>>> be usable within filesystems to detect such cases and protect the
>>>>> filesystem but so far neither me nor John Hubbart has got to implement this
>>>>> in the generic writeback infrastructure + some filesystem as a sample case
>>>>> others could copy...
>>>>
>>>> Well, so far the strategy elsewhere seems to be to just ignore pages
>>>> only dirtied through get_user_pages.  E.g. iomap skips over pages
>>>> reported as holes, and ext4_writepage complains about pages without
>>>> buffers and then clears the dirty bit and continues.
>>>>
>>>> I'm kinda surprised that btrfs wants to treat this so special
>>>> especially as more of the btrfs page and sub-page status will be out
>>>> of date as well.
>>>
>>> I agree btrfs probably needs a different solution than what it is currently
>>> doing if they want to get things right. I just wanted to make it clear that
>>> the code you are ripping out may be a wrong solution but to a real problem.
>>
>> IHMO I believe btrfs should also ignore such dirty but not managed by fs
>> pages.
>>
>> But I still have a small concern here.
>>
>> Is it ensured that, after RDMA dirtying the pages, would we finally got
>> a proper notification to fs that those pages are marked written?
>
> So there is ->page_mkwrite() notification happening when RDMA code calls
> pin_user_pages() when preparing buffers.

I'm wondering why page_mkwrite() is only called when preparing the buffer?

Wouldn't it make more sense to call page_mkwrite() when the buffered is
released from RDMA?

Sorry for all these dumb questions, as the core-api/pin_user_pages.rst
still doesn't explain thing to my dumb brain...

Another thing is, RDMA doesn't really need to respect things like page
locked/writeback, right?
As to RDMA calls, all pages should be pinned and seemingly exclusive to
them.

And in that case, I think btrfs should ignore writing back those pages,
other than doing fixing ups.

As the btrfs csum requires everyone modifying the page to wait for
writeback, or the written data will be out-of-sync with the calculated
csum and cause future -EIO when reading it from disk.

> The trouble is that although later
> page_mkclean() makes page not writeable from page tables, it may be still
> written by RDMA code (even hours after ->page_mkwrite() notification, RDMA
> buffers are really long-lived) and that's what eventually confuses the
> filesystem.  Otherwise set_page_dirty() is the notification that page
> contents was changed and needs writing out...

Another thing I still didn't get is, is there any explicit
mkwrite()/set_page_dirty() calls when those page are unpinned.

If no such explicit calls, these dirty pages caused by RDMA would always
be ignored by fses (except btrfs), and would never got proper written back.

Thanks,
Qu

>
> 								Honza

Christoph Hellwig June 29, 2022, 7:58 a.m. UTC | #21

On Tue, Jun 28, 2022 at 01:53:56PM +0200, David Sterba wrote:
> This would work only for the higher level API where eg. RDMA notifies
> the filesystem, but there's still the s390 case that is part of the
> hardware architecture. The fixup worker is there as a safety for all
> other cases, I'm not fine removing or ignoring it.

I'd really like to have a confirmation of this whole s390 theory.
s390 does treat some dirtying different than the other architectures,
but none of that should leak into the file system API if any way that
bypasses ->page_mkwrite.

Because if it did most file systems would be completely broken on
s390.

Christoph Hellwig June 29, 2022, 8:38 a.m. UTC | #22

On Tue, Jun 28, 2022 at 10:29:00AM -0400, Chris Mason wrote:
> As Sterba points out later in the thread, btrfs cares more because of 
> stable page requirements to protect data during COW and to make sure the 
> crcs we write to disk are correct.

I don't think this matters here.  What the other file systems do is to
simply not ever write a page that has the dirty bit set, but never had
->page_mkwrite called on it, which is the case that is getting fixed up
here.

I did a little research and this post from Jan describes the problem
best:

https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/

So the problem is that while get_user_pages takes a write fault and
marks the page dirty, the page could have been claned just after
that, and then receive a set_page/folio_dirty after that.  The
canonical example would be the direct I/O read completion calling
into that.

> I'd love a proper fix for this on the *_user_pages() side where 
> page_mkwrite() style notifications are used all the time.  It's just a huge 
> change, and my answer so far has always been that using btrfs mmap'd memory 
> for this kind of thing isn't a great choice either way.

Everyone else has the same problem, but decided that you can't get
full data integrity out of this workload.

I think the sane answers are:  simply don't writeback pages that
are held by a get_user_pages with writable pages, or try to dirty
the pages from set_page_dirtẏ.  The set_page_dirty contexts are
somewhat iffy, but would probably be a better place to kick off the
btrfs writepage fixup.

Christoph Hellwig June 29, 2022, 8:40 a.m. UTC | #23

On Wed, Jun 29, 2022 at 09:20:59AM +0800, Qu Wenruo wrote:
> In fact, COW is not that special, even before btrfs or all the other
> fses supporting COW, all those old fses has to do something like COW,
> when they are writing into holes.
>
> What makes btrfs special is its csum, and in fact csum requires more
> stable page status.
>
> If someone can modify a page without waiting for its writeback to
> finish, btrfs csum can easily be stale and cause -EIO for future read.

And the writepage time fixup does not help with this at all, as it
just allocates the ordered extent at writepage time, long after the
data has been modified.

Jan Kara June 29, 2022, 9:45 a.m. UTC | #24

On Tue 28-06-22 10:29:00, Chris Mason wrote:
> I'd love a proper fix for this on the *_user_pages() side where
> page_mkwrite() style notifications are used all the time.  It's just a huge
> change, and my answer so far has always been that using btrfs mmap'd memory
> for this kind of thing isn't a great choice either way.

As Christoph wrote, it isn't a problem that you would not get a
page_mkwrite() notification. That happens just fine. But the problem is
that after that, the page can still get modified after you've removed all
writeable mappings of the page (e.g. by calling page_mkclean() in
clear_page_dirty_for_io()). And there's no way a kernel can provide further
notification for such writes because we've simply handed of the page
physical address to the HW to DMA into.

So the only viable solution is really for the filesystem to detect such
unprotectable pages if it cares and somehow deal with them (skip writeback,
use bounce pages, ...). The good news is that we already have
page_maybe_dma_pinned() call that at least allows detection of such
unprotectable pages.

								Honza

Jan Kara June 29, 2022, 10:03 a.m. UTC | #25

On Wed 29-06-22 09:33:23, Qu Wenruo wrote:
> 
> 
> On 2022/6/28 16:00, Jan Kara wrote:
> > On Tue 28-06-22 08:24:07, Qu Wenruo wrote:
> > > On 2022/6/27 18:19, Jan Kara wrote:
> > > > On Sat 25-06-22 11:11:43, Christoph Hellwig wrote:
> > > > > On Fri, Jun 24, 2022 at 03:07:50PM +0200, Jan Kara wrote:
> > > > > > I'm not sure I get the context 100% right but pages getting randomly dirty
> > > > > > behind filesystem's back can still happen - most commonly with RDMA and
> > > > > > similar stuff which calls set_page_dirty() on pages it has got from
> > > > > > pin_user_pages() once the transfer is done. page_maybe_dma_pinned() should
> > > > > > be usable within filesystems to detect such cases and protect the
> > > > > > filesystem but so far neither me nor John Hubbart has got to implement this
> > > > > > in the generic writeback infrastructure + some filesystem as a sample case
> > > > > > others could copy...
> > > > > 
> > > > > Well, so far the strategy elsewhere seems to be to just ignore pages
> > > > > only dirtied through get_user_pages.  E.g. iomap skips over pages
> > > > > reported as holes, and ext4_writepage complains about pages without
> > > > > buffers and then clears the dirty bit and continues.
> > > > > 
> > > > > I'm kinda surprised that btrfs wants to treat this so special
> > > > > especially as more of the btrfs page and sub-page status will be out
> > > > > of date as well.
> > > > 
> > > > I agree btrfs probably needs a different solution than what it is currently
> > > > doing if they want to get things right. I just wanted to make it clear that
> > > > the code you are ripping out may be a wrong solution but to a real problem.
> > > 
> > > IHMO I believe btrfs should also ignore such dirty but not managed by fs
> > > pages.
> > > 
> > > But I still have a small concern here.
> > > 
> > > Is it ensured that, after RDMA dirtying the pages, would we finally got
> > > a proper notification to fs that those pages are marked written?
> > 
> > So there is ->page_mkwrite() notification happening when RDMA code calls
> > pin_user_pages() when preparing buffers.
> 
> I'm wondering why page_mkwrite() is only called when preparing the buffer?

Because that's the moment when the page fault happens. After this moment we
simply give the page physical address to the HW card and the card is free
to modify that memory as it wishes without telling the kernel about it.
That is simply how the HW is designed.

> Wouldn't it make more sense to call page_mkwrite() when the buffered is
> released from RDMA?

Well, but this is long after the page contents have been modified and in
fact the page need not be mapped to process' virtual address space anymore
by that time (it is perfectly fine to do: addr = mmap(file), pass addr to
HW, munmap(addr)). So we don't have enough context for page_mkwrite()
callback anymore. Essentially all we can provide is already provided in the
->set_page_dirty() callback the filesystem gets.

> Sorry for all these dumb questions, as the core-api/pin_user_pages.rst
> still doesn't explain thing to my dumb brain...

Yeah, these things are subtle and somewhat hard to grasp...

> Another thing is, RDMA doesn't really need to respect things like page
> locked/writeback, right?

Correct.

> As to RDMA calls, all pages should be pinned and seemingly exclusive to
> them.
> 
> And in that case, I think btrfs should ignore writing back those pages,
> other than doing fixing ups.
> 
> As the btrfs csum requires everyone modifying the page to wait for
> writeback, or the written data will be out-of-sync with the calculated
> csum and cause future -EIO when reading it from disk.

Yes, I know. Ignoring writeback of page_maybe_dma_pinned() pages is a
reasonable choice the fs can do. The only exception tends to be data
integrity writeback - stuff like fsync(2) or sync(2). There the filesystem
might need to writeback the page to make sure everything is consistent on
disk (and stale data is not exposed) in case of a crash. So in these
special cases it may be necessary to use bounce pages for submitting the IO
(and computing checksums etc.) so that inconsistencies you mention above
are not possible.

> > The trouble is that although later
> > page_mkclean() makes page not writeable from page tables, it may be still
> > written by RDMA code (even hours after ->page_mkwrite() notification, RDMA
> > buffers are really long-lived) and that's what eventually confuses the
> > filesystem.  Otherwise set_page_dirty() is the notification that page
> > contents was changed and needs writing out...
> 
> Another thing I still didn't get is, is there any explicit
> mkwrite()/set_page_dirty() calls when those page are unpinned.
> 
> If no such explicit calls, these dirty pages caused by RDMA would always
> be ignored by fses (except btrfs), and would never got proper written back.

When the pages are unpinned the holder must call set_page_dirty() to let
the rest of the kernel know that the hardware may be modified the page
contents. The filesystem can hook there with ->set_page_dirty() hook if it
needs to do some action.

								Honza

Christoph Hellwig June 29, 2022, 1:59 p.m. UTC | #26

On Wed, Jun 29, 2022 at 11:45:47AM +0200, Jan Kara wrote:
> So the only viable solution is really for the filesystem to detect such
> unprotectable pages if it cares and somehow deal with them (skip writeback,
> use bounce pages, ...). The good news is that we already have
> page_maybe_dma_pinned() call that at least allows detection of such
> unprotectable pages.

The bad news is that page_maybe_dma_pinned only accounts for FOLL_PIN
memory, and we till have a lot of users of FOLL_GET including direct I/O.

Now for FOLL_PIN I think most problems would be solved by
marking pages that are DMA pinned when writeback completes (that
is when PG_writeback is cleared) dirty again, and making sure the
equivalent of page_mkwrite is called for them again as well.  The
latter might be a bit ugly as PG_writeback could be cleared from
interrupt context, even if most modern file systems especially if
they do anything fancy like out of place writes or unwritten extents
defer it to a workqueue.

To reduce the overhead of this it would make sense to skip writing
these pages back at all unless it is a data integrity write.

Gerald Schaefer July 5, 2022, 2:21 p.m. UTC | #27

On Wed, 29 Jun 2022 09:58:37 +0200
Christoph Hellwig <hch@lst.de> wrote:

> On Tue, Jun 28, 2022 at 01:53:56PM +0200, David Sterba wrote:
> > This would work only for the higher level API where eg. RDMA notifies
> > the filesystem, but there's still the s390 case that is part of the
> > hardware architecture. The fixup worker is there as a safety for all
> > other cases, I'm not fine removing or ignoring it.  
> 
> I'd really like to have a confirmation of this whole s390 theory.
> s390 does treat some dirtying different than the other architectures,
> but none of that should leak into the file system API if any way that
> bypasses ->page_mkwrite.
> 
> Because if it did most file systems would be completely broken on
> s390.

Could you please be more specific about what exactly you mean with
"the s390 case that is part of the hardware architecture"?

One thing that s390 might handle different from others, is that it
is not using a HW dirty bit in the PTE, but instead a fault-triggered
SW dirty bit.

E.g. pte_mkwrite() will mark a PTE as writable (via another SW bit),
but not clear the HW protection bit, which would then generate a
fault on first write access. In handle_pte_fault(), the PTE would
then be marked as dirty via pte_mkdirty(), which also clears the HW
protection bit, at least for pte_write() PTEs.
For the !pte_write() COW case, we would go through do_wp_page() like
everybody else, but probably still end up in some pte_mkdirty()
eventually, to avoid getting another fault.

Not being familiar with either btrfs, any other fs, or RDMA, I cannot
really follow the discussion here. Still it seems to me that you are
not talking about special s390 HW architecture regarding PTE, but
rather about some (struct) page dirtying on the COW path, which should
be completely common code and not subject to any s390 special case.

Somewhere in this thread it was also mentioned that "s390 can not do
page flags update atomically", which I can not confirm, in case this
was the question. The code in include/linux/page-flags.h seems to
use normal (arch)_test/set/clear_bit operations, which should always be
atomic on s390.

btrfs: remove btrfs_writepage_cow_fixup

Commit Message

Comments

Patch