diff mbox

[RFC] mm: retry writepages() on ENOMEM when doing an data integrity writeback

Message ID 20170315050743.5539-1-tytso@mit.edu (mailing list archive)
State New, archived
Headers show

Commit Message

Theodore Ts'o March 15, 2017, 5:07 a.m. UTC
Currently, file system's writepages() function must not fail with an
ENOMEM, since if they do, it's possible for buffered data to be lost.
This is because on a data integrity writeback writepages() gets called
but once, and if it returns ENOMEM and you're lucky the error will get
reflected back to the userspace process calling fsync() --- at which
point the application may or may not be properly checking error codes.
If you aren't lucky, the user is unmounting the file system, and the
dirty pages will simply be lost.

For this reason, file system code generally will use GFP_NOFS, and in
some cases, will retry the allocation in a loop, on the theory that
"kernel livelocks are temporary; data loss is forever".
Unfortunately, this can indeed cause livelocks, since inside the
writepages() call, the file system is holding various mutexes, and
these mutexes may prevent the OOM killer from killing its targetted
victim if it is also holding on to those mutexes.

A better solution would be to allow writepages() to call the memory
allocator with flags that give greater latitude to the allocator to
fail, and then release its locks and return ENOMEM, and in the case of
background writeback, the writes can be retried at a later time.  In
the case of data-integrity writeback retry after waiting a brief
amount of time.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---

As we had discussed in an e-mail thread last week, I'm interested in
allowing ext4_writepages() to return ENOMEM without causing dirty
pages from buffered writes getting list.  It looks like doing so
should be fairly straightforward.   What do folks think?

 mm/page-writeback.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

Comments

Jan Kara March 15, 2017, 11:59 a.m. UTC | #1
On Wed 15-03-17 01:07:43, Ted Tso wrote:
> Currently, file system's writepages() function must not fail with an
> ENOMEM, since if they do, it's possible for buffered data to be lost.
> This is because on a data integrity writeback writepages() gets called
> but once, and if it returns ENOMEM and you're lucky the error will get
> reflected back to the userspace process calling fsync() --- at which
> point the application may or may not be properly checking error codes.
> If you aren't lucky, the user is unmounting the file system, and the
> dirty pages will simply be lost.
> 
> For this reason, file system code generally will use GFP_NOFS, and in
> some cases, will retry the allocation in a loop, on the theory that
> "kernel livelocks are temporary; data loss is forever".
> Unfortunately, this can indeed cause livelocks, since inside the
> writepages() call, the file system is holding various mutexes, and
> these mutexes may prevent the OOM killer from killing its targetted
> victim if it is also holding on to those mutexes.
> 
> A better solution would be to allow writepages() to call the memory
> allocator with flags that give greater latitude to the allocator to
> fail, and then release its locks and return ENOMEM, and in the case of
> background writeback, the writes can be retried at a later time.  In
> the case of data-integrity writeback retry after waiting a brief
> amount of time.
> 
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> ---
> 
> As we had discussed in an e-mail thread last week, I'm interested in
> allowing ext4_writepages() to return ENOMEM without causing dirty
> pages from buffered writes getting list.  It looks like doing so
> should be fairly straightforward.   What do folks think?

Makes sense to me. One comment below:


> +	while (1) {
> +		if (mapping->a_ops->writepages)
> +			ret = mapping->a_ops->writepages(mapping, wbc);
> +		else
> +			ret = generic_writepages(mapping, wbc);
> +		if ((ret != ENOMEM) || (wbc->sync_mode != WB_SYNC_ALL))

-ENOMEM I guess...


> +			break;
> +		cond_resched();
> +		congestion_wait(BLK_RW_ASYNC, HZ/50);
> +	}
>  	return ret;
>  }

								Honza
Michal Hocko March 15, 2017, 1:03 p.m. UTC | #2
On Wed 15-03-17 01:07:43, Theodore Ts'o wrote:
> Currently, file system's writepages() function must not fail with an
> ENOMEM, since if they do, it's possible for buffered data to be lost.
> This is because on a data integrity writeback writepages() gets called
> but once, and if it returns ENOMEM and you're lucky the error will get
> reflected back to the userspace process calling fsync() --- at which
> point the application may or may not be properly checking error codes.
> If you aren't lucky, the user is unmounting the file system, and the
> dirty pages will simply be lost.
> 
> For this reason, file system code generally will use GFP_NOFS, and in
> some cases, will retry the allocation in a loop, on the theory that
> "kernel livelocks are temporary; data loss is forever".
> Unfortunately, this can indeed cause livelocks, since inside the
> writepages() call, the file system is holding various mutexes, and
> these mutexes may prevent the OOM killer from killing its targetted
> victim if it is also holding on to those mutexes.

The victim might be looping inside do_writepages now instead (especially
when the memory reserves are depleted), though. On the other hand the
recent OOM killer changes do not rely on the oom victim exiting anymore.
We try to reap as much memory from its address space as possible
which alone should help us to move on. Even if that is not sufficient we
will move on to another victim. So unless everything is in this path and
all the memory is sitting unreachable from the reapable address space we
should be safe.

> A better solution would be to allow writepages() to call the memory
> allocator with flags that give greater latitude to the allocator to
> fail, and then release its locks and return ENOMEM, and in the case of
> background writeback, the writes can be retried at a later time.  In
> the case of data-integrity writeback retry after waiting a brief
> amount of time.

yes that sounds reasonable to me. Btw. I was proposing
__GFP_RETRY_MAYFAIL recently [1] which sounds like a good fit here.

[1] http://lkml.kernel.org/r/20170307154843.32516-1-mhocko@kernel.org

> Signed-off-by: Theodore Ts'o <tytso@mit.edu>

The patch looks good to me be I am not familiar with all the callers to
be fully qualified to give my Acked-by

> ---
> 
> As we had discussed in an e-mail thread last week, I'm interested in
> allowing ext4_writepages() to return ENOMEM without causing dirty
> pages from buffered writes getting list.  It looks like doing so
> should be fairly straightforward.   What do folks think?
> 
>  mm/page-writeback.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 290e8b7d3181..8666d3f3c57a 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2352,10 +2352,16 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  
>  	if (wbc->nr_to_write <= 0)
>  		return 0;
> -	if (mapping->a_ops->writepages)
> -		ret = mapping->a_ops->writepages(mapping, wbc);
> -	else
> -		ret = generic_writepages(mapping, wbc);
> +	while (1) {
> +		if (mapping->a_ops->writepages)
> +			ret = mapping->a_ops->writepages(mapping, wbc);
> +		else
> +			ret = generic_writepages(mapping, wbc);
> +		if ((ret != ENOMEM) || (wbc->sync_mode != WB_SYNC_ALL))
> +			break;
> +		cond_resched();
> +		congestion_wait(BLK_RW_ASYNC, HZ/50);
> +	}
>  	return ret;
>  }
>  
> -- 
> 2.11.0.rc0.7.gbe5a750
Tetsuo Handa March 16, 2017, 10:18 a.m. UTC | #3
On 2017/03/15 22:03, Michal Hocko wrote:
> On Wed 15-03-17 01:07:43, Theodore Ts'o wrote:
>> Unfortunately, this can indeed cause livelocks, since inside the
>> writepages() call, the file system is holding various mutexes, and
>> these mutexes may prevent the OOM killer from killing its targetted
>> victim if it is also holding on to those mutexes.
> 
> The victim might be looping inside do_writepages now instead (especially
> when the memory reserves are depleted), though. On the other hand the
> recent OOM killer changes do not rely on the oom victim exiting anymore.

True only if CONFIG_MMU=y.

> We try to reap as much memory from its address space as possible
> which alone should help us to move on. Even if that is not sufficient we
> will move on to another victim. So unless everything is in this path and
> all the memory is sitting unreachable from the reapable address space we
> should be safe.

If the caller is doing sync() or umount() syscall, isn't it reasonable
to bail out if fatal_signal_pending() is true because it is caller's
responsibility to check whether sync() or umount() succeeded? Though,
I don't know whether writepages() can preserve data for later retry by
other callers.
diff mbox

Patch

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 290e8b7d3181..8666d3f3c57a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2352,10 +2352,16 @@  int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 
 	if (wbc->nr_to_write <= 0)
 		return 0;
-	if (mapping->a_ops->writepages)
-		ret = mapping->a_ops->writepages(mapping, wbc);
-	else
-		ret = generic_writepages(mapping, wbc);
+	while (1) {
+		if (mapping->a_ops->writepages)
+			ret = mapping->a_ops->writepages(mapping, wbc);
+		else
+			ret = generic_writepages(mapping, wbc);
+		if ((ret != ENOMEM) || (wbc->sync_mode != WB_SYNC_ALL))
+			break;
+		cond_resched();
+		congestion_wait(BLK_RW_ASYNC, HZ/50);
+	}
 	return ret;
 }