mm,shmem: Fix a typo in shmem_swapin_page()

Message ID	20210723080000.93953-1-ying.huang@intel.com (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=htka=MP=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org E0E9E60EC0 From: Huang Ying <ying.huang@intel.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying <ying.huang@intel.com>, Miaohe Lin <linmiaohe@huawei.com>, Hugh Dickins <hughd@google.com>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@suse.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, Matthew Wilcox <willy@infradead.org>, Minchan Kim <minchan@kernel.org> Subject: [PATCH] mm,shmem: Fix a typo in shmem_swapin_page() Date: Fri, 23 Jul 2021 16:00:00 +0800 Message-Id: <20210723080000.93953-1-ying.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm,shmem: Fix a typo in shmem_swapin_page() \| expand mm,shmem: Fix a typo in shmem_swapin_page()

Huang, Ying July 23, 2021, 8 a.m. UTC

"-" is missing before "EINVAL".

Fixes: 2efa33fc7f6e ("mm/shmem: fix shmem_swapin() race with swapoff")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

David Hildenbrand July 23, 2021, 11:10 a.m. UTC | #1

On 23.07.21 10:00, Huang Ying wrote:
> "-" is missing before "EINVAL".
> 

Ehm, that's not a typo, that's a real BUG.

What are the symptoms?

Hugh Dickins July 23, 2021, 8:23 p.m. UTC | #2

On Fri, 23 Jul 2021, Huang Ying wrote:

> "-" is missing before "EINVAL".
> 
> Fixes: 2efa33fc7f6e ("mm/shmem: fix shmem_swapin() race with swapoff")
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Miaohe Lin <linmiaohe@huawei.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Minchan Kim <minchan@kernel.org>
> ---
>  mm/shmem.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 9af4b2173fe9..e201a3ba12fa 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1708,7 +1708,7 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
>  	/* Prevent swapoff from happening to us. */
>  	si = get_swap_device(swap);
>  	if (!si) {
> -		error = EINVAL;
> +		error = -EINVAL;
>  		goto failed;
>  	}
>  	/* Look it up and read it in.. */
> -- 
> 2.30.2

Thanks for catching that; and as David says, it's worse than a typo.

But this is not the right fix:
2efa33fc7f6e ("mm/shmem: fix shmem_swapin() race with swapoff")
needs to be reverted.

It's been on my pile to look at for weeks: now I look at it and see
it's just a bad patch.  Over-enthusiastic stablehands already rushed
it out, I was wary, and reverts are already in -rc for 5.13 and 5.10,
phew, but 5.12.19 EOL is stuck with it unfortunately, oh well.

I was wary because, if the (never observed) race to be fixed is in
swap_cluster_readahead(), why was shmem_swapin_page() being patched?
Not explained in its commit message, probably a misunderstanding of
how mm/shmem.c already manages races (and prefers not to be involved
in swap_info_struct stuff).

But why do I now say it's bad?  Because even if you correct the EINVAL
to -EINVAL, that's an unexpected error: -EEXIST is common, -ENOMEM is
not surprising, -ENOSPC can need consideration, but -EIO and anything
else just end up as SIGBUS when faulting (or as error from syscall).
So, 2efa33fc7f6e converts a race with swapoff to SIGBUS: not good,
and I think much more likely than the race to be fixed (since
swapoff's percpu_ref_kill() rightly comes before synchronize_rcu()).

2efa33fc7f6e was intending to fix a race introduced by two-year-old
8fd2e0b505d1 ("mm: swap: check if swap backing device is congested
or not"), which added a call to inode_read_congested().  Certainly
relying on si->swap_file->f_mapping->host there was new territory:
whether actually racy I'm not sure offhand - I've forgotten whether
synchronize_rcu() waits for preempted tasks or not.

But if it is racy, then I wonder if the right fix might be to revert
8fd2e0b505d1 too. Convincing numbers were offered for it, but I'm
puzzled: because Matthew has in the past noted that the block layer
broke and further broke bdi congestion tracking (I don't know the
relevant release numbers), so I don't understand how checking
inode_read_congested() is actually useful there nowadays.

No need to hurry to a conclusion on 8fd2e0b505d1;
but 2efa33fc7f6e should definitely be reverted.

Thanks,
Hugh

Matthew Wilcox July 23, 2021, 9:53 p.m. UTC | #3

On Fri, Jul 23, 2021 at 01:23:07PM -0700, Hugh Dickins wrote:
> I was wary because, if the (never observed) race to be fixed is in
> swap_cluster_readahead(), why was shmem_swapin_page() being patched?
> Not explained in its commit message, probably a misunderstanding of
> how mm/shmem.c already manages races (and prefers not to be involved
> in swap_info_struct stuff).
> 
> But why do I now say it's bad?  Because even if you correct the EINVAL
> to -EINVAL, that's an unexpected error: -EEXIST is common, -ENOMEM is
> not surprising, -ENOSPC can need consideration, but -EIO and anything
> else just end up as SIGBUS when faulting (or as error from syscall).
> So, 2efa33fc7f6e converts a race with swapoff to SIGBUS: not good,
> and I think much more likely than the race to be fixed (since
> swapoff's percpu_ref_kill() rightly comes before synchronize_rcu()).

Yes, I think a lot more thought was needed here.  And I would have
preferred to start with a reproducer instead of "hey, this could
happen".  Maybe something like booting a 1GB VM, adding two 2GB swap
partitions, swapon(partition A); run a 2GB memhog and then

loop:
	swapon(part B);
	swapoff(part A);
	swapon(part A);
	swapoff(part B);

to make this happen.

but if it does happen, why would returning EINVAL be the right thing
to do?  We've swapped it out.  It must be on swap somewhere, or we've
really messed up.  So I could see there being a race where we get
preempted between looking up the swap entry and calling get_swap_device().
But if that does happen, then the page gets brought in, and potentially
reswapped to the other swap device.

So returning -EEXIST here would actually work.  That forces a re-lookup
in the page cache, so we'll get the new swap entry that tells us which
swap device the page is now on.

But I REALLY REALLY REALLY want a reproducer.  Right now, I have a hard
time believing this, or any of the other races can really happen.

> 2efa33fc7f6e was intending to fix a race introduced by two-year-old
> 8fd2e0b505d1 ("mm: swap: check if swap backing device is congested
> or not"), which added a call to inode_read_congested().  Certainly
> relying on si->swap_file->f_mapping->host there was new territory:
> whether actually racy I'm not sure offhand - I've forgotten whether
> synchronize_rcu() waits for preempted tasks or not.
> 
> But if it is racy, then I wonder if the right fix might be to revert
> 8fd2e0b505d1 too. Convincing numbers were offered for it, but I'm
> puzzled: because Matthew has in the past noted that the block layer
> broke and further broke bdi congestion tracking (I don't know the
> relevant release numbers), so I don't understand how checking
> inode_read_congested() is actually useful there nowadays.

It might be useful for NFS?  I don't think congestion is broken there
(except how does the NFS client have any idea whether the server is
congested or not?)

huang ying July 28, 2021, 1:03 p.m. UTC | #4

Hi, Hugh,

Thanks for your comments.

On Sat, Jul 24, 2021 at 4:23 AM Hugh Dickins <hughd@google.com> wrote:
>
> On Fri, 23 Jul 2021, Huang Ying wrote:
>
> > "-" is missing before "EINVAL".
> >
> > Fixes: 2efa33fc7f6e ("mm/shmem: fix shmem_swapin() race with swapoff")
> > Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> > Cc: Miaohe Lin <linmiaohe@huawei.com>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Cc: Minchan Kim <minchan@kernel.org>
> > ---
> >  mm/shmem.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 9af4b2173fe9..e201a3ba12fa 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1708,7 +1708,7 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
> >       /* Prevent swapoff from happening to us. */
> >       si = get_swap_device(swap);
> >       if (!si) {
> > -             error = EINVAL;
> > +             error = -EINVAL;
> >               goto failed;
> >       }
> >       /* Look it up and read it in.. */
> > --
> > 2.30.2
>
> Thanks for catching that; and as David says, it's worse than a typo.
>
> But this is not the right fix:
> 2efa33fc7f6e ("mm/shmem: fix shmem_swapin() race with swapoff")
> needs to be reverted.
>
> It's been on my pile to look at for weeks: now I look at it and see
> it's just a bad patch.  Over-enthusiastic stablehands already rushed
> it out, I was wary, and reverts are already in -rc for 5.13 and 5.10,
> phew, but 5.12.19 EOL is stuck with it unfortunately, oh well.
>
> I was wary because, if the (never observed) race to be fixed is in
> swap_cluster_readahead(), why was shmem_swapin_page() being patched?

When we get a swap entry from the page table or shmem xarray, and no
necessary lock is held to prevent the swap device to be swapoff (e.g.
page table lock, page lock, etc.), it's possible that the swap device
has been swapoff when we operate on the swap entry (e.g. swapin).  So
we need to find a way to prevent the swap device to be swapoff,
get_swap_device() based on percpu_ref is used for that.  To avoid to
call get_swap_device() here and there (e.g. now it is called in many
different places), I think it's better to call get_swap_device() when
we just get a swap entry without holding the necessary lock, that is,
in do_swap_page() and shmem_swapin_page(), etc.  So that we can delete
the get_swap_device() call in lookup_swap_cache(),
__read_swap_cache_async(), etc.  This will make it easier to
understand when to use get_swap_device() and clean up the code.  Do
you agree?

> Not explained in its commit message, probably a misunderstanding of
> how mm/shmem.c already manages races (and prefers not to be involved
> in swap_info_struct stuff).

Yes.  The commit message isn't clean enough about why we do that.

> But why do I now say it's bad?  Because even if you correct the EINVAL
> to -EINVAL, that's an unexpected error: -EEXIST is common, -ENOMEM is
> not surprising, -ENOSPC can need consideration, but -EIO and anything
> else just end up as SIGBUS when faulting (or as error from syscall).

Yes.  -EINVAL isn't a good choice.  If it's the swapoff race, then
retrying can fix the race, so -EAGAIN may be a choice.  But if the
swap entry is really invalid (almost impossible in theory), we may
need something else, for example, WARN_ON_ONCE() and SIGBUS?  This
reminds me that we may need to distinguish the two possibilities in
get_swap_device()?

Best Regards,
Huang, Ying

Huang, Ying Aug. 3, 2021, 8:06 a.m. UTC | #5

Hi, Andrew,

huang ying <huang.ying.caritas@gmail.com> writes:
>>
>> Thanks for catching that; and as David says, it's worse than a typo.
>>
>> But this is not the right fix:
>> 2efa33fc7f6e ("mm/shmem: fix shmem_swapin() race with swapoff")
>> needs to be reverted.
>>
>> It's been on my pile to look at for weeks: now I look at it and see
>> it's just a bad patch.  Over-enthusiastic stablehands already rushed
>> it out, I was wary, and reverts are already in -rc for 5.13 and 5.10,
>> phew, but 5.12.19 EOL is stuck with it unfortunately, oh well.
>>
>> I was wary because, if the (never observed) race to be fixed is in
>> swap_cluster_readahead(), why was shmem_swapin_page() being patched?
>
> When we get a swap entry from the page table or shmem xarray, and no
> necessary lock is held to prevent the swap device to be swapoff (e.g.
> page table lock, page lock, etc.), it's possible that the swap device
> has been swapoff when we operate on the swap entry (e.g. swapin).  So
> we need to find a way to prevent the swap device to be swapoff,
> get_swap_device() based on percpu_ref is used for that.  To avoid to
> call get_swap_device() here and there (e.g. now it is called in many
> different places), I think it's better to call get_swap_device() when
> we just get a swap entry without holding the necessary lock, that is,
> in do_swap_page() and shmem_swapin_page(), etc.  So that we can delete
> the get_swap_device() call in lookup_swap_cache(),
> __read_swap_cache_async(), etc.  This will make it easier to
> understand when to use get_swap_device() and clean up the code.  Do
> you agree?
>
>> Not explained in its commit message, probably a misunderstanding of
>> how mm/shmem.c already manages races (and prefers not to be involved
>> in swap_info_struct stuff).
>
> Yes.  The commit message isn't clean enough about why we do that.
>
>> But why do I now say it's bad?  Because even if you correct the EINVAL
>> to -EINVAL, that's an unexpected error: -EEXIST is common, -ENOMEM is
>> not surprising, -ENOSPC can need consideration, but -EIO and anything
>> else just end up as SIGBUS when faulting (or as error from syscall).
>
> Yes.  -EINVAL isn't a good choice.  If it's the swapoff race, then
> retrying can fix the race, so -EAGAIN may be a choice.  But if the
> swap entry is really invalid (almost impossible in theory), we may
> need something else, for example, WARN_ON_ONCE() and SIGBUS?  This
> reminds me that we may need to distinguish the two possibilities in
> get_swap_device()?

As Hugh pointed out, EINVAL isn't an appropriate error code for race
condition.  After checking the code, I found that EEXIST is the error
code used for race condition.  So I revise the patch as below.  If Hugh
doesn't object, can you help to replace the patch with the below one?

Best Regards,
Huang, Ying

-----------------------------8<---------------------------------------
From e2b281a0b09d34d6463942e214e577ed9357c213 Mon Sep 17 00:00:00 2001
From: Huang Ying <ying.huang@intel.com>
Date: Tue, 3 Aug 2021 10:51:16 +0800
Subject: [PATCH] shmem_swapin_page(): fix error processing for
 get_swap_device()

Firstly, "-" is missing before the error code.  Secondly, EINVAL isn't
the proper error code for the race condition.  EEXIST is used in
shmem_swapin_page() for that.  So the error code is changed to EEXIST
too.

Link: https://lkml.kernel.org/r/20210723080000.93953-1-ying.huang@intel.com
Fixes: 2efa33fc7f6e ("mm/shmem: fix shmem_swapin() race with swapoff")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
---
 mm/shmem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index dcc07d14162e..ba925baa4404 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1711,8 +1711,8 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
 	/* Prevent swapoff from happening to us. */
 	si = get_swap_device(swap);
 	if (!si) {
-		error = EINVAL;
-		goto failed;
+		error = -EEXIST;
+		goto unlock;
 	}
 	/* Look it up and read it in.. */
 	page = lookup_swap_cache(swap, NULL, 0);

Huang, Ying Aug. 3, 2021, 8:14 a.m. UTC | #6

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Jul 23, 2021 at 01:23:07PM -0700, Hugh Dickins wrote:
>> I was wary because, if the (never observed) race to be fixed is in
>> swap_cluster_readahead(), why was shmem_swapin_page() being patched?
>> Not explained in its commit message, probably a misunderstanding of
>> how mm/shmem.c already manages races (and prefers not to be involved
>> in swap_info_struct stuff).
>> 
>> But why do I now say it's bad?  Because even if you correct the EINVAL
>> to -EINVAL, that's an unexpected error: -EEXIST is common, -ENOMEM is
>> not surprising, -ENOSPC can need consideration, but -EIO and anything
>> else just end up as SIGBUS when faulting (or as error from syscall).
>> So, 2efa33fc7f6e converts a race with swapoff to SIGBUS: not good,
>> and I think much more likely than the race to be fixed (since
>> swapoff's percpu_ref_kill() rightly comes before synchronize_rcu()).
>
> Yes, I think a lot more thought was needed here.  And I would have
> preferred to start with a reproducer instead of "hey, this could
> happen".  Maybe something like booting a 1GB VM, adding two 2GB swap
> partitions, swapon(partition A); run a 2GB memhog and then
>
> loop:
> 	swapon(part B);
> 	swapoff(part A);
> 	swapon(part A);
> 	swapoff(part B);
>
> to make this happen.
>
> but if it does happen, why would returning EINVAL be the right thing
> to do?  We've swapped it out.  It must be on swap somewhere, or we've
> really messed up.  So I could see there being a race where we get
> preempted between looking up the swap entry and calling get_swap_device().
> But if that does happen, then the page gets brought in, and potentially
> reswapped to the other swap device.
>
> So returning -EEXIST here would actually work.  That forces a re-lookup
> in the page cache, so we'll get the new swap entry that tells us which
> swap device the page is now on.

Yes.  -EEXIST is the right error code.  We use that in
shmem_swapin_page() to deal with race condition.

> But I REALLY REALLY REALLY want a reproducer.  Right now, I have a hard
> time believing this, or any of the other races can really happen.

I think the race is only theoretical too.  Firstly, swapoff is a rare
operations in practice; secondly, the race window is really small.

Best Regards,
Huang, Ying

>> 2efa33fc7f6e was intending to fix a race introduced by two-year-old
>> 8fd2e0b505d1 ("mm: swap: check if swap backing device is congested
>> or not"), which added a call to inode_read_congested().  Certainly
>> relying on si->swap_file->f_mapping->host there was new territory:
>> whether actually racy I'm not sure offhand - I've forgotten whether
>> synchronize_rcu() waits for preempted tasks or not.
>> 
>> But if it is racy, then I wonder if the right fix might be to revert
>> 8fd2e0b505d1 too. Convincing numbers were offered for it, but I'm
>> puzzled: because Matthew has in the past noted that the block layer
>> broke and further broke bdi congestion tracking (I don't know the
>> relevant release numbers), so I don't understand how checking
>> inode_read_congested() is actually useful there nowadays.
>
> It might be useful for NFS?  I don't think congestion is broken there
> (except how does the NFS client have any idea whether the server is
> congested or not?)

Matthew Wilcox Aug. 3, 2021, 11:59 a.m. UTC | #7

On Tue, Aug 03, 2021 at 04:06:47PM +0800, Huang, Ying wrote:
> As Hugh pointed out, EINVAL isn't an appropriate error code for race
> condition.  After checking the code, I found that EEXIST is the error
> code used for race condition.  So I revise the patch as below.  If Hugh
> doesn't object, can you help to replace the patch with the below one?
> 
> Best Regards,
> Huang, Ying
> 
> -----------------------------8<---------------------------------------
> >From e2b281a0b09d34d6463942e214e577ed9357c213 Mon Sep 17 00:00:00 2001
> From: Huang Ying <ying.huang@intel.com>
> Date: Tue, 3 Aug 2021 10:51:16 +0800
> Subject: [PATCH] shmem_swapin_page(): fix error processing for
>  get_swap_device()
> 
> Firstly, "-" is missing before the error code.  Secondly, EINVAL isn't
> the proper error code for the race condition.  EEXIST is used in
> shmem_swapin_page() for that.  So the error code is changed to EEXIST
> too.
> 
> Link: https://lkml.kernel.org/r/20210723080000.93953-1-ying.huang@intel.com
> Fixes: 2efa33fc7f6e ("mm/shmem: fix shmem_swapin() race with swapoff")
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Also, the description is poor.  How about:

If we hit this rare race, returning EINVAL (or even -EINVAL) would cause
the page fault to be handled as a SIGBUS.  This is not correct; the page
is not missing or unreadable, it has simply changed location.  Returning
-EEXIST here will cause the lookup to be retried by the caller.

Matthew Wilcox Aug. 3, 2021, 12:01 p.m. UTC | #8

On Tue, Aug 03, 2021 at 04:14:38PM +0800, Huang, Ying wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> > But I REALLY REALLY REALLY want a reproducer.  Right now, I have a hard
> > time believing this, or any of the other races can really happen.
> 
> I think the race is only theoretical too.  Firstly, swapoff is a rare
> operations in practice; secondly, the race window is really small.

So do something to provoke it.  Widen the window.  Put an msleep(1000)
between *pagep = NULL and the call to get_swap_device().  That's assuming
that the swapon/swapoff loop that I proposed doesn't work.  Did you
try it?

Hugh Dickins Aug. 4, 2021, 4:28 a.m. UTC | #9

On Tue, 3 Aug 2021, Huang, Ying wrote:
> 
> As Hugh pointed out, EINVAL isn't an appropriate error code for race
> condition.  After checking the code, I found that EEXIST is the error
> code used for race condition.  So I revise the patch as below.  If Hugh
> doesn't object, can you help to replace the patch with the below one?

(I'm sorry that it's so hard to extract responses from me...)

Yes, I'll go along with this version, or Matthew's better commented
version, which Andrew has now taken into his tree.

I won't go so far as to Ack this, because I still want to revert the
original commit; but this will not do actual harm, and I'm too slow
to mess you around further for 5.14.  I'll just have to work through
it and argue it later when/if I have time.

I'll say more on that in answering your earlier mail in this thread.

But should admit right now that I think have somewhat misled us all.
Neither the EINVAL nor the -EINVAL were as dangerous as they looked:
because they were followed immediately by "goto failed", and

failed:
	if (!shmem_confirm_swap(mapping, index, swap))
		error = -EEXIST;

and in the case that get_swap_device() fails, all the swapping off
has been done, so shmem_confirm_swap() will return false, and the
error then be set to -EEXIST anyway.

But let's pretend that I hadn't realized that: what's in Andrew's
tree is better than what was there before.

(And let's pretend that in writing those paragraphs, I did not
realize that get_swap_device() could also fail if entry had got
corrupted - should never happen, of course - and shmem then get
stuck in a repeating -EEXIST loop.  Maybe I'll want to do better
for that case too, but not this time.)

Hugh

Hugh Dickins Aug. 4, 2021, 5:34 a.m. UTC | #10

On Tue, 3 Aug 2021, Matthew Wilcox wrote:
> On Tue, Aug 03, 2021 at 04:14:38PM +0800, Huang, Ying wrote:
> > Matthew Wilcox <willy@infradead.org> writes:
> > > But I REALLY REALLY REALLY want a reproducer.  Right now, I have a hard
> > > time believing this, or any of the other races can really happen.
> > 
> > I think the race is only theoretical too.  Firstly, swapoff is a rare
> > operations in practice; secondly, the race window is really small.
> 
> So do something to provoke it.  Widen the window.  Put an msleep(1000)
> between *pagep = NULL and the call to get_swap_device().  That's assuming
> that the swapon/swapoff loop that I proposed doesn't work.  Did you
> try it?

I've been doing that swapon/swapoff loop for years, while running kernel
builds on tmpfs going out to swap; for better or worse on baremetal not VM.

You're right that few will ever need that level of reliability; but it
has caught problems from time to time, and I do insist on fixing them.

I'm not as insistent as you on wanting a reproducer; and we all take pride
sometimes in fixing ever more inconceivable bugs.  I'm not against that,
but it's easy to end up with a fix more dangerous than what it claims to
fix, rather like with random newbie cleanups.

I've never seen the swapoff race claimed by Miaohe, and don't expect to;
but he's probably right, given the current code.  I just dislike adding
unnecessary complexity, and siting it in the wrong place (mm/shmem.c).

Yang, is it possible that 5.1 commit 8fd2e0b505d1 ("mm: swap: check if
swap backing device is congested or not") was actually developed and
measured on 4.1 or earlier, which still had blk_set_queue_congested()?

I cannot explain its usefulness nowadays, on congested HDD anyway:
Matthew is right that NFS and a few others may still be setting
congested flags, but they're not what that commit was proposed for.

If it is still useful, then I contend (but Huang Ying will disagree)
that the get_swap_device() and put_swap_device() should be around
8fd2e0b505d1's inode_read_congested() block in swap_cluster_readahead(),
not encroaching into mm/shmem.c.

But if that block is not useful, then it should simply be removed (later).

Hugh

Hugh Dickins Aug. 4, 2021, 6:42 a.m. UTC | #11

On Wed, 28 Jul 2021, huang ying wrote:
> On Sat, Jul 24, 2021 at 4:23 AM Hugh Dickins <hughd@google.com> wrote:
> >
> > I was wary because, if the (never observed) race to be fixed is in
> > swap_cluster_readahead(), why was shmem_swapin_page() being patched?
> 
> When we get a swap entry from the page table or shmem xarray, and no
> necessary lock is held to prevent the swap device to be swapoff (e.g.
> page table lock, page lock, etc.), it's possible that the swap device
> has been swapoff when we operate on the swap entry (e.g. swapin).

Yes.  And even without any swapoff, the swap entry may no longer be
right by the time we go to swap it in, or after it has been swapped in.

Both mm/memory.c and mm/shmem.c have done an unlocked lookup to get
the swap entry, and have to do a pte_same() or shmem_confirm_swap()
later, to ensure that what they've got is still right.  That lockless
lookup and raciness is intentional, and has been working for years.

> So we need to find a way to prevent the swap device to be swapoff,
> get_swap_device() based on percpu_ref is used for that.

To prevent swapoff, no, it would be bad if swapoff could be prevented
indefinitely.  But I think you're saying to prevent swapoff from
completing - reaching the point of freeing structures which might
still be being examined.

Yes, though I thought that was already prevented.  But I may well have
missed the inode_read_congested() case which came in two years ago
(affecting shmem swapin only).  And I'll admit now to never(?) having
studied or tested the SWP_SYNCHRONOUS_IO case in do_swap_page() (with
suspiciously no equivalent in shmem swapin): I'd better start.

I do dislike the way SWP_SYNCHRONOUS_IO imported low-level swap business
into do_swap_page(): I'd love to try to move that into swap_state.c, and
in doing so hopefully get to appreciate it better (and the suggested
swapoff race: I presume it comes from skipping swapcache_prepare()).
But I have no time for that at present.

> To avoid to
> call get_swap_device() here and there (e.g. now it is called in many
> different places), I think it's better to call get_swap_device() when
> we just get a swap entry without holding the necessary lock, that is,
> in do_swap_page() and shmem_swapin_page(), etc.  So that we can delete
> the get_swap_device() call in lookup_swap_cache(),
> __read_swap_cache_async(), etc.  This will make it easier to
> understand when to use get_swap_device() and clean up the code.  Do
> you agree?

Offhand I'd say that I do not agree: but I can easily imagine coming to
agree with you, once I have tried to do better and discovered I cannot.

I dislike the way swap internals are being pushed out to the higher
levels.  Particularly since those higher levels already have to deal
with similar races which are not protected by get_swap_device().

I'd forgotten how earlier you found you had to add get_swap_device()
into lookup_swap_cache(), and friends: and can see the attraction of
doing it once instead of here and there.  But it is still there in
lookup_swap_cache() etc, so that's a poor argument for these commits.

> 
> > Not explained in its commit message, probably a misunderstanding of
> > how mm/shmem.c already manages races (and prefers not to be involved
> > in swap_info_struct stuff).
> 
> Yes.  The commit message isn't clean enough about why we do that.

Thanks for clearing that up.  In intervening days I did read about 50
of the ~100 mails on these commits, April and later (yes, I was Cc'ed
throughout, thanks, but that didn't give me the time).  I've not yet
reached any that explain the get_swap_device() placement, perhaps
I'll change my mind when eventually I read those.

> 
> > But why do I now say it's bad?  Because even if you correct the EINVAL
> > to -EINVAL, that's an unexpected error: -EEXIST is common, -ENOMEM is
> > not surprising, -ENOSPC can need consideration, but -EIO and anything
> > else just end up as SIGBUS when faulting (or as error from syscall).
> 
> Yes.  -EINVAL isn't a good choice.  If it's the swapoff race, then
> retrying can fix the race, so -EAGAIN may be a choice.  But if the
> swap entry is really invalid (almost impossible in theory), we may
> need something else, for example, WARN_ON_ONCE() and SIGBUS?  This
> reminds me that we may need to distinguish the two possibilities in
> get_swap_device()?

Ah, I guess in that last sentence you're thinking of what I realized
in writing previous mail, behaviour when given a corrupted swap entry.

Hugh

Huang, Ying Aug. 4, 2021, 8:59 a.m. UTC | #12

Hugh Dickins <hughd@google.com> writes:

> On Wed, 28 Jul 2021, huang ying wrote:
>> On Sat, Jul 24, 2021 at 4:23 AM Hugh Dickins <hughd@google.com> wrote:
>> >
>> > I was wary because, if the (never observed) race to be fixed is in
>> > swap_cluster_readahead(), why was shmem_swapin_page() being patched?
>> 
>> When we get a swap entry from the page table or shmem xarray, and no
>> necessary lock is held to prevent the swap device to be swapoff (e.g.
>> page table lock, page lock, etc.), it's possible that the swap device
>> has been swapoff when we operate on the swap entry (e.g. swapin).
>
> Yes.  And even without any swapoff, the swap entry may no longer be
> right by the time we go to swap it in, or after it has been swapped in.

Yes.

> Both mm/memory.c and mm/shmem.c have done an unlocked lookup to get
> the swap entry, and have to do a pte_same() or shmem_confirm_swap()
> later, to ensure that what they've got is still right.  That lockless
> lookup and raciness is intentional, and has been working for years.

Yes.

>> So we need to find a way to prevent the swap device to be swapoff,
>> get_swap_device() based on percpu_ref is used for that.
>
> To prevent swapoff, no, it would be bad if swapoff could be prevented
> indefinitely.  But I think you're saying to prevent swapoff from
> completing - reaching the point of freeing structures which might
> still be being examined.

Yes.

> Yes, though I thought that was already prevented.  But I may well have
> missed the inode_read_congested() case which came in two years ago
> (affecting shmem swapin only).  And I'll admit now to never(?) having
> studied or tested the SWP_SYNCHRONOUS_IO case in do_swap_page() (with
> suspiciously no equivalent in shmem swapin): I'd better start.
>
> I do dislike the way SWP_SYNCHRONOUS_IO imported low-level swap business
> into do_swap_page(): I'd love to try to move that into swap_state.c, and
> in doing so hopefully get to appreciate it better (and the suggested
> swapoff race: I presume it comes from skipping swapcache_prepare()).

Yes.  I think so too.

> But I have no time for that at present.
>
>> To avoid to
>> call get_swap_device() here and there (e.g. now it is called in many
>> different places), I think it's better to call get_swap_device() when
>> we just get a swap entry without holding the necessary lock, that is,
>> in do_swap_page() and shmem_swapin_page(), etc.  So that we can delete
>> the get_swap_device() call in lookup_swap_cache(),
>> __read_swap_cache_async(), etc.  This will make it easier to
>> understand when to use get_swap_device() and clean up the code.  Do
>> you agree?
>
> Offhand I'd say that I do not agree: but I can easily imagine coming to
> agree with you, once I have tried to do better and discovered I cannot.
>
> I dislike the way swap internals are being pushed out to the higher
> levels.  Particularly since those higher levels already have to deal
> with similar races which are not protected by get_swap_device().
>
> I'd forgotten how earlier you found you had to add get_swap_device()
> into lookup_swap_cache(), and friends: and can see the attraction of
> doing it once instead of here and there.  But it is still there in
> lookup_swap_cache() etc, so that's a poor argument for these commits.

I have a patch in hand to delete get_swap_device() in
lookup_swap_cache() and other places.  In fact, the bug fixed in this
patch is found when developing that patch.

An untested initial version of that cleanup patch is as below, I am
still working on it.  In that patch, 6 get_swap_device() are deleted,
while 4 get_swap_device() are added.  What matters more is that it's
more clear about when to call get_swap_device().  That is, when you get
a swap entry without necessary lock and want to operate on it.

>> > Not explained in its commit message, probably a misunderstanding of
>> > how mm/shmem.c already manages races (and prefers not to be involved
>> > in swap_info_struct stuff).
>> 
>> Yes.  The commit message isn't clean enough about why we do that.
>
> Thanks for clearing that up.  In intervening days I did read about 50
> of the ~100 mails on these commits, April and later (yes, I was Cc'ed
> throughout, thanks, but that didn't give me the time).  I've not yet
> reached any that explain the get_swap_device() placement, perhaps
> I'll change my mind when eventually I read those.

I will write about that in the patch to cleanup get_swap_device() calling.

>> 
>> > But why do I now say it's bad?  Because even if you correct the EINVAL
>> > to -EINVAL, that's an unexpected error: -EEXIST is common, -ENOMEM is
>> > not surprising, -ENOSPC can need consideration, but -EIO and anything
>> > else just end up as SIGBUS when faulting (or as error from syscall).
>> 
>> Yes.  -EINVAL isn't a good choice.  If it's the swapoff race, then
>> retrying can fix the race, so -EAGAIN may be a choice.  But if the
>> swap entry is really invalid (almost impossible in theory), we may
>> need something else, for example, WARN_ON_ONCE() and SIGBUS?  This
>> reminds me that we may need to distinguish the two possibilities in
>> get_swap_device()?
>
> Ah, I guess in that last sentence you're thinking of what I realized
> in writing previous mail, behaviour when given a corrupted swap entry.

Yes.  But because that should be impossible even in theory, I think we
can ignore it for now to avoid to make the code even more complicated?
Now, some corrupted swap entry can be identified and a error message can
be printed in get_swap_device().

Best Regards,
Huang, Ying

----------------------------------8<--------------------------------------
From 8c83cf945436b98222b72903b4bd3063ad41e51a Mon Sep 17 00:00:00 2001
From: Huang Ying <ying.huang@intel.com>
Date: Fri, 23 Jul 2021 15:56:26 +0800
Subject: [PATCH] swap: cleanup get/_put_swap_device()

---
 mm/madvise.c    | 10 ++++++++++
 mm/memory.c     | 12 +++++++++++-
 mm/swap_state.c | 11 -----------
 mm/swapfile.c   | 36 ++++++------------------------------
 mm/zswap.c      |  5 +++++
 5 files changed, 32 insertions(+), 42 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 012129fbfaf8..5e38e888645d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -199,6 +199,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 		swp_entry_t entry;
 		struct page *page;
 		spinlock_t *ptl;
+		struct swap_info_struct *si;
 
 		orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
 		pte = *(orig_pte + ((index - start) / PAGE_SIZE));
@@ -209,11 +210,15 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 		entry = pte_to_swp_entry(pte);
 		if (unlikely(non_swap_entry(entry)))
 			continue;
+		si = get_swap_device(entry);
+		if (!si)
+			continue;
 
 		page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
 							vma, index, false);
 		if (page)
 			put_page(page);
+		put_swap_device(si);
 	}
 
 	return 0;
@@ -234,6 +239,7 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
 	rcu_read_lock();
 	xas_for_each(&xas, page, end_index) {
 		swp_entry_t swap;
+		struct swap_info_struct *si;
 
 		if (!xa_is_value(page))
 			continue;
@@ -241,10 +247,14 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
 		rcu_read_unlock();
 
 		swap = radix_to_swp_entry(page);
+		si = get_swap_device(swap);
+		if (!si)
+			continue;
 		page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
 							NULL, 0, false);
 		if (page)
 			put_page(page);
+		put_swap_device(si);
 
 		rcu_read_lock();
 	}
diff --git a/mm/memory.c b/mm/memory.c
index 39e7a1495c3c..b150d1d577cd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1096,8 +1096,18 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	cond_resched();
 
 	if (ret == -EIO) {
+		int err;
+		struct swap_info_struct *si;
+
 		VM_WARN_ON_ONCE(!entry.val);
-		if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
+		si = get_swap_device(entry);
+		if (!si) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		err = add_swap_count_continuation(entry, GFP_KERNEL);
+		put_swap_device(si);
+		if (err < 0) {
 			ret = -ENOMEM;
 			goto out;
 		}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 1a29b4f98208..7a800f6783fc 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -335,14 +335,8 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma,
 			       unsigned long addr)
 {
 	struct page *page;
-	struct swap_info_struct *si;
 
-	si = get_swap_device(entry);
-	if (!si)
-		return NULL;
 	page = find_get_page(swap_address_space(entry), swp_offset(entry));
-	put_swap_device(si);
-
 	INC_CACHE_INFO(find_total);
 	if (page) {
 		bool vma_ra = swap_use_vma_readahead();
@@ -418,7 +412,6 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr,
 			bool *new_page_allocated)
 {
-	struct swap_info_struct *si;
 	struct page *page;
 	void *shadow = NULL;
 
@@ -431,12 +424,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * called after lookup_swap_cache() failed, re-calling
 		 * that would confuse statistics.
 		 */
-		si = get_swap_device(entry);
-		if (!si)
-			return NULL;
 		page = find_get_page(swap_address_space(entry),
 				     swp_offset(entry));
-		put_swap_device(si);
 		if (page)
 			return page;
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e3dcaeecc50f..d7cd7fe2eaf9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1472,14 +1472,9 @@ int __swap_count(swp_entry_t entry)
 {
 	struct swap_info_struct *si;
 	pgoff_t offset = swp_offset(entry);
-	int count = 0;
 
-	si = get_swap_device(entry);
-	if (si) {
-		count = swap_count(si->swap_map[offset]);
-		put_swap_device(si);
-	}
-	return count;
+	si = swp_swap_info(entry);
+	return swap_count(si->swap_map[offset]);
 }
 
 static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
@@ -1501,15 +1496,10 @@ static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
  */
 int __swp_swapcount(swp_entry_t entry)
 {
-	int count = 0;
 	struct swap_info_struct *si;
 
-	si = get_swap_device(entry);
-	if (si) {
-		count = swap_swapcount(si, entry);
-		put_swap_device(si);
-	}
-	return count;
+	si = swp_swap_info(entry);
+	return swap_swapcount(si, entry);
 }
 
 /*
@@ -3430,10 +3420,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 	unsigned char has_cache;
 	int err;
 
-	p = get_swap_device(entry);
-	if (!p)
-		return -EINVAL;
-
+	p = swp_swap_info(entry);
 	offset = swp_offset(entry);
 	ci = lock_cluster_or_swap_info(p, offset);
 
@@ -3479,8 +3466,6 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 
 unlock_out:
 	unlock_cluster_or_swap_info(p, ci);
-	if (p)
-		put_swap_device(p);
 	return err;
 }
 
@@ -3581,14 +3566,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	 */
 	page = alloc_page(gfp_mask | __GFP_HIGHMEM);
 
-	si = get_swap_device(entry);
-	if (!si) {
-		/*
-		 * An acceptable race has occurred since the failing
-		 * __swap_duplicate(): the swap device may be swapoff
-		 */
-		goto outer;
-	}
+	si = swp_swap_info(entry);
 	spin_lock(&si->lock);
 
 	offset = swp_offset(entry);
@@ -3660,8 +3638,6 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 out:
 	unlock_cluster(ci);
 	spin_unlock(&si->lock);
-	put_swap_device(si);
-outer:
 	if (page)
 		__free_page(page);
 	return ret;
diff --git a/mm/zswap.c b/mm/zswap.c
index 7944e3e57e78..f707a73e35aa 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -902,9 +902,14 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
 				struct page **retpage)
 {
 	bool page_was_allocated;
+	struct swap_info_struct *si;
 
+	si = get_swap_device(entry);
+	if (!si)
+		return ZSWAP_SWAPCACHE_FAIL;
 	*retpage = __read_swap_cache_async(entry, GFP_KERNEL,
 			NULL, 0, &page_was_allocated);
+	put_swap_device(si);
 	if (page_was_allocated)
 		return ZSWAP_SWAPCACHE_NEW;
 	if (!*retpage)

Huang, Ying Aug. 4, 2021, 9:01 a.m. UTC | #13

Matthew Wilcox <willy@infradead.org> writes:

> On Tue, Aug 03, 2021 at 04:14:38PM +0800, Huang, Ying wrote:
>> Matthew Wilcox <willy@infradead.org> writes:
>> > But I REALLY REALLY REALLY want a reproducer.  Right now, I have a hard
>> > time believing this, or any of the other races can really happen.
>> 
>> I think the race is only theoretical too.  Firstly, swapoff is a rare
>> operations in practice; secondly, the race window is really small.
>
> So do something to provoke it.  Widen the window.  Put an msleep(1000)
> between *pagep = NULL and the call to get_swap_device().  That's assuming
> that the swapon/swapoff loop that I proposed doesn't work.  Did you
> try it?

I haven't tried it.  Do you agree that the race is possible in theory?
But if you still really want it, I can try to do that.

Best Regards,
Huang, Ying

Huang, Ying Aug. 4, 2021, 9:04 a.m. UTC | #14

Hugh Dickins <hughd@google.com> writes:

> On Tue, 3 Aug 2021, Matthew Wilcox wrote:
>> On Tue, Aug 03, 2021 at 04:14:38PM +0800, Huang, Ying wrote:
>> > Matthew Wilcox <willy@infradead.org> writes:
>> > > But I REALLY REALLY REALLY want a reproducer.  Right now, I have a hard
>> > > time believing this, or any of the other races can really happen.
>> > 
>> > I think the race is only theoretical too.  Firstly, swapoff is a rare
>> > operations in practice; secondly, the race window is really small.
>> 
>> So do something to provoke it.  Widen the window.  Put an msleep(1000)
>> between *pagep = NULL and the call to get_swap_device().  That's assuming
>> that the swapon/swapoff loop that I proposed doesn't work.  Did you
>> try it?
>
> I've been doing that swapon/swapoff loop for years, while running kernel
> builds on tmpfs going out to swap; for better or worse on baremetal not VM.
>
> You're right that few will ever need that level of reliability; but it
> has caught problems from time to time, and I do insist on fixing them.
>
> I'm not as insistent as you on wanting a reproducer; and we all take pride
> sometimes in fixing ever more inconceivable bugs.  I'm not against that,
> but it's easy to end up with a fix more dangerous than what it claims to
> fix, rather like with random newbie cleanups.

Yes.  I totally agree, bug fixing is hard.

Best Regards,
Huang, Ying

Yang Shi Aug. 5, 2021, 11:08 p.m. UTC | #15

On Tue, Aug 3, 2021 at 10:34 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Tue, 3 Aug 2021, Matthew Wilcox wrote:
> > On Tue, Aug 03, 2021 at 04:14:38PM +0800, Huang, Ying wrote:
> > > Matthew Wilcox <willy@infradead.org> writes:
> > > > But I REALLY REALLY REALLY want a reproducer.  Right now, I have a hard
> > > > time believing this, or any of the other races can really happen.
> > >
> > > I think the race is only theoretical too.  Firstly, swapoff is a rare
> > > operations in practice; secondly, the race window is really small.
> >
> > So do something to provoke it.  Widen the window.  Put an msleep(1000)
> > between *pagep = NULL and the call to get_swap_device().  That's assuming
> > that the swapon/swapoff loop that I proposed doesn't work.  Did you
> > try it?
>
> I've been doing that swapon/swapoff loop for years, while running kernel
> builds on tmpfs going out to swap; for better or worse on baremetal not VM.
>
> You're right that few will ever need that level of reliability; but it
> has caught problems from time to time, and I do insist on fixing them.
>
> I'm not as insistent as you on wanting a reproducer; and we all take pride
> sometimes in fixing ever more inconceivable bugs.  I'm not against that,
> but it's easy to end up with a fix more dangerous than what it claims to
> fix, rather like with random newbie cleanups.
>
> I've never seen the swapoff race claimed by Miaohe, and don't expect to;
> but he's probably right, given the current code.  I just dislike adding
> unnecessary complexity, and siting it in the wrong place (mm/shmem.c).
>
> Yang, is it possible that 5.1 commit 8fd2e0b505d1 ("mm: swap: check if
> swap backing device is congested or not") was actually developed and
> measured on 4.1 or earlier, which still had blk_set_queue_congested()?

I forgot the exact version, but definitely not 4.1 or earlier. Maybe
4.19 or earlier. I'm not familiar with how block layer detect
congestion, if the logic was changed, hence the optimization doesn't
stand anymore nowadays, I'm totally fine to remove it.

>
> I cannot explain its usefulness nowadays, on congested HDD anyway:
> Matthew is right that NFS and a few others may still be setting
> congested flags, but they're not what that commit was proposed for.
>
> If it is still useful, then I contend (but Huang Ying will disagree)
> that the get_swap_device() and put_swap_device() should be around
> 8fd2e0b505d1's inode_read_congested() block in swap_cluster_readahead(),
> not encroaching into mm/shmem.c.
>
> But if that block is not useful, then it should simply be removed (later).
>
> Hugh

Hugh Dickins Aug. 6, 2021, 6:01 a.m. UTC | #16

On Thu, 5 Aug 2021, Yang Shi wrote:
> On Tue, Aug 3, 2021 at 10:34 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > I've never seen the swapoff race claimed by Miaohe, and don't expect to;
> > but he's probably right, given the current code.  I just dislike adding
> > unnecessary complexity, and siting it in the wrong place (mm/shmem.c).
> >
> > Yang, is it possible that 5.1 commit 8fd2e0b505d1 ("mm: swap: check if
> > swap backing device is congested or not") was actually developed and
> > measured on 4.1 or earlier, which still had blk_set_queue_congested()?
> 
> I forgot the exact version, but definitely not 4.1 or earlier. Maybe
> 4.19 or earlier. I'm not familiar with how block layer detect
> congestion, if the logic was changed, hence the optimization doesn't
> stand anymore nowadays, I'm totally fine to remove it.

You drove me back to look more closely.  blk_set_queue_congested()
vanished from include/linux/blkdev.h in 4.2, but blk_set_congested()
appeared then in block/blk-core.c to replace it.  blk_set_congested()
vanished (along with all references to "congested" in blk-core.c) in
5.0, then your commit (most probably tested on 4.19) went into 5.1 -
just after it had become redundant!

Thanks, yes, let's revert that and Miaohe's and Huang's, later on.

Hugh

Yang Shi Aug. 6, 2021, 8:37 p.m. UTC | #17

On Thu, Aug 5, 2021 at 11:01 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Thu, 5 Aug 2021, Yang Shi wrote:
> > On Tue, Aug 3, 2021 at 10:34 PM Hugh Dickins <hughd@google.com> wrote:
> > >
> > > I've never seen the swapoff race claimed by Miaohe, and don't expect to;
> > > but he's probably right, given the current code.  I just dislike adding
> > > unnecessary complexity, and siting it in the wrong place (mm/shmem.c).
> > >
> > > Yang, is it possible that 5.1 commit 8fd2e0b505d1 ("mm: swap: check if
> > > swap backing device is congested or not") was actually developed and
> > > measured on 4.1 or earlier, which still had blk_set_queue_congested()?
> >
> > I forgot the exact version, but definitely not 4.1 or earlier. Maybe
> > 4.19 or earlier. I'm not familiar with how block layer detect
> > congestion, if the logic was changed, hence the optimization doesn't
> > stand anymore nowadays, I'm totally fine to remove it.
>
> You drove me back to look more closely.  blk_set_queue_congested()
> vanished from include/linux/blkdev.h in 4.2, but blk_set_congested()
> appeared then in block/blk-core.c to replace it.  blk_set_congested()
> vanished (along with all references to "congested" in blk-core.c) in
> 5.0, then your commit (most probably tested on 4.19) went into 5.1 -
> just after it had become redundant!
>
> Thanks, yes, let's revert that and Miaohe's and Huang's, later on.

It should be easier to revert Huang Ying's , then Miaohe's, then mine.

>
> Hugh

Yang Shi Aug. 9, 2021, 9:26 p.m. UTC | #18

On Fri, Aug 6, 2021 at 1:37 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Aug 5, 2021 at 11:01 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > On Thu, 5 Aug 2021, Yang Shi wrote:
> > > On Tue, Aug 3, 2021 at 10:34 PM Hugh Dickins <hughd@google.com> wrote:
> > > >
> > > > I've never seen the swapoff race claimed by Miaohe, and don't expect to;
> > > > but he's probably right, given the current code.  I just dislike adding
> > > > unnecessary complexity, and siting it in the wrong place (mm/shmem.c).
> > > >
> > > > Yang, is it possible that 5.1 commit 8fd2e0b505d1 ("mm: swap: check if
> > > > swap backing device is congested or not") was actually developed and
> > > > measured on 4.1 or earlier, which still had blk_set_queue_congested()?
> > >
> > > I forgot the exact version, but definitely not 4.1 or earlier. Maybe
> > > 4.19 or earlier. I'm not familiar with how block layer detect
> > > congestion, if the logic was changed, hence the optimization doesn't
> > > stand anymore nowadays, I'm totally fine to remove it.
> >
> > You drove me back to look more closely.  blk_set_queue_congested()
> > vanished from include/linux/blkdev.h in 4.2, but blk_set_congested()
> > appeared then in block/blk-core.c to replace it.  blk_set_congested()
> > vanished (along with all references to "congested" in blk-core.c) in
> > 5.0, then your commit (most probably tested on 4.19) went into 5.1 -
> > just after it had become redundant!
> >
> > Thanks, yes, let's revert that and Miaohe's and Huang's, later on.
>
> It should be easier to revert Huang Ying's , then Miaohe's, then mine.

Hi Ying,

I just prepared the reverts since I need to revert yours and Miaohe's
in order to revert my problematic commit.

>
> >
> > Hugh

Huang, Ying Aug. 9, 2021, 11:43 p.m. UTC | #19

Yang Shi <shy828301@gmail.com> writes:

> On Fri, Aug 6, 2021 at 1:37 PM Yang Shi <shy828301@gmail.com> wrote:
>>
>> On Thu, Aug 5, 2021 at 11:01 PM Hugh Dickins <hughd@google.com> wrote:
>> >
>> > On Thu, 5 Aug 2021, Yang Shi wrote:
>> > > On Tue, Aug 3, 2021 at 10:34 PM Hugh Dickins <hughd@google.com> wrote:
>> > > >
>> > > > I've never seen the swapoff race claimed by Miaohe, and don't expect to;
>> > > > but he's probably right, given the current code.  I just dislike adding
>> > > > unnecessary complexity, and siting it in the wrong place (mm/shmem.c).
>> > > >
>> > > > Yang, is it possible that 5.1 commit 8fd2e0b505d1 ("mm: swap: check if
>> > > > swap backing device is congested or not") was actually developed and
>> > > > measured on 4.1 or earlier, which still had blk_set_queue_congested()?
>> > >
>> > > I forgot the exact version, but definitely not 4.1 or earlier. Maybe
>> > > 4.19 or earlier. I'm not familiar with how block layer detect
>> > > congestion, if the logic was changed, hence the optimization doesn't
>> > > stand anymore nowadays, I'm totally fine to remove it.
>> >
>> > You drove me back to look more closely.  blk_set_queue_congested()
>> > vanished from include/linux/blkdev.h in 4.2, but blk_set_congested()
>> > appeared then in block/blk-core.c to replace it.  blk_set_congested()
>> > vanished (along with all references to "congested" in blk-core.c) in
>> > 5.0, then your commit (most probably tested on 4.19) went into 5.1 -
>> > just after it had become redundant!
>> >
>> > Thanks, yes, let's revert that and Miaohe's and Huang's, later on.
>>
>> It should be easier to revert Huang Ying's , then Miaohe's, then mine.
>
> Hi Ying,
>
> I just prepared the reverts since I need to revert yours and Miaohe's
> in order to revert my problematic commit.

If your original commit will be reverted, then mine and Miaohe's can be
reverted from the race condition point of view.

Although I still think it's better to call get/put_swap_device() in
shmem_swapin_page(), that can be discussed with another patch.

Best Regards,
Huang, Ying

Yang Shi Aug. 10, 2021, 1:11 a.m. UTC | #20

On Mon, Aug 9, 2021 at 4:43 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yang Shi <shy828301@gmail.com> writes:
>
> > On Fri, Aug 6, 2021 at 1:37 PM Yang Shi <shy828301@gmail.com> wrote:
> >>
> >> On Thu, Aug 5, 2021 at 11:01 PM Hugh Dickins <hughd@google.com> wrote:
> >> >
> >> > On Thu, 5 Aug 2021, Yang Shi wrote:
> >> > > On Tue, Aug 3, 2021 at 10:34 PM Hugh Dickins <hughd@google.com> wrote:
> >> > > >
> >> > > > I've never seen the swapoff race claimed by Miaohe, and don't expect to;
> >> > > > but he's probably right, given the current code.  I just dislike adding
> >> > > > unnecessary complexity, and siting it in the wrong place (mm/shmem.c).
> >> > > >
> >> > > > Yang, is it possible that 5.1 commit 8fd2e0b505d1 ("mm: swap: check if
> >> > > > swap backing device is congested or not") was actually developed and
> >> > > > measured on 4.1 or earlier, which still had blk_set_queue_congested()?
> >> > >
> >> > > I forgot the exact version, but definitely not 4.1 or earlier. Maybe
> >> > > 4.19 or earlier. I'm not familiar with how block layer detect
> >> > > congestion, if the logic was changed, hence the optimization doesn't
> >> > > stand anymore nowadays, I'm totally fine to remove it.
> >> >
> >> > You drove me back to look more closely.  blk_set_queue_congested()
> >> > vanished from include/linux/blkdev.h in 4.2, but blk_set_congested()
> >> > appeared then in block/blk-core.c to replace it.  blk_set_congested()
> >> > vanished (along with all references to "congested" in blk-core.c) in
> >> > 5.0, then your commit (most probably tested on 4.19) went into 5.1 -
> >> > just after it had become redundant!
> >> >
> >> > Thanks, yes, let's revert that and Miaohe's and Huang's, later on.
> >>
> >> It should be easier to revert Huang Ying's , then Miaohe's, then mine.
> >
> > Hi Ying,
> >
> > I just prepared the reverts since I need to revert yours and Miaohe's
> > in order to revert my problematic commit.
>
> If your original commit will be reverted, then mine and Miaohe's can be
> reverted from the race condition point of view.

Yes, this is exactly what I did.

>
> Although I still think it's better to call get/put_swap_device() in
> shmem_swapin_page(), that can be discussed with another patch.

Any additional change could be based off the reverts.

>
> Best Regards,
> Huang, Ying

mm,shmem: Fix a typo in shmem_swapin_page()

Commit Message

Comments

Patch