diff mbox series

mm: disable preemption before swapcache_free

Message ID 2018072514375722198958@wingtech.com (mailing list archive)
State New, archived
Headers show
Series mm: disable preemption before swapcache_free | expand

Commit Message

zhaowuyun@wingtech.com July 25, 2018, 6:37 a.m. UTC
From: zhaowuyun <zhaowuyun@wingtech.com>
 
issue is that there are two processes A and B, A is kworker/u16:8
normal priority, B is AudioTrack, RT priority, they are on the
same CPU 3.
 
The task A preempted by task B in the moment
after __delete_from_swap_cache(page) and before swapcache_free(swap).
 
The task B does __read_swap_cache_async in the do {} while loop, it
will never find the page from swapper_space because the page is removed
by the task A, and it will never sucessfully in swapcache_prepare because
the entry is EEXIST.
 
The task B then stuck in the loop infinitely because it is a RT task,
no one can preempt it.
 
so need to disable preemption until the swapcache_free executed.
 
TASK A:
__delete_from_swap_cache(page);
spin_unlock_irqrestore(&mapping->tree_lock, flags);
swapcache_free(swap);
+ preempt_enable();
} else {
void (*freepage)(struct page *);
void *shadow = NULL;
@@ -740,6 +747,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
shadow = workingset_eviction(mapping, page);
__delete_from_page_cache(page, shadow);
spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ preempt_enable();
if (freepage != NULL)
freepage(page);
@@ -749,6 +757,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
cannot_free:
spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ preempt_enable();
return 0;
}

Comments

Michal Hocko July 25, 2018, 7:40 a.m. UTC | #1
On Wed 25-07-18 14:37:58, zhaowuyun@wingtech.com wrote:
[...]
> Change-Id: I36d9df7ccff77c589b7157225410269c675a8504

What is this?

> Signed-off-by: zhaowuyun <zhaowuyun@wingtech.com>
> ---
> mm/vmscan.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2740973..acede002 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -674,6 +674,12 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
> BUG_ON(!PageLocked(page));
> BUG_ON(mapping != page_mapping(page));
> + /*
> + * preemption must be disabled to protect current task preempted before
> + * swapcache_free(swap) invoked by the task which do the
> + * __read_swap_cache_async job on the same page
> + */
> + preempt_disable();
> spin_lock_irqsave(&mapping->tree_lock, flags);

Hmm, but spin_lock_irqsave already implies the disabled preemption.
zhaowuyun@wingtech.com July 25, 2018, 7:57 a.m. UTC | #2
That is a BUG we found in mm/vmscan.c at KERNEL VERSION 4.9.82
Sumary is TASK A (normal priority) doing __remove_mapping page preempted by TASK B (RT priority) doing __read_swap_cache_async,
the TASK A preempted before swapcache_free, left SWAP_HAS_CACHE flag in the swap cache,
the TASK B which doing __read_swap_cache_async, will not success at swapcache_prepare(entry) because the swap cache was exist, then it will loop forever because it is a RT thread...
the spin lock unlocked before swapcache_free, so disable preemption until swapcache_free executed ...

struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
            struct vm_area_struct *vma, unsigned long addr,
            bool *new_page_allocated)
{
    struct page *found_page, *new_page = NULL;
    struct address_space *swapper_space = swap_address_space(entry);
    int err;
    *new_page_allocated = false;

    do {
        /*
         * First check the swap cache.  Since this is normally
         * called after lookup_swap_cache() failed, re-calling
         * that would confuse statistics.
         */
        found_page = find_get_page(swapper_space, swp_offset(entry));
        if (found_page)
            break;

        /*
         * Get a new page to read into from swap.
         */
        if (!new_page) {
            new_page = alloc_page_vma(gfp_mask, vma, addr);
            if (!new_page)
                break;        /* Out of memory */
        }

        /*
         * call radix_tree_preload() while we can wait.
         */
        err = radix_tree_maybe_preload(gfp_mask & GFP_KERNEL);
        if (err)
            break;

        /*
         * Swap entry may have been freed since our caller observed it.
         */
        err = swapcache_prepare(entry);
        if (err == -EEXIST) {
            radix_tree_preload_end();
            /*
             * We might race against get_swap_page() and stumble
             * across a SWAP_HAS_CACHE swap_map entry whose page
             * has not been brought into the swapcache yet, while
             * the other end is scheduled away waiting on discard
             * I/O completion at scan_swap_map().
             *
             * In order to avoid turning this transitory state
             * into a permanent loop around this -EEXIST case
             * if !CONFIG_PREEMPT and the I/O completion happens
             * to be waiting on the CPU waitqueue where we are now
             * busy looping, we just conditionally invoke the
             * scheduler here, if there are some more important
             * tasks to run.
             */
            cond_resched();
            continue;                                                // will loop infinitely 
        }

zhaowuyun@wingtech.com
 
From: Michal Hocko
Date: 2018-07-25 15:40
To: zhaowuyun@wingtech.com
CC: mgorman; akpm; minchan; vinmenon; hannes; hillf.zj; linux-mm; linux-kernel
Subject: Re: [PATCH] [PATCH] mm: disable preemption before swapcache_free
On Wed 25-07-18 14:37:58, zhaowuyun@wingtech.com wrote:
[...]
> Change-Id: I36d9df7ccff77c589b7157225410269c675a8504
 
What is this?
 
> Signed-off-by: zhaowuyun <zhaowuyun@wingtech.com>
> ---
> mm/vmscan.c | 9 +++++++++
> 1 file changed, 9 insertions(+)

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2740973..acede002 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -674,6 +674,12 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
> BUG_ON(!PageLocked(page));
> BUG_ON(mapping != page_mapping(page));
> + /*
> + * preemption must be disabled to protect current task preempted before
> + * swapcache_free(swap) invoked by the task which do the
> + * __read_swap_cache_async job on the same page
> + */
> + preempt_disable();
> spin_lock_irqsave(&mapping->tree_lock, flags);
 
Hmm, but spin_lock_irqsave already implies the disabled preemption.
--
Michal Hocko
SUSE Labs
Michal Hocko July 25, 2018, 8:21 a.m. UTC | #3
[Please do not top post - thank you]
[CC Hugh - the original patch was http://lkml.kernel.org/r/2018072514375722198958@wingtech.com]

On Wed 25-07-18 15:57:55, zhaowuyun@wingtech.com wrote:
> That is a BUG we found in mm/vmscan.c at KERNEL VERSION 4.9.82

The code is quite similar in the current tree as well.

> Sumary is TASK A (normal priority) doing __remove_mapping page preempted by TASK B (RT priority) doing __read_swap_cache_async,
> the TASK A preempted before swapcache_free, left SWAP_HAS_CACHE flag in the swap cache,
> the TASK B which doing __read_swap_cache_async, will not success at swapcache_prepare(entry) because the swap cache was exist, then it will loop forever because it is a RT thread...
> the spin lock unlocked before swapcache_free, so disable preemption until swapcache_free executed ...

OK, I see your point now. I have missed the lock is dropped before
swapcache_free. How can preemption disabling prevent this race to happen
while the code is preempted by an IRQ?
zhaowuyun@wingtech.com July 25, 2018, 9:53 a.m. UTC | #4
>[Please do not top post - thank you]
>[CC Hugh - the original patch was http://lkml.kernel.org/r/2018072514375722198958@wingtech.com]
>
>On Wed 25-07-18 15:57:55, zhaowuyun@wingtech.com wrote:
>> That is a BUG we found in mm/vmscan.c at KERNEL VERSION 4.9.82
>
>The code is quite similar in the current tree as well.
>
>> Sumary is TASK A (normal priority) doing __remove_mapping page preempted by TASK B (RT priority) doing __read_swap_cache_async,
>> the TASK A preempted before swapcache_free, left SWAP_HAS_CACHE flag in the swap cache,
>> the TASK B which doing __read_swap_cache_async, will not success at swapcache_prepare(entry) because the swap cache was exist, then it will loop forever because it is a RT thread...
>> the spin lock unlocked before swapcache_free, so disable preemption until swapcache_free executed ...
>
>OK, I see your point now. I have missed the lock is dropped before
>swapcache_free. How can preemption disabling prevent this race to happen
>while the code is preempted by an IRQ?
>--
>Michal Hocko
>SUSE Labs 

Hi Michal,

The action what processes __read_swap_cache_async is on the process context, so I think disable preemption is enough.

--------------
zhaowuyun@wingtech.com
Michal Hocko July 25, 2018, 10:32 a.m. UTC | #5
On Wed 25-07-18 10:21:00, Michal Hocko wrote:
> [Please do not top post - thank you]
> [CC Hugh - the original patch was http://lkml.kernel.org/r/2018072514375722198958@wingtech.com]

now for real

> On Wed 25-07-18 15:57:55, zhaowuyun@wingtech.com wrote:
> > That is a BUG we found in mm/vmscan.c at KERNEL VERSION 4.9.82
> 
> The code is quite similar in the current tree as well.
> 
> > Sumary is TASK A (normal priority) doing __remove_mapping page preempted by TASK B (RT priority) doing __read_swap_cache_async,
> > the TASK A preempted before swapcache_free, left SWAP_HAS_CACHE flag in the swap cache,
> > the TASK B which doing __read_swap_cache_async, will not success at swapcache_prepare(entry) because the swap cache was exist, then it will loop forever because it is a RT thread...
> > the spin lock unlocked before swapcache_free, so disable preemption until swapcache_free executed ...
> 
> OK, I see your point now. I have missed the lock is dropped before
> swapcache_free. How can preemption disabling prevent this race to happen
> while the code is preempted by an IRQ?
Michal Hocko July 25, 2018, 10:34 a.m. UTC | #6
On Wed 25-07-18 17:53:07, zhaowuyun@wingtech.com wrote:
> >[Please do not top post - thank you]
> >[CC Hugh - the original patch was http://lkml.kernel.org/r/2018072514375722198958@wingtech.com]
> >
> >On Wed 25-07-18 15:57:55, zhaowuyun@wingtech.com wrote:
> >> That is a BUG we found in mm/vmscan.c at KERNEL VERSION 4.9.82
> >
> >The code is quite similar in the current tree as well.
> >
> >> Sumary is TASK A (normal priority) doing __remove_mapping page preempted by TASK B (RT priority) doing __read_swap_cache_async,
> >> the TASK A preempted before swapcache_free, left SWAP_HAS_CACHE flag in the swap cache,
> >> the TASK B which doing __read_swap_cache_async, will not success at swapcache_prepare(entry) because the swap cache was exist, then it will loop forever because it is a RT thread...
> >> the spin lock unlocked before swapcache_free, so disable preemption until swapcache_free executed ...
> >
> >OK, I see your point now. I have missed the lock is dropped before
> >swapcache_free. How can preemption disabling prevent this race to happen
> >while the code is preempted by an IRQ?
> >--
> >Michal Hocko
> >SUSE Labs 
> 
> Hi Michal,
> 
> The action what processes __read_swap_cache_async is on the process context, so I think disable preemption is enough.

So what you are saying is that no IRQ or other non-process contexts will
not loop in __read_swap_cache_async so the live lock is not possible?
zhaowuyun@wingtech.com July 25, 2018, 11:17 a.m. UTC | #7
>On Wed 25-07-18 17:53:07, zhaowuyun@wingtech.com wrote:
>> >[Please do not top post - thank you]
>> >[CC Hugh - the original patch was http://lkml.kernel.org/r/2018072514375722198958@wingtech.com]
>> >
>> >On Wed 25-07-18 15:57:55, zhaowuyun@wingtech.com wrote:
>> >> That is a BUG we found in mm/vmscan.c at KERNEL VERSION 4.9.82
>> >
>> >The code is quite similar in the current tree as well.
>> >
>> >> Sumary is TASK A (normal priority) doing __remove_mapping page preempted by TASK B (RT priority) doing __read_swap_cache_async,
>> >> the TASK A preempted before swapcache_free, left SWAP_HAS_CACHE flag in the swap cache,
>> >> the TASK B which doing __read_swap_cache_async, will not success at swapcache_prepare(entry) because the swap cache was exist, then it will loop forever because it is a RT thread...
>> >> the spin lock unlocked before swapcache_free, so disable preemption until swapcache_free executed ...
>> >
>> >OK, I see your point now. I have missed the lock is dropped before
>> >swapcache_free. How can preemption disabling prevent this race to happen
>> >while the code is preempted by an IRQ?
>> >--
>> >Michal Hocko
>> >SUSE Labs
>>
>> Hi Michal,
>>
>> The action what processes __read_swap_cache_async is on the process context, so I think disable preemption is enough.
>
>So what you are saying is that no IRQ or other non-process contexts will
>not loop in __read_swap_cache_async so the live lock is not possible?
>--
>Michal Hocko
>SUSE Labs 


I think that __read_swap_cache_async will not running under IRQ contexts. 
If running under other non-process contexts, I think it must at the other CPU, will not encounter this dead loop.

--------------
zhaowuyun@wingtech.com
Andrew Morton July 25, 2018, 9:16 p.m. UTC | #8
On Wed, 25 Jul 2018 14:37:58 +0800 "zhaowuyun@wingtech.com" <zhaowuyun@wingtech.com> wrote:

> From: zhaowuyun <zhaowuyun@wingtech.com>
>  
> issue is that there are two processes A and B, A is kworker/u16:8
> normal priority, B is AudioTrack, RT priority, they are on the
> same CPU 3.
>  
> The task A preempted by task B in the moment
> after __delete_from_swap_cache(page) and before swapcache_free(swap).
>  
> The task B does __read_swap_cache_async in the do {} while loop, it
> will never find the page from swapper_space because the page is removed
> by the task A, and it will never sucessfully in swapcache_prepare because
> the entry is EEXIST.
>  
> The task B then stuck in the loop infinitely because it is a RT task,
> no one can preempt it.
>  
> so need to disable preemption until the swapcache_free executed.

Yes, right, sorry, I must have merged cbab0e4eec299 in my sleep. 
cond_resched() is a no-op in the presence of realtime policy threads
and using to attempt to yield to a different thread it in this fashion
is broken.

Disabling preemption on the other side of the race should fix things,
but it's using a bandaid to plug the leakage from the earlier bandaid. 
The proper way to coordinate threads is to use a sleeping lock, such
as a mutex, or some other wait/wakeup mechanism.

And once that's done, we can hopefully eliminate the do loop from
__read_swap_cache_async().  That also services ENOMEM from
radix_tree_insert(), but __add_to_swap_cache() appears to handle that
OK and we shouldn't just loop around retrying the insert and the
radix_tree_preload() should ensure that radix_tree_insert() never fails
anyway.  Unless we're calling __read_swap_cache_async() with screwy
gfp_flags from somewhere.
zhaowuyun@wingtech.com July 26, 2018, 2:21 a.m. UTC | #9
>On Wed, 25 Jul 2018 14:37:58 +0800 "zhaowuyun@wingtech.com" <zhaowuyun@wingtech.com> wrote:
>
>> From: zhaowuyun <zhaowuyun@wingtech.com>
>> 
>> issue is that there are two processes A and B, A is kworker/u16:8
>> normal priority, B is AudioTrack, RT priority, they are on the
>> same CPU 3.
>> 
>> The task A preempted by task B in the moment
>> after __delete_from_swap_cache(page) and before swapcache_free(swap).
>> 
>> The task B does __read_swap_cache_async in the do {} while loop, it
>> will never find the page from swapper_space because the page is removed
>> by the task A, and it will never sucessfully in swapcache_prepare because
>> the entry is EEXIST.
>> 
>> The task B then stuck in the loop infinitely because it is a RT task,
>> no one can preempt it.
>> 
>> so need to disable preemption until the swapcache_free executed.
>
>Yes, right, sorry, I must have merged cbab0e4eec299 in my sleep.
>cond_resched() is a no-op in the presence of realtime policy threads
>and using to attempt to yield to a different thread it in this fashion
>is broken.
>
>Disabling preemption on the other side of the race should fix things,
>but it's using a bandaid to plug the leakage from the earlier bandaid.
>The proper way to coordinate threads is to use a sleeping lock, such
>as a mutex, or some other wait/wakeup mechanism.
>
>And once that's done, we can hopefully eliminate the do loop from
>__read_swap_cache_async().  That also services ENOMEM from
>radix_tree_insert(), but __add_to_swap_cache() appears to handle that
>OK and we shouldn't just loop around retrying the insert and the
>radix_tree_preload() should ensure that radix_tree_insert() never fails
>anyway.  Unless we're calling __read_swap_cache_async() with screwy
>gfp_flags from somewhere.
>
> 


Your are right, it is a bandaid ...
Could you provide some suggestion more specific about how to use sleeping lock/some other wait/wakeup mechanism to fix this issue? Thanks very much!
Our project really needs a fix to this issue ...


--------------
zhaowuyun@wingtech.com
Michal Hocko July 26, 2018, 6:06 a.m. UTC | #10
On Thu 26-07-18 10:21:40, zhaowuyun@wingtech.com wrote:
[...]
> Our project really needs a fix to this issue

Could you be more specific why? My understanding is that RT tasks
usually have all the memory mlocked otherwise all the real time
expectations are gone already.
zhaowuyun@wingtech.com July 26, 2018, 7:03 a.m. UTC | #11
>On Thu 26-07-18 10:21:40, zhaowuyun@wingtech.com wrote:
>[...]
>> Our project really needs a fix to this issue
>
>Could you be more specific why? My understanding is that RT tasks
>usually have all the memory mlocked otherwise all the real time
>expectations are gone already.
>--
>Michal Hocko
>SUSE Labs 


The RT thread is created by a process with normal priority, and the process was sleep, 
then some task needs the RT thread to do something, so the process create this thread, and set it to RT policy.
I think that is the reason why RT task would read the swap.


--------------
zhaowuyun@wingtech.com
Michal Hocko July 26, 2018, 7:44 a.m. UTC | #12
On Thu 26-07-18 15:03:23, zhaowuyun@wingtech.com wrote:
> >On Thu 26-07-18 10:21:40, zhaowuyun@wingtech.com wrote:
> >[...]
> >> Our project really needs a fix to this issue
> >
> >Could you be more specific why? My understanding is that RT tasks
> >usually have all the memory mlocked otherwise all the real time
> >expectations are gone already.
> >--
> >Michal Hocko
> >SUSE Labs 
> 
> 
> The RT thread is created by a process with normal priority, and the process was sleep, 
> then some task needs the RT thread to do something, so the process create this thread, and set it to RT policy.
> I think that is the reason why RT task would read the swap.

OK I see. This design is quite fragile though. You are opening ticket to
priority inversions and what not.

Anyway, the underlying swap issue should be fixed. Unfortunatelly I do
not have a great idea how to do that properly.
Andrew Morton July 26, 2018, 10:11 p.m. UTC | #13
On Thu, 26 Jul 2018 15:03:23 +0800 "zhaowuyun@wingtech.com" <zhaowuyun@wingtech.com> wrote:

> >On Thu 26-07-18 10:21:40, zhaowuyun@wingtech.com wrote:
> >[...]
> >> Our project really needs a fix to this issue
> >
> >Could you be more specific why? My understanding is that RT tasks
> >usually have all the memory mlocked otherwise all the real time
> >expectations are gone already.
> >--
> >Michal Hocko
> >SUSE Labs 
> 
> 
> The RT thread is created by a process with normal priority, and the process was sleep, 
> then some task needs the RT thread to do something, so the process create this thread, and set it to RT policy.
> I think that is the reason why RT task would read the swap.

A simpler bandaid might be to replace the cond_resched() with msleep(1).
zhaowuyun@wingtech.com July 27, 2018, 6:07 a.m. UTC | #14
>On Thu, 26 Jul 2018 15:03:23 +0800 "zhaowuyun@wingtech.com" <zhaowuyun@wingtech.com> wrote:
>
>> >On Thu 26-07-18 10:21:40, zhaowuyun@wingtech.com wrote:
>> >[...]
>> >> Our project really needs a fix to this issue
>> >
>> >Could you be more specific why? My understanding is that RT tasks
>> >usually have all the memory mlocked otherwise all the real time
>> >expectations are gone already.
>> >--
>> >Michal Hocko
>> >SUSE Labs
>>
>>
>> The RT thread is created by a process with normal priority, and the process was sleep,
>> then some task needs the RT thread to do something, so the process create this thread, and set it to RT policy.
>> I think that is the reason why RT task would read the swap.
>
>A simpler bandaid might be to replace the cond_resched() with msleep(1). 


Thanks for the suggestion, we will try that.


--------------
zhaowuyun@wingtech.com
Hugh Dickins Aug. 4, 2018, 11:07 p.m. UTC | #15
On Fri, 27 Jul 2018, zhaowuyun@wingtech.com wrote:
> >On Thu, 26 Jul 2018 15:03:23 +0800 "zhaowuyun@wingtech.com" <zhaowuyun@wingtech.com> wrote:
> >
> >> >On Thu 26-07-18 10:21:40, zhaowuyun@wingtech.com wrote:
> >> >[...]
> >> >> Our project really needs a fix to this issue
> >> >
> >> >Could you be more specific why? My understanding is that RT tasks
> >> >usually have all the memory mlocked otherwise all the real time
> >> >expectations are gone already.
> >> >--
> >> >Michal Hocko
> >> >SUSE Labs
> >>
> >>
> >> The RT thread is created by a process with normal priority, and the process was sleep,
> >> then some task needs the RT thread to do something, so the process create this thread, and set it to RT policy.
> >> I think that is the reason why RT task would read the swap.
> >
> >A simpler bandaid might be to replace the cond_resched() with msleep(1). 
> 
> 
> Thanks for the suggestion, we will try that.

Andrew's msleep(1) may be a good enough bandaid for you. And I share
Michal's doubt about your design, in which an RT thread meets swap:
this may not be the last problem you have with that.

But this is a real bug when CONFIG_PREEMPT=y, RT threads or not: we
just didn't notice, because it's usually hidden by the cond_resched().
(I think that was added in 3.10, because in 2.6.29 I had been guilty of
inserting a discard, and wait for completion, in between allocating swap
and adding to swap cache; but Shao Hua fixed my discard in 3.12.) Thanks
a lot for making us aware of this bug.

After reminding myself of the issues here, I disagree with much of what
has been said: we shall "always" want the loop in __read_swap_cache_async()
(though some of its error handling is probably superfluous now, agreed);
and your disabling of preemption is not just a bandaid, it's exactly the
right approach.

We could do something heavier, perhaps rearranging the swapcache tree work
to be done under swap_lock as well as tree_lock (I'm talking 4.9-language),
but that's very unlikely to be an improvement. Disabling preemption yokes
the two spinlocks together in an efficient way, without overhead on other
paths; on rare occasions we spin around __read_swap_cache_async() instead
of spinning around to acquire a spinlock.

But your patch is incomplete. The same needs to be done in delete_from_
swap_cache(), and we would also need to protect against preemption between
the get_swap_page() and the add_to_swap_cache(), in add_to_swap() and in
shmem_writepage(). The latter gets messy, but 4.11 (where Tim Chen uses
SWAP_HAS_CACHE more widely) gives a good hint: __read_swap_cache_async()
callers are only interested in swap entries that are already in use and
still in use. (Though swapoff has to be more careful, partly because one
of its jobs is to clear out swap-cache-only entries, partly because the
current interface would mistake a NULL for no-entry as out-of-memory.)

Below is the patch I would do for 4.9 (diffed against 4.9.117), and I'm
showing that because it's the simplest case. Although the principles stay
the same, the codebase here has gone through several shifts, and 4.19 will
probably be different again. So I'll test and post a patch against 4.19-rc
in a few weeks time, and that can then be backported to stable: but will
need several manual backports because of the intervening changes.

I did wonder whether just to extend the irq-disabled section in
delete_from_swap_cache() etc: that's easy, and works, and is even better
protection against spinning too long; but it's not absolutely necessary,
so all in all, probably better avoided. I did wonder whether to remove
the cond_resched(), but it's not a bad place for one, so I've left it in.

When checking worst cases of looping around __read_swap_cache_async(),
after the patch, I was worried for a while. I had naively imagined that
going more than twice around the loop should be vanishingly rare, but
that is not so at all. But the bad cases I looked into were all the same:
after forking, two processes, on HT siblings, each serving do_swap_page(),
trying to bring the same swap into its mm, with a sparse swapper_space
tree: one process gets to do all the work of allocating new radix-tree
nodes and bringing them into D-cache, while the other just spins around
__read_swap_cache_async() seeing SWAP_HAS_CACHE but not yet the page in
the radix-tree. That's okay, that makes sense.

Hugh
---

 mm/swap_state.c |   26 +++++++++++++-------------
 mm/swapfile.c   |    8 +++++++-
 mm/vmscan.c     |    3 +++
 3 files changed, 23 insertions(+), 14 deletions(-)

--- 4.9.117/mm/swap_state.c	2016-12-11 11:17:54.000000000 -0800
+++ linux/mm/swap_state.c	2018-08-04 11:50:46.577788766 -0700
@@ -225,9 +225,11 @@ void delete_from_swap_cache(struct page
 	address_space = swap_address_space(entry);
 	spin_lock_irq(&address_space->tree_lock);
 	__delete_from_swap_cache(page);
+	/* Expedite swapcache_free() to help __read_swap_cache_async() */
+	preempt_disable();
 	spin_unlock_irq(&address_space->tree_lock);
-
 	swapcache_free(entry);
+	preempt_enable();
 	put_page(page);
 }
 
@@ -337,19 +339,17 @@ struct page *__read_swap_cache_async(swp
 		if (err == -EEXIST) {
 			radix_tree_preload_end();
 			/*
-			 * We might race against get_swap_page() and stumble
-			 * across a SWAP_HAS_CACHE swap_map entry whose page
-			 * has not been brought into the swapcache yet, while
-			 * the other end is scheduled away waiting on discard
-			 * I/O completion at scan_swap_map().
+			 * We might race against __delete_from_swap_cache() and
+			 * stumble across a swap_map entry whose SWAP_HAS_CACHE
+			 * has not yet been cleared: hence preempt_disable()
+			 * in __remove_mapping() and delete_from_swap_cache(),
+			 * so they cannot schedule away before clearing it.
 			 *
-			 * In order to avoid turning this transitory state
-			 * into a permanent loop around this -EEXIST case
-			 * if !CONFIG_PREEMPT and the I/O completion happens
-			 * to be waiting on the CPU waitqueue where we are now
-			 * busy looping, we just conditionally invoke the
-			 * scheduler here, if there are some more important
-			 * tasks to run.
+			 * We need similar protection against racing calls to
+			 * __read_swap_cache_async(): preempt_disable() before
+			 * swapcache_prepare() above, preempt_enable() after
+			 * __add_to_swap_cache() below: which are already in
+			 * radix_tree_maybe_preload(), radix_tree_preload_end().
 			 */
 			cond_resched();
 			continue;
--- 4.9.117/mm/swapfile.c	2018-08-04 11:40:08.463504848 -0700
+++ linux/mm/swapfile.c	2018-08-04 11:50:46.577788766 -0700
@@ -2670,7 +2670,13 @@ static int __swap_duplicate(swp_entry_t
 		/* set SWAP_HAS_CACHE if there is no cache and entry is used */
 		if (!has_cache && count)
 			has_cache = SWAP_HAS_CACHE;
-		else if (has_cache)		/* someone else added cache */
+		/*
+		 * __read_swap_cache_async() can usually skip entries without
+		 * real usage (including those in between being allocated and
+		 * added to swap cache); but swapoff (!SWP_WRITEOK) must not.
+		 */
+		else if (has_cache &&
+			 (count || !(p->flags & SWP_WRITEOK)))
 			err = -EEXIST;
 		else				/* no users remaining */
 			err = -ENOENT;
--- 4.9.117/mm/vmscan.c	2018-08-04 11:40:08.471504902 -0700
+++ linux/mm/vmscan.c	2018-08-04 11:50:46.577788766 -0700
@@ -709,8 +709,11 @@ static int __remove_mapping(struct addre
 		swp_entry_t swap = { .val = page_private(page) };
 		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page);
+		/* Expedite swapcache_free() for __read_swap_cache_async() */
+		preempt_disable();
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 		swapcache_free(swap);
+		preempt_enable();
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
zhaowuyun@wingtech.com Aug. 7, 2018, 2:15 a.m. UTC | #16
>Andrew's msleep(1) may be a good enough bandaid for you. And I share
>Michal's doubt about your design, in which an RT thread meets swap:
>this may not be the last problem you have with that.
>
>But this is a real bug when CONFIG_PREEMPT=y, RT threads or not: we
>just didn't notice, because it's usually hidden by the cond_resched().
>(I think that was added in 3.10, because in 2.6.29 I had been guilty of
>inserting a discard, and wait for completion, in between allocating swap
>and adding to swap cache; but Shao Hua fixed my discard in 3.12.) Thanks
>a lot for making us aware of this bug.
>
>After reminding myself of the issues here, I disagree with much of what
>has been said: we shall "always" want the loop in __read_swap_cache_async()
>(though some of its error handling is probably superfluous now, agreed);
>and your disabling of preemption is not just a bandaid, it's exactly the
>right approach.
>
>We could do something heavier, perhaps rearranging the swapcache tree work
>to be done under swap_lock as well as tree_lock (I'm talking 4.9-language),
>but that's very unlikely to be an improvement. Disabling preemption yokes
>the two spinlocks together in an efficient way, without overhead on other
>paths; on rare occasions we spin around __read_swap_cache_async() instead
>of spinning around to acquire a spinlock.
>
>But your patch is incomplete. The same needs to be done in delete_from_
>swap_cache(), and we would also need to protect against preemption between
>the get_swap_page() and the add_to_swap_cache(), in add_to_swap() and in
>shmem_writepage(). The latter gets messy, but 4.11 (where Tim Chen uses
>SWAP_HAS_CACHE more widely) gives a good hint: __read_swap_cache_async()
>callers are only interested in swap entries that are already in use and
>still in use. (Though swapoff has to be more careful, partly because one
>of its jobs is to clear out swap-cache-only entries, partly because the
>current interface would mistake a NULL for no-entry as out-of-memory.)
>
>Below is the patch I would do for 4.9 (diffed against 4.9.117), and I'm
>showing that because it's the simplest case. Although the principles stay
>the same, the codebase here has gone through several shifts, and 4.19 will
>probably be different again. So I'll test and post a patch against 4.19-rc
>in a few weeks time, and that can then be backported to stable: but will
>need several manual backports because of the intervening changes.
>
>I did wonder whether just to extend the irq-disabled section in
>delete_from_swap_cache() etc: that's easy, and works, and is even better
>protection against spinning too long; but it's not absolutely necessary,
>so all in all, probably better avoided. I did wonder whether to remove
>the cond_resched(), but it's not a bad place for one, so I've left it in.
>
>When checking worst cases of looping around __read_swap_cache_async(),
>after the patch, I was worried for a while. I had naively imagined that
>going more than twice around the loop should be vanishingly rare, but
>that is not so at all. But the bad cases I looked into were all the same:
>after forking, two processes, on HT siblings, each serving do_swap_page(),
>trying to bring the same swap into its mm, with a sparse swapper_space
>tree: one process gets to do all the work of allocating new radix-tree
>nodes and bringing them into D-cache, while the other just spins around
>__read_swap_cache_async() seeing SWAP_HAS_CACHE but not yet the page in
>the radix-tree. That's okay, that makes sense.
>
>Hugh
>---
>
> mm/swap_state.c |   26 +++++++++++++-------------
> mm/swapfile.c   |    8 +++++++-
> mm/vmscan.c     |    3 +++
> 3 files changed, 23 insertions(+), 14 deletions(-)
>
>--- 4.9.117/mm/swap_state.c	2016-12-11 11:17:54.000000000 -0800
>+++ linux/mm/swap_state.c	2018-08-04 11:50:46.577788766 -0700
>@@ -225,9 +225,11 @@ void delete_from_swap_cache(struct page
> address_space = swap_address_space(entry);
> spin_lock_irq(&address_space->tree_lock);
> __delete_from_swap_cache(page);
>+	/* Expedite swapcache_free() to help __read_swap_cache_async() */
>+	preempt_disable();
> spin_unlock_irq(&address_space->tree_lock);
>-
> swapcache_free(entry);
>+	preempt_enable();
> put_page(page);
> }
>
>@@ -337,19 +339,17 @@ struct page *__read_swap_cache_async(swp
> if (err == -EEXIST) {
> radix_tree_preload_end();
> /*
>-	* We might race against get_swap_page() and stumble
>-	* across a SWAP_HAS_CACHE swap_map entry whose page
>-	* has not been brought into the swapcache yet, while
>-	* the other end is scheduled away waiting on discard
>-	* I/O completion at scan_swap_map().
>+	* We might race against __delete_from_swap_cache() and
>+	* stumble across a swap_map entry whose SWAP_HAS_CACHE
>+	* has not yet been cleared: hence preempt_disable()
>+	* in __remove_mapping() and delete_from_swap_cache(),
>+	* so they cannot schedule away before clearing it.
> *
>-	* In order to avoid turning this transitory state
>-	* into a permanent loop around this -EEXIST case
>-	* if !CONFIG_PREEMPT and the I/O completion happens
>-	* to be waiting on the CPU waitqueue where we are now
>-	* busy looping, we just conditionally invoke the
>-	* scheduler here, if there are some more important
>-	* tasks to run.
>+	* We need similar protection against racing calls to
>+	* __read_swap_cache_async(): preempt_disable() before
>+	* swapcache_prepare() above, preempt_enable() after
>+	* __add_to_swap_cache() below: which are already in
>+	* radix_tree_maybe_preload(), radix_tree_preload_end().
> */
> cond_resched();
> continue;
>--- 4.9.117/mm/swapfile.c	2018-08-04 11:40:08.463504848 -0700
>+++ linux/mm/swapfile.c	2018-08-04 11:50:46.577788766 -0700
>@@ -2670,7 +2670,13 @@ static int __swap_duplicate(swp_entry_t
> /* set SWAP_HAS_CACHE if there is no cache and entry is used */
> if (!has_cache && count)
> has_cache = SWAP_HAS_CACHE;
>-	else if (has_cache)	/* someone else added cache */
>+	/*
>+	* __read_swap_cache_async() can usually skip entries without
>+	* real usage (including those in between being allocated and
>+	* added to swap cache); but swapoff (!SWP_WRITEOK) must not.
>+	*/
>+	else if (has_cache &&
>+	(count || !(p->flags & SWP_WRITEOK)))
> err = -EEXIST;
> else	/* no users remaining */
> err = -ENOENT;
>--- 4.9.117/mm/vmscan.c	2018-08-04 11:40:08.471504902 -0700
>+++ linux/mm/vmscan.c	2018-08-04 11:50:46.577788766 -0700
>@@ -709,8 +709,11 @@ static int __remove_mapping(struct addre
> swp_entry_t swap = { .val = page_private(page) };
> mem_cgroup_swapout(page, swap);
> __delete_from_swap_cache(page);
>+	/* Expedite swapcache_free() for __read_swap_cache_async() */
>+	preempt_disable();
> spin_unlock_irqrestore(&mapping->tree_lock, flags);
> swapcache_free(swap);
>+	preempt_enable();
> } else {
> void (*freepage)(struct page *);
> void *shadow = NULL; 


Hi Hugh,

Thanks for affirming the modification of disabling preemption and 
pointing out the incompleteness, delete_from_swap_cache() needs the same protection.
I'm curious about that why don't put swapcache_free(swap) under protection of mapping->tree_lock ??
Hugh Dickins Aug. 7, 2018, 3:23 a.m. UTC | #17
On Tue, 7 Aug 2018, zhaowuyun@wingtech.com wrote:
> 
> Thanks for affirming the modification of disabling preemption and 
> pointing out the incompleteness, delete_from_swap_cache() needs the same protection.
> I'm curious about that why don't put swapcache_free(swap) under protection of mapping->tree_lock ??

That would violate the long-established lock ordering (see not-always-
kept-up-to-date comments at the head of mm/rmap.c). In particular,
swap_lock (and its more recent descendants, such as swap_info->lock)
can be held with interrupts enabled, whereas taking tree_lock (later
called i_pages lock) involves disabling interrupts. So: there would
be quite a lot of modifications required to do swapcache_free(swap)
under mapping->tree_lock.

Generally easier would be to take tree_lock under swap lock: that fits
the establishd lock ordering, and is already done in just a few places
- or am I thinking of free_swap_and_cache() in the old days before
find_get_page() did lockless lookup? But you didn't suggest that way,
because it's more awkward in the __remove_mapping() case: I expect
that could be worked around with an initial PageSwapCache check,
taking swap locks there first (not inside swapcache_free()) -
__remove_mapping()'s BUG_ON(!PageLocked) implies that won't be racy.

But either way round, why? What would be the advantage in doing so?
A more conventional nesting of locks, easier to describe and understand,
yes. But from a performance point of view, thinking of lock contention,
nothing but disadvantage. And don't forget the get_swap_page() end:
there it would be harder to deal with both locks together (at least
in the shmem case).

Hugh
diff mbox series

Patch

=====================================================
Process: kworker/u16:8, cpu: 3 pid: 20289 start: 0xffffffc0385f8e00
=====================================================
    Task name: kworker/u16:8 pid: 20289 cpu: 3 start: ffffffc0385f8e00
    state: 0x0 exit_state: 0x0 stack base: 0xffffffc012ba0000 Prio: 120
    Stack:
    [<ffffff80bca861a4>] __switch_to+0x90
    [<ffffff80bd83eddc>] __schedule+0x29c
    [<ffffff80bd83f600>] preempt_schedule_common+0x24
    [<ffffff80bd83f63c>] preempt_schedule.part.169+0x1c
    [<ffffff80bd83f664>] preempt_schedule+0x20
    [<ffffff80bd84396c>] _raw_spin_unlock_irqrestore+0x40
    [<ffffff80bcbc4710>] __remove_mapping+0x174
    [<ffffff80bcbc7698>] shrink_page_list+0x894
    [<ffffff80bcbc7d7c>] reclaim_pages_from_list+0xc8
    [<ffffff80bcc7b910>] reclaim_pte_range+0x158
    [<ffffff80bcbf45d4>] walk_pgd_range+0xd4
    [<ffffff80bcbf476c>] walk_page_range+0x74
    [<ffffff80bcc7cd64>] reclaim_task_anon+0xdc
    [<ffffff80bcc0a4c4>] swap_fn+0x1b8
    [<ffffff80bcac2e88>] process_one_work+0x168
    [<ffffff80bcac33a0>] worker_thread+0x224
    [<ffffff80bcac9864>] kthread+0xe0
    [<ffffff80bca836e0>] ret_from_fork+0x10
 
TASK B:
[535478.724249] CPU: 3 PID: 4645 Comm: AudioTrack Tainted: GF    UD W  O 4.9.82-perf+ #1
[535478.724385] Hardware name: Qualcomm Technologies, Inc. SDM450 PMI632 MTP S3 (DT)
[535478.724479] task: ffffffc026ce2a00 task.stack: ffffffc012e14000
[535478.724537] PC is at __read_swap_cache_async+0x154/0x25c
[535478.724630] LR is at __read_swap_cache_async+0x9c/0x25c
...
[535478.735546] [<ffffff80bcbf9970>] __read_swap_cache_async+0x154/0x25c
[535478.735599] [<ffffff80bcbf9a98>] read_swap_cache_async+0x20/0x54
[535478.735697] [<ffffff80bcbf9b24>] swapin_readahead+0x58/0x218
[535478.735797] [<ffffff80bcbe5240>] do_swap_page+0x3c4/0x4d0
[535478.735850] [<ffffff80bcbe6bf8>] handle_mm_fault+0x364/0xba4
[535478.735949] [<ffffff80bca9b5a8>] do_page_fault+0x2a0/0x38c
[535478.736003] [<ffffff80bca9b79c>] do_translation_fault+0x40/0x48
[535478.736100] [<ffffff80bca81340>] do_mem_abort+0x50/0xc8
 
Change-Id: I36d9df7ccff77c589b7157225410269c675a8504
Signed-off-by: zhaowuyun <zhaowuyun@wingtech.com>
---
mm/vmscan.c | 9 +++++++++
1 file changed, 9 insertions(+)
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2740973..acede002 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -674,6 +674,12 @@  static int __remove_mapping(struct address_space *mapping, struct page *page,
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
+ /*
+ * preemption must be disabled to protect current task preempted before
+ * swapcache_free(swap) invoked by the task which do the
+ * __read_swap_cache_async job on the same page
+ */
+ preempt_disable();
spin_lock_irqsave(&mapping->tree_lock, flags);
/*
* The non racy check for a busy page.
@@ -714,6 +720,7 @@  static int __remove_mapping(struct address_space *mapping, struct page *page,