diff mbox series

[-mm,-V7] mm, swap: fix race between swapoff and some swap operations

Message ID 20190211083846.18888-1-ying.huang@intel.com (mailing list archive)
State New, archived
Headers show
Series [-mm,-V7] mm, swap: fix race between swapoff and some swap operations | expand

Commit Message

Huang, Ying Feb. 11, 2019, 8:38 a.m. UTC
From: Huang Ying <ying.huang@intel.com>

When swapin is performed, after getting the swap entry information from
the page table, system will swap in the swap entry, without any lock held
to prevent the swap device from being swapoff.  This may cause the race
like below,

CPU 1				CPU 2
-----				-----
				do_swap_page
				  swapin_readahead
				    __read_swap_cache_async
swapoff				      swapcache_prepare
  p->swap_map = NULL		        __swap_duplicate
					  p->swap_map[?] /* !!! NULL pointer access */

Because swapoff is usually done when system shutdown only, the race may
not hit many people in practice.  But it is still a race need to be fixed.

To fix the race, get_swap_device() is added to check whether the specified
swap entry is valid in its swap device.  If so, it will keep the swap
entry valid via preventing the swap device from being swapoff, until
put_swap_device() is called.

Because swapoff() is very rare code path, to make the normal path runs as
fast as possible, disabling preemption + stop_machine() instead of
reference count is used to implement get/put_swap_device().  From
get_swap_device() to put_swap_device(), the preemption is disabled, so
stop_machine() in swapoff() will wait until put_swap_device() is called.

In addition to swap_map, cluster_info, etc.  data structure in the struct
swap_info_struct, the swap cache radix tree will be freed after swapoff,
so this patch fixes the race between swap cache looking up and swapoff
too.

Races between some other swap cache usages protected via disabling
preemption and swapoff are fixed too via calling stop_machine() between
clearing PageSwapCache() and freeing swap cache data structure.

Alternative implementation could be replacing disable preemption with
rcu_read_lock_sched and stop_machine() with synchronize_sched().

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Not-Nacked-by: Hugh Dickins <hughd@google.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>

Changelog:

v7:

- Rebased on patch: "mm, swap: bounds check swap_info accesses to avoid NULL derefs"

v6:

- Add more comments to get_swap_device() to make it more clear about
  possible swapoff or swapoff+swapon.

v5:

- Replace RCU with stop_machine()

v4:

- Use synchronize_rcu() in enable_swap_info() to reduce overhead of
  normal paths further.

v3:

- Re-implemented with RCU to reduce the overhead of normal paths

v2:

- Re-implemented with SRCU to reduce the overhead of normal paths.

- Avoid to check whether the swap device has been swapoff in
  get_swap_device().  Because we can check the origin of the swap
  entry to make sure the swap device hasn't bee swapoff.
---
 include/linux/swap.h |  13 +++-
 mm/memory.c          |   2 +-
 mm/swap_state.c      |  16 ++++-
 mm/swapfile.c        | 150 ++++++++++++++++++++++++++++++++++---------
 4 files changed, 145 insertions(+), 36 deletions(-)

Comments

Daniel Jordan Feb. 11, 2019, 7:06 p.m. UTC | #1
On Mon, Feb 11, 2019 at 04:38:46PM +0800, Huang, Ying wrote:
> +struct swap_info_struct *get_swap_device(swp_entry_t entry)
> +{
> +	struct swap_info_struct *si;
> +	unsigned long type, offset;
> +
> +	if (!entry.val)
> +		goto out;

> +	type = swp_type(entry);
> +	si = swap_type_to_swap_info(type);

These lines can be collapsed into swp_swap_info if you want.

> +	if (!si)
> +		goto bad_nofile;
> +
> +	preempt_disable();
> +	if (!(si->flags & SWP_VALID))
> +		goto unlock_out;

After Hugh alluded to barriers, it seems the read of SWP_VALID could be
reordered with the write in preempt_disable at runtime.  Without smp_mb()
between the two, couldn't this happen, however unlikely a race it is?

CPU0                                CPU1

__swap_duplicate()
    get_swap_device()
        // sees SWP_VALID set
                                   swapoff
                                       p->flags &= ~SWP_VALID;
                                       spin_unlock(&p->lock); // pair w/ smp_mb
                                       ...
                                       stop_machine(...)
                                       p->swap_map = NULL;
        preempt_disable()
    read NULL p->swap_map
Andrea Parri Feb. 12, 2019, 3:21 a.m. UTC | #2
> > +	if (!si)
> > +		goto bad_nofile;
> > +
> > +	preempt_disable();
> > +	if (!(si->flags & SWP_VALID))
> > +		goto unlock_out;
> 
> After Hugh alluded to barriers, it seems the read of SWP_VALID could be
> reordered with the write in preempt_disable at runtime.  Without smp_mb()
> between the two, couldn't this happen, however unlikely a race it is?
> 
> CPU0                                CPU1
> 
> __swap_duplicate()
>     get_swap_device()
>         // sees SWP_VALID set
>                                    swapoff
>                                        p->flags &= ~SWP_VALID;
>                                        spin_unlock(&p->lock); // pair w/ smp_mb
>                                        ...
>                                        stop_machine(...)
>                                        p->swap_map = NULL;
>         preempt_disable()
>     read NULL p->swap_map


I don't think that that smp_mb() is necessary.  I elaborate:

An important piece of information, I think, that is missing in the
diagram above is the stopper thread which executes the work queued
by stop_machine().  We have two cases to consider, that is,

  1) the stopper is "executed before" the preempt-disable section

	CPU0

	cpu_stopper_thread()
	...
	preempt_disable()
	...
	preempt_enable()

  2) the stopper is "executed after" the preempt-disable section

	CPU0

	preempt_disable()
	...
	preempt_enable()
	...
	cpu_stopper_thread()

Notice that the reads from p->flags and p->swap_map in CPU0 cannot
cross cpu_stopper_thread().  The claim is that CPU0 sees SWP_VALID
unset in (1) and that it sees a non-NULL p->swap_map in (2).

I consider the two cases separately:

  1) CPU1 unsets SPW_VALID, it locks the stopper's lock, and it
     queues the stopper work; CPU0 locks the stopper's lock, it
     dequeues this work, and it reads from p->flags.

     Diagrammatically, we have the following MP-like pattern:

	CPU0				CPU1

	lock(stopper->lock)		p->flags &= ~SPW_VALID
	get @work			lock(stopper->lock)
	unlock(stopper->lock)		add @work
	reads p->flags 			unlock(stopper->lock)

     where CPU0 must see SPW_VALID unset (if CPU0 sees the work
     added by CPU1).

  2) CPU0 reads from p->swap_map, it locks the completion lock,
     and it signals completion; CPU1 locks the completion lock,
     it checks for completion, and it writes to p->swap_map.

     (If CPU0 doesn't signal the completion, or CPU1 doesn't see
     the completion, then CPU1 will have to iterate the read and
     to postpone the control-dependent write to p->swap_map.)

     Diagrammatically, we have the following LB-like pattern:

	CPU0				CPU1

	reads p->swap_map		lock(completion)
	lock(completion)		read completion->done
	completion->done++		unlock(completion)
	unlock(completion)		p->swap_map = NULL

     where CPU0 must see a non-NULL p->swap_map if CPU1 sees the
     completion from CPU0.

Does this make sense?

  Andrea
Huang, Ying Feb. 12, 2019, 6:40 a.m. UTC | #3
Daniel Jordan <daniel.m.jordan@oracle.com> writes:

> On Mon, Feb 11, 2019 at 04:38:46PM +0800, Huang, Ying wrote:
>> +struct swap_info_struct *get_swap_device(swp_entry_t entry)
>> +{
>> +	struct swap_info_struct *si;
>> +	unsigned long type, offset;
>> +
>> +	if (!entry.val)
>> +		goto out;
>
>> +	type = swp_type(entry);
>> +	si = swap_type_to_swap_info(type);
>
> These lines can be collapsed into swp_swap_info if you want.

Yes.  I can use that function to reduce another line from the patch.
Thanks!  Will do that.

>> +	if (!si)
>> +		goto bad_nofile;
>> +
>> +	preempt_disable();
>> +	if (!(si->flags & SWP_VALID))
>> +		goto unlock_out;
>
> After Hugh alluded to barriers, it seems the read of SWP_VALID could be
> reordered with the write in preempt_disable at runtime.  Without smp_mb()
> between the two, couldn't this happen, however unlikely a race it is?
>
> CPU0                                CPU1
>
> __swap_duplicate()
>     get_swap_device()
>         // sees SWP_VALID set
>                                    swapoff
>                                        p->flags &= ~SWP_VALID;
>                                        spin_unlock(&p->lock); // pair w/ smp_mb
>                                        ...
>                                        stop_machine(...)
>                                        p->swap_map = NULL;
>         preempt_disable()
>     read NULL p->swap_map

Andrea has helped to explain this.

Best Regards,
Huang, Ying
Huang, Ying Feb. 12, 2019, 6:47 a.m. UTC | #4
Andrea Parri <andrea.parri@amarulasolutions.com> writes:

>> > +	if (!si)
>> > +		goto bad_nofile;
>> > +
>> > +	preempt_disable();
>> > +	if (!(si->flags & SWP_VALID))
>> > +		goto unlock_out;
>> 
>> After Hugh alluded to barriers, it seems the read of SWP_VALID could be
>> reordered with the write in preempt_disable at runtime.  Without smp_mb()
>> between the two, couldn't this happen, however unlikely a race it is?
>> 
>> CPU0                                CPU1
>> 
>> __swap_duplicate()
>>     get_swap_device()
>>         // sees SWP_VALID set
>>                                    swapoff
>>                                        p->flags &= ~SWP_VALID;
>>                                        spin_unlock(&p->lock); // pair w/ smp_mb
>>                                        ...
>>                                        stop_machine(...)
>>                                        p->swap_map = NULL;
>>         preempt_disable()
>>     read NULL p->swap_map
>
>
> I don't think that that smp_mb() is necessary.  I elaborate:
>
> An important piece of information, I think, that is missing in the
> diagram above is the stopper thread which executes the work queued
> by stop_machine().  We have two cases to consider, that is,
>
>   1) the stopper is "executed before" the preempt-disable section
>
> 	CPU0
>
> 	cpu_stopper_thread()
> 	...
> 	preempt_disable()
> 	...
> 	preempt_enable()
>
>   2) the stopper is "executed after" the preempt-disable section
>
> 	CPU0
>
> 	preempt_disable()
> 	...
> 	preempt_enable()
> 	...
> 	cpu_stopper_thread()
>
> Notice that the reads from p->flags and p->swap_map in CPU0 cannot
> cross cpu_stopper_thread().  The claim is that CPU0 sees SWP_VALID
> unset in (1) and that it sees a non-NULL p->swap_map in (2).
>
> I consider the two cases separately:
>
>   1) CPU1 unsets SPW_VALID, it locks the stopper's lock, and it
>      queues the stopper work; CPU0 locks the stopper's lock, it
>      dequeues this work, and it reads from p->flags.
>
>      Diagrammatically, we have the following MP-like pattern:
>
> 	CPU0				CPU1
>
> 	lock(stopper->lock)		p->flags &= ~SPW_VALID
> 	get @work			lock(stopper->lock)
> 	unlock(stopper->lock)		add @work
> 	reads p->flags 			unlock(stopper->lock)
>
>      where CPU0 must see SPW_VALID unset (if CPU0 sees the work
>      added by CPU1).
>
>   2) CPU0 reads from p->swap_map, it locks the completion lock,
>      and it signals completion; CPU1 locks the completion lock,
>      it checks for completion, and it writes to p->swap_map.
>
>      (If CPU0 doesn't signal the completion, or CPU1 doesn't see
>      the completion, then CPU1 will have to iterate the read and
>      to postpone the control-dependent write to p->swap_map.)
>
>      Diagrammatically, we have the following LB-like pattern:
>
> 	CPU0				CPU1
>
> 	reads p->swap_map		lock(completion)
> 	lock(completion)		read completion->done
> 	completion->done++		unlock(completion)
> 	unlock(completion)		p->swap_map = NULL
>
>      where CPU0 must see a non-NULL p->swap_map if CPU1 sees the
>      completion from CPU0.
>
> Does this make sense?

Thanks a lot for detailed explanation!

Best Regards,
Huang, Ying

>   Andrea
Andrea Parri Feb. 12, 2019, 10:13 a.m. UTC | #5
> Alternative implementation could be replacing disable preemption with
> rcu_read_lock_sched and stop_machine() with synchronize_sched().

JFYI, starting with v4.20-rc1, synchronize_rcu{,expedited}() also wait
for preempt-disable sections (the intent seems to retire the RCU-sched
update-side API), so that here you could instead use preempt-disable +
synchronize_rcu{,expedited}().  This LWN article gives an overview of
the latest RCU API/semantics changes: https://lwn.net/Articles/777036/.

  Andrea
Tim Chen Feb. 12, 2019, 5:58 p.m. UTC | #6
On 2/11/19 10:47 PM, Huang, Ying wrote:
> Andrea Parri <andrea.parri@amarulasolutions.com> writes:
> 
>>>> +	if (!si)
>>>> +		goto bad_nofile;
>>>> +
>>>> +	preempt_disable();
>>>> +	if (!(si->flags & SWP_VALID))
>>>> +		goto unlock_out;
>>>
>>> After Hugh alluded to barriers, it seems the read of SWP_VALID could be
>>> reordered with the write in preempt_disable at runtime.  Without smp_mb()
>>> between the two, couldn't this happen, however unlikely a race it is?
>>>
>>> CPU0                                CPU1
>>>
>>> __swap_duplicate()
>>>     get_swap_device()
>>>         // sees SWP_VALID set
>>>                                    swapoff
>>>                                        p->flags &= ~SWP_VALID;
>>>                                        spin_unlock(&p->lock); // pair w/ smp_mb
>>>                                        ...
>>>                                        stop_machine(...)
>>>                                        p->swap_map = NULL;
>>>         preempt_disable()
>>>     read NULL p->swap_map
>>
>>
>> I don't think that that smp_mb() is necessary.  I elaborate:
>>
>> An important piece of information, I think, that is missing in the
>> diagram above is the stopper thread which executes the work queued
>> by stop_machine().  We have two cases to consider, that is,
>>
>>   1) the stopper is "executed before" the preempt-disable section
>>
>> 	CPU0
>>
>> 	cpu_stopper_thread()
>> 	...
>> 	preempt_disable()
>> 	...
>> 	preempt_enable()
>>
>>   2) the stopper is "executed after" the preempt-disable section
>>
>> 	CPU0
>>
>> 	preempt_disable()
>> 	...
>> 	preempt_enable()
>> 	...
>> 	cpu_stopper_thread()
>>
>> Notice that the reads from p->flags and p->swap_map in CPU0 cannot
>> cross cpu_stopper_thread().  The claim is that CPU0 sees SWP_VALID
>> unset in (1) and that it sees a non-NULL p->swap_map in (2).
>>
>> I consider the two cases separately:
>>
>>   1) CPU1 unsets SPW_VALID, it locks the stopper's lock, and it
>>      queues the stopper work; CPU0 locks the stopper's lock, it
>>      dequeues this work, and it reads from p->flags.
>>
>>      Diagrammatically, we have the following MP-like pattern:
>>
>> 	CPU0				CPU1
>>
>> 	lock(stopper->lock)		p->flags &= ~SPW_VALID
>> 	get @work			lock(stopper->lock)
>> 	unlock(stopper->lock)		add @work
>> 	reads p->flags 			unlock(stopper->lock)
>>
>>      where CPU0 must see SPW_VALID unset (if CPU0 sees the work
>>      added by CPU1).
>>
>>   2) CPU0 reads from p->swap_map, it locks the completion lock,
>>      and it signals completion; CPU1 locks the completion lock,
>>      it checks for completion, and it writes to p->swap_map.
>>
>>      (If CPU0 doesn't signal the completion, or CPU1 doesn't see
>>      the completion, then CPU1 will have to iterate the read and
>>      to postpone the control-dependent write to p->swap_map.)
>>
>>      Diagrammatically, we have the following LB-like pattern:
>>
>> 	CPU0				CPU1
>>
>> 	reads p->swap_map		lock(completion)
>> 	lock(completion)		read completion->done
>> 	completion->done++		unlock(completion)
>> 	unlock(completion)		p->swap_map = NULL
>>
>>      where CPU0 must see a non-NULL p->swap_map if CPU1 sees the
>>      completion from CPU0.
>>
>> Does this make sense?
> 
> Thanks a lot for detailed explanation!

This is certainly a non-trivial explanation of why memory barrier is not
needed.  Can we put it in the commit log and mention something in
comments on why we don't need memory barrier?

Thanks.

Tim
Daniel Jordan Feb. 12, 2019, 8:06 p.m. UTC | #7
On Tue, Feb 12, 2019 at 04:21:21AM +0100, Andrea Parri wrote:
> > > +	if (!si)
> > > +		goto bad_nofile;
> > > +
> > > +	preempt_disable();
> > > +	if (!(si->flags & SWP_VALID))
> > > +		goto unlock_out;
> > 
> > After Hugh alluded to barriers, it seems the read of SWP_VALID could be
> > reordered with the write in preempt_disable at runtime.  Without smp_mb()
> > between the two, couldn't this happen, however unlikely a race it is?
> > 
> > CPU0                                CPU1
> > 
> > __swap_duplicate()
> >     get_swap_device()
> >         // sees SWP_VALID set
> >                                    swapoff
> >                                        p->flags &= ~SWP_VALID;
> >                                        spin_unlock(&p->lock); // pair w/ smp_mb
> >                                        ...
> >                                        stop_machine(...)
> >                                        p->swap_map = NULL;
> >         preempt_disable()
> >     read NULL p->swap_map
> 
> 
> I don't think that that smp_mb() is necessary.  I elaborate:
> 
> An important piece of information, I think, that is missing in the
> diagram above is the stopper thread which executes the work queued
> by stop_machine().  We have two cases to consider, that is,
> 
>   1) the stopper is "executed before" the preempt-disable section
> 
> 	CPU0
> 
> 	cpu_stopper_thread()
> 	...
> 	preempt_disable()
> 	...
> 	preempt_enable()
> 
>   2) the stopper is "executed after" the preempt-disable section
> 
> 	CPU0
> 
> 	preempt_disable()
> 	...
> 	preempt_enable()
> 	...
> 	cpu_stopper_thread()
> 
> Notice that the reads from p->flags and p->swap_map in CPU0 cannot
> cross cpu_stopper_thread().  The claim is that CPU0 sees SWP_VALID
> unset in (1) and that it sees a non-NULL p->swap_map in (2).
> 
> I consider the two cases separately:
> 
>   1) CPU1 unsets SPW_VALID, it locks the stopper's lock, and it
>      queues the stopper work; CPU0 locks the stopper's lock, it
>      dequeues this work, and it reads from p->flags.
> 
>      Diagrammatically, we have the following MP-like pattern:
> 
> 	CPU0				CPU1
> 
> 	lock(stopper->lock)		p->flags &= ~SPW_VALID
> 	get @work			lock(stopper->lock)
> 	unlock(stopper->lock)		add @work
> 	reads p->flags 			unlock(stopper->lock)
> 
>      where CPU0 must see SPW_VALID unset (if CPU0 sees the work
>      added by CPU1).
> 
>   2) CPU0 reads from p->swap_map, it locks the completion lock,
>      and it signals completion; CPU1 locks the completion lock,
>      it checks for completion, and it writes to p->swap_map.
> 
>      (If CPU0 doesn't signal the completion, or CPU1 doesn't see
>      the completion, then CPU1 will have to iterate the read and
>      to postpone the control-dependent write to p->swap_map.)
> 
>      Diagrammatically, we have the following LB-like pattern:
> 
> 	CPU0				CPU1
> 
> 	reads p->swap_map		lock(completion)
> 	lock(completion)		read completion->done
> 	completion->done++		unlock(completion)
> 	unlock(completion)		p->swap_map = NULL
> 
>      where CPU0 must see a non-NULL p->swap_map if CPU1 sees the
>      completion from CPU0.
> 
> Does this make sense?

Yes, thanks for this, Andrea!  Good that smp_mb isn't needed.
Huang, Ying Feb. 13, 2019, 3:23 a.m. UTC | #8
Tim Chen <tim.c.chen@linux.intel.com> writes:

> On 2/11/19 10:47 PM, Huang, Ying wrote:
>> Andrea Parri <andrea.parri@amarulasolutions.com> writes:
>> 
>>>>> +	if (!si)
>>>>> +		goto bad_nofile;
>>>>> +
>>>>> +	preempt_disable();
>>>>> +	if (!(si->flags & SWP_VALID))
>>>>> +		goto unlock_out;
>>>>
>>>> After Hugh alluded to barriers, it seems the read of SWP_VALID could be
>>>> reordered with the write in preempt_disable at runtime.  Without smp_mb()
>>>> between the two, couldn't this happen, however unlikely a race it is?
>>>>
>>>> CPU0                                CPU1
>>>>
>>>> __swap_duplicate()
>>>>     get_swap_device()
>>>>         // sees SWP_VALID set
>>>>                                    swapoff
>>>>                                        p->flags &= ~SWP_VALID;
>>>>                                        spin_unlock(&p->lock); // pair w/ smp_mb
>>>>                                        ...
>>>>                                        stop_machine(...)
>>>>                                        p->swap_map = NULL;
>>>>         preempt_disable()
>>>>     read NULL p->swap_map
>>>
>>>
>>> I don't think that that smp_mb() is necessary.  I elaborate:
>>>
>>> An important piece of information, I think, that is missing in the
>>> diagram above is the stopper thread which executes the work queued
>>> by stop_machine().  We have two cases to consider, that is,
>>>
>>>   1) the stopper is "executed before" the preempt-disable section
>>>
>>> 	CPU0
>>>
>>> 	cpu_stopper_thread()
>>> 	...
>>> 	preempt_disable()
>>> 	...
>>> 	preempt_enable()
>>>
>>>   2) the stopper is "executed after" the preempt-disable section
>>>
>>> 	CPU0
>>>
>>> 	preempt_disable()
>>> 	...
>>> 	preempt_enable()
>>> 	...
>>> 	cpu_stopper_thread()
>>>
>>> Notice that the reads from p->flags and p->swap_map in CPU0 cannot
>>> cross cpu_stopper_thread().  The claim is that CPU0 sees SWP_VALID
>>> unset in (1) and that it sees a non-NULL p->swap_map in (2).
>>>
>>> I consider the two cases separately:
>>>
>>>   1) CPU1 unsets SPW_VALID, it locks the stopper's lock, and it
>>>      queues the stopper work; CPU0 locks the stopper's lock, it
>>>      dequeues this work, and it reads from p->flags.
>>>
>>>      Diagrammatically, we have the following MP-like pattern:
>>>
>>> 	CPU0				CPU1
>>>
>>> 	lock(stopper->lock)		p->flags &= ~SPW_VALID
>>> 	get @work			lock(stopper->lock)
>>> 	unlock(stopper->lock)		add @work
>>> 	reads p->flags 			unlock(stopper->lock)
>>>
>>>      where CPU0 must see SPW_VALID unset (if CPU0 sees the work
>>>      added by CPU1).
>>>
>>>   2) CPU0 reads from p->swap_map, it locks the completion lock,
>>>      and it signals completion; CPU1 locks the completion lock,
>>>      it checks for completion, and it writes to p->swap_map.
>>>
>>>      (If CPU0 doesn't signal the completion, or CPU1 doesn't see
>>>      the completion, then CPU1 will have to iterate the read and
>>>      to postpone the control-dependent write to p->swap_map.)
>>>
>>>      Diagrammatically, we have the following LB-like pattern:
>>>
>>> 	CPU0				CPU1
>>>
>>> 	reads p->swap_map		lock(completion)
>>> 	lock(completion)		read completion->done
>>> 	completion->done++		unlock(completion)
>>> 	unlock(completion)		p->swap_map = NULL
>>>
>>>      where CPU0 must see a non-NULL p->swap_map if CPU1 sees the
>>>      completion from CPU0.
>>>
>>> Does this make sense?
>> 
>> Thanks a lot for detailed explanation!
>
> This is certainly a non-trivial explanation of why memory barrier is not
> needed.  Can we put it in the commit log and mention something in
> comments on why we don't need memory barrier?

Good idea!  Will do this.

Best Regards,
Huang, Ying

> Thanks.
>
> Tim
Andrea Arcangeli Feb. 14, 2019, 2:38 a.m. UTC | #9
Hello everyone,

On Mon, Feb 11, 2019 at 04:38:46PM +0800, Huang, Ying wrote:
> @@ -2386,7 +2463,17 @@ static void enable_swap_info(struct swap_info_struct *p, int prio,
>  	frontswap_init(p->type, frontswap_map);
>  	spin_lock(&swap_lock);
>  	spin_lock(&p->lock);
> -	 _enable_swap_info(p, prio, swap_map, cluster_info);
> +	setup_swap_info(p, prio, swap_map, cluster_info);
> +	spin_unlock(&p->lock);
> +	spin_unlock(&swap_lock);
> +	/*
> +	 * Guarantee swap_map, cluster_info, etc. fields are used
> +	 * between get/put_swap_device() only if SWP_VALID bit is set
> +	 */
> +	stop_machine(swap_onoff_stop, NULL, cpu_online_mask);

Should cpu_online_mask be read while holding cpus_read_lock?

	cpus_read_lock();
	err = __stop_machine(swap_onoff_stop, NULL, cpu_online_mask);
	cpus_read_unlock();

I missed what the exact motivation was for the switch from
rcu_read_lock()/syncrhonize_rcu() to preempt_disable()/stop_machine().

It looks like the above stop_machine all it does is to reach a
quiescent point, when you've RCU that already can reach the quiescent
point without an explicit stop_machine.

The reason both implementations are basically looking the same is that
stop_machine dummy call of swap_onoff_stop() { /* noop */ } will only
reach a quiescent point faster than RCU, but it's otherwise
functionally identical to RCU, but it's extremely more expensive. If
it wasn't functionally identical stop_machine() couldn't be used as a
drop in replacement of synchronize_sched() in the previous patch.

I don't see the point of worrying about the synchronize_rcu latency in
swapoff when RCU is basically identical and not more complex.

So to be clear, I'm not against stop_machine() but with stop_machine()
method invoked in all CPUs, you can actually do more than RCU and you
can remove real locking not just reach a quiescent point.

With stop_machine() the code would need reshuffling around so that the
actual p->swap_map = NULL happens inside stop_machine, not outside
like with RCU.

With RCU all code stays concurrent at all times, simply the race is
controlled, as opposed with stop_machine() you can make fully
serialize and run like in UP temporarily (vs all preempt_disable()
section at least).

For example nr_swapfiles could in theory become a constant under
preempt_disable() with stop_machine() without having to take a
swap_lock.

swap_onoff_stop can be implemented like this:

enum {
	FIRST_STOP_MACHINE_INIT,
	FIRST_STOP_MACHINE_START,
	FIRST_STOP_MACHINE_END,
};
static int first_stop_machine;
static int swap_onoff_stop(void *data)
{
	struct swap_stop_machine *swsm = (struct swap_stop_machine *)data;
	int first;

	first = cmpxchg(&first_stop_machine, FIRST_STOP_MACHINE_INIT,
			FIRST_STOP_MACHINE_START);
	if (first == FIRST_STOP_MACHINE_INIT) {
		swsm->p->swap_map = NULL;
		/* add more stuff here until swap_lock goes away */
		smp_wmb();
		WRITE_ONCE(first_stop_machine, FIRST_STOP_MACHINE_END);
	} else {
		do {
			cpu_relax();
		} while (READ_ONCE(first_stop_machine) !=
			 FIRST_STOP_MACHINE_END);
		smp_rmb();
	}

	return 0;
}

stop_machine invoked with a method like above, will guarantee while we
set p->swap_map to NULL (and while we do nr_swapfiles++) nothing else
can run, no even interrupts, so some lock may just disappear. Only NMI
and SMI could possibly run concurrently with the swsm->p->swap_map =
NULL operation.

If we've to keep swap_onoff_stop() a dummy function run on all CPUs
just to reach a quiescent point, then I don't see why
the synchronize_rcu() (or synchronize_sched or synchronize_kernel or
whatever it is called right now, but still RCU) solution isn't
preferable.

Thanks,
Andrea
Huang, Ying Feb. 14, 2019, 8:07 a.m. UTC | #10
Hi, Andrea,

Andrea Arcangeli <aarcange@redhat.com> writes:

> Hello everyone,
>
> On Mon, Feb 11, 2019 at 04:38:46PM +0800, Huang, Ying wrote:
>> @@ -2386,7 +2463,17 @@ static void enable_swap_info(struct swap_info_struct *p, int prio,
>>  	frontswap_init(p->type, frontswap_map);
>>  	spin_lock(&swap_lock);
>>  	spin_lock(&p->lock);
>> -	 _enable_swap_info(p, prio, swap_map, cluster_info);
>> +	setup_swap_info(p, prio, swap_map, cluster_info);
>> +	spin_unlock(&p->lock);
>> +	spin_unlock(&swap_lock);
>> +	/*
>> +	 * Guarantee swap_map, cluster_info, etc. fields are used
>> +	 * between get/put_swap_device() only if SWP_VALID bit is set
>> +	 */
>> +	stop_machine(swap_onoff_stop, NULL, cpu_online_mask);
>
> Should cpu_online_mask be read while holding cpus_read_lock?
>
> 	cpus_read_lock();
> 	err = __stop_machine(swap_onoff_stop, NULL, cpu_online_mask);
> 	cpus_read_unlock();

Thanks for pointing this out.  Because swap_onoff_stop() is just dumb
function, something as below should be sufficient.

stop_machine(swap_onoff_stop, NULL, NULL);

> I missed what the exact motivation was for the switch from
> rcu_read_lock()/syncrhonize_rcu() to preempt_disable()/stop_machine().
>
> It looks like the above stop_machine all it does is to reach a
> quiescent point, when you've RCU that already can reach the quiescent
> point without an explicit stop_machine.
>
> The reason both implementations are basically looking the same is that
> stop_machine dummy call of swap_onoff_stop() { /* noop */ } will only
> reach a quiescent point faster than RCU, but it's otherwise
> functionally identical to RCU, but it's extremely more expensive. If
> it wasn't functionally identical stop_machine() couldn't be used as a
> drop in replacement of synchronize_sched() in the previous patch.
>
> I don't see the point of worrying about the synchronize_rcu latency in
> swapoff when RCU is basically identical and not more complex.
>
> So to be clear, I'm not against stop_machine() but with stop_machine()
> method invoked in all CPUs, you can actually do more than RCU and you
> can remove real locking not just reach a quiescent point.
>
> With stop_machine() the code would need reshuffling around so that the
> actual p->swap_map = NULL happens inside stop_machine, not outside
> like with RCU.
>
> With RCU all code stays concurrent at all times, simply the race is
> controlled, as opposed with stop_machine() you can make fully
> serialize and run like in UP temporarily (vs all preempt_disable()
> section at least).
>
> For example nr_swapfiles could in theory become a constant under
> preempt_disable() with stop_machine() without having to take a
> swap_lock.
>
> swap_onoff_stop can be implemented like this:
>
> enum {
> 	FIRST_STOP_MACHINE_INIT,
> 	FIRST_STOP_MACHINE_START,
> 	FIRST_STOP_MACHINE_END,
> };
> static int first_stop_machine;
> static int swap_onoff_stop(void *data)
> {
> 	struct swap_stop_machine *swsm = (struct swap_stop_machine *)data;
> 	int first;
>
> 	first = cmpxchg(&first_stop_machine, FIRST_STOP_MACHINE_INIT,
> 			FIRST_STOP_MACHINE_START);
> 	if (first == FIRST_STOP_MACHINE_INIT) {
> 		swsm->p->swap_map = NULL;
> 		/* add more stuff here until swap_lock goes away */
> 		smp_wmb();
> 		WRITE_ONCE(first_stop_machine, FIRST_STOP_MACHINE_END);
> 	} else {
> 		do {
> 			cpu_relax();
> 		} while (READ_ONCE(first_stop_machine) !=
> 			 FIRST_STOP_MACHINE_END);
> 		smp_rmb();
> 	}
>
> 	return 0;
> }
>
> stop_machine invoked with a method like above, will guarantee while we
> set p->swap_map to NULL (and while we do nr_swapfiles++) nothing else
> can run, no even interrupts, so some lock may just disappear. Only NMI
> and SMI could possibly run concurrently with the swsm->p->swap_map =
> NULL operation.
>
> If we've to keep swap_onoff_stop() a dummy function run on all CPUs
> just to reach a quiescent point, then I don't see why
> the synchronize_rcu() (or synchronize_sched or synchronize_kernel or
> whatever it is called right now, but still RCU) solution isn't
> preferable.

We only need to keep swap_onoff_stop() a dummy function as above.  So
from functionality point of view, RCU works for us perfectly too.  Paul
pointed out that before too.

Before, we choose to use stop_machine() to reduce the overhead of hot
path (page fault handler) as much as possible.  But now, I found
rcu_read_lock_sched() is just a wrapper of preempt_disable().  So maybe
we can switch to RCU version now.

Best Regards,
Huang, Ying

> Thanks,
> Andrea
Michal Hocko Feb. 14, 2019, 2:33 p.m. UTC | #11
On Mon 11-02-19 16:38:46, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> When swapin is performed, after getting the swap entry information from
> the page table, system will swap in the swap entry, without any lock held
> to prevent the swap device from being swapoff.  This may cause the race
> like below,
> 
> CPU 1				CPU 2
> -----				-----
> 				do_swap_page
> 				  swapin_readahead
> 				    __read_swap_cache_async
> swapoff				      swapcache_prepare
>   p->swap_map = NULL		        __swap_duplicate
> 					  p->swap_map[?] /* !!! NULL pointer access */
> 
> Because swapoff is usually done when system shutdown only, the race may
> not hit many people in practice.  But it is still a race need to be fixed.
> 
> To fix the race, get_swap_device() is added to check whether the specified
> swap entry is valid in its swap device.  If so, it will keep the swap
> entry valid via preventing the swap device from being swapoff, until
> put_swap_device() is called.
> 
> Because swapoff() is very rare code path, to make the normal path runs as
> fast as possible, disabling preemption + stop_machine() instead of
> reference count is used to implement get/put_swap_device().  From
> get_swap_device() to put_swap_device(), the preemption is disabled, so
> stop_machine() in swapoff() will wait until put_swap_device() is called.
> 
> In addition to swap_map, cluster_info, etc.  data structure in the struct
> swap_info_struct, the swap cache radix tree will be freed after swapoff,
> so this patch fixes the race between swap cache looking up and swapoff
> too.
> 
> Races between some other swap cache usages protected via disabling
> preemption and swapoff are fixed too via calling stop_machine() between
> clearing PageSwapCache() and freeing swap cache data structure.
> 
> Alternative implementation could be replacing disable preemption with
> rcu_read_lock_sched and stop_machine() with synchronize_sched().

using stop_machine is generally discouraged. It is a gross
synchronization.

Besides that, since when do we have this problem?

> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Not-Nacked-by: Hugh Dickins <hughd@google.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
> Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
> 
> Changelog:
> 
> v7:
> 
> - Rebased on patch: "mm, swap: bounds check swap_info accesses to avoid NULL derefs"
> 
> v6:
> 
> - Add more comments to get_swap_device() to make it more clear about
>   possible swapoff or swapoff+swapon.
> 
> v5:
> 
> - Replace RCU with stop_machine()
> 
> v4:
> 
> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
>   normal paths further.
> 
> v3:
> 
> - Re-implemented with RCU to reduce the overhead of normal paths
> 
> v2:
> 
> - Re-implemented with SRCU to reduce the overhead of normal paths.
> 
> - Avoid to check whether the swap device has been swapoff in
>   get_swap_device().  Because we can check the origin of the swap
>   entry to make sure the swap device hasn't bee swapoff.
> ---
>  include/linux/swap.h |  13 +++-
>  mm/memory.c          |   2 +-
>  mm/swap_state.c      |  16 ++++-
>  mm/swapfile.c        | 150 ++++++++++++++++++++++++++++++++++---------
>  4 files changed, 145 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 649529be91f2..aecd1430d0a9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -175,8 +175,9 @@ enum {
>  	SWP_PAGE_DISCARD = (1 << 10),	/* freed swap page-cluster discards */
>  	SWP_STABLE_WRITES = (1 << 11),	/* no overwrite PG_writeback pages */
>  	SWP_SYNCHRONOUS_IO = (1 << 12),	/* synchronous IO is efficient */
> +	SWP_VALID	= (1 << 13),	/* swap is valid to be operated on? */
>  					/* add others here before... */
> -	SWP_SCANNING	= (1 << 13),	/* refcount in scan_swap_map */
> +	SWP_SCANNING	= (1 << 14),	/* refcount in scan_swap_map */
>  };
>  
>  #define SWAP_CLUSTER_MAX 32UL
> @@ -460,7 +461,7 @@ extern unsigned int count_swap_pages(int, int);
>  extern sector_t map_swap_page(struct page *, struct block_device **);
>  extern sector_t swapdev_block(int, pgoff_t);
>  extern int page_swapcount(struct page *);
> -extern int __swap_count(struct swap_info_struct *si, swp_entry_t entry);
> +extern int __swap_count(swp_entry_t entry);
>  extern int __swp_swapcount(swp_entry_t entry);
>  extern int swp_swapcount(swp_entry_t entry);
>  extern struct swap_info_struct *page_swap_info(struct page *);
> @@ -470,6 +471,12 @@ extern int try_to_free_swap(struct page *);
>  struct backing_dev_info;
>  extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
>  extern void exit_swap_address_space(unsigned int type);
> +extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
> +
> +static inline void put_swap_device(struct swap_info_struct *si)
> +{
> +	preempt_enable();
> +}
>  
>  #else /* CONFIG_SWAP */
>  
> @@ -576,7 +583,7 @@ static inline int page_swapcount(struct page *page)
>  	return 0;
>  }
>  
> -static inline int __swap_count(struct swap_info_struct *si, swp_entry_t entry)
> +static inline int __swap_count(swp_entry_t entry)
>  {
>  	return 0;
>  }
> diff --git a/mm/memory.c b/mm/memory.c
> index 34ced1369883..9c0743c17c6c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2719,7 +2719,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		struct swap_info_struct *si = swp_swap_info(entry);
>  
>  		if (si->flags & SWP_SYNCHRONOUS_IO &&
> -				__swap_count(si, entry) == 1) {
> +				__swap_count(entry) == 1) {
>  			/* skip swapcache */
>  			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
>  							vmf->address);
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 85245fdec8d9..61453f1faf72 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -310,8 +310,13 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma,
>  			       unsigned long addr)
>  {
>  	struct page *page;
> +	struct swap_info_struct *si;
>  
> +	si = get_swap_device(entry);
> +	if (!si)
> +		return NULL;
>  	page = find_get_page(swap_address_space(entry), swp_offset(entry));
> +	put_swap_device(si);
>  
>  	INC_CACHE_INFO(find_total);
>  	if (page) {
> @@ -354,8 +359,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  			struct vm_area_struct *vma, unsigned long addr,
>  			bool *new_page_allocated)
>  {
> -	struct page *found_page, *new_page = NULL;
> -	struct address_space *swapper_space = swap_address_space(entry);
> +	struct page *found_page = NULL, *new_page = NULL;
> +	struct swap_info_struct *si;
>  	int err;
>  	*new_page_allocated = false;
>  
> @@ -365,7 +370,12 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		 * called after lookup_swap_cache() failed, re-calling
>  		 * that would confuse statistics.
>  		 */
> -		found_page = find_get_page(swapper_space, swp_offset(entry));
> +		si = get_swap_device(entry);
> +		if (!si)
> +			break;
> +		found_page = find_get_page(swap_address_space(entry),
> +					   swp_offset(entry));
> +		put_swap_device(si);
>  		if (found_page)
>  			break;
>  
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index cca8420b12db..8f92f36814fb 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -38,6 +38,7 @@
>  #include <linux/export.h>
>  #include <linux/swap_slots.h>
>  #include <linux/sort.h>
> +#include <linux/stop_machine.h>
>  
>  #include <asm/pgtable.h>
>  #include <asm/tlbflush.h>
> @@ -1186,6 +1187,64 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *p,
>  	return usage;
>  }
>  
> +/*
> + * Check whether swap entry is valid in the swap device.  If so,
> + * return pointer to swap_info_struct, and keep the swap entry valid
> + * via preventing the swap device from being swapoff, until
> + * put_swap_device() is called.  Otherwise return NULL.
> + *
> + * Notice that swapoff or swapoff+swapon can still happen before the
> + * preempt_disable() in get_swap_device() or after the
> + * preempt_enable() in put_swap_device() if there isn't any other way
> + * to prevent swapoff, such as page lock, page table lock, etc.  The
> + * caller must be prepared for that.  For example, the following
> + * situation is possible.
> + *
> + *   CPU1				CPU2
> + *   do_swap_page()
> + *     ...				swapoff+swapon
> + *     __read_swap_cache_async()
> + *       swapcache_prepare()
> + *         __swap_duplicate()
> + *           // check swap_map
> + *     // verify PTE not changed
> + *
> + * In __swap_duplicate(), the swap_map need to be checked before
> + * changing partly because the specified swap entry may be for another
> + * swap device which has been swapoff.  And in do_swap_page(), after
> + * the page is read from the swap device, the PTE is verified not
> + * changed with the page table locked to check whether the swap device
> + * has been swapoff or swapoff+swapon.
> + */
> +struct swap_info_struct *get_swap_device(swp_entry_t entry)
> +{
> +	struct swap_info_struct *si;
> +	unsigned long type, offset;
> +
> +	if (!entry.val)
> +		goto out;
> +	type = swp_type(entry);
> +	si = swap_type_to_swap_info(type);
> +	if (!si)
> +		goto bad_nofile;
> +
> +	preempt_disable();
> +	if (!(si->flags & SWP_VALID))
> +		goto unlock_out;
> +	offset = swp_offset(entry);
> +	if (offset >= si->max)
> +		goto unlock_out;
> +
> +	return si;
> +bad_nofile:
> +	pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
> +out:
> +	return NULL;
> +unlock_out:
> +	preempt_enable();
> +	return NULL;
> +}
> +
>  static unsigned char __swap_entry_free(struct swap_info_struct *p,
>  				       swp_entry_t entry, unsigned char usage)
>  {
> @@ -1357,11 +1416,18 @@ int page_swapcount(struct page *page)
>  	return count;
>  }
>  
> -int __swap_count(struct swap_info_struct *si, swp_entry_t entry)
> +int __swap_count(swp_entry_t entry)
>  {
> +	struct swap_info_struct *si;
>  	pgoff_t offset = swp_offset(entry);
> +	int count = 0;
>  
> -	return swap_count(si->swap_map[offset]);
> +	si = get_swap_device(entry);
> +	if (si) {
> +		count = swap_count(si->swap_map[offset]);
> +		put_swap_device(si);
> +	}
> +	return count;
>  }
>  
>  static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
> @@ -1386,9 +1452,11 @@ int __swp_swapcount(swp_entry_t entry)
>  	int count = 0;
>  	struct swap_info_struct *si;
>  
> -	si = __swap_info_get(entry);
> -	if (si)
> +	si = get_swap_device(entry);
> +	if (si) {
>  		count = swap_swapcount(si, entry);
> +		put_swap_device(si);
> +	}
>  	return count;
>  }
>  
> @@ -2332,9 +2400,9 @@ static int swap_node(struct swap_info_struct *p)
>  	return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
>  }
>  
> -static void _enable_swap_info(struct swap_info_struct *p, int prio,
> -				unsigned char *swap_map,
> -				struct swap_cluster_info *cluster_info)
> +static void setup_swap_info(struct swap_info_struct *p, int prio,
> +			    unsigned char *swap_map,
> +			    struct swap_cluster_info *cluster_info)
>  {
>  	int i;
>  
> @@ -2359,7 +2427,11 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
>  	}
>  	p->swap_map = swap_map;
>  	p->cluster_info = cluster_info;
> -	p->flags |= SWP_WRITEOK;
> +}
> +
> +static void _enable_swap_info(struct swap_info_struct *p)
> +{
> +	p->flags |= SWP_WRITEOK | SWP_VALID;
>  	atomic_long_add(p->pages, &nr_swap_pages);
>  	total_swap_pages += p->pages;
>  
> @@ -2378,6 +2450,11 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
>  	add_to_avail_list(p);
>  }
>  
> +static int swap_onoff_stop(void *arg)
> +{
> +	return 0;
> +}
> +
>  static void enable_swap_info(struct swap_info_struct *p, int prio,
>  				unsigned char *swap_map,
>  				struct swap_cluster_info *cluster_info,
> @@ -2386,7 +2463,17 @@ static void enable_swap_info(struct swap_info_struct *p, int prio,
>  	frontswap_init(p->type, frontswap_map);
>  	spin_lock(&swap_lock);
>  	spin_lock(&p->lock);
> -	 _enable_swap_info(p, prio, swap_map, cluster_info);
> +	setup_swap_info(p, prio, swap_map, cluster_info);
> +	spin_unlock(&p->lock);
> +	spin_unlock(&swap_lock);
> +	/*
> +	 * Guarantee swap_map, cluster_info, etc. fields are used
> +	 * between get/put_swap_device() only if SWP_VALID bit is set
> +	 */
> +	stop_machine(swap_onoff_stop, NULL, cpu_online_mask);
> +	spin_lock(&swap_lock);
> +	spin_lock(&p->lock);
> +	_enable_swap_info(p);
>  	spin_unlock(&p->lock);
>  	spin_unlock(&swap_lock);
>  }
> @@ -2395,7 +2482,8 @@ static void reinsert_swap_info(struct swap_info_struct *p)
>  {
>  	spin_lock(&swap_lock);
>  	spin_lock(&p->lock);
> -	_enable_swap_info(p, p->prio, p->swap_map, p->cluster_info);
> +	setup_swap_info(p, p->prio, p->swap_map, p->cluster_info);
> +	_enable_swap_info(p);
>  	spin_unlock(&p->lock);
>  	spin_unlock(&swap_lock);
>  }
> @@ -2498,6 +2586,17 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  
>  	reenable_swap_slots_cache_unlock();
>  
> +	spin_lock(&swap_lock);
> +	spin_lock(&p->lock);
> +	p->flags &= ~SWP_VALID;		/* mark swap device as invalid */
> +	spin_unlock(&p->lock);
> +	spin_unlock(&swap_lock);
> +	/*
> +	 * wait for swap operations protected by get/put_swap_device()
> +	 * to complete
> +	 */
> +	stop_machine(swap_onoff_stop, NULL, cpu_online_mask);
> +
>  	flush_work(&p->discard_work);
>  
>  	destroy_swap_extents(p);
> @@ -3263,17 +3362,11 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
>  	unsigned char has_cache;
>  	int err = -EINVAL;
>  
> -	if (non_swap_entry(entry))
> -		goto out;
> -
> -	p = swp_swap_info(entry);
> +	p = get_swap_device(entry);
>  	if (!p)
> -		goto bad_file;
> -
> -	offset = swp_offset(entry);
> -	if (unlikely(offset >= p->max))
>  		goto out;
>  
> +	offset = swp_offset(entry);
>  	ci = lock_cluster_or_swap_info(p, offset);
>  
>  	count = p->swap_map[offset];
> @@ -3319,11 +3412,9 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
>  unlock_out:
>  	unlock_cluster_or_swap_info(p, ci);
>  out:
> +	if (p)
> +		put_swap_device(p);
>  	return err;
> -
> -bad_file:
> -	pr_err("swap_dup: %s%08lx\n", Bad_file, entry.val);
> -	goto out;
>  }
>  
>  /*
> @@ -3415,6 +3506,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
>  	struct page *list_page;
>  	pgoff_t offset;
>  	unsigned char count;
> +	int ret = 0;
>  
>  	/*
>  	 * When debugging, it's easier to use __GFP_ZERO here; but it's better
> @@ -3422,15 +3514,15 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
>  	 */
>  	page = alloc_page(gfp_mask | __GFP_HIGHMEM);
>  
> -	si = swap_info_get(entry);
> +	si = get_swap_device(entry);
>  	if (!si) {
>  		/*
>  		 * An acceptable race has occurred since the failing
> -		 * __swap_duplicate(): the swap entry has been freed,
> -		 * perhaps even the whole swap_map cleared for swapoff.
> +		 * __swap_duplicate(): the swap device may be swapoff
>  		 */
>  		goto outer;
>  	}
> +	spin_lock(&si->lock);
>  
>  	offset = swp_offset(entry);
>  
> @@ -3448,9 +3540,8 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
>  	}
>  
>  	if (!page) {
> -		unlock_cluster(ci);
> -		spin_unlock(&si->lock);
> -		return -ENOMEM;
> +		ret = -ENOMEM;
> +		goto out;
>  	}
>  
>  	/*
> @@ -3502,10 +3593,11 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
>  out:
>  	unlock_cluster(ci);
>  	spin_unlock(&si->lock);
> +	put_swap_device(si);
>  outer:
>  	if (page)
>  		__free_page(page);
> -	return 0;
> +	return ret;
>  }
>  
>  /*
> -- 
> 2.20.1
>
Andrew Morton Feb. 14, 2019, 8:30 p.m. UTC | #12
On Thu, 14 Feb 2019 15:33:18 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> > Because swapoff() is very rare code path, to make the normal path runs as
> > fast as possible, disabling preemption + stop_machine() instead of
> > reference count is used to implement get/put_swap_device().  From
> > get_swap_device() to put_swap_device(), the preemption is disabled, so
> > stop_machine() in swapoff() will wait until put_swap_device() is called.
> > 
> > In addition to swap_map, cluster_info, etc.  data structure in the struct
> > swap_info_struct, the swap cache radix tree will be freed after swapoff,
> > so this patch fixes the race between swap cache looking up and swapoff
> > too.
> > 
> > Races between some other swap cache usages protected via disabling
> > preemption and swapoff are fixed too via calling stop_machine() between
> > clearing PageSwapCache() and freeing swap cache data structure.
> > 
> > Alternative implementation could be replacing disable preemption with
> > rcu_read_lock_sched and stop_machine() with synchronize_sched().
> 
> using stop_machine is generally discouraged. It is a gross
> synchronization.

This was discussed to death and I think the changelog explains the
conclusions adequately.  swapoff is super-rare so a stop_machine() in
that path is appropriate if its use permits more efficiency in the
regular swap code paths.  

> Besides that, since when do we have this problem?

What problem??
Andrea Arcangeli Feb. 14, 2019, 9:22 p.m. UTC | #13
Hello,

On Thu, Feb 14, 2019 at 12:30:02PM -0800, Andrew Morton wrote:
> This was discussed to death and I think the changelog explains the
> conclusions adequately.  swapoff is super-rare so a stop_machine() in
> that path is appropriate if its use permits more efficiency in the
> regular swap code paths.  

The problem is precisely that the way the stop_machine callback is
implemented right now (a dummy noop), makes the stop_machine()
solution fully equivalent to RCU from the fast path point of view. It
does not permit more efficiency in the fast path which is all we care
about.

For the slow path point of view the only difference is possibly that
stop_machine will reach the quiescent state faster (i.e. swapoff may
return a few dozen milliseconds faster), but nobody cares about the
latency of swapoff and it's actually nicer if swapoff doesn't stop all
CPUs on large systems and it uses less CPU overall.

This is why I suggested if we keep using stop_machine() we should not
use a dummy function whose only purpose is to reach a queiscent state
(which is something that is more efficiently achieved with the
syncronize_rcu/sched/kernel RCU API of the day) but we should instead
try to leverage the UP-like serialization to remove more spinlocks
from the fast path and convert them to preempt_disable(). However the
current dummy callback cannot achieve that higher efficiency in the
fast paths, the code would need to be reshuffled to try to remove at
least the swap_lock.

If no spinlock is converted to preempt_disable() RCU I don't see the
point of stop_machine().

On a side note, the cmpxchge machinery I posted to run the function
simultaneously on all CPUs I think may actually be superflous if using
cpus=NULL like Hing suggested. Implementation details aside, still the
idea of stop_machine would be to do those p->swap_map = NULL and
everything protected by the swap_lock, should be executed inside the
callback that runs like in a UP system to speedup the fast path
further.

Thanks,
Andrea
Andrea Arcangeli Feb. 14, 2019, 9:47 p.m. UTC | #14
On Thu, Feb 14, 2019 at 04:07:37PM +0800, Huang, Ying wrote:
> Before, we choose to use stop_machine() to reduce the overhead of hot
> path (page fault handler) as much as possible.  But now, I found
> rcu_read_lock_sched() is just a wrapper of preempt_disable().  So maybe
> we can switch to RCU version now.

rcu_read_lock looks more efficient than rcu_read_lock_sched. So for
this purpose in the fast path rcu_read_lock()/unlock() should be the
preferred methods, no need to force preempt_disable() (except for
debug purposes if sleep debug is enabled). Server builds are done with
voluntary preempt (no preempt shouldn't even exist as config option)
and there rcu_read_lock might be just a noop.

Against a fast path rcu_read_lock/unlock before the consolidation
synchronize_rcu would have been enough, now after the consolidation
even more certain that it's enough because it's equivalent to _mult.
Huang, Ying Feb. 15, 2019, 6:34 a.m. UTC | #15
Andrea Parri <andrea.parri@amarulasolutions.com> writes:

>> Alternative implementation could be replacing disable preemption with
>> rcu_read_lock_sched and stop_machine() with synchronize_sched().
>
> JFYI, starting with v4.20-rc1, synchronize_rcu{,expedited}() also wait
> for preempt-disable sections (the intent seems to retire the RCU-sched
> update-side API), so that here you could instead use preempt-disable +
> synchronize_rcu{,expedited}().  This LWN article gives an overview of
> the latest RCU API/semantics changes: https://lwn.net/Articles/777036/.

Thanks a lot!  The article is really helpful!

Best Regards,
Huang, Ying

>   Andrea
Huang, Ying Feb. 15, 2019, 7:08 a.m. UTC | #16
Michal Hocko <mhocko@kernel.org> writes:

> On Mon 11-02-19 16:38:46, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> When swapin is performed, after getting the swap entry information from
>> the page table, system will swap in the swap entry, without any lock held
>> to prevent the swap device from being swapoff.  This may cause the race
>> like below,
>> 
>> CPU 1				CPU 2
>> -----				-----
>> 				do_swap_page
>> 				  swapin_readahead
>> 				    __read_swap_cache_async
>> swapoff				      swapcache_prepare
>>   p->swap_map = NULL		        __swap_duplicate
>> 					  p->swap_map[?] /* !!! NULL pointer access */
>> 
>> Because swapoff is usually done when system shutdown only, the race may
>> not hit many people in practice.  But it is still a race need to be fixed.
>> 
>> To fix the race, get_swap_device() is added to check whether the specified
>> swap entry is valid in its swap device.  If so, it will keep the swap
>> entry valid via preventing the swap device from being swapoff, until
>> put_swap_device() is called.
>> 
>> Because swapoff() is very rare code path, to make the normal path runs as
>> fast as possible, disabling preemption + stop_machine() instead of
>> reference count is used to implement get/put_swap_device().  From
>> get_swap_device() to put_swap_device(), the preemption is disabled, so
>> stop_machine() in swapoff() will wait until put_swap_device() is called.
>> 
>> In addition to swap_map, cluster_info, etc.  data structure in the struct
>> swap_info_struct, the swap cache radix tree will be freed after swapoff,
>> so this patch fixes the race between swap cache looking up and swapoff
>> too.
>> 
>> Races between some other swap cache usages protected via disabling
>> preemption and swapoff are fixed too via calling stop_machine() between
>> clearing PageSwapCache() and freeing swap cache data structure.
>> 
>> Alternative implementation could be replacing disable preemption with
>> rcu_read_lock_sched and stop_machine() with synchronize_sched().
>
> using stop_machine is generally discouraged. It is a gross
> synchronization.
>
> Besides that, since when do we have this problem?

For problem, you mean the race between swapoff and the page fault
handler?  The problem is introduced in v4.11 when we avoid to replace
swap_info_struct->lock with swap_cluster_info->lock in
__swap_duplicate() if possible to improve the scalability of swap
operations.  But because swapoff is a really rare operation, I don't
think it's necessary to backport the fix.

Best Regards,
Huang, Ying
Huang, Ying Feb. 15, 2019, 7:50 a.m. UTC | #17
Andrea Arcangeli <aarcange@redhat.com> writes:

> On Thu, Feb 14, 2019 at 04:07:37PM +0800, Huang, Ying wrote:
>> Before, we choose to use stop_machine() to reduce the overhead of hot
>> path (page fault handler) as much as possible.  But now, I found
>> rcu_read_lock_sched() is just a wrapper of preempt_disable().  So maybe
>> we can switch to RCU version now.
>
> rcu_read_lock looks more efficient than rcu_read_lock_sched. So for
> this purpose in the fast path rcu_read_lock()/unlock() should be the
> preferred methods, no need to force preempt_disable() (except for
> debug purposes if sleep debug is enabled). Server builds are done with
> voluntary preempt (no preempt shouldn't even exist as config option)
> and there rcu_read_lock might be just a noop.

If

CONFIG_PREEMPT_VOLUNTARY=y
CONFIG_PREEMPT=n
CONFIG_PREEMPT_COUNT=n

which is common for servers,

rcu_read_lock() will be a noop, rcu_read_lock_sched() and
preempt_disable() will be barrier().  So rcu_read_lock() is a little
better.

> Against a fast path rcu_read_lock/unlock before the consolidation
> synchronize_rcu would have been enough, now after the consolidation
> even more certain that it's enough because it's equivalent to _mult.

Yes.  Will change to rcu_read_lock/unlock based method.

Best Regards,
Huang, Ying
Michal Hocko Feb. 15, 2019, 1:11 p.m. UTC | #18
On Fri 15-02-19 15:08:36, Huang, Ying wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Mon 11-02-19 16:38:46, Huang, Ying wrote:
> >> From: Huang Ying <ying.huang@intel.com>
> >> 
> >> When swapin is performed, after getting the swap entry information from
> >> the page table, system will swap in the swap entry, without any lock held
> >> to prevent the swap device from being swapoff.  This may cause the race
> >> like below,
> >> 
> >> CPU 1				CPU 2
> >> -----				-----
> >> 				do_swap_page
> >> 				  swapin_readahead
> >> 				    __read_swap_cache_async
> >> swapoff				      swapcache_prepare
> >>   p->swap_map = NULL		        __swap_duplicate
> >> 					  p->swap_map[?] /* !!! NULL pointer access */
> >> 
> >> Because swapoff is usually done when system shutdown only, the race may
> >> not hit many people in practice.  But it is still a race need to be fixed.
> >> 
> >> To fix the race, get_swap_device() is added to check whether the specified
> >> swap entry is valid in its swap device.  If so, it will keep the swap
> >> entry valid via preventing the swap device from being swapoff, until
> >> put_swap_device() is called.
> >> 
> >> Because swapoff() is very rare code path, to make the normal path runs as
> >> fast as possible, disabling preemption + stop_machine() instead of
> >> reference count is used to implement get/put_swap_device().  From
> >> get_swap_device() to put_swap_device(), the preemption is disabled, so
> >> stop_machine() in swapoff() will wait until put_swap_device() is called.
> >> 
> >> In addition to swap_map, cluster_info, etc.  data structure in the struct
> >> swap_info_struct, the swap cache radix tree will be freed after swapoff,
> >> so this patch fixes the race between swap cache looking up and swapoff
> >> too.
> >> 
> >> Races between some other swap cache usages protected via disabling
> >> preemption and swapoff are fixed too via calling stop_machine() between
> >> clearing PageSwapCache() and freeing swap cache data structure.
> >> 
> >> Alternative implementation could be replacing disable preemption with
> >> rcu_read_lock_sched and stop_machine() with synchronize_sched().
> >
> > using stop_machine is generally discouraged. It is a gross
> > synchronization.
> >
> > Besides that, since when do we have this problem?
> 
> For problem, you mean the race between swapoff and the page fault
> handler?

yes

> The problem is introduced in v4.11 when we avoid to replace
> swap_info_struct->lock with swap_cluster_info->lock in
> __swap_duplicate() if possible to improve the scalability of swap
> operations.  But because swapoff is a really rare operation, I don't
> think it's necessary to backport the fix.

Well, a lack of any bug reports would support your theory that this is
unlikely to hit in practice. Fixes tag would be nice to have regardless
though.

Thanks!
Huang, Ying Feb. 18, 2019, 12:51 a.m. UTC | #19
Michal Hocko <mhocko@kernel.org> writes:

> On Fri 15-02-19 15:08:36, Huang, Ying wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Mon 11-02-19 16:38:46, Huang, Ying wrote:
>> >> From: Huang Ying <ying.huang@intel.com>
>> >> 
>> >> When swapin is performed, after getting the swap entry information from
>> >> the page table, system will swap in the swap entry, without any lock held
>> >> to prevent the swap device from being swapoff.  This may cause the race
>> >> like below,
>> >> 
>> >> CPU 1				CPU 2
>> >> -----				-----
>> >> 				do_swap_page
>> >> 				  swapin_readahead
>> >> 				    __read_swap_cache_async
>> >> swapoff				      swapcache_prepare
>> >>   p->swap_map = NULL		        __swap_duplicate
>> >> 					  p->swap_map[?] /* !!! NULL pointer access */
>> >> 
>> >> Because swapoff is usually done when system shutdown only, the race may
>> >> not hit many people in practice.  But it is still a race need to be fixed.
>> >> 
>> >> To fix the race, get_swap_device() is added to check whether the specified
>> >> swap entry is valid in its swap device.  If so, it will keep the swap
>> >> entry valid via preventing the swap device from being swapoff, until
>> >> put_swap_device() is called.
>> >> 
>> >> Because swapoff() is very rare code path, to make the normal path runs as
>> >> fast as possible, disabling preemption + stop_machine() instead of
>> >> reference count is used to implement get/put_swap_device().  From
>> >> get_swap_device() to put_swap_device(), the preemption is disabled, so
>> >> stop_machine() in swapoff() will wait until put_swap_device() is called.
>> >> 
>> >> In addition to swap_map, cluster_info, etc.  data structure in the struct
>> >> swap_info_struct, the swap cache radix tree will be freed after swapoff,
>> >> so this patch fixes the race between swap cache looking up and swapoff
>> >> too.
>> >> 
>> >> Races between some other swap cache usages protected via disabling
>> >> preemption and swapoff are fixed too via calling stop_machine() between
>> >> clearing PageSwapCache() and freeing swap cache data structure.
>> >> 
>> >> Alternative implementation could be replacing disable preemption with
>> >> rcu_read_lock_sched and stop_machine() with synchronize_sched().
>> >
>> > using stop_machine is generally discouraged. It is a gross
>> > synchronization.
>> >
>> > Besides that, since when do we have this problem?
>> 
>> For problem, you mean the race between swapoff and the page fault
>> handler?
>
> yes
>
>> The problem is introduced in v4.11 when we avoid to replace
>> swap_info_struct->lock with swap_cluster_info->lock in
>> __swap_duplicate() if possible to improve the scalability of swap
>> operations.  But because swapoff is a really rare operation, I don't
>> think it's necessary to backport the fix.
>
> Well, a lack of any bug reports would support your theory that this is
> unlikely to hit in practice. Fixes tag would be nice to have regardless
> though.

Sure.  Will add "Fixes" tag.

Best Regards,
Huang, Ying

> Thanks!
diff mbox series

Patch

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 649529be91f2..aecd1430d0a9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -175,8 +175,9 @@  enum {
 	SWP_PAGE_DISCARD = (1 << 10),	/* freed swap page-cluster discards */
 	SWP_STABLE_WRITES = (1 << 11),	/* no overwrite PG_writeback pages */
 	SWP_SYNCHRONOUS_IO = (1 << 12),	/* synchronous IO is efficient */
+	SWP_VALID	= (1 << 13),	/* swap is valid to be operated on? */
 					/* add others here before... */
-	SWP_SCANNING	= (1 << 13),	/* refcount in scan_swap_map */
+	SWP_SCANNING	= (1 << 14),	/* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
@@ -460,7 +461,7 @@  extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
-extern int __swap_count(struct swap_info_struct *si, swp_entry_t entry);
+extern int __swap_count(swp_entry_t entry);
 extern int __swp_swapcount(swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
@@ -470,6 +471,12 @@  extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
 extern void exit_swap_address_space(unsigned int type);
+extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
+
+static inline void put_swap_device(struct swap_info_struct *si)
+{
+	preempt_enable();
+}
 
 #else /* CONFIG_SWAP */
 
@@ -576,7 +583,7 @@  static inline int page_swapcount(struct page *page)
 	return 0;
 }
 
-static inline int __swap_count(struct swap_info_struct *si, swp_entry_t entry)
+static inline int __swap_count(swp_entry_t entry)
 {
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 34ced1369883..9c0743c17c6c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2719,7 +2719,7 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 		struct swap_info_struct *si = swp_swap_info(entry);
 
 		if (si->flags & SWP_SYNCHRONOUS_IO &&
-				__swap_count(si, entry) == 1) {
+				__swap_count(entry) == 1) {
 			/* skip swapcache */
 			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
 							vmf->address);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 85245fdec8d9..61453f1faf72 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -310,8 +310,13 @@  struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma,
 			       unsigned long addr)
 {
 	struct page *page;
+	struct swap_info_struct *si;
 
+	si = get_swap_device(entry);
+	if (!si)
+		return NULL;
 	page = find_get_page(swap_address_space(entry), swp_offset(entry));
+	put_swap_device(si);
 
 	INC_CACHE_INFO(find_total);
 	if (page) {
@@ -354,8 +359,8 @@  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr,
 			bool *new_page_allocated)
 {
-	struct page *found_page, *new_page = NULL;
-	struct address_space *swapper_space = swap_address_space(entry);
+	struct page *found_page = NULL, *new_page = NULL;
+	struct swap_info_struct *si;
 	int err;
 	*new_page_allocated = false;
 
@@ -365,7 +370,12 @@  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * called after lookup_swap_cache() failed, re-calling
 		 * that would confuse statistics.
 		 */
-		found_page = find_get_page(swapper_space, swp_offset(entry));
+		si = get_swap_device(entry);
+		if (!si)
+			break;
+		found_page = find_get_page(swap_address_space(entry),
+					   swp_offset(entry));
+		put_swap_device(si);
 		if (found_page)
 			break;
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index cca8420b12db..8f92f36814fb 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -38,6 +38,7 @@ 
 #include <linux/export.h>
 #include <linux/swap_slots.h>
 #include <linux/sort.h>
+#include <linux/stop_machine.h>
 
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
@@ -1186,6 +1187,64 @@  static unsigned char __swap_entry_free_locked(struct swap_info_struct *p,
 	return usage;
 }
 
+/*
+ * Check whether swap entry is valid in the swap device.  If so,
+ * return pointer to swap_info_struct, and keep the swap entry valid
+ * via preventing the swap device from being swapoff, until
+ * put_swap_device() is called.  Otherwise return NULL.
+ *
+ * Notice that swapoff or swapoff+swapon can still happen before the
+ * preempt_disable() in get_swap_device() or after the
+ * preempt_enable() in put_swap_device() if there isn't any other way
+ * to prevent swapoff, such as page lock, page table lock, etc.  The
+ * caller must be prepared for that.  For example, the following
+ * situation is possible.
+ *
+ *   CPU1				CPU2
+ *   do_swap_page()
+ *     ...				swapoff+swapon
+ *     __read_swap_cache_async()
+ *       swapcache_prepare()
+ *         __swap_duplicate()
+ *           // check swap_map
+ *     // verify PTE not changed
+ *
+ * In __swap_duplicate(), the swap_map need to be checked before
+ * changing partly because the specified swap entry may be for another
+ * swap device which has been swapoff.  And in do_swap_page(), after
+ * the page is read from the swap device, the PTE is verified not
+ * changed with the page table locked to check whether the swap device
+ * has been swapoff or swapoff+swapon.
+ */
+struct swap_info_struct *get_swap_device(swp_entry_t entry)
+{
+	struct swap_info_struct *si;
+	unsigned long type, offset;
+
+	if (!entry.val)
+		goto out;
+	type = swp_type(entry);
+	si = swap_type_to_swap_info(type);
+	if (!si)
+		goto bad_nofile;
+
+	preempt_disable();
+	if (!(si->flags & SWP_VALID))
+		goto unlock_out;
+	offset = swp_offset(entry);
+	if (offset >= si->max)
+		goto unlock_out;
+
+	return si;
+bad_nofile:
+	pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
+out:
+	return NULL;
+unlock_out:
+	preempt_enable();
+	return NULL;
+}
+
 static unsigned char __swap_entry_free(struct swap_info_struct *p,
 				       swp_entry_t entry, unsigned char usage)
 {
@@ -1357,11 +1416,18 @@  int page_swapcount(struct page *page)
 	return count;
 }
 
-int __swap_count(struct swap_info_struct *si, swp_entry_t entry)
+int __swap_count(swp_entry_t entry)
 {
+	struct swap_info_struct *si;
 	pgoff_t offset = swp_offset(entry);
+	int count = 0;
 
-	return swap_count(si->swap_map[offset]);
+	si = get_swap_device(entry);
+	if (si) {
+		count = swap_count(si->swap_map[offset]);
+		put_swap_device(si);
+	}
+	return count;
 }
 
 static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
@@ -1386,9 +1452,11 @@  int __swp_swapcount(swp_entry_t entry)
 	int count = 0;
 	struct swap_info_struct *si;
 
-	si = __swap_info_get(entry);
-	if (si)
+	si = get_swap_device(entry);
+	if (si) {
 		count = swap_swapcount(si, entry);
+		put_swap_device(si);
+	}
 	return count;
 }
 
@@ -2332,9 +2400,9 @@  static int swap_node(struct swap_info_struct *p)
 	return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
 }
 
-static void _enable_swap_info(struct swap_info_struct *p, int prio,
-				unsigned char *swap_map,
-				struct swap_cluster_info *cluster_info)
+static void setup_swap_info(struct swap_info_struct *p, int prio,
+			    unsigned char *swap_map,
+			    struct swap_cluster_info *cluster_info)
 {
 	int i;
 
@@ -2359,7 +2427,11 @@  static void _enable_swap_info(struct swap_info_struct *p, int prio,
 	}
 	p->swap_map = swap_map;
 	p->cluster_info = cluster_info;
-	p->flags |= SWP_WRITEOK;
+}
+
+static void _enable_swap_info(struct swap_info_struct *p)
+{
+	p->flags |= SWP_WRITEOK | SWP_VALID;
 	atomic_long_add(p->pages, &nr_swap_pages);
 	total_swap_pages += p->pages;
 
@@ -2378,6 +2450,11 @@  static void _enable_swap_info(struct swap_info_struct *p, int prio,
 	add_to_avail_list(p);
 }
 
+static int swap_onoff_stop(void *arg)
+{
+	return 0;
+}
+
 static void enable_swap_info(struct swap_info_struct *p, int prio,
 				unsigned char *swap_map,
 				struct swap_cluster_info *cluster_info,
@@ -2386,7 +2463,17 @@  static void enable_swap_info(struct swap_info_struct *p, int prio,
 	frontswap_init(p->type, frontswap_map);
 	spin_lock(&swap_lock);
 	spin_lock(&p->lock);
-	 _enable_swap_info(p, prio, swap_map, cluster_info);
+	setup_swap_info(p, prio, swap_map, cluster_info);
+	spin_unlock(&p->lock);
+	spin_unlock(&swap_lock);
+	/*
+	 * Guarantee swap_map, cluster_info, etc. fields are used
+	 * between get/put_swap_device() only if SWP_VALID bit is set
+	 */
+	stop_machine(swap_onoff_stop, NULL, cpu_online_mask);
+	spin_lock(&swap_lock);
+	spin_lock(&p->lock);
+	_enable_swap_info(p);
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
 }
@@ -2395,7 +2482,8 @@  static void reinsert_swap_info(struct swap_info_struct *p)
 {
 	spin_lock(&swap_lock);
 	spin_lock(&p->lock);
-	_enable_swap_info(p, p->prio, p->swap_map, p->cluster_info);
+	setup_swap_info(p, p->prio, p->swap_map, p->cluster_info);
+	_enable_swap_info(p);
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
 }
@@ -2498,6 +2586,17 @@  SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	reenable_swap_slots_cache_unlock();
 
+	spin_lock(&swap_lock);
+	spin_lock(&p->lock);
+	p->flags &= ~SWP_VALID;		/* mark swap device as invalid */
+	spin_unlock(&p->lock);
+	spin_unlock(&swap_lock);
+	/*
+	 * wait for swap operations protected by get/put_swap_device()
+	 * to complete
+	 */
+	stop_machine(swap_onoff_stop, NULL, cpu_online_mask);
+
 	flush_work(&p->discard_work);
 
 	destroy_swap_extents(p);
@@ -3263,17 +3362,11 @@  static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 	unsigned char has_cache;
 	int err = -EINVAL;
 
-	if (non_swap_entry(entry))
-		goto out;
-
-	p = swp_swap_info(entry);
+	p = get_swap_device(entry);
 	if (!p)
-		goto bad_file;
-
-	offset = swp_offset(entry);
-	if (unlikely(offset >= p->max))
 		goto out;
 
+	offset = swp_offset(entry);
 	ci = lock_cluster_or_swap_info(p, offset);
 
 	count = p->swap_map[offset];
@@ -3319,11 +3412,9 @@  static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 unlock_out:
 	unlock_cluster_or_swap_info(p, ci);
 out:
+	if (p)
+		put_swap_device(p);
 	return err;
-
-bad_file:
-	pr_err("swap_dup: %s%08lx\n", Bad_file, entry.val);
-	goto out;
 }
 
 /*
@@ -3415,6 +3506,7 @@  int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	struct page *list_page;
 	pgoff_t offset;
 	unsigned char count;
+	int ret = 0;
 
 	/*
 	 * When debugging, it's easier to use __GFP_ZERO here; but it's better
@@ -3422,15 +3514,15 @@  int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	 */
 	page = alloc_page(gfp_mask | __GFP_HIGHMEM);
 
-	si = swap_info_get(entry);
+	si = get_swap_device(entry);
 	if (!si) {
 		/*
 		 * An acceptable race has occurred since the failing
-		 * __swap_duplicate(): the swap entry has been freed,
-		 * perhaps even the whole swap_map cleared for swapoff.
+		 * __swap_duplicate(): the swap device may be swapoff
 		 */
 		goto outer;
 	}
+	spin_lock(&si->lock);
 
 	offset = swp_offset(entry);
 
@@ -3448,9 +3540,8 @@  int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	}
 
 	if (!page) {
-		unlock_cluster(ci);
-		spin_unlock(&si->lock);
-		return -ENOMEM;
+		ret = -ENOMEM;
+		goto out;
 	}
 
 	/*
@@ -3502,10 +3593,11 @@  int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 out:
 	unlock_cluster(ci);
 	spin_unlock(&si->lock);
+	put_swap_device(si);
 outer:
 	if (page)
 		__free_page(page);
-	return 0;
+	return ret;
 }
 
 /*