diff mbox series

mm, swap: bounds check swap_info accesses to avoid NULL derefs

Message ID 20190115002305.15402-1-daniel.m.jordan@oracle.com (mailing list archive)
State New, archived
Headers show
Series mm, swap: bounds check swap_info accesses to avoid NULL derefs | expand

Commit Message

Daniel Jordan Jan. 15, 2019, 12:23 a.m. UTC
Dan Carpenter reports a potential NULL dereference in
get_swap_page_of_type:

  Smatch complains that the NULL checks on "si" aren't consistent.  This
  seems like a real bug because we have not ensured that the type is
  valid and so "si" can be NULL.

Add the missing check for NULL, taking care to use a read barrier to
ensure CPU1 observes CPU0's updates in the correct order:

        CPU0                           CPU1
        alloc_swap_info()              if (type >= nr_swapfiles)
          swap_info[type] = p              /* handle invalid entry */
          smp_wmb()                    smp_rmb()
          ++nr_swapfiles               p = swap_info[type]

Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before
CPU0's write to swap_info[type] and read NULL from swap_info[type].

Ying Huang noticed that other places don't order these reads properly.
Introduce swap_type_to_swap_info to encourage correct usage.

Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model
(see tools/memory-model/Documentation/explanation.txt).

This ordering need not be enforced in places where swap_lock is held
(e.g. si_swapinfo) because swap_lock serializes updates to nr_swapfiles
and the swap_info array.

This is a theoretical problem, no actual reports of it exist.

Fixes: ec8acf20afb8 ("swap: add per-partition lock for swapfile")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>

---

I'd appreciate it if someone more familiar with memory barriers could
check this over.  Thanks.

Probably no need for stable, this is all theoretical.

Against linux-mmotm tag v5.0-rc1-mmotm-2019-01-09-13-40

 mm/swapfile.c | 43 +++++++++++++++++++++++++++----------------
 1 file changed, 27 insertions(+), 16 deletions(-)

Comments

Andrea Parri Jan. 15, 2019, 1:17 a.m. UTC | #1
On Mon, Jan 14, 2019 at 07:23:05PM -0500, Daniel Jordan wrote:
> Dan Carpenter reports a potential NULL dereference in
> get_swap_page_of_type:
> 
>   Smatch complains that the NULL checks on "si" aren't consistent.  This
>   seems like a real bug because we have not ensured that the type is
>   valid and so "si" can be NULL.
> 
> Add the missing check for NULL, taking care to use a read barrier to
> ensure CPU1 observes CPU0's updates in the correct order:
> 
>         CPU0                           CPU1
>         alloc_swap_info()              if (type >= nr_swapfiles)
>           swap_info[type] = p              /* handle invalid entry */
>           smp_wmb()                    smp_rmb()
>           ++nr_swapfiles               p = swap_info[type]
> 
> Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before
> CPU0's write to swap_info[type] and read NULL from swap_info[type].
> 
> Ying Huang noticed that other places don't order these reads properly.
> Introduce swap_type_to_swap_info to encourage correct usage.
> 
> Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model
> (see tools/memory-model/Documentation/explanation.txt).
> 
> This ordering need not be enforced in places where swap_lock is held
> (e.g. si_swapinfo) because swap_lock serializes updates to nr_swapfiles
> and the swap_info array.
> 
> This is a theoretical problem, no actual reports of it exist.
> 
> Fixes: ec8acf20afb8 ("swap: add per-partition lock for swapfile")
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Dan Carpenter <dan.carpenter@oracle.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Shaohua Li <shli@kernel.org>
> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Will Deacon <will.deacon@arm.com>

Please feel free to add:

Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>

  Andrea


> 
> ---
> 
> I'd appreciate it if someone more familiar with memory barriers could
> check this over.  Thanks.
> 
> Probably no need for stable, this is all theoretical.
> 
> Against linux-mmotm tag v5.0-rc1-mmotm-2019-01-09-13-40
> 
>  mm/swapfile.c | 43 +++++++++++++++++++++++++++----------------
>  1 file changed, 27 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f0edf7244256..dad52fc67045 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -99,6 +99,15 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
>  
>  atomic_t nr_rotate_swap = ATOMIC_INIT(0);
>  
> +static struct swap_info_struct *swap_type_to_swap_info(int type)
> +{
> +	if (type >= READ_ONCE(nr_swapfiles))
> +		return NULL;
> +
> +	smp_rmb();	/* Pairs with smp_wmb in alloc_swap_info. */
> +	return READ_ONCE(swap_info[type]);
> +}
> +
>  static inline unsigned char swap_count(unsigned char ent)
>  {
>  	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
> @@ -1045,12 +1054,14 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>  /* The only caller of this function is now suspend routine */
>  swp_entry_t get_swap_page_of_type(int type)
>  {
> -	struct swap_info_struct *si;
> +	struct swap_info_struct *si = swap_type_to_swap_info(type);
>  	pgoff_t offset;
>  
> -	si = swap_info[type];
> +	if (!si)
> +		goto fail;
> +
>  	spin_lock(&si->lock);
> -	if (si && (si->flags & SWP_WRITEOK)) {
> +	if (si->flags & SWP_WRITEOK) {
>  		atomic_long_dec(&nr_swap_pages);
>  		/* This is called for allocating swap entry, not cache */
>  		offset = scan_swap_map(si, 1);
> @@ -1061,6 +1072,7 @@ swp_entry_t get_swap_page_of_type(int type)
>  		atomic_long_inc(&nr_swap_pages);
>  	}
>  	spin_unlock(&si->lock);
> +fail:
>  	return (swp_entry_t) {0};
>  }
>  
> @@ -1072,9 +1084,9 @@ static struct swap_info_struct *__swap_info_get(swp_entry_t entry)
>  	if (!entry.val)
>  		goto out;
>  	type = swp_type(entry);
> -	if (type >= nr_swapfiles)
> +	p = swap_type_to_swap_info(type);
> +	if (!p)
>  		goto bad_nofile;
> -	p = swap_info[type];
>  	if (!(p->flags & SWP_USED))
>  		goto bad_device;
>  	offset = swp_offset(entry);
> @@ -1212,9 +1224,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
>  	if (!entry.val)
>  		goto out;
>  	type = swp_type(entry);
> -	if (type >= nr_swapfiles)
> +	si = swap_type_to_swap_info(type);
> +	if (!si)
>  		goto bad_nofile;
> -	si = swap_info[type];
>  
>  	preempt_disable();
>  	if (!(si->flags & SWP_VALID))
> @@ -1765,10 +1777,9 @@ int swap_type_of(dev_t device, sector_t offset, struct block_device **bdev_p)
>  sector_t swapdev_block(int type, pgoff_t offset)
>  {
>  	struct block_device *bdev;
> +	struct swap_info_struct *si = swap_type_to_swap_info(type);
>  
> -	if ((unsigned int)type >= nr_swapfiles)
> -		return 0;
> -	if (!(swap_info[type]->flags & SWP_WRITEOK))
> +	if (!si || !(si->flags & SWP_WRITEOK))
>  		return 0;
>  	return map_swap_entry(swp_entry(type, offset), &bdev);
>  }
> @@ -2799,9 +2810,9 @@ static void *swap_start(struct seq_file *swap, loff_t *pos)
>  	if (!l)
>  		return SEQ_START_TOKEN;
>  
> -	for (type = 0; type < nr_swapfiles; type++) {
> +	for (type = 0; type < READ_ONCE(nr_swapfiles); type++) {
>  		smp_rmb();	/* read nr_swapfiles before swap_info[type] */
> -		si = swap_info[type];
> +		si = READ_ONCE(swap_info[type]);
>  		if (!(si->flags & SWP_USED) || !si->swap_map)
>  			continue;
>  		if (!--l)
> @@ -2821,9 +2832,9 @@ static void *swap_next(struct seq_file *swap, void *v, loff_t *pos)
>  	else
>  		type = si->type + 1;
>  
> -	for (; type < nr_swapfiles; type++) {
> +	for (; type < READ_ONCE(nr_swapfiles); type++) {
>  		smp_rmb();	/* read nr_swapfiles before swap_info[type] */
> -		si = swap_info[type];
> +		si = READ_ONCE(swap_info[type]);
>  		if (!(si->flags & SWP_USED) || !si->swap_map)
>  			continue;
>  		++*pos;
> @@ -2930,14 +2941,14 @@ static struct swap_info_struct *alloc_swap_info(void)
>  	}
>  	if (type >= nr_swapfiles) {
>  		p->type = type;
> -		swap_info[type] = p;
> +		WRITE_ONCE(swap_info[type], p);
>  		/*
>  		 * Write swap_info[type] before nr_swapfiles, in case a
>  		 * racing procfs swap_start() or swap_next() is reading them.
>  		 * (We never shrink nr_swapfiles, we never free this entry.)
>  		 */
>  		smp_wmb();
> -		nr_swapfiles++;
> +		WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1);
>  	} else {
>  		kvfree(p);
>  		p = swap_info[type];
> -- 
> 2.20.0
>
Andrew Morton Jan. 30, 2019, 6:26 a.m. UTC | #2
On Mon, 14 Jan 2019 19:23:05 -0500 Daniel Jordan <daniel.m.jordan@oracle.com> wrote:

> Dan Carpenter reports a potential NULL dereference in
> get_swap_page_of_type:
> 
>   Smatch complains that the NULL checks on "si" aren't consistent.  This
>   seems like a real bug because we have not ensured that the type is
>   valid and so "si" can be NULL.
> 
> Add the missing check for NULL, taking care to use a read barrier to
> ensure CPU1 observes CPU0's updates in the correct order:
> 
>         CPU0                           CPU1
>         alloc_swap_info()              if (type >= nr_swapfiles)
>           swap_info[type] = p              /* handle invalid entry */
>           smp_wmb()                    smp_rmb()
>           ++nr_swapfiles               p = swap_info[type]
> 
> Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before
> CPU0's write to swap_info[type] and read NULL from swap_info[type].
> 
> Ying Huang noticed that other places don't order these reads properly.
> Introduce swap_type_to_swap_info to encourage correct usage.
> 
> Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model
> (see tools/memory-model/Documentation/explanation.txt).
> 
> This ordering need not be enforced in places where swap_lock is held
> (e.g. si_swapinfo) because swap_lock serializes updates to nr_swapfiles
> and the swap_info array.
> 
> This is a theoretical problem, no actual reports of it exist.
> 

LGTM, but like most people I'm afraid to ack it ;)

mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very
stuck so can you please redo this against mainline?
Dan Carpenter Jan. 30, 2019, 7:28 a.m. UTC | #3
On Mon, Jan 14, 2019 at 07:23:05PM -0500, Daniel Jordan wrote:
> Dan Carpenter reports a potential NULL dereference in
> get_swap_page_of_type:
> 
>   Smatch complains that the NULL checks on "si" aren't consistent.  This
>   seems like a real bug because we have not ensured that the type is
>   valid and so "si" can be NULL.
> 
> Add the missing check for NULL, taking care to use a read barrier to
> ensure CPU1 observes CPU0's updates in the correct order:
> 
>         CPU0                           CPU1
>         alloc_swap_info()              if (type >= nr_swapfiles)
>           swap_info[type] = p              /* handle invalid entry */
>           smp_wmb()                    smp_rmb()
>           ++nr_swapfiles               p = swap_info[type]
> 
> Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before
> CPU0's write to swap_info[type] and read NULL from swap_info[type].
> 
> Ying Huang noticed that other places don't order these reads properly.
> Introduce swap_type_to_swap_info to encourage correct usage.
> 
> Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model
> (see tools/memory-model/Documentation/explanation.txt).
> 
> This ordering need not be enforced in places where swap_lock is held
> (e.g. si_swapinfo) because swap_lock serializes updates to nr_swapfiles
> and the swap_info array.
> 
> This is a theoretical problem, no actual reports of it exist.
> 
> Fixes: ec8acf20afb8 ("swap: add per-partition lock for swapfile")
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Dan Carpenter <dan.carpenter@oracle.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Shaohua Li <shli@kernel.org>
> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Will Deacon <will.deacon@arm.com>
> 
> ---
> 
> I'd appreciate it if someone more familiar with memory barriers could
> check this over.  Thanks.
> 
> Probably no need for stable, this is all theoretical.
> 

The NULL dereference part is not theoretical.  It require CAP_SYS_ADMIN
so it's not a huge deal, but you could trigger it from snapshot_ioctl()
with SNAPSHOT_ALLOC_SWAP_PAGE.

regards,
dan carpenter
Peter Zijlstra Jan. 30, 2019, 9:13 a.m. UTC | #4
On Mon, Jan 14, 2019 at 07:23:05PM -0500, Daniel Jordan wrote:
> Dan Carpenter reports a potential NULL dereference in
> get_swap_page_of_type:
> 
>   Smatch complains that the NULL checks on "si" aren't consistent.  This
>   seems like a real bug because we have not ensured that the type is
>   valid and so "si" can be NULL.
> 
> Add the missing check for NULL, taking care to use a read barrier to
> ensure CPU1 observes CPU0's updates in the correct order:
> 
>         CPU0                           CPU1
>         alloc_swap_info()              if (type >= nr_swapfiles)
>           swap_info[type] = p              /* handle invalid entry */
>           smp_wmb()                    smp_rmb()
>           ++nr_swapfiles               p = swap_info[type]
> 
> Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before
> CPU0's write to swap_info[type] and read NULL from swap_info[type].
> 
> Ying Huang noticed that other places don't order these reads properly.
> Introduce swap_type_to_swap_info to encourage correct usage.
> 
> Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model
> (see tools/memory-model/Documentation/explanation.txt).
> 
> This ordering need not be enforced in places where swap_lock is held
> (e.g. si_swapinfo) because swap_lock serializes updates to nr_swapfiles
> and the swap_info array.
> 
> This is a theoretical problem, no actual reports of it exist.
> 
> Fixes: ec8acf20afb8 ("swap: add per-partition lock for swapfile")
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>

A few comments below, but:

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> +static struct swap_info_struct *swap_type_to_swap_info(int type)
> +{
> +	if (type >= READ_ONCE(nr_swapfiles))
> +		return NULL;
> +
> +	smp_rmb();	/* Pairs with smp_wmb in alloc_swap_info. */
> +	return READ_ONCE(swap_info[type]);
> +}

> @@ -2799,9 +2810,9 @@ static void *swap_start(struct seq_file *swap, loff_t *pos)
>  	if (!l)
>  		return SEQ_START_TOKEN;
>  
> -	for (type = 0; type < nr_swapfiles; type++) {
> +	for (type = 0; type < READ_ONCE(nr_swapfiles); type++) {
>  		smp_rmb();	/* read nr_swapfiles before swap_info[type] */
> -		si = swap_info[type];
> +		si = READ_ONCE(swap_info[type]);
>  		if (!(si->flags & SWP_USED) || !si->swap_map)
>  			continue;
>  		if (!--l)
> @@ -2821,9 +2832,9 @@ static void *swap_next(struct seq_file *swap, void *v, loff_t *pos)
>  	else
>  		type = si->type + 1;
>  
> -	for (; type < nr_swapfiles; type++) {
> +	for (; type < READ_ONCE(nr_swapfiles); type++) {
>  		smp_rmb();	/* read nr_swapfiles before swap_info[type] */
> -		si = swap_info[type];
> +		si = READ_ONCE(swap_info[type]);
>  		if (!(si->flags & SWP_USED) || !si->swap_map)
>  			continue;
>  		++*pos;

You could write those like:

	for (; (si = swap_type_to_swap_info(type)); type++)

> @@ -2930,14 +2941,14 @@ static struct swap_info_struct *alloc_swap_info(void)
>  	}
>  	if (type >= nr_swapfiles) {
>  		p->type = type;
> -		swap_info[type] = p;
> +		WRITE_ONCE(swap_info[type], p);
>  		/*
>  		 * Write swap_info[type] before nr_swapfiles, in case a
>  		 * racing procfs swap_start() or swap_next() is reading them.
>  		 * (We never shrink nr_swapfiles, we never free this entry.)
>  		 */
>  		smp_wmb();
> -		nr_swapfiles++;
> +		WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1);
>  	} else {
>  		kvfree(p);
>  		p = swap_info[type];

It is also possible to write this with smp_load_acquire() /
smp_store_release(). ARM64/RISC-V might benefit from that, OTOH ARM
won't like that much.

Dunno what would be better.
Daniel Jordan Jan. 31, 2019, 1:52 a.m. UTC | #5
On Tue, Jan 29, 2019 at 10:26:22PM -0800, Andrew Morton wrote:
> LGTM, but like most people I'm afraid to ack it ;)
> 
> mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very
> stuck so can you please redo this against mainline?

Yep, I can do that.
Daniel Jordan Jan. 31, 2019, 1:55 a.m. UTC | #6
On Wed, Jan 30, 2019 at 10:28:46AM +0300, Dan Carpenter wrote:
> On Mon, Jan 14, 2019 at 07:23:05PM -0500, Daniel Jordan wrote:
> > Probably no need for stable, this is all theoretical.
> 
> The NULL dereference part is not theoretical.  It require CAP_SYS_ADMIN
> so it's not a huge deal, but you could trigger it from snapshot_ioctl()
> with SNAPSHOT_ALLOC_SWAP_PAGE.

Ok, I'll amend the changelog for v2.
Daniel Jordan Jan. 31, 2019, 2 a.m. UTC | #7
On Wed, Jan 30, 2019 at 10:13:16AM +0100, Peter Zijlstra wrote:
> On Mon, Jan 14, 2019 at 07:23:05PM -0500, Daniel Jordan wrote:
> 
> A few comments below, but:
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Thanks.

> > @@ -2799,9 +2810,9 @@ static void *swap_start(struct seq_file *swap, loff_t *pos)
> >  	if (!l)
> >  		return SEQ_START_TOKEN;
> >  
> > -	for (type = 0; type < nr_swapfiles; type++) {
> > +	for (type = 0; type < READ_ONCE(nr_swapfiles); type++) {
> >  		smp_rmb();	/* read nr_swapfiles before swap_info[type] */
> > -		si = swap_info[type];
> > +		si = READ_ONCE(swap_info[type]);
> >  		if (!(si->flags & SWP_USED) || !si->swap_map)
> >  			continue;
> >  		if (!--l)
> > @@ -2821,9 +2832,9 @@ static void *swap_next(struct seq_file *swap, void *v, loff_t *pos)
> >  	else
> >  		type = si->type + 1;
> >  
> > -	for (; type < nr_swapfiles; type++) {
> > +	for (; type < READ_ONCE(nr_swapfiles); type++) {
> >  		smp_rmb();	/* read nr_swapfiles before swap_info[type] */
> > -		si = swap_info[type];
> > +		si = READ_ONCE(swap_info[type]);
> >  		if (!(si->flags & SWP_USED) || !si->swap_map)
> >  			continue;
> >  		++*pos;
> 
> You could write those like:
> 
> 	for (; (si = swap_type_to_swap_info(type)); type++)

That's clever, and way better than the ugly iterator macro I wrote and then
deleted in disgust.

> > @@ -2930,14 +2941,14 @@ static struct swap_info_struct *alloc_swap_info(void)
> >  	}
> >  	if (type >= nr_swapfiles) {
> >  		p->type = type;
> > -		swap_info[type] = p;
> > +		WRITE_ONCE(swap_info[type], p);
> >  		/*
> >  		 * Write swap_info[type] before nr_swapfiles, in case a
> >  		 * racing procfs swap_start() or swap_next() is reading them.
> >  		 * (We never shrink nr_swapfiles, we never free this entry.)
> >  		 */
> >  		smp_wmb();
> > -		nr_swapfiles++;
> > +		WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1);
> >  	} else {
> >  		kvfree(p);
> >  		p = swap_info[type];
> 
> It is also possible to write this with smp_load_acquire() /
> smp_store_release(). ARM64/RISC-V might benefit from that, OTOH ARM
> won't like that much.
> 
> Dunno what would be better.

I just left it as-is for now.
Huang, Ying Jan. 31, 2019, 2:48 a.m. UTC | #8
Andrew Morton <akpm@linux-foundation.org> writes:
> mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very
> stuck so can you please redo this against mainline?

Allow me to be off topic, this patch has been in mm tree for quite some
time, what can I do to help this be merged upstream?

Best Regards,
Huang, Ying
Andrew Morton Jan. 31, 2019, 8:46 p.m. UTC | #9
On Thu, 31 Jan 2019 10:48:29 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:

> Andrew Morton <akpm@linux-foundation.org> writes:
> > mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very
> > stuck so can you please redo this against mainline?
> 
> Allow me to be off topic, this patch has been in mm tree for quite some
> time, what can I do to help this be merged upstream?

I have no evidence that it has been reviewed, for a start.  I've asked
Hugh to look at it.
Huang, Ying Feb. 2, 2019, 7:14 a.m. UTC | #10
Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 31 Jan 2019 10:48:29 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:
>
>> Andrew Morton <akpm@linux-foundation.org> writes:
>> > mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very
>> > stuck so can you please redo this against mainline?
>> 
>> Allow me to be off topic, this patch has been in mm tree for quite some
>> time, what can I do to help this be merged upstream?
>
> I have no evidence that it has been reviewed, for a start.  I've asked
> Hugh to look at it.

Got it!  I will try to find some people to review it.

Best Regards,
Huang, Ying
Hugh Dickins Feb. 4, 2019, 9:37 p.m. UTC | #11
On Thu, 31 Jan 2019, Andrew Morton wrote:
> On Thu, 31 Jan 2019 10:48:29 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:
> > Andrew Morton <akpm@linux-foundation.org> writes:
> > > mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very
> > > stuck so can you please redo this against mainline?
> > 
> > Allow me to be off topic, this patch has been in mm tree for quite some
> > time, what can I do to help this be merged upstream?

Wow, yes, it's about a year old.

> 
> I have no evidence that it has been reviewed, for a start.  I've asked
> Hugh to look at it.

I tried at the weekend.  Usual story: I don't like it at all, the
ever-increasing complexity there, but certainly understand the need
for that fix, and have not managed to think up anything better -
and now I need to switch away, sorry.

The multiple dynamically allocated and freed swapper address spaces
have indeed broken what used to make it safe.  If those imaginary
address spaces did not have to be virtually contiguous, I'd say
cache them and reuse them, instead of freeing.  But I don't see
how to do that as it stands.

find_get_page(swapper_address_space(entry), swp_offset(entry)) has
become an unsafe construct, where it used to be safe against corrupted
page tables.  Maybe we don't care so much about crashing on corrupted
page tables nowadays (I haven't heard recent complaints), and I think
Huang is correct that lookup_swap_cache() and __read_swap_cache_async()
happen to be the only instances that need to be guarded against swapoff
(the others are working with page table locked).

The array of arrays of swapper spaces is all just to get a separate
lock for separate extents of the swapfile: I wonder whether Matthew has
anything in mind for that in XArray (I think Peter once got it working
in radix-tree, but the overhead not so good).

(I was originally horrified by the stop_machine() added in swapon and
swapoff, but perhaps I'm remembering a distant past of really stopping
the machine: stop_machine() today looked reasonable, something to avoid
generally like lru_add_drain_all(), but not as shameful as I thought.)

Hugh
Matthew Wilcox (Oracle) Feb. 4, 2019, 10:26 p.m. UTC | #12
On Mon, Feb 04, 2019 at 01:37:00PM -0800, Hugh Dickins wrote:
> On Thu, 31 Jan 2019, Andrew Morton wrote:
> > On Thu, 31 Jan 2019 10:48:29 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:
> > > Andrew Morton <akpm@linux-foundation.org> writes:
> > > > mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very
> > > > stuck so can you please redo this against mainline?
> > 
> > I have no evidence that it has been reviewed, for a start.  I've asked
> > Hugh to look at it.
> 
> I tried at the weekend.  Usual story: I don't like it at all, the
> ever-increasing complexity there, but certainly understand the need
> for that fix, and have not managed to think up anything better -
> and now I need to switch away, sorry.
> 
> The multiple dynamically allocated and freed swapper address spaces
> have indeed broken what used to make it safe.  If those imaginary
> address spaces did not have to be virtually contiguous, I'd say
> cache them and reuse them, instead of freeing.  But I don't see
> how to do that as it stands.
> 
> find_get_page(swapper_address_space(entry), swp_offset(entry)) has
> become an unsafe construct, where it used to be safe against corrupted
> page tables.  Maybe we don't care so much about crashing on corrupted
> page tables nowadays (I haven't heard recent complaints), and I think
> Huang is correct that lookup_swap_cache() and __read_swap_cache_async()
> happen to be the only instances that need to be guarded against swapoff
> (the others are working with page table locked).
> 
> The array of arrays of swapper spaces is all just to get a separate
> lock for separate extents of the swapfile: I wonder whether Matthew has
> anything in mind for that in XArray (I think Peter once got it working
> in radix-tree, but the overhead not so good).

Hi Hugh, thanks for putting me on the cc.

I've certainly noticed what's been going on with the swapper code, but
I've generally had a lack of tuits (round or otherwise) to really dig in
and figure out what's going on.  I've had some ideas about embedding a
spinlock in each leaf node (giving one lock per 64 slots), but I know I've
got about 800 things that I've actually promised to do ahead of looking
at doing that.

I have a suspicion that the swapper code could probably be replaced with
an allocating XArray (like the IDR) and it doesn't really need to be a
full on address_space, but I'm probably wrong because I haven't studied
the swap code in depth.
Huang, Ying Feb. 6, 2019, 12:14 a.m. UTC | #13
Hi, Hugh,

Hugh Dickins <hughd@google.com> writes:

> On Thu, 31 Jan 2019, Andrew Morton wrote:
>> On Thu, 31 Jan 2019 10:48:29 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:
>> > Andrew Morton <akpm@linux-foundation.org> writes:
>> > > mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very
>> > > stuck so can you please redo this against mainline?
>> > 
>> > Allow me to be off topic, this patch has been in mm tree for quite some
>> > time, what can I do to help this be merged upstream?
>
> Wow, yes, it's about a year old.
>
>> 
>> I have no evidence that it has been reviewed, for a start.  I've asked
>> Hugh to look at it.
>
> I tried at the weekend.  Usual story: I don't like it at all, the
> ever-increasing complexity there, but certainly understand the need
> for that fix, and have not managed to think up anything better -
> and now I need to switch away, sorry.
>
> The multiple dynamically allocated and freed swapper address spaces
> have indeed broken what used to make it safe.  If those imaginary
> address spaces did not have to be virtually contiguous, I'd say
> cache them and reuse them, instead of freeing.  But I don't see
> how to do that as it stands.
>
> find_get_page(swapper_address_space(entry), swp_offset(entry)) has
> become an unsafe construct, where it used to be safe against corrupted
> page tables.  Maybe we don't care so much about crashing on corrupted
> page tables nowadays (I haven't heard recent complaints), and I think
> Huang is correct that lookup_swap_cache() and __read_swap_cache_async()
> happen to be the only instances that need to be guarded against swapoff
> (the others are working with page table locked).
>
> The array of arrays of swapper spaces is all just to get a separate
> lock for separate extents of the swapfile: I wonder whether Matthew has
> anything in mind for that in XArray (I think Peter once got it working
> in radix-tree, but the overhead not so good).
>
> (I was originally horrified by the stop_machine() added in swapon and
> swapoff, but perhaps I'm remembering a distant past of really stopping
> the machine: stop_machine() today looked reasonable, something to avoid
> generally like lru_add_drain_all(), but not as shameful as I thought.)

Thanks a lot for your review and comments!

It appears that you have no strong objection for this patch?  Could I
have your "Acked-by"?

Best Regards,
Huang, Ying
Hugh Dickins Feb. 6, 2019, 12:36 a.m. UTC | #14
On Wed, 6 Feb 2019, Huang, Ying wrote:
> 
> Thanks a lot for your review and comments!
> 
> It appears that you have no strong objection for this patch?

That much is correct.

> Could I have your "Acked-by"?

Sorry to be so begrudging, but I have to save my Acks for when I feel
more confident in my opinion.  Here I don't think I can get beyond

Not-Nacked-by: Hugh Dickins <hughd@google.com>

I imagine Daniel would ask for some barriers in there: maybe you can
get a more generous response from him when he looks over the result.

Warmly but meanly,
Hugh
Huang, Ying Feb. 6, 2019, 12:58 a.m. UTC | #15
Hugh Dickins <hughd@google.com> writes:

> On Wed, 6 Feb 2019, Huang, Ying wrote:
>> 
>> Thanks a lot for your review and comments!
>> 
>> It appears that you have no strong objection for this patch?
>
> That much is correct.
>
>> Could I have your "Acked-by"?
>
> Sorry to be so begrudging, but I have to save my Acks for when I feel
> more confident in my opinion.  Here I don't think I can get beyond
>
> Not-Nacked-by: Hugh Dickins <hughd@google.com>
>
> I imagine Daniel would ask for some barriers in there: maybe you can
> get a more generous response from him when he looks over the result.

Thanks a lot for your help!  Will ask him help too.

Best Regards,
Huang, Ying

> Warmly but meanly,
> Hugh
Andrea Parri Feb. 8, 2019, 12:28 a.m. UTC | #16
Hi Huang, Ying,

On Mon, Feb 04, 2019 at 01:37:00PM -0800, Hugh Dickins wrote:
> On Thu, 31 Jan 2019, Andrew Morton wrote:
> > On Thu, 31 Jan 2019 10:48:29 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:
> > > Andrew Morton <akpm@linux-foundation.org> writes:
> > > > mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very
> > > > stuck so can you please redo this against mainline?
> > > 
> > > Allow me to be off topic, this patch has been in mm tree for quite some
> > > time, what can I do to help this be merged upstream?

[...]


> 
> Wow, yes, it's about a year old.
> 
> > 
> > I have no evidence that it has been reviewed, for a start.  I've asked
> > Hugh to look at it.
> 
> I tried at the weekend.  Usual story: I don't like it at all, the
> ever-increasing complexity there, but certainly understand the need
> for that fix, and have not managed to think up anything better -
> and now I need to switch away, sorry.

FWIW, I do agree with Hugh about "the need for that fix": AFAIU, that
(mainline) code is naively buggy _and_ "this patch":

  http://lkml.kernel.org/r/20180223060010.954-1-ying.huang@intel.com

"redone on top of mainline" seems both correct and appropriate to me.


> (I was originally horrified by the stop_machine() added in swapon and
> swapoff, but perhaps I'm remembering a distant past of really stopping
> the machine: stop_machine() today looked reasonable, something to avoid
> generally like lru_add_drain_all(), but not as shameful as I thought.)

AFAIC_find_on_LKML, we have three different fixes (at least!): resp.,

  1. refcount(-based),
  2. RCU,
  3. stop_machine();

(3) appears to be the less documented/relied-upon/tested among these;
I'm not aware of definitive reasons forcing us to reject (1) and (2).

  Andrea


> 
> Hugh
Huang, Ying Feb. 11, 2019, 1:02 a.m. UTC | #17
Andrea Parri <andrea.parri@amarulasolutions.com> writes:

> Hi Huang, Ying,
>
> On Mon, Feb 04, 2019 at 01:37:00PM -0800, Hugh Dickins wrote:
>> On Thu, 31 Jan 2019, Andrew Morton wrote:
>> > On Thu, 31 Jan 2019 10:48:29 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:
>> > > Andrew Morton <akpm@linux-foundation.org> writes:
>> > > > mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very
>> > > > stuck so can you please redo this against mainline?
>> > > 
>> > > Allow me to be off topic, this patch has been in mm tree for quite some
>> > > time, what can I do to help this be merged upstream?
>
> [...]
>
>
>> 
>> Wow, yes, it's about a year old.
>> 
>> > 
>> > I have no evidence that it has been reviewed, for a start.  I've asked
>> > Hugh to look at it.
>> 
>> I tried at the weekend.  Usual story: I don't like it at all, the
>> ever-increasing complexity there, but certainly understand the need
>> for that fix, and have not managed to think up anything better -
>> and now I need to switch away, sorry.
>
> FWIW, I do agree with Hugh about "the need for that fix": AFAIU, that
> (mainline) code is naively buggy _and_ "this patch":
>
>   http://lkml.kernel.org/r/20180223060010.954-1-ying.huang@intel.com
>
> "redone on top of mainline" seems both correct and appropriate to me.

Thanks!  Because the patch needs to go through -mm tree, so I will
rebase the patch on top of the head of -mm tree.
>
>> (I was originally horrified by the stop_machine() added in swapon and
>> swapoff, but perhaps I'm remembering a distant past of really stopping
>> the machine: stop_machine() today looked reasonable, something to avoid
>> generally like lru_add_drain_all(), but not as shameful as I thought.)
>
> AFAIC_find_on_LKML, we have three different fixes (at least!): resp.,
>
>   1. refcount(-based),
>   2. RCU,
>   3. stop_machine();
>
> (3) appears to be the less documented/relied-upon/tested among these;
> I'm not aware of definitive reasons forcing us to reject (1) and (2).

Because swapoff() is a really cold path, while page fault handler is a
really hot path.  (3) can minimize the overhead of the hot path.

Best Regards,
Huang, Ying

>   Andrea
>
>
>> 
>> Hugh
diff mbox series

Patch

diff --git a/mm/swapfile.c b/mm/swapfile.c
index f0edf7244256..dad52fc67045 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -99,6 +99,15 @@  static atomic_t proc_poll_event = ATOMIC_INIT(0);
 
 atomic_t nr_rotate_swap = ATOMIC_INIT(0);
 
+static struct swap_info_struct *swap_type_to_swap_info(int type)
+{
+	if (type >= READ_ONCE(nr_swapfiles))
+		return NULL;
+
+	smp_rmb();	/* Pairs with smp_wmb in alloc_swap_info. */
+	return READ_ONCE(swap_info[type]);
+}
+
 static inline unsigned char swap_count(unsigned char ent)
 {
 	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
@@ -1045,12 +1054,14 @@  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 /* The only caller of this function is now suspend routine */
 swp_entry_t get_swap_page_of_type(int type)
 {
-	struct swap_info_struct *si;
+	struct swap_info_struct *si = swap_type_to_swap_info(type);
 	pgoff_t offset;
 
-	si = swap_info[type];
+	if (!si)
+		goto fail;
+
 	spin_lock(&si->lock);
-	if (si && (si->flags & SWP_WRITEOK)) {
+	if (si->flags & SWP_WRITEOK) {
 		atomic_long_dec(&nr_swap_pages);
 		/* This is called for allocating swap entry, not cache */
 		offset = scan_swap_map(si, 1);
@@ -1061,6 +1072,7 @@  swp_entry_t get_swap_page_of_type(int type)
 		atomic_long_inc(&nr_swap_pages);
 	}
 	spin_unlock(&si->lock);
+fail:
 	return (swp_entry_t) {0};
 }
 
@@ -1072,9 +1084,9 @@  static struct swap_info_struct *__swap_info_get(swp_entry_t entry)
 	if (!entry.val)
 		goto out;
 	type = swp_type(entry);
-	if (type >= nr_swapfiles)
+	p = swap_type_to_swap_info(type);
+	if (!p)
 		goto bad_nofile;
-	p = swap_info[type];
 	if (!(p->flags & SWP_USED))
 		goto bad_device;
 	offset = swp_offset(entry);
@@ -1212,9 +1224,9 @@  struct swap_info_struct *get_swap_device(swp_entry_t entry)
 	if (!entry.val)
 		goto out;
 	type = swp_type(entry);
-	if (type >= nr_swapfiles)
+	si = swap_type_to_swap_info(type);
+	if (!si)
 		goto bad_nofile;
-	si = swap_info[type];
 
 	preempt_disable();
 	if (!(si->flags & SWP_VALID))
@@ -1765,10 +1777,9 @@  int swap_type_of(dev_t device, sector_t offset, struct block_device **bdev_p)
 sector_t swapdev_block(int type, pgoff_t offset)
 {
 	struct block_device *bdev;
+	struct swap_info_struct *si = swap_type_to_swap_info(type);
 
-	if ((unsigned int)type >= nr_swapfiles)
-		return 0;
-	if (!(swap_info[type]->flags & SWP_WRITEOK))
+	if (!si || !(si->flags & SWP_WRITEOK))
 		return 0;
 	return map_swap_entry(swp_entry(type, offset), &bdev);
 }
@@ -2799,9 +2810,9 @@  static void *swap_start(struct seq_file *swap, loff_t *pos)
 	if (!l)
 		return SEQ_START_TOKEN;
 
-	for (type = 0; type < nr_swapfiles; type++) {
+	for (type = 0; type < READ_ONCE(nr_swapfiles); type++) {
 		smp_rmb();	/* read nr_swapfiles before swap_info[type] */
-		si = swap_info[type];
+		si = READ_ONCE(swap_info[type]);
 		if (!(si->flags & SWP_USED) || !si->swap_map)
 			continue;
 		if (!--l)
@@ -2821,9 +2832,9 @@  static void *swap_next(struct seq_file *swap, void *v, loff_t *pos)
 	else
 		type = si->type + 1;
 
-	for (; type < nr_swapfiles; type++) {
+	for (; type < READ_ONCE(nr_swapfiles); type++) {
 		smp_rmb();	/* read nr_swapfiles before swap_info[type] */
-		si = swap_info[type];
+		si = READ_ONCE(swap_info[type]);
 		if (!(si->flags & SWP_USED) || !si->swap_map)
 			continue;
 		++*pos;
@@ -2930,14 +2941,14 @@  static struct swap_info_struct *alloc_swap_info(void)
 	}
 	if (type >= nr_swapfiles) {
 		p->type = type;
-		swap_info[type] = p;
+		WRITE_ONCE(swap_info[type], p);
 		/*
 		 * Write swap_info[type] before nr_swapfiles, in case a
 		 * racing procfs swap_start() or swap_next() is reading them.
 		 * (We never shrink nr_swapfiles, we never free this entry.)
 		 */
 		smp_wmb();
-		nr_swapfiles++;
+		WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1);
 	} else {
 		kvfree(p);
 		p = swap_info[type];