Message ID | 20190115002305.15402-1-daniel.m.jordan@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, swap: bounds check swap_info accesses to avoid NULL derefs | expand |
On Mon, Jan 14, 2019 at 07:23:05PM -0500, Daniel Jordan wrote: > Dan Carpenter reports a potential NULL dereference in > get_swap_page_of_type: > > Smatch complains that the NULL checks on "si" aren't consistent. This > seems like a real bug because we have not ensured that the type is > valid and so "si" can be NULL. > > Add the missing check for NULL, taking care to use a read barrier to > ensure CPU1 observes CPU0's updates in the correct order: > > CPU0 CPU1 > alloc_swap_info() if (type >= nr_swapfiles) > swap_info[type] = p /* handle invalid entry */ > smp_wmb() smp_rmb() > ++nr_swapfiles p = swap_info[type] > > Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before > CPU0's write to swap_info[type] and read NULL from swap_info[type]. > > Ying Huang noticed that other places don't order these reads properly. > Introduce swap_type_to_swap_info to encourage correct usage. > > Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model > (see tools/memory-model/Documentation/explanation.txt). > > This ordering need not be enforced in places where swap_lock is held > (e.g. si_swapinfo) because swap_lock serializes updates to nr_swapfiles > and the swap_info array. > > This is a theoretical problem, no actual reports of it exist. > > Fixes: ec8acf20afb8 ("swap: add per-partition lock for swapfile") > Reported-by: Dan Carpenter <dan.carpenter@oracle.com> > Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> > Cc: Alan Stern <stern@rowland.harvard.edu> > Cc: Andi Kleen <ak@linux.intel.com> > Cc: Andrea Parri <andrea.parri@amarulasolutions.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Dan Carpenter <dan.carpenter@oracle.com> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Huang Ying <ying.huang@intel.com> > Cc: Omar Sandoval <osandov@fb.com> > Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> > Cc: Peter Zijlstra (Intel) <peterz@infradead.org> > Cc: Shaohua Li <shli@kernel.org> > Cc: Stephen Rothwell <sfr@canb.auug.org.au> > Cc: Tejun Heo <tj@kernel.org> > Cc: Will Deacon <will.deacon@arm.com> Please feel free to add: Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com> Andrea > > --- > > I'd appreciate it if someone more familiar with memory barriers could > check this over. Thanks. > > Probably no need for stable, this is all theoretical. > > Against linux-mmotm tag v5.0-rc1-mmotm-2019-01-09-13-40 > > mm/swapfile.c | 43 +++++++++++++++++++++++++++---------------- > 1 file changed, 27 insertions(+), 16 deletions(-) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index f0edf7244256..dad52fc67045 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -99,6 +99,15 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0); > > atomic_t nr_rotate_swap = ATOMIC_INIT(0); > > +static struct swap_info_struct *swap_type_to_swap_info(int type) > +{ > + if (type >= READ_ONCE(nr_swapfiles)) > + return NULL; > + > + smp_rmb(); /* Pairs with smp_wmb in alloc_swap_info. */ > + return READ_ONCE(swap_info[type]); > +} > + > static inline unsigned char swap_count(unsigned char ent) > { > return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */ > @@ -1045,12 +1054,14 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) > /* The only caller of this function is now suspend routine */ > swp_entry_t get_swap_page_of_type(int type) > { > - struct swap_info_struct *si; > + struct swap_info_struct *si = swap_type_to_swap_info(type); > pgoff_t offset; > > - si = swap_info[type]; > + if (!si) > + goto fail; > + > spin_lock(&si->lock); > - if (si && (si->flags & SWP_WRITEOK)) { > + if (si->flags & SWP_WRITEOK) { > atomic_long_dec(&nr_swap_pages); > /* This is called for allocating swap entry, not cache */ > offset = scan_swap_map(si, 1); > @@ -1061,6 +1072,7 @@ swp_entry_t get_swap_page_of_type(int type) > atomic_long_inc(&nr_swap_pages); > } > spin_unlock(&si->lock); > +fail: > return (swp_entry_t) {0}; > } > > @@ -1072,9 +1084,9 @@ static struct swap_info_struct *__swap_info_get(swp_entry_t entry) > if (!entry.val) > goto out; > type = swp_type(entry); > - if (type >= nr_swapfiles) > + p = swap_type_to_swap_info(type); > + if (!p) > goto bad_nofile; > - p = swap_info[type]; > if (!(p->flags & SWP_USED)) > goto bad_device; > offset = swp_offset(entry); > @@ -1212,9 +1224,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) > if (!entry.val) > goto out; > type = swp_type(entry); > - if (type >= nr_swapfiles) > + si = swap_type_to_swap_info(type); > + if (!si) > goto bad_nofile; > - si = swap_info[type]; > > preempt_disable(); > if (!(si->flags & SWP_VALID)) > @@ -1765,10 +1777,9 @@ int swap_type_of(dev_t device, sector_t offset, struct block_device **bdev_p) > sector_t swapdev_block(int type, pgoff_t offset) > { > struct block_device *bdev; > + struct swap_info_struct *si = swap_type_to_swap_info(type); > > - if ((unsigned int)type >= nr_swapfiles) > - return 0; > - if (!(swap_info[type]->flags & SWP_WRITEOK)) > + if (!si || !(si->flags & SWP_WRITEOK)) > return 0; > return map_swap_entry(swp_entry(type, offset), &bdev); > } > @@ -2799,9 +2810,9 @@ static void *swap_start(struct seq_file *swap, loff_t *pos) > if (!l) > return SEQ_START_TOKEN; > > - for (type = 0; type < nr_swapfiles; type++) { > + for (type = 0; type < READ_ONCE(nr_swapfiles); type++) { > smp_rmb(); /* read nr_swapfiles before swap_info[type] */ > - si = swap_info[type]; > + si = READ_ONCE(swap_info[type]); > if (!(si->flags & SWP_USED) || !si->swap_map) > continue; > if (!--l) > @@ -2821,9 +2832,9 @@ static void *swap_next(struct seq_file *swap, void *v, loff_t *pos) > else > type = si->type + 1; > > - for (; type < nr_swapfiles; type++) { > + for (; type < READ_ONCE(nr_swapfiles); type++) { > smp_rmb(); /* read nr_swapfiles before swap_info[type] */ > - si = swap_info[type]; > + si = READ_ONCE(swap_info[type]); > if (!(si->flags & SWP_USED) || !si->swap_map) > continue; > ++*pos; > @@ -2930,14 +2941,14 @@ static struct swap_info_struct *alloc_swap_info(void) > } > if (type >= nr_swapfiles) { > p->type = type; > - swap_info[type] = p; > + WRITE_ONCE(swap_info[type], p); > /* > * Write swap_info[type] before nr_swapfiles, in case a > * racing procfs swap_start() or swap_next() is reading them. > * (We never shrink nr_swapfiles, we never free this entry.) > */ > smp_wmb(); > - nr_swapfiles++; > + WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); > } else { > kvfree(p); > p = swap_info[type]; > -- > 2.20.0 >
On Mon, 14 Jan 2019 19:23:05 -0500 Daniel Jordan <daniel.m.jordan@oracle.com> wrote: > Dan Carpenter reports a potential NULL dereference in > get_swap_page_of_type: > > Smatch complains that the NULL checks on "si" aren't consistent. This > seems like a real bug because we have not ensured that the type is > valid and so "si" can be NULL. > > Add the missing check for NULL, taking care to use a read barrier to > ensure CPU1 observes CPU0's updates in the correct order: > > CPU0 CPU1 > alloc_swap_info() if (type >= nr_swapfiles) > swap_info[type] = p /* handle invalid entry */ > smp_wmb() smp_rmb() > ++nr_swapfiles p = swap_info[type] > > Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before > CPU0's write to swap_info[type] and read NULL from swap_info[type]. > > Ying Huang noticed that other places don't order these reads properly. > Introduce swap_type_to_swap_info to encourage correct usage. > > Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model > (see tools/memory-model/Documentation/explanation.txt). > > This ordering need not be enforced in places where swap_lock is held > (e.g. si_swapinfo) because swap_lock serializes updates to nr_swapfiles > and the swap_info array. > > This is a theoretical problem, no actual reports of it exist. > LGTM, but like most people I'm afraid to ack it ;) mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very stuck so can you please redo this against mainline?
On Mon, Jan 14, 2019 at 07:23:05PM -0500, Daniel Jordan wrote: > Dan Carpenter reports a potential NULL dereference in > get_swap_page_of_type: > > Smatch complains that the NULL checks on "si" aren't consistent. This > seems like a real bug because we have not ensured that the type is > valid and so "si" can be NULL. > > Add the missing check for NULL, taking care to use a read barrier to > ensure CPU1 observes CPU0's updates in the correct order: > > CPU0 CPU1 > alloc_swap_info() if (type >= nr_swapfiles) > swap_info[type] = p /* handle invalid entry */ > smp_wmb() smp_rmb() > ++nr_swapfiles p = swap_info[type] > > Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before > CPU0's write to swap_info[type] and read NULL from swap_info[type]. > > Ying Huang noticed that other places don't order these reads properly. > Introduce swap_type_to_swap_info to encourage correct usage. > > Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model > (see tools/memory-model/Documentation/explanation.txt). > > This ordering need not be enforced in places where swap_lock is held > (e.g. si_swapinfo) because swap_lock serializes updates to nr_swapfiles > and the swap_info array. > > This is a theoretical problem, no actual reports of it exist. > > Fixes: ec8acf20afb8 ("swap: add per-partition lock for swapfile") > Reported-by: Dan Carpenter <dan.carpenter@oracle.com> > Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> > Cc: Alan Stern <stern@rowland.harvard.edu> > Cc: Andi Kleen <ak@linux.intel.com> > Cc: Andrea Parri <andrea.parri@amarulasolutions.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Dan Carpenter <dan.carpenter@oracle.com> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Huang Ying <ying.huang@intel.com> > Cc: Omar Sandoval <osandov@fb.com> > Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> > Cc: Peter Zijlstra (Intel) <peterz@infradead.org> > Cc: Shaohua Li <shli@kernel.org> > Cc: Stephen Rothwell <sfr@canb.auug.org.au> > Cc: Tejun Heo <tj@kernel.org> > Cc: Will Deacon <will.deacon@arm.com> > > --- > > I'd appreciate it if someone more familiar with memory barriers could > check this over. Thanks. > > Probably no need for stable, this is all theoretical. > The NULL dereference part is not theoretical. It require CAP_SYS_ADMIN so it's not a huge deal, but you could trigger it from snapshot_ioctl() with SNAPSHOT_ALLOC_SWAP_PAGE. regards, dan carpenter
On Mon, Jan 14, 2019 at 07:23:05PM -0500, Daniel Jordan wrote: > Dan Carpenter reports a potential NULL dereference in > get_swap_page_of_type: > > Smatch complains that the NULL checks on "si" aren't consistent. This > seems like a real bug because we have not ensured that the type is > valid and so "si" can be NULL. > > Add the missing check for NULL, taking care to use a read barrier to > ensure CPU1 observes CPU0's updates in the correct order: > > CPU0 CPU1 > alloc_swap_info() if (type >= nr_swapfiles) > swap_info[type] = p /* handle invalid entry */ > smp_wmb() smp_rmb() > ++nr_swapfiles p = swap_info[type] > > Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before > CPU0's write to swap_info[type] and read NULL from swap_info[type]. > > Ying Huang noticed that other places don't order these reads properly. > Introduce swap_type_to_swap_info to encourage correct usage. > > Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model > (see tools/memory-model/Documentation/explanation.txt). > > This ordering need not be enforced in places where swap_lock is held > (e.g. si_swapinfo) because swap_lock serializes updates to nr_swapfiles > and the swap_info array. > > This is a theoretical problem, no actual reports of it exist. > > Fixes: ec8acf20afb8 ("swap: add per-partition lock for swapfile") > Reported-by: Dan Carpenter <dan.carpenter@oracle.com> > Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> A few comments below, but: Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> > +static struct swap_info_struct *swap_type_to_swap_info(int type) > +{ > + if (type >= READ_ONCE(nr_swapfiles)) > + return NULL; > + > + smp_rmb(); /* Pairs with smp_wmb in alloc_swap_info. */ > + return READ_ONCE(swap_info[type]); > +} > @@ -2799,9 +2810,9 @@ static void *swap_start(struct seq_file *swap, loff_t *pos) > if (!l) > return SEQ_START_TOKEN; > > - for (type = 0; type < nr_swapfiles; type++) { > + for (type = 0; type < READ_ONCE(nr_swapfiles); type++) { > smp_rmb(); /* read nr_swapfiles before swap_info[type] */ > - si = swap_info[type]; > + si = READ_ONCE(swap_info[type]); > if (!(si->flags & SWP_USED) || !si->swap_map) > continue; > if (!--l) > @@ -2821,9 +2832,9 @@ static void *swap_next(struct seq_file *swap, void *v, loff_t *pos) > else > type = si->type + 1; > > - for (; type < nr_swapfiles; type++) { > + for (; type < READ_ONCE(nr_swapfiles); type++) { > smp_rmb(); /* read nr_swapfiles before swap_info[type] */ > - si = swap_info[type]; > + si = READ_ONCE(swap_info[type]); > if (!(si->flags & SWP_USED) || !si->swap_map) > continue; > ++*pos; You could write those like: for (; (si = swap_type_to_swap_info(type)); type++) > @@ -2930,14 +2941,14 @@ static struct swap_info_struct *alloc_swap_info(void) > } > if (type >= nr_swapfiles) { > p->type = type; > - swap_info[type] = p; > + WRITE_ONCE(swap_info[type], p); > /* > * Write swap_info[type] before nr_swapfiles, in case a > * racing procfs swap_start() or swap_next() is reading them. > * (We never shrink nr_swapfiles, we never free this entry.) > */ > smp_wmb(); > - nr_swapfiles++; > + WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); > } else { > kvfree(p); > p = swap_info[type]; It is also possible to write this with smp_load_acquire() / smp_store_release(). ARM64/RISC-V might benefit from that, OTOH ARM won't like that much. Dunno what would be better.
On Tue, Jan 29, 2019 at 10:26:22PM -0800, Andrew Morton wrote: > LGTM, but like most people I'm afraid to ack it ;) > > mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very > stuck so can you please redo this against mainline? Yep, I can do that.
On Wed, Jan 30, 2019 at 10:28:46AM +0300, Dan Carpenter wrote: > On Mon, Jan 14, 2019 at 07:23:05PM -0500, Daniel Jordan wrote: > > Probably no need for stable, this is all theoretical. > > The NULL dereference part is not theoretical. It require CAP_SYS_ADMIN > so it's not a huge deal, but you could trigger it from snapshot_ioctl() > with SNAPSHOT_ALLOC_SWAP_PAGE. Ok, I'll amend the changelog for v2.
On Wed, Jan 30, 2019 at 10:13:16AM +0100, Peter Zijlstra wrote: > On Mon, Jan 14, 2019 at 07:23:05PM -0500, Daniel Jordan wrote: > > A few comments below, but: > > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Thanks. > > @@ -2799,9 +2810,9 @@ static void *swap_start(struct seq_file *swap, loff_t *pos) > > if (!l) > > return SEQ_START_TOKEN; > > > > - for (type = 0; type < nr_swapfiles; type++) { > > + for (type = 0; type < READ_ONCE(nr_swapfiles); type++) { > > smp_rmb(); /* read nr_swapfiles before swap_info[type] */ > > - si = swap_info[type]; > > + si = READ_ONCE(swap_info[type]); > > if (!(si->flags & SWP_USED) || !si->swap_map) > > continue; > > if (!--l) > > @@ -2821,9 +2832,9 @@ static void *swap_next(struct seq_file *swap, void *v, loff_t *pos) > > else > > type = si->type + 1; > > > > - for (; type < nr_swapfiles; type++) { > > + for (; type < READ_ONCE(nr_swapfiles); type++) { > > smp_rmb(); /* read nr_swapfiles before swap_info[type] */ > > - si = swap_info[type]; > > + si = READ_ONCE(swap_info[type]); > > if (!(si->flags & SWP_USED) || !si->swap_map) > > continue; > > ++*pos; > > You could write those like: > > for (; (si = swap_type_to_swap_info(type)); type++) That's clever, and way better than the ugly iterator macro I wrote and then deleted in disgust. > > @@ -2930,14 +2941,14 @@ static struct swap_info_struct *alloc_swap_info(void) > > } > > if (type >= nr_swapfiles) { > > p->type = type; > > - swap_info[type] = p; > > + WRITE_ONCE(swap_info[type], p); > > /* > > * Write swap_info[type] before nr_swapfiles, in case a > > * racing procfs swap_start() or swap_next() is reading them. > > * (We never shrink nr_swapfiles, we never free this entry.) > > */ > > smp_wmb(); > > - nr_swapfiles++; > > + WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); > > } else { > > kvfree(p); > > p = swap_info[type]; > > It is also possible to write this with smp_load_acquire() / > smp_store_release(). ARM64/RISC-V might benefit from that, OTOH ARM > won't like that much. > > Dunno what would be better. I just left it as-is for now.
Andrew Morton <akpm@linux-foundation.org> writes: > mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very > stuck so can you please redo this against mainline? Allow me to be off topic, this patch has been in mm tree for quite some time, what can I do to help this be merged upstream? Best Regards, Huang, Ying
On Thu, 31 Jan 2019 10:48:29 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote: > Andrew Morton <akpm@linux-foundation.org> writes: > > mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very > > stuck so can you please redo this against mainline? > > Allow me to be off topic, this patch has been in mm tree for quite some > time, what can I do to help this be merged upstream? I have no evidence that it has been reviewed, for a start. I've asked Hugh to look at it.
Andrew Morton <akpm@linux-foundation.org> writes: > On Thu, 31 Jan 2019 10:48:29 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote: > >> Andrew Morton <akpm@linux-foundation.org> writes: >> > mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very >> > stuck so can you please redo this against mainline? >> >> Allow me to be off topic, this patch has been in mm tree for quite some >> time, what can I do to help this be merged upstream? > > I have no evidence that it has been reviewed, for a start. I've asked > Hugh to look at it. Got it! I will try to find some people to review it. Best Regards, Huang, Ying
On Thu, 31 Jan 2019, Andrew Morton wrote: > On Thu, 31 Jan 2019 10:48:29 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote: > > Andrew Morton <akpm@linux-foundation.org> writes: > > > mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very > > > stuck so can you please redo this against mainline? > > > > Allow me to be off topic, this patch has been in mm tree for quite some > > time, what can I do to help this be merged upstream? Wow, yes, it's about a year old. > > I have no evidence that it has been reviewed, for a start. I've asked > Hugh to look at it. I tried at the weekend. Usual story: I don't like it at all, the ever-increasing complexity there, but certainly understand the need for that fix, and have not managed to think up anything better - and now I need to switch away, sorry. The multiple dynamically allocated and freed swapper address spaces have indeed broken what used to make it safe. If those imaginary address spaces did not have to be virtually contiguous, I'd say cache them and reuse them, instead of freeing. But I don't see how to do that as it stands. find_get_page(swapper_address_space(entry), swp_offset(entry)) has become an unsafe construct, where it used to be safe against corrupted page tables. Maybe we don't care so much about crashing on corrupted page tables nowadays (I haven't heard recent complaints), and I think Huang is correct that lookup_swap_cache() and __read_swap_cache_async() happen to be the only instances that need to be guarded against swapoff (the others are working with page table locked). The array of arrays of swapper spaces is all just to get a separate lock for separate extents of the swapfile: I wonder whether Matthew has anything in mind for that in XArray (I think Peter once got it working in radix-tree, but the overhead not so good). (I was originally horrified by the stop_machine() added in swapon and swapoff, but perhaps I'm remembering a distant past of really stopping the machine: stop_machine() today looked reasonable, something to avoid generally like lru_add_drain_all(), but not as shameful as I thought.) Hugh
On Mon, Feb 04, 2019 at 01:37:00PM -0800, Hugh Dickins wrote: > On Thu, 31 Jan 2019, Andrew Morton wrote: > > On Thu, 31 Jan 2019 10:48:29 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote: > > > Andrew Morton <akpm@linux-foundation.org> writes: > > > > mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very > > > > stuck so can you please redo this against mainline? > > > > I have no evidence that it has been reviewed, for a start. I've asked > > Hugh to look at it. > > I tried at the weekend. Usual story: I don't like it at all, the > ever-increasing complexity there, but certainly understand the need > for that fix, and have not managed to think up anything better - > and now I need to switch away, sorry. > > The multiple dynamically allocated and freed swapper address spaces > have indeed broken what used to make it safe. If those imaginary > address spaces did not have to be virtually contiguous, I'd say > cache them and reuse them, instead of freeing. But I don't see > how to do that as it stands. > > find_get_page(swapper_address_space(entry), swp_offset(entry)) has > become an unsafe construct, where it used to be safe against corrupted > page tables. Maybe we don't care so much about crashing on corrupted > page tables nowadays (I haven't heard recent complaints), and I think > Huang is correct that lookup_swap_cache() and __read_swap_cache_async() > happen to be the only instances that need to be guarded against swapoff > (the others are working with page table locked). > > The array of arrays of swapper spaces is all just to get a separate > lock for separate extents of the swapfile: I wonder whether Matthew has > anything in mind for that in XArray (I think Peter once got it working > in radix-tree, but the overhead not so good). Hi Hugh, thanks for putting me on the cc. I've certainly noticed what's been going on with the swapper code, but I've generally had a lack of tuits (round or otherwise) to really dig in and figure out what's going on. I've had some ideas about embedding a spinlock in each leaf node (giving one lock per 64 slots), but I know I've got about 800 things that I've actually promised to do ahead of looking at doing that. I have a suspicion that the swapper code could probably be replaced with an allocating XArray (like the IDR) and it doesn't really need to be a full on address_space, but I'm probably wrong because I haven't studied the swap code in depth.
Hi, Hugh, Hugh Dickins <hughd@google.com> writes: > On Thu, 31 Jan 2019, Andrew Morton wrote: >> On Thu, 31 Jan 2019 10:48:29 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote: >> > Andrew Morton <akpm@linux-foundation.org> writes: >> > > mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very >> > > stuck so can you please redo this against mainline? >> > >> > Allow me to be off topic, this patch has been in mm tree for quite some >> > time, what can I do to help this be merged upstream? > > Wow, yes, it's about a year old. > >> >> I have no evidence that it has been reviewed, for a start. I've asked >> Hugh to look at it. > > I tried at the weekend. Usual story: I don't like it at all, the > ever-increasing complexity there, but certainly understand the need > for that fix, and have not managed to think up anything better - > and now I need to switch away, sorry. > > The multiple dynamically allocated and freed swapper address spaces > have indeed broken what used to make it safe. If those imaginary > address spaces did not have to be virtually contiguous, I'd say > cache them and reuse them, instead of freeing. But I don't see > how to do that as it stands. > > find_get_page(swapper_address_space(entry), swp_offset(entry)) has > become an unsafe construct, where it used to be safe against corrupted > page tables. Maybe we don't care so much about crashing on corrupted > page tables nowadays (I haven't heard recent complaints), and I think > Huang is correct that lookup_swap_cache() and __read_swap_cache_async() > happen to be the only instances that need to be guarded against swapoff > (the others are working with page table locked). > > The array of arrays of swapper spaces is all just to get a separate > lock for separate extents of the swapfile: I wonder whether Matthew has > anything in mind for that in XArray (I think Peter once got it working > in radix-tree, but the overhead not so good). > > (I was originally horrified by the stop_machine() added in swapon and > swapoff, but perhaps I'm remembering a distant past of really stopping > the machine: stop_machine() today looked reasonable, something to avoid > generally like lru_add_drain_all(), but not as shameful as I thought.) Thanks a lot for your review and comments! It appears that you have no strong objection for this patch? Could I have your "Acked-by"? Best Regards, Huang, Ying
On Wed, 6 Feb 2019, Huang, Ying wrote: > > Thanks a lot for your review and comments! > > It appears that you have no strong objection for this patch? That much is correct. > Could I have your "Acked-by"? Sorry to be so begrudging, but I have to save my Acks for when I feel more confident in my opinion. Here I don't think I can get beyond Not-Nacked-by: Hugh Dickins <hughd@google.com> I imagine Daniel would ask for some barriers in there: maybe you can get a more generous response from him when he looks over the result. Warmly but meanly, Hugh
Hugh Dickins <hughd@google.com> writes: > On Wed, 6 Feb 2019, Huang, Ying wrote: >> >> Thanks a lot for your review and comments! >> >> It appears that you have no strong objection for this patch? > > That much is correct. > >> Could I have your "Acked-by"? > > Sorry to be so begrudging, but I have to save my Acks for when I feel > more confident in my opinion. Here I don't think I can get beyond > > Not-Nacked-by: Hugh Dickins <hughd@google.com> > > I imagine Daniel would ask for some barriers in there: maybe you can > get a more generous response from him when he looks over the result. Thanks a lot for your help! Will ask him help too. Best Regards, Huang, Ying > Warmly but meanly, > Hugh
Hi Huang, Ying, On Mon, Feb 04, 2019 at 01:37:00PM -0800, Hugh Dickins wrote: > On Thu, 31 Jan 2019, Andrew Morton wrote: > > On Thu, 31 Jan 2019 10:48:29 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote: > > > Andrew Morton <akpm@linux-foundation.org> writes: > > > > mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very > > > > stuck so can you please redo this against mainline? > > > > > > Allow me to be off topic, this patch has been in mm tree for quite some > > > time, what can I do to help this be merged upstream? [...] > > Wow, yes, it's about a year old. > > > > > I have no evidence that it has been reviewed, for a start. I've asked > > Hugh to look at it. > > I tried at the weekend. Usual story: I don't like it at all, the > ever-increasing complexity there, but certainly understand the need > for that fix, and have not managed to think up anything better - > and now I need to switch away, sorry. FWIW, I do agree with Hugh about "the need for that fix": AFAIU, that (mainline) code is naively buggy _and_ "this patch": http://lkml.kernel.org/r/20180223060010.954-1-ying.huang@intel.com "redone on top of mainline" seems both correct and appropriate to me. > (I was originally horrified by the stop_machine() added in swapon and > swapoff, but perhaps I'm remembering a distant past of really stopping > the machine: stop_machine() today looked reasonable, something to avoid > generally like lru_add_drain_all(), but not as shameful as I thought.) AFAIC_find_on_LKML, we have three different fixes (at least!): resp., 1. refcount(-based), 2. RCU, 3. stop_machine(); (3) appears to be the less documented/relied-upon/tested among these; I'm not aware of definitive reasons forcing us to reject (1) and (2). Andrea > > Hugh
Andrea Parri <andrea.parri@amarulasolutions.com> writes: > Hi Huang, Ying, > > On Mon, Feb 04, 2019 at 01:37:00PM -0800, Hugh Dickins wrote: >> On Thu, 31 Jan 2019, Andrew Morton wrote: >> > On Thu, 31 Jan 2019 10:48:29 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote: >> > > Andrew Morton <akpm@linux-foundation.org> writes: >> > > > mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch is very >> > > > stuck so can you please redo this against mainline? >> > > >> > > Allow me to be off topic, this patch has been in mm tree for quite some >> > > time, what can I do to help this be merged upstream? > > [...] > > >> >> Wow, yes, it's about a year old. >> >> > >> > I have no evidence that it has been reviewed, for a start. I've asked >> > Hugh to look at it. >> >> I tried at the weekend. Usual story: I don't like it at all, the >> ever-increasing complexity there, but certainly understand the need >> for that fix, and have not managed to think up anything better - >> and now I need to switch away, sorry. > > FWIW, I do agree with Hugh about "the need for that fix": AFAIU, that > (mainline) code is naively buggy _and_ "this patch": > > http://lkml.kernel.org/r/20180223060010.954-1-ying.huang@intel.com > > "redone on top of mainline" seems both correct and appropriate to me. Thanks! Because the patch needs to go through -mm tree, so I will rebase the patch on top of the head of -mm tree. > >> (I was originally horrified by the stop_machine() added in swapon and >> swapoff, but perhaps I'm remembering a distant past of really stopping >> the machine: stop_machine() today looked reasonable, something to avoid >> generally like lru_add_drain_all(), but not as shameful as I thought.) > > AFAIC_find_on_LKML, we have three different fixes (at least!): resp., > > 1. refcount(-based), > 2. RCU, > 3. stop_machine(); > > (3) appears to be the less documented/relied-upon/tested among these; > I'm not aware of definitive reasons forcing us to reject (1) and (2). Because swapoff() is a really cold path, while page fault handler is a really hot path. (3) can minimize the overhead of the hot path. Best Regards, Huang, Ying > Andrea > > >> >> Hugh
diff --git a/mm/swapfile.c b/mm/swapfile.c index f0edf7244256..dad52fc67045 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -99,6 +99,15 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0); atomic_t nr_rotate_swap = ATOMIC_INIT(0); +static struct swap_info_struct *swap_type_to_swap_info(int type) +{ + if (type >= READ_ONCE(nr_swapfiles)) + return NULL; + + smp_rmb(); /* Pairs with smp_wmb in alloc_swap_info. */ + return READ_ONCE(swap_info[type]); +} + static inline unsigned char swap_count(unsigned char ent) { return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */ @@ -1045,12 +1054,14 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) /* The only caller of this function is now suspend routine */ swp_entry_t get_swap_page_of_type(int type) { - struct swap_info_struct *si; + struct swap_info_struct *si = swap_type_to_swap_info(type); pgoff_t offset; - si = swap_info[type]; + if (!si) + goto fail; + spin_lock(&si->lock); - if (si && (si->flags & SWP_WRITEOK)) { + if (si->flags & SWP_WRITEOK) { atomic_long_dec(&nr_swap_pages); /* This is called for allocating swap entry, not cache */ offset = scan_swap_map(si, 1); @@ -1061,6 +1072,7 @@ swp_entry_t get_swap_page_of_type(int type) atomic_long_inc(&nr_swap_pages); } spin_unlock(&si->lock); +fail: return (swp_entry_t) {0}; } @@ -1072,9 +1084,9 @@ static struct swap_info_struct *__swap_info_get(swp_entry_t entry) if (!entry.val) goto out; type = swp_type(entry); - if (type >= nr_swapfiles) + p = swap_type_to_swap_info(type); + if (!p) goto bad_nofile; - p = swap_info[type]; if (!(p->flags & SWP_USED)) goto bad_device; offset = swp_offset(entry); @@ -1212,9 +1224,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) if (!entry.val) goto out; type = swp_type(entry); - if (type >= nr_swapfiles) + si = swap_type_to_swap_info(type); + if (!si) goto bad_nofile; - si = swap_info[type]; preempt_disable(); if (!(si->flags & SWP_VALID)) @@ -1765,10 +1777,9 @@ int swap_type_of(dev_t device, sector_t offset, struct block_device **bdev_p) sector_t swapdev_block(int type, pgoff_t offset) { struct block_device *bdev; + struct swap_info_struct *si = swap_type_to_swap_info(type); - if ((unsigned int)type >= nr_swapfiles) - return 0; - if (!(swap_info[type]->flags & SWP_WRITEOK)) + if (!si || !(si->flags & SWP_WRITEOK)) return 0; return map_swap_entry(swp_entry(type, offset), &bdev); } @@ -2799,9 +2810,9 @@ static void *swap_start(struct seq_file *swap, loff_t *pos) if (!l) return SEQ_START_TOKEN; - for (type = 0; type < nr_swapfiles; type++) { + for (type = 0; type < READ_ONCE(nr_swapfiles); type++) { smp_rmb(); /* read nr_swapfiles before swap_info[type] */ - si = swap_info[type]; + si = READ_ONCE(swap_info[type]); if (!(si->flags & SWP_USED) || !si->swap_map) continue; if (!--l) @@ -2821,9 +2832,9 @@ static void *swap_next(struct seq_file *swap, void *v, loff_t *pos) else type = si->type + 1; - for (; type < nr_swapfiles; type++) { + for (; type < READ_ONCE(nr_swapfiles); type++) { smp_rmb(); /* read nr_swapfiles before swap_info[type] */ - si = swap_info[type]; + si = READ_ONCE(swap_info[type]); if (!(si->flags & SWP_USED) || !si->swap_map) continue; ++*pos; @@ -2930,14 +2941,14 @@ static struct swap_info_struct *alloc_swap_info(void) } if (type >= nr_swapfiles) { p->type = type; - swap_info[type] = p; + WRITE_ONCE(swap_info[type], p); /* * Write swap_info[type] before nr_swapfiles, in case a * racing procfs swap_start() or swap_next() is reading them. * (We never shrink nr_swapfiles, we never free this entry.) */ smp_wmb(); - nr_swapfiles++; + WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); } else { kvfree(p); p = swap_info[type];
Dan Carpenter reports a potential NULL dereference in get_swap_page_of_type: Smatch complains that the NULL checks on "si" aren't consistent. This seems like a real bug because we have not ensured that the type is valid and so "si" can be NULL. Add the missing check for NULL, taking care to use a read barrier to ensure CPU1 observes CPU0's updates in the correct order: CPU0 CPU1 alloc_swap_info() if (type >= nr_swapfiles) swap_info[type] = p /* handle invalid entry */ smp_wmb() smp_rmb() ++nr_swapfiles p = swap_info[type] Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before CPU0's write to swap_info[type] and read NULL from swap_info[type]. Ying Huang noticed that other places don't order these reads properly. Introduce swap_type_to_swap_info to encourage correct usage. Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model (see tools/memory-model/Documentation/explanation.txt). This ordering need not be enforced in places where swap_lock is held (e.g. si_swapinfo) because swap_lock serializes updates to nr_swapfiles and the swap_info array. This is a theoretical problem, no actual reports of it exist. Fixes: ec8acf20afb8 ("swap: add per-partition lock for swapfile") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Parri <andrea.parri@amarulasolutions.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shaohua Li <shli@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tejun Heo <tj@kernel.org> Cc: Will Deacon <will.deacon@arm.com> --- I'd appreciate it if someone more familiar with memory barriers could check this over. Thanks. Probably no need for stable, this is all theoretical. Against linux-mmotm tag v5.0-rc1-mmotm-2019-01-09-13-40 mm/swapfile.c | 43 +++++++++++++++++++++++++++---------------- 1 file changed, 27 insertions(+), 16 deletions(-)