diff mbox series

mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

Message ID 20200922184838.978540-1-aquini@redhat.com (mailing list archive)
State New, archived
Headers show
Series mm: swapfile: avoid split_swap_cluster() NULL pointer dereference | expand

Commit Message

Rafael Aquini Sept. 22, 2020, 6:48 p.m. UTC
The swap area descriptor only gets struct swap_cluster_info *cluster_info
allocated if the swapfile is backed by non-rotational storage.
When the swap area is laid on top of ordinary disk spindles, lock_cluster()
will naturally return NULL.

CONFIG_THP_SWAP exposes cluster_info infrastructure to a broader number of
use cases, and split_swap_cluster(), which is the counterpart of split_huge_page()
for the THPs in the swapcache, misses checking the return of lock_cluster before
operating on the cluster_info pointer.

This patch addresses that issue by adding a proper check for the pointer
not being NULL in the wrappers cluster_{is,clear}_huge(), in order to avoid
crashes similar to the one below:

[ 5758.157556] BUG: kernel NULL pointer dereference, address: 0000000000000007
[ 5758.165331] #PF: supervisor write access in kernel mode
[ 5758.171161] #PF: error_code(0x0002) - not-present page
[ 5758.176894] PGD 0 P4D 0
[ 5758.179721] Oops: 0002 [#1] SMP PTI
[ 5758.183614] CPU: 10 PID: 316 Comm: kswapd1 Kdump: loaded Tainted: G S               --------- ---  5.9.0-0.rc3.1.tst.el8.x86_64 #1
[ 5758.196717] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.01.0002.082220131453 08/22/2013
[ 5758.208176] RIP: 0010:split_swap_cluster+0x47/0x60
[ 5758.213522] Code: c1 e3 06 48 c1 eb 0f 48 8d 1c d8 48 89 df e8 d0 20 6a 00 80 63 07 fb 48 85 db 74 16 48 89 df c6 07 00 66 66 66 90 31 c0 5b c3 <80> 24 25 07 00 00 00 fb 31 c0 5b c3 b8 f0 ff ff ff 5b c3 66 0f 1f
[ 5758.234478] RSP: 0018:ffffb147442d7af0 EFLAGS: 00010246
[ 5758.240309] RAX: 0000000000000000 RBX: 000000000014b217 RCX: ffffb14779fd9000
[ 5758.248281] RDX: 000000000014b217 RSI: ffff9c52f2ab1400 RDI: 000000000014b217
[ 5758.256246] RBP: ffffe00c51168080 R08: ffffe00c5116fe08 R09: ffff9c52fffd3000
[ 5758.264208] R10: ffffe00c511537c8 R11: ffff9c52fffd3c90 R12: 0000000000000000
[ 5758.272172] R13: ffffe00c51170000 R14: ffffe00c51170000 R15: ffffe00c51168040
[ 5758.280134] FS:  0000000000000000(0000) GS:ffff9c52f2a80000(0000) knlGS:0000000000000000
[ 5758.289163] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5758.295575] CR2: 0000000000000007 CR3: 0000000022a0e003 CR4: 00000000000606e0
[ 5758.303538] Call Trace:
[ 5758.306273]  split_huge_page_to_list+0x88b/0x950
[ 5758.311433]  deferred_split_scan+0x1ca/0x310
[ 5758.316202]  do_shrink_slab+0x12c/0x2a0
[ 5758.320491]  shrink_slab+0x20f/0x2c0
[ 5758.324482]  shrink_node+0x240/0x6c0
[ 5758.328469]  balance_pgdat+0x2d1/0x550
[ 5758.332652]  kswapd+0x201/0x3c0
[ 5758.336157]  ? finish_wait+0x80/0x80
[ 5758.340147]  ? balance_pgdat+0x550/0x550
[ 5758.344525]  kthread+0x114/0x130
[ 5758.348126]  ? kthread_park+0x80/0x80
[ 5758.352214]  ret_from_fork+0x22/0x30
[ 5758.356203] Modules linked in: fuse zram rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp mgag200 iTCO_wdt crct10dif_pclmul iTCO_vendor_support drm_kms_helper crc32_pclmul ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops cec rapl joydev intel_cstate ipmi_si ipmi_devintf drm intel_uncore i2c_i801 ipmi_msghandler pcspkr lpc_ich mei_me i2c_smbus mei ioatdma ip_tables xfs libcrc32c sr_mod sd_mod cdrom t10_pi sg igb ahci libahci i2c_algo_bit crc32c_intel libata dca wmi dm_mirror dm_region_hash dm_log dm_mod
[ 5758.412673] CR2: 0000000000000007
[    0.000000] Linux version 5.9.0-0.rc3.1.tst.el8.x86_64 (mockbuild@x86-vm-15.build.eng.bos.redhat.com) (gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), GNU ld version 2.30-79.el8) #1 SMP Wed Sep 9 16:03:34 EDT 2020

Fixes: 59807685a7e77 ("mm, THP, swap: support splitting THP for THP swap out")
Signed-off-by: Rafael Aquini <aquini@redhat.com>
---
 mm/swapfile.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Andrew Morton Sept. 22, 2020, 7:47 p.m. UTC | #1
On Tue, 22 Sep 2020 14:48:38 -0400 Rafael Aquini <aquini@redhat.com> wrote:

> The swap area descriptor only gets struct swap_cluster_info *cluster_info
> allocated if the swapfile is backed by non-rotational storage.
> When the swap area is laid on top of ordinary disk spindles, lock_cluster()
> will naturally return NULL.
> 
> CONFIG_THP_SWAP exposes cluster_info infrastructure to a broader number of
> use cases, and split_swap_cluster(), which is the counterpart of split_huge_page()
> for the THPs in the swapcache, misses checking the return of lock_cluster before
> operating on the cluster_info pointer.
> 
> This patch addresses that issue by adding a proper check for the pointer
> not being NULL in the wrappers cluster_{is,clear}_huge(), in order to avoid
> crashes similar to the one below:
> 
> ...
>
> Fixes: 59807685a7e77 ("mm, THP, swap: support splitting THP for THP swap out")
> Signed-off-by: Rafael Aquini <aquini@redhat.com>

Did you consider cc:stable?
Huang, Ying Sept. 23, 2020, 2:21 a.m. UTC | #2
Hi, Rafael,

Rafael Aquini <aquini@redhat.com> writes:

> The swap area descriptor only gets struct swap_cluster_info *cluster_info
> allocated if the swapfile is backed by non-rotational storage.
> When the swap area is laid on top of ordinary disk spindles, lock_cluster()
> will naturally return NULL.

Thanks for reporting.  But the bug looks strange.  Because in a system
with only HDD swap devices, during THP swap out, the swap cluster
shouldn't be allocated, as in

shrink_page_list()
  add_to_swap()
    get_swap_page()
      get_swap_pages()
        swap_alloc_cluster()

Where si->free_clusters is checked, and it should be empty for HDD.  So
in shrink_page_list(), the THP should have been split.  While in
split_huge_page_to_list(), PageSwapCache() is checked before calling
split_swap_cluster().  So this appears strange.

All in all, it appears that we need to find the real root cause of the
bug.

Did you test with the latest upstream kernel?  Can you help trace the
return value of swap_alloc_cluster()?  Can you share the swap device
information?

Best Regards,
Huang, Ying

> CONFIG_THP_SWAP exposes cluster_info infrastructure to a broader number of
> use cases, and split_swap_cluster(), which is the counterpart of split_huge_page()
> for the THPs in the swapcache, misses checking the return of lock_cluster before
> operating on the cluster_info pointer.
>
> This patch addresses that issue by adding a proper check for the pointer
> not being NULL in the wrappers cluster_{is,clear}_huge(), in order to avoid
> crashes similar to the one below:
>
> [ 5758.157556] BUG: kernel NULL pointer dereference, address: 0000000000000007
> [ 5758.165331] #PF: supervisor write access in kernel mode
> [ 5758.171161] #PF: error_code(0x0002) - not-present page
> [ 5758.176894] PGD 0 P4D 0
> [ 5758.179721] Oops: 0002 [#1] SMP PTI
> [ 5758.183614] CPU: 10 PID: 316 Comm: kswapd1 Kdump: loaded Tainted: G S               --------- ---  5.9.0-0.rc3.1.tst.el8.x86_64 #1
> [ 5758.196717] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.01.0002.082220131453 08/22/2013
> [ 5758.208176] RIP: 0010:split_swap_cluster+0x47/0x60
> [ 5758.213522] Code: c1 e3 06 48 c1 eb 0f 48 8d 1c d8 48 89 df e8 d0 20 6a 00 80 63 07 fb 48 85 db 74 16 48 89 df c6 07 00 66 66 66 90 31 c0 5b c3 <80> 24 25 07 00 00 00 fb 31 c0 5b c3 b8 f0 ff ff ff 5b c3 66 0f 1f
> [ 5758.234478] RSP: 0018:ffffb147442d7af0 EFLAGS: 00010246
> [ 5758.240309] RAX: 0000000000000000 RBX: 000000000014b217 RCX: ffffb14779fd9000
> [ 5758.248281] RDX: 000000000014b217 RSI: ffff9c52f2ab1400 RDI: 000000000014b217
> [ 5758.256246] RBP: ffffe00c51168080 R08: ffffe00c5116fe08 R09: ffff9c52fffd3000
> [ 5758.264208] R10: ffffe00c511537c8 R11: ffff9c52fffd3c90 R12: 0000000000000000
> [ 5758.272172] R13: ffffe00c51170000 R14: ffffe00c51170000 R15: ffffe00c51168040
> [ 5758.280134] FS:  0000000000000000(0000) GS:ffff9c52f2a80000(0000) knlGS:0000000000000000
> [ 5758.289163] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 5758.295575] CR2: 0000000000000007 CR3: 0000000022a0e003 CR4: 00000000000606e0
> [ 5758.303538] Call Trace:
> [ 5758.306273]  split_huge_page_to_list+0x88b/0x950
> [ 5758.311433]  deferred_split_scan+0x1ca/0x310
> [ 5758.316202]  do_shrink_slab+0x12c/0x2a0
> [ 5758.320491]  shrink_slab+0x20f/0x2c0
> [ 5758.324482]  shrink_node+0x240/0x6c0
> [ 5758.328469]  balance_pgdat+0x2d1/0x550
> [ 5758.332652]  kswapd+0x201/0x3c0
> [ 5758.336157]  ? finish_wait+0x80/0x80
> [ 5758.340147]  ? balance_pgdat+0x550/0x550
> [ 5758.344525]  kthread+0x114/0x130
> [ 5758.348126]  ? kthread_park+0x80/0x80
> [ 5758.352214]  ret_from_fork+0x22/0x30
> [ 5758.356203] Modules linked in: fuse zram rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp mgag200 iTCO_wdt crct10dif_pclmul iTCO_vendor_support drm_kms_helper crc32_pclmul ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops cec rapl joydev intel_cstate ipmi_si ipmi_devintf drm intel_uncore i2c_i801 ipmi_msghandler pcspkr lpc_ich mei_me i2c_smbus mei ioatdma ip_tables xfs libcrc32c sr_mod sd_mod cdrom t10_pi sg igb ahci libahci i2c_algo_bit crc32c_intel libata dca wmi dm_mirror dm_region_hash dm_log dm_mod
> [ 5758.412673] CR2: 0000000000000007
> [    0.000000] Linux version 5.9.0-0.rc3.1.tst.el8.x86_64 (mockbuild@x86-vm-15.build.eng.bos.redhat.com) (gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), GNU ld version 2.30-79.el8) #1 SMP Wed Sep 9 16:03:34 EDT 2020
>
> Fixes: 59807685a7e77 ("mm, THP, swap: support splitting THP for THP swap out")
> Signed-off-by: Rafael Aquini <aquini@redhat.com>
> ---
>  mm/swapfile.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 12f59e641b5e..37ddf5e5c53b 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -324,14 +324,15 @@ static inline void cluster_set_null(struct swap_cluster_info *info)
>  
>  static inline bool cluster_is_huge(struct swap_cluster_info *info)
>  {
> -	if (IS_ENABLED(CONFIG_THP_SWAP))
> +	if (IS_ENABLED(CONFIG_THP_SWAP) && info)
>  		return info->flags & CLUSTER_FLAG_HUGE;
>  	return false;
>  }
>  
>  static inline void cluster_clear_huge(struct swap_cluster_info *info)
>  {
> -	info->flags &= ~CLUSTER_FLAG_HUGE;
> +	if (IS_ENABLED(CONFIG_THP_SWAP) && info)
> +		info->flags &= ~CLUSTER_FLAG_HUGE;
>  }
>  
>  static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
Rafael Aquini Sept. 23, 2020, 4:34 a.m. UTC | #3
On Wed, Sep 23, 2020 at 10:21:36AM +0800, Huang, Ying wrote:
> Hi, Rafael,
> 
> Rafael Aquini <aquini@redhat.com> writes:
> 
> > The swap area descriptor only gets struct swap_cluster_info *cluster_info
> > allocated if the swapfile is backed by non-rotational storage.
> > When the swap area is laid on top of ordinary disk spindles, lock_cluster()
> > will naturally return NULL.
> 
> Thanks for reporting.  But the bug looks strange.  Because in a system
> with only HDD swap devices, during THP swap out, the swap cluster
> shouldn't be allocated, as in
> 
> shrink_page_list()
>   add_to_swap()
>     get_swap_page()
>       get_swap_pages()
>         swap_alloc_cluster()
>

The underlying problem is that swap_info_struct.cluster_info is always NULL 
on the rotational storage case. So, it's very easy to follow that constructions 
like this one, in split_swap_cluster 

...
        ci = lock_cluster(si, offset);
        cluster_clear_huge(ci);
...

will go for a NULL pointer dereference, in that case, given that lock_cluster 
reads:

...
	struct swap_cluster_info *ci;
        ci = si->cluster_info;
        if (ci) {
                ci += offset / SWAPFILE_CLUSTER;
                spin_lock(&ci->lock);
        }
        return ci;
...
Huang, Ying Sept. 23, 2020, 5:13 a.m. UTC | #4
Rafael Aquini <aquini@redhat.com> writes:

> On Wed, Sep 23, 2020 at 10:21:36AM +0800, Huang, Ying wrote:
>> Hi, Rafael,
>> 
>> Rafael Aquini <aquini@redhat.com> writes:
>> 
>> > The swap area descriptor only gets struct swap_cluster_info *cluster_info
>> > allocated if the swapfile is backed by non-rotational storage.
>> > When the swap area is laid on top of ordinary disk spindles, lock_cluster()
>> > will naturally return NULL.
>> 
>> Thanks for reporting.  But the bug looks strange.  Because in a system
>> with only HDD swap devices, during THP swap out, the swap cluster
>> shouldn't be allocated, as in
>> 
>> shrink_page_list()
>>   add_to_swap()
>>     get_swap_page()
>>       get_swap_pages()
>>         swap_alloc_cluster()
>>
>
> The underlying problem is that swap_info_struct.cluster_info is always NULL 
> on the rotational storage case.

Yes.

> So, it's very easy to follow that constructions 
> like this one, in split_swap_cluster 
>
> ...
>         ci = lock_cluster(si, offset);
>         cluster_clear_huge(ci);
> ...
>
> will go for a NULL pointer dereference, in that case, given that lock_cluster 
> reads:
>
> ...
> 	struct swap_cluster_info *ci;
>         ci = si->cluster_info;
>         if (ci) {
>                 ci += offset / SWAPFILE_CLUSTER;
>                 spin_lock(&ci->lock);
>         }
>         return ci;
> ...

But on HDD, we shouldn't call split_swap_cluster() at all, because we
will not allocate swap cluster firstly.  So, if we run into this,
there should be some other bug, we need to figure it out.

Best Regards,
Huang, Ying
Rafael Aquini Sept. 23, 2020, 1:01 p.m. UTC | #5
On Wed, Sep 23, 2020 at 01:13:49PM +0800, Huang, Ying wrote:
> Rafael Aquini <aquini@redhat.com> writes:
> 
> > On Wed, Sep 23, 2020 at 10:21:36AM +0800, Huang, Ying wrote:
> >> Hi, Rafael,
> >> 
> >> Rafael Aquini <aquini@redhat.com> writes:
> >> 
> >> > The swap area descriptor only gets struct swap_cluster_info *cluster_info
> >> > allocated if the swapfile is backed by non-rotational storage.
> >> > When the swap area is laid on top of ordinary disk spindles, lock_cluster()
> >> > will naturally return NULL.
> >> 
> >> Thanks for reporting.  But the bug looks strange.  Because in a system
> >> with only HDD swap devices, during THP swap out, the swap cluster
> >> shouldn't be allocated, as in
> >> 
> >> shrink_page_list()
> >>   add_to_swap()
> >>     get_swap_page()
> >>       get_swap_pages()
> >>         swap_alloc_cluster()
> >>
> >
> > The underlying problem is that swap_info_struct.cluster_info is always NULL 
> > on the rotational storage case.
> 
> Yes.
> 
> > So, it's very easy to follow that constructions 
> > like this one, in split_swap_cluster 
> >
> > ...
> >         ci = lock_cluster(si, offset);
> >         cluster_clear_huge(ci);
> > ...
> >
> > will go for a NULL pointer dereference, in that case, given that lock_cluster 
> > reads:
> >
> > ...
> > 	struct swap_cluster_info *ci;
> >         ci = si->cluster_info;
> >         if (ci) {
> >                 ci += offset / SWAPFILE_CLUSTER;
> >                 spin_lock(&ci->lock);
> >         }
> >         return ci;
> > ...
> 
> But on HDD, we shouldn't call split_swap_cluster() at all, because we
> will not allocate swap cluster firstly.  So, if we run into this,
> there should be some other bug, we need to figure it out.
>

split_swap_cluster() gets called by split_huge_page_to_list(),
if the page happens to be in the swapcache, and it will always
go that way, regardless the backing storage type:

...
            __split_huge_page(page, list, end, flags);
            if (PageSwapCache(head)) {
                    swp_entry_t entry = { .val = page_private(head) };

                    ret = split_swap_cluster(entry);
            } else
                    ret = 0;
...

The problem is not about allocating the swap_cluster -- it's obviously
not allocated in these cases. The problem is that on rotational
storage you don't even have the base structure that allows you to
keep the swap clusters (cluster_info does not get allocated, at all,
so si->cluster_info is always NULL)

You can argue about other bugs all you want, it doesn't change
the fact that this code is incomplete as it sits, because it 
misses checking for a real case where lock_cluster() will return NULL.
Rafael Aquini Sept. 23, 2020, 1:42 p.m. UTC | #6
On Tue, Sep 22, 2020 at 12:47:50PM -0700, Andrew Morton wrote:
> On Tue, 22 Sep 2020 14:48:38 -0400 Rafael Aquini <aquini@redhat.com> wrote:
> 
> > The swap area descriptor only gets struct swap_cluster_info *cluster_info
> > allocated if the swapfile is backed by non-rotational storage.
> > When the swap area is laid on top of ordinary disk spindles, lock_cluster()
> > will naturally return NULL.
> > 
> > CONFIG_THP_SWAP exposes cluster_info infrastructure to a broader number of
> > use cases, and split_swap_cluster(), which is the counterpart of split_huge_page()
> > for the THPs in the swapcache, misses checking the return of lock_cluster before
> > operating on the cluster_info pointer.
> > 
> > This patch addresses that issue by adding a proper check for the pointer
> > not being NULL in the wrappers cluster_{is,clear}_huge(), in order to avoid
> > crashes similar to the one below:
> > 
> > ...
> >
> > Fixes: 59807685a7e77 ("mm, THP, swap: support splitting THP for THP swap out")
> > Signed-off-by: Rafael Aquini <aquini@redhat.com>
> 
> Did you consider cc:stable?
>

UGH! I missed adding it to my cc list. Shall I just forward it, now, or
do you prefer a fresh repost?
Huang, Ying Sept. 24, 2020, 12:59 a.m. UTC | #7
Rafael Aquini <aquini@redhat.com> writes:

> On Wed, Sep 23, 2020 at 01:13:49PM +0800, Huang, Ying wrote:
>> Rafael Aquini <aquini@redhat.com> writes:
>> 
>> > On Wed, Sep 23, 2020 at 10:21:36AM +0800, Huang, Ying wrote:
>> >> Hi, Rafael,
>> >> 
>> >> Rafael Aquini <aquini@redhat.com> writes:
>> >> 
>> >> > The swap area descriptor only gets struct swap_cluster_info *cluster_info
>> >> > allocated if the swapfile is backed by non-rotational storage.
>> >> > When the swap area is laid on top of ordinary disk spindles, lock_cluster()
>> >> > will naturally return NULL.
>> >> 
>> >> Thanks for reporting.  But the bug looks strange.  Because in a system
>> >> with only HDD swap devices, during THP swap out, the swap cluster
>> >> shouldn't be allocated, as in
>> >> 
>> >> shrink_page_list()
>> >>   add_to_swap()
>> >>     get_swap_page()
>> >>       get_swap_pages()
>> >>         swap_alloc_cluster()
>> >>
>> >
>> > The underlying problem is that swap_info_struct.cluster_info is always NULL 
>> > on the rotational storage case.
>> 
>> Yes.
>> 
>> > So, it's very easy to follow that constructions 
>> > like this one, in split_swap_cluster 
>> >
>> > ...
>> >         ci = lock_cluster(si, offset);
>> >         cluster_clear_huge(ci);
>> > ...
>> >
>> > will go for a NULL pointer dereference, in that case, given that lock_cluster 
>> > reads:
>> >
>> > ...
>> > 	struct swap_cluster_info *ci;
>> >         ci = si->cluster_info;
>> >         if (ci) {
>> >                 ci += offset / SWAPFILE_CLUSTER;
>> >                 spin_lock(&ci->lock);
>> >         }
>> >         return ci;
>> > ...
>> 
>> But on HDD, we shouldn't call split_swap_cluster() at all, because we
>> will not allocate swap cluster firstly.  So, if we run into this,
>> there should be some other bug, we need to figure it out.
>>
>
> split_swap_cluster() gets called by split_huge_page_to_list(),
> if the page happens to be in the swapcache, and it will always
> go that way, regardless the backing storage type:
>
> ...
>             __split_huge_page(page, list, end, flags);
>             if (PageSwapCache(head)) {
>                     swp_entry_t entry = { .val = page_private(head) };
>
>                     ret = split_swap_cluster(entry);
>             } else
>                     ret = 0;
> ...
>
> The problem is not about allocating the swap_cluster -- it's obviously
> not allocated in these cases. The problem is that on rotational
> storage you don't even have the base structure that allows you to
> keep the swap clusters (cluster_info does not get allocated, at all,
> so si->cluster_info is always NULL)
>
> You can argue about other bugs all you want, it doesn't change
> the fact that this code is incomplete as it sits, because it 
> misses checking for a real case where lock_cluster() will return NULL.

I don't want to argue about anything.  I just want to fix the bug.  The
fix here will hide the real bug instead of fixing it.  For the situation
you described (PageSwapCache() returns true for a THP backed by a normal
swap entry (not swap cluster)), we will run into other troubles too.  So
we need to find the root cause and fix it.

Can you help me to collect more information to fix the real bug?  Or,
how to reproduce it?

Best Regards,
Huang, Ying
Rafael Aquini Sept. 24, 2020, 2:09 a.m. UTC | #8
On Thu, Sep 24, 2020 at 08:59:40AM +0800, Huang, Ying wrote:
> Rafael Aquini <aquini@redhat.com> writes:
> 
> > On Wed, Sep 23, 2020 at 01:13:49PM +0800, Huang, Ying wrote:
> >> Rafael Aquini <aquini@redhat.com> writes:
> >> 
> >> > On Wed, Sep 23, 2020 at 10:21:36AM +0800, Huang, Ying wrote:
> >> >> Hi, Rafael,
> >> >> 
> >> >> Rafael Aquini <aquini@redhat.com> writes:
> >> >> 
> >> >> > The swap area descriptor only gets struct swap_cluster_info *cluster_info
> >> >> > allocated if the swapfile is backed by non-rotational storage.
> >> >> > When the swap area is laid on top of ordinary disk spindles, lock_cluster()
> >> >> > will naturally return NULL.
> >> >> 
> >> >> Thanks for reporting.  But the bug looks strange.  Because in a system
> >> >> with only HDD swap devices, during THP swap out, the swap cluster
> >> >> shouldn't be allocated, as in
> >> >> 
> >> >> shrink_page_list()
> >> >>   add_to_swap()
> >> >>     get_swap_page()
> >> >>       get_swap_pages()
> >> >>         swap_alloc_cluster()
> >> >>
> >> >
> >> > The underlying problem is that swap_info_struct.cluster_info is always NULL 
> >> > on the rotational storage case.
> >> 
> >> Yes.
> >> 
> >> > So, it's very easy to follow that constructions 
> >> > like this one, in split_swap_cluster 
> >> >
> >> > ...
> >> >         ci = lock_cluster(si, offset);
> >> >         cluster_clear_huge(ci);
> >> > ...
> >> >
> >> > will go for a NULL pointer dereference, in that case, given that lock_cluster 
> >> > reads:
> >> >
> >> > ...
> >> > 	struct swap_cluster_info *ci;
> >> >         ci = si->cluster_info;
> >> >         if (ci) {
> >> >                 ci += offset / SWAPFILE_CLUSTER;
> >> >                 spin_lock(&ci->lock);
> >> >         }
> >> >         return ci;
> >> > ...
> >> 
> >> But on HDD, we shouldn't call split_swap_cluster() at all, because we
> >> will not allocate swap cluster firstly.  So, if we run into this,
> >> there should be some other bug, we need to figure it out.
> >>
> >
> > split_swap_cluster() gets called by split_huge_page_to_list(),
> > if the page happens to be in the swapcache, and it will always
> > go that way, regardless the backing storage type:
> >
> > ...
> >             __split_huge_page(page, list, end, flags);
> >             if (PageSwapCache(head)) {
> >                     swp_entry_t entry = { .val = page_private(head) };
> >
> >                     ret = split_swap_cluster(entry);
> >             } else
> >                     ret = 0;
> > ...
> >
> > The problem is not about allocating the swap_cluster -- it's obviously
> > not allocated in these cases. The problem is that on rotational
> > storage you don't even have the base structure that allows you to
> > keep the swap clusters (cluster_info does not get allocated, at all,
> > so si->cluster_info is always NULL)
> >
> > You can argue about other bugs all you want, it doesn't change
> > the fact that this code is incomplete as it sits, because it 
> > misses checking for a real case where lock_cluster() will return NULL.
> 
> I don't want to argue about anything.  I just want to fix the bug.  The
> fix here will hide the real bug instead of fixing it.  For the situation
> you described (PageSwapCache() returns true for a THP backed by a normal
> swap entry (not swap cluster)), we will run into other troubles too.  So
> we need to find the root cause and fix it.
>

The bug here is quite simple: split_swap_cluster() misses checking for 
lock_cluster() returning NULL before committing to change cluster_info->flags.

The fundamental problem has nothing to do with allocating, or not allocating 
a swap cluster, but it has to do with the fact that the THP deferred split scan
can transiently race with swapcache insertion, and the fact that when you run 
your swap area on rotational storage cluster_info is _always_ NULL. 
split_swap_cluster() needs to check for lock_cluster() returning NULL because 
that's one possible case, and it clearly fails to do so. 

Run a workload that cause multiple THP COW, and add a memory hogger to create 
memory pressure so you'll force the reclaimers to kick the registered
shrinkers. The trigger is not heavy swapping, and that's probably why
most swap test cases don't hit it. The window is tight, but you will get the 
NULL pointer dereference.

Regardless you find furhter bugs, or not, this patch is needed to correct a
blunt coding mistake.
Huang, Ying Sept. 24, 2020, 3:51 a.m. UTC | #9
Rafael Aquini <aquini@redhat.com> writes:
> The bug here is quite simple: split_swap_cluster() misses checking for
> lock_cluster() returning NULL before committing to change cluster_info->flags.

I don't think so.  We shouldn't run into this situation firstly.  So the
"fix" hides the real bug instead of fixing it.  Just like we call
VM_BUG_ON_PAGE(!PageLocked(head), head) in split_huge_page_to_list()
instead of returning if !PageLocked(head) silently.

> The fundamental problem has nothing to do with allocating, or not allocating
> a swap cluster, but it has to do with the fact that the THP deferred split scan
> can transiently race with swapcache insertion, and the fact that when you run
> your swap area on rotational storage cluster_info is _always_ NULL.
> split_swap_cluster() needs to check for lock_cluster() returning NULL because
> that's one possible case, and it clearly fails to do so.

If there's a race, we should fix the race.  But the code path for
swapcache insertion is,

add_to_swap()
  get_swap_page() /* Return if fails to allocate */
  add_to_swap_cache()
    SetPageSwapCache()

While the code path to split THP is,

split_huge_page_to_list()
  if PageSwapCache()
    split_swap_cluster()

Both code paths are protected by the page lock.  So there should be some
other reasons to trigger the bug.

And again, for HDD, a THP shouldn't have PageSwapCache() set at the
first place.  If so, the bug is that the flag is set and we should fix
the setting.

> Run a workload that cause multiple THP COW, and add a memory hogger to create
> memory pressure so you'll force the reclaimers to kick the registered
> shrinkers. The trigger is not heavy swapping, and that's probably why
> most swap test cases don't hit it. The window is tight, but you will get the
> NULL pointer dereference.

Do you have a script to reproduce the bug?

> Regardless you find furhter bugs, or not, this patch is needed to correct a
> blunt coding mistake.

As above.  I don't agree with that.

Best Regards,
Huang, Ying
Rafael Aquini Sept. 24, 2020, 6:30 a.m. UTC | #10
On Thu, Sep 24, 2020 at 11:51:17AM +0800, Huang, Ying wrote:
> Rafael Aquini <aquini@redhat.com> writes:
> > The bug here is quite simple: split_swap_cluster() misses checking for
> > lock_cluster() returning NULL before committing to change cluster_info->flags.
> 
> I don't think so.  We shouldn't run into this situation firstly.  So the
> "fix" hides the real bug instead of fixing it.  Just like we call
> VM_BUG_ON_PAGE(!PageLocked(head), head) in split_huge_page_to_list()
> instead of returning if !PageLocked(head) silently.
>

Not the same thing, obviously, as you are going for an apples-to-carrots
comparison, but since you mentioned:

split_huge_page_to_list() asserts (in debug builds) *page is locked, 
and later checks if *head bears the SwapCache flag. 
deferred_split_scan(), OTOH, doesn't hand down the compound head locked, 
but the 2nd page in the group instead. 
This doesn't necessarely means it's a problem, though, but might help
on hitting the issue. 

 
> > The fundamental problem has nothing to do with allocating, or not allocating
> > a swap cluster, but it has to do with the fact that the THP deferred split scan
> > can transiently race with swapcache insertion, and the fact that when you run
> > your swap area on rotational storage cluster_info is _always_ NULL.
> > split_swap_cluster() needs to check for lock_cluster() returning NULL because
> > that's one possible case, and it clearly fails to do so.
> 
> If there's a race, we should fix the race.  But the code path for
> swapcache insertion is,
> 
> add_to_swap()
>   get_swap_page() /* Return if fails to allocate */
>   add_to_swap_cache()
>     SetPageSwapCache()
> 
> While the code path to split THP is,
> 
> split_huge_page_to_list()
>   if PageSwapCache()
>     split_swap_cluster()
> 
> Both code paths are protected by the page lock.  So there should be some
> other reasons to trigger the bug.

As mentioned above, no they seem to not be protected (at least, not the
same page, depending on the case). While add_to_swap() will assure a 
page_lock on the compound head, split_huge_page_to_list() does not.


> And again, for HDD, a THP shouldn't have PageSwapCache() set at the
> first place.  If so, the bug is that the flag is set and we should fix
> the setting.
> 

I fail to follow your claim here. Where is the guarantee, in the code, that 
you'll never have a compound head in the swapcache? 

> > Run a workload that cause multiple THP COW, and add a memory hogger to create
> > memory pressure so you'll force the reclaimers to kick the registered
> > shrinkers. The trigger is not heavy swapping, and that's probably why
> > most swap test cases don't hit it. The window is tight, but you will get the
> > NULL pointer dereference.
> 
> Do you have a script to reproduce the bug?
> 

Nope, a convoluted set of internal regression tests we have usually
triggers it. In the wild, customers running HANNA are seeing it,
occasionally.

> > Regardless you find furhter bugs, or not, this patch is needed to correct a
> > blunt coding mistake.
> 
> As above.  I don't agree with that.
> 

It's OK to disagree, split_swap_cluster still misses the cluster_info NULL check,
though.
Huang, Ying Sept. 24, 2020, 6:57 a.m. UTC | #11
Rafael Aquini <aquini@redhat.com> writes:

>> 
>> If there's a race, we should fix the race.  But the code path for
>> swapcache insertion is,
>> 
>> add_to_swap()
>>   get_swap_page() /* Return if fails to allocate */
>>   add_to_swap_cache()
>>     SetPageSwapCache()
>> 
>> While the code path to split THP is,
>> 
>> split_huge_page_to_list()
>>   if PageSwapCache()
>>     split_swap_cluster()
>> 
>> Both code paths are protected by the page lock.  So there should be some
>> other reasons to trigger the bug.
>
> As mentioned above, no they seem to not be protected (at least, not the
> same page, depending on the case). While add_to_swap() will assure a 
> page_lock on the compound head, split_huge_page_to_list() does not.
>

int split_huge_page_to_list(struct page *page, struct list_head *list)
{
	struct page *head = compound_head(page);
	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
	struct deferred_split *ds_queue = get_deferred_split_queue(head);
	struct anon_vma *anon_vma = NULL;
	struct address_space *mapping = NULL;
	int count, mapcount, extra_pins, ret;
	unsigned long flags;
	pgoff_t end;

	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
	VM_BUG_ON_PAGE(!PageLocked(head), head);

I found there's page lock checking in split_huge_page_to_list().

Best Regards,
Huang, Ying
Huang, Ying Sept. 24, 2020, 7:45 a.m. UTC | #12
Rafael Aquini <aquini@redhat.com> writes:

> On Thu, Sep 24, 2020 at 11:51:17AM +0800, Huang, Ying wrote:
>> Rafael Aquini <aquini@redhat.com> writes:
>> > The bug here is quite simple: split_swap_cluster() misses checking for
>> > lock_cluster() returning NULL before committing to change cluster_info->flags.
>> 
>> I don't think so.  We shouldn't run into this situation firstly.  So the
>> "fix" hides the real bug instead of fixing it.  Just like we call
>> VM_BUG_ON_PAGE(!PageLocked(head), head) in split_huge_page_to_list()
>> instead of returning if !PageLocked(head) silently.
>>
>
> Not the same thing, obviously, as you are going for an apples-to-carrots
> comparison, but since you mentioned:
>
> split_huge_page_to_list() asserts (in debug builds) *page is locked,

	VM_BUG_ON_PAGE(!PageLocked(head), head);

It asserts *head instead of *page.

> and later checks if *head bears the SwapCache flag. 
> deferred_split_scan(), OTOH, doesn't hand down the compound head locked, 
> but the 2nd page in the group instead.

No.  deferred_split_scan() will can trylock_page() on the 2nd page in
the group, but

static inline int trylock_page(struct page *page)
{
	page = compound_head(page);
	return (likely(!test_and_set_bit_lock(PG_locked, &page->flags)));
}

So the head page will be locked instead.

> This doesn't necessarely means it's a problem, though, but might help
> on hitting the issue. 
>  
>> > The fundamental problem has nothing to do with allocating, or not allocating
>> > a swap cluster, but it has to do with the fact that the THP deferred split scan
>> > can transiently race with swapcache insertion, and the fact that when you run
>> > your swap area on rotational storage cluster_info is _always_ NULL.
>> > split_swap_cluster() needs to check for lock_cluster() returning NULL because
>> > that's one possible case, and it clearly fails to do so.
>> 
>> If there's a race, we should fix the race.  But the code path for
>> swapcache insertion is,
>> 
>> add_to_swap()
>>   get_swap_page() /* Return if fails to allocate */
>>   add_to_swap_cache()
>>     SetPageSwapCache()
>> 
>> While the code path to split THP is,
>> 
>> split_huge_page_to_list()
>>   if PageSwapCache()
>>     split_swap_cluster()
>> 
>> Both code paths are protected by the page lock.  So there should be some
>> other reasons to trigger the bug.
>
> As mentioned above, no they seem to not be protected (at least, not the
> same page, depending on the case). While add_to_swap() will assure a 
> page_lock on the compound head, split_huge_page_to_list() does not.
>
>
>> And again, for HDD, a THP shouldn't have PageSwapCache() set at the
>> first place.  If so, the bug is that the flag is set and we should fix
>> the setting.
>> 
>
> I fail to follow your claim here. Where is the guarantee, in the code, that 
> you'll never have a compound head in the swapcache? 

We may have a THP in the swap cache, only if non-rotational disk is used
as swap device.  This is the design assumption of the THP swap support.
And this is guaranteed via swap space allocation for THP will fail for
HDD.  If the implementation doesn't guarantee this, we will fix the
implementation to guarantee this.

>> > Run a workload that cause multiple THP COW, and add a memory hogger to create
>> > memory pressure so you'll force the reclaimers to kick the registered
>> > shrinkers. The trigger is not heavy swapping, and that's probably why
>> > most swap test cases don't hit it. The window is tight, but you will get the
>> > NULL pointer dereference.
>> 
>> Do you have a script to reproduce the bug?
>> 
>
> Nope, a convoluted set of internal regression tests we have usually
> triggers it. In the wild, customers running HANNA are seeing it,
> occasionally.

So you haven't reproduce the bug on upstream kernel?

Or, can you help to run the test with a debug kernel based on upstream
kernel.  I can provide some debug patch.

>> > Regardless you find furhter bugs, or not, this patch is needed to correct a
>> > blunt coding mistake.
>> 
>> As above.  I don't agree with that.
>> 
>
> It's OK to disagree, split_swap_cluster still misses the cluster_info NULL check,
> though.

In contrast, if the checking is necessary, we shouldn't ignore it, but
use something like

        ci = lock_cluster(si, offset);
+       VM_BUG_ON(!ci);
	cluster_clear_huge(ci);

in split_swap_cluster() to enforce the checking to report bug as early
as possible.  But this appears unnecessary now because NULL accessing in
cluster_clear_huge().

Best Regards,
Huang, Ying
Rafael Aquini Sept. 24, 2020, 3:08 p.m. UTC | #13
On Thu, Sep 24, 2020 at 03:45:52PM +0800, Huang, Ying wrote:
> Rafael Aquini <aquini@redhat.com> writes:
> 
> > On Thu, Sep 24, 2020 at 11:51:17AM +0800, Huang, Ying wrote:
> >> Rafael Aquini <aquini@redhat.com> writes:
> >> > The bug here is quite simple: split_swap_cluster() misses checking for
> >> > lock_cluster() returning NULL before committing to change cluster_info->flags.
> >> 
> >> I don't think so.  We shouldn't run into this situation firstly.  So the
> >> "fix" hides the real bug instead of fixing it.  Just like we call
> >> VM_BUG_ON_PAGE(!PageLocked(head), head) in split_huge_page_to_list()
> >> instead of returning if !PageLocked(head) silently.
> >>
> >
> > Not the same thing, obviously, as you are going for an apples-to-carrots
> > comparison, but since you mentioned:
> >
> > split_huge_page_to_list() asserts (in debug builds) *page is locked,
> 
> 	VM_BUG_ON_PAGE(!PageLocked(head), head);
> 
> It asserts *head instead of *page.
>
> > and later checks if *head bears the SwapCache flag. 
> > deferred_split_scan(), OTOH, doesn't hand down the compound head locked, 
> > but the 2nd page in the group instead.
> 
> No.  deferred_split_scan() will can trylock_page() on the 2nd page in
> the group, but
> 
> static inline int trylock_page(struct page *page)
> {
> 	page = compound_head(page);
> 	return (likely(!test_and_set_bit_lock(PG_locked, &page->flags)));
> }
> 
> So the head page will be locked instead.
> 

Yep, missed that. Thanks for straighten me out on this one.


> > This doesn't necessarely means it's a problem, though, but might help
> > on hitting the issue. 
> >  
> >> > The fundamental problem has nothing to do with allocating, or not allocating
> >> > a swap cluster, but it has to do with the fact that the THP deferred split scan
> >> > can transiently race with swapcache insertion, and the fact that when you run
> >> > your swap area on rotational storage cluster_info is _always_ NULL.
> >> > split_swap_cluster() needs to check for lock_cluster() returning NULL because
> >> > that's one possible case, and it clearly fails to do so.
> >> 
> >> If there's a race, we should fix the race.  But the code path for
> >> swapcache insertion is,
> >> 
> >> add_to_swap()
> >>   get_swap_page() /* Return if fails to allocate */
> >>   add_to_swap_cache()
> >>     SetPageSwapCache()
> >> 
> >> While the code path to split THP is,
> >> 
> >> split_huge_page_to_list()
> >>   if PageSwapCache()
> >>     split_swap_cluster()
> >> 
> >> Both code paths are protected by the page lock.  So there should be some
> >> other reasons to trigger the bug.
> >
> > As mentioned above, no they seem to not be protected (at least, not the
> > same page, depending on the case). While add_to_swap() will assure a 
> > page_lock on the compound head, split_huge_page_to_list() does not.
> >
> >
> >> And again, for HDD, a THP shouldn't have PageSwapCache() set at the
> >> first place.  If so, the bug is that the flag is set and we should fix
> >> the setting.
> >> 
> >
> > I fail to follow your claim here. Where is the guarantee, in the code, that 
> > you'll never have a compound head in the swapcache? 
> 
> We may have a THP in the swap cache, only if non-rotational disk is used
> as swap device.  This is the design assumption of the THP swap support.
> And this is guaranteed via swap space allocation for THP will fail for
> HDD.  If the implementation doesn't guarantee this, we will fix the
> implementation to guarantee this.
> 
> >> > Run a workload that cause multiple THP COW, and add a memory hogger to create
> >> > memory pressure so you'll force the reclaimers to kick the registered
> >> > shrinkers. The trigger is not heavy swapping, and that's probably why
> >> > most swap test cases don't hit it. The window is tight, but you will get the
> >> > NULL pointer dereference.
> >> 
> >> Do you have a script to reproduce the bug?
> >> 
> >
> > Nope, a convoluted set of internal regression tests we have usually
> > triggers it. In the wild, customers running HANNA are seeing it,
> > occasionally.
> 
> So you haven't reproduce the bug on upstream kernel?
> 

Have you seen the stack dump in the patch? It still reproduces with v5.9,
even though the rate is a lot lower than with earlier kernels.


> Or, can you help to run the test with a debug kernel based on upstream
> kernel.  I can provide some debug patch.
> 

Sure, I can set your patches to run with the test cases we have that tend to 
reproduce the issue with some degree of success.


> >> > Regardless you find furhter bugs, or not, this patch is needed to correct a
> >> > blunt coding mistake.
> >> 
> >> As above.  I don't agree with that.
> >> 
> >
> > It's OK to disagree, split_swap_cluster still misses the cluster_info NULL check,
> > though.
> 
> In contrast, if the checking is necessary, we shouldn't ignore it, but
> use something like
> 
>         ci = lock_cluster(si, offset);
> +       VM_BUG_ON(!ci);

Wrong. This will still allow for NULL ptr dereference on non-debug builds.
If ci can be NULL -- and it clearly can, we need to protect 
cluster_clear_huge(ci) against that.
Andrew Morton Sept. 25, 2020, 2:59 a.m. UTC | #14
On Wed, 23 Sep 2020 09:42:51 -0400 Rafael Aquini <aquini@redhat.com> wrote:

> On Tue, Sep 22, 2020 at 12:47:50PM -0700, Andrew Morton wrote:
> > On Tue, 22 Sep 2020 14:48:38 -0400 Rafael Aquini <aquini@redhat.com> wrote:
> > 
> > > The swap area descriptor only gets struct swap_cluster_info *cluster_info
> > > allocated if the swapfile is backed by non-rotational storage.
> > > When the swap area is laid on top of ordinary disk spindles, lock_cluster()
> > > will naturally return NULL.
> > > 
> > > CONFIG_THP_SWAP exposes cluster_info infrastructure to a broader number of
> > > use cases, and split_swap_cluster(), which is the counterpart of split_huge_page()
> > > for the THPs in the swapcache, misses checking the return of lock_cluster before
> > > operating on the cluster_info pointer.
> > > 
> > > This patch addresses that issue by adding a proper check for the pointer
> > > not being NULL in the wrappers cluster_{is,clear}_huge(), in order to avoid
> > > crashes similar to the one below:
> > > 
> > > ...
> > >
> > > Fixes: 59807685a7e77 ("mm, THP, swap: support splitting THP for THP swap out")
> > > Signed-off-by: Rafael Aquini <aquini@redhat.com>
> > 
> > Did you consider cc:stable?
> >
> 
> UGH! I missed adding it to my cc list. Shall I just forward it, now, or
> do you prefer a fresh repost?

I added the cc:stable to my copy.
Huang, Ying Sept. 25, 2020, 3:06 a.m. UTC | #15
Hi, Andrew,

Andrew Morton <akpm@linux-foundation.org> writes:

> On Wed, 23 Sep 2020 09:42:51 -0400 Rafael Aquini <aquini@redhat.com> wrote:
>
>> On Tue, Sep 22, 2020 at 12:47:50PM -0700, Andrew Morton wrote:
>> > On Tue, 22 Sep 2020 14:48:38 -0400 Rafael Aquini <aquini@redhat.com> wrote:
>> > 
>> > > The swap area descriptor only gets struct swap_cluster_info *cluster_info
>> > > allocated if the swapfile is backed by non-rotational storage.
>> > > When the swap area is laid on top of ordinary disk spindles, lock_cluster()
>> > > will naturally return NULL.
>> > > 
>> > > CONFIG_THP_SWAP exposes cluster_info infrastructure to a broader number of
>> > > use cases, and split_swap_cluster(), which is the counterpart of split_huge_page()
>> > > for the THPs in the swapcache, misses checking the return of lock_cluster before
>> > > operating on the cluster_info pointer.
>> > > 
>> > > This patch addresses that issue by adding a proper check for the pointer
>> > > not being NULL in the wrappers cluster_{is,clear}_huge(), in order to avoid
>> > > crashes similar to the one below:
>> > > 
>> > > ...
>> > >
>> > > Fixes: 59807685a7e77 ("mm, THP, swap: support splitting THP for THP swap out")
>> > > Signed-off-by: Rafael Aquini <aquini@redhat.com>
>> > 
>> > Did you consider cc:stable?
>> >
>> 
>> UGH! I missed adding it to my cc list. Shall I just forward it, now, or
>> do you prefer a fresh repost?
>
> I added the cc:stable to my copy.

Please don't merge this patch.  This patch doesn't fix the bug, but hide
the real bug.  I will work with Rafael on root causing and fixing.

Best Regards,
Huang, Ying
Andrew Morton Sept. 25, 2020, 3:10 a.m. UTC | #16
On Fri, 25 Sep 2020 11:06:53 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:

> >> UGH! I missed adding it to my cc list. Shall I just forward it, now, or
> >> do you prefer a fresh repost?
> >
> > I added the cc:stable to my copy.
> 
> Please don't merge this patch.  This patch doesn't fix the bug, but hide
> the real bug.  I will work with Rafael on root causing and fixing.

Yup, I saw the discussion and have put the patch on hold pending
the outcome.
Huang, Ying Sept. 25, 2020, 3:21 a.m. UTC | #17
Rafael Aquini <aquini@redhat.com> writes:
>> Or, can you help to run the test with a debug kernel based on upstream
>> kernel.  I can provide some debug patch.
>> 
>
> Sure, I can set your patches to run with the test cases we have that tend to 
> reproduce the issue with some degree of success.

Thanks!

I found a race condition.  During THP splitting, "head" may be unlocked
before calling split_swap_cluster(), because head != page during
deferred splitting.  So we should call split_swap_cluster() before
unlocking.  The debug patch to do that is as below.  Can you help to
test it?

Best Regards,
Huang, Ying

------------------------8<----------------------------
From 24ce0736a9f587d2dba12f12491c88d3e296a491 Mon Sep 17 00:00:00 2001
From: Huang Ying <ying.huang@intel.com>
Date: Fri, 25 Sep 2020 11:10:56 +0800
Subject: [PATCH] dbg: Call split_swap_clsuter() before unlock page during
 split THP

---
 mm/huge_memory.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index faadc449cca5..8d79e5e6b46e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2444,6 +2444,12 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 	remap_page(head);
 
+	if (PageSwapCache(head)) {
+		swp_entry_t entry = { .val = page_private(head) };
+
+		split_swap_cluster(entry);
+	}
+
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		struct page *subpage = head + i;
 		if (subpage == page)
@@ -2678,12 +2684,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		}
 
 		__split_huge_page(page, list, end, flags);
-		if (PageSwapCache(head)) {
-			swp_entry_t entry = { .val = page_private(head) };
-
-			ret = split_swap_cluster(entry);
-		} else
-			ret = 0;
+		ret = 0;
 	} else {
 		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
 			pr_alert("total_mapcount: %u, page_count(): %u\n",
Rafael Aquini Sept. 26, 2020, 3:16 p.m. UTC | #18
On Fri, Sep 25, 2020 at 11:21:58AM +0800, Huang, Ying wrote:
> Rafael Aquini <aquini@redhat.com> writes:
> >> Or, can you help to run the test with a debug kernel based on upstream
> >> kernel.  I can provide some debug patch.
> >> 
> >
> > Sure, I can set your patches to run with the test cases we have that tend to 
> > reproduce the issue with some degree of success.
> 
> Thanks!
> 
> I found a race condition.  During THP splitting, "head" may be unlocked
> before calling split_swap_cluster(), because head != page during
> deferred splitting.  So we should call split_swap_cluster() before
> unlocking.  The debug patch to do that is as below.  Can you help to
> test it?
>


I finally could grab a good crashdump and confirm that head is really
not locked. I still need to dig into it to figure out more about the
crash. I guess that your patch will guarantee that lock on head, but
it still doesn't help on explaining how did we get the THP marked as 
PG_swapcache, given that it should fail add_to_swap()->get_swap_page()
right? 

I'll give your patch a run over the weekend, hopefully we'll have more
info on this next week.

  
> Best Regards,
> Huang, Ying
> 
> ------------------------8<----------------------------
> From 24ce0736a9f587d2dba12f12491c88d3e296a491 Mon Sep 17 00:00:00 2001
> From: Huang Ying <ying.huang@intel.com>
> Date: Fri, 25 Sep 2020 11:10:56 +0800
> Subject: [PATCH] dbg: Call split_swap_clsuter() before unlock page during
>  split THP
> 
> ---
>  mm/huge_memory.c | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index faadc449cca5..8d79e5e6b46e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2444,6 +2444,12 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  
>  	remap_page(head);
>  
> +	if (PageSwapCache(head)) {
> +		swp_entry_t entry = { .val = page_private(head) };
> +
> +		split_swap_cluster(entry);
> +	}
> +
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
>  		struct page *subpage = head + i;
>  		if (subpage == page)
> @@ -2678,12 +2684,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  		}
>  
>  		__split_huge_page(page, list, end, flags);
> -		if (PageSwapCache(head)) {
> -			swp_entry_t entry = { .val = page_private(head) };
> -
> -			ret = split_swap_cluster(entry);
> -		} else
> -			ret = 0;
> +		ret = 0;
>  	} else {
>  		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
>  			pr_alert("total_mapcount: %u, page_count(): %u\n",
> -- 
> 2.28.0
>
Huang, Ying Sept. 27, 2020, 5:33 a.m. UTC | #19
Rafael Aquini <aquini@redhat.com> writes:

> On Fri, Sep 25, 2020 at 11:21:58AM +0800, Huang, Ying wrote:
>> Rafael Aquini <aquini@redhat.com> writes:
>> >> Or, can you help to run the test with a debug kernel based on upstream
>> >> kernel.  I can provide some debug patch.
>> >> 
>> >
>> > Sure, I can set your patches to run with the test cases we have that tend to 
>> > reproduce the issue with some degree of success.
>> 
>> Thanks!
>> 
>> I found a race condition.  During THP splitting, "head" may be unlocked
>> before calling split_swap_cluster(), because head != page during
>> deferred splitting.  So we should call split_swap_cluster() before
>> unlocking.  The debug patch to do that is as below.  Can you help to
>> test it?
>>
>
>
> I finally could grab a good crashdump and confirm that head is really
> not locked.

Thanks!  That's really helpful for us to root cause the bug.

> I still need to dig into it to figure out more about the
> crash. I guess that your patch will guarantee that lock on head, but
> it still doesn't help on explaining how did we get the THP marked as 
> PG_swapcache, given that it should fail add_to_swap()->get_swap_page()
> right? 

Because ClearPageCompound(head) is called in __split_huge_page(), then
all subpages except "page" are unlocked.  So previously, when
split_swap_cluster() is called in split_huge_page_to_list(), the THP has
been split already and "head" may be unlocked.  Then the normal page
"head" can be added to swap cache.

CPU1                                                             CPU2
----                                                             ----
deferred_split_scan()
  split_huge_page(page) /* page isn't compound head */
    split_huge_page_to_list(page, NULL)
      __split_huge_page(page, )
        ClearPageCompound(head)
        /* unlock all subpages except page (not head) */
                                                                 add_to_swap(head)  /* not THP */
                                                                   get_swap_page(head)
                                                                   add_to_swap_cache(head, )
                                                                     SetPageSwapCache(head)
     if PageSwapCache(head)
       split_swap_cluster(/* swap entry of head */)
         /* Deref sis->cluster_info: NULL accessing! */

> I'll give your patch a run over the weekend, hopefully we'll have more
> info on this next week.

Thanks!

Best Regards,
Huang, Ying

>> Best Regards,
>> Huang, Ying
>> 
>> ------------------------8<----------------------------
>> From 24ce0736a9f587d2dba12f12491c88d3e296a491 Mon Sep 17 00:00:00 2001
>> From: Huang Ying <ying.huang@intel.com>
>> Date: Fri, 25 Sep 2020 11:10:56 +0800
>> Subject: [PATCH] dbg: Call split_swap_clsuter() before unlock page during
>>  split THP
>> 
>> ---
>>  mm/huge_memory.c | 13 +++++++------
>>  1 file changed, 7 insertions(+), 6 deletions(-)
>> 
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index faadc449cca5..8d79e5e6b46e 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2444,6 +2444,12 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>  
>>  	remap_page(head);
>>  
>> +	if (PageSwapCache(head)) {
>> +		swp_entry_t entry = { .val = page_private(head) };
>> +
>> +		split_swap_cluster(entry);
>> +	}
>> +
>>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
>>  		struct page *subpage = head + i;
>>  		if (subpage == page)
>> @@ -2678,12 +2684,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>  		}
>>  
>>  		__split_huge_page(page, list, end, flags);
>> -		if (PageSwapCache(head)) {
>> -			swp_entry_t entry = { .val = page_private(head) };
>> -
>> -			ret = split_swap_cluster(entry);
>> -		} else
>> -			ret = 0;
>> +		ret = 0;
>>  	} else {
>>  		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
>>  			pr_alert("total_mapcount: %u, page_count(): %u\n",
>> -- 
>> 2.28.0
>>
Rafael Aquini Oct. 1, 2020, 2:31 p.m. UTC | #20
On Fri, Sep 25, 2020 at 11:21:58AM +0800, Huang, Ying wrote:
> Rafael Aquini <aquini@redhat.com> writes:
> >> Or, can you help to run the test with a debug kernel based on upstream
> >> kernel.  I can provide some debug patch.
> >> 
> >
> > Sure, I can set your patches to run with the test cases we have that tend to 
> > reproduce the issue with some degree of success.
> 
> Thanks!
> 
> I found a race condition.  During THP splitting, "head" may be unlocked
> before calling split_swap_cluster(), because head != page during
> deferred splitting.  So we should call split_swap_cluster() before
> unlocking.  The debug patch to do that is as below.  Can you help to
> test it?
> 
> Best Regards,
> Huang, Ying
> 
> ------------------------8<----------------------------
> From 24ce0736a9f587d2dba12f12491c88d3e296a491 Mon Sep 17 00:00:00 2001
> From: Huang Ying <ying.huang@intel.com>
> Date: Fri, 25 Sep 2020 11:10:56 +0800
> Subject: [PATCH] dbg: Call split_swap_clsuter() before unlock page during
>  split THP
> 
> ---
>  mm/huge_memory.c | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index faadc449cca5..8d79e5e6b46e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2444,6 +2444,12 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  
>  	remap_page(head);
>  
> +	if (PageSwapCache(head)) {
> +		swp_entry_t entry = { .val = page_private(head) };
> +
> +		split_swap_cluster(entry);
> +	}
> +
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
>  		struct page *subpage = head + i;
>  		if (subpage == page)
> @@ -2678,12 +2684,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  		}
>  
>  		__split_huge_page(page, list, end, flags);
> -		if (PageSwapCache(head)) {
> -			swp_entry_t entry = { .val = page_private(head) };
> -
> -			ret = split_swap_cluster(entry);
> -		} else
> -			ret = 0;
> +		ret = 0;
>  	} else {
>  		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
>  			pr_alert("total_mapcount: %u, page_count(): %u\n",
> -- 
> 2.28.0
> 

I left it running for several days, on several systems that had seen the
crash hitting before, and no crashes were observed for either the upstream
kernel nor the distro build 4.18-based kernel.

I guess we can comfortably go with your patch. Thanks!
Rafael Aquini Oct. 5, 2020, 1:39 p.m. UTC | #21
On Thu, Oct 01, 2020 at 10:31:57AM -0400, Rafael Aquini wrote:
> On Fri, Sep 25, 2020 at 11:21:58AM +0800, Huang, Ying wrote:
> > Rafael Aquini <aquini@redhat.com> writes:
> > >> Or, can you help to run the test with a debug kernel based on upstream
> > >> kernel.  I can provide some debug patch.
> > >> 
> > >
> > > Sure, I can set your patches to run with the test cases we have that tend to 
> > > reproduce the issue with some degree of success.
> > 
> > Thanks!
> > 
> > I found a race condition.  During THP splitting, "head" may be unlocked
> > before calling split_swap_cluster(), because head != page during
> > deferred splitting.  So we should call split_swap_cluster() before
> > unlocking.  The debug patch to do that is as below.  Can you help to
> > test it?
> > 
> > Best Regards,
> > Huang, Ying
> > 
> > ------------------------8<----------------------------
> > From 24ce0736a9f587d2dba12f12491c88d3e296a491 Mon Sep 17 00:00:00 2001
> > From: Huang Ying <ying.huang@intel.com>
> > Date: Fri, 25 Sep 2020 11:10:56 +0800
> > Subject: [PATCH] dbg: Call split_swap_clsuter() before unlock page during
> >  split THP
> > 
> > ---
> >  mm/huge_memory.c | 13 +++++++------
> >  1 file changed, 7 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index faadc449cca5..8d79e5e6b46e 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2444,6 +2444,12 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> >  
> >  	remap_page(head);
> >  
> > +	if (PageSwapCache(head)) {
> > +		swp_entry_t entry = { .val = page_private(head) };
> > +
> > +		split_swap_cluster(entry);
> > +	}
> > +
> >  	for (i = 0; i < HPAGE_PMD_NR; i++) {
> >  		struct page *subpage = head + i;
> >  		if (subpage == page)
> > @@ -2678,12 +2684,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
> >  		}
> >  
> >  		__split_huge_page(page, list, end, flags);
> > -		if (PageSwapCache(head)) {
> > -			swp_entry_t entry = { .val = page_private(head) };
> > -
> > -			ret = split_swap_cluster(entry);
> > -		} else
> > -			ret = 0;
> > +		ret = 0;
> >  	} else {
> >  		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
> >  			pr_alert("total_mapcount: %u, page_count(): %u\n",
> > -- 
> > 2.28.0
> > 
> 
> I left it running for several days, on several systems that had seen the
> crash hitting before, and no crashes were observed for either the upstream
> kernel nor the distro build 4.18-based kernel.
> 
> I guess we can comfortably go with your patch. Thanks!
> 
>
Ping

Are you going to post this patchfix soon? Or do you rather have me
posting it?

regards,
Huang, Ying Oct. 9, 2020, 12:18 a.m. UTC | #22
Rafael Aquini <aquini@redhat.com> writes:

> On Thu, Oct 01, 2020 at 10:31:57AM -0400, Rafael Aquini wrote:
>> On Fri, Sep 25, 2020 at 11:21:58AM +0800, Huang, Ying wrote:
>> > Rafael Aquini <aquini@redhat.com> writes:
>> > >> Or, can you help to run the test with a debug kernel based on upstream
>> > >> kernel.  I can provide some debug patch.
>> > >> 
>> > >
>> > > Sure, I can set your patches to run with the test cases we have that tend to 
>> > > reproduce the issue with some degree of success.
>> > 
>> > Thanks!
>> > 
>> > I found a race condition.  During THP splitting, "head" may be unlocked
>> > before calling split_swap_cluster(), because head != page during
>> > deferred splitting.  So we should call split_swap_cluster() before
>> > unlocking.  The debug patch to do that is as below.  Can you help to
>> > test it?
>> > 
>> > Best Regards,
>> > Huang, Ying
>> > 
>> > ------------------------8<----------------------------
>> > From 24ce0736a9f587d2dba12f12491c88d3e296a491 Mon Sep 17 00:00:00 2001
>> > From: Huang Ying <ying.huang@intel.com>
>> > Date: Fri, 25 Sep 2020 11:10:56 +0800
>> > Subject: [PATCH] dbg: Call split_swap_clsuter() before unlock page during
>> >  split THP
>> > 
>> > ---
>> >  mm/huge_memory.c | 13 +++++++------
>> >  1 file changed, 7 insertions(+), 6 deletions(-)
>> > 
>> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> > index faadc449cca5..8d79e5e6b46e 100644
>> > --- a/mm/huge_memory.c
>> > +++ b/mm/huge_memory.c
>> > @@ -2444,6 +2444,12 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>> >  
>> >  	remap_page(head);
>> >  
>> > +	if (PageSwapCache(head)) {
>> > +		swp_entry_t entry = { .val = page_private(head) };
>> > +
>> > +		split_swap_cluster(entry);
>> > +	}
>> > +
>> >  	for (i = 0; i < HPAGE_PMD_NR; i++) {
>> >  		struct page *subpage = head + i;
>> >  		if (subpage == page)
>> > @@ -2678,12 +2684,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>> >  		}
>> >  
>> >  		__split_huge_page(page, list, end, flags);
>> > -		if (PageSwapCache(head)) {
>> > -			swp_entry_t entry = { .val = page_private(head) };
>> > -
>> > -			ret = split_swap_cluster(entry);
>> > -		} else
>> > -			ret = 0;
>> > +		ret = 0;
>> >  	} else {
>> >  		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
>> >  			pr_alert("total_mapcount: %u, page_count(): %u\n",
>> > -- 
>> > 2.28.0
>> > 
>> 
>> I left it running for several days, on several systems that had seen the
>> crash hitting before, and no crashes were observed for either the upstream
>> kernel nor the distro build 4.18-based kernel.
>> 
>> I guess we can comfortably go with your patch. Thanks!
>> 
>>
> Ping
>
> Are you going to post this patchfix soon? Or do you rather have me
> posting it?

Sorry for late replying.  I just come back from a long local holiday.
Thanks a lot for testing!  I will prepare the formal fixing patch.

Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 12f59e641b5e..37ddf5e5c53b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -324,14 +324,15 @@  static inline void cluster_set_null(struct swap_cluster_info *info)
 
 static inline bool cluster_is_huge(struct swap_cluster_info *info)
 {
-	if (IS_ENABLED(CONFIG_THP_SWAP))
+	if (IS_ENABLED(CONFIG_THP_SWAP) && info)
 		return info->flags & CLUSTER_FLAG_HUGE;
 	return false;
 }
 
 static inline void cluster_clear_huge(struct swap_cluster_info *info)
 {
-	info->flags &= ~CLUSTER_FLAG_HUGE;
+	if (IS_ENABLED(CONFIG_THP_SWAP) && info)
+		info->flags &= ~CLUSTER_FLAG_HUGE;
 }
 
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,