mm/page_isolation: fix a deadlock with printk()
diff mbox series

Message ID 1570207346-30477-1-git-send-email-cai@lca.pw
State New
Headers show
Series
  • mm/page_isolation: fix a deadlock with printk()
Related show

Commit Message

Qian Cai Oct. 4, 2019, 4:42 p.m. UTC
It is unsafe to call printk() while zone->lock was held, i.e.,

zone->lock --> console_sem

because the console could always allocate some memory in different code
paths and form locking chains in an opposite order,

console_sem --> * --> zone->lock

As the result, it triggers lockdep splats like below and in [1]. It is
fine to take zone->lock after has_unmovable_pages() (which has
dump_stack()) in set_migratetype_isolate(). While at it, remove a
problematic printk() in __offline_isolated_pages() only for debugging as
well which will always disable lockdep on debug kernels.

The problem is probably there forever, but neither many developers will
run memory offline with the lockdep enabled nor admins in the field are
lucky enough yet to hit a perfect timing which required to trigger a
real deadlock. In addition, there aren't many places that call printk()
while zone->lock was held.

WARNING: possible circular locking dependency detected
------------------------------------------------------
test.sh/1724 is trying to acquire lock:
0000000052059ec0 (console_owner){-...}, at: console_unlock+0x
01: 328/0xa30

but task is already holding lock:
000000006ffd89c8 (&(&zone->lock)->rlock){-.-.}, at: start_iso
01: late_page_range+0x216/0x538

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&(&zone->lock)->rlock){-.-.}:
       lock_acquire+0x21a/0x468
       _raw_spin_lock+0x54/0x68
       get_page_from_freelist+0x8b6/0x2d28
       __alloc_pages_nodemask+0x246/0x658
       __get_free_pages+0x34/0x78
       sclp_init+0x106/0x690
       sclp_register+0x2e/0x248
       sclp_rw_init+0x4a/0x70
       sclp_console_init+0x4a/0x1b8
       console_init+0x2c8/0x410
       start_kernel+0x530/0x6a0
       startup_continue+0x70/0xd0

-> #1 (sclp_lock){-.-.}:
       lock_acquire+0x21a/0x468
       _raw_spin_lock_irqsave+0xcc/0xe8
       sclp_add_request+0x34/0x308
       sclp_conbuf_emit+0x100/0x138
       sclp_console_write+0x96/0x3b8
       console_unlock+0x6dc/0xa30
       vprintk_emit+0x184/0x3c8
       vprintk_default+0x44/0x50
       printk+0xa8/0xc0
       iommu_debugfs_setup+0xf2/0x108
       iommu_init+0x6c/0x78
       do_one_initcall+0x162/0x680
       kernel_init_freeable+0x4e8/0x5a8
       kernel_init+0x2a/0x188
       ret_from_fork+0x30/0x34
       kernel_thread_starter+0x0/0xc

-> #0 (console_owner){-...}:
       check_noncircular+0x338/0x3e0
       __lock_acquire+0x1e66/0x2d88
       lock_acquire+0x21a/0x468
       console_unlock+0x3a6/0xa30
       vprintk_emit+0x184/0x3c8
       vprintk_default+0x44/0x50
       printk+0xa8/0xc0
       __dump_page+0x1dc/0x710
       dump_page+0x2e/0x58
       has_unmovable_pages+0x2e8/0x470
       start_isolate_page_range+0x404/0x538
       __offline_pages+0x22c/0x1338
       memory_subsys_offline+0xa6/0xe8
       device_offline+0xe6/0x118
       state_store+0xf0/0x110
       kernfs_fop_write+0x1bc/0x270
       vfs_write+0xce/0x220
       ksys_write+0xea/0x190
       system_call+0xd8/0x2b4

other info that might help us debug this:

Chain exists of:
  console_owner --> sclp_lock --> &(&zone->lock)->rlock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&(&zone->lock)->rlock);
                               lock(sclp_lock);
                               lock(&(&zone->lock)->rlock);
  lock(console_owner);

 *** DEADLOCK ***

9 locks held by test.sh/1724:
 #0: 000000000e925408 (sb_writers#4){.+.+}, at: vfs_write+0x201:
 #1: 0000000050aa4280 (&of->mutex){+.+.}, at: kernfs_fop_write:
 #2: 0000000062e5c628 (kn->count#198){.+.+}, at: kernfs_fop_write
 #3: 00000000523236a0 (device_hotplug_lock){+.+.}, at:
lock_device_hotplug_sysfs+0x30/0x80
 #4: 0000000062e70990 (&dev->mutex){....}, at: device_offline
 #5: 0000000051fd36b0 (cpu_hotplug_lock.rw_sem){++++}, at:
__offline_pages+0xec/0x1338
 #6: 00000000521ca470 (mem_hotplug_lock.rw_sem){++++}, at:
percpu_down_write+0x38/0x210
 #7: 000000006ffd89c8 (&(&zone->lock)->rlock){-.-.}, at:
start_isolate_page_range+0x216/0x538
 #8: 000000005205a100 (console_lock){+.+.}, at: vprintk_emit

stack backtrace:
Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
Call Trace:
([<00000000512ae218>] show_stack+0x110/0x1b0)
 [<0000000051b6d506>] dump_stack+0x126/0x178
 [<00000000513a4b08>] check_noncircular+0x338/0x3e0
 [<00000000513aaaf6>] __lock_acquire+0x1e66/0x2d88
 [<00000000513a7e12>] lock_acquire+0x21a/0x468
 [<00000000513bb2fe>] console_unlock+0x3a6/0xa30
 [<00000000513bde2c>] vprintk_emit+0x184/0x3c8
 [<00000000513be0b4>] vprintk_default+0x44/0x50
 [<00000000513beb60>] printk+0xa8/0xc0
 [<000000005158c364>] __dump_page+0x1dc/0x710
 [<000000005158c8c6>] dump_page+0x2e/0x58
 [<00000000515d87c8>] has_unmovable_pages+0x2e8/0x470
 [<000000005167072c>] start_isolate_page_range+0x404/0x538
 [<0000000051b96de4>] __offline_pages+0x22c/0x1338
 [<0000000051908586>] memory_subsys_offline+0xa6/0xe8
 [<00000000518e561e>] device_offline+0xe6/0x118
 [<0000000051908170>] state_store+0xf0/0x110
 [<0000000051796384>] kernfs_fop_write+0x1bc/0x270
 [<000000005168972e>] vfs_write+0xce/0x220
 [<0000000051689b9a>] ksys_write+0xea/0x190
 [<0000000051ba9990>] system_call+0xd8/0x2b4
INFO: lockdep is turned off.

[1] https://lore.kernel.org/lkml/1568817579.5576.172.camel@lca.pw/

Signed-off-by: Qian Cai <cai@lca.pw>
---
 mm/page_alloc.c     |  4 ----
 mm/page_isolation.c | 10 +++++-----
 2 files changed, 5 insertions(+), 9 deletions(-)

Comments

David Hildenbrand Oct. 4, 2019, 5:47 p.m. UTC | #1
On 04.10.19 18:42, Qian Cai wrote:
> It is unsafe to call printk() while zone->lock was held, i.e.,
> 
> zone->lock --> console_sem
> 
> because the console could always allocate some memory in different code
> paths and form locking chains in an opposite order,
> 
> console_sem --> * --> zone->lock
> 
> As the result, it triggers lockdep splats like below and in [1]. It is
> fine to take zone->lock after has_unmovable_pages() (which has
> dump_stack()) in set_migratetype_isolate(). While at it, remove a
> problematic printk() in __offline_isolated_pages() only for debugging as
> well which will always disable lockdep on debug kernels.
> 
> The problem is probably there forever, but neither many developers will
> run memory offline with the lockdep enabled nor admins in the field are
> lucky enough yet to hit a perfect timing which required to trigger a
> real deadlock. In addition, there aren't many places that call printk()
> while zone->lock was held.
> 
> WARNING: possible circular locking dependency detected
> ------------------------------------------------------
> test.sh/1724 is trying to acquire lock:
> 0000000052059ec0 (console_owner){-...}, at: console_unlock+0x
> 01: 328/0xa30
> 
> but task is already holding lock:
> 000000006ffd89c8 (&(&zone->lock)->rlock){-.-.}, at: start_iso
> 01: late_page_range+0x216/0x538
> 
> which lock already depends on the new lock.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #2 (&(&zone->lock)->rlock){-.-.}:
>        lock_acquire+0x21a/0x468
>        _raw_spin_lock+0x54/0x68
>        get_page_from_freelist+0x8b6/0x2d28
>        __alloc_pages_nodemask+0x246/0x658
>        __get_free_pages+0x34/0x78
>        sclp_init+0x106/0x690
>        sclp_register+0x2e/0x248
>        sclp_rw_init+0x4a/0x70
>        sclp_console_init+0x4a/0x1b8
>        console_init+0x2c8/0x410
>        start_kernel+0x530/0x6a0
>        startup_continue+0x70/0xd0
> 
> -> #1 (sclp_lock){-.-.}:
>        lock_acquire+0x21a/0x468
>        _raw_spin_lock_irqsave+0xcc/0xe8
>        sclp_add_request+0x34/0x308
>        sclp_conbuf_emit+0x100/0x138
>        sclp_console_write+0x96/0x3b8
>        console_unlock+0x6dc/0xa30
>        vprintk_emit+0x184/0x3c8
>        vprintk_default+0x44/0x50
>        printk+0xa8/0xc0
>        iommu_debugfs_setup+0xf2/0x108
>        iommu_init+0x6c/0x78
>        do_one_initcall+0x162/0x680
>        kernel_init_freeable+0x4e8/0x5a8
>        kernel_init+0x2a/0x188
>        ret_from_fork+0x30/0x34
>        kernel_thread_starter+0x0/0xc
> 
> -> #0 (console_owner){-...}:
>        check_noncircular+0x338/0x3e0
>        __lock_acquire+0x1e66/0x2d88
>        lock_acquire+0x21a/0x468
>        console_unlock+0x3a6/0xa30
>        vprintk_emit+0x184/0x3c8
>        vprintk_default+0x44/0x50
>        printk+0xa8/0xc0
>        __dump_page+0x1dc/0x710
>        dump_page+0x2e/0x58
>        has_unmovable_pages+0x2e8/0x470
>        start_isolate_page_range+0x404/0x538
>        __offline_pages+0x22c/0x1338
>        memory_subsys_offline+0xa6/0xe8
>        device_offline+0xe6/0x118
>        state_store+0xf0/0x110
>        kernfs_fop_write+0x1bc/0x270
>        vfs_write+0xce/0x220
>        ksys_write+0xea/0x190
>        system_call+0xd8/0x2b4
> 
> other info that might help us debug this:
> 
> Chain exists of:
>   console_owner --> sclp_lock --> &(&zone->lock)->rlock
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(&(&zone->lock)->rlock);
>                                lock(sclp_lock);
>                                lock(&(&zone->lock)->rlock);
>   lock(console_owner);
> 
>  *** DEADLOCK ***
> 
> 9 locks held by test.sh/1724:
>  #0: 000000000e925408 (sb_writers#4){.+.+}, at: vfs_write+0x201:
>  #1: 0000000050aa4280 (&of->mutex){+.+.}, at: kernfs_fop_write:
>  #2: 0000000062e5c628 (kn->count#198){.+.+}, at: kernfs_fop_write
>  #3: 00000000523236a0 (device_hotplug_lock){+.+.}, at:
> lock_device_hotplug_sysfs+0x30/0x80
>  #4: 0000000062e70990 (&dev->mutex){....}, at: device_offline
>  #5: 0000000051fd36b0 (cpu_hotplug_lock.rw_sem){++++}, at:
> __offline_pages+0xec/0x1338
>  #6: 00000000521ca470 (mem_hotplug_lock.rw_sem){++++}, at:
> percpu_down_write+0x38/0x210
>  #7: 000000006ffd89c8 (&(&zone->lock)->rlock){-.-.}, at:
> start_isolate_page_range+0x216/0x538
>  #8: 000000005205a100 (console_lock){+.+.}, at: vprintk_emit
> 
> stack backtrace:
> Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
> Call Trace:
> ([<00000000512ae218>] show_stack+0x110/0x1b0)
>  [<0000000051b6d506>] dump_stack+0x126/0x178
>  [<00000000513a4b08>] check_noncircular+0x338/0x3e0
>  [<00000000513aaaf6>] __lock_acquire+0x1e66/0x2d88
>  [<00000000513a7e12>] lock_acquire+0x21a/0x468
>  [<00000000513bb2fe>] console_unlock+0x3a6/0xa30
>  [<00000000513bde2c>] vprintk_emit+0x184/0x3c8
>  [<00000000513be0b4>] vprintk_default+0x44/0x50
>  [<00000000513beb60>] printk+0xa8/0xc0
>  [<000000005158c364>] __dump_page+0x1dc/0x710
>  [<000000005158c8c6>] dump_page+0x2e/0x58
>  [<00000000515d87c8>] has_unmovable_pages+0x2e8/0x470
>  [<000000005167072c>] start_isolate_page_range+0x404/0x538
>  [<0000000051b96de4>] __offline_pages+0x22c/0x1338
>  [<0000000051908586>] memory_subsys_offline+0xa6/0xe8
>  [<00000000518e561e>] device_offline+0xe6/0x118
>  [<0000000051908170>] state_store+0xf0/0x110
>  [<0000000051796384>] kernfs_fop_write+0x1bc/0x270
>  [<000000005168972e>] vfs_write+0xce/0x220
>  [<0000000051689b9a>] ksys_write+0xea/0x190
>  [<0000000051ba9990>] system_call+0xd8/0x2b4
> INFO: lockdep is turned off.
> 
> [1] https://lore.kernel.org/lkml/1568817579.5576.172.camel@lca.pw/
> 
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
>  mm/page_alloc.c     |  4 ----
>  mm/page_isolation.c | 10 +++++-----
>  2 files changed, 5 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 15c2050c629b..232bbb1dc521 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8588,10 +8588,6 @@ void zone_pcp_reset(struct zone *zone)
>  		BUG_ON(!PageBuddy(page));
>  		order = page_order(page);
>  		offlined_pages += 1 << order;
> -#ifdef CONFIG_DEBUG_VM
> -		pr_info("remove from free list %lx %d %lx\n",
> -			pfn, 1 << order, end_pfn);
> -#endif
>  		del_page_from_free_area(page, &zone->free_area[order]);
>  		for (i = 0; i < (1 << order); i++)
>  			SetPageReserved((page+i));
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 89c19c0feadb..8682ccb5fbd1 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -25,8 +25,6 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
>  
>  	zone = page_zone(page);
>  
> -	spin_lock_irqsave(&zone->lock, flags);
> -
>  	/*
>  	 * We assume the caller intended to SET migrate type to isolate.
>  	 * If it is already set, then someone else must have raced and
> @@ -74,16 +72,18 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
>  		int mt = get_pageblock_migratetype(page);
>  
>  		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> +
> +		spin_lock_irqsave(&zone->lock, flags);

The migratetype has to be tested and set under lock, otherwise two
clients could race. I don't like such severe locking changes just to
make some printing we only need for debugging work.

Can't we somehow return some information (page / cause) from
has_unmovable_pages() and print from a save place instead?

To fix the BUG, I would much rather want to see all printing getting
ripped out instead. That's easy to backort.

We can then come back and think about how to log such stuff in order to
debug it.
Qian Cai Oct. 4, 2019, 6:53 p.m. UTC | #2
On Fri, 2019-10-04 at 19:47 +0200, David Hildenbrand wrote:
> On 04.10.19 18:42, Qian Cai wrote:
> > It is unsafe to call printk() while zone->lock was held, i.e.,
> > 
> > zone->lock --> console_sem
> > 
> > because the console could always allocate some memory in different code
> > paths and form locking chains in an opposite order,
> > 
> > console_sem --> * --> zone->lock
> > 
> > As the result, it triggers lockdep splats like below and in [1]. It is
> > fine to take zone->lock after has_unmovable_pages() (which has
> > dump_stack()) in set_migratetype_isolate(). While at it, remove a
> > problematic printk() in __offline_isolated_pages() only for debugging as
> > well which will always disable lockdep on debug kernels.
> > 
> > The problem is probably there forever, but neither many developers will
> > run memory offline with the lockdep enabled nor admins in the field are
> > lucky enough yet to hit a perfect timing which required to trigger a
> > real deadlock. In addition, there aren't many places that call printk()
> > while zone->lock was held.
> > 
> > WARNING: possible circular locking dependency detected
> > ------------------------------------------------------
> > test.sh/1724 is trying to acquire lock:
> > 0000000052059ec0 (console_owner){-...}, at: console_unlock+0x
> > 01: 328/0xa30
> > 
> > but task is already holding lock:
> > 000000006ffd89c8 (&(&zone->lock)->rlock){-.-.}, at: start_iso
> > 01: late_page_range+0x216/0x538
> > 
> > which lock already depends on the new lock.
> > 
> > the existing dependency chain (in reverse order) is:
> > 
> > -> #2 (&(&zone->lock)->rlock){-.-.}:
> >        lock_acquire+0x21a/0x468
> >        _raw_spin_lock+0x54/0x68
> >        get_page_from_freelist+0x8b6/0x2d28
> >        __alloc_pages_nodemask+0x246/0x658
> >        __get_free_pages+0x34/0x78
> >        sclp_init+0x106/0x690
> >        sclp_register+0x2e/0x248
> >        sclp_rw_init+0x4a/0x70
> >        sclp_console_init+0x4a/0x1b8
> >        console_init+0x2c8/0x410
> >        start_kernel+0x530/0x6a0
> >        startup_continue+0x70/0xd0
> > 
> > -> #1 (sclp_lock){-.-.}:
> >        lock_acquire+0x21a/0x468
> >        _raw_spin_lock_irqsave+0xcc/0xe8
> >        sclp_add_request+0x34/0x308
> >        sclp_conbuf_emit+0x100/0x138
> >        sclp_console_write+0x96/0x3b8
> >        console_unlock+0x6dc/0xa30
> >        vprintk_emit+0x184/0x3c8
> >        vprintk_default+0x44/0x50
> >        printk+0xa8/0xc0
> >        iommu_debugfs_setup+0xf2/0x108
> >        iommu_init+0x6c/0x78
> >        do_one_initcall+0x162/0x680
> >        kernel_init_freeable+0x4e8/0x5a8
> >        kernel_init+0x2a/0x188
> >        ret_from_fork+0x30/0x34
> >        kernel_thread_starter+0x0/0xc
> > 
> > -> #0 (console_owner){-...}:
> >        check_noncircular+0x338/0x3e0
> >        __lock_acquire+0x1e66/0x2d88
> >        lock_acquire+0x21a/0x468
> >        console_unlock+0x3a6/0xa30
> >        vprintk_emit+0x184/0x3c8
> >        vprintk_default+0x44/0x50
> >        printk+0xa8/0xc0
> >        __dump_page+0x1dc/0x710
> >        dump_page+0x2e/0x58
> >        has_unmovable_pages+0x2e8/0x470
> >        start_isolate_page_range+0x404/0x538
> >        __offline_pages+0x22c/0x1338
> >        memory_subsys_offline+0xa6/0xe8
> >        device_offline+0xe6/0x118
> >        state_store+0xf0/0x110
> >        kernfs_fop_write+0x1bc/0x270
> >        vfs_write+0xce/0x220
> >        ksys_write+0xea/0x190
> >        system_call+0xd8/0x2b4
> > 
> > other info that might help us debug this:
> > 
> > Chain exists of:
> >   console_owner --> sclp_lock --> &(&zone->lock)->rlock
> > 
> >  Possible unsafe locking scenario:
> > 
> >        CPU0                    CPU1
> >        ----                    ----
> >   lock(&(&zone->lock)->rlock);
> >                                lock(sclp_lock);
> >                                lock(&(&zone->lock)->rlock);
> >   lock(console_owner);
> > 
> >  *** DEADLOCK ***
> > 
> > 9 locks held by test.sh/1724:
> >  #0: 000000000e925408 (sb_writers#4){.+.+}, at: vfs_write+0x201:
> >  #1: 0000000050aa4280 (&of->mutex){+.+.}, at: kernfs_fop_write:
> >  #2: 0000000062e5c628 (kn->count#198){.+.+}, at: kernfs_fop_write
> >  #3: 00000000523236a0 (device_hotplug_lock){+.+.}, at:
> > lock_device_hotplug_sysfs+0x30/0x80
> >  #4: 0000000062e70990 (&dev->mutex){....}, at: device_offline
> >  #5: 0000000051fd36b0 (cpu_hotplug_lock.rw_sem){++++}, at:
> > __offline_pages+0xec/0x1338
> >  #6: 00000000521ca470 (mem_hotplug_lock.rw_sem){++++}, at:
> > percpu_down_write+0x38/0x210
> >  #7: 000000006ffd89c8 (&(&zone->lock)->rlock){-.-.}, at:
> > start_isolate_page_range+0x216/0x538
> >  #8: 000000005205a100 (console_lock){+.+.}, at: vprintk_emit
> > 
> > stack backtrace:
> > Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
> > Call Trace:
> > ([<00000000512ae218>] show_stack+0x110/0x1b0)
> >  [<0000000051b6d506>] dump_stack+0x126/0x178
> >  [<00000000513a4b08>] check_noncircular+0x338/0x3e0
> >  [<00000000513aaaf6>] __lock_acquire+0x1e66/0x2d88
> >  [<00000000513a7e12>] lock_acquire+0x21a/0x468
> >  [<00000000513bb2fe>] console_unlock+0x3a6/0xa30
> >  [<00000000513bde2c>] vprintk_emit+0x184/0x3c8
> >  [<00000000513be0b4>] vprintk_default+0x44/0x50
> >  [<00000000513beb60>] printk+0xa8/0xc0
> >  [<000000005158c364>] __dump_page+0x1dc/0x710
> >  [<000000005158c8c6>] dump_page+0x2e/0x58
> >  [<00000000515d87c8>] has_unmovable_pages+0x2e8/0x470
> >  [<000000005167072c>] start_isolate_page_range+0x404/0x538
> >  [<0000000051b96de4>] __offline_pages+0x22c/0x1338
> >  [<0000000051908586>] memory_subsys_offline+0xa6/0xe8
> >  [<00000000518e561e>] device_offline+0xe6/0x118
> >  [<0000000051908170>] state_store+0xf0/0x110
> >  [<0000000051796384>] kernfs_fop_write+0x1bc/0x270
> >  [<000000005168972e>] vfs_write+0xce/0x220
> >  [<0000000051689b9a>] ksys_write+0xea/0x190
> >  [<0000000051ba9990>] system_call+0xd8/0x2b4
> > INFO: lockdep is turned off.
> > 
> > [1] https://lore.kernel.org/lkml/1568817579.5576.172.camel@lca.pw/
> > 
> > Signed-off-by: Qian Cai <cai@lca.pw>
> > ---
> >  mm/page_alloc.c     |  4 ----
> >  mm/page_isolation.c | 10 +++++-----
> >  2 files changed, 5 insertions(+), 9 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 15c2050c629b..232bbb1dc521 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -8588,10 +8588,6 @@ void zone_pcp_reset(struct zone *zone)
> >  		BUG_ON(!PageBuddy(page));
> >  		order = page_order(page);
> >  		offlined_pages += 1 << order;
> > -#ifdef CONFIG_DEBUG_VM
> > -		pr_info("remove from free list %lx %d %lx\n",
> > -			pfn, 1 << order, end_pfn);
> > -#endif
> >  		del_page_from_free_area(page, &zone->free_area[order]);
> >  		for (i = 0; i < (1 << order); i++)
> >  			SetPageReserved((page+i));
> > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > index 89c19c0feadb..8682ccb5fbd1 100644
> > --- a/mm/page_isolation.c
> > +++ b/mm/page_isolation.c
> > @@ -25,8 +25,6 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
> >  
> >  	zone = page_zone(page);
> >  
> > -	spin_lock_irqsave(&zone->lock, flags);
> > -
> >  	/*
> >  	 * We assume the caller intended to SET migrate type to isolate.
> >  	 * If it is already set, then someone else must have raced and
> > @@ -74,16 +72,18 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
> >  		int mt = get_pageblock_migratetype(page);
> >  
> >  		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> > +
> > +		spin_lock_irqsave(&zone->lock, flags);
> 
> The migratetype has to be tested and set under lock, otherwise two
> clients could race. I don't like such severe locking changes just to
> make some printing we only need for debugging work.

Ah, it could use a different lock for that then rather than reuse zone->lock
which is kind of confusion to begin with.

> 
> Can't we somehow return some information (page / cause) from
> has_unmovable_pages() and print from a save place instead?

Possible too, but using a different lock to protect migratetype looks simpler.

> 
> To fix the BUG, I would much rather want to see all printing getting
> ripped out instead. That's easy to backort.

I am not even sure if this worth backporting at all as it is rather difficult to
hit in the field and probably exist there for many years without anyone noticed
it.

> 
> We can then come back and think about how to log such stuff in order to
> debug it.
>
Andrew Morton Oct. 5, 2019, 11:29 p.m. UTC | #3
On Fri,  4 Oct 2019 12:42:26 -0400 Qian Cai <cai@lca.pw> wrote:

> It is unsafe to call printk() while zone->lock was held, i.e.,
> 
> zone->lock --> console_sem
> 
> because the console could always allocate some memory in different code
> paths and form locking chains in an opposite order,
> 
> console_sem --> * --> zone->lock
> 
> As the result, it triggers lockdep splats like below and in [1]. It is
> fine to take zone->lock after has_unmovable_pages() (which has
> dump_stack()) in set_migratetype_isolate(). While at it, remove a
> problematic printk() in __offline_isolated_pages() only for debugging as
> well which will always disable lockdep on debug kernels.
> 
> The problem is probably there forever, but neither many developers will
> run memory offline with the lockdep enabled nor admins in the field are
> lucky enough yet to hit a perfect timing which required to trigger a
> real deadlock. In addition, there aren't many places that call printk()
> while zone->lock was held.
> 
> WARNING: possible circular locking dependency detected
> ------------------------------------------------------
> test.sh/1724 is trying to acquire lock:
> 0000000052059ec0 (console_owner){-...}, at: console_unlock+0x
> 01: 328/0xa30
> 
> but task is already holding lock:
> 000000006ffd89c8 (&(&zone->lock)->rlock){-.-.}, at: start_iso
> 01: late_page_range+0x216/0x538
> 
> which lock already depends on the new lock.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #2 (&(&zone->lock)->rlock){-.-.}:
>        lock_acquire+0x21a/0x468
>        _raw_spin_lock+0x54/0x68
>        get_page_from_freelist+0x8b6/0x2d28
>        __alloc_pages_nodemask+0x246/0x658
>        __get_free_pages+0x34/0x78
>        sclp_init+0x106/0x690
>        sclp_register+0x2e/0x248
>        sclp_rw_init+0x4a/0x70
>        sclp_console_init+0x4a/0x1b8
>        console_init+0x2c8/0x410
>        start_kernel+0x530/0x6a0
>        startup_continue+0x70/0xd0

This appears to be the core of our problem?  At initialization time,
the sclp driver registers an inappropriate dependency with lockdep.  It
does this by calling into the page allocator while holding sclp_lock. 
But we don't *want* to teach lockdep that sclp_lock nests outside
zone->lock.  We want the opposite.

So can we address this class of problem by declaring "thou shalt not
call the page allocator while holding a lock which can be taken on the
prink path?".  And then declare sclp to be defective.


And I think sclp is kinda buggy-but-lucky anyway: if console output is
directed to sclp device #0 and we're then trying to initialize sclp
device #1 then any printk which happens during that initialization will
deadlock.  The driver escapes this by only supporting a single device
system-wide but it's not a model which drivers should generally follow.

(And if sclp will only ever support a single device system-wide, why
the heck does it need to take sclp_lock() on the device initialization
path??)
Qian Cai Oct. 6, 2019, 12:10 a.m. UTC | #4
> On Oct 5, 2019, at 7:29 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> ´╗┐On Fri,  4 Oct 2019 12:42:26 -0400 Qian Cai <cai@lca.pw> wrote:
> 
>> It is unsafe to call printk() while zone->lock was held, i.e.,
>> 
>> zone->lock --> console_sem
>> 
>> because the console could always allocate some memory in different code
>> paths and form locking chains in an opposite order,
>> 
>> console_sem --> * --> zone->lock
>> 
>> As the result, it triggers lockdep splats like below and in [1]. It is
>> fine to take zone->lock after has_unmovable_pages() (which has
>> dump_stack()) in set_migratetype_isolate(). While at it, remove a
>> problematic printk() in __offline_isolated_pages() only for debugging as
>> well which will always disable lockdep on debug kernels.
>> 
>> The problem is probably there forever, but neither many developers will
>> run memory offline with the lockdep enabled nor admins in the field are
>> lucky enough yet to hit a perfect timing which required to trigger a
>> real deadlock. In addition, there aren't many places that call printk()
>> while zone->lock was held.
>> 
>> WARNING: possible circular locking dependency detected
>> ------------------------------------------------------
>> test.sh/1724 is trying to acquire lock:
>> 0000000052059ec0 (console_owner){-...}, at: console_unlock+0x
>> 01: 328/0xa30
>> 
>> but task is already holding lock:
>> 000000006ffd89c8 (&(&zone->lock)->rlock){-.-.}, at: start_iso
>> 01: late_page_range+0x216/0x538
>> 
>> which lock already depends on the new lock.
>> 
>> the existing dependency chain (in reverse order) is:
>> 
>> -> #2 (&(&zone->lock)->rlock){-.-.}:
>>       lock_acquire+0x21a/0x468
>>       _raw_spin_lock+0x54/0x68
>>       get_page_from_freelist+0x8b6/0x2d28
>>       __alloc_pages_nodemask+0x246/0x658
>>       __get_free_pages+0x34/0x78
>>       sclp_init+0x106/0x690
>>       sclp_register+0x2e/0x248
>>       sclp_rw_init+0x4a/0x70
>>       sclp_console_init+0x4a/0x1b8
>>       console_init+0x2c8/0x410
>>       start_kernel+0x530/0x6a0
>>       startup_continue+0x70/0xd0
> 
> This appears to be the core of our problem?

No, that is just one of those many places could form the lock chain. 

console_lock -> other locks -> zone_lock

Another example is,

https://lore.kernel.org/lkml/1568823006.5576.178.camel@lca.pw/

It is easier to avoid,

zone_lock -> console_lock

rather than fixing the opposite.

>  At initialization time,
> the sclp driver registers an inappropriate dependency with lockdep.  It
> does this by calling into the page allocator while holding sclp_lock. 
> But we don't *want* to teach lockdep that sclp_lock nests outside
> zone->lock.  We want the opposite.
> 
> So can we address this class of problem by declaring "thou shalt not
> call the page allocator while holding a lock which can be taken on the
> prink path?".  And then declare sclp to be defective.
> 
> 
> And I think sclp is kinda buggy-but-lucky anyway: if console output is
> directed to sclp device #0 and we're then trying to initialize sclp
> device #1 then any printk which happens during that initialization will
> deadlock.  The driver escapes this by only supporting a single device
> system-wide but it's not a model which drivers should generally follow.
> 
> (And if sclp will only ever support a single device system-wide, why
> the heck does it need to take sclp_lock() on the device initialization
> path??)
> 
>
Andrew Morton Oct. 6, 2019, 12:44 a.m. UTC | #5
On Sat, 5 Oct 2019 20:10:47 -0400 Qian Cai <cai@lca.pw> wrote:

> 
> 
> >> the existing dependency chain (in reverse order) is:
> >> 
> >> -> #2 (&(&zone->lock)->rlock){-.-.}:
> >>       lock_acquire+0x21a/0x468
> >>       _raw_spin_lock+0x54/0x68
> >>       get_page_from_freelist+0x8b6/0x2d28
> >>       __alloc_pages_nodemask+0x246/0x658
> >>       __get_free_pages+0x34/0x78
> >>       sclp_init+0x106/0x690
> >>       sclp_register+0x2e/0x248
> >>       sclp_rw_init+0x4a/0x70
> >>       sclp_console_init+0x4a/0x1b8
> >>       console_init+0x2c8/0x410
> >>       start_kernel+0x530/0x6a0
> >>       startup_continue+0x70/0xd0
> > 
> > This appears to be the core of our problem?
> 
> No, that is just one of those many places could form the lock chain. 
> 
> console_lock -> other locks -> zone_lock
> 
> Another example is,
> 
> https://lore.kernel.org/lkml/1568823006.5576.178.camel@lca.pw/

There is no "console_lock".  Please be much more specific.

> It is easier to avoid,
> 
> zone_lock -> console_lock
> 
> rather than fixing the opposite.

"ease" isn't the main objective.  A more important question is "what
makes sense".  We should be able to call printk() from anywhere, any
time under any conditions.  That can't be done 100% but it is the
objective.  printk() should be robust and not being able to call
printk() while holding zone->lock isn't robust!

btw, this:

: It is unsafe to call printk() while zone->lock was held, i.e.,
:
:	zone->lock --> console_sem

doesn't make a lot of sense.  console_sem is a sleeping lock so
attempting to acquire it (with down()!) under spinlock is a huge bug. 
Again, please be careful with the descriptions.
Qian Cai Oct. 6, 2019, 12:59 a.m. UTC | #6
> On Oct 5, 2019, at 7:29 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
>> -> #2 (&(&zone->lock)->rlock){-.-.}:
>>       lock_acquire+0x21a/0x468
>>       _raw_spin_lock+0x54/0x68
>>       get_page_from_freelist+0x8b6/0x2d28
>>       __alloc_pages_nodemask+0x246/0x658
>>       __get_free_pages+0x34/0x78
>>       sclp_init+0x106/0x690
>>       sclp_register+0x2e/0x248
>>       sclp_rw_init+0x4a/0x70
>>       sclp_console_init+0x4a/0x1b8
>>       console_init+0x2c8/0x410
>>       start_kernel+0x530/0x6a0
>>       startup_continue+0x70/0xd0
> 
> This appears to be the core of our problem?

There are more.

[  297.425908] WARNING: possible circular locking dependency detected
[  297.425908] 5.3.0-next-20190917 #8 Not tainted
[  297.425909] ------------------------------------------------------
[  297.425910] test.sh/8653 is trying to acquire lock:
[  297.425911] ffffffff865a4460 (console_owner){-.-.}, at:
console_unlock+0x207/0x750

[  297.425914] but task is already holding lock:
[  297.425915] ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
__offline_isolated_pages+0x179/0x3e0

[  297.425919] which lock already depends on the new lock.


[  297.425920] the existing dependency chain (in reverse order) is:

[  297.425922] -> #3 (&(&zone->lock)->rlock){-.-.}:
[  297.425925]        __lock_acquire+0x5b3/0xb40
[  297.425925]        lock_acquire+0x126/0x280
[  297.425926]        _raw_spin_lock+0x2f/0x40
[  297.425927]        rmqueue_bulk.constprop.21+0xb6/0x1160
[  297.425928]        get_page_from_freelist+0x898/0x22c0
[  297.425928]        __alloc_pages_nodemask+0x2f3/0x1cd0
[  297.425929]        alloc_pages_current+0x9c/0x110
[  297.425930]        allocate_slab+0x4c6/0x19c0
[  297.425931]        new_slab+0x46/0x70
[  297.425931]        ___slab_alloc+0x58b/0x960
[  297.425932]        __slab_alloc+0x43/0x70
[  297.425933]        __kmalloc+0x3ad/0x4b0
[  297.425933]        __tty_buffer_request_room+0x100/0x250
[  297.425934]        tty_insert_flip_string_fixed_flag+0x67/0x110
[  297.425935]        pty_write+0xa2/0xf0
[  297.425936]        n_tty_write+0x36b/0x7b0
[  297.425936]        tty_write+0x284/0x4c0
[  297.425937]        __vfs_write+0x50/0xa0
[  297.425938]        vfs_write+0x105/0x290
[  297.425939]        redirected_tty_write+0x6a/0xc0
[  297.425939]        do_iter_write+0x248/0x2a0
[  297.425940]        vfs_writev+0x106/0x1e0
[  297.425941]        do_writev+0xd4/0x180
[  297.425941]        __x64_sys_writev+0x45/0x50
[  297.425942]        do_syscall_64+0xcc/0x76c
[  297.425943]        entry_SYSCALL_64_after_hwframe+0x49/0xbe

[  297.425944] -> #2 (&(&port->lock)->rlock){-.-.}:
[  297.425946]        __lock_acquire+0x5b3/0xb40
[  297.425947]        lock_acquire+0x126/0x280
[  297.425948]        _raw_spin_lock_irqsave+0x3a/0x50
[  297.425949]        tty_port_tty_get+0x20/0x60
[  297.425949]        tty_port_default_wakeup+0xf/0x30
[  297.425950]        tty_port_tty_wakeup+0x39/0x40
[  297.425951]        uart_write_wakeup+0x2a/0x40
[  297.425952]        serial8250_tx_chars+0x22e/0x440
[  297.425952]        serial8250_handle_irq.part.8+0x14a/0x170
[  297.425953]        serial8250_default_handle_irq+0x5c/0x90
[  297.425954]        serial8250_interrupt+0xa6/0x130
[  297.425955]        __handle_irq_event_percpu+0x78/0x4f0
[  297.425955]        handle_irq_event_percpu+0x70/0x100
[  297.425956]        handle_irq_event+0x5a/0x8b
[  297.425957]        handle_edge_irq+0x117/0x370
[  297.425958]        do_IRQ+0x9e/0x1e0
[  297.425958]        ret_from_intr+0x0/0x2a
[  297.425959]        cpuidle_enter_state+0x156/0x8e0
[  297.425960]        cpuidle_enter+0x41/0x70
[  297.425960]        call_cpuidle+0x5e/0x90
[  297.425961]        do_idle+0x333/0x370
[  297.425962]        cpu_startup_entry+0x1d/0x1f
[  297.425962]        start_secondary+0x290/0x330
[  297.425963]        secondary_startup_64+0xb6/0xc0

[  297.425964] -> #1 (&port_lock_key){-.-.}:
[  297.425967]        __lock_acquire+0x5b3/0xb40
[  297.425967]        lock_acquire+0x126/0x280
[  297.425968]        _raw_spin_lock_irqsave+0x3a/0x50
[  297.425969]        serial8250_console_write+0x3e4/0x450
[  297.425970]        univ8250_console_write+0x4b/0x60
[  297.425970]        console_unlock+0x501/0x750
[  297.425971]        vprintk_emit+0x10d/0x340
[  297.425972]        vprintk_default+0x1f/0x30
[  297.425972]        vprintk_func+0x44/0xd4
[  297.425973]        printk+0x9f/0xc5
[  297.425974]        register_console+0x39c/0x520
[  297.425975]        univ8250_console_init+0x23/0x2d
[  297.425975]        console_init+0x338/0x4cd
[  297.425976]        start_kernel+0x534/0x724
[  297.425977]        x86_64_start_reservations+0x24/0x26
[  297.425977]        x86_64_start_kernel+0xf4/0xfb
[  297.425978]        secondary_startup_64+0xb6/0xc0

[  297.425979] -> #0 (console_owner){-.-.}:
[  297.425982]        check_prev_add+0x107/0xea0
[  297.425982]        validate_chain+0x8fc/0x1200
[  297.425983]        __lock_acquire+0x5b3/0xb40
[  297.425984]        lock_acquire+0x126/0x280
[  297.425984]        console_unlock+0x269/0x750
[  297.425985]        vprintk_emit+0x10d/0x340
[  297.425986]        vprintk_default+0x1f/0x30
[  297.425987]        vprintk_func+0x44/0xd4
[  297.425987]        printk+0x9f/0xc5
[  297.425988]        __offline_isolated_pages.cold.52+0x2f/0x30a
[  297.425989]        offline_isolated_pages_cb+0x17/0x30
[  297.425990]        walk_system_ram_range+0xda/0x160
[  297.425990]        __offline_pages+0x79c/0xa10
[  297.425991]        offline_pages+0x11/0x20
[  297.425992]        memory_subsys_offline+0x7e/0xc0
[  297.425992]        device_offline+0xd5/0x110
[  297.425993]        state_store+0xc6/0xe0
[  297.425994]        dev_attr_store+0x3f/0x60
[  297.425995]        sysfs_kf_write+0x89/0xb0
[  297.425995]        kernfs_fop_write+0x188/0x240
[  297.425996]        __vfs_write+0x50/0xa0
[  297.425997]        vfs_write+0x105/0x290
[  297.425997]        ksys_write+0xc6/0x160
[  297.425998]        __x64_sys_write+0x43/0x50
[  297.425999]        do_syscall_64+0xcc/0x76c
[  297.426000]        entry_SYSCALL_64_after_hwframe+0x49/0xbe

[  297.426001] other info that might help us debug this:

[  297.426002] Chain exists of:
[  297.426002]   console_owner --> &(&port->lock)->rlock --> &(&zone->lock)-
>rlock

[  297.426007]  Possible unsafe locking scenario:

[  297.426008]        CPU0                    CPU1
[  297.426009]        ----                    ----
[  297.426009]   lock(&(&zone->lock)->rlock);
[  297.426011]                                lock(&(&port->lock)->rlock);
[  297.426013]                                lock(&(&zone->lock)->rlock);
[  297.426014]   lock(console_owner);

[  297.426016]  *** DEADLOCK ***

[  297.426017] 9 locks held by test.sh/8653:
[  297.426018]  #0: ffff88839ba7d408 (sb_writers#4){.+.+}, at:
vfs_write+0x25f/0x290
[  297.426021]  #1: ffff888277618880 (&of->mutex){+.+.}, at:
kernfs_fop_write+0x128/0x240
[  297.426024]  #2: ffff8898131fc218 (kn->count#115){.+.+}, at:
kernfs_fop_write+0x138/0x240
[  297.426028]  #3: ffffffff86962a80 (device_hotplug_lock){+.+.}, at:
lock_device_hotplug_sysfs+0x16/0x50
[  297.426031]  #4: ffff8884374f4990 (&dev->mutex){....}, at:
device_offline+0x70/0x110
[  297.426034]  #5: ffffffff86515250 (cpu_hotplug_lock.rw_sem){++++}, at:
__offline_pages+0xbf/0xa10
[  297.426037]  #6: ffffffff867405f0 (mem_hotplug_lock.rw_sem){++++}, at:
percpu_down_write+0x87/0x2f0
[  297.426040]  #7: ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
__offline_isolated_pages+0x179/0x3e0
[  297.426043]  #8: ffffffff865a4920 (console_lock){+.+.}, at:
vprintk_emit+0x100/0x340

[  297.426047] stack backtrace:
[  297.426048] CPU: 1 PID: 8653 Comm: test.sh Not tainted 5.3.0-next-20190917 #8
[  297.426049] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10,
BIOS U34 05/21/2019
[  297.426049] Call Trace:
[  297.426050]  dump_stack+0x86/0xca
[  297.426051]  print_circular_bug.cold.31+0x243/0x26e
[  297.426051]  check_noncircular+0x29e/0x2e0
[  297.426052]  ? stack_trace_save+0x87/0xb0
[  297.426053]  ? print_circular_bug+0x120/0x120
[  297.426053]  check_prev_add+0x107/0xea0
[  297.426054]  validate_chain+0x8fc/0x1200
[  297.426055]  ? check_prev_add+0xea0/0xea0
[  297.426055]  __lock_acquire+0x5b3/0xb40
[  297.426056]  lock_acquire+0x126/0x280
[  297.426057]  ? console_unlock+0x207/0x750
[  297.426057]  ? __kasan_check_read+0x11/0x20
[  297.426058]  console_unlock+0x269/0x750
[  297.426059]  ? console_unlock+0x207/0x750
[  297.426059]  vprintk_emit+0x10d/0x340
[  297.426060]  vprintk_default+0x1f/0x30
[  297.426061]  vprintk_func+0x44/0xd4
[  297.426061]  ? do_raw_spin_lock+0x118/0x1d0
[  297.426062]  printk+0x9f/0xc5
[  297.426063]  ? kmsg_dump_rewind_nolock+0x64/0x64
[  297.426064]  ? __offline_isolated_pages+0x179/0x3e0
[  297.426064]  __offline_isolated_pages.cold.52+0x2f/0x30a
[  297.426065]  ? online_memory_block+0x20/0x20
[  297.426066]  offline_isolated_pages_cb+0x17/0x30
[  297.426067]  walk_system_ram_range+0xda/0x160
[  297.426067]  ? walk_mem_res+0x30/0x30
[  297.426068]  ? dissolve_free_huge_page+0x1e/0x2b0
[  297.426069]  __offline_pages+0x79c/0xa10
[  297.426069]  ? __add_memory+0xc0/0xc0
[  297.426070]  ? __kasan_check_write+0x14/0x20
[  297.426071]  ? __mutex_lock+0x344/0xcd0
[  297.426071]  ? _raw_spin_unlock_irqrestore+0x49/0x50
[  297.426072]  ? device_offline+0x70/0x110
[  297.426073]  ? klist_next+0x1c1/0x1e0
[  297.426073]  ? __mutex_add_waiter+0xc0/0xc0
[  297.426074]  ? klist_next+0x10b/0x1e0
[  297.426075]  ? klist_iter_exit+0x16/0x40
[  297.426076]  ? device_for_each_child+0xd0/0x110
[  297.426076]  offline_pages+0x11/0x20
[  297.426077]  memory_subsys_offline+0x7e/0xc0
[  297.426078]  device_offline+0xd5/0x110
[  297.426078]  ? auto_online_blocks_show+0x70/0x70
[  297.426079]  state_store+0xc6/0xe0
[  297.426080]  dev_attr_store+0x3f/0x60
[  297.426080]  ? device_match_name+0x40/0x40
[  297.426081]  sysfs_kf_write+0x89/0xb0
[  297.426082]  ? sysfs_file_ops+0xa0/0xa0
[  297.426082]  kernfs_fop_write+0x188/0x240
[  297.426083]  __vfs_write+0x50/0xa0
[  297.426084]  vfs_write+0x105/0x290
[  297.426084]  ksys_write+0xc6/0x160
[  297.426085]  ? __x64_sys_read+0x50/0x50
[  297.426086]  ? do_syscall_64+0x79/0x76c
[  297.426086]  ? do_syscall_64+0x79/0x76c
[  297.426087]  __x64_sys_write+0x43/0x50
[  297.426088]  do_syscall_64+0xcc/0x76c
[  297.426088]  ? trace_hardirqs_on_thunk+0x1a/0x20
[  297.426089]  ? syscall_return_slowpath+0x210/0x210
[  297.426090]  ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
[  297.426091]  ? trace_hardirqs_off_caller+0x3a/0x150
[  297.426092]  ? trace_hardirqs_off_thunk+0x1a/0x20
[  297.426092]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  297.426093] RIP: 0033:0x7fd7336b4e18
[  297.426095] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f
1e fa 48 8d 05 05 59 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0
ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[  297.426096] RSP: 002b:00007ffc58c7b258 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[  297.426098] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007fd7336b4e18
[  297.426098] RDX: 0000000000000008 RSI: 000055ad6d519c70 RDI: 0000000000000001
[  297.426099] RBP: 000055ad6d519c70 R08: 000000000000000a R09: 00007fd733746300
[  297.426100] R10: 000000000000000a R11: 0000000000000246 R12: 00007fd733986780
[  297.426101] R13: 0000000000000008 R14: 00007fd733981740 R15: 0000000000000008


>  At initialization time,
> the sclp driver registers an inappropriate dependency with lockdep.  It
> does this by calling into the page allocator while holding sclp_lock. 
> But we don't *want* to teach lockdep that sclp_lock nests outside
> zone->lock.  We want the opposite.
> 
> So can we address this class of problem by declaring "thou shalt not
> call the page allocator while holding a lock which can be taken on the
> prink path?".  And then declare sclp to be defective.
> 
> 
> And I think sclp is kinda buggy-but-lucky anyway: if console output is
> directed to sclp device #0 and we're then trying to initialize sclp
> device #1 then any printk which happens during that initialization will
> deadlock.  The driver escapes this by only

printk() is such a beast that basically any lock taken after console_lock held  like sclp_lock, port_lock, rq_lock or even indirectly like pi_lock as in below,

console_lock -> rq_lock -> pi_lock

would have a same problem where if another CPU take one of those locks in a different path and call printk() could trigger a deadlock.

[  763.659221][    C6] -> #2 (&rq->lock){-.-.}:
[  763.659222][    C6]        __lock_acquire+0x44e/0x8c0
[  763.659223][    C6]        lock_acquire+0xc0/0x1c0
[  763.659223][    C6]        _raw_spin_lock+0x2f/0x40
[  763.659223][    C6]        task_fork_fair+0x37/0x150
[  763.659224][    C6]        sched_fork+0x126/0x230
[  763.659224][    C6]        copy_process+0xafc/0x1e90
[  763.659224][    C6]        _do_fork+0x89/0x720
[  763.659225][    C6]        kernel_thread+0x58/0x70
[  763.659225][    C6]        rest_init+0x28/0x302
[  763.659225][    C6]        arch_call_rest_init+0xe/0x1b
[  763.659226][    C6]        start_kernel+0x581/0x5a0
[  763.659226][    C6]        x86_64_start_reservations+0x24/0x26
[  763.659227][    C6]        x86_64_start_kernel+0xef/0xf6
[  763.659227][    C6]        secondary_startup_64+0xb6/0xc0
[  763.659227][    C6] 
[  763.659227][    C6] -> #1 (&p->pi_lock){-.-.}:
[  763.659229][    C6]        __lock_acquire+0x44e/0x8c0
[  763.659229][    C6]        lock_acquire+0xc0/0x1c0
[  763.659230][    C6]        _raw_spin_lock_irqsave+0x3a/0x50
[  763.659230][    C6]        try_to_wake_up+0x5c/0xbc0
[  763.659230][    C6]        wake_up_process+0x15/0x20
[  763.659231][    C6]        __up+0x4a/0x50
[  763.659231][    C6]        up+0x45/0x50
[  763.659231][    C6]        __up_console_sem+0x37/0x60
[  763.659232][    C6]        console_unlock+0x357/0x600
[  763.659232][    C6]        vprintk_emit+0x101/0x320
[  763.659232][    C6]        vprintk_default+0x1f/0x30
[  763.659233][    C6]        vprintk_func+0x44/0xd4
[  763.659233][    C6]        printk+0x58/0x6f
[  763.659234][    C6]        do_exit+0xd73/0xd80
[  763.659234][    C6]        do_group_exit+0x41/0xd0
[  763.659234][    C6]        __x64_sys_exit_group+0x18/0x20
[  763.659235][    C6]        do_syscall_64+0x6d/0x488
[  763.659235][    C6]        entry_SYSCALL_64_after_hwframe+0x49/0xbe

> supporting a single device
> system-wide but it's not a model which drivers should generally follow.
> 
> (And if sclp will only ever support a single device system-wide, why
> the heck does it need to take sclp_lock() on the device initialization
> path??)
Qian Cai Oct. 6, 2019, 1:10 a.m. UTC | #7
> On Oct 5, 2019, at 8:44 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> There is no "console_lock".  Please be much more specific.
> 
>> It is easier to avoid,
>> 
>> zone_lock -> console_lock
>> 
>> rather than fixing the opposite.
> 
> "ease" isn't the main objective.  A more important question is "what
> makes sense".  We should be able to call printk() from anywhere, any
> time under any conditions.  That can't be done 100% but it is the
> objective.  printk() should be robust and not being able to call
> printk() while holding zone->lock isn't robust!
> 
> btw, this:
> 
> : It is unsafe to call printk() while zone->lock was held, i.e.,
> :
> :    zone->lock --> console_sem
> 
> doesn't make a lot of sense.  console_sem is a sleeping lock so
> attempting to acquire it (with down()!) under spinlock is a huge bug. 
> Again, please be careful with the descriptions.

Sorry, It is console_owner_lock.
Qian Cai Oct. 6, 2019, 1:56 a.m. UTC | #8
> On Oct 5, 2019, at 8:44 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> "ease" isn't the main objective.  A more important question is "what
> makes sense".  We should be able to call printk() from anywhere, any
> time under any conditions.  That can't be done 100% but it is the
> objective.  printk() should be robust and not being able to call
> printk() while holding zone->lock isn't robust!

https://github.com/torvalds/linux/commit/dbdda842fe96f8932bae554f0adf463c27c42bc7

If you look at the above commit, it explicitly call out the possible deadlock scenarios and recommend to call printk_deferred() instead in those areas of code where zone->lock looks like one of those candidates.

Patch
diff mbox series

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 15c2050c629b..232bbb1dc521 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8588,10 +8588,6 @@  void zone_pcp_reset(struct zone *zone)
 		BUG_ON(!PageBuddy(page));
 		order = page_order(page);
 		offlined_pages += 1 << order;
-#ifdef CONFIG_DEBUG_VM
-		pr_info("remove from free list %lx %d %lx\n",
-			pfn, 1 << order, end_pfn);
-#endif
 		del_page_from_free_area(page, &zone->free_area[order]);
 		for (i = 0; i < (1 << order); i++)
 			SetPageReserved((page+i));
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 89c19c0feadb..8682ccb5fbd1 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -25,8 +25,6 @@  static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 
 	zone = page_zone(page);
 
-	spin_lock_irqsave(&zone->lock, flags);
-
 	/*
 	 * We assume the caller intended to SET migrate type to isolate.
 	 * If it is already set, then someone else must have raced and
@@ -74,16 +72,18 @@  static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 		int mt = get_pageblock_migratetype(page);
 
 		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
+
+		spin_lock_irqsave(&zone->lock, flags);
 		zone->nr_isolate_pageblock++;
 		nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE,
 									NULL);
 
 		__mod_zone_freepage_state(zone, -nr_pages, mt);
-	}
+		spin_unlock_irqrestore(&zone->lock, flags);
 
-	spin_unlock_irqrestore(&zone->lock, flags);
-	if (!ret)
 		drain_all_pages(zone);
+	}
+
 	return ret;
 }