[v2] kmemleak: Turn kmemleak_lock to raw spinlock on RT

Message ID	1542877459-144382-1-git-send-email-zhe.he@windriver.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of zhe.he@windriver.com designates 147.11.1.11 as permitted sender) client-ip=147.11.1.11; From: <zhe.he@windriver.com> To: <catalin.marinas@arm.com>, <bigeasy@linutronix.de>, <tglx@linutronix.de>, <rostedt@goodmis.org>, <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>, <linux-rt-users@vger.kernel.org> Subject: [PATCH v2] kmemleak: Turn kmemleak_lock to raw spinlock on RT Date: Thu, 22 Nov 2018 17:04:19 +0800 Message-ID: <1542877459-144382-1-git-send-email-zhe.he@windriver.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v2] kmemleak: Turn kmemleak_lock to raw spinlock on RT \| expand [v2] kmemleak: Turn kmemleak_lock to raw spinlock on RT

He Zhe Nov. 22, 2018, 9:04 a.m. UTC

From: He Zhe <zhe.he@windriver.com>

kmemleak_lock, as a rwlock on RT, can possibly be held in atomic context and
causes the follow BUG.

BUG: scheduling while atomic: migration/15/132/0x00000002
Modules linked in: iTCO_wdt iTCO_vendor_support intel_rapl pcc_cpufreq
pnd2_edac intel_powerclamp coretemp crct10dif_pclmul crct10dif_common
aesni_intel matroxfb_base aes_x86_64 matroxfb_g450 matroxfb_accel
crypto_simd matroxfb_DAC1064 cryptd glue_helper g450_pll matroxfb_misc
i2c_ismt i2c_i801 acpi_cpufreq
Preemption disabled at:
[<ffffffff8c927c11>] cpu_stopper_thread+0x71/0x100
CPU: 15 PID: 132 Comm: migration/15 Not tainted 4.19.0-rt1-preempt-rt #1
Hardware name: Intel Corp. Harcuvar/Server, BIOS HAVLCRB1.X64.0015.D62.1708310404 08/31/2017
Call Trace:
 dump_stack+0x4f/0x6a
 ? cpu_stopper_thread+0x71/0x100
 __schedule_bug.cold.16+0x38/0x55
 __schedule+0x484/0x6c0
 schedule+0x3d/0xe0
 rt_spin_lock_slowlock_locked+0x118/0x2a0
 rt_spin_lock_slowlock+0x57/0x90
 __rt_spin_lock+0x26/0x30
 __write_rt_lock+0x23/0x1a0
 ? intel_pmu_cpu_dying+0x67/0x70
 rt_write_lock+0x2a/0x30
 find_and_remove_object+0x1e/0x80
 delete_object_full+0x10/0x20
 kmemleak_free+0x32/0x50
 kfree+0x104/0x1f0
 ? x86_pmu_starting_cpu+0x30/0x30
 intel_pmu_cpu_dying+0x67/0x70
 x86_pmu_dying_cpu+0x1a/0x30
 cpuhp_invoke_callback+0x92/0x700
 take_cpu_down+0x70/0xa0
 multi_cpu_stop+0x62/0xc0
 ? cpu_stop_queue_work+0x130/0x130
 cpu_stopper_thread+0x79/0x100
 smpboot_thread_fn+0x20f/0x2d0
 kthread+0x121/0x140
 ? sort_range+0x30/0x30
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40

And on v4.18 stable tree the following call trace, caused by grabbing
kmemleak_lock again, is also observed.

kernel BUG at kernel/locking/rtmutex.c:1048! 
invalid opcode: 0000 [#1] PREEMPT SMP PTI 
CPU: 5 PID: 689 Comm: mkfs.ext4 Not tainted 4.18.16-rt9-preempt-rt #1 
Hardware name: Intel Corp. Harcuvar/Server, BIOS HAVLCRB1.X64.0015.D62.1708310404 08/31/2017 
RIP: 0010:rt_spin_lock_slowlock_locked+0x277/0x2a0 
Code: e8 5e 64 61 ff e9 bc fe ff ff e8 54 64 61 ff e9 b7 fe ff ff 0f 0b e8 98 57 53 ff e9 43 fe ff ff e8 8e 57 53 ff e9 74 ff ff ff <0f> 0b 0f 0b 0f 0b 48 8b 43 10 48 85 c0 74 06 48 3b 58 38 75 0b 49 
RSP: 0018:ffff936846d4f3b0 EFLAGS: 00010046 
RAX: ffff8e3680361e00 RBX: ffffffff83a8b240 RCX: 0000000000000001 
RDX: 0000000000000000 RSI: ffff8e3680361e00 RDI: ffffffff83a8b258 
RBP: ffff936846d4f3e8 R08: ffff8e3680361e01 R09: ffffffff82adfdf0 
R10: ffffffff827ede18 R11: 0000000000000000 R12: ffff936846d4f3f8 
R13: ffff8e3680361e00 R14: ffff936846d4f3f8 R15: 0000000000000246 
FS: 00007fc8b6bfd780(0000) GS:ffff8e369f340000(0000) knlGS:0000000000000000 
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
CR2: 000055fb5659e000 CR3: 00000007fdd14000 CR4: 00000000003406e0 
Call Trace: 
 ? preempt_count_add+0x74/0xc0 
 rt_spin_lock_slowlock+0x57/0x90 
 ? __kernel_text_address+0x12/0x40 
 ? __save_stack_trace+0x75/0x100 
 __rt_spin_lock+0x26/0x30 
 __write_rt_lock+0x23/0x1a0 
 rt_write_lock+0x2a/0x30 
 create_object+0x17d/0x2b0 
 kmemleak_alloc+0x34/0x50 
 kmem_cache_alloc+0x146/0x220 
 ? mempool_alloc_slab+0x15/0x20 
 mempool_alloc_slab+0x15/0x20 
 mempool_alloc+0x65/0x170 
 sg_pool_alloc+0x21/0x60 
 __sg_alloc_table+0x101/0x160 
 ? sg_free_table_chained+0x30/0x30 
 sg_alloc_table_chained+0x8b/0xb0 
 scsi_init_sgtable+0x31/0x90 
 scsi_init_io+0x44/0x130 
 sd_setup_write_same16_cmnd+0xef/0x150 
 sd_init_command+0x6bf/0xaa0 
 ? cgroup_base_stat_cputime_account_end.isra.0+0x26/0x60 
 ? elv_rb_del+0x2a/0x40 
 scsi_setup_cmnd+0x8e/0x140 
 scsi_prep_fn+0x5d/0x140 
 blk_peek_request+0xda/0x2f0 
 scsi_request_fn+0x33/0x550 
 ? cfq_rb_erase+0x23/0x40 
 __blk_run_queue+0x43/0x60 
 cfq_insert_request+0x2f3/0x5d0 
 __elv_add_request+0x160/0x290 
 blk_flush_plug_list+0x204/0x230 
 schedule+0x87/0xe0 
 __write_rt_lock+0x18b/0x1a0 
 rt_write_lock+0x2a/0x30 
 create_object+0x17d/0x2b0 
 kmemleak_alloc+0x34/0x50 
 __kmalloc_node+0x1cd/0x340 
 alloc_request_size+0x30/0x70 
 mempool_alloc+0x65/0x170 
 ? ioc_lookup_icq+0x54/0x70 
 get_request+0x4e3/0x8d0 
 ? wait_woken+0x80/0x80 
 blk_queue_bio+0x153/0x470 
 generic_make_request+0x1dc/0x3f0 
 submit_bio+0x49/0x140 
 ? next_bio+0x38/0x40 
 submit_bio_wait+0x59/0x90 
 blkdev_issue_discard+0x7a/0xd0 
 ? _raw_spin_unlock_irqrestore+0x18/0x50 
 blk_ioctl_discard+0xc7/0x110 
 blkdev_ioctl+0x57e/0x960 
 ? __wake_up+0x13/0x20 
 block_ioctl+0x3d/0x50 
 do_vfs_ioctl+0xa8/0x610 
 ? vfs_write+0x166/0x1b0 
 ksys_ioctl+0x67/0x90 
 __x64_sys_ioctl+0x1a/0x20 
 do_syscall_64+0x4d/0xf0 
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

kmemleak is an error detecting feature. We would not expect as good performance
as without it. As there is no raw rwlock defining helpers, we turn kmemleak_lock
to a raw spinlock.

Signed-off-by: He Zhe <zhe.he@windriver.com>
Cc: catalin.marinas@arm.com
Cc: bigeasy@linutronix.de
Cc: tglx@linutronix.de
Cc: rostedt@goodmis.org
---
v2: Remove stable tag as this is only for preempt-rt patchset

 mm/kmemleak.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

Sebastian Sewior Nov. 22, 2018, 10:16 a.m. UTC | #1

On 2018-11-22 17:04:19 [+0800], zhe.he@windriver.com wrote:
> From: He Zhe <zhe.he@windriver.com>
> 
> kmemleak_lock, as a rwlock on RT, can possibly be held in atomic context and
> causes the follow BUG.

please use
 [PATCH RT … ]

in future while posting for RT. And this was (and is) on my TODO list.

Sebastian

Sebastian Sewior Nov. 23, 2018, 9:53 a.m. UTC | #2

On 2018-11-22 17:04:19 [+0800], zhe.he@windriver.com wrote:
> From: He Zhe <zhe.he@windriver.com>
> 
> kmemleak_lock, as a rwlock on RT, can possibly be held in atomic context and
> causes the follow BUG.
> 
> BUG: scheduling while atomic: migration/15/132/0x00000002
…
> Preemption disabled at:
> [<ffffffff8c927c11>] cpu_stopper_thread+0x71/0x100
> CPU: 15 PID: 132 Comm: migration/15 Not tainted 4.19.0-rt1-preempt-rt #1
> Hardware name: Intel Corp. Harcuvar/Server, BIOS HAVLCRB1.X64.0015.D62.1708310404 08/31/2017
> Call Trace:
>  dump_stack+0x4f/0x6a
>  ? cpu_stopper_thread+0x71/0x100
>  __schedule_bug.cold.16+0x38/0x55
>  __schedule+0x484/0x6c0
>  schedule+0x3d/0xe0
>  rt_spin_lock_slowlock_locked+0x118/0x2a0
>  rt_spin_lock_slowlock+0x57/0x90
>  __rt_spin_lock+0x26/0x30
>  __write_rt_lock+0x23/0x1a0
>  ? intel_pmu_cpu_dying+0x67/0x70
>  rt_write_lock+0x2a/0x30
>  find_and_remove_object+0x1e/0x80
>  delete_object_full+0x10/0x20
>  kmemleak_free+0x32/0x50
>  kfree+0x104/0x1f0
>  ? x86_pmu_starting_cpu+0x30/0x30
>  intel_pmu_cpu_dying+0x67/0x70
>  x86_pmu_dying_cpu+0x1a/0x30
>  cpuhp_invoke_callback+0x92/0x700
>  take_cpu_down+0x70/0xa0
>  multi_cpu_stop+0x62/0xc0
>  ? cpu_stop_queue_work+0x130/0x130
>  cpu_stopper_thread+0x79/0x100
>  smpboot_thread_fn+0x20f/0x2d0
>  kthread+0x121/0x140
>  ? sort_range+0x30/0x30
>  ? kthread_park+0x90/0x90
>  ret_from_fork+0x35/0x40

If this is the only problem? kfree() from a preempt-disabled section
should cause a warning even without kmemleak.

> And on v4.18 stable tree the following call trace, caused by grabbing
> kmemleak_lock again, is also observed.
> 
> kernel BUG at kernel/locking/rtmutex.c:1048! 
> invalid opcode: 0000 [#1] PREEMPT SMP PTI 
> CPU: 5 PID: 689 Comm: mkfs.ext4 Not tainted 4.18.16-rt9-preempt-rt #1 
…
> Call Trace: 
>  ? preempt_count_add+0x74/0xc0 
>  rt_spin_lock_slowlock+0x57/0x90 
>  ? __kernel_text_address+0x12/0x40 
>  ? __save_stack_trace+0x75/0x100 
>  __rt_spin_lock+0x26/0x30 
>  __write_rt_lock+0x23/0x1a0 
>  rt_write_lock+0x2a/0x30 
>  create_object+0x17d/0x2b0 
…

is this an RT-only problem? Because mainline should not allow read->read
locking or read->write locking for reader-writer locks. If this only
happens on v4.18 and not on v4.19 then something must have fixed it.
 

Sebastian

Andrea Parri Nov. 23, 2018, 11:02 a.m. UTC | #3

> is this an RT-only problem? Because mainline should not allow read->read
> locking or read->write locking for reader-writer locks. If this only
> happens on v4.18 and not on v4.19 then something must have fixed it.

Probably misunderstanding, but I'd say that read->read locking is "the
norm"...?

If you don't use qrwlock, readers are also "recursive", in part.,

  P0			P1
  read_lock(l)
			write_lock(l)
  read_lock(l)

won't block P0 on the second read_lock().  (qrwlock somehow complicate
the analysis; IIUC, they are recursive if and only if in_interrupt().).

  Andrea


>  
> 
> Sebastian

Sebastian Sewior Nov. 23, 2018, 11:06 a.m. UTC | #4

On 2018-11-23 12:02:55 [+0100], Andrea Parri wrote:
> > is this an RT-only problem? Because mainline should not allow read->read
> > locking or read->write locking for reader-writer locks. If this only
> > happens on v4.18 and not on v4.19 then something must have fixed it.
> 
> Probably misunderstanding, but I'd say that read->read locking is "the
> norm"...?
> 
> If you don't use qrwlock, readers are also "recursive", in part.,
> 
>   P0			P1
>   read_lock(l)
> 			write_lock(l)
>   read_lock(l)
> 
> won't block P0 on the second read_lock().  (qrwlock somehow complicate
> the analysis; IIUC, they are recursive if and only if in_interrupt().).

ehm, peterz, is that true? My memory on that is that all readers will
block if there is a writer pending.

>   Andrea

Sebastian

Catalin Marinas Nov. 23, 2018, 11:31 a.m. UTC | #5

On Fri, Nov 23, 2018 at 12:06:11PM +0100, Sebastian Andrzej Siewior wrote:
> On 2018-11-23 12:02:55 [+0100], Andrea Parri wrote:
> > > is this an RT-only problem? Because mainline should not allow read->read
> > > locking or read->write locking for reader-writer locks. If this only
> > > happens on v4.18 and not on v4.19 then something must have fixed it.
> > 
> > Probably misunderstanding, but I'd say that read->read locking is "the
> > norm"...?
> > 
> > If you don't use qrwlock, readers are also "recursive", in part.,
> > 
> >   P0			P1
> >   read_lock(l)
> > 			write_lock(l)
> >   read_lock(l)
> > 
> > won't block P0 on the second read_lock().  (qrwlock somehow complicate
> > the analysis; IIUC, they are recursive if and only if in_interrupt().).
> 
> ehm, peterz, is that true? My memory on that is that all readers will
> block if there is a writer pending.

With qwrlocks, the readers will normally block if there is a pending
writer (to avoid starving the writer), unless in_interrupt() when the
readers are allowed to starve a pending writer.

TLA+/PlusCal model here:  ;)

https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/tree/qrwlock.tla

Steven Rostedt Nov. 23, 2018, 3:51 p.m. UTC | #6

On Fri, 23 Nov 2018 11:31:31 +0000
Catalin Marinas <catalin.marinas@arm.com> wrote:

> With qwrlocks, the readers will normally block if there is a pending
> writer (to avoid starving the writer), unless in_interrupt() when the
> readers are allowed to starve a pending writer.
> 
> TLA+/PlusCal model here:  ;)
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/tree/qrwlock.tla
> 


And the code appears to confirm it too:

void queued_read_lock_slowpath(struct qrwlock *lock)
{
	/*
	 * Readers come here when they cannot get the lock without waiting
	 */
	if (unlikely(in_interrupt())) {
		/*
		 * Readers in interrupt context will get the lock immediately
		 * if the writer is just waiting (not holding the lock yet),
		 * so spin with ACQUIRE semantics until the lock is available
		 * without waiting in the queue.
		 */
		atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
		return;
	}
	atomic_sub(_QR_BIAS, &lock->cnts);

	/*
	 * Put the reader into the wait queue
	 */
	arch_spin_lock(&lock->wait_lock);
	atomic_add(_QR_BIAS, &lock->cnts);



-- Steve

He Zhe Nov. 24, 2018, 2:26 p.m. UTC | #7

On 2018/11/23 17:53, Sebastian Andrzej Siewior wrote:
> On 2018-11-22 17:04:19 [+0800], zhe.he@windriver.com wrote:
>> From: He Zhe <zhe.he@windriver.com>
>>
>> kmemleak_lock, as a rwlock on RT, can possibly be held in atomic context and
>> causes the follow BUG.
>>
>> BUG: scheduling while atomic: migration/15/132/0x00000002
> …
>> Preemption disabled at:
>> [<ffffffff8c927c11>] cpu_stopper_thread+0x71/0x100
>> CPU: 15 PID: 132 Comm: migration/15 Not tainted 4.19.0-rt1-preempt-rt #1
>> Hardware name: Intel Corp. Harcuvar/Server, BIOS HAVLCRB1.X64.0015.D62.1708310404 08/31/2017
>> Call Trace:
>>  dump_stack+0x4f/0x6a
>>  ? cpu_stopper_thread+0x71/0x100
>>  __schedule_bug.cold.16+0x38/0x55
>>  __schedule+0x484/0x6c0
>>  schedule+0x3d/0xe0
>>  rt_spin_lock_slowlock_locked+0x118/0x2a0
>>  rt_spin_lock_slowlock+0x57/0x90
>>  __rt_spin_lock+0x26/0x30
>>  __write_rt_lock+0x23/0x1a0
>>  ? intel_pmu_cpu_dying+0x67/0x70
>>  rt_write_lock+0x2a/0x30
>>  find_and_remove_object+0x1e/0x80
>>  delete_object_full+0x10/0x20
>>  kmemleak_free+0x32/0x50
>>  kfree+0x104/0x1f0
>>  ? x86_pmu_starting_cpu+0x30/0x30
>>  intel_pmu_cpu_dying+0x67/0x70
>>  x86_pmu_dying_cpu+0x1a/0x30
>>  cpuhp_invoke_callback+0x92/0x700
>>  take_cpu_down+0x70/0xa0
>>  multi_cpu_stop+0x62/0xc0
>>  ? cpu_stop_queue_work+0x130/0x130
>>  cpu_stopper_thread+0x79/0x100
>>  smpboot_thread_fn+0x20f/0x2d0
>>  kthread+0x121/0x140
>>  ? sort_range+0x30/0x30
>>  ? kthread_park+0x90/0x90
>>  ret_from_fork+0x35/0x40
> If this is the only problem? kfree() from a preempt-disabled section
> should cause a warning even without kmemleak.

Thanks for your review. I just did some tests aginst the latest code.

On latest v4.19.1-rt3, both of the call traces can be reproduced with kmemleak
enabied. And none can be reproduced with kmemleak disabled.

On latest mainline tree, none can be reproduced no matter kmemleak is enabled
or disabled.

I don't get why kfree from a preempt-disabled section should cause a warning
without kmemleak, since kfree can't sleep.

If I understand correctly, the call trace above is caused by trying to schedule
after preemption is disabled, which cannot be reached in mainline kernel. So
we might need to turn to use raw lock to keep preemption disabled.

>
>> And on v4.18 stable tree the following call trace, caused by grabbing
>> kmemleak_lock again, is also observed.
>>
>> kernel BUG at kernel/locking/rtmutex.c:1048! 
>> invalid opcode: 0000 [#1] PREEMPT SMP PTI 
>> CPU: 5 PID: 689 Comm: mkfs.ext4 Not tainted 4.18.16-rt9-preempt-rt #1 
> …
>> Call Trace: 
>>  ? preempt_count_add+0x74/0xc0 
>>  rt_spin_lock_slowlock+0x57/0x90 
>>  ? __kernel_text_address+0x12/0x40 
>>  ? __save_stack_trace+0x75/0x100 
>>  __rt_spin_lock+0x26/0x30 
>>  __write_rt_lock+0x23/0x1a0 
>>  rt_write_lock+0x2a/0x30 
>>  create_object+0x17d/0x2b0 
> …
>
> is this an RT-only problem? Because mainline should not allow read->read
> locking or read->write locking for reader-writer locks. If this only
> happens on v4.18 and not on v4.19 then something must have fixed it.

From what I reached above, this is RT-only and happens on v4.18 and v4.19.

The call trace above is caused by grabbing kmemleak_lock and then getting
scheduled and then re-grabbing kmemleak_lock. Using raw lock can also solve
this problem.

Thanks,
Zhe

>  
>
> Sebastian
>

Peter Zijlstra Nov. 26, 2018, 8:40 a.m. UTC | #8

On Fri, Nov 23, 2018 at 12:06:11PM +0100, Sebastian Andrzej Siewior wrote:
> On 2018-11-23 12:02:55 [+0100], Andrea Parri wrote:
> > > is this an RT-only problem? Because mainline should not allow read->read
> > > locking or read->write locking for reader-writer locks. If this only
> > > happens on v4.18 and not on v4.19 then something must have fixed it.
> > 
> > Probably misunderstanding, but I'd say that read->read locking is "the
> > norm"...?
> > 
> > If you don't use qrwlock, readers are also "recursive", in part.,
> > 
> >   P0			P1
> >   read_lock(l)
> > 			write_lock(l)
> >   read_lock(l)
> > 
> > won't block P0 on the second read_lock().  (qrwlock somehow complicate
> > the analysis; IIUC, they are recursive if and only if in_interrupt().).
> 
> ehm, peterz, is that true? My memory on that is that all readers will
> block if there is a writer pending.

Since qrwlock is the more strict, all users should use its semantics.
Just like we cannot 'rely' on the unfairness of some lock
implementations.

Sebastian Sewior Nov. 30, 2018, 6:19 p.m. UTC | #9

On 2018-11-24 22:26:46 [+0800], He Zhe wrote:
> On latest v4.19.1-rt3, both of the call traces can be reproduced with kmemleak
> enabied. And none can be reproduced with kmemleak disabled.
okay. So it needs attention.

> On latest mainline tree, none can be reproduced no matter kmemleak is enabled
> or disabled.
> 
> I don't get why kfree from a preempt-disabled section should cause a warning
> without kmemleak, since kfree can't sleep.

it might. It will acquire a sleeping lock if it has go down to the
memory allocator to actually give memory back.

> If I understand correctly, the call trace above is caused by trying to schedule
> after preemption is disabled, which cannot be reached in mainline kernel. So
> we might need to turn to use raw lock to keep preemption disabled.

The buddy-allocator runs with spin locks so it is okay on !RT. So you
can use kfree() with disabled preemption or disabled interrupts.
I don't think that we want to use raw-locks in the buddy-allocator.

> >From what I reached above, this is RT-only and happens on v4.18 and v4.19.
> 
> The call trace above is caused by grabbing kmemleak_lock and then getting
> scheduled and then re-grabbing kmemleak_lock. Using raw lock can also solve
> this problem.

But this is a reader / writer lock. And if I understand the other part
of the thread then it needs multiple readers.
Couldn't we just get rid of that kfree() or move it somewhere else?
I mean if the free() memory on CPU-down and allocate it again CPU-up
then we could skip that, rigth? Just allocate it and don't free it
because the CPU will likely get up again.

> Thanks,
> Zhe

Sebastian

He Zhe Dec. 5, 2018, 1:53 p.m. UTC | #10

On 2018/12/1 02:19, Sebastian Andrzej Siewior wrote:
> On 2018-11-24 22:26:46 [+0800], He Zhe wrote:
>> On latest v4.19.1-rt3, both of the call traces can be reproduced with kmemleak
>> enabied. And none can be reproduced with kmemleak disabled.
> okay. So it needs attention.
>
>> On latest mainline tree, none can be reproduced no matter kmemleak is enabled
>> or disabled.
>>
>> I don't get why kfree from a preempt-disabled section should cause a warning
>> without kmemleak, since kfree can't sleep.
> it might. It will acquire a sleeping lock if it has go down to the
> memory allocator to actually give memory back.

Got it. Thanks.

>
>> If I understand correctly, the call trace above is caused by trying to schedule
>> after preemption is disabled, which cannot be reached in mainline kernel. So
>> we might need to turn to use raw lock to keep preemption disabled.
> The buddy-allocator runs with spin locks so it is okay on !RT. So you
> can use kfree() with disabled preemption or disabled interrupts.
> I don't think that we want to use raw-locks in the buddy-allocator.

For call trace 1:

I went through the calling paths inside kfree() and found that there have already
been things using raw lock in it as follow.

1) in the path of kfree() itself
kfree -> slab_free -> do_slab_free -> __slab_free -> raw_spin_lock_irqsave

2) in the path of CONFIG_DEBUG_OBJECTS_FREE
kfree -> slab_free -> slab_free_freelist_hook -> slab_free_hook -> debug_check_no_obj_freed -> __debug_check_no_obj_freed -> raw_spin_lock_irqsave

3) in the path of CONFIG_LOCKDEP
kfree -> __free_pages -> free_unref_page -> free_unref_page_prepare -> free_pcp_prepare -> free_pages_prepare -> debug_check_no_locks_freed -> debug_check_no_locks_freed -> raw_local_irq_save

Since kmemleak would most likely be used to debug in environments where
we would not expect as great performance as without it, and kfree() has raw locks
in its main path and other debug function paths, I suppose it wouldn't hurt that
we change to raw locks.

>
>> >From what I reached above, this is RT-only and happens on v4.18 and v4.19.
>>
>> The call trace above is caused by grabbing kmemleak_lock and then getting
>> scheduled and then re-grabbing kmemleak_lock. Using raw lock can also solve
>> this problem.
> But this is a reader / writer lock. And if I understand the other part
> of the thread then it needs multiple readers.

For call trace 2:

I don't get what "it needs multiple readers" exactly means here.

In this call trace, the kmemleak_lock is grabbed as write lock, and then scheduled
away, and then grabbed again as write lock from another path. It's a
write->write locking, compared to the discussion in the other part of the thread.

This is essentially because kmemleak hooks on the very low level memory
allocation and free operations. After scheduled away, it can easily re-enter itself.
We need raw locks to prevent this from happening.

> Couldn't we just get rid of that kfree() or move it somewhere else?
> I mean if the free() memory on CPU-down and allocate it again CPU-up
> then we could skip that, rigth? Just allocate it and don't free it
> because the CPU will likely get up again.

For call trace 1:

I went through the CPU hotplug code and found that the allocation of the
problematic data, cpuc->shared_regs, is done in intel_pmu_cpu_prepare. And
the free is done in intel_pmu_cpu_dying. They are handlers triggered by two
different perf events.

It seems we can hardly form a convincing method that holds the data while
CPUs are off and then uses it again. raw locks would be easy and good enough.

Thanks,
Zhe

>
>> Thanks,
>> Zhe
> Sebastian
>

Sebastian Sewior Dec. 5, 2018, 7:14 p.m. UTC | #11

On 2018-12-05 21:53:37 [+0800], He Zhe wrote:
> For call trace 1:
…
> Since kmemleak would most likely be used to debug in environments where
> we would not expect as great performance as without it, and kfree() has raw locks
> in its main path and other debug function paths, I suppose it wouldn't hurt that
> we change to raw locks.
okay.

> >> >From what I reached above, this is RT-only and happens on v4.18 and v4.19.
> >>
> >> The call trace above is caused by grabbing kmemleak_lock and then getting
> >> scheduled and then re-grabbing kmemleak_lock. Using raw lock can also solve
> >> this problem.
> > But this is a reader / writer lock. And if I understand the other part
> > of the thread then it needs multiple readers.
> 
> For call trace 2:
> 
> I don't get what "it needs multiple readers" exactly means here.
> 
> In this call trace, the kmemleak_lock is grabbed as write lock, and then scheduled
> away, and then grabbed again as write lock from another path. It's a
> write->write locking, compared to the discussion in the other part of the thread.
> 
> This is essentially because kmemleak hooks on the very low level memory
> allocation and free operations. After scheduled away, it can easily re-enter itself.
> We need raw locks to prevent this from happening.

With raw locks you wouldn't have multiple readers at the same time.
Maybe you wouldn't have recursion but since you can't have multiple
readers you would add lock contention where was none (because you could
have two readers at the same time).

> > Couldn't we just get rid of that kfree() or move it somewhere else?
> > I mean if the free() memory on CPU-down and allocate it again CPU-up
> > then we could skip that, rigth? Just allocate it and don't free it
> > because the CPU will likely get up again.
> 
> For call trace 1:
> 
> I went through the CPU hotplug code and found that the allocation of the
> problematic data, cpuc->shared_regs, is done in intel_pmu_cpu_prepare. And
> the free is done in intel_pmu_cpu_dying. They are handlers triggered by two
> different perf events.
> 
> It seems we can hardly form a convincing method that holds the data while
> CPUs are off and then uses it again. raw locks would be easy and good enough.

Why not allocate the memory in intel_pmu_cpu_prepare() if it is not
already there (otherwise skip the allocation) and in
intel_pmu_cpu_dying() not free it. It looks easy.

> Thanks,
> Zhe

Sebastian

He Zhe Dec. 18, 2018, 10:51 a.m. UTC | #12

On 2018/12/6 03:14, Sebastian Andrzej Siewior wrote:
> On 2018-12-05 21:53:37 [+0800], He Zhe wrote:
>> For call trace 1:
> …
>> Since kmemleak would most likely be used to debug in environments where
>> we would not expect as great performance as without it, and kfree() has raw locks
>> in its main path and other debug function paths, I suppose it wouldn't hurt that
>> we change to raw locks.
> okay.
>
>>>> >From what I reached above, this is RT-only and happens on v4.18 and v4.19.
>>>>
>>>> The call trace above is caused by grabbing kmemleak_lock and then getting
>>>> scheduled and then re-grabbing kmemleak_lock. Using raw lock can also solve
>>>> this problem.
>>> But this is a reader / writer lock. And if I understand the other part
>>> of the thread then it needs multiple readers.
>> For call trace 2:
>>
>> I don't get what "it needs multiple readers" exactly means here.
>>
>> In this call trace, the kmemleak_lock is grabbed as write lock, and then scheduled
>> away, and then grabbed again as write lock from another path. It's a
>> write->write locking, compared to the discussion in the other part of the thread.
>>
>> This is essentially because kmemleak hooks on the very low level memory
>> allocation and free operations. After scheduled away, it can easily re-enter itself.
>> We need raw locks to prevent this from happening.
> With raw locks you wouldn't have multiple readers at the same time.
> Maybe you wouldn't have recursion but since you can't have multiple
> readers you would add lock contention where was none (because you could
> have two readers at the same time).

Sorry for slow reply.

OK. I understand your concern finally. At the commit log said, I wanted to use raw
rwlock but didn't find the DEFINE helper for it. Thinking it would not be expected to
have great performance, I turn to use raw spinlock instead. And yes, this would
introduce worse performance.

Maybe I miss the reason, but why don't we have rwlock_types_raw.h to define raw
rwlock helper for RT? With that, we can cleanly replace kmemleak_lock with a raw
rwlock.

Or should we just define a raw rwlock using basic type, like arch_rwlock_t, only in
kmemleak?

>
>>> Couldn't we just get rid of that kfree() or move it somewhere else?
>>> I mean if the free() memory on CPU-down and allocate it again CPU-up
>>> then we could skip that, rigth? Just allocate it and don't free it
>>> because the CPU will likely get up again.
>> For call trace 1:
>>
>> I went through the CPU hotplug code and found that the allocation of the
>> problematic data, cpuc->shared_regs, is done in intel_pmu_cpu_prepare. And
>> the free is done in intel_pmu_cpu_dying. They are handlers triggered by two
>> different perf events.
>>
>> It seems we can hardly form a convincing method that holds the data while
>> CPUs are off and then uses it again. raw locks would be easy and good enough.
> Why not allocate the memory in intel_pmu_cpu_prepare() if it is not
> already there (otherwise skip the allocation) and in
> intel_pmu_cpu_dying() not free it. It looks easy.

Thanks for your suggestion. I've sent the change for call trace 1 to mainline
mailing list. Hopefully it can be accepted.

Zhe

>
>> Thanks,
>> Zhe
> Sebastian
>

Catalin Marinas Dec. 18, 2018, 3:07 p.m. UTC | #13

On Tue, Dec 18, 2018 at 06:51:41PM +0800, He Zhe wrote:
> On 2018/12/6 03:14, Sebastian Andrzej Siewior wrote:
> > With raw locks you wouldn't have multiple readers at the same time.
> > Maybe you wouldn't have recursion but since you can't have multiple
> > readers you would add lock contention where was none (because you could
> > have two readers at the same time).
> 
> OK. I understand your concern finally. At the commit log said, I wanted to use raw
> rwlock but didn't find the DEFINE helper for it. Thinking it would not be expected to
> have great performance, I turn to use raw spinlock instead. And yes, this would
> introduce worse performance.

Looking through the kmemleak code, I can't really find significant
reader contention. The longest holder of this lock (read) is the scan
thread which is also protected by a scan_mutex, so can't run
concurrently with another scanner (triggered through debugfs). The other
read_lock(&kmemleak_lock) user is find_and_get_object() called from a
few places. However, all such places normally follow a create_object()
call (kmemleak_alloc() and friends) which already performs a
write_lock(&kmemleak_lock), so it needs to wait for the scan thread to
release the kmemleak_lock.

It may be worth running some performance/latency tests during kmemleak
scanning (echo scan > /sys/kernel/debug/kmemleak) but at a quick look,
I don't think we'd see any difference with a raw_spin_lock_t.

With a bit more thinking (though I'll be off until the new year), we
could probably get rid of the kmemleak_lock entirely in scan_block() and
make lookup_object() and the related rbtree code in kmemleak RCU-safe.

Catalin Marinas Dec. 18, 2018, 3:12 p.m. UTC | #14

On Thu, Nov 22, 2018 at 05:04:19PM +0800, zhe.he@windriver.com wrote:
> From: He Zhe <zhe.he@windriver.com>
> 
> kmemleak_lock, as a rwlock on RT, can possibly be held in atomic context and
> causes the follow BUG.
> 
> BUG: scheduling while atomic: migration/15/132/0x00000002
> Modules linked in: iTCO_wdt iTCO_vendor_support intel_rapl pcc_cpufreq
> pnd2_edac intel_powerclamp coretemp crct10dif_pclmul crct10dif_common
> aesni_intel matroxfb_base aes_x86_64 matroxfb_g450 matroxfb_accel
> crypto_simd matroxfb_DAC1064 cryptd glue_helper g450_pll matroxfb_misc
> i2c_ismt i2c_i801 acpi_cpufreq
> Preemption disabled at:
> [<ffffffff8c927c11>] cpu_stopper_thread+0x71/0x100
> CPU: 15 PID: 132 Comm: migration/15 Not tainted 4.19.0-rt1-preempt-rt #1
> Hardware name: Intel Corp. Harcuvar/Server, BIOS HAVLCRB1.X64.0015.D62.1708310404 08/31/2017
> Call Trace:
>  dump_stack+0x4f/0x6a
>  ? cpu_stopper_thread+0x71/0x100
>  __schedule_bug.cold.16+0x38/0x55
>  __schedule+0x484/0x6c0
>  schedule+0x3d/0xe0
>  rt_spin_lock_slowlock_locked+0x118/0x2a0
>  rt_spin_lock_slowlock+0x57/0x90
>  __rt_spin_lock+0x26/0x30
>  __write_rt_lock+0x23/0x1a0
>  ? intel_pmu_cpu_dying+0x67/0x70
>  rt_write_lock+0x2a/0x30
>  find_and_remove_object+0x1e/0x80
>  delete_object_full+0x10/0x20
>  kmemleak_free+0x32/0x50
>  kfree+0x104/0x1f0
>  ? x86_pmu_starting_cpu+0x30/0x30
>  intel_pmu_cpu_dying+0x67/0x70
>  x86_pmu_dying_cpu+0x1a/0x30
>  cpuhp_invoke_callback+0x92/0x700
>  take_cpu_down+0x70/0xa0
>  multi_cpu_stop+0x62/0xc0
>  ? cpu_stop_queue_work+0x130/0x130
>  cpu_stopper_thread+0x79/0x100
>  smpboot_thread_fn+0x20f/0x2d0
>  kthread+0x121/0x140
>  ? sort_range+0x30/0x30
>  ? kthread_park+0x90/0x90
>  ret_from_fork+0x35/0x40
> 
> And on v4.18 stable tree the following call trace, caused by grabbing
> kmemleak_lock again, is also observed.
> 
> kernel BUG at kernel/locking/rtmutex.c:1048! 
> invalid opcode: 0000 [#1] PREEMPT SMP PTI 
> CPU: 5 PID: 689 Comm: mkfs.ext4 Not tainted 4.18.16-rt9-preempt-rt #1 
> Hardware name: Intel Corp. Harcuvar/Server, BIOS HAVLCRB1.X64.0015.D62.1708310404 08/31/2017 
> RIP: 0010:rt_spin_lock_slowlock_locked+0x277/0x2a0 
> Code: e8 5e 64 61 ff e9 bc fe ff ff e8 54 64 61 ff e9 b7 fe ff ff 0f 0b e8 98 57 53 ff e9 43 fe ff ff e8 8e 57 53 ff e9 74 ff ff ff <0f> 0b 0f 0b 0f 0b 48 8b 43 10 48 85 c0 74 06 48 3b 58 38 75 0b 49 
> RSP: 0018:ffff936846d4f3b0 EFLAGS: 00010046 
> RAX: ffff8e3680361e00 RBX: ffffffff83a8b240 RCX: 0000000000000001 
> RDX: 0000000000000000 RSI: ffff8e3680361e00 RDI: ffffffff83a8b258 
> RBP: ffff936846d4f3e8 R08: ffff8e3680361e01 R09: ffffffff82adfdf0 
> R10: ffffffff827ede18 R11: 0000000000000000 R12: ffff936846d4f3f8 
> R13: ffff8e3680361e00 R14: ffff936846d4f3f8 R15: 0000000000000246 
> FS: 00007fc8b6bfd780(0000) GS:ffff8e369f340000(0000) knlGS:0000000000000000 
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
> CR2: 000055fb5659e000 CR3: 00000007fdd14000 CR4: 00000000003406e0 
> Call Trace: 
>  ? preempt_count_add+0x74/0xc0 
>  rt_spin_lock_slowlock+0x57/0x90 
>  ? __kernel_text_address+0x12/0x40 
>  ? __save_stack_trace+0x75/0x100 
>  __rt_spin_lock+0x26/0x30 
>  __write_rt_lock+0x23/0x1a0 
>  rt_write_lock+0x2a/0x30 
>  create_object+0x17d/0x2b0 
>  kmemleak_alloc+0x34/0x50 
>  kmem_cache_alloc+0x146/0x220 
>  ? mempool_alloc_slab+0x15/0x20 
>  mempool_alloc_slab+0x15/0x20 
>  mempool_alloc+0x65/0x170 
>  sg_pool_alloc+0x21/0x60 
>  __sg_alloc_table+0x101/0x160 
>  ? sg_free_table_chained+0x30/0x30 
>  sg_alloc_table_chained+0x8b/0xb0 
>  scsi_init_sgtable+0x31/0x90 
>  scsi_init_io+0x44/0x130 
>  sd_setup_write_same16_cmnd+0xef/0x150 
>  sd_init_command+0x6bf/0xaa0 
>  ? cgroup_base_stat_cputime_account_end.isra.0+0x26/0x60 
>  ? elv_rb_del+0x2a/0x40 
>  scsi_setup_cmnd+0x8e/0x140 
>  scsi_prep_fn+0x5d/0x140 
>  blk_peek_request+0xda/0x2f0 
>  scsi_request_fn+0x33/0x550 
>  ? cfq_rb_erase+0x23/0x40 
>  __blk_run_queue+0x43/0x60 
>  cfq_insert_request+0x2f3/0x5d0 
>  __elv_add_request+0x160/0x290 
>  blk_flush_plug_list+0x204/0x230 
>  schedule+0x87/0xe0 
>  __write_rt_lock+0x18b/0x1a0 
>  rt_write_lock+0x2a/0x30 
>  create_object+0x17d/0x2b0 
>  kmemleak_alloc+0x34/0x50 
>  __kmalloc_node+0x1cd/0x340 
>  alloc_request_size+0x30/0x70 
>  mempool_alloc+0x65/0x170 
>  ? ioc_lookup_icq+0x54/0x70 
>  get_request+0x4e3/0x8d0 
>  ? wait_woken+0x80/0x80 
>  blk_queue_bio+0x153/0x470 
>  generic_make_request+0x1dc/0x3f0 
>  submit_bio+0x49/0x140 
>  ? next_bio+0x38/0x40 
>  submit_bio_wait+0x59/0x90 
>  blkdev_issue_discard+0x7a/0xd0 
>  ? _raw_spin_unlock_irqrestore+0x18/0x50 
>  blk_ioctl_discard+0xc7/0x110 
>  blkdev_ioctl+0x57e/0x960 
>  ? __wake_up+0x13/0x20 
>  block_ioctl+0x3d/0x50 
>  do_vfs_ioctl+0xa8/0x610 
>  ? vfs_write+0x166/0x1b0 
>  ksys_ioctl+0x67/0x90 
>  __x64_sys_ioctl+0x1a/0x20 
>  do_syscall_64+0x4d/0xf0 
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> kmemleak is an error detecting feature. We would not expect as good performance
> as without it. As there is no raw rwlock defining helpers, we turn kmemleak_lock
> to a raw spinlock.
> 
> Signed-off-by: He Zhe <zhe.he@windriver.com>
> Cc: catalin.marinas@arm.com
> Cc: bigeasy@linutronix.de
> Cc: tglx@linutronix.de
> Cc: rostedt@goodmis.org

As I replied already, I don't think this patch would increase the
kmemleak latency (or performance), although I haven't actually tested
it. FWIW:

Acked-by: Catalin Marinas <catalin.marinas@arm.com>

Sebastian Sewior Dec. 19, 2018, 3:30 p.m. UTC | #15

On 2018-12-18 15:07:45 [+0000], Catalin Marinas wrote:
…
> It may be worth running some performance/latency tests during kmemleak
> scanning (echo scan > /sys/kernel/debug/kmemleak) but at a quick look,
> I don't think we'd see any difference with a raw_spin_lock_t.
> 
> With a bit more thinking (though I'll be off until the new year), we
> could probably get rid of the kmemleak_lock entirely in scan_block() and
> make lookup_object() and the related rbtree code in kmemleak RCU-safe.

Okay. So let me apply that patch into my RT tree with your ack (from the
other email). And then I hope that it either shows up upstream or gets
replaced with RCU in the ende :)

Thanks.

Sebastian

He Zhe Dec. 20, 2018, 1:46 a.m. UTC | #16

On 2018/12/19 23:30, Sebastian Andrzej Siewior wrote:
> On 2018-12-18 15:07:45 [+0000], Catalin Marinas wrote:
> …
>> It may be worth running some performance/latency tests during kmemleak
>> scanning (echo scan > /sys/kernel/debug/kmemleak) but at a quick look,
>> I don't think we'd see any difference with a raw_spin_lock_t.
>>
>> With a bit more thinking (though I'll be off until the new year), we
>> could probably get rid of the kmemleak_lock entirely in scan_block() and
>> make lookup_object() and the related rbtree code in kmemleak RCU-safe.
> Okay. So let me apply that patch into my RT tree with your ack (from the
> other email). And then I hope that it either shows up upstream or gets
> replaced with RCU in the ende :)

I'd like to do the upstreaming or replacing. Thanks.

Zhe

>
> Thanks.
>
> Sebastian
>

[v2] kmemleak: Turn kmemleak_lock to raw spinlock on RT

Commit Message

Comments

Patch