diff mbox series

x86/MCE/AMD: Decrement threshold_bank refcount when removing threshold blocks

Message ID 20220614174346.3648305-1-yazen.ghannam@amd.com (mailing list archive)
State New, archived
Headers show
Series x86/MCE/AMD: Decrement threshold_bank refcount when removing threshold blocks | expand

Commit Message

Yazen Ghannam June 14, 2022, 5:43 p.m. UTC
AMD systems from Family 10h to 16h share MCA bank 4 across multiple CPUs.
Therefore, the threshold_bank structure for bank 4, and its threshold_block
structures, will be initialized once at boot time. And the kobject for the
shared bank will be added to each of the CPUs that share it. Furthermore,
the threshold_blocks for the shared bank will be added again to the bank's
kobject. These additions will increase the refcount for the bank's kobject.

For example, a shared bank with two blocks and shared across two CPUs will
be set up like this:

CPU0 init
  bank create and add; bank refcount = 1; threshold_create_bank()
    block 0 init and add; bank refcount = 2; allocate_threshold_blocks()
    block 1 init and add; bank refcount = 3; allocate_threshold_blocks()
CPU1 init
  bank add; bank refcount = 3; threshold_create_bank()
    block 0 add; bank refcount = 4; __threshold_add_blocks()
    block 1 add; bank refcount = 5; __threshold_add_blocks()

Currently in threshold_remove_bank(), if the bank is shared then
__threshold_remove_blocks() is called. Here the shared bank's kobject and
the bank's blocks' kobjects are deleted. This is done on the first call
even while the structures are still shared. Subsequent calls from other
CPUs that share the structures will attempt to delete the kobjects.

During kobject_del(), kobject->sd is removed. If the kobject is not part of
a kset with default_groups, then subsequent kobject_del() calls seem safe
even with kobject->sd == NULL.

Originally, the AMD MCA thresholding structures did not use default_groups.
And so the above behavior was not apparent.

However, a recent change implemented default_groups for the thresholding
structures. Therefore, kobject_del() will go down the sysfs_remove_groups()
code path. In this case, the first kobject_del() may succeed and remove
kobject->sd. But subsequent kobject_del() calls will give a WARNing in
kernfs_remove_by_name_ns() since kobject->sd == NULL.

Use kobject_put() on the shared bank's kobject when "removing" blocks. This
decrements the bank's refcount while keeping kobjects enabled until the
bank is no longer shared. At that point, kobject_put() will be called on
the blocks which drives their refcount to 0 and deletes them and also
decrementing the bank's refcount. And finally kobject_put() will be called
on the bank driving its refcount to 0 and deleting it.

With this patch and the example above:

CPU1 shutdown
  bank is shared; bank refcount = 5; threshold_remove_bank()
    block 0 put parent bank; bank refcount = 4; __threshold_remove_blocks()
    block 1 put parent bank; bank refcount = 3; __threshold_remove_blocks()
CPU0 shutdown
  bank is no longer shared; bank refcount = 3; threshold_remove_bank()
    block 0 put block; bank refcount = 2; deallocate_threshold_blocks()
    block 1 put block; bank refcount = 1; deallocate_threshold_blocks()
  put bank; bank refcount = 0; threshold_remove_bank()

Fixes: 7f99cb5e6039 ("x86/CPU/AMD: Use default_groups in kobj_type")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/alpine.LRH.2.02.2205301145540.25840@file01.intranet.prod.int.rdu2.redhat.com
---
 arch/x86/kernel/cpu/mce/amd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Greg Kroah-Hartman June 15, 2022, 6:33 a.m. UTC | #1
On Tue, Jun 14, 2022 at 05:43:46PM +0000, Yazen Ghannam wrote:
> AMD systems from Family 10h to 16h share MCA bank 4 across multiple CPUs.
> Therefore, the threshold_bank structure for bank 4, and its threshold_block
> structures, will be initialized once at boot time. And the kobject for the
> shared bank will be added to each of the CPUs that share it. Furthermore,
> the threshold_blocks for the shared bank will be added again to the bank's
> kobject. These additions will increase the refcount for the bank's kobject.
> 
> For example, a shared bank with two blocks and shared across two CPUs will
> be set up like this:
> 
> CPU0 init
>   bank create and add; bank refcount = 1; threshold_create_bank()
>     block 0 init and add; bank refcount = 2; allocate_threshold_blocks()
>     block 1 init and add; bank refcount = 3; allocate_threshold_blocks()
> CPU1 init
>   bank add; bank refcount = 3; threshold_create_bank()
>     block 0 add; bank refcount = 4; __threshold_add_blocks()
>     block 1 add; bank refcount = 5; __threshold_add_blocks()
> 
> Currently in threshold_remove_bank(), if the bank is shared then
> __threshold_remove_blocks() is called. Here the shared bank's kobject and
> the bank's blocks' kobjects are deleted. This is done on the first call
> even while the structures are still shared. Subsequent calls from other
> CPUs that share the structures will attempt to delete the kobjects.
> 
> During kobject_del(), kobject->sd is removed. If the kobject is not part of
> a kset with default_groups, then subsequent kobject_del() calls seem safe
> even with kobject->sd == NULL.
> 
> Originally, the AMD MCA thresholding structures did not use default_groups.
> And so the above behavior was not apparent.
> 
> However, a recent change implemented default_groups for the thresholding
> structures. Therefore, kobject_del() will go down the sysfs_remove_groups()
> code path. In this case, the first kobject_del() may succeed and remove
> kobject->sd. But subsequent kobject_del() calls will give a WARNing in
> kernfs_remove_by_name_ns() since kobject->sd == NULL.
> 
> Use kobject_put() on the shared bank's kobject when "removing" blocks. This
> decrements the bank's refcount while keeping kobjects enabled until the
> bank is no longer shared. At that point, kobject_put() will be called on
> the blocks which drives their refcount to 0 and deletes them and also
> decrementing the bank's refcount. And finally kobject_put() will be called
> on the bank driving its refcount to 0 and deleting it.
> 
> With this patch and the example above:
> 
> CPU1 shutdown
>   bank is shared; bank refcount = 5; threshold_remove_bank()
>     block 0 put parent bank; bank refcount = 4; __threshold_remove_blocks()
>     block 1 put parent bank; bank refcount = 3; __threshold_remove_blocks()
> CPU0 shutdown
>   bank is no longer shared; bank refcount = 3; threshold_remove_bank()
>     block 0 put block; bank refcount = 2; deallocate_threshold_blocks()
>     block 1 put block; bank refcount = 1; deallocate_threshold_blocks()
>   put bank; bank refcount = 0; threshold_remove_bank()
> 
> Fixes: 7f99cb5e6039 ("x86/CPU/AMD: Use default_groups in kobj_type")

This predates this fixup, this commit just exposed the root problem here
so odds are it should be backported further, right?

thanks,

greg k-h
Yazen Ghannam June 15, 2022, 1:51 p.m. UTC | #2
On Wed, Jun 15, 2022 at 08:33:50AM +0200, Greg KH wrote:
...
> > 
> > Fixes: 7f99cb5e6039 ("x86/CPU/AMD: Use default_groups in kobj_type")
> 
> This predates this fixup, this commit just exposed the root problem here
> so odds are it should be backported further, right?
>

Yes, I believe that's true based on code inspection. But I'm not aware of any
reported issues in this area before the commit listed above. So I decided to
switch the Fixes tag from what I had before (shown below). I can switch it
back if you think that's best.

Fixes: 019f34fccfd5 ("x86, MCE, AMD: Move shared bank to node descriptor")

Thanks,
Yazen
Mateusz Jończyk Aug. 12, 2022, 9:14 p.m. UTC | #3
W dniu 14.06.2022 o 19:43, Yazen Ghannam pisze:
> AMD systems from Family 10h to 16h share MCA bank 4 across multiple CPUs.
> Therefore, the threshold_bank structure for bank 4, and its threshold_block
> structures, will be initialized once at boot time. And the kobject for the
> shared bank will be added to each of the CPUs that share it. Furthermore,
> the threshold_blocks for the shared bank will be added again to the bank's
> kobject. These additions will increase the refcount for the bank's kobject.
>
> For example, a shared bank with two blocks and shared across two CPUs will
> be set up like this:
>
[snip]

Hello,

The problem this patch solves (multiple warnings with stacktraces on suspend)
also affects me, on a tri-core old AMD Phenom II X3 720 processor.

I have done some further debugging by adding printks to the code and to me it seems that your diagnosis
was correct. In the dmesg I can see:

[   66.094076] ACPI: PM: Preparing to enter system sleep state S3
[   66.510378] ACPI: PM: Saving platform NVS memory
[   66.510718] Disabling non-boot CPUs ...
[   66.511043] mce: ENTER mce_threshold_remove_device(cpu = 1)
[   66.511046] mce: ENTER __threshold_remove_device(); numbanks=6
[   66.511048] mce: ENTER threshold_remove_bank(bank name = northbridge, bank->cpus = 3, bank->shared = 1)
[   66.511049] mce: ENTER __threshold_remove_blocks(bank name = northbridge, bank->cpus = 2, bank->shared = 1)
[   66.511060] mce: EXIT __threshold_remove_blocks()
[   66.511061] mce: threshold_remove_bank: EXIT after __threshold_remove_blocks()
[   66.511062] mce: EXIT __threshold_remove_device()
[   66.511063] mce: EXIT mce_threshold_remove_device(cpu = 1)
[   66.512247] smpboot: CPU 1 is now offline
[   66.513915] mce: ENTER mce_threshold_remove_device(cpu = 2)
[   66.513919] mce: ENTER __threshold_remove_device(); numbanks=6
[   66.513922] mce: ENTER threshold_remove_bank(bank name = northbridge, bank->cpus = 2, bank->shared = 1)
[   66.513925] mce: ENTER __threshold_remove_blocks(bank name = northbridge, bank->cpus = 1, bank->shared = 1)
[   66.513929] ------------[ cut here ]------------
[   66.513930] kernfs: can not remove 'threshold_limit', no directory
[... stacktrace]
[   66.514246] kernfs: can not remove 'error_count', no directory
[...]
[   66.514485] kernfs: can not remove 'interrupt_enable', no directory
[...]
[   66.514719] kernfs: can not remove 'threshold_limit', no directory
[...]
[   66.514951] kernfs: can not remove 'error_count', no directory
[...]
[   66.515183] kernfs: can not remove 'interrupt_enable', no directory
[...]
[   66.515412] ---[ end trace 0000000000000000 ]---
[   66.515414] mce: EXIT __threshold_remove_blocks()
[   66.515415] mce: threshold_remove_bank: EXIT after __threshold_remove_blocks()
[   66.515417] mce: EXIT __threshold_remove_device()
[   66.515418] mce: EXIT mce_threshold_remove_device(cpu = 2)
[   66.516669] smpboot: CPU 2 is now offline


> Fixes: 7f99cb5e6039 ("x86/CPU/AMD: Use default_groups in kobj_type")
> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> Tested-by: Mikulas Patocka <mpatocka@redhat.com>
> Cc: stable@vger.kernel.org
> Link: https://lore.kernel.org/r/alpine.LRH.2.02.2205301145540.25840@file01.intranet.prod.int.rdu2.redhat.com
> ---
>  arch/x86/kernel/cpu/mce/amd.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
> index 2b7ee4a6c6ba..680b75d23a03 100644
> --- a/arch/x86/kernel/cpu/mce/amd.c
> +++ b/arch/x86/kernel/cpu/mce/amd.c
> @@ -1260,10 +1260,10 @@ static void __threshold_remove_blocks(struct threshold_bank *b)
>  	struct threshold_block *pos = NULL;
>  	struct threshold_block *tmp = NULL;
>  
> -	kobject_del(b->kobj);
> +	kobject_put(b->kobj);
>  
>  	list_for_each_entry_safe(pos, tmp, &b->blocks->miscj, miscj)
> -		kobject_del(&pos->kobj);
> +		kobject_put(b->kobj);

Shouldn't there be "kobject_put(&pos->kobj)" here instead?

Also, it seems to me that "kobject_put(b->kobj);" before the loop may be relocated after
the loop - so that the refcounts on the child objects are decreased first, then the refcount on the parent
object.

Additionally, shouldn't there be a call to "kobject_put(&b->blocks->kobj);" in __threshold_remove_blocks()?
From what I understand, b->blocks is a list head, so we need to decrease the refcount on it too.

After these changes, the __threshold_remove_blocks() function looks very similar to deallocate_threshold_blocks()
function just above it.

>  }
>  
>  static void threshold_remove_bank(struct threshold_bank *bank)

Thank you for your work.

Greetings,

Mateusz


---------------------

commit 39df75c672d9620a2967f993befedabb36b2616d
Author: Mateusz Jończyk <mat.jonczyk@o2.pl>
Date:   Fri Aug 12 20:21:46 2022 +0200

    x86/mce/amd: debugging messages

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 1c87501e0fa3..10c37c0599e1 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -1238,6 +1238,10 @@ static int threshold_create_bank(struct threshold_bank **bp, unsigned int cpu,
 
 static void threshold_block_release(struct kobject *kobj)
 {
+    struct threshold_block *block = to_block(kobj);
+
+    pr_warn("threshold_block_release(block = %u, bank = %u, cpu = %u)\n",
+        block->block, block->bank, block->cpu);
     kfree(to_block(kobj));
 }
 
@@ -1245,12 +1249,19 @@ static void deallocate_threshold_blocks(struct threshold_bank *bank)
 {
     struct threshold_block *pos, *tmp;
 
+    pr_warn("ENTER deallocate_threshold_blocks(bank name = %s, bank->cpus = %u, bank->shared = %d)\n",
+        kobject_name(bank->kobj),
+        refcount_read(&bank->cpus),
+        bank->shared);
+
     list_for_each_entry_safe(pos, tmp, &bank->blocks->miscj, miscj) {
         list_del(&pos->miscj);
         kobject_put(&pos->kobj);
     }
 
     kobject_put(&bank->blocks->kobj);
+
+    pr_warn("EXIT deallocate_threshold_blocks()\n");
 }
 
 static void __threshold_remove_blocks(struct threshold_bank *b)
@@ -1258,24 +1269,40 @@ static void __threshold_remove_blocks(struct threshold_bank *b)
     struct threshold_block *pos = NULL;
     struct threshold_block *tmp = NULL;
 
+    pr_warn("ENTER __threshold_remove_blocks(bank name = %s, bank->cpus = %u, bank->shared = %d)\n",
+        kobject_name(b->kobj),
+        refcount_read(&b->cpus),
+        b->shared);
+
     kobject_del(b->kobj);
 
     list_for_each_entry_safe(pos, tmp, &b->blocks->miscj, miscj)
         kobject_del(&pos->kobj);
+
+    pr_warn("EXIT __threshold_remove_blocks()\n");
 }
 
 static void threshold_remove_bank(struct threshold_bank *bank)
 {
     struct amd_northbridge *nb;
+    pr_warn("ENTER threshold_remove_bank(bank name = %s, bank->cpus = %u, bank->shared = %d)\n",
+        kobject_name(bank->kobj),
+        refcount_read(&bank->cpus),
+        bank->shared);
 
-    if (!bank->blocks)
+    if (!bank->blocks) {
+        pr_warn("threshold_remove_bank: bank->blocks == NULL\n");
         goto out_free;
+        }
 
-    if (!bank->shared)
+    if (!bank->shared) {
+        pr_warn("threshold_remove_bank: bank->shared == NULL\n");
         goto out_dealloc;
+        }
 
     if (!refcount_dec_and_test(&bank->cpus)) {
         __threshold_remove_blocks(bank);
+        pr_warn("threshold_remove_bank: EXIT after __threshold_remove_blocks()\n");
         return;
     } else {
         /*
@@ -1283,6 +1310,7 @@ static void threshold_remove_bank(struct threshold_bank *bank)
          * away, remove that bank now.
          */
         nb = node_to_amd_nb(topology_die_id(smp_processor_id()));
+        pr_warn("threshold_remove_bank: setting nb->bank4 = NULL\n");
         nb->bank4 = NULL;
     }
 
@@ -1292,12 +1320,16 @@ static void threshold_remove_bank(struct threshold_bank *bank)
 out_free:
     kobject_put(bank->kobj);
     kfree(bank);
+
+    pr_warn("EXIT threshold_remove_bank(); \n");
 }
 
 static void __threshold_remove_device(struct threshold_bank **bp)
 {
     unsigned int bank, numbanks = this_cpu_read(mce_num_banks);
 
+    pr_warn("ENTER __threshold_remove_device(); numbanks=%u\n", numbanks);
+
     for (bank = 0; bank < numbanks; bank++) {
         if (!bp[bank])
             continue;
@@ -1306,14 +1338,19 @@ static void __threshold_remove_device(struct threshold_bank **bp)
         bp[bank] = NULL;
     }
     kfree(bp);
+    pr_warn("EXIT __threshold_remove_device()\n");
 }
 
 int mce_threshold_remove_device(unsigned int cpu)
 {
     struct threshold_bank **bp = this_cpu_read(threshold_banks);
 
-    if (!bp)
+    pr_warn("ENTER mce_threshold_remove_device(cpu = %u)\n", cpu);
+
+    if (!bp) {
+        pr_warn("EXIT mce_threshold_remove_device(cpu = %u): bp == NULL\n", cpu);
         return 0;
+    }
 
     /*
      * Clear the pointer before cleaning up, so that the interrupt won't
@@ -1322,6 +1359,8 @@ int mce_threshold_remove_device(unsigned int cpu)
     this_cpu_write(threshold_banks, NULL);
 
     __threshold_remove_device(bp);
+
+    pr_warn("EXIT mce_threshold_remove_device(cpu = %u)\n", cpu);
     return 0;
 }
Borislav Petkov Aug. 13, 2022, 10:09 a.m. UTC | #4
On Fri, Aug 12, 2022 at 11:14:44PM +0200, Mateusz Jończyk wrote:
> Shouldn't there be "kobject_put(&pos->kobj)" here instead?

Yes, it should.

> Also, it seems to me that "kobject_put(b->kobj);" before the loop
> may be relocated after the loop - so that the refcounts on the child
> objects are decreased first, then the refcount on the parent object.

Yes, I guess we can do that.

> Additionally, shouldn't there be a call to
> "kobject_put(&b->blocks->kobj);" in __threshold_remove_blocks()?

Makes sense, we do

	kobject_add(&b->blocks->kobj, ...

in __threshold_add_blocks().

> From what I understand, b->blocks is a list head, so we need to
> decrease the refcount on it too.

Not list_heads - we modify the refcount of kobjects. See what the arg of
kobject_put() is.

> After these changes, the __threshold_remove_blocks() function looks
> very similar to deallocate_threshold_blocks() function just above it.

Yes, minus the list_del(&pos->miscj); But that can be made conditional
with a bool arg to deallocate_threshold_blocks() and then remove
__threshold_remove_blocks().

Care to take Yazen's patch, fix it up, test it thoroughly (you should
enable KASAN to catch any potential memory leaks) and send it?

Thx.
Mateusz Jończyk Aug. 13, 2022, 12:04 p.m. UTC | #5
W dniu 13.08.2022 o 12:09, Borislav Petkov pisze:
> On Fri, Aug 12, 2022 at 11:14:44PM +0200, Mateusz Jończyk wrote:
>> After these changes, the __threshold_remove_blocks() function looks
>> very similar to deallocate_threshold_blocks() function just above it.
> Yes, minus the list_del(&pos->miscj); But that can be made conditional
> with a bool arg to deallocate_threshold_blocks() and then remove
> __threshold_remove_blocks().
>
> Care to take Yazen's patch, fix it up, test it thoroughly (you should
> enable KASAN to catch any potential memory leaks) and send it?
>
> Thx.
>
I'll try, but I don't understand the code well yet, and have access to only one
old computer with an AMD CPU.

Greetings,

Mateusz,
Borislav Petkov Oct. 26, 2022, 10:16 a.m. UTC | #6
On Wed, Jun 15, 2022 at 01:51:47PM +0000, Yazen Ghannam wrote:
> Yes, I believe that's true based on code inspection. But I'm not aware of any
> reported issues in this area before the commit listed above. So I decided to
> switch the Fixes tag from what I had before (shown below). I can switch it
> back if you think that's best.
> 
> Fixes: 019f34fccfd5 ("x86, MCE, AMD: Move shared bank to node descriptor")

Yeah, that is probably the culprit. I finally got to this and am able to
repro on my F10h box.

Here's what I think the fix should be, Greg, please check this
for no-nos, especially for doing a kobject_put() on the parent in
remove_shared_bank_kobjects(). But that is basically the reverse
operation of the kobject_add() I'm doing when sharing the bank, more to
that below.

I wonder why we see this now - maybe the kobject reference counting got
changed since then...

Anyway, thoughts?

---
From: Borislav Petkov <bp@suse.de>

x86/MCE/AMD: Correctly drop shared bank references

Old AMD machines have a shared MCA bank 4 which reports northbridge
error types. That bank has a bunch of controls which are exposed this
way in sysfs on CPU0:

  /sys/devices/system/machinecheck/machinecheck0/northbridge/
  ├── dram
  │   ├── error_count
  │   ├── interrupt_enable
  │   └── threshold_limit
  ├── ht_links
  │   ├── error_count
  │   ├── interrupt_enable
  │   └── threshold_limit
  └── l3_cache
      ├── error_count
      ├── interrupt_enable
      └── threshold_limit

In order to expose the exact same controls - the bank is shared
between all CPUs - threshold_create_bank() reuses the bank pointer and
kobject_add()s it to the parent of the other CPUs:

  mce: threshold_create_bank: CPU1, yes, use it, kref: 4, parent_kref: 3, name: northbridge
  mce: threshold_create_bank: CPU1, inc cpus: 2, bank ref: 4
  mce: __threshold_add_blocks: entry, kobj: 0xffff888100adb218, parent: 0xffff888100c10c00 ref: 1, parent_kref: 4, name: dram
  mce: __threshold_add_blocks: misc,  kobj: 0xffff888100adb418, parent: 0xffff888100c10c00,  kref: 1, parent_kref: 6, name: l3_cache
  mce: __threshold_add_blocks: misc,  kobj: 0xffff888100adb318, parent: 0xffff888100c10c00,  kref: 1, parent_kref: 7, name: ht_links
  ...

kobject_add() does a kobject_get() on the parent for each sysfs file it
adds.

Therefore, in order to unwind the same setup work when the CPU goes
offline and the bank *references* only are being removed - the other
CPUs still share it - do a kobject_put() on the parent.

Rename things more properly while at it and add comments.

Signed-off-by: Borislav Petkov <bp@suse.de>

---

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 1c87501e0fa3..b2bdee9e0bae 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -1241,31 +1241,40 @@ static void threshold_block_release(struct kobject *kobj)
 	kfree(to_block(kobj));
 }
 
+/*
+ * Drop refcounts and delete list heads in order to free the memory.
+ */
 static void deallocate_threshold_blocks(struct threshold_bank *bank)
 {
+	struct list_head *head = &bank->blocks->miscj;
 	struct threshold_block *pos, *tmp;
 
-	list_for_each_entry_safe(pos, tmp, &bank->blocks->miscj, miscj) {
-		list_del(&pos->miscj);
+	list_for_each_entry_safe(pos, tmp, head, miscj) {
 		kobject_put(&pos->kobj);
+		list_del(&pos->miscj);
 	}
 
 	kobject_put(&bank->blocks->kobj);
 }
 
-static void __threshold_remove_blocks(struct threshold_bank *b)
+/*
+ * Only put the parent kobject of each block. The inverse of  kobject_add()
+ * above in threshold_create_bank().
+ */
+static void remove_shared_bank_kobjects(struct threshold_bank *bank)
 {
-	struct threshold_block *pos = NULL;
-	struct threshold_block *tmp = NULL;
+	struct list_head *head = &bank->blocks->miscj;
+	struct threshold_block *pos, *tmp;
 
-	kobject_del(b->kobj);
+	list_for_each_entry_safe(pos, tmp, head, miscj)
+		kobject_put(pos->kobj.parent);
 
-	list_for_each_entry_safe(pos, tmp, &b->blocks->miscj, miscj)
-		kobject_del(&pos->kobj);
+	kobject_put(bank->kobj);
 }
 
 static void threshold_remove_bank(struct threshold_bank *bank)
 {
+	int cpu = smp_processor_id();
 	struct amd_northbridge *nb;
 
 	if (!bank->blocks)
@@ -1275,14 +1284,14 @@ static void threshold_remove_bank(struct threshold_bank *bank)
 		goto out_dealloc;
 
 	if (!refcount_dec_and_test(&bank->cpus)) {
-		__threshold_remove_blocks(bank);
+		remove_shared_bank_kobjects(bank);
 		return;
 	} else {
 		/*
 		 * The last CPU on this node using the shared bank is going
 		 * away, remove that bank now.
 		 */
-		nb = node_to_amd_nb(topology_die_id(smp_processor_id()));
+		nb = node_to_amd_nb(topology_die_id(cpu));
 		nb->bank4 = NULL;
 	}
Greg Kroah-Hartman Oct. 26, 2022, 12:04 p.m. UTC | #7
On Wed, Oct 26, 2022 at 12:16:40PM +0200, Borislav Petkov wrote:
> On Wed, Jun 15, 2022 at 01:51:47PM +0000, Yazen Ghannam wrote:
> > Yes, I believe that's true based on code inspection. But I'm not aware of any
> > reported issues in this area before the commit listed above. So I decided to
> > switch the Fixes tag from what I had before (shown below). I can switch it
> > back if you think that's best.
> > 
> > Fixes: 019f34fccfd5 ("x86, MCE, AMD: Move shared bank to node descriptor")
> 
> Yeah, that is probably the culprit. I finally got to this and am able to
> repro on my F10h box.
> 
> Here's what I think the fix should be, Greg, please check this
> for no-nos, especially for doing a kobject_put() on the parent in
> remove_shared_bank_kobjects(). But that is basically the reverse
> operation of the kobject_add() I'm doing when sharing the bank, more to
> that below.
> 
> I wonder why we see this now - maybe the kobject reference counting got
> changed since then...
> 
> Anyway, thoughts?
> 
> ---
> From: Borislav Petkov <bp@suse.de>
> 
> x86/MCE/AMD: Correctly drop shared bank references
> 
> Old AMD machines have a shared MCA bank 4 which reports northbridge
> error types. That bank has a bunch of controls which are exposed this
> way in sysfs on CPU0:
> 
>   /sys/devices/system/machinecheck/machinecheck0/northbridge/
>   ├── dram
>   │   ├── error_count
>   │   ├── interrupt_enable
>   │   └── threshold_limit
>   ├── ht_links
>   │   ├── error_count
>   │   ├── interrupt_enable
>   │   └── threshold_limit
>   └── l3_cache
>       ├── error_count
>       ├── interrupt_enable
>       └── threshold_limit
> 
> In order to expose the exact same controls - the bank is shared
> between all CPUs - threshold_create_bank() reuses the bank pointer and
> kobject_add()s it to the parent of the other CPUs:
> 
>   mce: threshold_create_bank: CPU1, yes, use it, kref: 4, parent_kref: 3, name: northbridge
>   mce: threshold_create_bank: CPU1, inc cpus: 2, bank ref: 4
>   mce: __threshold_add_blocks: entry, kobj: 0xffff888100adb218, parent: 0xffff888100c10c00 ref: 1, parent_kref: 4, name: dram
>   mce: __threshold_add_blocks: misc,  kobj: 0xffff888100adb418, parent: 0xffff888100c10c00,  kref: 1, parent_kref: 6, name: l3_cache
>   mce: __threshold_add_blocks: misc,  kobj: 0xffff888100adb318, parent: 0xffff888100c10c00,  kref: 1, parent_kref: 7, name: ht_links
>   ...
> 
> kobject_add() does a kobject_get() on the parent for each sysfs file it
> adds.
> 
> Therefore, in order to unwind the same setup work when the CPU goes
> offline and the bank *references* only are being removed - the other
> CPUs still share it - do a kobject_put() on the parent.

Eeek, no!

You can't decrement the reference on the parent, that could cause you to
get dropped.

And you can not reuse kobjects, is that the issue here?  When you are
done with one, you have to delete it.  Then create a new one.

No need to move anything around, just destory it all and then add new
ones.

> Rename things more properly while at it and add comments.
> 
> Signed-off-by: Borislav Petkov <bp@suse.de>
> 
> ---
> 
> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
> index 1c87501e0fa3..b2bdee9e0bae 100644
> --- a/arch/x86/kernel/cpu/mce/amd.c
> +++ b/arch/x86/kernel/cpu/mce/amd.c
> @@ -1241,31 +1241,40 @@ static void threshold_block_release(struct kobject *kobj)
>  	kfree(to_block(kobj));
>  }
>  
> +/*
> + * Drop refcounts and delete list heads in order to free the memory.
> + */
>  static void deallocate_threshold_blocks(struct threshold_bank *bank)
>  {
> +	struct list_head *head = &bank->blocks->miscj;
>  	struct threshold_block *pos, *tmp;
>  
> -	list_for_each_entry_safe(pos, tmp, &bank->blocks->miscj, miscj) {
> -		list_del(&pos->miscj);
> +	list_for_each_entry_safe(pos, tmp, head, miscj) {
>  		kobject_put(&pos->kobj);
> +		list_del(&pos->miscj);
>  	}
>  
>  	kobject_put(&bank->blocks->kobj);
>  }
>  
> -static void __threshold_remove_blocks(struct threshold_bank *b)
> +/*
> + * Only put the parent kobject of each block. The inverse of  kobject_add()
> + * above in threshold_create_bank().
> + */
> +static void remove_shared_bank_kobjects(struct threshold_bank *bank)
>  {
> -	struct threshold_block *pos = NULL;
> -	struct threshold_block *tmp = NULL;
> +	struct list_head *head = &bank->blocks->miscj;
> +	struct threshold_block *pos, *tmp;
>  
> -	kobject_del(b->kobj);
> +	list_for_each_entry_safe(pos, tmp, head, miscj)
> +		kobject_put(pos->kobj.parent);

No, please don't do that.

>  
> -	list_for_each_entry_safe(pos, tmp, &b->blocks->miscj, miscj)
> -		kobject_del(&pos->kobj);
> +	kobject_put(bank->kobj);

What changed to cause problems?  the kobject reference logic hasn't
changed, was it some topology stuff?

thanks,

greg k-h
Yazen Ghannam Oct. 26, 2022, 3:39 p.m. UTC | #8
On Wed, Oct 26, 2022 at 02:04:04PM +0200, Greg KH wrote:
...
> > kobject_add() does a kobject_get() on the parent for each sysfs file it
> > adds.
> > 
> > Therefore, in order to unwind the same setup work when the CPU goes
> > offline and the bank *references* only are being removed - the other
> > CPUs still share it - do a kobject_put() on the parent.
> 
> Eeek, no!
> 
> You can't decrement the reference on the parent, that could cause you to
> get dropped.
> 
> And you can not reuse kobjects, is that the issue here?  When you are
> done with one, you have to delete it.  Then create a new one.
> 
> No need to move anything around, just destory it all and then add new
> ones.
> 

Yes, it seems like we're messing up refcounts as we try to manually manage the
life of shared kobjects.

But I take it this is not allowed, correct? So maybe the best solution is to
not do this sharing at all.

...
> 
> What changed to cause problems?  the kobject reference logic hasn't
> changed, was it some topology stuff?
>

Here's a snip from the commit message at the top of the thread:

"During kobject_del(), kobject->sd is removed. If the kobject is not part of
a kset with default_groups, then subsequent kobject_del() calls seem safe
even with kobject->sd == NULL.

Originally, the AMD MCA thresholding structures did not use default_groups.
And so the above behavior was not apparent.

However, a recent change implemented default_groups for the thresholding
structures. Therefore, kobject_del() will go down the sysfs_remove_groups()
code path. In this case, the first kobject_del() may succeed and remove
kobject->sd. But subsequent kobject_del() calls will give a WARNing in
kernfs_remove_by_name_ns() since kobject->sd == NULL."

Basically, we're deleting a shared kobject. The first CPU gets to delete it,
and the following CPUs complain that we're deleting an already deleted
kobject.

Sorry Boris, it's been a while since I looked at this. What's the issue with
my original patch? I think this is the simplest way to fix the current
implementation. But we should probably get rid of this kobject sharing idea in
light of Greg's comments.

Thanks,
Yazen
Borislav Petkov Oct. 26, 2022, 6:29 p.m. UTC | #9
On Wed, Oct 26, 2022 at 03:39:15PM +0000, Yazen Ghannam wrote:
> What's the issue with my original patch?

Do you see it?

> @@ -1258,10 +1258,10 @@ static void __threshold_remove_blocks(struct threshold_bank *b)
>         struct threshold_block *pos = NULL;
>         struct threshold_block *tmp = NULL;
>  
> -       kobject_del(b->kobj);
> +       kobject_put(b->kobj);
>  
>         list_for_each_entry_safe(pos, tmp, &b->blocks->miscj, miscj)
> -               kobject_del(&pos->kobj);
> +               kobject_put(b->kobj);

You're basically putting the parent as many times as there are elements
on the ->miscj list.

Basically what Greg doesn't like.

Him and I need to talk it over first whether my gross hack of grafting
the bank4 kobject hierarchy from CPU0 onto the other CPUs on the node is
even viable so stay tuned...

> I think this is the simplest way to fix the current implementation.
> But we should probably get rid of this kobject sharing idea in light
> of Greg's comments.

You said it. :)

Or maybe do a better one.
Yazen Ghannam Oct. 26, 2022, 7:44 p.m. UTC | #10
On Wed, Oct 26, 2022 at 08:29:51PM +0200, Borislav Petkov wrote:
> On Wed, Oct 26, 2022 at 03:39:15PM +0000, Yazen Ghannam wrote:
> > What's the issue with my original patch?
> 
> Do you see it?
> 
> > @@ -1258,10 +1258,10 @@ static void __threshold_remove_blocks(struct threshold_bank *b)
> >         struct threshold_block *pos = NULL;
> >         struct threshold_block *tmp = NULL;
> >  
> > -       kobject_del(b->kobj);
> > +       kobject_put(b->kobj);
> >  
> >         list_for_each_entry_safe(pos, tmp, &b->blocks->miscj, miscj)
> > -               kobject_del(&pos->kobj);
> > +               kobject_put(b->kobj);
> 
> You're basically putting the parent as many times as there are elements
> on the ->miscj list.
> 
> Basically what Greg doesn't like.
> 
> Him and I need to talk it over first whether my gross hack of grafting
> the bank4 kobject hierarchy from CPU0 onto the other CPUs on the node is
> even viable so stay tuned...
> 
> > I think this is the simplest way to fix the current implementation.
> > But we should probably get rid of this kobject sharing idea in light
> > of Greg's comments.
> 
> You said it. :)
> 
> Or maybe do a better one.

Right, so can we do the following two things?

1) Apply the patch I submitted as a simple fix/workaround for the presented
symptom. I tried to keep it small and well described to be a stable backport.
Obviously I wrote it without knowing the shared kobject behavior isn't ideal.

2) Address the shared kobject thing.
   Here are some options:
   a. Only set up the thresholding kobject on a single CPU per "AMD Node".
   Technically MCA Bank 4 is "shared" on legacy systems. But AFAICT from
   looking at old BKDG docs, in practice only the "Node Base Core" can access
   the registers. This behavior is controlled by a bit in NB which BIOS is
   supposed to set. Maybe some BIOSes don't do this, but I think that's a
   "broken BIOS on legacy system" issue if so.
   b. Disable the MCA Thresholding interface for Families before 0x17. This is
   an undocumented interface, and I don't know if anyone is using it on older
   systems. The issue we're discussing here started because of a splat during
   suspend/resume/CPU hotplug. In disable_err_thresholding(), we disable MCA
   Thresholding for bank 4 on Family 15h, so there's some precedent.
   c. Do nothing at the moment. I *really* want to clean up the MCA
   Thresholding interface, and the shared kobject thing may get resolved in
   that.

What do you think?

Thanks,
Yazen
Borislav Petkov Oct. 26, 2022, 8:12 p.m. UTC | #11
On Wed, Oct 26, 2022 at 07:44:17PM +0000, Yazen Ghannam wrote:
> 1) Apply the patch I submitted as a simple fix/workaround for the presented
> symptom. I tried to keep it small and well described to be a stable backport.
> Obviously I wrote it without knowing the shared kobject behavior isn't ideal.

We'll see.

> 2) Address the shared kobject thing.
>    Here are some options:
>    a. Only set up the thresholding kobject on a single CPU per "AMD Node".
>    Technically MCA Bank 4 is "shared" on legacy systems. But AFAICT from
>    looking at old BKDG docs, in practice only the "Node Base Core" can access
>    the registers. This behavior is controlled by a bit in NB which BIOS is
>    supposed to set. Maybe some BIOSes don't do this, but I think that's a
>    "broken BIOS on legacy system" issue if so.

I guess we can do that. And I even think we have some code which finds
out which the NBC is...

/me greps a bit:

ah, there it is: get_nbc_for_node() in arch/x86/kernel/cpu/mce/inject.c.


>    b. Disable the MCA Thresholding interface for Families before 0x17.

Can't. It is user-visible and you don't know for sure whether someone is
using it or not.

Believe me, I have been wanting to disable this thing forever. I've
never heard of anyone using it and all the energy we put in it was for
nothing. :-\

We could try to deprecate it, though, make it default=n in Kconfig and
see who complains. And after a couple of releases, kill it.

>    This is an undocumented interface, 

Of course it is documented - it is in the old BKDGs.

> and I don't know if anyone is using it on older systems.

Yap.

> The issue we're discussing here started because of a splat during
> suspend/resume/CPU hotplug. In disable_err_thresholding(), we disable
> MCA Thresholding for bank 4 on Family 15h, so there's some precedent.
> c. Do nothing at the moment. I *really* want to clean up the MCA
> Thresholding interface, and the shared kobject thing may get resolved
> in that.

Clean it up how exactly?

Put it behind a Kconfig item, disable it and remove it after a while?

:-)

If so, I wouldn't mind. No one's using this. At least I haven't heard of
a single bug report or of a use case. Only when CPU hotplug explodes and
that thing is involved, only then.

Might as well remove it. And then remove it in the hardware too. RAS
folks would love to get rid of some of that crap which takes up verif
resources for no good reason.

:-)
Yazen Ghannam Nov. 2, 2022, 2:36 a.m. UTC | #12
On Wed, Oct 26, 2022 at 10:12:15PM +0200, Borislav Petkov wrote:
> On Wed, Oct 26, 2022 at 07:44:17PM +0000, Yazen Ghannam wrote:
> > 1) Apply the patch I submitted as a simple fix/workaround for the presented
> > symptom. I tried to keep it small and well described to be a stable backport.
> > Obviously I wrote it without knowing the shared kobject behavior isn't ideal.
> 
> We'll see.
> 
> > 2) Address the shared kobject thing.
> >    Here are some options:
> >    a. Only set up the thresholding kobject on a single CPU per "AMD Node".
> >    Technically MCA Bank 4 is "shared" on legacy systems. But AFAICT from
> >    looking at old BKDG docs, in practice only the "Node Base Core" can access
> >    the registers. This behavior is controlled by a bit in NB which BIOS is
> >    supposed to set. Maybe some BIOSes don't do this, but I think that's a
> >    "broken BIOS on legacy system" issue if so.
> 
> I guess we can do that. And I even think we have some code which finds
> out which the NBC is...
> 
> /me greps a bit:
> 
> ah, there it is: get_nbc_for_node() in arch/x86/kernel/cpu/mce/inject.c.
> 
> 
> >    b. Disable the MCA Thresholding interface for Families before 0x17.
> 
> Can't. It is user-visible and you don't know for sure whether someone is
> using it or not.
> 
> Believe me, I have been wanting to disable this thing forever. I've
> never heard of anyone using it and all the energy we put in it was for
> nothing. :-\
> 
> We could try to deprecate it, though, make it default=n in Kconfig and
> see who complains. And after a couple of releases, kill it.
> 
> >    This is an undocumented interface, 
> 
> Of course it is documented - it is in the old BKDGs.
> 
> > and I don't know if anyone is using it on older systems.
> 
> Yap.
> 
> > The issue we're discussing here started because of a splat during
> > suspend/resume/CPU hotplug. In disable_err_thresholding(), we disable
> > MCA Thresholding for bank 4 on Family 15h, so there's some precedent.
> > c. Do nothing at the moment. I *really* want to clean up the MCA
> > Thresholding interface, and the shared kobject thing may get resolved
> > in that.
> 
> Clean it up how exactly?
> 
> Put it behind a Kconfig item, disable it and remove it after a while?
> 
> :-)
> 
> If so, I wouldn't mind. No one's using this. At least I haven't heard of
> a single bug report or of a use case. Only when CPU hotplug explodes and
> that thing is involved, only then.
> 
> Might as well remove it. And then remove it in the hardware too. RAS
> folks would love to get rid of some of that crap which takes up verif
> resources for no good reason.
> 
> :-)
>

Cool beans. I think this'll be a long process, so let me start by removing the
shared bank stuff. Thanks!

-Yazen
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 2b7ee4a6c6ba..680b75d23a03 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -1260,10 +1260,10 @@  static void __threshold_remove_blocks(struct threshold_bank *b)
 	struct threshold_block *pos = NULL;
 	struct threshold_block *tmp = NULL;
 
-	kobject_del(b->kobj);
+	kobject_put(b->kobj);
 
 	list_for_each_entry_safe(pos, tmp, &b->blocks->miscj, miscj)
-		kobject_del(&pos->kobj);
+		kobject_put(b->kobj);
 }
 
 static void threshold_remove_bank(struct threshold_bank *bank)