[v4] mm/memory_hotplug: Fix remove_memory() lockdep splat
diff mbox series

Message ID 157869128062.2451572.4093315441083744888.stgit@dwillia2-desk3.amr.corp.intel.com
State New
Headers show
Series
  • [v4] mm/memory_hotplug: Fix remove_memory() lockdep splat
Related show

Commit Message

Dan Williams Jan. 10, 2020, 9:22 p.m. UTC
The daxctl unit test for the dax_kmem driver currently triggers the
lockdep splat below. It results from the fact that
remove_memory_block_devices() is invoked under the mem_hotplug_lock()
causing lockdep entanglements with cpu_hotplug_lock().

The mem_hotplug_lock() is not needed to synchronize the memory block
device sysfs interface vs the page online state, that is already handled
by lock_device_hotplug(). Specifically lock_device_hotplug()
is sufficient to allow try_remove_memory() to check the offline
state of the memblocks and be assured that subsequent online attempts
will be blocked. The device_online() path checks mem->section_count
before allowing any state manipulations and mem->section_count is
cleared in remove_memory_block_devices().

The add_memory() path does create memblock devices under the lock, but
there is no lockdep report on that path, and it wants to unwind the
hot-add (via arch_remove_memory()) if the memblock device creation
fails, so it is left alone for now.

This change is only possible thanks to the recent change that refactored
memory block device removal out of arch_remove_memory() (commit
4c4b7f9ba948 mm/memory_hotplug: remove memory block devices before
arch_remove_memory()).

    ======================================================
    WARNING: possible circular locking dependency detected
    5.5.0-rc3+ #230 Tainted: G           OE
    ------------------------------------------------------
    lt-daxctl/6459 is trying to acquire lock:
    ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80

    but task is already holding lock:
    ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0

    which lock already depends on the new lock.


    the existing dependency chain (in reverse order) is:

    -> #2 (mem_hotplug_lock.rw_sem){++++}:
           __lock_acquire+0x39c/0x790
           lock_acquire+0xa2/0x1b0
           get_online_mems+0x3e/0xb0
           kmem_cache_create_usercopy+0x2e/0x260
           kmem_cache_create+0x12/0x20
           ptlock_cache_init+0x20/0x28
           start_kernel+0x243/0x547
           secondary_startup_64+0xb6/0xc0

    -> #1 (cpu_hotplug_lock.rw_sem){++++}:
           __lock_acquire+0x39c/0x790
           lock_acquire+0xa2/0x1b0
           cpus_read_lock+0x3e/0xb0
           online_pages+0x37/0x300
           memory_subsys_online+0x17d/0x1c0
           device_online+0x60/0x80
           state_store+0x65/0xd0
           kernfs_fop_write+0xcf/0x1c0
           vfs_write+0xdb/0x1d0
           ksys_write+0x65/0xe0
           do_syscall_64+0x5c/0xa0
           entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #0 (kn->count#241){++++}:
           check_prev_add+0x98/0xa40
           validate_chain+0x576/0x860
           __lock_acquire+0x39c/0x790
           lock_acquire+0xa2/0x1b0
           __kernfs_remove+0x25f/0x2e0
           kernfs_remove_by_name_ns+0x41/0x80
           remove_files.isra.0+0x30/0x70
           sysfs_remove_group+0x3d/0x80
           sysfs_remove_groups+0x29/0x40
           device_remove_attrs+0x39/0x70
           device_del+0x16a/0x3f0
           device_unregister+0x16/0x60
           remove_memory_block_devices+0x82/0xb0
           try_remove_memory+0xb5/0x130
           remove_memory+0x26/0x40
           dev_dax_kmem_remove+0x44/0x6a [kmem]
           device_release_driver_internal+0xe4/0x1c0
           unbind_store+0xef/0x120
           kernfs_fop_write+0xcf/0x1c0
           vfs_write+0xdb/0x1d0
           ksys_write+0x65/0xe0
           do_syscall_64+0x5c/0xa0
           entry_SYSCALL_64_after_hwframe+0x49/0xbe

    other info that might help us debug this:

    Chain exists of:
      kn->count#241 --> cpu_hotplug_lock.rw_sem --> mem_hotplug_lock.rw_sem

     Possible unsafe locking scenario:

           CPU0                    CPU1
           ----                    ----
      lock(mem_hotplug_lock.rw_sem);
                                   lock(cpu_hotplug_lock.rw_sem);
                                   lock(mem_hotplug_lock.rw_sem);
      lock(kn->count#241);

     *** DEADLOCK ***

No fixes tag as this seems to have been a long standing issue that
likely predated the addition of kernfs lockdep annotations.

Cc: <stable@vger.kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes since v3 [1]:
- Actually fix the compile error that v3 was meant to do.

[1]: http://lore.kernel.org/r/157868988882.2387473.11703138480034171351.stgit@dwillia2-desk3.amr.corp.intel.com

 Documentation/core-api/memory-hotplug.rst |   15 +++++++++++----
 drivers/base/core.c                       |    5 +++++
 drivers/base/memory.c                     |    8 ++++++--
 include/linux/device.h                    |    1 +
 mm/memory_hotplug.c                       |   10 +++++++---
 5 files changed, 30 insertions(+), 9 deletions(-)

Comments

David Hildenbrand Jan. 10, 2020, 10:33 p.m. UTC | #1
On 10.01.20 22:22, Dan Williams wrote:
> The daxctl unit test for the dax_kmem driver currently triggers the
> lockdep splat below. It results from the fact that
> remove_memory_block_devices() is invoked under the mem_hotplug_lock()
> causing lockdep entanglements with cpu_hotplug_lock().
> 
> The mem_hotplug_lock() is not needed to synchronize the memory block
> device sysfs interface vs the page online state, that is already handled
> by lock_device_hotplug(). Specifically lock_device_hotplug()
> is sufficient to allow try_remove_memory() to check the offline
> state of the memblocks and be assured that subsequent online attempts
> will be blocked. The device_online() path checks mem->section_count
> before allowing any state manipulations and mem->section_count is
> cleared in remove_memory_block_devices().
> 
> The add_memory() path does create memblock devices under the lock, but
> there is no lockdep report on that path, and it wants to unwind the
> hot-add (via arch_remove_memory()) if the memblock device creation
> fails, so it is left alone for now.
> 
> This change is only possible thanks to the recent change that refactored
> memory block device removal out of arch_remove_memory() (commit
> 4c4b7f9ba948 mm/memory_hotplug: remove memory block devices before
> arch_remove_memory()).
> 
>     ======================================================
>     WARNING: possible circular locking dependency detected
>     5.5.0-rc3+ #230 Tainted: G           OE
>     ------------------------------------------------------
>     lt-daxctl/6459 is trying to acquire lock:
>     ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80
> 
>     but task is already holding lock:
>     ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0
> 
>     which lock already depends on the new lock.
> 
> 
>     the existing dependency chain (in reverse order) is:
> 
>     -> #2 (mem_hotplug_lock.rw_sem){++++}:
>            __lock_acquire+0x39c/0x790
>            lock_acquire+0xa2/0x1b0
>            get_online_mems+0x3e/0xb0
>            kmem_cache_create_usercopy+0x2e/0x260
>            kmem_cache_create+0x12/0x20
>            ptlock_cache_init+0x20/0x28
>            start_kernel+0x243/0x547
>            secondary_startup_64+0xb6/0xc0
> 
>     -> #1 (cpu_hotplug_lock.rw_sem){++++}:
>            __lock_acquire+0x39c/0x790
>            lock_acquire+0xa2/0x1b0
>            cpus_read_lock+0x3e/0xb0
>            online_pages+0x37/0x300
>            memory_subsys_online+0x17d/0x1c0
>            device_online+0x60/0x80
>            state_store+0x65/0xd0
>            kernfs_fop_write+0xcf/0x1c0
>            vfs_write+0xdb/0x1d0
>            ksys_write+0x65/0xe0
>            do_syscall_64+0x5c/0xa0
>            entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>     -> #0 (kn->count#241){++++}:
>            check_prev_add+0x98/0xa40
>            validate_chain+0x576/0x860
>            __lock_acquire+0x39c/0x790
>            lock_acquire+0xa2/0x1b0
>            __kernfs_remove+0x25f/0x2e0
>            kernfs_remove_by_name_ns+0x41/0x80
>            remove_files.isra.0+0x30/0x70
>            sysfs_remove_group+0x3d/0x80
>            sysfs_remove_groups+0x29/0x40
>            device_remove_attrs+0x39/0x70
>            device_del+0x16a/0x3f0
>            device_unregister+0x16/0x60
>            remove_memory_block_devices+0x82/0xb0
>            try_remove_memory+0xb5/0x130
>            remove_memory+0x26/0x40
>            dev_dax_kmem_remove+0x44/0x6a [kmem]
>            device_release_driver_internal+0xe4/0x1c0
>            unbind_store+0xef/0x120
>            kernfs_fop_write+0xcf/0x1c0
>            vfs_write+0xdb/0x1d0
>            ksys_write+0x65/0xe0
>            do_syscall_64+0x5c/0xa0
>            entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>     other info that might help us debug this:
> 
>     Chain exists of:
>       kn->count#241 --> cpu_hotplug_lock.rw_sem --> mem_hotplug_lock.rw_sem
> 
>      Possible unsafe locking scenario:
> 
>            CPU0                    CPU1
>            ----                    ----
>       lock(mem_hotplug_lock.rw_sem);
>                                    lock(cpu_hotplug_lock.rw_sem);
>                                    lock(mem_hotplug_lock.rw_sem);
>       lock(kn->count#241);
> 
>      *** DEADLOCK ***
> 
> No fixes tag as this seems to have been a long standing issue that
> likely predated the addition of kernfs lockdep annotations.
> 
> Cc: <stable@vger.kernel.org>

I am not convinced this can actually happen. I explained somewhere else
already why a similar locksplat (reported by Pavel IIRC) on the ordinary
memory removal path is a false positive (because the device hotplug lock
actually protects us from such conditions). Can you elaborate why this
is stable material (and explain my tired eyes how the issue will
actually happen in real life)?

[...]
>  
>  When adding/removing memory that uses memory block devices (i.e. ordinary RAM),
> -the device_hotplug_lock should be held to:
> +the device_hotplug_lock is held to:
>  
>  - synchronize against online/offline requests (e.g. via sysfs). This way, memory
>    block devices can only be accessed (.online/.state attributes) by user
> -  space once memory has been fully added. And when removing memory, we
> -  know nobody is in critical sections.
> +  space once memory has been fully added. And when removing memory, the
> +  memory block device is invalidated (mem->section count set to 0) under the
> +  lock to abort any in-flight online requests.

I don't think this is needed. See below.

>  - synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC)
>  
>  Especially, there is a possible lock inversion that is avoided using
> @@ -112,7 +113,13 @@ can result in a lock inversion.
>  
>  onlining/offlining of memory should be done via device_online()/
>  device_offline() - to make sure it is properly synchronized to actions
> -via sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type)
> +via sysfs. Holding device_hotplug_lock is required to prevent online racing
> +removal. The device_hotplug_lock and memblock invalidation allows
> +remove_memory_block_devices() to run outside of mem_hotplug_lock to avoid lock
> +dependency conflicts with memblock-sysfs teardown. The add_memory() path
> +performs create_memory_block_devices() under mem_hotplug_lock so that if it
> +fails it can perform an arch_remove_memory() cleanup. There are no known lock
> +dependency problems with memblock-sysfs setup.
>  
>  When adding/removing/onlining/offlining memory or adding/removing
>  heterogeneous/device memory, we should always hold the mem_hotplug_lock in
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 42a672456432..5d5036370c92 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -1146,6 +1146,11 @@ void unlock_device_hotplug(void)
>  	mutex_unlock(&device_hotplug_lock);
>  }
>  
> +void assert_held_device_hotplug(void)
> +{
> +	lockdep_assert_held(&device_hotplug_lock);
> +}
> +
>  int lock_device_hotplug_sysfs(void)
>  {
>  	if (mutex_trylock(&device_hotplug_lock))
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 799b43191dea..91c6fbd2383e 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -280,6 +280,10 @@ static int memory_subsys_online(struct device *dev)
>  	if (mem->state == MEM_ONLINE)
>  		return 0;
>  
> +	/* online lost the race with hot-unplug, abort */
> +	if (!mem->section_count)
> +		return -ENXIO;
> +

Huh, why is that needed? There is pages_correctly_probed(), which checks
that all sections are present already (but I also have a patch to rework
that in my queue, because it looks like it's not needed in the current
state).

(Especially, I don't see why this is necessary in the context of this
patch - nothing changed in that regard. Also, checks against "device
already removed" should logically belong into
device_online()/device_offline(). Other subsystems should have similar
issues, no?)

>  	/*
>  	 * If we are called from state_store(), online_type will be
>  	 * set >= 0 Otherwise we were called from the device online
> @@ -736,8 +740,6 @@ int create_memory_block_devices(unsigned long start, unsigned long size)
>   * Remove memory block devices for the given memory area. Start and size
>   * have to be aligned to memory block granularity. Memory block devices
>   * have to be offline.
> - *
> - * Called under device_hotplug_lock.
>   */

Why is that change needed? Especially with the radix tree rework, this
lock is required on this call path. Removing this looks wrong to me.


>  void remove_memory_block_devices(unsigned long start, unsigned long size)
>  {
> @@ -746,6 +748,8 @@ void remove_memory_block_devices(unsigned long start, unsigned long size)
>  	struct memory_block *mem;
>  	unsigned long block_id;
>  
> +	assert_held_device_hotplug();
> +
>  	if (WARN_ON_ONCE(!IS_ALIGNED(start, memory_block_size_bytes()) ||
>  			 !IS_ALIGNED(size, memory_block_size_bytes())))
>  		return;
> diff --git a/include/linux/device.h b/include/linux/device.h
> index 96ff76731e93..a84654489c51 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -1553,6 +1553,7 @@ static inline bool device_supports_offline(struct device *dev)
>  extern void lock_device_hotplug(void);
>  extern void unlock_device_hotplug(void);
>  extern int lock_device_hotplug_sysfs(void);
> +extern void assert_held_device_hotplug(void);
>  extern int device_offline(struct device *dev);
>  extern int device_online(struct device *dev);
>  extern void set_primary_fwnode(struct device *dev, struct fwnode_handle *fwnode);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 55ac23ef11c1..0158cd4cca48 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1763,8 +1763,6 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
>  
>  	BUG_ON(check_hotplug_memory_range(start, size));
>  
> -	mem_hotplug_begin();
> -
>  	/*
>  	 * All memory blocks must be offlined before removing memory.  Check
>  	 * whether all memory blocks in question are offline and return error
> @@ -1777,9 +1775,15 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
>  	/* remove memmap entry */
>  	firmware_map_remove(start, start + size, "System RAM");
>  
> -	/* remove memory block devices before removing memory */
> +	/*
> +	 * Remove memory block devices before removing memory and before
> +	 * mem_hotplug_begin() (see Documentation/core-api/memory-hotplug.rst
> +	 * "Locking Internals").
> +	 */
>  	remove_memory_block_devices(start, size);
>  
> +	mem_hotplug_begin();
> +
>  	arch_remove_memory(nid, start, size, NULL);
>  	memblock_free(start, size);
>  	memblock_remove(start, size);
> 

That hunk looks good to me.
Dan Williams Jan. 10, 2020, 11:29 p.m. UTC | #2
On Fri, Jan 10, 2020 at 2:33 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 10.01.20 22:22, Dan Williams wrote:
> > The daxctl unit test for the dax_kmem driver currently triggers the
> > lockdep splat below. It results from the fact that
> > remove_memory_block_devices() is invoked under the mem_hotplug_lock()
> > causing lockdep entanglements with cpu_hotplug_lock().
> >
> > The mem_hotplug_lock() is not needed to synchronize the memory block
> > device sysfs interface vs the page online state, that is already handled
> > by lock_device_hotplug(). Specifically lock_device_hotplug()
> > is sufficient to allow try_remove_memory() to check the offline
> > state of the memblocks and be assured that subsequent online attempts
> > will be blocked. The device_online() path checks mem->section_count
> > before allowing any state manipulations and mem->section_count is
> > cleared in remove_memory_block_devices().
> >
> > The add_memory() path does create memblock devices under the lock, but
> > there is no lockdep report on that path, and it wants to unwind the
> > hot-add (via arch_remove_memory()) if the memblock device creation
> > fails, so it is left alone for now.
> >
> > This change is only possible thanks to the recent change that refactored
> > memory block device removal out of arch_remove_memory() (commit
> > 4c4b7f9ba948 mm/memory_hotplug: remove memory block devices before
> > arch_remove_memory()).
> >
> >     ======================================================
> >     WARNING: possible circular locking dependency detected
> >     5.5.0-rc3+ #230 Tainted: G           OE
> >     ------------------------------------------------------
> >     lt-daxctl/6459 is trying to acquire lock:
> >     ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80
> >
> >     but task is already holding lock:
> >     ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0
> >
> >     which lock already depends on the new lock.
> >
> >
> >     the existing dependency chain (in reverse order) is:
> >
> >     -> #2 (mem_hotplug_lock.rw_sem){++++}:
> >            __lock_acquire+0x39c/0x790
> >            lock_acquire+0xa2/0x1b0
> >            get_online_mems+0x3e/0xb0
> >            kmem_cache_create_usercopy+0x2e/0x260
> >            kmem_cache_create+0x12/0x20
> >            ptlock_cache_init+0x20/0x28
> >            start_kernel+0x243/0x547
> >            secondary_startup_64+0xb6/0xc0
> >
> >     -> #1 (cpu_hotplug_lock.rw_sem){++++}:
> >            __lock_acquire+0x39c/0x790
> >            lock_acquire+0xa2/0x1b0
> >            cpus_read_lock+0x3e/0xb0
> >            online_pages+0x37/0x300
> >            memory_subsys_online+0x17d/0x1c0
> >            device_online+0x60/0x80
> >            state_store+0x65/0xd0
> >            kernfs_fop_write+0xcf/0x1c0
> >            vfs_write+0xdb/0x1d0
> >            ksys_write+0x65/0xe0
> >            do_syscall_64+0x5c/0xa0
> >            entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> >     -> #0 (kn->count#241){++++}:
> >            check_prev_add+0x98/0xa40
> >            validate_chain+0x576/0x860
> >            __lock_acquire+0x39c/0x790
> >            lock_acquire+0xa2/0x1b0
> >            __kernfs_remove+0x25f/0x2e0
> >            kernfs_remove_by_name_ns+0x41/0x80
> >            remove_files.isra.0+0x30/0x70
> >            sysfs_remove_group+0x3d/0x80
> >            sysfs_remove_groups+0x29/0x40
> >            device_remove_attrs+0x39/0x70
> >            device_del+0x16a/0x3f0
> >            device_unregister+0x16/0x60
> >            remove_memory_block_devices+0x82/0xb0
> >            try_remove_memory+0xb5/0x130
> >            remove_memory+0x26/0x40
> >            dev_dax_kmem_remove+0x44/0x6a [kmem]
> >            device_release_driver_internal+0xe4/0x1c0
> >            unbind_store+0xef/0x120
> >            kernfs_fop_write+0xcf/0x1c0
> >            vfs_write+0xdb/0x1d0
> >            ksys_write+0x65/0xe0
> >            do_syscall_64+0x5c/0xa0
> >            entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> >     other info that might help us debug this:
> >
> >     Chain exists of:
> >       kn->count#241 --> cpu_hotplug_lock.rw_sem --> mem_hotplug_lock.rw_sem
> >
> >      Possible unsafe locking scenario:
> >
> >            CPU0                    CPU1
> >            ----                    ----
> >       lock(mem_hotplug_lock.rw_sem);
> >                                    lock(cpu_hotplug_lock.rw_sem);
> >                                    lock(mem_hotplug_lock.rw_sem);
> >       lock(kn->count#241);
> >
> >      *** DEADLOCK ***
> >
> > No fixes tag as this seems to have been a long standing issue that
> > likely predated the addition of kernfs lockdep annotations.
> >
> > Cc: <stable@vger.kernel.org>
>
> I am not convinced this can actually happen. I explained somewhere else
> already why a similar locksplat (reported by Pavel IIRC) on the ordinary
> memory removal path is a false positive (because the device hotplug lock
> actually protects us from such conditions). Can you elaborate why this
> is stable material (and explain my tired eyes how the issue will
> actually happen in real life)?

I don't mind waiting for it to soak a while upstream before heading
back to -stable, and it's possible the kn->count entanglement is on a
kobject in a different part of the sysfs hierarchy, but I haven't
proven that. So, it's a toss up. I think the backport risk is low, but
we can validate that with some upstream soak time.

>
> [...]
> >
> >  When adding/removing memory that uses memory block devices (i.e. ordinary RAM),
> > -the device_hotplug_lock should be held to:
> > +the device_hotplug_lock is held to:
> >
> >  - synchronize against online/offline requests (e.g. via sysfs). This way, memory
> >    block devices can only be accessed (.online/.state attributes) by user
> > -  space once memory has been fully added. And when removing memory, we
> > -  know nobody is in critical sections.
> > +  space once memory has been fully added. And when removing memory, the
> > +  memory block device is invalidated (mem->section count set to 0) under the
> > +  lock to abort any in-flight online requests.
>
> I don't think this is needed. See below.
>
> >  - synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC)
> >
> >  Especially, there is a possible lock inversion that is avoided using
> > @@ -112,7 +113,13 @@ can result in a lock inversion.
> >
> >  onlining/offlining of memory should be done via device_online()/
> >  device_offline() - to make sure it is properly synchronized to actions
> > -via sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type)
> > +via sysfs. Holding device_hotplug_lock is required to prevent online racing
> > +removal. The device_hotplug_lock and memblock invalidation allows
> > +remove_memory_block_devices() to run outside of mem_hotplug_lock to avoid lock
> > +dependency conflicts with memblock-sysfs teardown. The add_memory() path
> > +performs create_memory_block_devices() under mem_hotplug_lock so that if it
> > +fails it can perform an arch_remove_memory() cleanup. There are no known lock
> > +dependency problems with memblock-sysfs setup.
> >
> >  When adding/removing/onlining/offlining memory or adding/removing
> >  heterogeneous/device memory, we should always hold the mem_hotplug_lock in
> > diff --git a/drivers/base/core.c b/drivers/base/core.c
> > index 42a672456432..5d5036370c92 100644
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -1146,6 +1146,11 @@ void unlock_device_hotplug(void)
> >       mutex_unlock(&device_hotplug_lock);
> >  }
> >
> > +void assert_held_device_hotplug(void)
> > +{
> > +     lockdep_assert_held(&device_hotplug_lock);
> > +}
> > +
> >  int lock_device_hotplug_sysfs(void)
> >  {
> >       if (mutex_trylock(&device_hotplug_lock))
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index 799b43191dea..91c6fbd2383e 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -280,6 +280,10 @@ static int memory_subsys_online(struct device *dev)
> >       if (mem->state == MEM_ONLINE)
> >               return 0;
> >
> > +     /* online lost the race with hot-unplug, abort */
> > +     if (!mem->section_count)
> > +             return -ENXIO;
> > +
>
> Huh, why is that needed? There is pages_correctly_probed(), which checks
> that all sections are present already (but I also have a patch to rework
> that in my queue, because it looks like it's not needed in the current
> state).

I chose "mem->section_count = 0" as the invalidation event because
that attribute is tied to the memory block itself and for symmetry
with the offline path. More below...

>
> (Especially, I don't see why this is necessary in the context of this
> patch - nothing changed in that regard. Also, checks against "device
> already removed" should logically belong into
> device_online()/device_offline().

The scenario is that userspace races two threads one calling offline
and the other calling online. Likely no, possible yes. Offline thread
wins the race, but not before online thread gets to the lock in
state_store(). Offline thread completes the teardown and unlocks.
Online thread starts operating on a zombie memory-block-device until
it notices the memory associated with that device is not suitable to
online.

For symmetry with memory_subsys_offline() (that prevents the loser of
a race of 2 threads running offline from continuing to operate on a
zombie memory-block with a ->section_count check) I added a
->section_count check to memory_subsys_online().

>Other subsystems should have similar issues, no?)

Not many subsystems have a sysfs attribute that can trigger the
unregistration of the self same attribute, but yes, I've seen it cause
problems.

For example, async device probing and registration causes similar
issues and was only recently fixed:

3451a495ef24 driver core: Establish order of operations for device_add
and device_del via bitflag

>
> >       /*
> >        * If we are called from state_store(), online_type will be
> >        * set >= 0 Otherwise we were called from the device online
> > @@ -736,8 +740,6 @@ int create_memory_block_devices(unsigned long start, unsigned long size)
> >   * Remove memory block devices for the given memory area. Start and size
> >   * have to be aligned to memory block granularity. Memory block devices
> >   * have to be offline.
> > - *
> > - * Called under device_hotplug_lock.
> >   */
>
> Why is that change needed? Especially with the radix tree rework, this
> lock is required on this call path. Removing this looks wrong to me.

I'm not removing the dependency I'm trading the comment for a call to
assert_held_device_hotplug(). I'm ok with keeping the comment, but the
explicit lockdep assertion helps the future developer that refactors
and inadvertently misses the comment.
David Hildenbrand Jan. 11, 2020, 11:03 a.m. UTC | #3
>>> Cc: <stable@vger.kernel.org>
>>
>> I am not convinced this can actually happen. I explained somewhere else
>> already why a similar locksplat (reported by Pavel IIRC) on the ordinary
>> memory removal path is a false positive (because the device hotplug lock
>> actually protects us from such conditions). Can you elaborate why this
>> is stable material (and explain my tired eyes how the issue will
>> actually happen in real life)?
> 
> I don't mind waiting for it to soak a while upstream before heading
> back to -stable, and it's possible the kn->count entanglement is on a
> kobject in a different part of the sysfs hierarchy, but I haven't
> proven that. So, it's a toss up. I think the backport risk is low, but
> we can validate that with some upstream soak time.

So I just remember why I think this (and the previously reported done
for ACPI DIMMs) are false positives. The actual locking order is

onlining/offlining from user space:

kn->count -> device_hotplug_lock -> cpu_hotplug_lock -> mem_hotplug_lock

memory removal:

device_hotplug_lock -> cpu_hotplug_lock -> mem_hotplug_lock -> kn->count


This looks like a locking inversion - but it's not. Whenever we come via
user space we do a mutex_trylock(), which resolves this issue by backing
up. The device_hotplug_lock will prevent

I have no clue why the device_hotplug_lock does not pop up in the
lockdep report here. Sounds wrong to me.

I think this is a false positive and not stable material.

> 
>>
>> [...]
>>>
>>>  When adding/removing memory that uses memory block devices (i.e. ordinary RAM),
>>> -the device_hotplug_lock should be held to:
>>> +the device_hotplug_lock is held to:
>>>
>>>  - synchronize against online/offline requests (e.g. via sysfs). This way, memory
>>>    block devices can only be accessed (.online/.state attributes) by user
>>> -  space once memory has been fully added. And when removing memory, we
>>> -  know nobody is in critical sections.
>>> +  space once memory has been fully added. And when removing memory, the
>>> +  memory block device is invalidated (mem->section count set to 0) under the
>>> +  lock to abort any in-flight online requests.
>>
>> I don't think this is needed. See below.
>>
>>>  - synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC)
>>>
>>>  Especially, there is a possible lock inversion that is avoided using
>>> @@ -112,7 +113,13 @@ can result in a lock inversion.
>>>
>>>  onlining/offlining of memory should be done via device_online()/
>>>  device_offline() - to make sure it is properly synchronized to actions
>>> -via sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type)
>>> +via sysfs. Holding device_hotplug_lock is required to prevent online racing
>>> +removal. The device_hotplug_lock and memblock invalidation allows
>>> +remove_memory_block_devices() to run outside of mem_hotplug_lock to avoid lock
>>> +dependency conflicts with memblock-sysfs teardown. The add_memory() path
>>> +performs create_memory_block_devices() under mem_hotplug_lock so that if it
>>> +fails it can perform an arch_remove_memory() cleanup. There are no known lock
>>> +dependency problems with memblock-sysfs setup.
>>>
>>>  When adding/removing/onlining/offlining memory or adding/removing
>>>  heterogeneous/device memory, we should always hold the mem_hotplug_lock in
>>> diff --git a/drivers/base/core.c b/drivers/base/core.c
>>> index 42a672456432..5d5036370c92 100644
>>> --- a/drivers/base/core.c
>>> +++ b/drivers/base/core.c
>>> @@ -1146,6 +1146,11 @@ void unlock_device_hotplug(void)
>>>       mutex_unlock(&device_hotplug_lock);
>>>  }
>>>
>>> +void assert_held_device_hotplug(void)
>>> +{
>>> +     lockdep_assert_held(&device_hotplug_lock);
>>> +}
>>> +
>>>  int lock_device_hotplug_sysfs(void)
>>>  {
>>>       if (mutex_trylock(&device_hotplug_lock))
>>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>>> index 799b43191dea..91c6fbd2383e 100644
>>> --- a/drivers/base/memory.c
>>> +++ b/drivers/base/memory.c
>>> @@ -280,6 +280,10 @@ static int memory_subsys_online(struct device *dev)
>>>       if (mem->state == MEM_ONLINE)
>>>               return 0;
>>>
>>> +     /* online lost the race with hot-unplug, abort */
>>> +     if (!mem->section_count)
>>> +             return -ENXIO;
>>> +
>>
>> Huh, why is that needed? There is pages_correctly_probed(), which checks
>> that all sections are present already (but I also have a patch to rework
>> that in my queue, because it looks like it's not needed in the current
>> state).
> 
> I chose "mem->section_count = 0" as the invalidation event because
> that attribute is tied to the memory block itself and for symmetry
> with the offline path. More below...
> 
>>
>> (Especially, I don't see why this is necessary in the context of this
>> patch - nothing changed in that regard. Also, checks against "device
>> already removed" should logically belong into
>> device_online()/device_offline().
> 
> The scenario is that userspace races two threads one calling offline
> and the other calling online. Likely no, possible yes. Offline thread
> wins the race, but not before online thread gets to the lock in
> state_store(). Offline thread completes the teardown and unlocks.
> Online thread starts operating on a zombie memory-block-device until
> it notices the memory associated with that device is not suitable to
> online.
> 
> For symmetry with memory_subsys_offline() (that prevents the loser of
> a race of 2 threads running offline from continuing to operate on a
> zombie memory-block with a ->section_count check) I added a
> ->section_count check to memory_subsys_online().

1. The section_count check is in place to disallow offlining memory
blocks with missing sections. (see 26bbe7ef6d5c ("drivers/base/memory.c:
prohibit offlining of memory blocks with missing sections")). Not to
deal with any races/zombie blocks.

2. offlining/onlining racing is completely irrelevant. state_store() and
online_store() do a lock_device_hotplug_sysfs(), which is a trylock. The
looser will simply back off. device_online() and device_offline()
properly deal with the block already being in the desired state.

3. zombie blocks are interesting, but I am not convinced yet this is an
actual issue - we've never seen it happening. I *think* it could work
due to the trylock correctly (pending requests will back off while
another thread is removing them), but I am not convinced there is no
tiny little race. Anyhow, we have pages_correctly_probed() which will
bail out properly already if the sections have been removed in the
meantime. Also, I think we should deal with zombie blocks differently if
required (more below).

I think this hunk does logically not belong into this patch and is also
not needed (due to pages_correctly_probed() in memory_block_action()).
Please drop it from this patch (including the documentation update
regarding this).

> 
>> Other subsystems should have similar issues, no?)
> 
> Not many subsystems have a sysfs attribute that can trigger the
> unregistration of the self same attribute, but yes, I've seen it cause
> problems.
> 
> For example, async device probing and registration causes similar
> issues and was only recently fixed:
> 
> 3451a495ef24 driver core: Establish order of operations for device_add
> and device_del via bitflag

I feel like we should have a -> removed property on devices and check
against that in device_online() and device_offline(). But only if there
isn't another magical mechanism that deals with zombie devices and
pending online/offline requests.

> 
>>
>>>       /*
>>>        * If we are called from state_store(), online_type will be
>>>        * set >= 0 Otherwise we were called from the device online
>>> @@ -736,8 +740,6 @@ int create_memory_block_devices(unsigned long start, unsigned long size)
>>>   * Remove memory block devices for the given memory area. Start and size
>>>   * have to be aligned to memory block granularity. Memory block devices
>>>   * have to be offline.
>>> - *
>>> - * Called under device_hotplug_lock.
>>>   */
>>
>> Why is that change needed? Especially with the radix tree rework, this
>> lock is required on this call path. Removing this looks wrong to me.
> 
> I'm not removing the dependency I'm trading the comment for a call to
> assert_held_device_hotplug(). I'm ok with keeping the comment, but the
> explicit lockdep assertion helps the future developer that refactors
> and inadvertently misses the comment.
> 

Ah, I missed that you placed the assert into that function, sorry
(thought it would be in the caller). We document it for most other
functions in that file now, so I'd prefer to just keep it. Whatever you
prefer.
Qian Cai Jan. 11, 2020, 1:55 p.m. UTC | #4
> On Jan 11, 2020, at 6:03 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> So I just remember why I think this (and the previously reported done
> for ACPI DIMMs) are false positives. The actual locking order is
> 
> onlining/offlining from user space:
> 
> kn->count -> device_hotplug_lock -> cpu_hotplug_lock -> mem_hotplug_lock
> 
> memory removal:
> 
> device_hotplug_lock -> cpu_hotplug_lock -> mem_hotplug_lock -> kn->count
> 
> 
> This looks like a locking inversion - but it's not. Whenever we come via
> user space we do a mutex_trylock(), which resolves this issue by backing
> up. The device_hotplug_lock will prevent
> 
> I have no clue why the device_hotplug_lock does not pop up in the
> lockdep report here. Sounds wrong to me.
> 
> I think this is a false positive and not stable material.

The point is that there are other paths does kn->count —> cpu_hotplug_lock without needing device_hotplug_lock to race with memory removal.

kmem_cache_shrink_all+0x50/0x100 (cpu_hotplug_lock.rw_sem/mem_hotplug_lock.rw_sem)
shrink_store+0x34/0x60
slab_attr_store+0x6c/0x170
sysfs_kf_write+0x70/0xb0
kernfs_fop_write+0x11c/0x270 ((kn->count)
 __vfs_write+0x3c/0x70
 vfs_write+0xcc/0x200
ksys_write+0x7c/0x140
system_call+0x5c/0x6
David Hildenbrand Jan. 11, 2020, 2:25 p.m. UTC | #5
> Am 11.01.2020 um 14:56 schrieb Qian Cai <cai@lca.pw>:
> 
> 
> 
>> On Jan 11, 2020, at 6:03 AM, David Hildenbrand <david@redhat.com> wrote:
>> 
>> So I just remember why I think this (and the previously reported done
>> for ACPI DIMMs) are false positives. The actual locking order is
>> 
>> onlining/offlining from user space:
>> 
>> kn->count -> device_hotplug_lock -> cpu_hotplug_lock -> mem_hotplug_lock
>> 
>> memory removal:
>> 
>> device_hotplug_lock -> cpu_hotplug_lock -> mem_hotplug_lock -> kn->count
>> 
>> 
>> This looks like a locking inversion - but it's not. Whenever we come via
>> user space we do a mutex_trylock(), which resolves this issue by backing
>> up. The device_hotplug_lock will prevent
>> 
>> I have no clue why the device_hotplug_lock does not pop up in the
>> lockdep report here. Sounds wrong to me.
>> 
>> I think this is a false positive and not stable material.
> 
> The point is that there are other paths does kn->count —> cpu_hotplug_lock without needing device_hotplug_lock to race with memory removal.
> 
> kmem_cache_shrink_all+0x50/0x100 (cpu_hotplug_lock.rw_sem/mem_hotplug_lock.rw_sem)
> shrink_store+0x34/0x60
> slab_attr_store+0x6c/0x170
> sysfs_kf_write+0x70/0xb0
> kernfs_fop_write+0x11c/0x270 ((kn->count)
> __vfs_write+0x3c/0x70
> vfs_write+0xcc/0x200
> ksys_write+0x7c/0x140
> system_call+0x5c/0x6
> 

But not the lock of the memory devices, or am I missing something?
David Hildenbrand Jan. 11, 2020, 2:52 p.m. UTC | #6
On 11.01.20 15:25, David Hildenbrand wrote:
> 
> 
>> Am 11.01.2020 um 14:56 schrieb Qian Cai <cai@lca.pw>:
>>
>> 
>>
>>> On Jan 11, 2020, at 6:03 AM, David Hildenbrand <david@redhat.com> wrote:
>>>
>>> So I just remember why I think this (and the previously reported done
>>> for ACPI DIMMs) are false positives. The actual locking order is
>>>
>>> onlining/offlining from user space:
>>>
>>> kn->count -> device_hotplug_lock -> cpu_hotplug_lock -> mem_hotplug_lock
>>>
>>> memory removal:
>>>
>>> device_hotplug_lock -> cpu_hotplug_lock -> mem_hotplug_lock -> kn->count
>>>
>>>
>>> This looks like a locking inversion - but it's not. Whenever we come via
>>> user space we do a mutex_trylock(), which resolves this issue by backing
>>> up. The device_hotplug_lock will prevent
>>>
>>> I have no clue why the device_hotplug_lock does not pop up in the
>>> lockdep report here. Sounds wrong to me.
>>>
>>> I think this is a false positive and not stable material.
>>
>> The point is that there are other paths does kn->count —> cpu_hotplug_lock without needing device_hotplug_lock to race with memory removal.
>>
>> kmem_cache_shrink_all+0x50/0x100 (cpu_hotplug_lock.rw_sem/mem_hotplug_lock.rw_sem)
>> shrink_store+0x34/0x60
>> slab_attr_store+0x6c/0x170
>> sysfs_kf_write+0x70/0xb0
>> kernfs_fop_write+0x11c/0x270 ((kn->count)
>> __vfs_write+0x3c/0x70
>> vfs_write+0xcc/0x200
>> ksys_write+0x7c/0x140
>> system_call+0x5c/0x6
>>
> 
> But not the lock of the memory devices, or am I missing something?
> 

To clarify:

memory unplug will remove e.g., /sys/devices/system/memory/memoryX/,
which has a dedicated kn->count AFAIK

If you do a "echo 1 > /sys/kernel/slab/X/shrink", you would not lock the
kn->count of /sys/devices/system/memory/memoryX/, but the one of some
slab thingy.

The only scenario I could see is if remove_memory_block_devices() will
not only remove /sys/devices/system/memory/memoryX/, but also implicitly
e.g., /sys/kernel/slab/X/. If that is the case, then this is indeed not
a false positive, but something rather hard to trigger (which would
still classify as stable material).
Dan Williams Jan. 11, 2020, 5:41 p.m. UTC | #7
On Sat, Jan 11, 2020 at 6:52 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 11.01.20 15:25, David Hildenbrand wrote:
> >
> >
> >> Am 11.01.2020 um 14:56 schrieb Qian Cai <cai@lca.pw>:
> >>
> >> 
> >>
> >>> On Jan 11, 2020, at 6:03 AM, David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> So I just remember why I think this (and the previously reported done
> >>> for ACPI DIMMs) are false positives. The actual locking order is
> >>>
> >>> onlining/offlining from user space:
> >>>
> >>> kn->count -> device_hotplug_lock -> cpu_hotplug_lock -> mem_hotplug_lock
> >>>
> >>> memory removal:
> >>>
> >>> device_hotplug_lock -> cpu_hotplug_lock -> mem_hotplug_lock -> kn->count
> >>>
> >>>
> >>> This looks like a locking inversion - but it's not. Whenever we come via
> >>> user space we do a mutex_trylock(), which resolves this issue by backing
> >>> up. The device_hotplug_lock will prevent
> >>>
> >>> I have no clue why the device_hotplug_lock does not pop up in the
> >>> lockdep report here. Sounds wrong to me.
> >>>
> >>> I think this is a false positive and not stable material.
> >>
> >> The point is that there are other paths does kn->count —> cpu_hotplug_lock without needing device_hotplug_lock to race with memory removal.
> >>
> >> kmem_cache_shrink_all+0x50/0x100 (cpu_hotplug_lock.rw_sem/mem_hotplug_lock.rw_sem)
> >> shrink_store+0x34/0x60
> >> slab_attr_store+0x6c/0x170
> >> sysfs_kf_write+0x70/0xb0
> >> kernfs_fop_write+0x11c/0x270 ((kn->count)
> >> __vfs_write+0x3c/0x70
> >> vfs_write+0xcc/0x200
> >> ksys_write+0x7c/0x140
> >> system_call+0x5c/0x6
> >>
> >
> > But not the lock of the memory devices, or am I missing something?
> >
>
> To clarify:
>
> memory unplug will remove e.g., /sys/devices/system/memory/memoryX/,
> which has a dedicated kn->count AFAIK
>
> If you do a "echo 1 > /sys/kernel/slab/X/shrink", you would not lock the
> kn->count of /sys/devices/system/memory/memoryX/, but the one of some
> slab thingy.
>
> The only scenario I could see is if remove_memory_block_devices() will
> not only remove /sys/devices/system/memory/memoryX/, but also implicitly
> e.g., /sys/kernel/slab/X/. If that is the case, then this is indeed not
> a false positive, but something rather hard to trigger (which would
> still classify as stable material).

Yes, already agreed to drop stable.

However, the trylock does not solve the race it just turns the
blocking wait to a spin wait, but the subsequent 5ms sleep does make
the theoretical race nearly impossible, Thanks for pointing that out.

The theoretical race is still a problem because it hides future
lockdep violations, but I otherwise can't point to whether the
kn->count in question is a false positive concern for an actual
deadlock or not. Tracking that down is possible, but not something I
have time for at present.

Patch
diff mbox series

diff --git a/Documentation/core-api/memory-hotplug.rst b/Documentation/core-api/memory-hotplug.rst
index de7467e48067..637b467378b7 100644
--- a/Documentation/core-api/memory-hotplug.rst
+++ b/Documentation/core-api/memory-hotplug.rst
@@ -90,12 +90,13 @@  Locking Internals
 =================
 
 When adding/removing memory that uses memory block devices (i.e. ordinary RAM),
-the device_hotplug_lock should be held to:
+the device_hotplug_lock is held to:
 
 - synchronize against online/offline requests (e.g. via sysfs). This way, memory
   block devices can only be accessed (.online/.state attributes) by user
-  space once memory has been fully added. And when removing memory, we
-  know nobody is in critical sections.
+  space once memory has been fully added. And when removing memory, the
+  memory block device is invalidated (mem->section count set to 0) under the
+  lock to abort any in-flight online requests.
 - synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC)
 
 Especially, there is a possible lock inversion that is avoided using
@@ -112,7 +113,13 @@  can result in a lock inversion.
 
 onlining/offlining of memory should be done via device_online()/
 device_offline() - to make sure it is properly synchronized to actions
-via sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type)
+via sysfs. Holding device_hotplug_lock is required to prevent online racing
+removal. The device_hotplug_lock and memblock invalidation allows
+remove_memory_block_devices() to run outside of mem_hotplug_lock to avoid lock
+dependency conflicts with memblock-sysfs teardown. The add_memory() path
+performs create_memory_block_devices() under mem_hotplug_lock so that if it
+fails it can perform an arch_remove_memory() cleanup. There are no known lock
+dependency problems with memblock-sysfs setup.
 
 When adding/removing/onlining/offlining memory or adding/removing
 heterogeneous/device memory, we should always hold the mem_hotplug_lock in
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 42a672456432..5d5036370c92 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1146,6 +1146,11 @@  void unlock_device_hotplug(void)
 	mutex_unlock(&device_hotplug_lock);
 }
 
+void assert_held_device_hotplug(void)
+{
+	lockdep_assert_held(&device_hotplug_lock);
+}
+
 int lock_device_hotplug_sysfs(void)
 {
 	if (mutex_trylock(&device_hotplug_lock))
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 799b43191dea..91c6fbd2383e 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -280,6 +280,10 @@  static int memory_subsys_online(struct device *dev)
 	if (mem->state == MEM_ONLINE)
 		return 0;
 
+	/* online lost the race with hot-unplug, abort */
+	if (!mem->section_count)
+		return -ENXIO;
+
 	/*
 	 * If we are called from state_store(), online_type will be
 	 * set >= 0 Otherwise we were called from the device online
@@ -736,8 +740,6 @@  int create_memory_block_devices(unsigned long start, unsigned long size)
  * Remove memory block devices for the given memory area. Start and size
  * have to be aligned to memory block granularity. Memory block devices
  * have to be offline.
- *
- * Called under device_hotplug_lock.
  */
 void remove_memory_block_devices(unsigned long start, unsigned long size)
 {
@@ -746,6 +748,8 @@  void remove_memory_block_devices(unsigned long start, unsigned long size)
 	struct memory_block *mem;
 	unsigned long block_id;
 
+	assert_held_device_hotplug();
+
 	if (WARN_ON_ONCE(!IS_ALIGNED(start, memory_block_size_bytes()) ||
 			 !IS_ALIGNED(size, memory_block_size_bytes())))
 		return;
diff --git a/include/linux/device.h b/include/linux/device.h
index 96ff76731e93..a84654489c51 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1553,6 +1553,7 @@  static inline bool device_supports_offline(struct device *dev)
 extern void lock_device_hotplug(void);
 extern void unlock_device_hotplug(void);
 extern int lock_device_hotplug_sysfs(void);
+extern void assert_held_device_hotplug(void);
 extern int device_offline(struct device *dev);
 extern int device_online(struct device *dev);
 extern void set_primary_fwnode(struct device *dev, struct fwnode_handle *fwnode);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 55ac23ef11c1..0158cd4cca48 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1763,8 +1763,6 @@  static int __ref try_remove_memory(int nid, u64 start, u64 size)
 
 	BUG_ON(check_hotplug_memory_range(start, size));
 
-	mem_hotplug_begin();
-
 	/*
 	 * All memory blocks must be offlined before removing memory.  Check
 	 * whether all memory blocks in question are offline and return error
@@ -1777,9 +1775,15 @@  static int __ref try_remove_memory(int nid, u64 start, u64 size)
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
 
-	/* remove memory block devices before removing memory */
+	/*
+	 * Remove memory block devices before removing memory and before
+	 * mem_hotplug_begin() (see Documentation/core-api/memory-hotplug.rst
+	 * "Locking Internals").
+	 */
 	remove_memory_block_devices(start, size);
 
+	mem_hotplug_begin();
+
 	arch_remove_memory(nid, start, size, NULL);
 	memblock_free(start, size);
 	memblock_remove(start, size);