diff mbox series

[v4] kernel/resource: Fix locking in request_free_mem_region

Message ID 20210416025745.8698-1-apopple@nvidia.com (mailing list archive)
State New, archived
Headers show
Series [v4] kernel/resource: Fix locking in request_free_mem_region | expand

Commit Message

Alistair Popple April 16, 2021, 2:57 a.m. UTC
request_free_mem_region() is used to find an empty range of physical
addresses for hotplugging ZONE_DEVICE memory. It does this by iterating
over the range of possible addresses using region_intersects() to see if
the range is free.

region_intersects() obtains a read lock before walking the resource tree
to protect against concurrent changes. However it drops the lock prior
to returning. This means by the time request_mem_region() is called in
request_free_mem_region() another thread may have already reserved the
requested region resulting in unexpected failures and a message in the
kernel log from hitting this condition:

        /*
         * mm/hmm.c reserves physical addresses which then
         * become unavailable to other users.  Conflicts are
         * not expected.  Warn to aid debugging if encountered.
         */
        if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
                pr_warn("Unaddressable device %s %pR conflicts with %pR",
                        conflict->name, conflict, res);

To fix this create versions of region_intersects() and
request_mem_region() that allow the caller to take the appropriate lock
such that it may be held over the required calls.

Instead of creating another version of devm_request_mem_region() that
doesn't take the lock open-code it to allow the caller to pre-allocate
the required memory prior to taking the lock.

On some architectures and kernel configurations revoke_iomem() also
calls resource code so cannot be called with the resource lock held.
Therefore call it only after dropping the lock.

Fixes: 4ef589dc9b10c ("mm/hmm/devmem: device memory hotplug using ZONE_DEVICE")
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Reported-by: kernel test robot <oliver.sang@intel.com>

---

Changes for v4:

- Update commit log
- Moved calling revoke_iomem() to before devres_add(). This shouldn't
  change anything but it maintains the original ordering.
- Fixed freeing of devres in case of failure.
- Rebased onto linux-next
---
 kernel/resource.c | 144 ++++++++++++++++++++++++++++++----------------
 1 file changed, 94 insertions(+), 50 deletions(-)

Comments

Dan Williams April 16, 2021, 4:19 a.m. UTC | #1
On Thu, Apr 15, 2021 at 7:58 PM Alistair Popple <apopple@nvidia.com> wrote:
>
> request_free_mem_region() is used to find an empty range of physical
> addresses for hotplugging ZONE_DEVICE memory. It does this by iterating
> over the range of possible addresses using region_intersects() to see if
> the range is free.
>
> region_intersects() obtains a read lock before walking the resource tree
> to protect against concurrent changes. However it drops the lock prior
> to returning. This means by the time request_mem_region() is called in
> request_free_mem_region() another thread may have already reserved the
> requested region resulting in unexpected failures and a message in the
> kernel log from hitting this condition:
>
>         /*
>          * mm/hmm.c reserves physical addresses which then
>          * become unavailable to other users.  Conflicts are
>          * not expected.  Warn to aid debugging if encountered.
>          */
>         if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
>                 pr_warn("Unaddressable device %s %pR conflicts with %pR",
>                         conflict->name, conflict, res);
>
> To fix this create versions of region_intersects() and
> request_mem_region() that allow the caller to take the appropriate lock
> such that it may be held over the required calls.
>
> Instead of creating another version of devm_request_mem_region() that
> doesn't take the lock open-code it to allow the caller to pre-allocate
> the required memory prior to taking the lock.
>
> On some architectures and kernel configurations revoke_iomem() also
> calls resource code so cannot be called with the resource lock held.
> Therefore call it only after dropping the lock.

The patch is difficult to read because too many things are being
changed at once, and the changelog seems to confirm that. Can you try
breaking this down into a set of incremental changes? Not only will
this ease review it will distribute any regressions over multiple
bisection targets.

Something like:

* Refactor region_intersects() to allow external locking
* Refactor __request_region() to allow external locking
* Push revoke_iomem() down into...
* Fix resource_lock usage in [devm_]request_free_mem_region()

The revoke_iomem() change seems like something that should be moved
into a leaf helper and not called by __request_free_mem_region()
directly.
David Hildenbrand April 16, 2021, 8:18 a.m. UTC | #2
On 16.04.21 06:19, Dan Williams wrote:
> On Thu, Apr 15, 2021 at 7:58 PM Alistair Popple <apopple@nvidia.com> wrote:
>>
>> request_free_mem_region() is used to find an empty range of physical
>> addresses for hotplugging ZONE_DEVICE memory. It does this by iterating
>> over the range of possible addresses using region_intersects() to see if
>> the range is free.
>>
>> region_intersects() obtains a read lock before walking the resource tree
>> to protect against concurrent changes. However it drops the lock prior
>> to returning. This means by the time request_mem_region() is called in
>> request_free_mem_region() another thread may have already reserved the
>> requested region resulting in unexpected failures and a message in the
>> kernel log from hitting this condition:
>>
>>          /*
>>           * mm/hmm.c reserves physical addresses which then
>>           * become unavailable to other users.  Conflicts are
>>           * not expected.  Warn to aid debugging if encountered.
>>           */
>>          if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
>>                  pr_warn("Unaddressable device %s %pR conflicts with %pR",
>>                          conflict->name, conflict, res);
>>
>> To fix this create versions of region_intersects() and
>> request_mem_region() that allow the caller to take the appropriate lock
>> such that it may be held over the required calls.
>>
>> Instead of creating another version of devm_request_mem_region() that
>> doesn't take the lock open-code it to allow the caller to pre-allocate
>> the required memory prior to taking the lock.
>>
>> On some architectures and kernel configurations revoke_iomem() also
>> calls resource code so cannot be called with the resource lock held.
>> Therefore call it only after dropping the lock.
> 
> The patch is difficult to read because too many things are being
> changed at once, and the changelog seems to confirm that. Can you try
> breaking this down into a set of incremental changes? Not only will
> this ease review it will distribute any regressions over multiple
> bisection targets.
> 
> Something like:
> 
> * Refactor region_intersects() to allow external locking
> * Refactor __request_region() to allow external locking
> * Push revoke_iomem() down into...
> * Fix resource_lock usage in [devm_]request_free_mem_region()

+1
Alistair Popple April 19, 2021, 7:09 a.m. UTC | #3
On Friday, 16 April 2021 2:19:18 PM AEST Dan Williams wrote:
> The revoke_iomem() change seems like something that should be moved
> into a leaf helper and not called by __request_free_mem_region()
> directly.

Ok. I have split this up but left the call to revoke_iomem() in 
__request_free_mem_region(). Perhaps I am missing something but I wasn't sure 
how moving it into a helper would be any better/different as it has to be 
called after dropping the lock.

 - Alistair
diff mbox series

Patch

diff --git a/kernel/resource.c b/kernel/resource.c
index 7e00239a023a..f1f7fe089fc8 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -502,6 +502,34 @@  int __weak page_is_ram(unsigned long pfn)
 }
 EXPORT_SYMBOL_GPL(page_is_ram);
 
+static int __region_intersects(resource_size_t start, size_t size,
+			       unsigned long flags, unsigned long desc)
+{
+	struct resource res;
+	int type = 0; int other = 0;
+	struct resource *p;
+
+	res.start = start;
+	res.end = start + size - 1;
+
+	for (p = iomem_resource.child; p ; p = p->sibling) {
+		bool is_type = (((p->flags & flags) == flags) &&
+				((desc == IORES_DESC_NONE) ||
+				 (desc == p->desc)));
+
+		if (resource_overlaps(p, &res))
+			is_type ? type++ : other++;
+	}
+
+	if (type == 0)
+		return REGION_DISJOINT;
+
+	if (other == 0)
+		return REGION_INTERSECTS;
+
+	return REGION_MIXED;
+}
+
 /**
  * region_intersects() - determine intersection of region with known resources
  * @start: region start address
@@ -525,31 +553,12 @@  EXPORT_SYMBOL_GPL(page_is_ram);
 int region_intersects(resource_size_t start, size_t size, unsigned long flags,
 		      unsigned long desc)
 {
-	struct resource res;
-	int type = 0; int other = 0;
-	struct resource *p;
-
-	res.start = start;
-	res.end = start + size - 1;
+	int rc;
 
 	read_lock(&resource_lock);
-	for (p = iomem_resource.child; p ; p = p->sibling) {
-		bool is_type = (((p->flags & flags) == flags) &&
-				((desc == IORES_DESC_NONE) ||
-				 (desc == p->desc)));
-
-		if (resource_overlaps(p, &res))
-			is_type ? type++ : other++;
-	}
+	rc = __region_intersects(start, size, flags, desc);
 	read_unlock(&resource_lock);
-
-	if (type == 0)
-		return REGION_DISJOINT;
-
-	if (other == 0)
-		return REGION_INTERSECTS;
-
-	return REGION_MIXED;
+	return rc;
 }
 EXPORT_SYMBOL_GPL(region_intersects);
 
@@ -1150,31 +1159,16 @@  struct address_space *iomem_get_mapping(void)
 	return smp_load_acquire(&iomem_inode)->i_mapping;
 }
 
-/**
- * __request_region - create a new busy resource region
- * @parent: parent resource descriptor
- * @start: resource start address
- * @n: resource region size
- * @name: reserving caller's ID string
- * @flags: IO resource flags
- */
-struct resource * __request_region(struct resource *parent,
-				   resource_size_t start, resource_size_t n,
-				   const char *name, int flags)
+static bool request_region_locked(struct resource *parent,
+				    struct resource *res, resource_size_t start,
+				    resource_size_t n, const char *name, int flags)
 {
 	DECLARE_WAITQUEUE(wait, current);
-	struct resource *res = alloc_resource(GFP_KERNEL);
-	struct resource *orig_parent = parent;
-
-	if (!res)
-		return NULL;
 
 	res->name = name;
 	res->start = start;
 	res->end = start + n - 1;
 
-	write_lock(&resource_lock);
-
 	for (;;) {
 		struct resource *conflict;
 
@@ -1209,14 +1203,37 @@  struct resource * __request_region(struct resource *parent,
 			write_lock(&resource_lock);
 			continue;
 		}
+		return false;
+	}
+
+	return true;
+}
+
+/**
+ * __request_region - create a new busy resource region
+ * @parent: parent resource descriptor
+ * @start: resource start address
+ * @n: resource region size
+ * @name: reserving caller's ID string
+ * @flags: IO resource flags
+ */
+struct resource *__request_region(struct resource *parent,
+				  resource_size_t start, resource_size_t n,
+				  const char *name, int flags)
+{
+	struct resource *res = alloc_resource(GFP_KERNEL);
+
+	if (!res)
+		return NULL;
+
+	write_lock(&resource_lock);
+	if (!request_region_locked(parent, res, start, n, name, flags)) {
 		/* Uhhuh, that didn't work out.. */
 		free_resource(res);
 		res = NULL;
-		break;
 	}
 	write_unlock(&resource_lock);
-
-	if (res && orig_parent == &iomem_resource)
+	if (res && parent == &iomem_resource)
 		revoke_iomem(res);
 
 	return res;
@@ -1758,26 +1775,53 @@  static struct resource *__request_free_mem_region(struct device *dev,
 {
 	resource_size_t end, addr;
 	struct resource *res;
+	struct region_devres *dr = NULL;
+
+	res = alloc_resource(GFP_KERNEL);
+	if (!res)
+		return ERR_PTR(-ENOMEM);
+
+	if (dev) {
+		dr = devres_alloc(devm_region_release, sizeof(struct region_devres),
+				  GFP_KERNEL);
+		if (!dr) {
+			free_resource(res);
+			return ERR_PTR(-ENOMEM);
+		}
+	}
 
 	size = ALIGN(size, 1UL << PA_SECTION_SHIFT);
 	end = min_t(unsigned long, base->end, (1UL << MAX_PHYSMEM_BITS) - 1);
 	addr = end - size + 1UL;
 
+	write_lock(&resource_lock);
 	for (; addr > size && addr >= base->start; addr -= size) {
-		if (region_intersects(addr, size, 0, IORES_DESC_NONE) !=
+		if (__region_intersects(addr, size, 0, IORES_DESC_NONE) !=
 				REGION_DISJOINT)
 			continue;
 
-		if (dev)
-			res = devm_request_mem_region(dev, addr, size, name);
-		else
-			res = request_mem_region(addr, size, name);
-		if (!res)
-			return ERR_PTR(-ENOMEM);
+		if (!request_region_locked(&iomem_resource, res, addr,
+						   size, name, 0))
+			break;
+
+		write_unlock(&resource_lock);
+		revoke_iomem(res);
 		res->desc = IORES_DESC_DEVICE_PRIVATE_MEMORY;
+		if (dev) {
+			dr->parent = &iomem_resource;
+			dr->start = addr;
+			dr->n = size;
+			devres_add(dev, dr);
+		}
+
 		return res;
 	}
 
+	write_unlock(&resource_lock);
+	free_resource(res);
+	if (dev)
+		devres_free(dr);
+
 	return ERR_PTR(-ERANGE);
 }