diff mbox

[2/5] mm, devm_memremap_pages: handle errors allocating final devres action

Message ID 152694212460.5484.13180030631810166467.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dan Williams May 21, 2018, 10:35 p.m. UTC
The last step before devm_memremap_pages() returns success is to
allocate a release action to tear the entire setup down. However, the
result from devm_add_action() is not checked.

Checking the error also means that we need to handle the fact that the
percpu_ref may not be killed by the time devm_memremap_pages_release()
runs. Add a new state flag for this case.

Cc: <stable@vger.kernel.org>
Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/memremap.h |    1 +
 kernel/memremap.c        |    8 ++++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

Comments

Andrew Morton May 21, 2018, 11:10 p.m. UTC | #1
On Mon, 21 May 2018 15:35:24 -0700 Dan Williams <dan.j.williams@intel.com> wrote:

> The last step before devm_memremap_pages() returns success is to
> allocate a release action to tear the entire setup down. However, the
> result from devm_add_action() is not checked.
> 
> Checking the error also means that we need to handle the fact that the
> percpu_ref may not be killed by the time devm_memremap_pages_release()
> runs. Add a new state flag for this case.
> 
> Cc: <stable@vger.kernel.org>
> Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Why the cc:stable?  The changelog doesn't describe the end-user-visible
impact of the bug (it always should, for this reason).

AFAICT we only go wrong when a small GFP_KERNEL allocation fails
(basically never happens), with undescribed results :(
Dan Williams May 22, 2018, 12:07 a.m. UTC | #2
[ resend as the last attempt dropped all the cc's ]

On Mon, May 21, 2018 at 4:10 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 21 May 2018 15:35:24 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
>
>> The last step before devm_memremap_pages() returns success is to
>> allocate a release action to tear the entire setup down. However, the
>> result from devm_add_action() is not checked.
>>
>> Checking the error also means that we need to handle the fact that the
>> percpu_ref may not be killed by the time devm_memremap_pages_release()
>> runs. Add a new state flag for this case.
>>
>> Cc: <stable@vger.kernel.org>
>> Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>> Cc: Logan Gunthorpe <logang@deltatee.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>
> Why the cc:stable?  The changelog doesn't describe the end-user-visible
> impact of the bug (it always should, for this reason).

True, I should have included that, one of these years I'll stop making
this mistake.

>
> AFAICT we only go wrong when a small GFP_KERNEL allocation fails
> (basically never happens), with undescribed results :(
>

Here's a better changelog:

---
The last step before devm_memremap_pages() returns success is to
allocate a release action to tear the entire setup down. However, the
result from devm_add_action() is not checked.

Checking the error also means that we need to handle the fact that the
percpu_ref may not be killed by the time devm_memremap_pages_release()
runs. Add a new state flag for this case.

Without this change we could fail to register the teardown of
devm_memremap_pages(). The likelihood of hitting this failure is tiny
as small memory allocations almost always succeed. However, the impact
of the failure is large given any future reconfiguration, or
disable/enable, of an nvdimm namespace will fail forever as subsequent
calls to devm_memremap_pages() will fail to setup the pgmap_radix
since there will be stale entries for the physical address range.
Christoph Hellwig May 22, 2018, 6:30 a.m. UTC | #3
On Mon, May 21, 2018 at 03:35:24PM -0700, Dan Williams wrote:
> The last step before devm_memremap_pages() returns success is to
> allocate a release action to tear the entire setup down. However, the
> result from devm_add_action() is not checked.
> 
> Checking the error also means that we need to handle the fact that the
> percpu_ref may not be killed by the time devm_memremap_pages_release()
> runs. Add a new state flag for this case.

Looks good (modulo any stable tag issues):

Reviewed-by: Christoph Hellwig <hch@lst.de>
Logan Gunthorpe May 22, 2018, 4:42 p.m. UTC | #4
Hey Dan,

On 21/05/18 06:07 PM, Dan Williams wrote:
> Without this change we could fail to register the teardown of
> devm_memremap_pages(). The likelihood of hitting this failure is tiny
> as small memory allocations almost always succeed. However, the impact
> of the failure is large given any future reconfiguration, or
> disable/enable, of an nvdimm namespace will fail forever as subsequent
> calls to devm_memremap_pages() will fail to setup the pgmap_radix
> since there will be stale entries for the physical address range.

Sorry, I don't follow this. The change only seems to prevent a warning
from occurring in this situation. Won't pgmap_radix_release() still be
called regardless of whether this patch is applied?

But it looks to me like this patch doesn't quite solve the issue -- at
least when looking at dax/pmem.c: If devm_add_action_or_reset() fails,
then dax_pmem_percpu_kill() won't be registered as an action and the
percpu_ref will never get killed. Thus, dax_pmem_percpu_release() would
not get called and dax_pmem_percpu_exit() will hang waiting for a
completion that will never occur. So we probably need to add a kill call
somewhere on the failing path...


Logan
Dan Williams May 22, 2018, 4:56 p.m. UTC | #5
On Tue, May 22, 2018 at 9:42 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Hey Dan,
>
> On 21/05/18 06:07 PM, Dan Williams wrote:
>> Without this change we could fail to register the teardown of
>> devm_memremap_pages(). The likelihood of hitting this failure is tiny
>> as small memory allocations almost always succeed. However, the impact
>> of the failure is large given any future reconfiguration, or
>> disable/enable, of an nvdimm namespace will fail forever as subsequent
>> calls to devm_memremap_pages() will fail to setup the pgmap_radix
>> since there will be stale entries for the physical address range.
>
> Sorry, I don't follow this. The change only seems to prevent a warning
> from occurring in this situation. Won't pgmap_radix_release() still be
> called regardless of whether this patch is applied?

devm_add_action() does not call the release function,
devm_add_action_or_reset() does.

> But it looks to me like this patch doesn't quite solve the issue -- at
> least when looking at dax/pmem.c: If devm_add_action_or_reset() fails,
> then dax_pmem_percpu_kill() won't be registered as an action and the
> percpu_ref will never get killed. Thus, dax_pmem_percpu_release() would
> not get called and dax_pmem_percpu_exit() will hang waiting for a
> completion that will never occur. So we probably need to add a kill call
> somewhere on the failing path...

Ah, true, good catch!

We should manually kill in the !registered case. I think this means we
need to pass in the custom kill routine, because for the pmem driver
it's blk_freeze_queue_start().
Logan Gunthorpe May 22, 2018, 5:03 p.m. UTC | #6
On 22/05/18 10:56 AM, Dan Williams wrote:
> On Tue, May 22, 2018 at 9:42 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>> Hey Dan,
>>
>> On 21/05/18 06:07 PM, Dan Williams wrote:
>>> Without this change we could fail to register the teardown of
>>> devm_memremap_pages(). The likelihood of hitting this failure is tiny
>>> as small memory allocations almost always succeed. However, the impact
>>> of the failure is large given any future reconfiguration, or
>>> disable/enable, of an nvdimm namespace will fail forever as subsequent
>>> calls to devm_memremap_pages() will fail to setup the pgmap_radix
>>> since there will be stale entries for the physical address range.
>>
>> Sorry, I don't follow this. The change only seems to prevent a warning
>> from occurring in this situation. Won't pgmap_radix_release() still be
>> called regardless of whether this patch is applied?
> 
> devm_add_action() does not call the release function,
> devm_add_action_or_reset() does.

Oh, yes. Thanks I see that now.

> Ah, true, good catch!
> 
> We should manually kill in the !registered case. I think this means we
> need to pass in the custom kill routine, because for the pmem driver
> it's blk_freeze_queue_start().

It may be cleaner to just have the caller call the specific kill
function if devm_memremap_pages fails... Though, I don't fully
understand how the nvdimm pmem driver cleans up the percpu counter.

Logan
Dan Williams May 22, 2018, 5:25 p.m. UTC | #7
On Tue, May 22, 2018 at 10:03 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 22/05/18 10:56 AM, Dan Williams wrote:
>> On Tue, May 22, 2018 at 9:42 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>> Hey Dan,
>>>
>>> On 21/05/18 06:07 PM, Dan Williams wrote:
>>>> Without this change we could fail to register the teardown of
>>>> devm_memremap_pages(). The likelihood of hitting this failure is tiny
>>>> as small memory allocations almost always succeed. However, the impact
>>>> of the failure is large given any future reconfiguration, or
>>>> disable/enable, of an nvdimm namespace will fail forever as subsequent
>>>> calls to devm_memremap_pages() will fail to setup the pgmap_radix
>>>> since there will be stale entries for the physical address range.
>>>
>>> Sorry, I don't follow this. The change only seems to prevent a warning
>>> from occurring in this situation. Won't pgmap_radix_release() still be
>>> called regardless of whether this patch is applied?
>>
>> devm_add_action() does not call the release function,
>> devm_add_action_or_reset() does.
>
> Oh, yes. Thanks I see that now.
>
>> Ah, true, good catch!
>>
>> We should manually kill in the !registered case. I think this means we
>> need to pass in the custom kill routine, because for the pmem driver
>> it's blk_freeze_queue_start().
>
> It may be cleaner to just have the caller call the specific kill
> function if devm_memremap_pages fails...

As far as I can see by then it's too late, or we need to expose
release details to the caller which defeats the purpose of devm
semantics.

> Though, I don't fully
> understand how the nvdimm pmem driver cleans up the percpu counter.

The dev_pagemap setup for pmem is entirely too subtle and arguably a
layering violation as it reuses the block layer q_usage_counter
percpu_ref. We arrange for that counter to be shutdown before the
blk_cleanup_queue() does the same.
Logan Gunthorpe May 22, 2018, 5:36 p.m. UTC | #8
On 22/05/18 11:25 AM, Dan Williams wrote:
> As far as I can see by then it's too late, or we need to expose
> release details to the caller which defeats the purpose of devm
> semantics.

In the dax/pmem case, I *think* it should be fine...

devm_add_action_or_reset() only calls the action it is passed on failure
not the entire devm chain. In which case, it should drop all the
references to the percpu counter. Then, if on an error return from
devm_memremap_pages() we call dax_pmem_percpu_kill(), the rest of the
devm chain should be called when we return from a failed probe and it
should proceed as usual.

I think dax_pmem_percpu_kill() also must be called on any error return
from devm_memremap_pages and it is currently not...

Logan
diff mbox

Patch

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..44a7ee517513 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -115,6 +115,7 @@  struct dev_pagemap {
 	dev_page_free_t page_free;
 	struct vmem_altmap altmap;
 	bool altmap_valid;
+	bool registered;
 	struct resource res;
 	struct percpu_ref *ref;
 	struct device *dev;
diff --git a/kernel/memremap.c b/kernel/memremap.c
index c614645227a7..30d96be5a965 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -296,7 +296,7 @@  static void devm_memremap_pages_release(void *data)
 	for_each_device_pfn(pfn, pgmap)
 		put_page(pfn_to_page(pfn));
 
-	if (percpu_ref_tryget_live(pgmap->ref)) {
+	if (pgmap->registered && percpu_ref_tryget_live(pgmap->ref)) {
 		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
 		percpu_ref_put(pgmap->ref);
 	}
@@ -418,7 +418,11 @@  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 		percpu_ref_get(pgmap->ref);
 	}
 
-	devm_add_action(dev, devm_memremap_pages_release, pgmap);
+	error = devm_add_action_or_reset(dev, devm_memremap_pages_release,
+			pgmap);
+	if (error)
+		return ERR_PTR(error);
+	pgmap->registered = true;
 
 	return __va(res->start);