diff mbox series

[RFT] x86/pat: Fix set_mce_nospec() for pmem

Message ID 162561960776.1149519.9267511644788011712.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State New
Headers show
Series [RFT] x86/pat: Fix set_mce_nospec() for pmem | expand

Commit Message

Dan Williams July 7, 2021, 1:01 a.m. UTC
When poison is discovered and triggers memory_failure() the physical
page is unmapped from all process address space. However, it is not
unmapped from kernel address space. Unlike a typical memory page that
can be retired from use in the page allocator and marked 'not present',
pmem needs to remain accessible given it can not be physically remapped
or retired. set_memory_uc() tries to maintain consistent nominal memtype
mappings for a given pfn, but memory_failure() is an exceptional
condition.

For the same reason that set_memory_np() bypasses memtype checks
because they do not apply in the memory failure case, memtype validation
is not applicable for marking the pmem pfn uncacheable. Use
_set_memory_uc().

Reported-by: Jane Chu <jane.chu@oracle.com>
Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set,clear}_mce_nospec()")
Cc: Luis Chamberlain <mcgrof@suse.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Jane, can you give this a try and see if it cleans up the error you are
seeing?

Thanks for the help.

 arch/x86/include/asm/set_memory.h |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Dan Williams Aug. 26, 2021, 7:08 p.m. UTC | #1
On Tue, Jul 6, 2021 at 6:01 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> When poison is discovered and triggers memory_failure() the physical
> page is unmapped from all process address space. However, it is not
> unmapped from kernel address space. Unlike a typical memory page that
> can be retired from use in the page allocator and marked 'not present',
> pmem needs to remain accessible given it can not be physically remapped
> or retired. set_memory_uc() tries to maintain consistent nominal memtype
> mappings for a given pfn, but memory_failure() is an exceptional
> condition.
>
> For the same reason that set_memory_np() bypasses memtype checks
> because they do not apply in the memory failure case, memtype validation
> is not applicable for marking the pmem pfn uncacheable. Use
> _set_memory_uc().
>
> Reported-by: Jane Chu <jane.chu@oracle.com>
> Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set,clear}_mce_nospec()")
> Cc: Luis Chamberlain <mcgrof@suse.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Jane, can you give this a try and see if it cleans up the error you are
> seeing?
>
> Thanks for the help.

Jane, does this resolve the failure you reported [1]?

[1]: https://lore.kernel.org/r/327f9156-9b28-d20e-2850-21c125ece8c7@oracle.com

>
>  arch/x86/include/asm/set_memory.h |    9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
> index 43fa081a1adb..0bf2274c5186 100644
> --- a/arch/x86/include/asm/set_memory.h
> +++ b/arch/x86/include/asm/set_memory.h
> @@ -114,8 +114,13 @@ static inline int set_mce_nospec(unsigned long pfn, bool unmap)
>
>         if (unmap)
>                 rc = set_memory_np(decoy_addr, 1);
> -       else
> -               rc = set_memory_uc(decoy_addr, 1);
> +       else {
> +               /*
> +                * Bypass memtype checks since memory-failure has shot
> +                * down mappings.
> +                */
> +               rc = _set_memory_uc(decoy_addr, 1);
> +       }
>         if (rc)
>                 pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
>         return rc;
>
Jane Chu Aug. 27, 2021, 7:12 a.m. UTC | #2
Hi, Dan,

On 8/26/2021 12:08 PM, Dan Williams wrote:
> On Tue, Jul 6, 2021 at 6:01 PM Dan Williams<dan.j.williams@intel.com>  wrote:
>> When poison is discovered and triggers memory_failure() the physical
>> page is unmapped from all process address space. However, it is not
>> unmapped from kernel address space. Unlike a typical memory page that
>> can be retired from use in the page allocator and marked 'not present',
>> pmem needs to remain accessible given it can not be physically remapped
>> or retired. set_memory_uc() tries to maintain consistent nominal memtype
>> mappings for a given pfn, but memory_failure() is an exceptional
>> condition.
>>
>> For the same reason that set_memory_np() bypasses memtype checks
>> because they do not apply in the memory failure case, memtype validation
>> is not applicable for marking the pmem pfn uncacheable. Use
>> _set_memory_uc().
>>
>> Reported-by: Jane Chu<jane.chu@oracle.com>
>> Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set,clear}_mce_nospec()")
>> Cc: Luis Chamberlain<mcgrof@suse.com>
>> Cc: Borislav Petkov<bp@alien8.de>
>> Cc: Tony Luck<tony.luck@intel.com>
>> Signed-off-by: Dan Williams<dan.j.williams@intel.com>
>> ---
>> Jane, can you give this a try and see if it cleans up the error you are
>> seeing?
>>
>> Thanks for the help.
> Jane, does this resolve the failure you reported [1]?
> 
> [1]:https://lore.kernel.org/r/327f9156-9b28-d20e-2850-21c125ece8c7@oracle.com
> 

Sorry for taking so long.  With the patch applied, the dmesg is displaying
[ 2111.282759] Memory failure: 0x1850600: recovery action for dax page: 
Recovered
[ 2112.415412] x86/PAT: fsdax_poison_v1:3214 freeing invalid memtype 
[mem 0x1850600000-0x1850600fff]

instead of the problematic

[10683.426147] x86/PAT: fsdax_poison_v1:5018 conflicting memory types 
1850600000-1850601000  uncached-minus<->write-back

Please feel free to add Tested-by: Jane Chu<jane.chu@oracle.com>

Thanks for the fix!

-jane
Borislav Petkov Sept. 13, 2021, 10:29 a.m. UTC | #3
On Tue, Jul 06, 2021 at 06:01:05PM -0700, Dan Williams wrote:
> When poison is discovered and triggers memory_failure() the physical
> page is unmapped from all process address space. However, it is not
> unmapped from kernel address space. Unlike a typical memory page that
> can be retired from use in the page allocator and marked 'not present',
> pmem needs to remain accessible given it can not be physically remapped
> or retired.

I'm surely missing something obvious but why does it need to remain
accessible? Spell it out please.

> set_memory_uc() tries to maintain consistent nominal memtype
> mappings for a given pfn, but memory_failure() is an exceptional
> condition.

That's not clear to me too. So looking at the failure:

[10683.426147] x86/PAT: fsdax_poison_v1:5018 conflicting memory types 1850600000-1850601000  uncached-minus<->write-back

set_memory_uc() marked it UC- but something? wants it to be WB. Why?

I guess I need some more info on the whole memory offlining for pmem and
why that should be done differently than with normal memory.

Thx.
Dan Williams Sept. 14, 2021, 6:08 p.m. UTC | #4
On Mon, Sep 13, 2021 at 3:29 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Jul 06, 2021 at 06:01:05PM -0700, Dan Williams wrote:
> > When poison is discovered and triggers memory_failure() the physical
> > page is unmapped from all process address space. However, it is not
> > unmapped from kernel address space. Unlike a typical memory page that
> > can be retired from use in the page allocator and marked 'not present',
> > pmem needs to remain accessible given it can not be physically remapped
> > or retired.
>
> I'm surely missing something obvious but why does it need to remain
> accessible? Spell it out please.

Sure, I should probably include this following note in all patches
touching the DAX-memory_failure() path, because it is a frequently
asked question. The tl;dr is:

Typical memory_failure() does not assume the physical page can be
recovered and put back into circulation, PMEM memory_failure() allows
for recovery of the page.

The longer description is:
Typical memory_failure() for anonymous, or page-cache pages, has the
flexibility to invalidate bad pages and trigger any users to request a
new page from the page allocator to replace the quarantined one. DAX
removes that flexibility. The page is a handle for a fixed storage
location, i.e. no mechanism to remap a physical page to a different
logical address. Software expects to be able to repair an error in
PMEM by reading around the poisoned cache lines and writing zeros,
fallocate(...FALLOC_FL_PUNCH_HOLE...), to overwrite poison. The page
needs to remain accessible to enable recovery.

>
> > set_memory_uc() tries to maintain consistent nominal memtype
> > mappings for a given pfn, but memory_failure() is an exceptional
> > condition.
>
> That's not clear to me too. So looking at the failure:
>
> [10683.426147] x86/PAT: fsdax_poison_v1:5018 conflicting memory types 1850600000-1850601000  uncached-minus<->write-back
>
> set_memory_uc() marked it UC- but something? wants it to be WB. Why?

PMEM is mapped WB at the beginning of time for nominal operation.
track_pfn_remap() records that driver setting and forwards it to any
track_pfn_insert() of the same pfn, i.e. this is how DAX mappings
inherit the WB cache mode. memory_failure() wants to arrange avoidance
speculative consumption of poison, set_memory_uc() checks with the
track_pfn_remap() setting, but we know this is an exceptional
condition and it is ok to force it UC against the typical memtype
expectation.

> I guess I need some more info on the whole memory offlining for pmem and
> why that should be done differently than with normal memory.

Short answer, PMEM never goes "offline" because it was never "online"
in the first place. Where "online" in this context is specifically
referring to pfns that are under the watchful eye of the core-mm page
allocator.
Borislav Petkov Sept. 15, 2021, 10:41 a.m. UTC | #5
On Tue, Sep 14, 2021 at 11:08:00AM -0700, Dan Williams wrote:
> Sure, I should probably include this following note in all patches
> touching the DAX-memory_failure() path, because it is a frequently
> asked question. The tl;dr is:
> 
> Typical memory_failure() does not assume the physical page can be
> recovered and put back into circulation, PMEM memory_failure() allows
> for recovery of the page.

Hmm, I think by "PMEM memory_failure()" you mean what pmem_do_write()
does with that poison clearing or?

But then I have no clue what the "DAX-memory_failure()" path is.

> The longer description is:
> Typical memory_failure() for anonymous, or page-cache pages, has the
> flexibility to invalidate bad pages and trigger any users to request a
> new page from the page allocator to replace the quarantined one. DAX
> removes that flexibility. The page is a handle for a fixed storage
> location, i.e. no mechanism to remap a physical page to a different
> logical address. Software expects to be able to repair an error in
> PMEM by reading around the poisoned cache lines and writing zeros,
> fallocate(...FALLOC_FL_PUNCH_HOLE...), to overwrite poison. The page
> needs to remain accessible to enable recovery.

Aha, so memory failure there is not offlining he 4K page but simply
zeroing the poison. What happens if the same physical location triggers
more read errors, i.e., the underlying hw is going bad? Don't you want
to exclude that location from ever being written again?

Or maybe such recovery action doesn't make sense for pmem?

> Short answer, PMEM never goes "offline" because it was never "online"
> in the first place. Where "online" in this context is specifically
> referring to pfns that are under the watchful eye of the core-mm page
> allocator.

Ok, so pmem wants to be handled differently during memory failure.
Looking at the patch again, you change the !unmap case to do
_set_memory_uc().

That bool unmap thing gets *not* set in whole_page():

	return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT;

so I'm guessing that "Recoverable Address LSB (bits 5:0): The lowest
valid recoverable address bit." thing is < 12 for pmem.

But are you really saying that the hardware would never report a lower
than 12 value for normal memory?

If it does, then that is wrong here.

In any case, I'd prefer if the code would do an explicit check somewhere
saying "is this pmem" in order to do that special action only for when
it really is pmem and not rely on it implicitly.

Thx.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 43fa081a1adb..0bf2274c5186 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -114,8 +114,13 @@  static inline int set_mce_nospec(unsigned long pfn, bool unmap)
 
 	if (unmap)
 		rc = set_memory_np(decoy_addr, 1);
-	else
-		rc = set_memory_uc(decoy_addr, 1);
+	else {
+		/*
+		 * Bypass memtype checks since memory-failure has shot
+		 * down mappings.
+		 */
+		rc = _set_memory_uc(decoy_addr, 1);
+	}
 	if (rc)
 		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
 	return rc;