diff mbox series

acpi/nfit: badrange report spill over to clean range

Message ID 20220711232658.536064-1-jane.chu@oracle.com (mailing list archive)
State New, archived
Headers show
Series acpi/nfit: badrange report spill over to clean range | expand

Commit Message

Jane Chu July 11, 2022, 11:26 p.m. UTC
Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
granularity") changed nfit_handle_mce() callback to report badrange for
each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
instead of the hardcoded L1_CACHE_BYTES. However recently on a server
populated with Intel DCPMEM v2 dimms, it appears that
1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
8 poisons.

[29076.590281] {3}[Hardware Error]:   physical_address: 0x00000040a0602400
[..]
[29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
[29076.627519] mce: [Hardware Error]: Machine check events logged
[29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
[29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
[..]
[29078.634817] {4}[Hardware Error]:   physical_address: 0x00000040a0602600
[..]
[29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
[29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
[..]
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "map":"dev",
  "size":33820770304,
  "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
  "sector_size":512,
  "align":2097152,
  "blockdev":"pmem0",
  "badblock_count":8,
  "badblocks":[
    {
      "offset":8208,
      "length":8,
      "dimms":[
        "nmem0"
      ]
    }
  ]
}

So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
radius and shouldn't be used.  More over, as each injected poison is being
reported independently, any alignment under 512-byte appear works:
L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
or 512-byte.

To get around this issue, 512-bytes is chosen as the alignment because
  a. it happens to be the badblock granularity,
  b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
  c. architecture agnostic

Fixes: 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison granularity")
Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 drivers/acpi/nfit/mce.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


base-commit: e35e5b6f695d241ffb1d223207da58a1fbcdff4b

Comments

Dan Williams July 13, 2022, 12:48 a.m. UTC | #1
Jane Chu wrote:
> Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
> granularity") changed nfit_handle_mce() callback to report badrange for
> each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
> instead of the hardcoded L1_CACHE_BYTES. However recently on a server
> populated with Intel DCPMEM v2 dimms, it appears that
> 1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
> Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
> 8 poisons.
> 
> [29076.590281] {3}[Hardware Error]:   physical_address: 0x00000040a0602400
> [..]
> [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
> [29076.627519] mce: [Hardware Error]: Machine check events logged
> [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
> [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
> [..]
> [29078.634817] {4}[Hardware Error]:   physical_address: 0x00000040a0602600
> [..]
> [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
> [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
> [..]
> {
>   "dev":"namespace0.0",
>   "mode":"fsdax",
>   "map":"dev",
>   "size":33820770304,
>   "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
>   "sector_size":512,
>   "align":2097152,
>   "blockdev":"pmem0",
>   "badblock_count":8,
>   "badblocks":[
>     {
>       "offset":8208,
>       "length":8,
>       "dimms":[
>         "nmem0"
>       ]
>     }
>   ]
> }
> 
> So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
> radius and shouldn't be used.  More over, as each injected poison is being
> reported independently, any alignment under 512-byte appear works:
> L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
> or 512-byte.
> 
> To get around this issue, 512-bytes is chosen as the alignment because
>   a. it happens to be the badblock granularity,
>   b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
>   c. architecture agnostic

I am failing to see the kernel bug? Yes, you injected less than 8
"badblocks" of poison and the hardware reported 8 blocks of poison, but
that's not the kernel's fault, that's the hardware. What happens when
hardware really does detect 8 blocks of consective poison and this
implementation decides to only record 1 at a time?

It seems the fix you want is for the hardware to report the precise
error bounds and that 1UL << MCI_MISC_ADDR_LSB(mce->misc) does not have
that precision in this case.

However, the ARS engine likely can return the precise error ranges so I
think the fix is to just use the address range indicated by 1UL <<
MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
scrub request to ask the device for the precise error list.
Jane Chu July 13, 2022, 11:52 p.m. UTC | #2
On 7/12/2022 5:48 PM, Dan Williams wrote:
> Jane Chu wrote:
>> Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
>> granularity") changed nfit_handle_mce() callback to report badrange for
>> each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
>> instead of the hardcoded L1_CACHE_BYTES. However recently on a server
>> populated with Intel DCPMEM v2 dimms, it appears that
>> 1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
>> Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
>> 8 poisons.
>>
>> [29076.590281] {3}[Hardware Error]:   physical_address: 0x00000040a0602400
>> [..]
>> [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
>> [29076.627519] mce: [Hardware Error]: Machine check events logged
>> [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
>> [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
>> [..]
>> [29078.634817] {4}[Hardware Error]:   physical_address: 0x00000040a0602600
>> [..]
>> [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
>> [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
>> [..]
>> {
>>    "dev":"namespace0.0",
>>    "mode":"fsdax",
>>    "map":"dev",
>>    "size":33820770304,
>>    "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
>>    "sector_size":512,
>>    "align":2097152,
>>    "blockdev":"pmem0",
>>    "badblock_count":8,
>>    "badblocks":[
>>      {
>>        "offset":8208,
>>        "length":8,
>>        "dimms":[
>>          "nmem0"
>>        ]
>>      }
>>    ]
>> }
>>
>> So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
>> radius and shouldn't be used.  More over, as each injected poison is being
>> reported independently, any alignment under 512-byte appear works:
>> L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
>> or 512-byte.
>>
>> To get around this issue, 512-bytes is chosen as the alignment because
>>    a. it happens to be the badblock granularity,
>>    b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
>>    c. architecture agnostic
> 
> I am failing to see the kernel bug? Yes, you injected less than 8
> "badblocks" of poison and the hardware reported 8 blocks of poison, but
> that's not the kernel's fault, that's the hardware. What happens when
> hardware really does detect 8 blocks of consective poison and this
> implementation decides to only record 1 at a time?

In that case, there will be 8 reports of the poisons by APEI GHES, and
ARC scan will also report 8 poisons, each will get to be added to the 
bad range via nvdimm_bus_add_badrange(), so none of them will be missed.

In the above 2 poison example, the poison in 0x00000040a0602400 and in 
0x00000040a0602600 were separately reported.

> 
> It seems the fix you want is for the hardware to report the precise
> error bounds and that 1UL << MCI_MISC_ADDR_LSB(mce->misc) does not have
> that precision in this case.

That field describes a 4K range even for a single poison, it confuses 
people unnecessarily.

> 
> However, the ARS engine likely can return the precise error ranges so I
> think the fix is to just use the address range indicated by 1UL <<
> MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
> scrub request to ask the device for the precise error list.

You mean for nfit_handle_mce() callback to issue a short ARS per each 
poison report over a 4K range in order to decide the precise range as a 
workaround of the hardware issue?  if there are 8 poisoned detected, 
there will be 8 short ARS, sure we want to do that?  also, for now, is 
it possible to log more than 1 poison per 512byte block?

thanks!
-jane
Dan Williams July 14, 2022, 12:24 a.m. UTC | #3
Jane Chu wrote:
> On 7/12/2022 5:48 PM, Dan Williams wrote:
> > Jane Chu wrote:
> >> Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
> >> granularity") changed nfit_handle_mce() callback to report badrange for
> >> each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
> >> instead of the hardcoded L1_CACHE_BYTES. However recently on a server
> >> populated with Intel DCPMEM v2 dimms, it appears that
> >> 1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
> >> Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
> >> 8 poisons.
> >>
> >> [29076.590281] {3}[Hardware Error]:   physical_address: 0x00000040a0602400
> >> [..]
> >> [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
> >> [29076.627519] mce: [Hardware Error]: Machine check events logged
> >> [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
> >> [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
> >> [..]
> >> [29078.634817] {4}[Hardware Error]:   physical_address: 0x00000040a0602600
> >> [..]
> >> [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
> >> [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
> >> [..]
> >> {
> >>    "dev":"namespace0.0",
> >>    "mode":"fsdax",
> >>    "map":"dev",
> >>    "size":33820770304,
> >>    "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
> >>    "sector_size":512,
> >>    "align":2097152,
> >>    "blockdev":"pmem0",
> >>    "badblock_count":8,
> >>    "badblocks":[
> >>      {
> >>        "offset":8208,
> >>        "length":8,
> >>        "dimms":[
> >>          "nmem0"
> >>        ]
> >>      }
> >>    ]
> >> }
> >>
> >> So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
> >> radius and shouldn't be used.  More over, as each injected poison is being
> >> reported independently, any alignment under 512-byte appear works:
> >> L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
> >> or 512-byte.
> >>
> >> To get around this issue, 512-bytes is chosen as the alignment because
> >>    a. it happens to be the badblock granularity,
> >>    b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
> >>    c. architecture agnostic
> > 
> > I am failing to see the kernel bug? Yes, you injected less than 8
> > "badblocks" of poison and the hardware reported 8 blocks of poison, but
> > that's not the kernel's fault, that's the hardware. What happens when
> > hardware really does detect 8 blocks of consective poison and this
> > implementation decides to only record 1 at a time?
> 
> In that case, there will be 8 reports of the poisons by APEI GHES,

Why would there be 8 reports for just one poison consumption event?

> ARC scan will also report 8 poisons, each will get to be added to the 
> bad range via nvdimm_bus_add_badrange(), so none of them will be missed.

Right, that's what I'm saying about the proposed change, trim the
reported poison by what is return from a "short" ARS. Recall that
short-ARS just reads from a staging buffer that the BIOS knows about, it
need not go all the way to hardware.

> In the above 2 poison example, the poison in 0x00000040a0602400 and in 
> 0x00000040a0602600 were separately reported.

Separately reported, each with a 4K alignment?

> > It seems the fix you want is for the hardware to report the precise
> > error bounds and that 1UL << MCI_MISC_ADDR_LSB(mce->misc) does not have
> > that precision in this case.
> 
> That field describes a 4K range even for a single poison, it confuses 
> people unnecessarily.

I agree with you on the problem statement, it's the fix where I have
questions.

> > However, the ARS engine likely can return the precise error ranges so I
> > think the fix is to just use the address range indicated by 1UL <<
> > MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
> > scrub request to ask the device for the precise error list.
> 
> You mean for nfit_handle_mce() callback to issue a short ARS per each 
> poison report over a 4K range 

Over a L1_CACHE_BYTES range...

> in order to decide the precise range as a workaround of the hardware
> issue?  if there are 8 poisoned detected, there will be 8 short ARS,
> sure we want to do that?

Seems ok to me, short ARS is meant to be cheap. I would hope there are
no latency concerns in this path.

> also, for now, is it possible to log more than 1 poison per 512byte
> block?

For the badrange tracking, no. So this would just be a check to say
"Yes, CPU I see you think the whole 4K is gone, but lets double check
with more precise information for what gets placed in the badrange
tracking".
Jane Chu July 14, 2022, 11:22 p.m. UTC | #4
On 7/13/2022 5:24 PM, Dan Williams wrote:
> Jane Chu wrote:
>> On 7/12/2022 5:48 PM, Dan Williams wrote:
>>> Jane Chu wrote:
>>>> Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
>>>> granularity") changed nfit_handle_mce() callback to report badrange for
>>>> each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
>>>> instead of the hardcoded L1_CACHE_BYTES. However recently on a server
>>>> populated with Intel DCPMEM v2 dimms, it appears that
>>>> 1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
>>>> Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
>>>> 8 poisons.
>>>>
>>>> [29076.590281] {3}[Hardware Error]:   physical_address: 0x00000040a0602400
>>>> [..]
>>>> [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
>>>> [29076.627519] mce: [Hardware Error]: Machine check events logged
>>>> [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
>>>> [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
>>>> [..]
>>>> [29078.634817] {4}[Hardware Error]:   physical_address: 0x00000040a0602600
>>>> [..]
>>>> [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
>>>> [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
>>>> [..]
>>>> {
>>>>     "dev":"namespace0.0",
>>>>     "mode":"fsdax",
>>>>     "map":"dev",
>>>>     "size":33820770304,
>>>>     "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
>>>>     "sector_size":512,
>>>>     "align":2097152,
>>>>     "blockdev":"pmem0",
>>>>     "badblock_count":8,
>>>>     "badblocks":[
>>>>       {
>>>>         "offset":8208,
>>>>         "length":8,
>>>>         "dimms":[
>>>>           "nmem0"
>>>>         ]
>>>>       }
>>>>     ]
>>>> }
>>>>
>>>> So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
>>>> radius and shouldn't be used.  More over, as each injected poison is being
>>>> reported independently, any alignment under 512-byte appear works:
>>>> L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
>>>> or 512-byte.
>>>>
>>>> To get around this issue, 512-bytes is chosen as the alignment because
>>>>     a. it happens to be the badblock granularity,
>>>>     b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
>>>>     c. architecture agnostic
>>>
>>> I am failing to see the kernel bug? Yes, you injected less than 8
>>> "badblocks" of poison and the hardware reported 8 blocks of poison, but
>>> that's not the kernel's fault, that's the hardware. What happens when
>>> hardware really does detect 8 blocks of consective poison and this
>>> implementation decides to only record 1 at a time?
>>
>> In that case, there will be 8 reports of the poisons by APEI GHES,
> 
> Why would there be 8 reports for just one poison consumption event?

I meant to say there would be 8 calls to the nfit_handle_mce() callback,
one call for each poison with accurate address.

Also, short ARS would find 2 poisons.

I attached the console output, my annotation is prefixed with "<==".

It is from these information I concluded that no poison will be missed
in terms of reporting.

> 
>> ARC scan will also report 8 poisons, each will get to be added to the
>> bad range via nvdimm_bus_add_badrange(), so none of them will be missed.
> 
> Right, that's what I'm saying about the proposed change, trim the
> reported poison by what is return from a "short" ARS. Recall that
> short-ARS just reads from a staging buffer that the BIOS knows about, it
> need not go all the way to hardware.

Okay, that confirms my understanding of your proposal. More below.

> 
>> In the above 2 poison example, the poison in 0x00000040a0602400 and in
>> 0x00000040a0602600 were separately reported.
> 
> Separately reported, each with a 4K alignment?

Yes,  and so twice nfit_handle_mce() call
nvdimm_bus_add_badrange() with addr: 0x40a0602000, length: 0x1000
complete overlap.

> 
>>> It seems the fix you want is for the hardware to report the precise
>>> error bounds and that 1UL << MCI_MISC_ADDR_LSB(mce->misc) does not have
>>> that precision in this case.
>>
>> That field describes a 4K range even for a single poison, it confuses
>> people unnecessarily.
> 
> I agree with you on the problem statement, it's the fix where I have
> questions.
> 
>>> However, the ARS engine likely can return the precise error ranges so I
>>> think the fix is to just use the address range indicated by 1UL <<
>>> MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
>>> scrub request to ask the device for the precise error list.
>>
>> You mean for nfit_handle_mce() callback to issue a short ARS per each
>> poison report over a 4K range
> 
> Over a L1_CACHE_BYTES range...
> 
>> in order to decide the precise range as a workaround of the hardware
>> issue?  if there are 8 poisoned detected, there will be 8 short ARS,
>> sure we want to do that?
> 
> Seems ok to me, short ARS is meant to be cheap. I would hope there are
> no latency concerns in this path.

Yeah, accumulated latency is the concern here. I know the upstream
user call stack has timeout for getting a response for certain
database transaction.  And the test folks might inject dozens of
back-to-back poisons to adjacent pages...
> 
>> also, for now, is it possible to log more than 1 poison per 512byte
>> block?
> 
> For the badrange tracking, no. So this would just be a check to say
> "Yes, CPU I see you think the whole 4K is gone, but lets double check
> with more precise information for what gets placed in the badrange
> tracking".

Okay, process-wise, this is what I am seeing -

- for each poison, nfit_handle_mce() issues a short ARS given (addr, 
64bytes)
- and short ARS returns to say that's actually (addr, 256bytes),
- and then nvdimm_bus_add_badrange() logs the poison in (addr, 512bytes) 
anyway.

The precise badrange from short ARS is lost in the process, given the
time spent visiting the BIOS, what's the gain?  Could we defer the
precise badrange until there is consumer of the information?

thanks!
-jane
# dmesg
# ndctl inject-error namespace0.0 -n 2 -B 8210
[29076.551909] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[29076.561136] {3}[Hardware Error]: event severity: recoverable
[29076.567453] {3}[Hardware Error]:  Error 0, type: recoverable
[29076.573769] {3}[Hardware Error]:   section_type: memory error
[29076.580182] {3}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
[29076.590281] {3}[Hardware Error]:   physical_address: 0x00000040a0602400		<== 1st poison @ 0x400
[29076.597664] {3}[Hardware Error]:   physical_address_mask: 0xffffffffffffff00
[29076.605532] {3}[Hardware Error]:   node:0 card:0 module:1
[29076.611655] {3}[Hardware Error]:   error_type: 14, scrub uncorrected error
[29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
[29076.627519] mce: [Hardware Error]: Machine check events logged
[29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
[29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)	<== 1st call to nfit_handle_mce
[29077.877682] EDAC MC0: 1 UE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x40a0602 offset:0x400 grain:32 -  err_code:0x0000:0x009f  SystemAddress:0x40a0602400 DevicePhysicalAddress:0x30602400 ProcessorSocketId:0x0 MemoryControllerId:0x0 ChannelAddress:0x430602400 ChannelId:0x0 PhysicalRankId:0x0 DimmSlotId:0x1 ChipSelect:0x4)

[29078.596454] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[29078.605682] {4}[Hardware Error]: event severity: recoverable
[29078.611997] {4}[Hardware Error]:  Error 0, type: recoverable
[29078.618313] {4}[Hardware Error]:   section_type: memory error
[29078.624727] {4}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
[29078.634817] {4}[Hardware Error]:   physical_address: 0x00000040a0602600		<== 2nd poison @ 0x600
[29078.642200] {4}[Hardware Error]:   physical_address_mask: 0xffffffffffffff00
[29078.650069] {4}[Hardware Error]:   node:0 card:0 module:1
[29078.656184] {4}[Hardware Error]:   error_type: 14, scrub uncorrected error
[29078.663944] Memory failure: 0x40a0602: recovery action for dax page: Recovered
[29078.672011] mce: [Hardware Error]: Machine check events logged
[29079.595327] nfit ACPI0012:00: XXX addr in SPA 1 (0x4080000000, 0x1f80000000)
[29079.603204] nfit ACPI0012:00: XXX new code nvdimm_bus_add_badrange
[29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)	<== 2nd call to nfit_handle_mce 
[29079.949531] EDAC MC0: 1 UE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x40a0602 offset:0x600 grain:32 -  err_code:0x0000:0x009f  SystemAddress:0x40a0602600 DevicePhysicalAddress:0x30602600 ProcessorSocketId:0x0 MemoryControllerId:0x0 ChannelAddress:0x430602600 ChannelId:0x0 PhysicalRankId:0x0 DimmSlotId:0x1 ChipSelect:0x4)

[29102.630372] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602400, 0x100)	<== short ARS found 2 poisons
[29102.638341] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602600, 0x100)
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "map":"dev",
  "size":33820770304,
  "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
  "sector_size":512,
  "align":2097152,
  "blockdev":"pmem0",
  "badblock_count":8,
  "badblocks":[
    {
      "offset":8208,
      "length":8,
      "dimms":[
        "nmem0"
      ]
    }
  ]
}
Dan Williams July 15, 2022, 12:58 a.m. UTC | #5
Jane Chu wrote:
> On 7/13/2022 5:24 PM, Dan Williams wrote:
> > Jane Chu wrote:
> >> On 7/12/2022 5:48 PM, Dan Williams wrote:
> >>> Jane Chu wrote:
> >>>> Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
> >>>> granularity") changed nfit_handle_mce() callback to report badrange for
> >>>> each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
> >>>> instead of the hardcoded L1_CACHE_BYTES. However recently on a server
> >>>> populated with Intel DCPMEM v2 dimms, it appears that
> >>>> 1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
> >>>> Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
> >>>> 8 poisons.
> >>>>
> >>>> [29076.590281] {3}[Hardware Error]:   physical_address: 0x00000040a0602400
> >>>> [..]
> >>>> [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
> >>>> [29076.627519] mce: [Hardware Error]: Machine check events logged
> >>>> [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
> >>>> [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
> >>>> [..]
> >>>> [29078.634817] {4}[Hardware Error]:   physical_address: 0x00000040a0602600
> >>>> [..]
> >>>> [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
> >>>> [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
> >>>> [..]
> >>>> {
> >>>>     "dev":"namespace0.0",
> >>>>     "mode":"fsdax",
> >>>>     "map":"dev",
> >>>>     "size":33820770304,
> >>>>     "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
> >>>>     "sector_size":512,
> >>>>     "align":2097152,
> >>>>     "blockdev":"pmem0",
> >>>>     "badblock_count":8,
> >>>>     "badblocks":[
> >>>>       {
> >>>>         "offset":8208,
> >>>>         "length":8,
> >>>>         "dimms":[
> >>>>           "nmem0"
> >>>>         ]
> >>>>       }
> >>>>     ]
> >>>> }
> >>>>
> >>>> So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
> >>>> radius and shouldn't be used.  More over, as each injected poison is being
> >>>> reported independently, any alignment under 512-byte appear works:
> >>>> L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
> >>>> or 512-byte.
> >>>>
> >>>> To get around this issue, 512-bytes is chosen as the alignment because
> >>>>     a. it happens to be the badblock granularity,
> >>>>     b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
> >>>>     c. architecture agnostic
> >>>
> >>> I am failing to see the kernel bug? Yes, you injected less than 8
> >>> "badblocks" of poison and the hardware reported 8 blocks of poison, but
> >>> that's not the kernel's fault, that's the hardware. What happens when
> >>> hardware really does detect 8 blocks of consective poison and this
> >>> implementation decides to only record 1 at a time?
> >>
> >> In that case, there will be 8 reports of the poisons by APEI GHES,
> > 
> > Why would there be 8 reports for just one poison consumption event?
> 
> I meant to say there would be 8 calls to the nfit_handle_mce() callback,
> one call for each poison with accurate address.
> 
> Also, short ARS would find 2 poisons.
> 
> I attached the console output, my annotation is prefixed with "<==".
> 
> It is from these information I concluded that no poison will be missed
> in terms of reporting.
> 
> > 
> >> ARC scan will also report 8 poisons, each will get to be added to the
> >> bad range via nvdimm_bus_add_badrange(), so none of them will be missed.
> > 
> > Right, that's what I'm saying about the proposed change, trim the
> > reported poison by what is return from a "short" ARS. Recall that
> > short-ARS just reads from a staging buffer that the BIOS knows about, it
> > need not go all the way to hardware.
> 
> Okay, that confirms my understanding of your proposal. More below.
> 
> > 
> >> In the above 2 poison example, the poison in 0x00000040a0602400 and in
> >> 0x00000040a0602600 were separately reported.
> > 
> > Separately reported, each with a 4K alignment?
> 
> Yes,  and so twice nfit_handle_mce() call
> nvdimm_bus_add_badrange() with addr: 0x40a0602000, length: 0x1000
> complete overlap.
> 
> > 
> >>> It seems the fix you want is for the hardware to report the precise
> >>> error bounds and that 1UL << MCI_MISC_ADDR_LSB(mce->misc) does not have
> >>> that precision in this case.
> >>
> >> That field describes a 4K range even for a single poison, it confuses
> >> people unnecessarily.
> > 
> > I agree with you on the problem statement, it's the fix where I have
> > questions.
> > 
> >>> However, the ARS engine likely can return the precise error ranges so I
> >>> think the fix is to just use the address range indicated by 1UL <<
> >>> MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
> >>> scrub request to ask the device for the precise error list.
> >>
> >> You mean for nfit_handle_mce() callback to issue a short ARS per each
> >> poison report over a 4K range
> > 
> > Over a L1_CACHE_BYTES range...
> > 
> >> in order to decide the precise range as a workaround of the hardware
> >> issue?  if there are 8 poisoned detected, there will be 8 short ARS,
> >> sure we want to do that?
> > 
> > Seems ok to me, short ARS is meant to be cheap. I would hope there are
> > no latency concerns in this path.
> 
> Yeah, accumulated latency is the concern here.

An actual concern in practice? I.e. you see the round trip call to the
BIOS violating application latency concerns?

> I know the upstream user call stack has timeout for getting a response
> for certain database transaction.

I would expect that's on the order of seconds, no?

> And the test folks might inject dozens of back-to-back poisons to
> adjacent pages...

I am not sure the extreme scenarios of test folks should be guiding
kernel design, the interfaces need to make sense first, and then the
performance can be fixed up if kernel architecture causes practical
problems.

> > 
> >> also, for now, is it possible to log more than 1 poison per 512byte
> >> block?
> > 
> > For the badrange tracking, no. So this would just be a check to say
> > "Yes, CPU I see you think the whole 4K is gone, but lets double check
> > with more precise information for what gets placed in the badrange
> > tracking".
> 
> Okay, process-wise, this is what I am seeing -
> 
> - for each poison, nfit_handle_mce() issues a short ARS given (addr, 
> 64bytes)

Why would the short-ARS be performed over a 64-byte span when the MCE
reported a 4K aligned event?

> - and short ARS returns to say that's actually (addr, 256bytes),
> - and then nvdimm_bus_add_badrange() logs the poison in (addr, 512bytes) 
> anyway.

Right, I am reacting to the fact that the patch is picking 512 as an
arbtitrary blast radius. It's ok to expand the blast radius from
hardware when, for example, recording a 64-byte MCE in badrange which
only understands 512 byte records, but it's not ok to take a 4K MCE and
trim it to 512 bytes without asking hardware for a more precise report.

Recall that the NFIT driver supports platforms that may not offer ARS.
In that case the 4K MCE from the CPU is all that the driver gets and
there is no data source for a more precise answer.

So the ask is to avoid trimming the blast radius of MCE reports unless
and until a short-ARS says otherwise.

> The precise badrange from short ARS is lost in the process, given the
> time spent visiting the BIOS, what's the gain?

Generic support for not under-recording poison on platforms that do not
support ARS.

> Could we defer the precise badrange until there is consumer of the
> information?

Ideally the consumer is immediate and this precise information can make
it to the filesystem which might be able to make a better decision about
what data got clobbered.

See dax_notify_failure() infrastructure currently in linux-next that can
convey poison events to filesystems. That might be a path to start
tracking and reporting precise failure information to address the
constraints of the badrange implementation.
Dan Williams July 15, 2022, 1:19 a.m. UTC | #6
Jane Chu wrote:
> I meant to say there would be 8 calls to the nfit_handle_mce() callback,
> one call for each poison with accurate address.
> 
> Also, short ARS would find 2 poisons.
> 
> I attached the console output, my annotation is prefixed with "<==".

[29078.634817] {4}[Hardware Error]:   physical_address: 0x00000040a0602600		<== 2nd poison @ 0x600
[29078.642200] {4}[Hardware Error]:   physical_address_mask: 0xffffffffffffff00

Why is nfit_handle_mce() seeing a 4K address mask when the CPER record
is seeing a 256-byte address mask?

Sigh, is this "firmware-first" causing the kernel to get bad information
via the native mechanisms?

I would expect that if this test was truly worried about minimizing BIOS
latency it would disable firmware-first error reporting. I wonder if
that fixes the observed problem?
Jane Chu July 15, 2022, 5:26 p.m. UTC | #7
On 7/14/2022 6:19 PM, Dan Williams wrote:
> Jane Chu wrote:
>> I meant to say there would be 8 calls to the nfit_handle_mce() callback,
>> one call for each poison with accurate address.
>>
>> Also, short ARS would find 2 poisons.
>>
>> I attached the console output, my annotation is prefixed with "<==".
> 
> [29078.634817] {4}[Hardware Error]:   physical_address: 0x00000040a0602600		<== 2nd poison @ 0x600
> [29078.642200] {4}[Hardware Error]:   physical_address_mask: 0xffffffffffffff00
> 
> Why is nfit_handle_mce() seeing a 4K address mask when the CPER record
> is seeing a 256-byte address mask?

Good question!  One would think both GHES reporting and 
nfit_handle_mce() are consuming the same mce record...
Who might know?

> 
> Sigh, is this "firmware-first" causing the kernel to get bad information
> via the native mechanisms >
> I would expect that if this test was truly worried about minimizing BIOS
> latency it would disable firmware-first error reporting. I wonder if
> that fixes the observed problem?

Could you elaborate on firmware-first error please?  What are the 
possible consequences disabling it? and how to disable it?

thanks!
-jane
Jane Chu July 15, 2022, 5:38 p.m. UTC | #8
On 7/14/2022 5:58 PM, Dan Williams wrote:
[..]
>>>
>>>>> However, the ARS engine likely can return the precise error ranges so I
>>>>> think the fix is to just use the address range indicated by 1UL <<
>>>>> MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
>>>>> scrub request to ask the device for the precise error list.
>>>>
>>>> You mean for nfit_handle_mce() callback to issue a short ARS per each
>>>> poison report over a 4K range
>>>
>>> Over a L1_CACHE_BYTES range...
>>>
[..]
>>>
>>> For the badrange tracking, no. So this would just be a check to say
>>> "Yes, CPU I see you think the whole 4K is gone, but lets double check
>>> with more precise information for what gets placed in the badrange
>>> tracking".
>>
>> Okay, process-wise, this is what I am seeing -
>>
>> - for each poison, nfit_handle_mce() issues a short ARS given (addr,
>> 64bytes)
> 
> Why would the short-ARS be performed over a 64-byte span when the MCE
> reported a 4K aligned event?

Cuz you said so, see above. :)  Yes, 4K range as reported by the MCE 
makes sense.

> 
>> - and short ARS returns to say that's actually (addr, 256bytes),
>> - and then nvdimm_bus_add_badrange() logs the poison in (addr, 512bytes)
>> anyway.
> 
> Right, I am reacting to the fact that the patch is picking 512 as an
> arbtitrary blast radius. It's ok to expand the blast radius from
> hardware when, for example, recording a 64-byte MCE in badrange which
> only understands 512 byte records, but it's not ok to take a 4K MCE and
> trim it to 512 bytes without asking hardware for a more precise report.

Agreed.

> 
> Recall that the NFIT driver supports platforms that may not offer ARS.
> In that case the 4K MCE from the CPU is all that the driver gets and
> there is no data source for a more precise answer.
> 
> So the ask is to avoid trimming the blast radius of MCE reports unless
> and until a short-ARS says otherwise.
> 

What happens to short ARS on a platform that doesn't support ARS?
-EOPNOTSUPPORTED ?

>> The precise badrange from short ARS is lost in the process, given the
>> time spent visiting the BIOS, what's the gain?
> 
> Generic support for not under-recording poison on platforms that do not
> support ARS.
> 
>> Could we defer the precise badrange until there is consumer of the
>> information?
> 
> Ideally the consumer is immediate and this precise information can make
> it to the filesystem which might be able to make a better decision about
> what data got clobbered.
> 
> See dax_notify_failure() infrastructure currently in linux-next that can
> convey poison events to filesystems. That might be a path to start
> tracking and reporting precise failure information to address the
> constraints of the badrange implementation.

Yes, I'm aware of dax_notify_failure(), but would appreciate if you 
don't mind to elaborate on how the code path could be leveraged for 
precise badrange implementation.
My understanding is that dax_notify_failure() is in the path of 
synchronous fault accompanied by SIGBUS with BUS_MCEERR_AR.
But badrange could be recorded without poison being consumed, even 
without DAX filesystem in the picture.

thanks,
-jane
Dan Williams July 15, 2022, 7:17 p.m. UTC | #9
[ add Tony ]

Jane Chu wrote:
> On 7/14/2022 6:19 PM, Dan Williams wrote:
> > Jane Chu wrote:
> >> I meant to say there would be 8 calls to the nfit_handle_mce() callback,
> >> one call for each poison with accurate address.
> >>
> >> Also, short ARS would find 2 poisons.
> >>
> >> I attached the console output, my annotation is prefixed with "<==".
> > 
> > [29078.634817] {4}[Hardware Error]:   physical_address: 0x00000040a0602600		<== 2nd poison @ 0x600
> > [29078.642200] {4}[Hardware Error]:   physical_address_mask: 0xffffffffffffff00
> > 
> > Why is nfit_handle_mce() seeing a 4K address mask when the CPER record
> > is seeing a 256-byte address mask?
> 
> Good question!  One would think both GHES reporting and 
> nfit_handle_mce() are consuming the same mce record...
> Who might know?

Did some grepping...

Have a look at: apei_mce_report_mem_error()

"The call is coming from inside the house!"

Luckily we do not need to contact a BIOS engineer to get this fixed.

> > Sigh, is this "firmware-first" causing the kernel to get bad information
> > via the native mechanisms >
> > I would expect that if this test was truly worried about minimizing BIOS
> > latency it would disable firmware-first error reporting. I wonder if
> > that fixes the observed problem?
> 
> Could you elaborate on firmware-first error please?  What are the 
> possible consequences disabling it? and how to disable it?

With my Linux kernel developer hat on, firmware-first error handling is
really only useful for supporting legacy operating systems that do not
have native machine check handling, or for platforms that have bugs that
would otherwise cause OS native error handling to fail. Otherwise, for
modern Linux, firmware-first error handling is pure overhead and a
source of bugs.

In this case the bug is in the Linux code that translates the ACPI event
back into an MCE record.
Jane Chu July 15, 2022, 10:46 p.m. UTC | #10
On 7/15/2022 12:17 PM, Dan Williams wrote:
> [ add Tony ]
> 
> Jane Chu wrote:
>> On 7/14/2022 6:19 PM, Dan Williams wrote:
>>> Jane Chu wrote:
>>>> I meant to say there would be 8 calls to the nfit_handle_mce() callback,
>>>> one call for each poison with accurate address.
>>>>
>>>> Also, short ARS would find 2 poisons.
>>>>
>>>> I attached the console output, my annotation is prefixed with "<==".
>>>
>>> [29078.634817] {4}[Hardware Error]:   physical_address: 0x00000040a0602600		<== 2nd poison @ 0x600
>>> [29078.642200] {4}[Hardware Error]:   physical_address_mask: 0xffffffffffffff00
>>>
>>> Why is nfit_handle_mce() seeing a 4K address mask when the CPER record
>>> is seeing a 256-byte address mask?
>>
>> Good question!  One would think both GHES reporting and
>> nfit_handle_mce() are consuming the same mce record...
>> Who might know?
> 
> Did some grepping...
> 
> Have a look at: apei_mce_report_mem_error()
> 
> "The call is coming from inside the house!"
> 
> Luckily we do not need to contact a BIOS engineer to get this fixed.

Great, thank you!
Just put together a quick fix for review after I tested it.

> 
>>> Sigh, is this "firmware-first" causing the kernel to get bad information
>>> via the native mechanisms >
>>> I would expect that if this test was truly worried about minimizing BIOS
>>> latency it would disable firmware-first error reporting. I wonder if
>>> that fixes the observed problem?
>>
>> Could you elaborate on firmware-first error please?  What are the
>> possible consequences disabling it? and how to disable it?
> 
> With my Linux kernel developer hat on, firmware-first error handling is
> really only useful for supporting legacy operating systems that do not
> have native machine check handling, or for platforms that have bugs that
> would otherwise cause OS native error handling to fail. Otherwise, for
> modern Linux, firmware-first error handling is pure overhead and a
> source of bugs.
> 
> In this case the bug is in the Linux code that translates the ACPI event
> back into an MCE record.

Thanks!

-jane
diff mbox series

Patch

diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index d48a388b796e..eeacc8eb807f 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -32,7 +32,6 @@  static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
 	 */
 	mutex_lock(&acpi_desc_lock);
 	list_for_each_entry(acpi_desc, &acpi_descs, list) {
-		unsigned int align = 1UL << MCI_MISC_ADDR_LSB(mce->misc);
 		struct device *dev = acpi_desc->dev;
 		int found_match = 0;
 
@@ -64,7 +63,8 @@  static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
 
 		/* If this fails due to an -ENOMEM, there is little we can do */
 		nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
-				ALIGN_DOWN(mce->addr, align), align);
+				ALIGN(mce->addr, SECTOR_SIZE),
+				SECTOR_SIZE);
 		nvdimm_region_notify(nfit_spa->nd_region,
 				NVDIMM_REVALIDATE_POISON);