[v18,04/19] EDAC: Add memory repair control feature

Message ID	20250106121017.1620-5-shiju.jose@huawei.com (mailing list archive)
State	Handled Elsewhere, archived
Headers	show Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A40CE1DDA39; Mon, 6 Jan 2025 12:11:35 +0000 (UTC) From: <shiju.jose@huawei.com> To: <linux-edac@vger.kernel.org>, <linux-cxl@vger.kernel.org>, <linux-acpi@vger.kernel.org>, <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org> CC: <bp@alien8.de>, <tony.luck@intel.com>, <rafael@kernel.org>, <lenb@kernel.org>, <mchehab@kernel.org>, <dan.j.williams@intel.com>, <dave@stgolabs.net>, <jonathan.cameron@huawei.com>, <dave.jiang@intel.com>, <alison.schofield@intel.com>, <vishal.l.verma@intel.com>, <ira.weiny@intel.com>, <david@redhat.com>, <Vilas.Sridharan@amd.com>, <leo.duran@amd.com>, <Yazen.Ghannam@amd.com>, <rientjes@google.com>, <jiaqiyan@google.com>, <Jon.Grimm@amd.com>, <dave.hansen@linux.intel.com>, <naoya.horiguchi@nec.com>, <james.morse@arm.com>, <jthoughton@google.com>, <somasundaram.a@hpe.com>, <erdemaktas@google.com>, <pgonda@google.com>, <duenwen@google.com>, <gthelen@google.com>, <wschwartz@amperecomputing.com>, <dferguson@amperecomputing.com>, <wbs@os.amperecomputing.com>, <nifan.cxl@gmail.com>, <yazen.ghannam@amd.com>, <tanxiaofei@huawei.com>, <prime.zeng@hisilicon.com>, <roberto.sassu@huawei.com>, <kangkang.shen@futurewei.com>, <wanghuiqiang@huawei.com>, <linuxarm@huawei.com>, <shiju.jose@huawei.com> Subject: [PATCH v18 04/19] EDAC: Add memory repair control feature Date: Mon, 6 Jan 2025 12:10:00 +0000 Message-ID: <20250106121017.1620-5-shiju.jose@huawei.com> In-Reply-To: <20250106121017.1620-1-shiju.jose@huawei.com> References: <20250106121017.1620-1-shiju.jose@huawei.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers \| expand [v18,00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers [v18,01/19] EDAC: Add support for EDAC device features control [v18,02/19] EDAC: Add scrub control feature [v18,03/19] EDAC: Add ECS control feature [v18,04/19] EDAC: Add memory repair control feature [v18,05/19] ACPI:RAS2: Add ACPI RAS2 driver [v18,06/19] ras: mem: Add memory ACPI RAS2 driver [v18,07/19] cxl: Refactor user ioctl command path from mds to mailbox [v18,08/19] cxl: Add skeletal features driver [v18,09/19] cxl: Enumerate feature commands [v18,10/19] cxl: Add Get Supported Features command for kernel usage [v18,11/19] cxl: Add features driver attribute to emit number of features supported [v18,12/19] cxl/mbox: Add GET_FEATURE mailbox command [v18,13/19] cxl/mbox: Add SET_FEATURE mailbox command [v18,14/19] cxl: Setup exclusive CXL features that are reserved for the kernel [v18,15/19] cxl/memfeature: Add CXL memory device patrol scrub control feature [v18,16/19] cxl/memfeature: Add CXL memory device ECS control feature [v18,17/19] cxl/mbox: Add support for PERFORM_MAINTENANCE mailbox command [v18,18/19] cxl/memfeature: Add CXL memory device soft PPR control feature [v18,19/19] cxl/memfeature: Add CXL memory device memory sparing control feature

Shiju Jose Jan. 6, 2025, 12:10 p.m. UTC

From: Shiju Jose <shiju.jose@huawei.com>

Add a generic EDAC memory repair control driver to manage memory repairs
in the system, such as CXL Post Package Repair (PPR) and CXL memory sparing
features.

For example, a CXL device with DRAM components that support PPR features
may implement PPR maintenance operations. DRAM components may support two
types of PPR, hard PPR, for a permanent row repair, and soft PPR,  for a
temporary row repair. Soft PPR is much faster than hard PPR, but the repair
is lost with a power cycle.
Similarly a CXL memory device may support soft and hard memory sparing at
cacheline, row, bank and rank granularities. Memory sparing is defined as
a repair function that replaces a portion of memory with a portion of
functional memory at that same granularity.
When a CXL device detects an error in a memory, it may report the host of
the need for a repair maintenance operation by using an event record where
the "maintenance needed" flag is set. The event records contains the device
physical address(DPA) and other attributes of the memory to repair (such as
channel, sub-channel, bank group, bank, rank, row, column etc). The kernel
will report the corresponding CXL general media or DRAM trace event to
userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
operation in response to the device request via the sysfs repair control.

Device with memory repair features registers with EDAC device driver,
which retrieves memory repair descriptor from EDAC memory repair driver
and exposes the sysfs repair control attributes to userspace in
/sys/bus/edac/devices/<dev-name>/mem_repairX/.

The common memory repair control interface abstracts the control of
arbitrary memory repair functionality into a standardized set of functions.
The sysfs memory repair attribute nodes are only available if the client
driver has implemented the corresponding attribute callback function and
provided operations to the EDAC device driver during registration.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 .../ABI/testing/sysfs-edac-memory-repair      | 244 +++++++++
 Documentation/edac/features.rst               |   3 +
 Documentation/edac/index.rst                  |   1 +
 Documentation/edac/memory_repair.rst          | 101 ++++
 drivers/edac/Makefile                         |   2 +-
 drivers/edac/edac_device.c                    |  33 ++
 drivers/edac/mem_repair.c                     | 492 ++++++++++++++++++
 include/linux/edac.h                          | 139 +++++
 8 files changed, 1014 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair
 create mode 100644 Documentation/edac/memory_repair.rst
 create mode 100755 drivers/edac/mem_repair.c

Borislav Petkov Jan. 9, 2025, 9:19 a.m. UTC | #1

On Mon, Jan 06, 2025 at 12:10:00PM +0000, shiju.jose@huawei.com wrote:
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_hpa
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_dpa
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_nibble_mask
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank_group
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_rank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_row
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_column
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_channel
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_sub_channel
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_hpa
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_dpa
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_nibble_mask
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank_group
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_rank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_row
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_column
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_channel
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_sub_channel

So this is new. I don't remember seeing that when I looked at your patches
the last time.

Looks like you have all those attributes and now you've decided to add a min
and max for each one, in addition. And UI-wise it is a madness as there are
gazillion single-value files now.

"Attributes should be ASCII text files, preferably with only one value per
file. It is noted that it may not be efficient to contain only one value per
file, so it is socially acceptable to express an array of values of the same
type."

So you don't need those - you can simply express each attribute as a range:

echo "1:2" > /sys/bus/edac/devices/<dev-name>/mem_repairX/bank

or if you wanna scrub only one bank:

echo "1:1" > /sys/bus/edac/devices/<dev-name>/mem_repairX/bank

What is the use case of that thing?

Someone might find it useful so let's add it preemptively?

Pfff.

Shiju Jose Jan. 9, 2025, 11 a.m. UTC | #2

>-----Original Message-----
>From: Borislav Petkov <bp@alien8.de>
>Sent: 09 January 2025 09:19
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux-
>acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
>tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
>mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
>Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
>alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
>david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
>Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
>Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
>duenwen@google.com; gthelen@google.com;
>wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
>wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
><linuxarm@huawei.com>
>Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
>
>On Mon, Jan 06, 2025 at 12:10:00PM +0000, shiju.jose@huawei.com wrote:
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_hpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_dpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_nibble_mask
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_bank_group
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_bank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_rank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_row
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_column
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_channel
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_sub_channel
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_hpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_dpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_nibble_mask
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_bank_group
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_bank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_rank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_row
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_column
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_channel
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_sub_channel
>
>So this is new. I don't remember seeing that when I looked at your patches the
>last time.
>
>Looks like you have all those attributes and now you've decided to add a min and
>max for each one, in addition. And UI-wise it is a madness as there are gazillion
>single-value files now.
>

Thanks for the feedbacks.

The min_ and max_ attributes of the control attributes are added  for your
feedback on V15 to expose supported ranges of these control attributes to the user, 
in the following links.   
However these min_ and max_ attributes are 'RO' instead of 'RW' as specified in the doc, 
which to be fixed in the doc.
https://lore.kernel.org/lkml/20241114133249.GEZzX8ATNyc_Xw1L52@fat_crate.local/
https://lore.kernel.org/lkml/fa5d6bdd08104cf1a09c4960a0f9bc46@huawei.com/
https://lore.kernel.org/lkml/20241119123657.GCZzyGaZIExvUHPLKL@fat_crate.local/

>"Attributes should be ASCII text files, preferably with only one value per file. It is
>noted that it may not be efficient to contain only one value per file, so it is
>socially acceptable to express an array of values of the same type."
>
>So you don't need those - you can simply express each attribute as a range:
>
>echo "1:2" > /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
>
>or if you wanna scrub only one bank:

After internal discussion, we think this is the source of the confusion. 
This is not scrub where a range would indeed make sense. It is repair. 
We are not aware of a failure mechanism where a set of memory banks
would fail together but not the whole of the next level up in the memory topology. 

In theory we might get a stupid device design where it reports coarse level
errors but can only repair at fine levels where a range might be appropriate.
We are not sure that makes sense in practice and with a range interface we will
get mess like running out of repair resources half way through a list with
no visibility of what is repaired.

However, given the repair flow is driven by userspace receiving error records
that will only possible values to repair, we think these bounds on what can be
repaired are a nice to have rather than necessary so we would propose we do not
add max_ and min_ for now and see how the use cases evolve.
>
>echo "1:1" > /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
>
>What is the use case of that thing?
>
>Someone might find it useful so let's add it preemptively?
>
>Pfff.
>
>--
>Regards/Gruss,
>    Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette

Thanks,
Shiju

Borislav Petkov Jan. 9, 2025, 12:32 p.m. UTC | #3

On Thu, Jan 09, 2025 at 11:00:43AM +0000, Shiju Jose wrote:
> The min_ and max_ attributes of the control attributes are added  for your
> feedback on V15 to expose supported ranges of these control attributes to the user, 
> in the following links.  

Sure, but you can make that differently:

cat /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
[x:y]

which is the allowed range.

echo ... 

then writes in the bank.

> ... so we would propose we do not add max_ and min_ for now and see how the
> use cases evolve.

Yes, you should apply that same methodology to the rest of the new features
you're adding: only add functionality for the stuff that is actually being
used now. You can always extend it later.

Changing an already user-visible API is a whole different story and a lot lot
harder, even impossible.

So I'd suggest you prune the EDAC patches from all the hypothetical usage and
then send only what remains so that I can try to queue them.

Thx.

Jonathan Cameron Jan. 9, 2025, 2:24 p.m. UTC | #4

On Thu, 9 Jan 2025 13:32:22 +0100
Borislav Petkov <bp@alien8.de> wrote:

Hi Boris,

> On Thu, Jan 09, 2025 at 11:00:43AM +0000, Shiju Jose wrote:
> > The min_ and max_ attributes of the control attributes are added  for your
> > feedback on V15 to expose supported ranges of these control attributes to the user, 
> > in the following links.    
> 
> Sure, but you can make that differently:
> 
> cat /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
> [x:y]
> 
> which is the allowed range.

To my thinking that would fail the test of being an intuitive interface.
To issue a repair command requires that multiple attributes be configured
before triggering the actual repair.

Think of it as setting the coordinates of the repair in a high dimensional
space.

In the extreme case of fine grained repair (Cacheline), to identify the
relevant subunit of memory (obtained from the error record that we are
basing the decision to repair on) we need to specify all of:

Channel, sub-channel, rank,  bank group, row, column and nibble mask.
For coarser granularity repair only specify a subset of these applies and
only the relevant controls are exposed to userspace.

They are broken out as specific attributes to enable each to be set before
triggering the action with a write to the repair attribute.

There are several possible alternatives:

Option 1

"A:B:C:D:E:F:G:H:I:J" opaque single write to trigger the repair where
each number is providing one of those coordinates and where a readback
let's us known what each number is.

That single attribute interface is very hard to extend in an intuitive way.

History tell us more levels will be introduced in the middle, not just
at the finest granularity, making such an interface hard to extend in
a backwards compatible way.

Another alternative of a key value list would make for a nasty sysfs
interface.

Option 2 
There are sysfs interfaces that use a selection type presentation.

Write: "C", Read: "A, B, [C], D" but that only works well for discrete sets
of options and is a pain to parse if read back is necessary.

So in conclusion, I think the proposed multiple sysfs attribute style
with them reading back the most recent value written is the least bad
solution to a complex control interface.

> 
> echo ... 
> 
> then writes in the bank.
> 
> > ... so we would propose we do not add max_ and min_ for now and see how the
> > use cases evolve.  
> 
> Yes, you should apply that same methodology to the rest of the new features
> you're adding: only add functionality for the stuff that is actually being
> used now. You can always extend it later.
> 
> Changing an already user-visible API is a whole different story and a lot lot
> harder, even impossible.
> 
> So I'd suggest you prune the EDAC patches from all the hypothetical usage and
> then send only what remains so that I can try to queue them.

Sure. In this case the addition of min/max was perhaps a wrong response to
your request for a way to those ranges rather than just rejecting a write
of something out of range as earlier version did.

We can revisit in future if range discovery becomes necessary.  Personally
I don't think it is given we are only taking these actions in response error
records that give us precisely what to write and hence are always in range.

Jonathan

> 
> Thx.
>

Borislav Petkov Jan. 9, 2025, 3:18 p.m. UTC | #5

On Thu, Jan 09, 2025 at 02:24:33PM +0000, Jonathan Cameron wrote:
> To my thinking that would fail the test of being an intuitive interface.
> To issue a repair command requires that multiple attributes be configured
> before triggering the actual repair.
> 
> Think of it as setting the coordinates of the repair in a high dimensional
> space.

Why?

You can write every attribute in its separate file and have a "commit" or
"start" file which does that.

Or you can designate a file which starts the process. This is how I'm
injecting errors on x86:

see readme_msg here: arch/x86/kernel/cpu/mce/inject.c

More specifically:

"flags:\t Injection type to be performed. Writing to this file will trigger a\n"
"\t real machine check, an APIC interrupt or invoke the error decoder routines\n"
"\t for AMD processors.\n"

So you set everything else, and as the last step you set the injection type
*and* you also trigger it with this one write.

> Sure. In this case the addition of min/max was perhaps a wrong response to
> your request for a way to those ranges rather than just rejecting a write
> of something out of range as earlier version did.
> 
> We can revisit in future if range discovery becomes necessary.  Personally
> I don't think it is given we are only taking these actions in response error
> records that give us precisely what to write and hence are always in range.

My goal here was to make this user-friendly. Because you need some way of
knowing what valid ranges are and in order to trigger the repair, if it needs
to happen for a range.

Or, you can teach the repair logic to ignore invalid ranges and "clamp" things
to whatever makes sense.

Again, I'm looking at it from the usability perspective. I haven't actually
needed this scrub+repair functionality yet to know whether the UI makes sense.
So yeah, collecting some feedback from real-life use cases would probably give
you a lot better understanding of how that UI should be designed... perhaps
you won't ever need the ranges, whow knows.

So yes, preemptively designing stuff like that "in the dark" is kinda hard.
:-)

Jonathan Cameron Jan. 9, 2025, 4:01 p.m. UTC | #6

On Thu, 9 Jan 2025 16:18:54 +0100
Borislav Petkov <bp@alien8.de> wrote:

> On Thu, Jan 09, 2025 at 02:24:33PM +0000, Jonathan Cameron wrote:
> > To my thinking that would fail the test of being an intuitive interface.
> > To issue a repair command requires that multiple attributes be configured
> > before triggering the actual repair.
> > 
> > Think of it as setting the coordinates of the repair in a high dimensional
> > space.  
> 
> Why?

Ok. To me the fact it's not a single write was relevant. Seems not
in your mental model of how this works.  For me a single write
that you cannot query back is fine, setting lots of parameters and
being unable to query any of them less so.  I guess you disagree.
In interests of progress I'm not going to argue further. No one is
going to use this interface by hand anyway so the lost of useability
I'm seeing doesn't matter a lot.

> 
> You can write every attribute in its separate file and have a "commit" or
> "start" file which does that.

That's what we have.

> 
> Or you can designate a file which starts the process. This is how I'm
> injecting errors on x86:
> 
> see readme_msg here: arch/x86/kernel/cpu/mce/inject.c
> 
> More specifically:
> 
> "flags:\t Injection type to be performed. Writing to this file will trigger a\n"
> "\t real machine check, an APIC interrupt or invoke the error decoder routines\n"
> "\t for AMD processors.\n"
> 
> So you set everything else, and as the last step you set the injection type
> *and* you also trigger it with this one write.

Agreed. I'm not sure of the relevance though. This is how it works and
there is no proposal to change that.  

What I was trying to argue was for an interface that let you set all the
coordinates and read back what they were before hitting go. 

> 
> > Sure. In this case the addition of min/max was perhaps a wrong response to
> > your request for a way to those ranges rather than just rejecting a write
> > of something out of range as earlier version did.
> > 
> > We can revisit in future if range discovery becomes necessary.  Personally
> > I don't think it is given we are only taking these actions in response error
> > records that give us precisely what to write and hence are always in range.  
> 
> My goal here was to make this user-friendly. Because you need some way of
> knowing what valid ranges are and in order to trigger the repair, if it needs
> to happen for a range.

In at least the CXL case I'm fairly sure most of them are not discoverable.
Until you see errors you have no idea what the memory topology is.

> 
> Or, you can teach the repair logic to ignore invalid ranges and "clamp" things
> to whatever makes sense.

For that you'd need to have a path to read back what happened.

> 
> Again, I'm looking at it from the usability perspective. I haven't actually
> needed this scrub+repair functionality yet to know whether the UI makes sense.
> So yeah, collecting some feedback from real-life use cases would probably give
> you a lot better understanding of how that UI should be designed... perhaps
> you won't ever need the ranges, whow knows.
> 
> So yes, preemptively designing stuff like that "in the dark" is kinda hard.
> :-)

The discoverability is unnecessary for any known usecase.

Ok. Then can we just drop the range discoverability entirely or we go with
your suggestion and do not support read back of what has been
requested but instead have the reads return a range if known or "" /
return -EONOTSUPP if simply not known?

I can live with that though to me we are heading in the direction of
a less intuitive interface to save a small number of additional files.

Jonathan

>

Borislav Petkov Jan. 9, 2025, 4:19 p.m. UTC | #7

On Thu, Jan 09, 2025 at 04:01:59PM +0000, Jonathan Cameron wrote:
> Ok. To me the fact it's not a single write was relevant. Seems not
> in your mental model of how this works.  For me a single write
> that you cannot query back is fine, setting lots of parameters and
> being unable to query any of them less so.  I guess you disagree.

Why can't you query it back?

grep -r . /sysfs/dir/

All files' values have been previously set and should still be there on
a read, I'd strongly hope. Your ->read routines should give the values back.

> In interests of progress I'm not going to argue further. No one is
> going to use this interface by hand anyway so the lost of useability
> I'm seeing doesn't matter a lot.

I had the suspicion that this user interface is not really going to be used by
a user but by a tool. But then if you don't have a tool, you're lost.

This is one of the reasons why you can control ftrace directly on the shell
too - without a tool. This is very useful in certain cases where you cannot
run some userspace tools.

> In at least the CXL case I'm fairly sure most of them are not discoverable.
> Until you see errors you have no idea what the memory topology is.

Ok.

> For that you'd need to have a path to read back what happened.

So how is this scrubbing going to work? You get an error, you parse it for all
the attributes and you go and write those attributes into the scrub interface
and it starts scrubbing?

But then why do you even need the interface at all?

Why can't the kernel automatically collect all those attributes and start the
scrubbing automatically - no need for any user interaction...?

So why do you *actually* even need user interaction here and why can't the
kernel be smart enough to start the scrub automatically?

> Ok. Then can we just drop the range discoverability entirely or we go with
> your suggestion and do not support read back of what has been
> requested but instead have the reads return a range if known or "" /
> return -EONOTSUPP if simply not known?

Probably.

> I can live with that though to me we are heading in the direction of
> a less intuitive interface to save a small number of additional files.

This is not the point. I already alluded to this earlier - we're talking about
a user visible interface which, once it goes out, it is cast in stone forever.

So those files better have a good reason to exist...

And if we're not sure yet, we can upstream only those which are fine now and
then continue discussing the rest.

HTH.

Jonathan Cameron Jan. 9, 2025, 6:34 p.m. UTC | #8

On Thu, 9 Jan 2025 17:19:02 +0100
Borislav Petkov <bp@alien8.de> wrote:

> On Thu, Jan 09, 2025 at 04:01:59PM +0000, Jonathan Cameron wrote:
> > Ok. To me the fact it's not a single write was relevant. Seems not
> > in your mental model of how this works.  For me a single write
> > that you cannot query back is fine, setting lots of parameters and
> > being unable to query any of them less so.  I guess you disagree.  
> 
> Why can't you query it back?
> 
> grep -r . /sysfs/dir/
> 
> All files' values have been previously set and should still be there on
> a read, I'd strongly hope. Your ->read routines should give the values back.

Today you can.  Seems we are talking cross purposes.

I'm confused. I thought your proposal was for "bank" attribute to present an
allowed range on read.
"bank" attribute is currently written to and read back as the value of the bank on which
to conduct a repair.  Maybe this disconnect is down to the fact max_ and min_
attributes should have been marked as RO in the docs. They aren't controls,
just presentation of limits to userspace.

Was intent a separate bank_range type attribute rather than max_bank, min_bank?
One of those would be absolutely fine (similar to the _available attributes
in IIO - I added those years ago to meet a similar need and we've never had
any issues with those).

> 
> > In interests of progress I'm not going to argue further. No one is
> > going to use this interface by hand anyway so the lost of useability
> > I'm seeing doesn't matter a lot.  
> 
> I had the suspicion that this user interface is not really going to be used by
> a user but by a tool. But then if you don't have a tool, you're lost.
> 
> This is one of the reasons why you can control ftrace directly on the shell
> too - without a tool. This is very useful in certain cases where you cannot
> run some userspace tools.

I fully agree. What I was saying was in response to me thinking you wanted it
to not be possible to read back the user set values (overlapping uses of
single bank attribute which wasn't what you meant). That is useful for a user
wanting to do the cat /sys/... that you mention above, but not vital if they are
directly reading the tracepoints for the error records and poking the
sysfs interface.

Given it seems I misunderstood that suggestion, ignore my reply to that
as irrelevant.

> 
> > In at least the CXL case I'm fairly sure most of them are not discoverable.
> > Until you see errors you have no idea what the memory topology is.  
> 
> Ok.
> 
> > For that you'd need to have a path to read back what happened.  
> 
> So how is this scrubbing going to work? You get an error, you parse it for all
> the attributes and you go and write those attributes into the scrub interface
> and it starts scrubbing?

Repair not scrubbing. They are different things we should keep separate,
scrub corrects the value, if it can, but doesn't change the underlying memory to
new memory cells to avoid repeated errors. Replacing scrub with repair 
(which I think was the intent here)...

You get error records that describe the error seen in hardware, write back the
values into this interface and tell it to repair the memory.  This is not
necessarily a synchronous or immediate thing - instead typically based on
trend analysis.

As an example, the decision might be that bit of ram threw up 3 errors
over a month including multiple system reboots (for other reasons) and
that is over some threshold so we use a spare memory line to replace it.

> 
> But then why do you even need the interface at all?
> 
> Why can't the kernel automatically collect all those attributes and start the
> scrubbing automatically - no need for any user interaction...?
> 
> So why do you *actually* even need user interaction here and why can't the
> kernel be smart enough to start the scrub automatically?

Short answer, it needs to be very smart and there isn't a case of one size
fits all - hence suggested approach of making it a user space problem.

There are hardware autonomous solutions and ones handled by host firmware.
That is how repair is done in many servers - at most software sees a slightly
latency spike as the memory is repaired under the hood. Some CXL devices
will do this as well. Those CXL devices may provide an additional repair
interface for the less clear cut decisions that need more data processing
/ analysis than the device firmware is doing. Other CXL devices will take
the view the OS is best placed to make all the decisions - those sometimes
will give a 'maintenance needed' indication in the error records but that
is still a hint the host may or may not take any notice of.

Given in the systems being considered here, software is triggering the repair,
we want to allow for policy in the decision. In simple cases we could push
that policy into the kernel e.g. just repair the moment we see an error record.

These repair resources are very limited in number, so immediately repairing
may a bad idea. We want to build up a history of errors before making
such a decision.  That can be done in kernel. 

The decision to repair memory is heavily influenced by policy and time considerations
against device resource constraints.

Some options that are hard to do in kernel.

1. Typical asynchronous error report for a corrected error.

   Tells us memory had an error (perhaps from a scrubbing engine on the device
   running checks). No need to take action immediately. Instead build up more data
   over time and if lots of errors occur make decision to repair as no we are sure it
   is worth doing rather than a single random event. We may tune scrubbing engines
   to check this memory more frequently and adjust our data analysis to take that
   into account for setting thresholds etc.
   When an admin considers it a good time to take action, offline the memory and
   repair before bringing it back into use (sometimes by rebooting the machine).
   Sometimes repair can be triggered in a software transparent way, sometimes not.
   This also applies to uncorrectable errors though in that case you can't necessarily
   repair it without ever seeing a synchronous poison with all the impacts that has.

2. Soft repair across boots.  We are actually storing the error records, then only
   applying the fix on reboot before using the memory - so maintaining a list
   of bad memory and saving it to a file to read back on boot. We could provide
   another kernel interface to get this info and reinject it after reboot instead
   of doing it in userspace but that is another ABI to design.

3. Complex policy across fleets.  A lot of work is going on around prediction techniques
   that may change the local policy on each node dependent on the overall reliability
   patterns of a particular batch of devices and local characteristics, service guarantees
   etc. If it is hard repair, then once you've run out you need schedule an engineer
   out to replace the DIMM. All complex inputs to the decision.

Similar cases like CPU offlining on repeated errors are done in userspace (e.g.
RAS Daemon) for similar reasons of long term data gathering and potentially
complex algorithms.

> 
> > Ok. Then can we just drop the range discoverability entirely or we go with
> > your suggestion and do not support read back of what has been
> > requested but instead have the reads return a range if known or "" /
> > return -EONOTSUPP if simply not known?  
> 
> Probably.

Too many options in the above paragraph so just to check...  Probably to which?
If it's a separate attribute from the one we write the control so then
we do what is already done here and don't present the interface at all if
the range isn't discoverable.

> 
> > I can live with that though to me we are heading in the direction of
> > a less intuitive interface to save a small number of additional files.  
> 
> This is not the point. I already alluded to this earlier - we're talking about
> a user visible interface which, once it goes out, it is cast in stone forever.
> 
> So those files better have a good reason to exist...
> 
> And if we're not sure yet, we can upstream only those which are fine now and
> then continue discussing the rest.

Ok. Best path is drop the available range support then (so no min_ max_ or
anything to replace them for now).

Added bonus is we don't have to rush this conversation and can make sure we
come to the right solution driven by use cases.

Jonathan

> HTH.
>

Dan Williams Jan. 9, 2025, 11:51 p.m. UTC | #9

Jonathan Cameron wrote:
> Ok. Best path is drop the available range support then (so no min_ max_ or
> anything to replace them for now).

I think less is more in this case.

The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too
wide for userspace to have a chance at writing a competent tool. At
least I am struggling with where to even begin with those ABIs if I was
asked to write a tool. Does a tool already exist for those?

Some questions that read on those ABIs are:

1/ What if the platform has translation between HPA (CXL decode) and SPA
(physical addresses reported in trace points that PIO and DMA see)?

2/ What if memory is interleaved across repair domains? 

3/ What if the device does not use DDR terminology / topology terms for
repair?

I expect the flow rasdaemon would want is that the current PFA (leaky
bucket Pre-Failure Analysis) decides that the number of soft-offlines it
has performed exceeds some threshold and it wants to attempt to repair
memory.

However, what is missing today for volatile memory is that some failures
can be repaired with in-band writes and some failures need heavier
hammers like Post-Package-Repair to actively swap in whole new banks of
memory. So don't we need something like "soft-offline-undo" on the way
to PPR?

So, yes, +1 to simpler for now where software effectively just needs to
deal with a handful of "region repair" buttons and the semantics of
those are coarse and sub-optimal. Wait for a future where a tool author
says, "we have had good success getting bulk offlined pages back into
service, but now we need this specific finer grained kernel interface to
avoid wasting spare banks prematurely".

Anything more complex than a set of /sys/devices/system/memory/
devices has a /sys/bus/edac/devices/devX/repair button, feels like a
generation ahead of where the initial sophistication needs to lie.

That said, I do not closely follow ras tooling to say whether someone
has already identified the critical need for a fine grained repair ABI?

Jonathan Cameron Jan. 10, 2025, 11:01 a.m. UTC | #10

On Thu, 9 Jan 2025 15:51:39 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan Cameron wrote:
> > Ok. Best path is drop the available range support then (so no min_ max_ or
> > anything to replace them for now).  
> 
> I think less is more in this case.

A few notes before I get to specific questions.

Key in the discussion that follows is that the 'repair' is a separate
from the 'decision to repair'.  They mostly need different information
all of which is in the error trace points. A lot of the questions
are about the 'decision to repair' part not the repair itself.

Critical point is there is on some devices (all CXL ones), there may
be no direct way to discover the mapping from HPA/SPA/DPA to bank row etc
other than the error record. The memory mapping functions are an internal
detail not exposed on any interfaces. Odd though it may seem, those
mapping functions are considered confidential enough that manufacturers
don't always publish them (though I believe they are fairly easy to reverse
engineer) - I know a team whose job involves designing those.

Anyhow, short of the kernel or RAS Daemon carrying a look up table
of all known devices (no support for new ones until they are added) we
can't do a reverse map from DPA etc to bank. There are complex ways
round this like storing the mappings when issuing an error record,
to build up the necessary reverse map, but that would have to be
preserved across boot. These error tend not to be frequent, so cross
reboot /kexec etc need to be incorporated.

PPR on CXL does use DPA, but memory sparing commands are meant to
supersede that interface (the reason for that is perhaps bordering on
consortium confidential, but lets say it doesn't work well for some
cases). Memory sparing does not use DPA.

I'd advise mostly ignoring PPR and looking at memory sparing in
the CXL spec if you want to consider an example. For PPR DPA is used
(there is an HPA option that might be available). DPA is still needed
for on boot soft repair (or we could delay that until regions configured,
but then we'd need to do DPA to HPA mapping as that will be different
on a new config, and then write HPA for the kernel to map it back to DPA.

> 
> The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too
> wide for userspace to have a chance at writing a competent tool. At
> least I am struggling with where to even begin with those ABIs if I was
> asked to write a tool. Does a tool already exist for those?

There is little choice on that - those are the controls for this type
of repair. If we do something like a magic 'key' based on a concatenation
of those values we need to define that description to replace a clean
self describing interface. I'm not 100% against that but I think it would
be a terrible interface design and I don't think anyone is really in favor of it.

All a userspace tool does is read the error record fields of
exactly those names.  From that it will log data (already happening under
those names in RAS daemon alongside HPA/ DPA).  Then, in simplest case,
a threshold is passed and we write those values to the repair interface. 

There is zero need in that simple case for these to be understood at all.
You can think of them as a complex key but divided into well defined fields. 

For more complex decision algorithms, that structure info may be needed
to make the decision. As a dumb example, maybe certain banks are more
error prone on a batch of devices so we need a lower threshold before repairing.

Simplest case is maybe 20-30 lines of code looping over result of an SQL
query on the RASDaemon DB and writing the values to the files.
Not the most challenging userspace tool.  The complexity is in
the analysis of the errors, not this part. I don't think we bothered
doing this one yet in rasdaemon because we considered it obvious enough
an example wasn't needed. (Mauro / Shiju, does that estimate sound reasonable?)
We would need a couple of variants but those map 1:1 with the variants of
error record parsing and logging RAS Daemon already has.

> 
> Some questions that read on those ABIs are:
> 
> 1/ What if the platform has translation between HPA (CXL decode) and SPA
> (physical addresses reported in trace points that PIO and DMA see)?

See later for discussion of other interfaces.. This is assuming the
repair key is not HPA (it is on some systems / situations) - if it's
the repair key then that is easily dealt with.

HPA / SPA more or less irrelevant for repair itself, they are relevant
for the decision to repair. In the 'on reboot' soft repair case they may
not even exist at the time of repair as it would be expected to happen
before we've brought up a region (to get the RAM into a good state at boot).

For cases where the memory decoders are configured and so there is an HPA
to DPA mapping:
The trace reports provide both all these somewhat magic values and
the HPA.  Thus we can do the HPA aware stuff on that before then looking
up the other bit of the appropriate error reports to get the bank row etc.

> 
> 2/ What if memory is interleaved across repair domains? 

Also not relevant to a repair control and only a little relevant to the
decision to repair.  The domains would be handled separately but if
we are have to offline a chunk of memory to do it (needed for repair
on some devices) that may be a bigger chunk if fine grained interleave
in use and that may affect the decision.

> 
> 3/ What if the device does not use DDR terminology / topology terms for
> repair?

Then we provide the additional interfaces assuming the correspond to well
known terms.  If someone is using a magic key then we can get grumpy
with them, but that can also be supported.

Mostly I'd expect a new technology to overlap a lot of the existing
interface and maybe add one or two more; which layer in the stack for
HBM for instance.

The main alternative is where the device takes an HPA / SPA / DPA. We have one
driver that does that queued up behind this series that uses HPA. PPR uses
DPA.  In that case userspace first tries to see if it can repair by HPA then
DPA and if not moves on to see if it it can use the fuller description.
We will see devices supporting HPA / DPA (which to use depends on when you
are doing the operation and what has been configured) but mostly I'd expect
either HPA/DPA or fine grained on a given repair instance.

HPA only works if the address decoders are always configured (so not on CXL)
What is actually happening in that case is typically that a firmware is
involved that can look up address decoders etc, and map the control HPA
to Bank / row etc to issue the actual low level commands.  This keeps
the memory mapping completely secret rather than exposing it in error
records.

> 
> I expect the flow rasdaemon would want is that the current PFA (leaky
> bucket Pre-Failure Analysis) decides that the number of soft-offlines it
> has performed exceeds some threshold and it wants to attempt to repair
> memory.

Sparing may happen prior to point where we'd have done a soft offline
if non disruptive (whether it is can be read from another bit of the
ABI).  Memory repair might be much less disruptive than soft-offline!
I rather hope memory manufacturers build that, but I'm aware of at least
one case where they didn't and the memory must be offline.

> 
> However, what is missing today for volatile memory is that some failures
> can be repaired with in-band writes and some failures need heavier
> hammers like Post-Package-Repair to actively swap in whole new banks of
> memory. So don't we need something like "soft-offline-undo" on the way
> to PPR?

Ultimately we may do. That discussion was in one of the earlier threads
on more heavy weight case of recovery from poison (unfortunately I can't
find the thread) - the ask was for example code so that the complexity
could be weighed against the demand for this sort of live repair or a lesser
version where repair can only be done once a region is offline (and parts
poisoned).

However, there are other usecases where this isn't needed which is why
that isn't a precursor for this series.

Initial enablement targets two situations:
1) Repair can be done in non disruptive way - no need to soft offline at all.
2) Repair can be done at boot before memory is onlined or on admin
   action to take the whole region offline, then repair specific chunks of
   memory before bringing it back online.

> 
> So, yes, +1 to simpler for now where software effectively just needs to
> deal with a handful of "region repair" buttons and the semantics of
> those are coarse and sub-optimal. Wait for a future where a tool author
> says, "we have had good success getting bulk offlined pages back into
> service, but now we need this specific finer grained kernel interface to
> avoid wasting spare banks prematurely".

Depends on where you think that interface is.  I can absolutely see that
as a control to RAS Daemon.  Option 2 above, region is offline, repair
all dodgy looking fine grained buckets.

Note though that a suboptimal repair may mean permanent use of very rare
resources.  So there needs to be a control a the finest granularity as well.
Which order those get added to userspace tools doesn't matter to me.

If you mean that interface in kernel it brings some non trivial requirements.
The kernel would need all of:
1) Tracking interface for all error records so the reverse map from region
   to specific bank / row etc is available for a subset of entries.  The
   kernel would need to know which of those are important (soft offline
   might help in that use case, otherwise that means decision algorithms
   are in kernel or we have fine grained queue for region repair in parallel
   with soft-offline).
2) A way to inject the reverse map information from a userspace store
  (to deal with reboot etc).

That sounds a lot harder to deal with than relying on the usespace program
that already does the tracking across boots.

> 
> Anything more complex than a set of /sys/devices/system/memory/
> devices has a /sys/bus/edac/devices/devX/repair button, feels like a
> generation ahead of where the initial sophistication needs to lie.
> 
> That said, I do not closely follow ras tooling to say whether someone
> has already identified the critical need for a fine grained repair ABI?

It's not that we necessarily want to repair at fine grain, it's that
the control interface to hardware is fine grained and the reverse mapping
often unknown except for specific error records.

I'm fully on board with simple interfaces for common cases like repair
the bad memory in this region.  I'm just strongly against moving the
complexity of doing that into the kernel.

Jonathan

>

Dan Williams Jan. 10, 2025, 10:49 p.m. UTC | #11

Jonathan Cameron wrote:
> On Thu, 9 Jan 2025 15:51:39 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Jonathan Cameron wrote:
> > > Ok. Best path is drop the available range support then (so no min_ max_ or
> > > anything to replace them for now).  
> > 
> > I think less is more in this case.
> 
> A few notes before I get to specific questions.
> 
> Key in the discussion that follows is that the 'repair' is a separate
> from the 'decision to repair'.  They mostly need different information
> all of which is in the error trace points. A lot of the questions
> are about the 'decision to repair' part not the repair itself.
> 
[snipped the parts I agree with]
> I'd advise mostly ignoring PPR and looking at memory sparing in
> the CXL spec if you want to consider an example. For PPR DPA is used
> (there is an HPA option that might be available). DPA is still needed
> for on boot soft repair (or we could delay that until regions configured,
> but then we'd need to do DPA to HPA mapping as that will be different
> on a new config, and then write HPA for the kernel to map it back to DPA.

This is helpful because I was indeed getting lost in what kind of
"repair" was being discussed in the thread. Ok, lets focus on sparing
commands.

> 
> > 
> > The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too
> > wide for userspace to have a chance at writing a competent tool. At
> > least I am struggling with where to even begin with those ABIs if I was
> > asked to write a tool. Does a tool already exist for those?
> 
> There is little choice on that - those are the controls for this type
> of repair. If we do something like a magic 'key' based on a concatenation
> of those values we need to define that description to replace a clean
> self describing interface. I'm not 100% against that but I think it would
> be a terrible interface design and I don't think anyone is really in favor of it.
> 
> All a userspace tool does is read the error record fields of
> exactly those names.  From that it will log data (already happening under
> those names in RAS daemon alongside HPA/ DPA).  Then, in simplest case,
> a threshold is passed and we write those values to the repair interface. 
> 
> There is zero need in that simple case for these to be understood at all.

This is where you lose me. The error record is a point in time snapshot
of the SPA:HPA:DPA:<proprietary internal "DIMM" mapping>. The security
model for memory operations is based on coordinating with the kernel's
understanding of how that SPA is *currently* being used.

The kernel can not just take userspace's word for it that potentially
data changing, or temporary loss-of-use operations are safe to execute
just because once upon a time userspace saw an error record that
implicated a given SPA in the past, especially over reboot. 

The SPA:HPA:DPA:DIMM tuple is invalidated on reconfiguration and reboot
events. It follows that the kernel has a security/integrity interest in
declining to act on invalidated tuples. This is solvable within a single
boot as the kernel can cache the error records and userspace can ask to
"trigger sparing based on cached record X".

For the reboot case when the error record cache is flushed the kernel
needs a reliable way to refill that cache, not an ABI for userspace to
say "take my word for it, this *should* be safe".

[snipped the explanation of replaying the old trace record parameters
data back through sysfs, because that is precisely the hang up I have
with the proposal]

> 
> > 
> > Some questions that read on those ABIs are:
> > 
> > 1/ What if the platform has translation between HPA (CXL decode) and SPA
> > (physical addresses reported in trace points that PIO and DMA see)?
> 
> See later for discussion of other interfaces.. This is assuming the
> repair key is not HPA (it is on some systems / situations) - if it's
> the repair key then that is easily dealt with.
> 
> HPA / SPA more or less irrelevant for repair itself, they are relevant
> for the decision to repair. In the 'on reboot' soft repair case they may
> not even exist at the time of repair as it would be expected to happen
> before we've brought up a region (to get the RAM into a good state at boot).
> 
> For cases where the memory decoders are configured and so there is an HPA
> to DPA mapping:
> The trace reports provide both all these somewhat magic values and
> the HPA.  Thus we can do the HPA aware stuff on that before then looking
> up the other bit of the appropriate error reports to get the bank row etc.
> 
> > 
> > 2/ What if memory is interleaved across repair domains? 
> 
> Also not relevant to a repair control and only a little relevant to the
> decision to repair.  The domains would be handled separately but if
> we are have to offline a chunk of memory to do it (needed for repair
> on some devices) that may be a bigger chunk if fine grained interleave
> in use and that may affect the decision.

Again, the repair control assumes that the kernel can just trust
userspace to get it right. When the kernel knows the SPA implications it
can add safety like "you are going to issue sparing on deviceA that will
temporarily take deviceA offline. CXL subsystem tells me deviceA is
interleaved with deviceB in SPA so the whole SPA range needs to be
offline before this operation proceeds". That is not someting that
userspace can reliably coordinate.

> > 3/ What if the device does not use DDR terminology / topology terms for
> > repair?
> 
> Then we provide the additional interfaces assuming the correspond to well
> known terms.  If someone is using a magic key then we can get grumpy
> with them, but that can also be supported.
> 
> Mostly I'd expect a new technology to overlap a lot of the existing
> interface and maybe add one or two more; which layer in the stack for
> HBM for instance.

The concern is the assertion that sysfs needs to care about all these
parameters vs an ABI that says "repair errorX". If persistence and
validity of error records is the concern lets build an ABI for that and
not depend upon trust in userspace to properly coordinate memory
integrity concerns.

> 
> The main alternative is where the device takes an HPA / SPA / DPA. We have one
> driver that does that queued up behind this series that uses HPA. PPR uses
> DPA.  In that case userspace first tries to see if it can repair by HPA then
> DPA and if not moves on to see if it it can use the fuller description.
> We will see devices supporting HPA / DPA (which to use depends on when you
> are doing the operation and what has been configured) but mostly I'd expect
> either HPA/DPA or fine grained on a given repair instance.
> 
> HPA only works if the address decoders are always configured (so not on CXL)
> What is actually happening in that case is typically that a firmware is
> involved that can look up address decoders etc, and map the control HPA
> to Bank / row etc to issue the actual low level commands.  This keeps
> the memory mapping completely secret rather than exposing it in error
> records.
> 
> > 
> > I expect the flow rasdaemon would want is that the current PFA (leaky
> > bucket Pre-Failure Analysis) decides that the number of soft-offlines it
> > has performed exceeds some threshold and it wants to attempt to repair
> > memory.
> 
> Sparing may happen prior to point where we'd have done a soft offline
> if non disruptive (whether it is can be read from another bit of the
> ABI).  Memory repair might be much less disruptive than soft-offline!
> I rather hope memory manufacturers build that, but I'm aware of at least
> one case where they didn't and the memory must be offline.

That's a good point, spare before offline makes sense.

[..]
> However, there are other usecases where this isn't needed which is why
> that isn't a precursor for this series.
> 
> Initial enablement targets two situations:
> 1) Repair can be done in non disruptive way - no need to soft offline at all.

Modulo needing to quiesce access over the sparing event?

> 2) Repair can be done at boot before memory is onlined or on admin
>    action to take the whole region offline, then repair specific chunks of
>    memory before bringing it back online.

Which is userspace racing the kernel to online memory?

> > So, yes, +1 to simpler for now where software effectively just needs to
> > deal with a handful of "region repair" buttons and the semantics of
> > those are coarse and sub-optimal. Wait for a future where a tool author
> > says, "we have had good success getting bulk offlined pages back into
> > service, but now we need this specific finer grained kernel interface to
> > avoid wasting spare banks prematurely".
> 
> Depends on where you think that interface is.  I can absolutely see that
> as a control to RAS Daemon.  Option 2 above, region is offline, repair
> all dodgy looking fine grained buckets.
> 
> Note though that a suboptimal repair may mean permanent use of very rare
> resources.  So there needs to be a control a the finest granularity as well.
> Which order those get added to userspace tools doesn't matter to me.
> 
> If you mean that interface in kernel it brings some non trivial requirements.
> The kernel would need all of:
> 1) Tracking interface for all error records so the reverse map from region
>    to specific bank / row etc is available for a subset of entries.  The
>    kernel would need to know which of those are important (soft offline
>    might help in that use case, otherwise that means decision algorithms
>    are in kernel or we have fine grained queue for region repair in parallel
>    with soft-offline).
> 2) A way to inject the reverse map information from a userspace store
>   (to deal with reboot etc).

Not a way to inject the reverse map information, a way to inject the
error records and assert that memory topology changes have not
invalidated those records.

> That sounds a lot harder to deal with than relying on the usespace program
> that already does the tracking across boots.

I am stuck behind the barrier of userspace must not assume it knows
better than the kernel about the SPA impact of a DIMM sparing
event. The kernel needs evidence either live records from within the
same kernel boot or validated records from a previous boot.

...devices could also help us out here with a way to replay DIMM error
events. That would allow for refreshing error records even if the
memory topology change because the new record would generate a refreshed
SPA:HPA:DPA:DIMM tuple.

> > Anything more complex than a set of /sys/devices/system/memory/
> > devices has a /sys/bus/edac/devices/devX/repair button, feels like a
> > generation ahead of where the initial sophistication needs to lie.
> > 
> > That said, I do not closely follow ras tooling to say whether someone
> > has already identified the critical need for a fine grained repair ABI?
> 
> It's not that we necessarily want to repair at fine grain, it's that
> the control interface to hardware is fine grained and the reverse mapping
> often unknown except for specific error records.
> 
> I'm fully on board with simple interfaces for common cases like repair
> the bad memory in this region.  I'm just strongly against moving the
> complexity of doing that into the kernel.

Yes, we are just caught up on where that "...but no simpler" line is
drawn.

Borislav Petkov Jan. 11, 2025, 5:12 p.m. UTC | #12

On Thu, Jan 09, 2025 at 06:34:48PM +0000, Jonathan Cameron wrote:
> Today you can.  Seems we are talking cross purposes.
> 
> I'm confused. I thought your proposal was for "bank" attribute to present an
> allowed range on read.
> "bank" attribute is currently written to and read back as the value of the bank on which
> to conduct a repair.  Maybe this disconnect is down to the fact max_ and min_
> attributes should have been marked as RO in the docs. They aren't controls,
> just presentation of limits to userspace.
> 
> Was intent a separate bank_range type attribute rather than max_bank, min_bank?

I don't know - I'm just throwing ideas out there. You could do:

cat /sys/.../bank

and that gives you

[<low> <current_value> <high>]

So you have all the needed information. Dunno if this would be abusing sysfs
rules too much tho.

> > 
> > > In at least the CXL case I'm fairly sure most of them are not discoverable.
> > > Until you see errors you have no idea what the memory topology is.  
> > 
> > Ok.
> > 
> > > For that you'd need to have a path to read back what happened.  
> > 
> > So how is this scrubbing going to work? You get an error, you parse it for all
:> > the attributes and you go and write those attributes into the scrub interface
> > and it starts scrubbing?
> 
> Repair not scrubbing. They are different things we should keep separate,
> scrub corrects the value, if it can, but doesn't change the underlying memory to
> new memory cells to avoid repeated errors. Replacing scrub with repair 
> (which I think was the intent here)...

Really?

So how is scrubbing defined for CXL? You read memory, do ECC check on it,
report any potential errors but write back the *original* wrong value?!

I thought the point of scrubbing is to repair it while at it too...

> You get error records that describe the error seen in hardware, write back the
> values into this interface and tell it to repair the memory.  This is not
> necessarily a synchronous or immediate thing - instead typically based on
> trend analysis.

This is just silly: I'm scrubbing, I found an error, I should simply fix it
while at it. Why would I need an additional command to repair it?!

> As an example, the decision might be that bit of ram threw up 3 errors
> over a month including multiple system reboots (for other reasons) and
> that is over some threshold so we use a spare memory line to replace it.

Right.

> Short answer, it needs to be very smart and there isn't a case of one size
> fits all - hence suggested approach of making it a user space problem.

Making it a userspace problem is pretty much always a sign that the hw design
failed.

> Given in the systems being considered here, software is triggering the repair,
> we want to allow for policy in the decision. 

Right, you can leave a high-level decision to userspace: repair only when
idle, repair only non-correctable errors, blabla but exposing every single
aspect of every single error... meh.

> In simple cases we could push that policy into the kernel e.g. just repair
> the moment we see an error record.
> 
> These repair resources are very limited in number, so immediately repairing
> may a bad idea. We want to build up a history of errors before making
> such a decision.  That can be done in kernel. 

Yap, we are doing this now:

drivers/ras/cec.c

Userspace involvement is minimal, if at all. It is mostly controlling the
parameters of the leaky bucket.

> The decision to repair memory is heavily influenced by policy and time considerations
> against device resource constraints.
> 
> Some options that are hard to do in kernel.
> 
> 1. Typical asynchronous error report for a corrected error.
> 
>    Tells us memory had an error (perhaps from a scrubbing engine on the device
>    running checks). No need to take action immediately. Instead build up more data
>    over time and if lots of errors occur make decision to repair as no we are sure it
>    is worth doing rather than a single random event. We may tune scrubbing engines
>    to check this memory more frequently and adjust our data analysis to take that
>    into account for setting thresholds etc.

See above.

What happens when your daemon dies and loses all that collected data?

> 2. Soft repair across boots.  We are actually storing the error records, then only
>    applying the fix on reboot before using the memory - so maintaining a list
>    of bad memory and saving it to a file to read back on boot. We could provide
>    another kernel interface to get this info and reinject it after reboot instead
>    of doing it in userspace but that is another ABI to design.

We did something similar recently: drivers/ras/amd/fmpm.c. It basically
"replays" errors from persistent storage as that memory cannot be replaced.

> 3. Complex policy across fleets.  A lot of work is going on around prediction techniques
>    that may change the local policy on each node dependent on the overall reliability
>    patterns of a particular batch of devices and local characteristics, service guarantees
>    etc. If it is hard repair, then once you've run out you need schedule an engineer
>    out to replace the DIMM. All complex inputs to the decision.

You probably could say here: "repair or report when this and that." or
"offline page and report error" and similar high-level decisions by leaving
the details to the kernel instead of looking at every possible error in
userspace and returning back to the kernel to state your decision.

> Similar cases like CPU offlining on repeated errors are done in userspace (e.g.
> RAS Daemon) for similar reasons of long term data gathering and potentially
> complex algorithms.
>   
> > 
> > > Ok. Then can we just drop the range discoverability entirely or we go with
> > > your suggestion and do not support read back of what has been
> > > requested but instead have the reads return a range if known or "" /
> > > return -EONOTSUPP if simply not known?  
> > 
> > Probably.
> 
> Too many options in the above paragraph so just to check...  Probably to which?
> If it's a separate attribute from the one we write the control so then
> we do what is already done here and don't present the interface at all if
> the range isn't discoverable.

Probably means I still don't get a warm and fuzzy feeling about this design.
As I've noted above.

> Ok. Best path is drop the available range support then (so no min_ max_ or
> anything to replace them for now).
> 
> Added bonus is we don't have to rush this conversation and can make sure we
> come to the right solution driven by use cases.

Yap, that sounds like a prudent idea.

What I'm trying to say, basically, is, this interface through sysfs is
a *lot* of attributes, there's no clear cut use case where we can judge how
useful it is and as I alluded to above, I really think that you should leave
the high-level decisions to userspace and let the kernel do its job.

This'll make your interface a lot simpler.

And if you really need to control every single aspect of scrubbing in
userspace, then you can always come later with proper design and use case.

But again, I really think you should keep as much recovery logic in the kernel
and as automatic as possible. Only when you really really need user input,
only then you should allow it...

I hope I'm making sense here...

Thx.

Jonathan Cameron Jan. 13, 2025, 11:07 a.m. UTC | #13

On Sat, 11 Jan 2025 18:12:43 +0100
Borislav Petkov <bp@alien8.de> wrote:

> On Thu, Jan 09, 2025 at 06:34:48PM +0000, Jonathan Cameron wrote:
> > Today you can.  Seems we are talking cross purposes.
> > 
> > I'm confused. I thought your proposal was for "bank" attribute to present an
> > allowed range on read.
> > "bank" attribute is currently written to and read back as the value of the bank on which
> > to conduct a repair.  Maybe this disconnect is down to the fact max_ and min_
> > attributes should have been marked as RO in the docs. They aren't controls,
> > just presentation of limits to userspace.
> > 
> > Was intent a separate bank_range type attribute rather than max_bank, min_bank?  
> 
> I don't know - I'm just throwing ideas out there. You could do:
> 
> cat /sys/.../bank
> 
> and that gives you
> 
> [<low> <current_value> <high>]
> 
> So you have all the needed information. Dunno if this would be abusing sysfs
> rules too much tho.

We could do that, trade off between count of files and parsing complexity, but
given I don't expect anyone to parse the reads anyway, sure.

If we do go that way I guess we'd have things like
<0> 33 <123>
33

To allow for cases where the range is known vs not known? Or
<?> 33 <?> maybe?

We can do that if you prefer.  I'm not that fussed how this is handled
because, for tooling at least, I don't see why we'd ever read it.
It's for human parsing only and the above is fine.

> 
> > >   
> > > > In at least the CXL case I'm fairly sure most of them are not discoverable.
> > > > Until you see errors you have no idea what the memory topology is.    
> > > 
> > > Ok.
> > >   
> > > > For that you'd need to have a path to read back what happened.    
> > > 
> > > So how is this scrubbing going to work? You get an error, you parse it for all
> :> > the attributes and you go and write those attributes into the scrub interface
> > > and it starts scrubbing?  
> > 
> > Repair not scrubbing. They are different things we should keep separate,
> > scrub corrects the value, if it can, but doesn't change the underlying memory to
> > new memory cells to avoid repeated errors. Replacing scrub with repair 
> > (which I think was the intent here)...  
> 
> Really?
> 
> So how is scrubbing defined for CXL? You read memory, do ECC check on it,
> report any potential errors but write back the *original* wrong value?!

What you describe is correction, not repair. Of course it corrects!
These two terms mean different things in CXL world and in DDR specs etc.
We should have made that distinction clearer as perhaps that is the root
of this struggle to reach agreement.

> 
> I thought the point of scrubbing is to repair it while at it too...

No. There is a major difference between correction (which scrubbing does)
and repair.

Point of scrubbing is to correct errors if it can by using (typically)
the single error correction of SECDED ECC codes (single correct, double detect).
In many cases it reports the error and uncorrectable errors it runs into.
This is common for all scrub implementations not just CXL ones.

What happens with those error reports differs between systems. This is
not CXL specific. I'm aware of other systems that make repair an OS managed
thing - we have another driver queued up behind this set that does exactly
that.

In many existing servers (I know some of ours do this for example), the
firmware / BMC etc keep track of those errors and make the decision to
repair the memory. It does not do this on detection of one correctable
error, it does it after a repeated pattern.

Repair cam be a feature of the DIMMs themselves or it can be a feature
of the memory controller. It is basically replacing them with spare
memory from somewhere else (usually elsewhere on same DIMMs that have
a bit of spare capacity for this).  Bit like a hot spare in a RAID setup.

In some other systems the OS gets the errors and is responsible for making
the decision. Sticking to the corrected error case (uncorrected handling
is going to require a lot more work given we've lost data, Dan asked about that
in the other branch of the thread), the OS as a whole (kernel + userspace)
gets the error records and makes the policy decision to repair based on
assessment of risk vs resource availability to make a repair.

Two reasons for this
1) Hardware isn't necessarily capable of repairing autonomously as
   other actions may be needed (memory traffic to some granularity of
   memory may need to be stopped to avoid timeouts). Note there are many
   graduations of this from A) can do it live with no visible effect, through
   B) offline a page, to C) offlining the whole device.
2) Policy can be a lot more sophisticated than a BMC can do.

> 
> > You get error records that describe the error seen in hardware, write back the
> > values into this interface and tell it to repair the memory.  This is not
> > necessarily a synchronous or immediate thing - instead typically based on
> > trend analysis.  
> 
> This is just silly: I'm scrubbing, I found an error, I should simply fix it
> while at it. Why would I need an additional command to repair it?!

This is confusing correction with repair.  They are different operations.
Correction of course happens (if we can- normally single bit errors only).
No OS involvement in scrub based correction.
Repair is normally only for repeated errors.   If the memory is fine and
we get a cosmic ray or similar flipping a bit there is no long term
increased likelihood of seeing another error.  If we get one every X
hours then it is highly likely the issue is something wrong with the memory.

> 
> > As an example, the decision might be that bit of ram threw up 3 errors
> > over a month including multiple system reboots (for other reasons) and
> > that is over some threshold so we use a spare memory line to replace it.  
> 
> Right.
> 
> > Short answer, it needs to be very smart and there isn't a case of one size
> > fits all - hence suggested approach of making it a user space problem.  
> 
> Making it a userspace problem is pretty much always a sign that the hw design
> failed.

In some cases perhaps, but another very strong driver is that policy is involved.

We can either try put a complex design in firmware and poke it with N opaque
parameters from a userspace tool or via some out of band method or we can put
the algorithm in userspace where it can be designed to incorporate lessons learnt
over time.  We will start simple and see what is appropriate as this starts
to get used in large fleets.  This stuff is a reasonable target for AI type
algorithms etc that we aren't going to put in the kernel.

Doing this at all is a reliability optimization, normally it isn't required for
correct operation.

> 
> > Given in the systems being considered here, software is triggering the repair,
> > we want to allow for policy in the decision.   
> 
> Right, you can leave a high-level decision to userspace: repair only when
> idle, repair only non-correctable errors, blabla but exposing every single
> aspect of every single error... meh.

Those aspects identify the error location. As I put in reply to Dan a device
does not necessarily have a fixed mapping to a device physical address over
reboots etc.  There are solutions that scramble that mapping on reboot
or hotplug for various reasons (security and others).

For CXL at least these are all in userspace already (RAS-daemon has supported
them for quite a while).  The same data is coming in a CPER record for other
memory errors though I'm not sure it is currently surfaced to RAS Daemon.

> 
> > In simple cases we could push that policy into the kernel e.g. just repair
> > the moment we see an error record.
> > 
> > These repair resources are very limited in number, so immediately repairing
> > may a bad idea. We want to build up a history of errors before making
> > such a decision.  That can be done in kernel.   
> 
> Yap, we are doing this now:
> 
> drivers/ras/cec.c
> 
> Userspace involvement is minimal, if at all. It is mostly controlling the
> parameters of the leaky bucket.

Offline has no permanent cost and no limit on number of times you can
do it. Repair is definitely a limited resource and may permanently use
up that resource (discoverable as a policy wants to know that too!)
In some cases once you run out of repair resources you have to send an
engineer to replace the memory before you can do it again.

So to me, these are different situations requiring different solutions.

> 
> > The decision to repair memory is heavily influenced by policy and time considerations
> > against device resource constraints.
> > 
> > Some options that are hard to do in kernel.
> > 
> > 1. Typical asynchronous error report for a corrected error.
> > 
> >    Tells us memory had an error (perhaps from a scrubbing engine on the device
> >    running checks). No need to take action immediately. Instead build up more data
> >    over time and if lots of errors occur make decision to repair as no we are sure it
> >    is worth doing rather than a single random event. We may tune scrubbing engines
> >    to check this memory more frequently and adjust our data analysis to take that
> >    into account for setting thresholds etc.  
> 
> See above.
> 
> What happens when your daemon dies and loses all that collected data?

It is logged always - in rasdaemon, either local sqlite DB or people are carrying
patches to do fleet scale logging.  This works across reboot so a daemon dying is
part of the design rather than just an error case.
If the daemon is able to corrupt your error logs then we have bigger problems.

> 
> > 2. Soft repair across boots.  We are actually storing the error records, then only
> >    applying the fix on reboot before using the memory - so maintaining a list
> >    of bad memory and saving it to a file to read back on boot. We could provide
> >    another kernel interface to get this info and reinject it after reboot instead
> >    of doing it in userspace but that is another ABI to design.  
> 
> We did something similar recently: drivers/ras/amd/fmpm.c. It basically
> "replays" errors from persistent storage as that memory cannot be replaced.

Ok. I guess it is an option (I wasn't aware of that work).

I was thinking that was far more complex to deal with than just doing it in
userspace tooling. From a quick look that solution seems to rely on ACPI ERSR
infrastructure to provide that persistence that we won't generally have but
I suppose we can read it from the filesystem or other persistent stores.
We'd need to be a lot more general about that as can't make system assumptions
that can be made in AMD specific code.

So could be done, I don't think it is a good idea in this case, but that
example does suggest it is possible.

> > 3. Complex policy across fleets.  A lot of work is going on around prediction techniques
> >    that may change the local policy on each node dependent on the overall reliability
> >    patterns of a particular batch of devices and local characteristics, service guarantees
> >    etc. If it is hard repair, then once you've run out you need schedule an engineer
> >    out to replace the DIMM. All complex inputs to the decision.  
> 
> You probably could say here: "repair or report when this and that." or
> "offline page and report error" and similar high-level decisions by leaving
> the details to the kernel instead of looking at every possible error in
> userspace and returning back to the kernel to state your decision.

In approach we are targetting, there is no round trip situation.  We let the kernel
deal with any synchronous error just fine and run it's existing logic
to offline problematic memory.  That needs to be timely and to carry on operating
exactly as it always has.

In parallel with that we gather the error reports that we will already be
gathering and run analysis on those.  From that we decide if a memory is likely to fail
again and perform a sparing operation if appropriate.
Effectively this is 'free'. All the information is already there in userspace
and already understood by tools like rasdaemon, we are not expanding that
reporting interface at all.

Or we don't decide to and the memory continues to be used as before.  Sure there
may be a higher chance of error but there isn't one right now.  (or there is and
we have offlined the memory anyway via normal mechanisms so no problem).

There is an interesting future exploration to be done on recovery from poison
(uncorrectable error) situations, but that is not in scope of this initial
code and that would likely be an asynchronous best effort process as well.

> 
> > Similar cases like CPU offlining on repeated errors are done in userspace (e.g.
> > RAS Daemon) for similar reasons of long term data gathering and potentially
> > complex algorithms.
> >     
> > >   
> > > > Ok. Then can we just drop the range discoverability entirely or we go with
> > > > your suggestion and do not support read back of what has been
> > > > requested but instead have the reads return a range if known or "" /
> > > > return -EONOTSUPP if simply not known?    
> > > 
> > > Probably.  
> > 
> > Too many options in the above paragraph so just to check...  Probably to which?
> > If it's a separate attribute from the one we write the control so then
> > we do what is already done here and don't present the interface at all if
> > the range isn't discoverable.  
> 
> Probably means I still don't get a warm and fuzzy feeling about this design.
> As I've noted above.

Ok. I was confused as the paragraph asked and (A) or (B) question.

> 
> > Ok. Best path is drop the available range support then (so no min_ max_ or
> > anything to replace them for now).
> > 
> > Added bonus is we don't have to rush this conversation and can make sure we
> > come to the right solution driven by use cases.  
> 
> Yap, that sounds like a prudent idea.
> 
> What I'm trying to say, basically, is, this interface through sysfs is
> a *lot* of attributes, there's no clear cut use case where we can judge how
> useful it is and as I alluded to above, I really think that you should leave
> the high-level decisions to userspace and let the kernel do its job.
> 
> This'll make your interface a lot simpler.

Ok.  It seems you correlate number of files with complexity.
I correlated difficulty of understanding those files with complexity.
Everyone one of the files is clearly defined and aligned with long history
of how to describe DRAM (see how long CPER records have used these
fields for example - they go back to the beginning).

1) It is only 8 files (if we drop the range info).
   I don't consider that a complex enough interface to worry about complexity.
2) Moving the control into the kernel will require a control interface as
   there will be multiple tuneables for those decisions. Like most such
   algorithm parameter questions we may spend years arguing about what those
   look like.  Sure that problem also occurs in userspace but we aren't
   obliging people to use what we put in RASDaemon.

> 
> And if you really need to control every single aspect of scrubbing in
> userspace, then you can always come later with proper design and use case.

I'm all in favor of building an interface up by providing minimum first
and then adding to it, but here what is proposed is the minimum for basic
functionality and the alternative of doing the whole thing in kernel both
puts complexity in the wrong place and restricts us in what is possible.
See below for a proposal to maybe make some progress.

> 
> But again, I really think you should keep as much recovery logic in the kernel
> and as automatic as possible. Only when you really really need user input,
> only then you should allow it...

This is missing the key considerations of resource availability being
limited and not addressing the main usecase here. This is not and has
never been about the logic to recover form an error. It is about reducing
the possibility of repeated errors.

> 
> I hope I'm making sense here...

To some degree but I think there is a major mismatch in what we think
this is for.

What I've asked Shiju to look at is splitting the repair infrastructure
into two cases so that maybe we can make partial progress:

1) Systems that support repair by Physical Address
 - Covers Post Package Repair for CXL

2) Systems that support repair by description of the underlying hardware
- Covers Memory Sparing interfaces for CXL. 

We need both longer term anyway, but maybe 1 is less controversial simply
on basis it has fewer control parameters

This still fundamentally puts the policy in userspace where I
believe it belongs.

Jonathan

> 
> Thx.
>

Jonathan Cameron Jan. 13, 2025, 11:40 a.m. UTC | #14

On Fri, 10 Jan 2025 14:49:03 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan Cameron wrote:
> > On Thu, 9 Jan 2025 15:51:39 -0800
> > Dan Williams <dan.j.williams@intel.com> wrote:
> >   
> > > Jonathan Cameron wrote:  
> > > > Ok. Best path is drop the available range support then (so no min_ max_ or
> > > > anything to replace them for now).    
> > > 
> > > I think less is more in this case.  
> > 
> > A few notes before I get to specific questions.
> > 
> > Key in the discussion that follows is that the 'repair' is a separate
> > from the 'decision to repair'.  They mostly need different information
> > all of which is in the error trace points. A lot of the questions
> > are about the 'decision to repair' part not the repair itself.
> >   
> [snipped the parts I agree with]
> > I'd advise mostly ignoring PPR and looking at memory sparing in
> > the CXL spec if you want to consider an example. For PPR DPA is used
> > (there is an HPA option that might be available). DPA is still needed
> > for on boot soft repair (or we could delay that until regions configured,
> > but then we'd need to do DPA to HPA mapping as that will be different
> > on a new config, and then write HPA for the kernel to map it back to DPA.  
> 
> This is helpful because I was indeed getting lost in what kind of
> "repair" was being discussed in the thread. Ok, lets focus on sparing
> commands.
> 
> >   
> > > 
> > > The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too
> > > wide for userspace to have a chance at writing a competent tool. At
> > > least I am struggling with where to even begin with those ABIs if I was
> > > asked to write a tool. Does a tool already exist for those?  
> > 
> > There is little choice on that - those are the controls for this type
> > of repair. If we do something like a magic 'key' based on a concatenation
> > of those values we need to define that description to replace a clean
> > self describing interface. I'm not 100% against that but I think it would
> > be a terrible interface design and I don't think anyone is really in favor of it.
> > 
> > All a userspace tool does is read the error record fields of
> > exactly those names.  From that it will log data (already happening under
> > those names in RAS daemon alongside HPA/ DPA).  Then, in simplest case,
> > a threshold is passed and we write those values to the repair interface. 
> > 
> > There is zero need in that simple case for these to be understood at all.  
> 
> This is where you lose me. The error record is a point in time snapshot
> of the SPA:HPA:DPA:<proprietary internal "DIMM" mapping>. The security
> model for memory operations is based on coordinating with the kernel's
> understanding of how that SPA is *currently* being used.

Whilst it is being used I agree.  Key is to only do disruptive / data
changing actions when it is not being used.

> 
> The kernel can not just take userspace's word for it that potentially
> data changing, or temporary loss-of-use operations are safe to execute
> just because once upon a time userspace saw an error record that
> implicated a given SPA in the past, especially over reboot. 

There are two cases (discoverable from hardware)

1) Non disruptive.  No security concern as the device guarantees to
   not interrupt traffic and the memory contents is copied to the new
   location. Basically software never knows it happened.
2) Disruptive.  We only allow this if the memory is offline. In the CXL case
   the CXL specific code must check no memory on the device is online so
   we aren't disrupting anything.  The other implementation we have code
   for (will post after this lands) has finer granularity constraints and only
   the page needs to be offline.
   As it is offline the content is not preserved anyway. We may need to add extra
   constraints along with future support for temporal persistence / sharing but
   we can do that as part of adding that support in general.
   (Personally I think in those cases memory repair is a job for the out of
    band management anyway).

In neither case am I seeing a security concern.  Am I missing something?

> 
> The SPA:HPA:DPA:DIMM tuple is invalidated on reconfiguration and reboot
> events. It follows that the kernel has a security/integrity interest in
> declining to act on invalidated tuples. This is solvable within a single
> boot as the kernel can cache the error records and userspace can ask to
> "trigger sparing based on cached record X".

The above rules remove this complexity.  Either it is always safe by
device design, or the memory is offline and we'll zero fill it anyway
so no security concern.

> 
> For the reboot case when the error record cache is flushed the kernel
> needs a reliable way to refill that cache, not an ABI for userspace to
> say "take my word for it, this *should* be safe".

It is safe because of 1 and 2 above we are not editing data in use
except in a fashion that the device guarantees is safe.

If you don't trust the device on this you have bigger problems.

> 
> [snipped the explanation of replaying the old trace record parameters
> data back through sysfs, because that is precisely the hang up I have
> with the proposal]
> 
> >   
> > > 
> > > Some questions that read on those ABIs are:
> > > 
> > > 1/ What if the platform has translation between HPA (CXL decode) and SPA
> > > (physical addresses reported in trace points that PIO and DMA see)?  
> > 
> > See later for discussion of other interfaces.. This is assuming the
> > repair key is not HPA (it is on some systems / situations) - if it's
> > the repair key then that is easily dealt with.
> > 
> > HPA / SPA more or less irrelevant for repair itself, they are relevant
> > for the decision to repair. In the 'on reboot' soft repair case they may
> > not even exist at the time of repair as it would be expected to happen
> > before we've brought up a region (to get the RAM into a good state at boot).
> > 
> > For cases where the memory decoders are configured and so there is an HPA
> > to DPA mapping:
> > The trace reports provide both all these somewhat magic values and
> > the HPA.  Thus we can do the HPA aware stuff on that before then looking
> > up the other bit of the appropriate error reports to get the bank row etc.
> >   
> > > 
> > > 2/ What if memory is interleaved across repair domains?   
> > 
> > Also not relevant to a repair control and only a little relevant to the
> > decision to repair.  The domains would be handled separately but if
> > we are have to offline a chunk of memory to do it (needed for repair
> > on some devices) that may be a bigger chunk if fine grained interleave
> > in use and that may affect the decision.  
> 
> Again, the repair control assumes that the kernel can just trust
> userspace to get it right. When the kernel knows the SPA implications it
> can add safety like "you are going to issue sparing on deviceA that will
> temporarily take deviceA offline. CXL subsystem tells me deviceA is
> interleaved with deviceB in SPA so the whole SPA range needs to be
> offline before this operation proceeds". That is not someting that
> userspace can reliably coordinate.

Absolutely he kernel has to enforce this. Same way we protect against
poison injection in some cases.  Right now the enforcement is slightly
wrong (Shiju is looking at it again) as we were enforcing at wrong
granularity (specific dpa, not device). Identifying that hole is a good
outcome of this discussion making us take another look.

Enforcing this is one of the key jobs of the CXL specific driver.
We considered doing it in the core, but the granularity differences
between our initial few examples meant we decided on specific driver
implementations of the checks for now.

> 
> > > 3/ What if the device does not use DDR terminology / topology terms for
> > > repair?  
> > 
> > Then we provide the additional interfaces assuming the correspond to well
> > known terms.  If someone is using a magic key then we can get grumpy
> > with them, but that can also be supported.
> > 
> > Mostly I'd expect a new technology to overlap a lot of the existing
> > interface and maybe add one or two more; which layer in the stack for
> > HBM for instance.  
> 
> The concern is the assertion that sysfs needs to care about all these
> parameters vs an ABI that says "repair errorX". If persistence and
> validity of error records is the concern lets build an ABI for that and
> not depend upon trust in userspace to properly coordinate memory
> integrity concerns.

It doesn't have to.  It just has to ensure that the memory device is in the correct
state.  So check, not coordinate. At a larger scale, coordination is already doable
(subject to races that we must avoid by holding locks), tear down the regions
so there are no mappings on the device you want to repair.  Don't bring them
up again until after you are done.

The main use case is probably do it before you bring the mappings up, but
same result.

> 
> > 
> > The main alternative is where the device takes an HPA / SPA / DPA. We have one
> > driver that does that queued up behind this series that uses HPA. PPR uses
> > DPA.  In that case userspace first tries to see if it can repair by HPA then
> > DPA and if not moves on to see if it it can use the fuller description.
> > We will see devices supporting HPA / DPA (which to use depends on when you
> > are doing the operation and what has been configured) but mostly I'd expect
> > either HPA/DPA or fine grained on a given repair instance.
> > 
> > HPA only works if the address decoders are always configured (so not on CXL)
> > What is actually happening in that case is typically that a firmware is
> > involved that can look up address decoders etc, and map the control HPA
> > to Bank / row etc to issue the actual low level commands.  This keeps
> > the memory mapping completely secret rather than exposing it in error
> > records.
> >   
> > > 
> > > I expect the flow rasdaemon would want is that the current PFA (leaky
> > > bucket Pre-Failure Analysis) decides that the number of soft-offlines it
> > > has performed exceeds some threshold and it wants to attempt to repair
> > > memory.  
> > 
> > Sparing may happen prior to point where we'd have done a soft offline
> > if non disruptive (whether it is can be read from another bit of the
> > ABI).  Memory repair might be much less disruptive than soft-offline!
> > I rather hope memory manufacturers build that, but I'm aware of at least
> > one case where they didn't and the memory must be offline.  
> 
> That's a good point, spare before offline makes sense.

If transparent an resources not constrained.
Very much not if we have to tear down the memory first.

> 
> [..]
> > However, there are other usecases where this isn't needed which is why
> > that isn't a precursor for this series.
> > 
> > Initial enablement targets two situations:
> > 1) Repair can be done in non disruptive way - no need to soft offline at all.  
> 
> Modulo needing to quiesce access over the sparing event?

Absolutely.  This is only doable in devices that don't need to quiesce.

> 
> > 2) Repair can be done at boot before memory is onlined or on admin
> >    action to take the whole region offline, then repair specific chunks of
> >    memory before bringing it back online.  
> 
> Which is userspace racing the kernel to online memory?

If you are doing this scheme you don't automatically online memory. So
both are in userspace control and can be easily sequenced.
If you aren't auto onlining then buy devices with hard PPR and do it by offlining
manually, repairing and rebooting. Or buy devices that don't need to quiecse
and cross your fingers the dodgy ram doesn't throw an error before you get
that far.  Little choice if you decide to online right at the start as normal
memory.

> 
> > > So, yes, +1 to simpler for now where software effectively just needs to
> > > deal with a handful of "region repair" buttons and the semantics of
> > > those are coarse and sub-optimal. Wait for a future where a tool author
> > > says, "we have had good success getting bulk offlined pages back into
> > > service, but now we need this specific finer grained kernel interface to
> > > avoid wasting spare banks prematurely".  
> > 
> > Depends on where you think that interface is.  I can absolutely see that
> > as a control to RAS Daemon.  Option 2 above, region is offline, repair
> > all dodgy looking fine grained buckets.
> > 
> > Note though that a suboptimal repair may mean permanent use of very rare
> > resources.  So there needs to be a control a the finest granularity as well.
> > Which order those get added to userspace tools doesn't matter to me.
> > 
> > If you mean that interface in kernel it brings some non trivial requirements.
> > The kernel would need all of:
> > 1) Tracking interface for all error records so the reverse map from region
> >    to specific bank / row etc is available for a subset of entries.  The
> >    kernel would need to know which of those are important (soft offline
> >    might help in that use case, otherwise that means decision algorithms
> >    are in kernel or we have fine grained queue for region repair in parallel
> >    with soft-offline).
> > 2) A way to inject the reverse map information from a userspace store
> >   (to deal with reboot etc).  
> 
> Not a way to inject the reverse map information, a way to inject the
> error records and assert that memory topology changes have not
> invalidated those records.

There is no way to tell that the topology hasn't changed.
For the reasons above, I don't think we care. Instead of trying to stop
userspace reparing the wrong memory, make sure it is safe for it to do that.
(The kernel is rarely in the business of preventing the slightly stupid)

> 
> > That sounds a lot harder to deal with than relying on the usespace program
> > that already does the tracking across boots.  
> 
> I am stuck behind the barrier of userspace must not assume it knows
> better than the kernel about the SPA impact of a DIMM sparing
> event. The kernel needs evidence either live records from within the
> same kernel boot or validated records from a previous boot.

I think this is the wrong approach.  The operation must be 'safe'.
With that in place we absolutely can let userspace assume it knows better than
the kernel. 

> 
> ...devices could also help us out here with a way to replay DIMM error
> events. That would allow for refreshing error records even if the
> memory topology change because the new record would generate a refreshed
> SPA:HPA:DPA:DIMM tuple.

Maybe, but I'm not seeing necessity.

> 
> > > Anything more complex than a set of /sys/devices/system/memory/
> > > devices has a /sys/bus/edac/devices/devX/repair button, feels like a
> > > generation ahead of where the initial sophistication needs to lie.
> > > 
> > > That said, I do not closely follow ras tooling to say whether someone
> > > has already identified the critical need for a fine grained repair ABI?  
> > 
> > It's not that we necessarily want to repair at fine grain, it's that
> > the control interface to hardware is fine grained and the reverse mapping
> > often unknown except for specific error records.
> > 
> > I'm fully on board with simple interfaces for common cases like repair
> > the bad memory in this region.  I'm just strongly against moving the
> > complexity of doing that into the kernel.  
> 
> Yes, we are just caught up on where that "...but no simpler" line is
> drawn.
> 
Sure.  For now, I've proposed we split the two cases.
1) HPA / DPA repair (PPR)
2) Memory topology based repair (Sparing)

If we can make progress on (1) perhaps we can come to a conclusion on what
is required.

Note that so far I see no reason why either should do any checking against
errors observed by the kernel given the security guarantees above.
Userspace can repair the wrong bit of memory. That's pointless and burning
limited resources, but not a security problem.

Jonathan

Mauro Carvalho Chehab Jan. 14, 2025, 11:47 a.m. UTC | #15

Em Mon, 6 Jan 2025 12:10:00 +0000
<shiju.jose@huawei.com> escreveu:

> From: Shiju Jose <shiju.jose@huawei.com>
> 
> Add a generic EDAC memory repair control driver to manage memory repairs
> in the system, such as CXL Post Package Repair (PPR) and CXL memory sparing
> features.
> 
> For example, a CXL device with DRAM components that support PPR features
> may implement PPR maintenance operations. DRAM components may support two
> types of PPR, hard PPR, for a permanent row repair, and soft PPR,  for a
> temporary row repair. Soft PPR is much faster than hard PPR, but the repair
> is lost with a power cycle.
> Similarly a CXL memory device may support soft and hard memory sparing at
> cacheline, row, bank and rank granularities. Memory sparing is defined as
> a repair function that replaces a portion of memory with a portion of
> functional memory at that same granularity.
> When a CXL device detects an error in a memory, it may report the host of
> the need for a repair maintenance operation by using an event record where
> the "maintenance needed" flag is set. The event records contains the device
> physical address(DPA) and other attributes of the memory to repair (such as
> channel, sub-channel, bank group, bank, rank, row, column etc). The kernel
> will report the corresponding CXL general media or DRAM trace event to
> userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
> operation in response to the device request via the sysfs repair control.
> 
> Device with memory repair features registers with EDAC device driver,
> which retrieves memory repair descriptor from EDAC memory repair driver
> and exposes the sysfs repair control attributes to userspace in
> /sys/bus/edac/devices/<dev-name>/mem_repairX/.
> 
> The common memory repair control interface abstracts the control of
> arbitrary memory repair functionality into a standardized set of functions.
> The sysfs memory repair attribute nodes are only available if the client
> driver has implemented the corresponding attribute callback function and
> provided operations to the EDAC device driver during registration.
> 
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> ---
>  .../ABI/testing/sysfs-edac-memory-repair      | 244 +++++++++
>  Documentation/edac/features.rst               |   3 +
>  Documentation/edac/index.rst                  |   1 +
>  Documentation/edac/memory_repair.rst          | 101 ++++
>  drivers/edac/Makefile                         |   2 +-
>  drivers/edac/edac_device.c                    |  33 ++
>  drivers/edac/mem_repair.c                     | 492 ++++++++++++++++++
>  include/linux/edac.h                          | 139 +++++
>  8 files changed, 1014 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair
>  create mode 100644 Documentation/edac/memory_repair.rst
>  create mode 100755 drivers/edac/mem_repair.c
> 
> diff --git a/Documentation/ABI/testing/sysfs-edac-memory-repair b/Documentation/ABI/testing/sysfs-edac-memory-repair
> new file mode 100644
> index 000000000000..e9268f3780ed
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-edac-memory-repair
> @@ -0,0 +1,244 @@
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		The sysfs EDAC bus devices /<dev-name>/mem_repairX subdirectory
> +		pertains to the memory media repair features control, such as
> +		PPR (Post Package Repair), memory sparing etc, where<dev-name>
> +		directory corresponds to a device registered with the EDAC
> +		device driver for the memory repair features.
> +
> +		Post Package Repair is a maintenance operation requests the memory
> +		device to perform a repair operation on its media, in detail is a
> +		memory self-healing feature that fixes a failing memory location by
> +		replacing it with a spare row in a DRAM device. For example, a
> +		CXL memory device with DRAM components that support PPR features may
> +		implement PPR maintenance operations. DRAM components may support
> +		two types of PPR functions: hard PPR, for a permanent row repair, and
> +		soft PPR, for a temporary row repair. soft PPR is much faster than
> +		hard PPR, but the repair is lost with a power cycle.
> +
> +		Memory sparing is a repair function that replaces a portion
> +		of memory with a portion of functional memory at that same
> +		sparing granularity. Memory sparing has cacheline/row/bank/rank
> +		sparing granularities. For example, in memory-sparing mode,
> +		one memory rank serves as a spare for other ranks on the same
> +		channel in case they fail. The spare rank is held in reserve and
> +		not used as active memory until a failure is indicated, with
> +		reserved capacity subtracted from the total available memory
> +		in the system.The DIMM installation order for memory sparing
> +		varies based on the number of processors and memory modules
> +		installed in the server. After an error threshold is surpassed
> +		in a system protected by memory sparing, the content of a failing
> +		rank of DIMMs is copied to the spare rank. The failing rank is
> +		then taken offline and the spare rank placed online for use as
> +		active memory in place of the failed rank.
> +
> +		The sysfs attributes nodes for a repair feature are only
> +		present if the parent driver has implemented the corresponding
> +		attr callback function and provided the necessary operations
> +		to the EDAC device driver during registration.
> +
> +		In some states of system configuration (e.g. before address
> +		decoders have been configured), memory devices (e.g. CXL)
> +		may not have an active mapping in the main host address
> +		physical address map. As such, the memory to repair must be
> +		identified by a device specific physical addressing scheme
> +		using a device physical address(DPA). The DPA and other control
> +		attributes to use will be presented in related error records.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair_function
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RO) Memory repair function type. For eg. post package repair,
> +		memory sparing etc.
> +		EDAC_SOFT_PPR - Soft post package repair
> +		EDAC_HARD_PPR - Hard post package repair
> +		EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing
> +		EDAC_ROW_MEM_SPARING - Row memory sparing
> +		EDAC_BANK_MEM_SPARING - Bank memory sparing
> +		EDAC_RANK_MEM_SPARING - Rank memory sparing
> +		All other values are reserved.

Too big strings. Why are them in upper cases? IMO:

	soft-ppr, hard-ppr, ... would be enough.

Also, Is it mandatory that all types are supported? If not, you need a
way to report to userspace what of them are supported. One option
would be that reading /sys/bus/edac/devices/<dev-name>/mem_repairX/repair_function
would return something like:

	soft-ppr [hard-ppr] row-mem-sparing

Also, as this will be parsed in ReST format, you need to change the
description to use bullets, otherwise the html/pdf version of the
document will place everything on a single line. E.g. something like:

Description:
		(RO) Memory repair function type. For eg. post package repair,
		memory sparing etc. Can be:

		- EDAC_SOFT_PPR - Soft post package repair
		- EDAC_HARD_PPR - Hard post package repair
		- EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing
		- EDAC_ROW_MEM_SPARING - Row memory sparing
		- EDAC_BANK_MEM_SPARING - Bank memory sparing
		- EDAC_RANK_MEM_SPARING - Rank memory sparing
		- All other values are reserved.

Same applies to other sysfs nodes. See for instance:

	Documentation/ABI/stable/sysfs-class-backlight

And see how it is formatted after Sphinx processing at the Kernel
Admin guide:

	https://www.kernel.org/doc/html/latest/admin-guide/abi-stable.html#symbols-under-sys-class

Please fix it on all places you have a list of values.
	
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) Read/Write the current persist repair mode set for a
> +		repair function. Persist repair modes supported in the
> +		device, based on the memory repair function is temporary
> +		or permanent and is lost with a power cycle.
> +		EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary repair).
> +		EDAC_MEM_REPAIR_HARD - Hard memory repair function (permanent repair).
> +		All other values are reserved.

Same here: edac/ is already in the path. No need to place EDAC_ at the name.

> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/dpa_support
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RO) True if memory device required device physical
> +		address (DPA) of memory to repair.
> +		False if memory device required host specific physical
> +                address (HPA) of memory to repair.

Please remove the extra spaces before "address", as otherwise conversion to
ReST may do the wrong thing or may produce doc warnings.

> +		In some states of system configuration (e.g. before address
> +		decoders have been configured), memory devices (e.g. CXL)
> +		may not have an active mapping in the main host address
> +		physical address map. As such, the memory to repair must be
> +		identified by a device specific physical addressing scheme
> +		using a DPA. The device physical address(DPA) to use will be
> +		presented in related error records.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair_safe_when_in_use
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RO) True if memory media is accessible and data is retained
> +		during the memory repair operation.
> +		The data may not be retained and memory requests may not be
> +		correctly processed during a repair operation. In such case
> +		the repair operation should not executed at runtime.

Please add an extra line before "The data" to ensure that the output at
the admin-guide won't merge the two paragraphs. Same on other places along
this patch series: paragraphs need a blank line at the description.

> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/hpa
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) Host Physical Address (HPA) of the memory to repair.
> +		See attribute 'dpa_support' for more details.
> +		The HPA to use will be provided in related error records.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/dpa
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) Device Physical Address (DPA) of the memory to repair.
> +		See attribute 'dpa_support' for more details.
> +		The specific DPA to use will be provided in related error
> +		records.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/nibble_mask
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) Read/Write Nibble mask of the memory to repair.
> +		Nibble mask identifies one or more nibbles in error on the
> +		memory bus that produced the error event. Nibble Mask bit 0
> +		shall be set if nibble 0 on the memory bus produced the
> +		event, etc. For example, CXL PPR and sparing, a nibble mask
> +		bit set to 1 indicates the request to perform repair
> +		operation in the specific device. All nibble mask bits set
> +		to 1 indicates the request to perform the operation in all
> +		devices. For CXL memory to repiar, the specific value of
> +		nibble mask to use will be provided in related error records.
> +		For more details, See nibble mask field in CXL spec ver 3.1,
> +		section 8.2.9.7.1.2 Table 8-103 soft PPR and section
> +		8.2.9.7.1.3 Table 8-104 hard PPR, section 8.2.9.7.1.4
> +		Table 8-105 memory sparing.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/bank_group
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/bank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/rank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/row
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/column
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/channel
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/sub_channel
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) The control attributes associated with memory address
> +		that is to be repaired. The specific value of attributes to
> +		use depends on the portion of memory to repair and may be
> +		reported to host in related error records and may be
> +		available to userspace in trace events, such as in CXL
> +		memory devices.
> +
> +		channel - The channel of the memory to repair. Channel is
> +		defined as an interface that can be independently accessed
> +		for a transaction.
> +		rank - The rank of the memory to repair. Rank is defined as a
> +		set of memory devices on a channel that together execute a
> +		transaction.
> +		bank_group - The bank group of the memory to repair.
> +		bank - The bank number of the memory to repair.
> +		row - The row number of the memory to repair.
> +		column - The column number of the memory to repair.
> +		sub_channel - The sub-channel of the memory to repair.

Same problem here with regards to bad ReST input. I would do:

	channel
		The channel of the memory to repair. Channel is
		defined as an interface that can be independently accessed
		for a transaction.

	rank
		The rank of the memory to repair. Rank is defined as a
		set of memory devices on a channel that together execute a
		transaction.

as this would provide a better output at admin-guide while still being
nicer to read as text.

> +
> +		The requirement to set these attributes varies based on the
> +		repair function. The attributes in sysfs are not present
> +		unless required for a repair function.
> +		For example, CXL spec ver 3.1, Section 8.2.9.7.1.2 Table 8-103
> +		soft PPR and Section 8.2.9.7.1.3 Table 8-104 hard PPR operations,
> +		these attributes are not required to set.
> +		For example, CXL spec ver 3.1, Section 8.2.9.7.1.4 Table 8-105
> +		memory sparing, these attributes are required to set based on
> +		memory sparing granularity as follows.
> +		Channel: Channel associated with the DPA that is to be spared
> +		and applies to all subclasses of sparing (cacheline, bank,
> +		row and rank sparing).
> +		Rank: Rank associated with the DPA that is to be spared and
> +		applies to all subclasses of sparing.
> +		Bank & Bank Group: Bank & bank group are associated with
> +		the DPA that is to be spared and applies to cacheline sparing,
> +		row sparing and bank sparing subclasses.
> +		Row: Row associated with the DPA that is to be spared and
> +		applies to cacheline sparing and row sparing subclasses.
> +		Column: column associated with the DPA that is to be spared
> +		and applies to cacheline sparing only.
> +		Sub-channel: sub-channel associated with the DPA that is to
> +		be spared and applies to cacheline sparing only.

Same here: this will all be on a single paragraph which would be really
weird.

> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_hpa
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_dpa
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_nibble_mask
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank_group
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_rank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_row
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_column
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_channel
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_sub_channel
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_hpa
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_dpa
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_nibble_mask
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank_group
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_rank
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_row
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_column
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_channel
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_sub_channel
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) The supported range of control attributes (optional)
> +		associated with a memory address that is to be repaired.
> +		The memory device may give the supported range of
> +		attributes to use and it will depend on the memory device
> +		and the portion of memory to repair.
> +		The userspace may receive the specific value of attributes
> +		to use for a repair operation from the memory device via
> +		related error records and trace events, such as in CXL
> +		memory devices.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(WO) Issue the memory repair operation for the specified
> +		memory repair attributes. The operation may fail if resources
> +		are insufficient based on the requirements of the memory
> +		device and repair function.
> +		EDAC_DO_MEM_REPAIR - issue repair operation.
> +		All other values are reserved.
> diff --git a/Documentation/edac/features.rst b/Documentation/edac/features.rst
> index ba3ab993ee4f..bfd5533b81b7 100644
> --- a/Documentation/edac/features.rst
> +++ b/Documentation/edac/features.rst
> @@ -97,3 +97,6 @@ RAS features
>  ------------
>  1. Memory Scrub
>  Memory scrub features are documented in `Documentation/edac/scrub.rst`.
> +
> +2. Memory Repair
> +Memory repair features are documented in `Documentation/edac/memory_repair.rst`.
> diff --git a/Documentation/edac/index.rst b/Documentation/edac/index.rst
> index dfb0c9fb9ab1..d6778f4562dd 100644
> --- a/Documentation/edac/index.rst
> +++ b/Documentation/edac/index.rst
> @@ -8,4 +8,5 @@ EDAC Subsystem
>     :maxdepth: 1
>  
>     features
> +   memory_repair
>     scrub
> diff --git a/Documentation/edac/memory_repair.rst b/Documentation/edac/memory_repair.rst
> new file mode 100644
> index 000000000000..2787a8a2d6ba
> --- /dev/null
> +++ b/Documentation/edac/memory_repair.rst
> @@ -0,0 +1,101 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==========================
> +EDAC Memory Repair Control
> +==========================
> +
> +Copyright (c) 2024 HiSilicon Limited.
> +
> +:Author:   Shiju Jose <shiju.jose@huawei.com>
> +:License:  The GNU Free Documentation License, Version 1.2
> +          (dual licensed under the GPL v2)
> +:Original Reviewers:
> +
> +- Written for: 6.14

See my comments with regards to license on the previous patches.

> +
> +Introduction
> +------------
> +Memory devices may support repair operations to address issues in their
> +memory media. Post Package Repair (PPR) and memory sparing are examples
> +of such features.
> +
> +Post Package Repair(PPR)
> +~~~~~~~~~~~~~~~~~~~~~~~~
> +Post Package Repair is a maintenance operation requests the memory device
> +to perform repair operation on its media, in detail is a memory self-healing
> +feature that fixes a failing memory location by replacing it with a spare
> +row in a DRAM device. For example, a CXL memory device with DRAM components
> +that support PPR features may implement PPR maintenance operations. DRAM
> +components may support types of PPR functions, hard PPR, for a permanent row
> +repair, and soft PPR, for a temporary row repair. Soft PPR is much faster
> +than hard PPR, but the repair is lost with a power cycle.  The data may not
> +be retained and memory requests may not be correctly processed during a
> +repair operation. In such case, the repair operation should not executed
> +at runtime.
> +For example, CXL memory devices, soft PPR and hard PPR repair operations
> +may be supported. See CXL spec rev 3.1 sections 8.2.9.7.1.1 PPR Maintenance
> +Operations, 8.2.9.7.1.2 sPPR Maintenance Operation and 8.2.9.7.1.3 hPPR
> +Maintenance Operation for more details.

Paragraphs require blank lines in ReST. Also, please place a link to the
specs.

I strongly suggest looking at the output of all docs with make htmldocs
and make pdfdocs to be sure that the paragraphs and the final document
will be properly handled. You may use:

	SPHINXDIRS="<book name(s)>"

to speed-up documentation builds.

Please see Sphinx documentation for more details about what it is expected
there:

	https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html

> +
> +Memory Sparing
> +~~~~~~~~~~~~~~
> +Memory sparing is a repair function that replaces a portion of memory with
> +a portion of functional memory at that same sparing granularity. Memory
> +sparing has cacheline/row/bank/rank sparing granularities. For example, in
> +memory-sparing mode, one memory rank serves as a spare for other ranks on
> +the same channel in case they fail. The spare rank is held in reserve and
> +not used as active memory until a failure is indicated, with reserved
> +capacity subtracted from the total available memory in the system. The DIMM
> +installation order for memory sparing varies based on the number of processors
> +and memory modules installed in the server. After an error threshold is
> +surpassed in a system protected by memory sparing, the content of a failing
> +rank of DIMMs is copied to the spare rank. The failing rank is then taken
> +offline and the spare rank placed online for use as active memory in place
> +of the failed rank.
> +
> +For example, CXL memory devices may support various subclasses for sparing
> +operation vary in terms of the scope of the sparing being performed.
> +Cacheline sparing subclass refers to a sparing action that can replace a
> +full cacheline. Row sparing is provided as an alternative to PPR sparing
> +functions and its scope is that of a single DDR row. Bank sparing allows
> +an entire bank to be replaced. Rank sparing is defined as an operation
> +in which an entire DDR rank is replaced. See CXL spec 3.1 section
> +8.2.9.7.1.4 Memory Sparing Maintenance Operations for more details.
> +
> +Use cases of generic memory repair features control
> +---------------------------------------------------
> +
> +1. The soft PPR , hard PPR and memory-sparing features share similar
> +control attributes. Therefore, there is a need for a standardized, generic
> +sysfs repair control that is exposed to userspace and used by
> +administrators, scripts and tools.
> +
> +2. When a CXL device detects an error in a memory component, it may inform
> +the host of the need for a repair maintenance operation by using an event
> +record where the "maintenance needed" flag is set. The event record
> +specifies the device physical address(DPA) and attributes of the memory that
> +requires repair. The kernel reports the corresponding CXL general media or
> +DRAM trace event to userspace, and userspace tools (e.g. rasdaemon) initiate
> +a repair maintenance operation in response to the device request using the
> +sysfs repair control.
> +
> +3. Userspace tools, such as rasdaemon, may request a PPR/sparing on a memory
> +region when an uncorrected memory error or an excess of corrected memory
> +errors is reported on that memory.
> +
> +4. Multiple PPR/sparing instances may be present per memory device.
> +
> +The File System
> +---------------
> +
> +The control attributes of a registered memory repair instance could be
> +accessed in the
> +
> +/sys/bus/edac/devices/<dev-name>/mem_repairX/
> +
> +sysfs
> +-----
> +
> +Sysfs files are documented in
> +
> +`Documentation/ABI/testing/sysfs-edac-memory-repair`.
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index 3a49304860f0..1de9fe66ac6b 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -10,7 +10,7 @@ obj-$(CONFIG_EDAC)			:= edac_core.o
>  
>  edac_core-y	:= edac_mc.o edac_device.o edac_mc_sysfs.o
>  edac_core-y	+= edac_module.o edac_device_sysfs.o wq.o
> -edac_core-y	+= scrub.o ecs.o
> +edac_core-y	+= scrub.o ecs.o mem_repair.o
>  
>  edac_core-$(CONFIG_EDAC_DEBUG)		+= debugfs.o
>  
> diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
> index 1c1142a2e4e4..a401d81dad8a 100644
> --- a/drivers/edac/edac_device.c
> +++ b/drivers/edac/edac_device.c
> @@ -575,6 +575,7 @@ static void edac_dev_release(struct device *dev)
>  {
>  	struct edac_dev_feat_ctx *ctx = container_of(dev, struct edac_dev_feat_ctx, dev);
>  
> +	kfree(ctx->mem_repair);
>  	kfree(ctx->scrub);
>  	kfree(ctx->dev.groups);
>  	kfree(ctx);
> @@ -611,6 +612,7 @@ int edac_dev_register(struct device *parent, char *name,
>  	const struct attribute_group **ras_attr_groups;
>  	struct edac_dev_data *dev_data;
>  	struct edac_dev_feat_ctx *ctx;
> +	int mem_repair_cnt = 0;
>  	int attr_gcnt = 0;
>  	int scrub_cnt = 0;
>  	int ret, feat;
> @@ -628,6 +630,10 @@ int edac_dev_register(struct device *parent, char *name,
>  		case RAS_FEAT_ECS:
>  			attr_gcnt += ras_features[feat].ecs_info.num_media_frus;
>  			break;
> +		case RAS_FEAT_MEM_REPAIR:
> +			attr_gcnt++;
> +			mem_repair_cnt++;
> +			break;
>  		default:
>  			return -EINVAL;
>  		}
> @@ -651,8 +657,17 @@ int edac_dev_register(struct device *parent, char *name,
>  		}
>  	}
>  
> +	if (mem_repair_cnt) {
> +		ctx->mem_repair = kcalloc(mem_repair_cnt, sizeof(*ctx->mem_repair), GFP_KERNEL);
> +		if (!ctx->mem_repair) {
> +			ret = -ENOMEM;
> +			goto data_mem_free;
> +		}
> +	}
> +
>  	attr_gcnt = 0;
>  	scrub_cnt = 0;
> +	mem_repair_cnt = 0;
>  	for (feat = 0; feat < num_features; feat++, ras_features++) {
>  		switch (ras_features->ft_type) {
>  		case RAS_FEAT_SCRUB:
> @@ -686,6 +701,23 @@ int edac_dev_register(struct device *parent, char *name,
>  
>  			attr_gcnt += ras_features->ecs_info.num_media_frus;
>  			break;
> +		case RAS_FEAT_MEM_REPAIR:
> +			if (!ras_features->mem_repair_ops ||
> +			    mem_repair_cnt != ras_features->instance)
> +				goto data_mem_free;
> +
> +			dev_data = &ctx->mem_repair[mem_repair_cnt];
> +			dev_data->instance = mem_repair_cnt;
> +			dev_data->mem_repair_ops = ras_features->mem_repair_ops;
> +			dev_data->private = ras_features->ctx;
> +			ret = edac_mem_repair_get_desc(parent, &ras_attr_groups[attr_gcnt],
> +						       ras_features->instance);
> +			if (ret)
> +				goto data_mem_free;
> +
> +			mem_repair_cnt++;
> +			attr_gcnt++;
> +			break;
>  		default:
>  			ret = -EINVAL;
>  			goto data_mem_free;
> @@ -712,6 +744,7 @@ int edac_dev_register(struct device *parent, char *name,
>  	return devm_add_action_or_reset(parent, edac_dev_unreg, &ctx->dev);
>  
>  data_mem_free:
> +	kfree(ctx->mem_repair);
>  	kfree(ctx->scrub);
>  groups_free:
>  	kfree(ras_attr_groups);
> diff --git a/drivers/edac/mem_repair.c b/drivers/edac/mem_repair.c
> new file mode 100755
> index 000000000000..e7439fd26c41
> --- /dev/null
> +++ b/drivers/edac/mem_repair.c
> @@ -0,0 +1,492 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * The generic EDAC memory repair driver is designed to control the memory
> + * devices with memory repair features, such as Post Package Repair (PPR),
> + * memory sparing etc. The common sysfs memory repair interface abstracts
> + * the control of various arbitrary memory repair functionalities into a
> + * unified set of functions.
> + *
> + * Copyright (c) 2024 HiSilicon Limited.
> + */
> +
> +#include <linux/edac.h>
> +
> +enum edac_mem_repair_attributes {
> +	MEM_REPAIR_FUNCTION,
> +	MEM_REPAIR_PERSIST_MODE,
> +	MEM_REPAIR_DPA_SUPPORT,
> +	MEM_REPAIR_SAFE_IN_USE,
> +	MEM_REPAIR_HPA,
> +	MEM_REPAIR_MIN_HPA,
> +	MEM_REPAIR_MAX_HPA,
> +	MEM_REPAIR_DPA,
> +	MEM_REPAIR_MIN_DPA,
> +	MEM_REPAIR_MAX_DPA,
> +	MEM_REPAIR_NIBBLE_MASK,
> +	MEM_REPAIR_MIN_NIBBLE_MASK,
> +	MEM_REPAIR_MAX_NIBBLE_MASK,
> +	MEM_REPAIR_BANK_GROUP,
> +	MEM_REPAIR_MIN_BANK_GROUP,
> +	MEM_REPAIR_MAX_BANK_GROUP,
> +	MEM_REPAIR_BANK,
> +	MEM_REPAIR_MIN_BANK,
> +	MEM_REPAIR_MAX_BANK,
> +	MEM_REPAIR_RANK,
> +	MEM_REPAIR_MIN_RANK,
> +	MEM_REPAIR_MAX_RANK,
> +	MEM_REPAIR_ROW,
> +	MEM_REPAIR_MIN_ROW,
> +	MEM_REPAIR_MAX_ROW,
> +	MEM_REPAIR_COLUMN,
> +	MEM_REPAIR_MIN_COLUMN,
> +	MEM_REPAIR_MAX_COLUMN,
> +	MEM_REPAIR_CHANNEL,
> +	MEM_REPAIR_MIN_CHANNEL,
> +	MEM_REPAIR_MAX_CHANNEL,
> +	MEM_REPAIR_SUB_CHANNEL,
> +	MEM_REPAIR_MIN_SUB_CHANNEL,
> +	MEM_REPAIR_MAX_SUB_CHANNEL,
> +	MEM_DO_REPAIR,
> +	MEM_REPAIR_MAX_ATTRS
> +};
> +
> +struct edac_mem_repair_dev_attr {
> +	struct device_attribute dev_attr;
> +	u8 instance;
> +};
> +
> +struct edac_mem_repair_context {
> +	char name[EDAC_FEAT_NAME_LEN];
> +	struct edac_mem_repair_dev_attr mem_repair_dev_attr[MEM_REPAIR_MAX_ATTRS];
> +	struct attribute *mem_repair_attrs[MEM_REPAIR_MAX_ATTRS + 1];
> +	struct attribute_group group;
> +};
> +
> +#define TO_MEM_REPAIR_DEV_ATTR(_dev_attr)      \
> +		container_of(_dev_attr, struct edac_mem_repair_dev_attr, dev_attr)
> +
> +#define EDAC_MEM_REPAIR_ATTR_SHOW(attrib, cb, type, format)			\
> +static ssize_t attrib##_show(struct device *ras_feat_dev,			\
> +			     struct device_attribute *attr, char *buf)		\
> +{										\
> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;			\
> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);		\
> +	const struct edac_mem_repair_ops *ops =					\
> +				ctx->mem_repair[inst].mem_repair_ops;		\
> +	type data;								\
> +	int ret;								\
> +										\
> +	ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private,	\
> +		      &data);							\
> +	if (ret)								\
> +		return ret;							\
> +										\
> +	return sysfs_emit(buf, format, data);					\
> +}
> +
> +EDAC_MEM_REPAIR_ATTR_SHOW(repair_function, get_repair_function, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(persist_mode, get_persist_mode, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(dpa_support, get_dpa_support, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(repair_safe_when_in_use, get_repair_safe_when_in_use, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(hpa, get_hpa, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_hpa, get_min_hpa, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_hpa, get_max_hpa, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(dpa, get_dpa, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_dpa, get_min_dpa, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_dpa, get_max_dpa, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(nibble_mask, get_nibble_mask, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_nibble_mask, get_min_nibble_mask, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_nibble_mask, get_max_nibble_mask, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(bank_group, get_bank_group, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank_group, get_min_bank_group, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank_group, get_max_bank_group, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(bank, get_bank, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank, get_min_bank, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank, get_max_bank, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(rank, get_rank, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_rank, get_min_rank, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_rank, get_max_rank, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(row, get_row, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_row, get_min_row, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_row, get_max_row, u64, "0x%llx\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(column, get_column, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_column, get_min_column, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_column, get_max_column, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(channel, get_channel, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_channel, get_min_channel, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_channel, get_max_channel, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(sub_channel, get_sub_channel, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(min_sub_channel, get_min_sub_channel, u32, "%u\n")
> +EDAC_MEM_REPAIR_ATTR_SHOW(max_sub_channel, get_max_sub_channel, u32, "%u\n")
> +
> +#define EDAC_MEM_REPAIR_ATTR_STORE(attrib, cb, type, conv_func)			\
> +static ssize_t attrib##_store(struct device *ras_feat_dev,			\
> +			      struct device_attribute *attr,			\
> +			      const char *buf, size_t len)			\
> +{										\
> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;			\
> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);		\
> +	const struct edac_mem_repair_ops *ops =					\
> +				ctx->mem_repair[inst].mem_repair_ops;		\
> +	type data;								\
> +	int ret;								\
> +										\
> +	ret = conv_func(buf, 0, &data);						\
> +	if (ret < 0)								\
> +		return ret;							\
> +										\
> +	ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private,	\
> +		      data);							\
> +	if (ret)								\
> +		return ret;							\
> +										\
> +	return len;								\
> +}
> +
> +EDAC_MEM_REPAIR_ATTR_STORE(persist_mode, set_persist_mode, unsigned long, kstrtoul)
> +EDAC_MEM_REPAIR_ATTR_STORE(hpa, set_hpa, u64, kstrtou64)
> +EDAC_MEM_REPAIR_ATTR_STORE(dpa, set_dpa, u64, kstrtou64)
> +EDAC_MEM_REPAIR_ATTR_STORE(nibble_mask, set_nibble_mask, u64, kstrtou64)
> +EDAC_MEM_REPAIR_ATTR_STORE(bank_group, set_bank_group, unsigned long, kstrtoul)
> +EDAC_MEM_REPAIR_ATTR_STORE(bank, set_bank, unsigned long, kstrtoul)
> +EDAC_MEM_REPAIR_ATTR_STORE(rank, set_rank, unsigned long, kstrtoul)
> +EDAC_MEM_REPAIR_ATTR_STORE(row, set_row, u64, kstrtou64)
> +EDAC_MEM_REPAIR_ATTR_STORE(column, set_column, unsigned long, kstrtoul)
> +EDAC_MEM_REPAIR_ATTR_STORE(channel, set_channel, unsigned long, kstrtoul)
> +EDAC_MEM_REPAIR_ATTR_STORE(sub_channel, set_sub_channel, unsigned long, kstrtoul)
> +
> +#define EDAC_MEM_REPAIR_DO_OP(attrib, cb)						\
> +static ssize_t attrib##_store(struct device *ras_feat_dev,				\
> +			      struct device_attribute *attr,				\
> +			      const char *buf, size_t len)				\
> +{											\
> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;				\
> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);			\
> +	const struct edac_mem_repair_ops *ops = ctx->mem_repair[inst].mem_repair_ops;	\
> +	unsigned long data;								\
> +	int ret;									\
> +											\
> +	ret = kstrtoul(buf, 0, &data);							\
> +	if (ret < 0)									\
> +		return ret;								\
> +											\
> +	ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private, data);	\
> +	if (ret)									\
> +		return ret;								\
> +											\
> +	return len;									\
> +}
> +
> +EDAC_MEM_REPAIR_DO_OP(repair, do_repair)
> +
> +static umode_t mem_repair_attr_visible(struct kobject *kobj, struct attribute *a, int attr_id)
> +{
> +	struct device *ras_feat_dev = kobj_to_dev(kobj);
> +	struct device_attribute *dev_attr = container_of(a, struct device_attribute, attr);
> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(dev_attr)->instance;
> +	const struct edac_mem_repair_ops *ops = ctx->mem_repair[inst].mem_repair_ops;
> +
> +	switch (attr_id) {
> +	case MEM_REPAIR_FUNCTION:
> +		if (ops->get_repair_function)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_PERSIST_MODE:
> +		if (ops->get_persist_mode) {
> +			if (ops->set_persist_mode)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_DPA_SUPPORT:
> +		if (ops->get_dpa_support)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_SAFE_IN_USE:
> +		if (ops->get_repair_safe_when_in_use)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_HPA:
> +		if (ops->get_hpa) {
> +			if (ops->set_hpa)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_HPA:
> +		if (ops->get_min_hpa)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_HPA:
> +		if (ops->get_max_hpa)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_DPA:
> +		if (ops->get_dpa) {
> +			if (ops->set_dpa)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_DPA:
> +		if (ops->get_min_dpa)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_DPA:
> +		if (ops->get_max_dpa)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_NIBBLE_MASK:
> +		if (ops->get_nibble_mask) {
> +			if (ops->set_nibble_mask)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_NIBBLE_MASK:
> +		if (ops->get_min_nibble_mask)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_NIBBLE_MASK:
> +		if (ops->get_max_nibble_mask)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_BANK_GROUP:
> +		if (ops->get_bank_group) {
> +			if (ops->set_bank_group)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_BANK_GROUP:
> +		if (ops->get_min_bank_group)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_BANK_GROUP:
> +		if (ops->get_max_bank_group)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_BANK:
> +		if (ops->get_bank) {
> +			if (ops->set_bank)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_BANK:
> +		if (ops->get_min_bank)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_BANK:
> +		if (ops->get_max_bank)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_RANK:
> +		if (ops->get_rank) {
> +			if (ops->set_rank)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_RANK:
> +		if (ops->get_min_rank)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_RANK:
> +		if (ops->get_max_rank)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_ROW:
> +		if (ops->get_row) {
> +			if (ops->set_row)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_ROW:
> +		if (ops->get_min_row)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_ROW:
> +		if (ops->get_max_row)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_COLUMN:
> +		if (ops->get_column) {
> +			if (ops->set_column)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_COLUMN:
> +		if (ops->get_min_column)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_COLUMN:
> +		if (ops->get_max_column)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_CHANNEL:
> +		if (ops->get_channel) {
> +			if (ops->set_channel)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_CHANNEL:
> +		if (ops->get_min_channel)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_CHANNEL:
> +		if (ops->get_max_channel)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_SUB_CHANNEL:
> +		if (ops->get_sub_channel) {
> +			if (ops->set_sub_channel)
> +				return a->mode;
> +			else
> +				return 0444;
> +		}
> +		break;
> +	case MEM_REPAIR_MIN_SUB_CHANNEL:
> +		if (ops->get_min_sub_channel)
> +			return a->mode;
> +		break;
> +	case MEM_REPAIR_MAX_SUB_CHANNEL:
> +		if (ops->get_max_sub_channel)
> +			return a->mode;
> +		break;
> +	case MEM_DO_REPAIR:
> +		if (ops->do_repair)
> +			return a->mode;
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	return 0;
> +}
> +
> +#define EDAC_MEM_REPAIR_ATTR_RO(_name, _instance)       \
> +	((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RO(_name), \
> +					     .instance = _instance })
> +
> +#define EDAC_MEM_REPAIR_ATTR_WO(_name, _instance)       \
> +	((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_WO(_name), \
> +					     .instance = _instance })
> +
> +#define EDAC_MEM_REPAIR_ATTR_RW(_name, _instance)       \
> +	((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RW(_name), \
> +					     .instance = _instance })
> +
> +static int mem_repair_create_desc(struct device *dev,
> +				  const struct attribute_group **attr_groups,
> +				  u8 instance)
> +{
> +	struct edac_mem_repair_context *ctx;
> +	struct attribute_group *group;
> +	int i;
> +	struct edac_mem_repair_dev_attr dev_attr[] = {
> +		[MEM_REPAIR_FUNCTION] = EDAC_MEM_REPAIR_ATTR_RO(repair_function,
> +							    instance),
> +		[MEM_REPAIR_PERSIST_MODE] =
> +				EDAC_MEM_REPAIR_ATTR_RW(persist_mode, instance),
> +		[MEM_REPAIR_DPA_SUPPORT] =
> +				EDAC_MEM_REPAIR_ATTR_RO(dpa_support, instance),
> +		[MEM_REPAIR_SAFE_IN_USE] =
> +				EDAC_MEM_REPAIR_ATTR_RO(repair_safe_when_in_use,
> +							instance),
> +		[MEM_REPAIR_HPA] = EDAC_MEM_REPAIR_ATTR_RW(hpa, instance),
> +		[MEM_REPAIR_MIN_HPA] = EDAC_MEM_REPAIR_ATTR_RO(min_hpa, instance),
> +		[MEM_REPAIR_MAX_HPA] = EDAC_MEM_REPAIR_ATTR_RO(max_hpa, instance),
> +		[MEM_REPAIR_DPA] = EDAC_MEM_REPAIR_ATTR_RW(dpa, instance),
> +		[MEM_REPAIR_MIN_DPA] = EDAC_MEM_REPAIR_ATTR_RO(min_dpa, instance),
> +		[MEM_REPAIR_MAX_DPA] = EDAC_MEM_REPAIR_ATTR_RO(max_dpa, instance),
> +		[MEM_REPAIR_NIBBLE_MASK] =
> +				EDAC_MEM_REPAIR_ATTR_RW(nibble_mask, instance),
> +		[MEM_REPAIR_MIN_NIBBLE_MASK] =
> +				EDAC_MEM_REPAIR_ATTR_RO(min_nibble_mask, instance),
> +		[MEM_REPAIR_MAX_NIBBLE_MASK] =
> +				EDAC_MEM_REPAIR_ATTR_RO(max_nibble_mask, instance),
> +		[MEM_REPAIR_BANK_GROUP] =
> +				EDAC_MEM_REPAIR_ATTR_RW(bank_group, instance),
> +		[MEM_REPAIR_MIN_BANK_GROUP] =
> +				EDAC_MEM_REPAIR_ATTR_RO(min_bank_group, instance),
> +		[MEM_REPAIR_MAX_BANK_GROUP] =
> +				EDAC_MEM_REPAIR_ATTR_RO(max_bank_group, instance),
> +		[MEM_REPAIR_BANK] = EDAC_MEM_REPAIR_ATTR_RW(bank, instance),
> +		[MEM_REPAIR_MIN_BANK] = EDAC_MEM_REPAIR_ATTR_RO(min_bank, instance),
> +		[MEM_REPAIR_MAX_BANK] = EDAC_MEM_REPAIR_ATTR_RO(max_bank, instance),
> +		[MEM_REPAIR_RANK] = EDAC_MEM_REPAIR_ATTR_RW(rank, instance),
> +		[MEM_REPAIR_MIN_RANK] = EDAC_MEM_REPAIR_ATTR_RO(min_rank, instance),
> +		[MEM_REPAIR_MAX_RANK] = EDAC_MEM_REPAIR_ATTR_RO(max_rank, instance),
> +		[MEM_REPAIR_ROW] = EDAC_MEM_REPAIR_ATTR_RW(row, instance),
> +		[MEM_REPAIR_MIN_ROW] = EDAC_MEM_REPAIR_ATTR_RO(min_row, instance),
> +		[MEM_REPAIR_MAX_ROW] = EDAC_MEM_REPAIR_ATTR_RO(max_row, instance),
> +		[MEM_REPAIR_COLUMN] = EDAC_MEM_REPAIR_ATTR_RW(column, instance),
> +		[MEM_REPAIR_MIN_COLUMN] = EDAC_MEM_REPAIR_ATTR_RO(min_column, instance),
> +		[MEM_REPAIR_MAX_COLUMN] = EDAC_MEM_REPAIR_ATTR_RO(max_column, instance),
> +		[MEM_REPAIR_CHANNEL] = EDAC_MEM_REPAIR_ATTR_RW(channel, instance),
> +		[MEM_REPAIR_MIN_CHANNEL] = EDAC_MEM_REPAIR_ATTR_RO(min_channel, instance),
> +		[MEM_REPAIR_MAX_CHANNEL] = EDAC_MEM_REPAIR_ATTR_RO(max_channel, instance),
> +		[MEM_REPAIR_SUB_CHANNEL] =
> +				EDAC_MEM_REPAIR_ATTR_RW(sub_channel, instance),
> +		[MEM_REPAIR_MIN_SUB_CHANNEL] =
> +				EDAC_MEM_REPAIR_ATTR_RO(min_sub_channel, instance),
> +		[MEM_REPAIR_MAX_SUB_CHANNEL] =
> +				EDAC_MEM_REPAIR_ATTR_RO(max_sub_channel, instance),
> +		[MEM_DO_REPAIR] = EDAC_MEM_REPAIR_ATTR_WO(repair, instance)
> +	};
> +
> +	ctx = devm_kzalloc(dev, sizeof(*ctx), GFP_KERNEL);
> +	if (!ctx)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < MEM_REPAIR_MAX_ATTRS; i++) {
> +		memcpy(&ctx->mem_repair_dev_attr[i].dev_attr,
> +		       &dev_attr[i], sizeof(dev_attr[i]));
> +		ctx->mem_repair_attrs[i] =
> +				&ctx->mem_repair_dev_attr[i].dev_attr.attr;
> +	}
> +
> +	sprintf(ctx->name, "%s%d", "mem_repair", instance);
> +	group = &ctx->group;
> +	group->name = ctx->name;
> +	group->attrs = ctx->mem_repair_attrs;
> +	group->is_visible  = mem_repair_attr_visible;
> +	attr_groups[0] = group;
> +
> +	return 0;
> +}
> +
> +/**
> + * edac_mem_repair_get_desc - get EDAC memory repair descriptors
> + * @dev: client device with memory repair feature
> + * @attr_groups: pointer to attribute group container
> + * @instance: device's memory repair instance number.
> + *
> + * Return:
> + *  * %0	- Success.
> + *  * %-EINVAL	- Invalid parameters passed.
> + *  * %-ENOMEM	- Dynamic memory allocation failed.
> + */
> +int edac_mem_repair_get_desc(struct device *dev,
> +			     const struct attribute_group **attr_groups, u8 instance)
> +{
> +	if (!dev || !attr_groups)
> +		return -EINVAL;
> +
> +	return mem_repair_create_desc(dev, attr_groups, instance);
> +}
> diff --git a/include/linux/edac.h b/include/linux/edac.h
> index 979e91426701..5d07192bf1a7 100644
> --- a/include/linux/edac.h
> +++ b/include/linux/edac.h
> @@ -668,6 +668,7 @@ static inline struct dimm_info *edac_get_dimm(struct mem_ctl_info *mci,
>  enum edac_dev_feat {
>  	RAS_FEAT_SCRUB,
>  	RAS_FEAT_ECS,
> +	RAS_FEAT_MEM_REPAIR,
>  	RAS_FEAT_MAX
>  };
>  
> @@ -729,11 +730,147 @@ int edac_ecs_get_desc(struct device *ecs_dev,
>  		      const struct attribute_group **attr_groups,
>  		      u16 num_media_frus);
>  
> +enum edac_mem_repair_function {
> +	EDAC_SOFT_PPR,
> +	EDAC_HARD_PPR,
> +	EDAC_CACHELINE_MEM_SPARING,
> +	EDAC_ROW_MEM_SPARING,
> +	EDAC_BANK_MEM_SPARING,
> +	EDAC_RANK_MEM_SPARING,
> +};
> +
> +enum edac_mem_repair_persist_mode {
> +	EDAC_MEM_REPAIR_SOFT, /* soft memory repair */
> +	EDAC_MEM_REPAIR_HARD, /* hard memory repair */
> +};
> +
> +enum edac_mem_repair_cmd {
> +	EDAC_DO_MEM_REPAIR = 1,
> +};
> +
> +/**
> + * struct edac_mem_repair_ops - memory repair operations
> + * (all elements are optional except do_repair, set_hpa/set_dpa)
> + * @get_repair_function: get the memory repair function, listed in
> + *			 enum edac_mem_repair_function.
> + * @get_persist_mode: get the current persist mode. Persist repair modes supported
> + *		      in the device is based on the memory repair function which is
> + *		      temporary or permanent and is lost with a power cycle.
> + *		      EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary repair).
> + *		      EDAC_MEM_REPAIR_HARD - Hard memory repair function (permanent repair).
> + * All other values are reserved.
> + * @set_persist_mode: set the persist mode of the memory repair instance.
> + * @get_dpa_support: get dpa support flag. In some states of system configuration
> + *		     (e.g. before address decoders have been configured), memory devices
> + *		     (e.g. CXL) may not have an active mapping in the main host address
> + *		     physical address map. As such, the memory to repair must be identified
> + *		     by a device specific physical addressing scheme using a device physical
> + *		     address(DPA). The DPA and other control attributes to use for the
> + *		     dry_run and repair operations will be presented in related error records.
> + * @get_repair_safe_when_in_use: get whether memory media is accessible and
> + *				 data is retained during repair operation.
> + * @get_hpa: get current host physical address (HPA).
> + * @set_hpa: set host physical address (HPA) of memory to repair.
> + * @get_min_hpa: get the minimum supported host physical address (HPA).
> + * @get_max_hpa: get the maximum supported host physical address (HPA).
> + * @get_dpa: get current device physical address (DPA).
> + * @set_dpa: set device physical address (DPA) of memory to repair.
> + * @get_min_dpa: get the minimum supported device physical address (DPA).
> + * @get_max_dpa: get the maximum supported device physical address (DPA).
> + * @get_nibble_mask: get current nibble mask.
> + * @set_nibble_mask: set nibble mask of memory to repair.
> + * @get_min_nibble_mask: get the minimum supported nibble mask.
> + * @get_max_nibble_mask: get the maximum supported nibble mask.
> + * @get_bank_group: get current bank group.
> + * @set_bank_group: set bank group of memory to repair.
> + * @get_min_bank_group: get the minimum supported bank group.
> + * @get_max_bank_group: get the maximum supported bank group.
> + * @get_bank: get current bank.
> + * @set_bank: set bank of memory to repair.
> + * @get_min_bank: get the minimum supported bank.
> + * @get_max_bank: get the maximum supported bank.
> + * @get_rank: get current rank.
> + * @set_rank: set rank of memory to repair.
> + * @get_min_rank: get the minimum supported rank.
> + * @get_max_rank: get the maximum supported rank.
> + * @get_row: get current row.
> + * @set_row: set row of memory to repair.
> + * @get_min_row: get the minimum supported row.
> + * @get_max_row: get the maximum supported row.
> + * @get_column: get current column.
> + * @set_column: set column of memory to repair.
> + * @get_min_column: get the minimum supported column.
> + * @get_max_column: get the maximum supported column.
> + * @get_channel: get current channel.
> + * @set_channel: set channel of memory to repair.
> + * @get_min_channel: get the minimum supported channel.
> + * @get_max_channel: get the maximum supported channel.
> + * @get_sub_channel: get current sub channel.
> + * @set_sub_channel: set sub channel of memory to repair.
> + * @get_min_sub_channel: get the minimum supported sub channel.
> + * @get_max_sub_channel: get the maximum supported sub channel.
> + * @do_repair: Issue memory repair operation for the HPA/DPA and
> + *	       other control attributes set for the memory to repair.
> + */
> +struct edac_mem_repair_ops {
> +	int (*get_repair_function)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_persist_mode)(struct device *dev, void *drv_data, u32 *mode);
> +	int (*set_persist_mode)(struct device *dev, void *drv_data, u32 mode);
> +	int (*get_dpa_support)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_repair_safe_when_in_use)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_hpa)(struct device *dev, void *drv_data, u64 *hpa);
> +	int (*set_hpa)(struct device *dev, void *drv_data, u64 hpa);
> +	int (*get_min_hpa)(struct device *dev, void *drv_data, u64 *hpa);
> +	int (*get_max_hpa)(struct device *dev, void *drv_data, u64 *hpa);
> +	int (*get_dpa)(struct device *dev, void *drv_data, u64 *dpa);
> +	int (*set_dpa)(struct device *dev, void *drv_data, u64 dpa);
> +	int (*get_min_dpa)(struct device *dev, void *drv_data, u64 *dpa);
> +	int (*get_max_dpa)(struct device *dev, void *drv_data, u64 *dpa);
> +	int (*get_nibble_mask)(struct device *dev, void *drv_data, u64 *val);
> +	int (*set_nibble_mask)(struct device *dev, void *drv_data, u64 val);
> +	int (*get_min_nibble_mask)(struct device *dev, void *drv_data, u64 *val);
> +	int (*get_max_nibble_mask)(struct device *dev, void *drv_data, u64 *val);
> +	int (*get_bank_group)(struct device *dev, void *drv_data, u32 *val);
> +	int (*set_bank_group)(struct device *dev, void *drv_data, u32 val);
> +	int (*get_min_bank_group)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_max_bank_group)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_bank)(struct device *dev, void *drv_data, u32 *val);
> +	int (*set_bank)(struct device *dev, void *drv_data, u32 val);
> +	int (*get_min_bank)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_max_bank)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_rank)(struct device *dev, void *drv_data, u32 *val);
> +	int (*set_rank)(struct device *dev, void *drv_data, u32 val);
> +	int (*get_min_rank)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_max_rank)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_row)(struct device *dev, void *drv_data, u64 *val);
> +	int (*set_row)(struct device *dev, void *drv_data, u64 val);
> +	int (*get_min_row)(struct device *dev, void *drv_data, u64 *val);
> +	int (*get_max_row)(struct device *dev, void *drv_data, u64 *val);
> +	int (*get_column)(struct device *dev, void *drv_data, u32 *val);
> +	int (*set_column)(struct device *dev, void *drv_data, u32 val);
> +	int (*get_min_column)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_max_column)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_channel)(struct device *dev, void *drv_data, u32 *val);
> +	int (*set_channel)(struct device *dev, void *drv_data, u32 val);
> +	int (*get_min_channel)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_max_channel)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_sub_channel)(struct device *dev, void *drv_data, u32 *val);
> +	int (*set_sub_channel)(struct device *dev, void *drv_data, u32 val);
> +	int (*get_min_sub_channel)(struct device *dev, void *drv_data, u32 *val);
> +	int (*get_max_sub_channel)(struct device *dev, void *drv_data, u32 *val);
> +	int (*do_repair)(struct device *dev, void *drv_data, u32 val);
> +};
> +
> +int edac_mem_repair_get_desc(struct device *dev,
> +			     const struct attribute_group **attr_groups,
> +			     u8 instance);
> +
>  /* EDAC device feature information structure */
>  struct edac_dev_data {
>  	union {
>  		const struct edac_scrub_ops *scrub_ops;
>  		const struct edac_ecs_ops *ecs_ops;
> +		const struct edac_mem_repair_ops *mem_repair_ops;
>  	};
>  	u8 instance;
>  	void *private;
> @@ -744,6 +881,7 @@ struct edac_dev_feat_ctx {
>  	void *private;
>  	struct edac_dev_data *scrub;
>  	struct edac_dev_data ecs;
> +	struct edac_dev_data *mem_repair;
>  };
>  
>  struct edac_dev_feature {
> @@ -752,6 +890,7 @@ struct edac_dev_feature {
>  	union {
>  		const struct edac_scrub_ops *scrub_ops;
>  		const struct edac_ecs_ops *ecs_ops;
> +		const struct edac_mem_repair_ops *mem_repair_ops;
>  	};
>  	void *ctx;
>  	struct edac_ecs_ex_info ecs_info;

Thanks,
Mauro

Shiju Jose Jan. 14, 2025, 12:31 p.m. UTC | #16

Hi Mauro,

Thanks for the comments.

>-----Original Message-----
>From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
>Sent: 14 January 2025 11:48
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux-
>acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
>bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
>mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
>Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
>alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
>david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
>Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
>Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
>duenwen@google.com; gthelen@google.com;
>wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
>wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
><linuxarm@huawei.com>
>Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
>
>Em Mon, 6 Jan 2025 12:10:00 +0000
><shiju.jose@huawei.com> escreveu:
>
>> From: Shiju Jose <shiju.jose@huawei.com>
>>
>> Add a generic EDAC memory repair control driver to manage memory repairs
>> in the system, such as CXL Post Package Repair (PPR) and CXL memory sparing
>> features.
>>
>> For example, a CXL device with DRAM components that support PPR features
>> may implement PPR maintenance operations. DRAM components may support
>two
>> types of PPR, hard PPR, for a permanent row repair, and soft PPR,  for a
>> temporary row repair. Soft PPR is much faster than hard PPR, but the repair
>> is lost with a power cycle.
>> Similarly a CXL memory device may support soft and hard memory sparing at
>> cacheline, row, bank and rank granularities. Memory sparing is defined as
>> a repair function that replaces a portion of memory with a portion of
>> functional memory at that same granularity.
>> When a CXL device detects an error in a memory, it may report the host of
>> the need for a repair maintenance operation by using an event record where
>> the "maintenance needed" flag is set. The event records contains the device
>> physical address(DPA) and other attributes of the memory to repair (such as
>> channel, sub-channel, bank group, bank, rank, row, column etc). The kernel
>> will report the corresponding CXL general media or DRAM trace event to
>> userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
>> operation in response to the device request via the sysfs repair control.
>>
>> Device with memory repair features registers with EDAC device driver,
>> which retrieves memory repair descriptor from EDAC memory repair driver
>> and exposes the sysfs repair control attributes to userspace in
>> /sys/bus/edac/devices/<dev-name>/mem_repairX/.
>>
>> The common memory repair control interface abstracts the control of
>> arbitrary memory repair functionality into a standardized set of functions.
>> The sysfs memory repair attribute nodes are only available if the client
>> driver has implemented the corresponding attribute callback function and
>> provided operations to the EDAC device driver during registration.
>>
>> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
>> ---
>>  .../ABI/testing/sysfs-edac-memory-repair      | 244 +++++++++
>>  Documentation/edac/features.rst               |   3 +
>>  Documentation/edac/index.rst                  |   1 +
>>  Documentation/edac/memory_repair.rst          | 101 ++++
>>  drivers/edac/Makefile                         |   2 +-
>>  drivers/edac/edac_device.c                    |  33 ++
>>  drivers/edac/mem_repair.c                     | 492 ++++++++++++++++++
>>  include/linux/edac.h                          | 139 +++++
>>  8 files changed, 1014 insertions(+), 1 deletion(-)
>>  create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair
>>  create mode 100644 Documentation/edac/memory_repair.rst
>>  create mode 100755 drivers/edac/mem_repair.c
>>
>> diff --git a/Documentation/ABI/testing/sysfs-edac-memory-repair
>b/Documentation/ABI/testing/sysfs-edac-memory-repair
>> new file mode 100644
>> index 000000000000..e9268f3780ed
>> --- /dev/null
>> +++ b/Documentation/ABI/testing/sysfs-edac-memory-repair
>> @@ -0,0 +1,244 @@
>> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		The sysfs EDAC bus devices /<dev-name>/mem_repairX
>subdirectory
>> +		pertains to the memory media repair features control, such as
>> +		PPR (Post Package Repair), memory sparing etc, where<dev-
>name>
>> +		directory corresponds to a device registered with the EDAC
>> +		device driver for the memory repair features.
>> +
>> +		Post Package Repair is a maintenance operation requests the
>memory
>> +		device to perform a repair operation on its media, in detail is a
>> +		memory self-healing feature that fixes a failing memory
>location by
>> +		replacing it with a spare row in a DRAM device. For example, a
>> +		CXL memory device with DRAM components that support PPR
>features may
>> +		implement PPR maintenance operations. DRAM components
>may support
>> +		two types of PPR functions: hard PPR, for a permanent row
>repair, and
>> +		soft PPR, for a temporary row repair. soft PPR is much faster
>than
>> +		hard PPR, but the repair is lost with a power cycle.
>> +
>> +		Memory sparing is a repair function that replaces a portion
>> +		of memory with a portion of functional memory at that same
>> +		sparing granularity. Memory sparing has
>cacheline/row/bank/rank
>> +		sparing granularities. For example, in memory-sparing mode,
>> +		one memory rank serves as a spare for other ranks on the same
>> +		channel in case they fail. The spare rank is held in reserve and
>> +		not used as active memory until a failure is indicated, with
>> +		reserved capacity subtracted from the total available memory
>> +		in the system.The DIMM installation order for memory sparing
>> +		varies based on the number of processors and memory modules
>> +		installed in the server. After an error threshold is surpassed
>> +		in a system protected by memory sparing, the content of a
>failing
>> +		rank of DIMMs is copied to the spare rank. The failing rank is
>> +		then taken offline and the spare rank placed online for use as
>> +		active memory in place of the failed rank.
>> +
>> +		The sysfs attributes nodes for a repair feature are only
>> +		present if the parent driver has implemented the corresponding
>> +		attr callback function and provided the necessary operations
>> +		to the EDAC device driver during registration.
>> +
>> +		In some states of system configuration (e.g. before address
>> +		decoders have been configured), memory devices (e.g. CXL)
>> +		may not have an active mapping in the main host address
>> +		physical address map. As such, the memory to repair must be
>> +		identified by a device specific physical addressing scheme
>> +		using a device physical address(DPA). The DPA and other control
>> +		attributes to use will be presented in related error records.
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair_function
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RO) Memory repair function type. For eg. post package repair,
>> +		memory sparing etc.
>> +		EDAC_SOFT_PPR - Soft post package repair
>> +		EDAC_HARD_PPR - Hard post package repair
>> +		EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing
>> +		EDAC_ROW_MEM_SPARING - Row memory sparing
>> +		EDAC_BANK_MEM_SPARING - Bank memory sparing
>> +		EDAC_RANK_MEM_SPARING - Rank memory sparing
>> +		All other values are reserved.
>
>Too big strings. Why are them in upper cases? IMO:
>
>	soft-ppr, hard-ppr, ... would be enough.
>
Here return repair type (single value, such as 0, 1, or 2 etc not as decoded string  for eg."EDAC_SOFT_PPR")
of the memory repair instance, which is  defined as enums (EDAC_SOFT_PPR, EDAC_HARD_PPR, ... etc) 
for the memory repair interface in the include/linux/edac.h.

enum edac_mem_repair_function {
	EDAC_SOFT_PPR,
	EDAC_HARD_PPR,
	EDAC_CACHELINE_MEM_SPARING,
	EDAC_ROW_MEM_SPARING,
	EDAC_BANK_MEM_SPARING,
	EDAC_RANK_MEM_SPARING,
};
  
I documented return value in terms of the above enums.

>Also, Is it mandatory that all types are supported? If not, you need a
>way to report to userspace what of them are supported. One option
>would be that reading /sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair_function
>would return something like:
>
>	soft-ppr [hard-ppr] row-mem-sparing
>
Same as above. It is not returned in the decoded string format.
 
>Also, as this will be parsed in ReST format, you need to change the
>description to use bullets, otherwise the html/pdf version of the
>document will place everything on a single line. E.g. something like:
Sure.

>
>Description:
>		(RO) Memory repair function type. For eg. post package repair,
>		memory sparing etc. Can be:
>
>		- EDAC_SOFT_PPR - Soft post package repair
>		- EDAC_HARD_PPR - Hard post package repair
>		- EDAC_CACHELINE_MEM_SPARING - Cacheline memory
>sparing
>		- EDAC_ROW_MEM_SPARING - Row memory sparing
>		- EDAC_BANK_MEM_SPARING - Bank memory sparing
>		- EDAC_RANK_MEM_SPARING - Rank memory sparing
>		- All other values are reserved.
>
>Same applies to other sysfs nodes. See for instance:
>
>	Documentation/ABI/stable/sysfs-class-backlight
>
>And see how it is formatted after Sphinx processing at the Kernel
>Admin guide:
>
>	https://www.kernel.org/doc/html/latest/admin-guide/abi-
>stable.html#symbols-under-sys-class
>
>Please fix it on all places you have a list of values.
Sure.
>
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/persist_mode
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RW) Read/Write the current persist repair mode set for a
>> +		repair function. Persist repair modes supported in the
>> +		device, based on the memory repair function is temporary
>> +		or permanent and is lost with a power cycle.
>> +		EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary
>repair).
>> +		EDAC_MEM_REPAIR_HARD - Hard memory repair function
>(permanent repair).
>> +		All other values are reserved.
>
>Same here: edac/ is already in the path. No need to place EDAC_ at the name.
>
Sam as above. Return a single value, not as decoded string. But documented in terms
of the enums defined for interface in the include/linux/edac.h    
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/dpa_support
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RO) True if memory device required device physical
>> +		address (DPA) of memory to repair.
>> +		False if memory device required host specific physical
>> +                address (HPA) of memory to repair.
>
>Please remove the extra spaces before "address", as otherwise conversion to
>ReST may do the wrong thing or may produce doc warnings.
Will fix.
>
>> +		In some states of system configuration (e.g. before address
>> +		decoders have been configured), memory devices (e.g. CXL)
>> +		may not have an active mapping in the main host address
>> +		physical address map. As such, the memory to repair must be
>> +		identified by a device specific physical addressing scheme
>> +		using a DPA. The device physical address(DPA) to use will be
>> +		presented in related error records.
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair_safe_when_in_use
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RO) True if memory media is accessible and data is retained
>> +		during the memory repair operation.
>> +		The data may not be retained and memory requests may not be
>> +		correctly processed during a repair operation. In such case
>> +		the repair operation should not executed at runtime.
>
>Please add an extra line before "The data" to ensure that the output at
>the admin-guide won't merge the two paragraphs. Same on other places along
>this patch series: paragraphs need a blank line at the description.
>
Sure.
>> +
>> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/hpa
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RW) Host Physical Address (HPA) of the memory to repair.
>> +		See attribute 'dpa_support' for more details.
>> +		The HPA to use will be provided in related error records.
>> +
>> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/dpa
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RW) Device Physical Address (DPA) of the memory to repair.
>> +		See attribute 'dpa_support' for more details.
>> +		The specific DPA to use will be provided in related error
>> +		records.
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/nibble_mask
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RW) Read/Write Nibble mask of the memory to repair.
>> +		Nibble mask identifies one or more nibbles in error on the
>> +		memory bus that produced the error event. Nibble Mask bit 0
>> +		shall be set if nibble 0 on the memory bus produced the
>> +		event, etc. For example, CXL PPR and sparing, a nibble mask
>> +		bit set to 1 indicates the request to perform repair
>> +		operation in the specific device. All nibble mask bits set
>> +		to 1 indicates the request to perform the operation in all
>> +		devices. For CXL memory to repiar, the specific value of
>> +		nibble mask to use will be provided in related error records.
>> +		For more details, See nibble mask field in CXL spec ver 3.1,
>> +		section 8.2.9.7.1.2 Table 8-103 soft PPR and section
>> +		8.2.9.7.1.3 Table 8-104 hard PPR, section 8.2.9.7.1.4
>> +		Table 8-105 memory sparing.
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/bank_group
>> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/bank
>> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/rank
>> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/row
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/column
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/channel
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/sub_channel
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RW) The control attributes associated with memory address
>> +		that is to be repaired. The specific value of attributes to
>> +		use depends on the portion of memory to repair and may be
>> +		reported to host in related error records and may be
>> +		available to userspace in trace events, such as in CXL
>> +		memory devices.
>> +
>> +		channel - The channel of the memory to repair. Channel is
>> +		defined as an interface that can be independently accessed
>> +		for a transaction.
>> +		rank - The rank of the memory to repair. Rank is defined as a
>> +		set of memory devices on a channel that together execute a
>> +		transaction.
>> +		bank_group - The bank group of the memory to repair.
>> +		bank - The bank number of the memory to repair.
>> +		row - The row number of the memory to repair.
>> +		column - The column number of the memory to repair.
>> +		sub_channel - The sub-channel of the memory to repair.
>
>Same problem here with regards to bad ReST input. I would do:
>
>	channel
>		The channel of the memory to repair. Channel is
>		defined as an interface that can be independently accessed
>		for a transaction.
>
>	rank
>		The rank of the memory to repair. Rank is defined as a
>		set of memory devices on a channel that together execute a
>		transaction.
>
Sure. Will fix.
>as this would provide a better output at admin-guide while still being
>nicer to read as text.
>
>> +
>> +		The requirement to set these attributes varies based on the
>> +		repair function. The attributes in sysfs are not present
>> +		unless required for a repair function.
>> +		For example, CXL spec ver 3.1, Section 8.2.9.7.1.2 Table 8-103
>> +		soft PPR and Section 8.2.9.7.1.3 Table 8-104 hard PPR
>operations,
>> +		these attributes are not required to set.
>> +		For example, CXL spec ver 3.1, Section 8.2.9.7.1.4 Table 8-105
>> +		memory sparing, these attributes are required to set based on
>> +		memory sparing granularity as follows.
>> +		Channel: Channel associated with the DPA that is to be spared
>> +		and applies to all subclasses of sparing (cacheline, bank,
>> +		row and rank sparing).
>> +		Rank: Rank associated with the DPA that is to be spared and
>> +		applies to all subclasses of sparing.
>> +		Bank & Bank Group: Bank & bank group are associated with
>> +		the DPA that is to be spared and applies to cacheline sparing,
>> +		row sparing and bank sparing subclasses.
>> +		Row: Row associated with the DPA that is to be spared and
>> +		applies to cacheline sparing and row sparing subclasses.
>> +		Column: column associated with the DPA that is to be spared
>> +		and applies to cacheline sparing only.
>> +		Sub-channel: sub-channel associated with the DPA that is to
>> +		be spared and applies to cacheline sparing only.
>
>Same here: this will all be on a single paragraph which would be really
>weird.
Will fix.
>
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_hpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_dpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_nibble_mask
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_bank_group
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_bank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_rank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_row
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_column
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_channel
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_sub_channel
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_hpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_dpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_nibble_mask
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_bank_group
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_bank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_rank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_row
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_column
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_channel
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_sub_channel
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RW) The supported range of control attributes (optional)
>> +		associated with a memory address that is to be repaired.
>> +		The memory device may give the supported range of
>> +		attributes to use and it will depend on the memory device
>> +		and the portion of memory to repair.
>> +		The userspace may receive the specific value of attributes
>> +		to use for a repair operation from the memory device via
>> +		related error records and trace events, such as in CXL
>> +		memory devices.
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(WO) Issue the memory repair operation for the specified
>> +		memory repair attributes. The operation may fail if resources
>> +		are insufficient based on the requirements of the memory
>> +		device and repair function.
>> +		EDAC_DO_MEM_REPAIR - issue repair operation.
>> +		All other values are reserved.
>> diff --git a/Documentation/edac/features.rst
>b/Documentation/edac/features.rst
>> index ba3ab993ee4f..bfd5533b81b7 100644
>> --- a/Documentation/edac/features.rst
>> +++ b/Documentation/edac/features.rst
>> @@ -97,3 +97,6 @@ RAS features
>>  ------------
>>  1. Memory Scrub
>>  Memory scrub features are documented in `Documentation/edac/scrub.rst`.
>> +
>> +2. Memory Repair
>> +Memory repair features are documented in
>`Documentation/edac/memory_repair.rst`.
>> diff --git a/Documentation/edac/index.rst b/Documentation/edac/index.rst
>> index dfb0c9fb9ab1..d6778f4562dd 100644
>> --- a/Documentation/edac/index.rst
>> +++ b/Documentation/edac/index.rst
>> @@ -8,4 +8,5 @@ EDAC Subsystem
>>     :maxdepth: 1
>>
>>     features
>> +   memory_repair
>>     scrub
>> diff --git a/Documentation/edac/memory_repair.rst
>b/Documentation/edac/memory_repair.rst
>> new file mode 100644
>> index 000000000000..2787a8a2d6ba
>> --- /dev/null
>> +++ b/Documentation/edac/memory_repair.rst
>> @@ -0,0 +1,101 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +==========================
>> +EDAC Memory Repair Control
>> +==========================
>> +
>> +Copyright (c) 2024 HiSilicon Limited.
>> +
>> +:Author:   Shiju Jose <shiju.jose@huawei.com>
>> +:License:  The GNU Free Documentation License, Version 1.2
>> +          (dual licensed under the GPL v2)
>> +:Original Reviewers:
>> +
>> +- Written for: 6.14
>
>See my comments with regards to license on the previous patches.
Ok.
>
>> +
>> +Introduction
>> +------------
>> +Memory devices may support repair operations to address issues in their
>> +memory media. Post Package Repair (PPR) and memory sparing are
>examples
>> +of such features.
>> +
>> +Post Package Repair(PPR)
>> +~~~~~~~~~~~~~~~~~~~~~~~~
>> +Post Package Repair is a maintenance operation requests the memory device
>> +to perform repair operation on its media, in detail is a memory self-healing
>> +feature that fixes a failing memory location by replacing it with a spare
>> +row in a DRAM device. For example, a CXL memory device with DRAM
>components
>> +that support PPR features may implement PPR maintenance operations.
>DRAM
>> +components may support types of PPR functions, hard PPR, for a permanent
>row
>> +repair, and soft PPR, for a temporary row repair. Soft PPR is much faster
>> +than hard PPR, but the repair is lost with a power cycle.  The data may not
>> +be retained and memory requests may not be correctly processed during a
>> +repair operation. In such case, the repair operation should not executed
>> +at runtime.
>> +For example, CXL memory devices, soft PPR and hard PPR repair operations
>> +may be supported. See CXL spec rev 3.1 sections 8.2.9.7.1.1 PPR Maintenance
>> +Operations, 8.2.9.7.1.2 sPPR Maintenance Operation and 8.2.9.7.1.3 hPPR
>> +Maintenance Operation for more details.
>
>Paragraphs require blank lines in ReST. Also, please place a link to the
>specs.
>
>I strongly suggest looking at the output of all docs with make htmldocs
>and make pdfdocs to be sure that the paragraphs and the final document
>will be properly handled. You may use:
>
>	SPHINXDIRS="<book name(s)>"
>
>to speed-up documentation builds.
>
>Please see Sphinx documentation for more details about what it is expected
>there:
>
>	https://www.sphinx-
>doc.org/en/master/usage/restructuredtext/basics.html
Thanks for information.  I will check and fix. 
I had fixed blank line requirements in most of the main documentations, 
but was  not aware of location of output for the ABI docs and missed.
>
>> +
>> +Memory Sparing
>> +~~~~~~~~~~~~~~
>> +Memory sparing is a repair function that replaces a portion of memory with
>> +a portion of functional memory at that same sparing granularity. Memory
>> +sparing has cacheline/row/bank/rank sparing granularities. For example, in
>> +memory-sparing mode, one memory rank serves as a spare for other ranks
>on
>> +the same channel in case they fail. The spare rank is held in reserve and
>> +not used as active memory until a failure is indicated, with reserved
>> +capacity subtracted from the total available memory in the system. The
>DIMM
>> +installation order for memory sparing varies based on the number of
>processors
>> +and memory modules installed in the server. After an error threshold is
>> +surpassed in a system protected by memory sparing, the content of a failing
>> +rank of DIMMs is copied to the spare rank. The failing rank is then taken
>> +offline and the spare rank placed online for use as active memory in place
>> +of the failed rank.
>> +
>> +For example, CXL memory devices may support various subclasses for sparing
>> +operation vary in terms of the scope of the sparing being performed.
>> +Cacheline sparing subclass refers to a sparing action that can replace a
>> +full cacheline. Row sparing is provided as an alternative to PPR sparing
>> +functions and its scope is that of a single DDR row. Bank sparing allows
>> +an entire bank to be replaced. Rank sparing is defined as an operation
>> +in which an entire DDR rank is replaced. See CXL spec 3.1 section
>> +8.2.9.7.1.4 Memory Sparing Maintenance Operations for more details.
>> +
>> +Use cases of generic memory repair features control
>> +---------------------------------------------------
>> +
>> +1. The soft PPR , hard PPR and memory-sparing features share similar
>> +control attributes. Therefore, there is a need for a standardized, generic
>> +sysfs repair control that is exposed to userspace and used by
>> +administrators, scripts and tools.
>> +
>> +2. When a CXL device detects an error in a memory component, it may
>inform
>> +the host of the need for a repair maintenance operation by using an event
>> +record where the "maintenance needed" flag is set. The event record
>> +specifies the device physical address(DPA) and attributes of the memory that
>> +requires repair. The kernel reports the corresponding CXL general media or
>> +DRAM trace event to userspace, and userspace tools (e.g. rasdaemon)
>initiate
>> +a repair maintenance operation in response to the device request using the
>> +sysfs repair control.
>> +
>> +3. Userspace tools, such as rasdaemon, may request a PPR/sparing on a
>memory
>> +region when an uncorrected memory error or an excess of corrected
>memory
>> +errors is reported on that memory.
>> +
>> +4. Multiple PPR/sparing instances may be present per memory device.
>> +
>> +The File System
>> +---------------
>> +
>> +The control attributes of a registered memory repair instance could be
>> +accessed in the
>> +
>> +/sys/bus/edac/devices/<dev-name>/mem_repairX/
>> +
>> +sysfs
>> +-----
>> +
>> +Sysfs files are documented in
>> +
>> +`Documentation/ABI/testing/sysfs-edac-memory-repair`.
>> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
>> index 3a49304860f0..1de9fe66ac6b 100644
>> --- a/drivers/edac/Makefile
>> +++ b/drivers/edac/Makefile
>> @@ -10,7 +10,7 @@ obj-$(CONFIG_EDAC)			:= edac_core.o
>>
>>  edac_core-y	:= edac_mc.o edac_device.o edac_mc_sysfs.o
>>  edac_core-y	+= edac_module.o edac_device_sysfs.o wq.o
>> -edac_core-y	+= scrub.o ecs.o
>> +edac_core-y	+= scrub.o ecs.o mem_repair.o
>>
>>  edac_core-$(CONFIG_EDAC_DEBUG)		+= debugfs.o
>>
>> diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
>> index 1c1142a2e4e4..a401d81dad8a 100644
>> --- a/drivers/edac/edac_device.c
>> +++ b/drivers/edac/edac_device.c
>> @@ -575,6 +575,7 @@ static void edac_dev_release(struct device *dev)
>>  {
>>  	struct edac_dev_feat_ctx *ctx = container_of(dev, struct
>edac_dev_feat_ctx, dev);
>>
>> +	kfree(ctx->mem_repair);
>>  	kfree(ctx->scrub);
>>  	kfree(ctx->dev.groups);
>>  	kfree(ctx);
>> @@ -611,6 +612,7 @@ int edac_dev_register(struct device *parent, char
>*name,
>>  	const struct attribute_group **ras_attr_groups;
>>  	struct edac_dev_data *dev_data;
>>  	struct edac_dev_feat_ctx *ctx;
>> +	int mem_repair_cnt = 0;
>>  	int attr_gcnt = 0;
>>  	int scrub_cnt = 0;
>>  	int ret, feat;
>> @@ -628,6 +630,10 @@ int edac_dev_register(struct device *parent, char
>*name,
>>  		case RAS_FEAT_ECS:
>>  			attr_gcnt +=
>ras_features[feat].ecs_info.num_media_frus;
>>  			break;
>> +		case RAS_FEAT_MEM_REPAIR:
>> +			attr_gcnt++;
>> +			mem_repair_cnt++;
>> +			break;
>>  		default:
>>  			return -EINVAL;
>>  		}
>> @@ -651,8 +657,17 @@ int edac_dev_register(struct device *parent, char
>*name,
>>  		}
>>  	}
>>
>> +	if (mem_repair_cnt) {
>> +		ctx->mem_repair = kcalloc(mem_repair_cnt, sizeof(*ctx-
>>mem_repair), GFP_KERNEL);
>> +		if (!ctx->mem_repair) {
>> +			ret = -ENOMEM;
>> +			goto data_mem_free;
>> +		}
>> +	}
>> +
>>  	attr_gcnt = 0;
>>  	scrub_cnt = 0;
>> +	mem_repair_cnt = 0;
>>  	for (feat = 0; feat < num_features; feat++, ras_features++) {
>>  		switch (ras_features->ft_type) {
>>  		case RAS_FEAT_SCRUB:
>> @@ -686,6 +701,23 @@ int edac_dev_register(struct device *parent, char
>*name,
>>
>>  			attr_gcnt += ras_features->ecs_info.num_media_frus;
>>  			break;
>> +		case RAS_FEAT_MEM_REPAIR:
>> +			if (!ras_features->mem_repair_ops ||
>> +			    mem_repair_cnt != ras_features->instance)
>> +				goto data_mem_free;
>> +
>> +			dev_data = &ctx->mem_repair[mem_repair_cnt];
>> +			dev_data->instance = mem_repair_cnt;
>> +			dev_data->mem_repair_ops = ras_features-
>>mem_repair_ops;
>> +			dev_data->private = ras_features->ctx;
>> +			ret = edac_mem_repair_get_desc(parent,
>&ras_attr_groups[attr_gcnt],
>> +						       ras_features->instance);
>> +			if (ret)
>> +				goto data_mem_free;
>> +
>> +			mem_repair_cnt++;
>> +			attr_gcnt++;
>> +			break;
>>  		default:
>>  			ret = -EINVAL;
>>  			goto data_mem_free;
>> @@ -712,6 +744,7 @@ int edac_dev_register(struct device *parent, char
>*name,
>>  	return devm_add_action_or_reset(parent, edac_dev_unreg, &ctx->dev);
>>
>>  data_mem_free:
>> +	kfree(ctx->mem_repair);
>>  	kfree(ctx->scrub);
>>  groups_free:
>>  	kfree(ras_attr_groups);
>> diff --git a/drivers/edac/mem_repair.c b/drivers/edac/mem_repair.c
>> new file mode 100755
>> index 000000000000..e7439fd26c41
>> --- /dev/null
>> +++ b/drivers/edac/mem_repair.c
>> @@ -0,0 +1,492 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * The generic EDAC memory repair driver is designed to control the memory
>> + * devices with memory repair features, such as Post Package Repair (PPR),
>> + * memory sparing etc. The common sysfs memory repair interface abstracts
>> + * the control of various arbitrary memory repair functionalities into a
>> + * unified set of functions.
>> + *
>> + * Copyright (c) 2024 HiSilicon Limited.
>> + */
>> +
>> +#include <linux/edac.h>
>> +
>> +enum edac_mem_repair_attributes {
>> +	MEM_REPAIR_FUNCTION,
>> +	MEM_REPAIR_PERSIST_MODE,
>> +	MEM_REPAIR_DPA_SUPPORT,
>> +	MEM_REPAIR_SAFE_IN_USE,
>> +	MEM_REPAIR_HPA,
>> +	MEM_REPAIR_MIN_HPA,
>> +	MEM_REPAIR_MAX_HPA,
>> +	MEM_REPAIR_DPA,
>> +	MEM_REPAIR_MIN_DPA,
>> +	MEM_REPAIR_MAX_DPA,
>> +	MEM_REPAIR_NIBBLE_MASK,
>> +	MEM_REPAIR_MIN_NIBBLE_MASK,
>> +	MEM_REPAIR_MAX_NIBBLE_MASK,
>> +	MEM_REPAIR_BANK_GROUP,
>> +	MEM_REPAIR_MIN_BANK_GROUP,
>> +	MEM_REPAIR_MAX_BANK_GROUP,
>> +	MEM_REPAIR_BANK,
>> +	MEM_REPAIR_MIN_BANK,
>> +	MEM_REPAIR_MAX_BANK,
>> +	MEM_REPAIR_RANK,
>> +	MEM_REPAIR_MIN_RANK,
>> +	MEM_REPAIR_MAX_RANK,
>> +	MEM_REPAIR_ROW,
>> +	MEM_REPAIR_MIN_ROW,
>> +	MEM_REPAIR_MAX_ROW,
>> +	MEM_REPAIR_COLUMN,
>> +	MEM_REPAIR_MIN_COLUMN,
>> +	MEM_REPAIR_MAX_COLUMN,
>> +	MEM_REPAIR_CHANNEL,
>> +	MEM_REPAIR_MIN_CHANNEL,
>> +	MEM_REPAIR_MAX_CHANNEL,
>> +	MEM_REPAIR_SUB_CHANNEL,
>> +	MEM_REPAIR_MIN_SUB_CHANNEL,
>> +	MEM_REPAIR_MAX_SUB_CHANNEL,
>> +	MEM_DO_REPAIR,
>> +	MEM_REPAIR_MAX_ATTRS
>> +};
>> +
>> +struct edac_mem_repair_dev_attr {
>> +	struct device_attribute dev_attr;
>> +	u8 instance;
>> +};
>> +
>> +struct edac_mem_repair_context {
>> +	char name[EDAC_FEAT_NAME_LEN];
>> +	struct edac_mem_repair_dev_attr
>mem_repair_dev_attr[MEM_REPAIR_MAX_ATTRS];
>> +	struct attribute *mem_repair_attrs[MEM_REPAIR_MAX_ATTRS + 1];
>> +	struct attribute_group group;
>> +};
>> +
>> +#define TO_MEM_REPAIR_DEV_ATTR(_dev_attr)      \
>> +		container_of(_dev_attr, struct edac_mem_repair_dev_attr,
>dev_attr)
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_SHOW(attrib, cb, type, format)
>		\
>> +static ssize_t attrib##_show(struct device *ras_feat_dev,
>	\
>> +			     struct device_attribute *attr, char *buf)
>	\
>> +{
>	\
>> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;
>	\
>> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
>	\
>> +	const struct edac_mem_repair_ops *ops =
>		\
>> +				ctx->mem_repair[inst].mem_repair_ops;
>		\
>> +	type data;
>	\
>> +	int ret;								\
>> +
>	\
>> +	ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private,
>	\
>> +		      &data);
>	\
>> +	if (ret)								\
>> +		return ret;
>	\
>> +
>	\
>> +	return sysfs_emit(buf, format, data);
>	\
>> +}
>> +
>> +EDAC_MEM_REPAIR_ATTR_SHOW(repair_function, get_repair_function,
>u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(persist_mode, get_persist_mode, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(dpa_support, get_dpa_support, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(repair_safe_when_in_use,
>get_repair_safe_when_in_use, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(hpa, get_hpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_hpa, get_min_hpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_hpa, get_max_hpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(dpa, get_dpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_dpa, get_min_dpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_dpa, get_max_dpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(nibble_mask, get_nibble_mask, u64,
>"0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_nibble_mask, get_min_nibble_mask,
>u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_nibble_mask,
>get_max_nibble_mask, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(bank_group, get_bank_group, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank_group, get_min_bank_group,
>u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank_group, get_max_bank_group,
>u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(bank, get_bank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank, get_min_bank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank, get_max_bank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(rank, get_rank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_rank, get_min_rank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_rank, get_max_rank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(row, get_row, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_row, get_min_row, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_row, get_max_row, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(column, get_column, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_column, get_min_column, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_column, get_max_column, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(channel, get_channel, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_channel, get_min_channel, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_channel, get_max_channel, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(sub_channel, get_sub_channel, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_sub_channel, get_min_sub_channel,
>u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_sub_channel,
>get_max_sub_channel, u32, "%u\n")
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_STORE(attrib, cb, type, conv_func)
>			\
>> +static ssize_t attrib##_store(struct device *ras_feat_dev,
>	\
>> +			      struct device_attribute *attr,
>	\
>> +			      const char *buf, size_t len)			\
>> +{
>	\
>> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;
>	\
>> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
>	\
>> +	const struct edac_mem_repair_ops *ops =
>		\
>> +				ctx->mem_repair[inst].mem_repair_ops;
>		\
>> +	type data;
>	\
>> +	int ret;								\
>> +
>	\
>> +	ret = conv_func(buf, 0, &data);
>	\
>> +	if (ret < 0)
>	\
>> +		return ret;
>	\
>> +
>	\
>> +	ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private,
>	\
>> +		      data);
>	\
>> +	if (ret)								\
>> +		return ret;
>	\
>> +
>	\
>> +	return len;
>	\
>> +}
>> +
>> +EDAC_MEM_REPAIR_ATTR_STORE(persist_mode, set_persist_mode,
>unsigned long, kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(hpa, set_hpa, u64, kstrtou64)
>> +EDAC_MEM_REPAIR_ATTR_STORE(dpa, set_dpa, u64, kstrtou64)
>> +EDAC_MEM_REPAIR_ATTR_STORE(nibble_mask, set_nibble_mask, u64,
>kstrtou64)
>> +EDAC_MEM_REPAIR_ATTR_STORE(bank_group, set_bank_group, unsigned
>long, kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(bank, set_bank, unsigned long, kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(rank, set_rank, unsigned long, kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(row, set_row, u64, kstrtou64)
>> +EDAC_MEM_REPAIR_ATTR_STORE(column, set_column, unsigned long,
>kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(channel, set_channel, unsigned long,
>kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(sub_channel, set_sub_channel, unsigned
>long, kstrtoul)
>> +
>> +#define EDAC_MEM_REPAIR_DO_OP(attrib, cb)
>			\
>> +static ssize_t attrib##_store(struct device *ras_feat_dev,
>		\
>> +			      struct device_attribute *attr,
>		\
>> +			      const char *buf, size_t len)
>	\
>> +{
>		\
>> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;
>		\
>> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
>		\
>> +	const struct edac_mem_repair_ops *ops = ctx-
>>mem_repair[inst].mem_repair_ops;	\
>> +	unsigned long data;
>		\
>> +	int ret;
>	\
>> +
>		\
>> +	ret = kstrtoul(buf, 0, &data);
>		\
>> +	if (ret < 0)
>		\
>> +		return ret;
>		\
>> +
>		\
>> +	ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private,
>data);	\
>> +	if (ret)
>	\
>> +		return ret;
>		\
>> +
>		\
>> +	return len;
>		\
>> +}
>> +
>> +EDAC_MEM_REPAIR_DO_OP(repair, do_repair)
>> +
>> +static umode_t mem_repair_attr_visible(struct kobject *kobj, struct attribute
>*a, int attr_id)
>> +{
>> +	struct device *ras_feat_dev = kobj_to_dev(kobj);
>> +	struct device_attribute *dev_attr = container_of(a, struct
>device_attribute, attr);
>> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
>> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(dev_attr)->instance;
>> +	const struct edac_mem_repair_ops *ops = ctx-
>>mem_repair[inst].mem_repair_ops;
>> +
>> +	switch (attr_id) {
>> +	case MEM_REPAIR_FUNCTION:
>> +		if (ops->get_repair_function)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_PERSIST_MODE:
>> +		if (ops->get_persist_mode) {
>> +			if (ops->set_persist_mode)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_DPA_SUPPORT:
>> +		if (ops->get_dpa_support)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_SAFE_IN_USE:
>> +		if (ops->get_repair_safe_when_in_use)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_HPA:
>> +		if (ops->get_hpa) {
>> +			if (ops->set_hpa)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_HPA:
>> +		if (ops->get_min_hpa)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_HPA:
>> +		if (ops->get_max_hpa)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_DPA:
>> +		if (ops->get_dpa) {
>> +			if (ops->set_dpa)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_DPA:
>> +		if (ops->get_min_dpa)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_DPA:
>> +		if (ops->get_max_dpa)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_NIBBLE_MASK:
>> +		if (ops->get_nibble_mask) {
>> +			if (ops->set_nibble_mask)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_NIBBLE_MASK:
>> +		if (ops->get_min_nibble_mask)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_NIBBLE_MASK:
>> +		if (ops->get_max_nibble_mask)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_BANK_GROUP:
>> +		if (ops->get_bank_group) {
>> +			if (ops->set_bank_group)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_BANK_GROUP:
>> +		if (ops->get_min_bank_group)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_BANK_GROUP:
>> +		if (ops->get_max_bank_group)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_BANK:
>> +		if (ops->get_bank) {
>> +			if (ops->set_bank)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_BANK:
>> +		if (ops->get_min_bank)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_BANK:
>> +		if (ops->get_max_bank)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_RANK:
>> +		if (ops->get_rank) {
>> +			if (ops->set_rank)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_RANK:
>> +		if (ops->get_min_rank)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_RANK:
>> +		if (ops->get_max_rank)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_ROW:
>> +		if (ops->get_row) {
>> +			if (ops->set_row)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_ROW:
>> +		if (ops->get_min_row)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_ROW:
>> +		if (ops->get_max_row)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_COLUMN:
>> +		if (ops->get_column) {
>> +			if (ops->set_column)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_COLUMN:
>> +		if (ops->get_min_column)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_COLUMN:
>> +		if (ops->get_max_column)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_CHANNEL:
>> +		if (ops->get_channel) {
>> +			if (ops->set_channel)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_CHANNEL:
>> +		if (ops->get_min_channel)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_CHANNEL:
>> +		if (ops->get_max_channel)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_SUB_CHANNEL:
>> +		if (ops->get_sub_channel) {
>> +			if (ops->set_sub_channel)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_SUB_CHANNEL:
>> +		if (ops->get_min_sub_channel)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_SUB_CHANNEL:
>> +		if (ops->get_max_sub_channel)
>> +			return a->mode;
>> +		break;
>> +	case MEM_DO_REPAIR:
>> +		if (ops->do_repair)
>> +			return a->mode;
>> +		break;
>> +	default:
>> +		break;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_RO(_name, _instance)       \
>> +	((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RO(_name),
>\
>> +					     .instance = _instance })
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_WO(_name, _instance)       \
>> +	((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_WO(_name),
>\
>> +					     .instance = _instance })
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_RW(_name, _instance)       \
>> +	((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RW(_name),
>\
>> +					     .instance = _instance })
>> +
>> +static int mem_repair_create_desc(struct device *dev,
>> +				  const struct attribute_group **attr_groups,
>> +				  u8 instance)
>> +{
>> +	struct edac_mem_repair_context *ctx;
>> +	struct attribute_group *group;
>> +	int i;
>> +	struct edac_mem_repair_dev_attr dev_attr[] = {
>> +		[MEM_REPAIR_FUNCTION] =
>EDAC_MEM_REPAIR_ATTR_RO(repair_function,
>> +							    instance),
>> +		[MEM_REPAIR_PERSIST_MODE] =
>> +				EDAC_MEM_REPAIR_ATTR_RW(persist_mode,
>instance),
>> +		[MEM_REPAIR_DPA_SUPPORT] =
>> +				EDAC_MEM_REPAIR_ATTR_RO(dpa_support,
>instance),
>> +		[MEM_REPAIR_SAFE_IN_USE] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(repair_safe_when_in_use,
>> +							instance),
>> +		[MEM_REPAIR_HPA] = EDAC_MEM_REPAIR_ATTR_RW(hpa,
>instance),
>> +		[MEM_REPAIR_MIN_HPA] =
>EDAC_MEM_REPAIR_ATTR_RO(min_hpa, instance),
>> +		[MEM_REPAIR_MAX_HPA] =
>EDAC_MEM_REPAIR_ATTR_RO(max_hpa, instance),
>> +		[MEM_REPAIR_DPA] = EDAC_MEM_REPAIR_ATTR_RW(dpa,
>instance),
>> +		[MEM_REPAIR_MIN_DPA] =
>EDAC_MEM_REPAIR_ATTR_RO(min_dpa, instance),
>> +		[MEM_REPAIR_MAX_DPA] =
>EDAC_MEM_REPAIR_ATTR_RO(max_dpa, instance),
>> +		[MEM_REPAIR_NIBBLE_MASK] =
>> +				EDAC_MEM_REPAIR_ATTR_RW(nibble_mask,
>instance),
>> +		[MEM_REPAIR_MIN_NIBBLE_MASK] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(min_nibble_mask, instance),
>> +		[MEM_REPAIR_MAX_NIBBLE_MASK] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(max_nibble_mask, instance),
>> +		[MEM_REPAIR_BANK_GROUP] =
>> +				EDAC_MEM_REPAIR_ATTR_RW(bank_group,
>instance),
>> +		[MEM_REPAIR_MIN_BANK_GROUP] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(min_bank_group, instance),
>> +		[MEM_REPAIR_MAX_BANK_GROUP] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(max_bank_group, instance),
>> +		[MEM_REPAIR_BANK] = EDAC_MEM_REPAIR_ATTR_RW(bank,
>instance),
>> +		[MEM_REPAIR_MIN_BANK] =
>EDAC_MEM_REPAIR_ATTR_RO(min_bank, instance),
>> +		[MEM_REPAIR_MAX_BANK] =
>EDAC_MEM_REPAIR_ATTR_RO(max_bank, instance),
>> +		[MEM_REPAIR_RANK] = EDAC_MEM_REPAIR_ATTR_RW(rank,
>instance),
>> +		[MEM_REPAIR_MIN_RANK] =
>EDAC_MEM_REPAIR_ATTR_RO(min_rank, instance),
>> +		[MEM_REPAIR_MAX_RANK] =
>EDAC_MEM_REPAIR_ATTR_RO(max_rank, instance),
>> +		[MEM_REPAIR_ROW] = EDAC_MEM_REPAIR_ATTR_RW(row,
>instance),
>> +		[MEM_REPAIR_MIN_ROW] =
>EDAC_MEM_REPAIR_ATTR_RO(min_row, instance),
>> +		[MEM_REPAIR_MAX_ROW] =
>EDAC_MEM_REPAIR_ATTR_RO(max_row, instance),
>> +		[MEM_REPAIR_COLUMN] =
>EDAC_MEM_REPAIR_ATTR_RW(column, instance),
>> +		[MEM_REPAIR_MIN_COLUMN] =
>EDAC_MEM_REPAIR_ATTR_RO(min_column, instance),
>> +		[MEM_REPAIR_MAX_COLUMN] =
>EDAC_MEM_REPAIR_ATTR_RO(max_column, instance),
>> +		[MEM_REPAIR_CHANNEL] =
>EDAC_MEM_REPAIR_ATTR_RW(channel, instance),
>> +		[MEM_REPAIR_MIN_CHANNEL] =
>EDAC_MEM_REPAIR_ATTR_RO(min_channel, instance),
>> +		[MEM_REPAIR_MAX_CHANNEL] =
>EDAC_MEM_REPAIR_ATTR_RO(max_channel, instance),
>> +		[MEM_REPAIR_SUB_CHANNEL] =
>> +				EDAC_MEM_REPAIR_ATTR_RW(sub_channel,
>instance),
>> +		[MEM_REPAIR_MIN_SUB_CHANNEL] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(min_sub_channel, instance),
>> +		[MEM_REPAIR_MAX_SUB_CHANNEL] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(max_sub_channel, instance),
>> +		[MEM_DO_REPAIR] = EDAC_MEM_REPAIR_ATTR_WO(repair,
>instance)
>> +	};
>> +
>> +	ctx = devm_kzalloc(dev, sizeof(*ctx), GFP_KERNEL);
>> +	if (!ctx)
>> +		return -ENOMEM;
>> +
>> +	for (i = 0; i < MEM_REPAIR_MAX_ATTRS; i++) {
>> +		memcpy(&ctx->mem_repair_dev_attr[i].dev_attr,
>> +		       &dev_attr[i], sizeof(dev_attr[i]));
>> +		ctx->mem_repair_attrs[i] =
>> +				&ctx->mem_repair_dev_attr[i].dev_attr.attr;
>> +	}
>> +
>> +	sprintf(ctx->name, "%s%d", "mem_repair", instance);
>> +	group = &ctx->group;
>> +	group->name = ctx->name;
>> +	group->attrs = ctx->mem_repair_attrs;
>> +	group->is_visible  = mem_repair_attr_visible;
>> +	attr_groups[0] = group;
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * edac_mem_repair_get_desc - get EDAC memory repair descriptors
>> + * @dev: client device with memory repair feature
>> + * @attr_groups: pointer to attribute group container
>> + * @instance: device's memory repair instance number.
>> + *
>> + * Return:
>> + *  * %0	- Success.
>> + *  * %-EINVAL	- Invalid parameters passed.
>> + *  * %-ENOMEM	- Dynamic memory allocation failed.
>> + */
>> +int edac_mem_repair_get_desc(struct device *dev,
>> +			     const struct attribute_group **attr_groups, u8
>instance)
>> +{
>> +	if (!dev || !attr_groups)
>> +		return -EINVAL;
>> +
>> +	return mem_repair_create_desc(dev, attr_groups, instance);
>> +}
>> diff --git a/include/linux/edac.h b/include/linux/edac.h
>> index 979e91426701..5d07192bf1a7 100644
>> --- a/include/linux/edac.h
>> +++ b/include/linux/edac.h
>> @@ -668,6 +668,7 @@ static inline struct dimm_info *edac_get_dimm(struct
>mem_ctl_info *mci,
>>  enum edac_dev_feat {
>>  	RAS_FEAT_SCRUB,
>>  	RAS_FEAT_ECS,
>> +	RAS_FEAT_MEM_REPAIR,
>>  	RAS_FEAT_MAX
>>  };
>>
>> @@ -729,11 +730,147 @@ int edac_ecs_get_desc(struct device *ecs_dev,
>>  		      const struct attribute_group **attr_groups,
>>  		      u16 num_media_frus);
>>
>> +enum edac_mem_repair_function {
>> +	EDAC_SOFT_PPR,
>> +	EDAC_HARD_PPR,
>> +	EDAC_CACHELINE_MEM_SPARING,
>> +	EDAC_ROW_MEM_SPARING,
>> +	EDAC_BANK_MEM_SPARING,
>> +	EDAC_RANK_MEM_SPARING,
>> +};
>> +
>> +enum edac_mem_repair_persist_mode {
>> +	EDAC_MEM_REPAIR_SOFT, /* soft memory repair */
>> +	EDAC_MEM_REPAIR_HARD, /* hard memory repair */
>> +};
>> +
>> +enum edac_mem_repair_cmd {
>> +	EDAC_DO_MEM_REPAIR = 1,
>> +};
>> +
>> +/**
>> + * struct edac_mem_repair_ops - memory repair operations
>> + * (all elements are optional except do_repair, set_hpa/set_dpa)
>> + * @get_repair_function: get the memory repair function, listed in
>> + *			 enum edac_mem_repair_function.
>> + * @get_persist_mode: get the current persist mode. Persist repair modes
>supported
>> + *		      in the device is based on the memory repair function which
>is
>> + *		      temporary or permanent and is lost with a power cycle.
>> + *		      EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary
>repair).
>> + *		      EDAC_MEM_REPAIR_HARD - Hard memory repair function
>(permanent repair).
>> + * All other values are reserved.
>> + * @set_persist_mode: set the persist mode of the memory repair instance.
>> + * @get_dpa_support: get dpa support flag. In some states of system
>configuration
>> + *		     (e.g. before address decoders have been configured),
>memory devices
>> + *		     (e.g. CXL) may not have an active mapping in the main host
>address
>> + *		     physical address map. As such, the memory to repair must be
>identified
>> + *		     by a device specific physical addressing scheme using a
>device physical
>> + *		     address(DPA). The DPA and other control attributes to use for
>the
>> + *		     dry_run and repair operations will be presented in related
>error records.
>> + * @get_repair_safe_when_in_use: get whether memory media is accessible
>and
>> + *				 data is retained during repair operation.
>> + * @get_hpa: get current host physical address (HPA).
>> + * @set_hpa: set host physical address (HPA) of memory to repair.
>> + * @get_min_hpa: get the minimum supported host physical address (HPA).
>> + * @get_max_hpa: get the maximum supported host physical address (HPA).
>> + * @get_dpa: get current device physical address (DPA).
>> + * @set_dpa: set device physical address (DPA) of memory to repair.
>> + * @get_min_dpa: get the minimum supported device physical address
>(DPA).
>> + * @get_max_dpa: get the maximum supported device physical address
>(DPA).
>> + * @get_nibble_mask: get current nibble mask.
>> + * @set_nibble_mask: set nibble mask of memory to repair.
>> + * @get_min_nibble_mask: get the minimum supported nibble mask.
>> + * @get_max_nibble_mask: get the maximum supported nibble mask.
>> + * @get_bank_group: get current bank group.
>> + * @set_bank_group: set bank group of memory to repair.
>> + * @get_min_bank_group: get the minimum supported bank group.
>> + * @get_max_bank_group: get the maximum supported bank group.
>> + * @get_bank: get current bank.
>> + * @set_bank: set bank of memory to repair.
>> + * @get_min_bank: get the minimum supported bank.
>> + * @get_max_bank: get the maximum supported bank.
>> + * @get_rank: get current rank.
>> + * @set_rank: set rank of memory to repair.
>> + * @get_min_rank: get the minimum supported rank.
>> + * @get_max_rank: get the maximum supported rank.
>> + * @get_row: get current row.
>> + * @set_row: set row of memory to repair.
>> + * @get_min_row: get the minimum supported row.
>> + * @get_max_row: get the maximum supported row.
>> + * @get_column: get current column.
>> + * @set_column: set column of memory to repair.
>> + * @get_min_column: get the minimum supported column.
>> + * @get_max_column: get the maximum supported column.
>> + * @get_channel: get current channel.
>> + * @set_channel: set channel of memory to repair.
>> + * @get_min_channel: get the minimum supported channel.
>> + * @get_max_channel: get the maximum supported channel.
>> + * @get_sub_channel: get current sub channel.
>> + * @set_sub_channel: set sub channel of memory to repair.
>> + * @get_min_sub_channel: get the minimum supported sub channel.
>> + * @get_max_sub_channel: get the maximum supported sub channel.
>> + * @do_repair: Issue memory repair operation for the HPA/DPA and
>> + *	       other control attributes set for the memory to repair.
>> + */
>> +struct edac_mem_repair_ops {
>> +	int (*get_repair_function)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_persist_mode)(struct device *dev, void *drv_data, u32
>*mode);
>> +	int (*set_persist_mode)(struct device *dev, void *drv_data, u32 mode);
>> +	int (*get_dpa_support)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_repair_safe_when_in_use)(struct device *dev, void *drv_data,
>u32 *val);
>> +	int (*get_hpa)(struct device *dev, void *drv_data, u64 *hpa);
>> +	int (*set_hpa)(struct device *dev, void *drv_data, u64 hpa);
>> +	int (*get_min_hpa)(struct device *dev, void *drv_data, u64 *hpa);
>> +	int (*get_max_hpa)(struct device *dev, void *drv_data, u64 *hpa);
>> +	int (*get_dpa)(struct device *dev, void *drv_data, u64 *dpa);
>> +	int (*set_dpa)(struct device *dev, void *drv_data, u64 dpa);
>> +	int (*get_min_dpa)(struct device *dev, void *drv_data, u64 *dpa);
>> +	int (*get_max_dpa)(struct device *dev, void *drv_data, u64 *dpa);
>> +	int (*get_nibble_mask)(struct device *dev, void *drv_data, u64 *val);
>> +	int (*set_nibble_mask)(struct device *dev, void *drv_data, u64 val);
>> +	int (*get_min_nibble_mask)(struct device *dev, void *drv_data, u64
>*val);
>> +	int (*get_max_nibble_mask)(struct device *dev, void *drv_data, u64
>*val);
>> +	int (*get_bank_group)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*set_bank_group)(struct device *dev, void *drv_data, u32 val);
>> +	int (*get_min_bank_group)(struct device *dev, void *drv_data, u32
>*val);
>> +	int (*get_max_bank_group)(struct device *dev, void *drv_data, u32
>*val);
>> +	int (*get_bank)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*set_bank)(struct device *dev, void *drv_data, u32 val);
>> +	int (*get_min_bank)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_max_bank)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_rank)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*set_rank)(struct device *dev, void *drv_data, u32 val);
>> +	int (*get_min_rank)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_max_rank)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_row)(struct device *dev, void *drv_data, u64 *val);
>> +	int (*set_row)(struct device *dev, void *drv_data, u64 val);
>> +	int (*get_min_row)(struct device *dev, void *drv_data, u64 *val);
>> +	int (*get_max_row)(struct device *dev, void *drv_data, u64 *val);
>> +	int (*get_column)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*set_column)(struct device *dev, void *drv_data, u32 val);
>> +	int (*get_min_column)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_max_column)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_channel)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*set_channel)(struct device *dev, void *drv_data, u32 val);
>> +	int (*get_min_channel)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_max_channel)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_sub_channel)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*set_sub_channel)(struct device *dev, void *drv_data, u32 val);
>> +	int (*get_min_sub_channel)(struct device *dev, void *drv_data, u32
>*val);
>> +	int (*get_max_sub_channel)(struct device *dev, void *drv_data, u32
>*val);
>> +	int (*do_repair)(struct device *dev, void *drv_data, u32 val);
>> +};
>> +
>> +int edac_mem_repair_get_desc(struct device *dev,
>> +			     const struct attribute_group **attr_groups,
>> +			     u8 instance);
>> +
>>  /* EDAC device feature information structure */
>>  struct edac_dev_data {
>>  	union {
>>  		const struct edac_scrub_ops *scrub_ops;
>>  		const struct edac_ecs_ops *ecs_ops;
>> +		const struct edac_mem_repair_ops *mem_repair_ops;
>>  	};
>>  	u8 instance;
>>  	void *private;
>> @@ -744,6 +881,7 @@ struct edac_dev_feat_ctx {
>>  	void *private;
>>  	struct edac_dev_data *scrub;
>>  	struct edac_dev_data ecs;
>> +	struct edac_dev_data *mem_repair;
>>  };
>>
>>  struct edac_dev_feature {
>> @@ -752,6 +890,7 @@ struct edac_dev_feature {
>>  	union {
>>  		const struct edac_scrub_ops *scrub_ops;
>>  		const struct edac_ecs_ops *ecs_ops;
>> +		const struct edac_mem_repair_ops *mem_repair_ops;
>>  	};
>>  	void *ctx;
>>  	struct edac_ecs_ex_info ecs_info;
>
>Thanks,
>Mauro

Thanks,
Shiju

Mauro Carvalho Chehab Jan. 14, 2025, 12:38 p.m. UTC | #17

Em Thu, 9 Jan 2025 14:24:33 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu:

> On Thu, 9 Jan 2025 13:32:22 +0100
> Borislav Petkov <bp@alien8.de> wrote:
> 
> Hi Boris,
> 
> > On Thu, Jan 09, 2025 at 11:00:43AM +0000, Shiju Jose wrote:  
> > > The min_ and max_ attributes of the control attributes are added  for your
> > > feedback on V15 to expose supported ranges of these control attributes to the user, 
> > > in the following links.      
> > 
> > Sure, but you can make that differently:
> > 
> > cat /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
> > [x:y]
> > 
> > which is the allowed range.  
> 
> To my thinking that would fail the test of being an intuitive interface.
> To issue a repair command requires that multiple attributes be configured
> before triggering the actual repair.
> 
> Think of it as setting the coordinates of the repair in a high dimensional
> space.
> 
> In the extreme case of fine grained repair (Cacheline), to identify the
> relevant subunit of memory (obtained from the error record that we are
> basing the decision to repair on) we need to specify all of:
> 
> Channel, sub-channel, rank,  bank group, row, column and nibble mask.
> For coarser granularity repair only specify a subset of these applies and
> only the relevant controls are exposed to userspace.
> 
> They are broken out as specific attributes to enable each to be set before
> triggering the action with a write to the repair attribute.
> 
> There are several possible alternatives:
> 
> Option 1
> 
> "A:B:C:D:E:F:G:H:I:J" opaque single write to trigger the repair where
> each number is providing one of those coordinates and where a readback
> let's us known what each number is.
> 
> That single attribute interface is very hard to extend in an intuitive way.
> 
> History tell us more levels will be introduced in the middle, not just
> at the finest granularity, making such an interface hard to extend in
> a backwards compatible way.
> 
> Another alternative of a key value list would make for a nasty sysfs
> interface.
> 
> Option 2 
> There are sysfs interfaces that use a selection type presentation.
> 
> Write: "C", Read: "A, B, [C], D" but that only works well for discrete sets
> of options and is a pain to parse if read back is necessary.

Writing it as:

	a b [c] d

or even:
	a, b, [c], d

doesn't make it hard to be parse on userspace. Adding a comma makes
Kernel code a little bigger, as it needs an extra check at the loop
to check if the line is empty or not:

	if (*tmp != '\0')
		*tmp += snprintf(", ")

Btwm we have an implementation like that on kernelspace/userspace for
the RC API:

- Kernelspace:
  https://github.com/torvalds/linux/blob/master/drivers/media/rc/rc-main.c#L1125
  6 lines of code + a const table with names/values, if we use the same example
  for EDAC:

	const char *name[] = { "foo", "bar" };

	for (i = 0; i < ARRAY_SIZE(names); i++) {
		if (enabled & names[i].type)
			tmp += sprintf(tmp, "[%s] ", names[i].name);
		else if (allowed & proto_names[i].type)
			tmp += sprintf(tmp, "%s ", names[i].name);
	}


- Userspace:
  https://git.linuxtv.org/v4l-utils.git/tree/utils/keytable/keytable.c#n197
  5 lines of code + a const table, if we use the same example
  for ras-daemon:

		const char *name[] = { 
			[EDAC_FOO] = "[foo]",
			[EDAC_BAR] = "[bar]",
		};

		for (p = strtok(arg, " ,"); p; p = strtok(NULL, " ,"))
			for (i = 0; i < ARRAY_SIZE(name); i++)
				if (!strcasecmp(p, name[i])
					return i;
		return -1;

	(strtok handles both space and commas at the above example)

IMO, this is a lot better, as the alternative would be to have separate
sysfs nodes to describe what values are valid for a given edac devnode.

See, userspace needs to know what values are valid for a given
device and support for it may vary depending on the Kernel and
device version. So, we need to have the information about what values
are valid stored on some sysfs devnode, to allow backward compatibility.

> 
> So in conclusion, I think the proposed multiple sysfs attribute style
> with them reading back the most recent value written is the least bad
> solution to a complex control interface.
> 
> > 
> > echo ... 
> > 
> > then writes in the bank.
> >   
> > > ... so we would propose we do not add max_ and min_ for now and see how the
> > > use cases evolve.    
> > 
> > Yes, you should apply that same methodology to the rest of the new features
> > you're adding: only add functionality for the stuff that is actually being
> > used now. You can always extend it later.
> > 
> > Changing an already user-visible API is a whole different story and a lot lot
> > harder, even impossible.
> > 
> > So I'd suggest you prune the EDAC patches from all the hypothetical usage and
> > then send only what remains so that I can try to queue them.  
> 
> Sure. In this case the addition of min/max was perhaps a wrong response to
> your request for a way to those ranges rather than just rejecting a write
> of something out of range as earlier version did.
> 
> We can revisit in future if range discovery becomes necessary.  Personally
> I don't think it is given we are only taking these actions in response error
> records that give us precisely what to write and hence are always in range.

For RO devnodes, there's no need for ranges, but those are likely needed for
RW, as otherwise userspace may try to write invalid requests and/or have
backward-compatibility issues.


Thanks,
Mauro

Mauro Carvalho Chehab Jan. 14, 2025, 12:57 p.m. UTC | #18

Em Thu, 9 Jan 2025 16:01:59 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu:


> > My goal here was to make this user-friendly. Because you need some way of
> > knowing what valid ranges are and in order to trigger the repair, if it needs
> > to happen for a range.  

IMO, user-friendly is important, as it allows people to manually use the
feature. This is interesting for debugging purposes and also to test if
some hardware is doing the right thing.

Ok, in practice, production will use an userspace tool like rasdaemon,
and/or some scripts [1].

[1] I'd say that rasdaemon should have an initialization phase to
discover capabilities that can be discovered.

As an example, rasdaemon could, for instance, reserve some sparing
memory at init time, if the hardware (partially) supports it. For
instance, maybe a CXL device could not be able to handle
rank-mem-sparing, but it can handle bank-mem-sparing.

> In at least the CXL case I'm fairly sure most of them are not discoverable.
> Until you see errors you have no idea what the memory topology is.

Sure, but some things can be discovered in advance, like what CXL
scrubbing features are supported by a given hardware.

If the hardware supports detecting ranges for row/bank/rank sparing,
it would be nice to have this reported in a way that userspace can
properly set it at OS init time, if desired by the sysadmins.

> > Or, you can teach the repair logic to ignore invalid ranges and "clamp" things
> > to whatever makes sense.  
> 
> For that you'd need to have a path to read back what happened.

If sysfs is RW, you have it there already after committing the value set.

> > Again, I'm looking at it from the usability perspective. I haven't actually
> > needed this scrub+repair functionality yet to know whether the UI makes sense.
> > So yeah, collecting some feedback from real-life use cases would probably give
> > you a lot better understanding of how that UI should be designed... perhaps
> > you won't ever need the ranges, whow knows.
> > 
> > So yes, preemptively designing stuff like that "in the dark" is kinda hard.
> > :-)  
> 
> The discoverability is unnecessary for any known usecase.
> 
> Ok. Then can we just drop the range discoverability entirely or we go with
> your suggestion and do not support read back of what has been
> requested but instead have the reads return a range if known or "" /
> return -EONOTSUPP if simply not known?

It sounds to be that ranges are needed at least to setup mem sparing.

> I can live with that though to me we are heading in the direction of
> a less intuitive interface to save a small number of additional files.
> 
> Jonathan
> 
> >   
> 



Thanks,
Mauro

Jonathan Cameron Jan. 14, 2025, 1:05 p.m. UTC | #19

On Tue, 14 Jan 2025 13:38:31 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> Em Thu, 9 Jan 2025 14:24:33 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu:
> 
> > On Thu, 9 Jan 2025 13:32:22 +0100
> > Borislav Petkov <bp@alien8.de> wrote:
> > 
> > Hi Boris,
> >   
> > > On Thu, Jan 09, 2025 at 11:00:43AM +0000, Shiju Jose wrote:    
> > > > The min_ and max_ attributes of the control attributes are added  for your
> > > > feedback on V15 to expose supported ranges of these control attributes to the user, 
> > > > in the following links.        
> > > 
> > > Sure, but you can make that differently:
> > > 
> > > cat /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
> > > [x:y]
> > > 
> > > which is the allowed range.    
> > 
> > To my thinking that would fail the test of being an intuitive interface.
> > To issue a repair command requires that multiple attributes be configured
> > before triggering the actual repair.
> > 
> > Think of it as setting the coordinates of the repair in a high dimensional
> > space.
> > 
> > In the extreme case of fine grained repair (Cacheline), to identify the
> > relevant subunit of memory (obtained from the error record that we are
> > basing the decision to repair on) we need to specify all of:
> > 
> > Channel, sub-channel, rank,  bank group, row, column and nibble mask.
> > For coarser granularity repair only specify a subset of these applies and
> > only the relevant controls are exposed to userspace.
> > 
> > They are broken out as specific attributes to enable each to be set before
> > triggering the action with a write to the repair attribute.
> > 
> > There are several possible alternatives:
> > 
> > Option 1
> > 
> > "A:B:C:D:E:F:G:H:I:J" opaque single write to trigger the repair where
> > each number is providing one of those coordinates and where a readback
> > let's us known what each number is.
> > 
> > That single attribute interface is very hard to extend in an intuitive way.
> > 
> > History tell us more levels will be introduced in the middle, not just
> > at the finest granularity, making such an interface hard to extend in
> > a backwards compatible way.
> > 
> > Another alternative of a key value list would make for a nasty sysfs
> > interface.
> > 
> > Option 2 
> > There are sysfs interfaces that use a selection type presentation.
> > 
> > Write: "C", Read: "A, B, [C], D" but that only works well for discrete sets
> > of options and is a pain to parse if read back is necessary.  
> 
> Writing it as:
> 
> 	a b [c] d
> 
> or even:
> 	a, b, [c], d
> 
> doesn't make it hard to be parse on userspace. Adding a comma makes
> Kernel code a little bigger, as it needs an extra check at the loop
> to check if the line is empty or not:
> 
> 	if (*tmp != '\0')
> 		*tmp += snprintf(", ")
> 
> Btwm we have an implementation like that on kernelspace/userspace for
> the RC API:
> 
> - Kernelspace:
>   https://github.com/torvalds/linux/blob/master/drivers/media/rc/rc-main.c#L1125
>   6 lines of code + a const table with names/values, if we use the same example
>   for EDAC:
> 
> 	const char *name[] = { "foo", "bar" };
> 
> 	for (i = 0; i < ARRAY_SIZE(names); i++) {
> 		if (enabled & names[i].type)
> 			tmp += sprintf(tmp, "[%s] ", names[i].name);
> 		else if (allowed & proto_names[i].type)
> 			tmp += sprintf(tmp, "%s ", names[i].name);
> 	}
> 
> 
> - Userspace:
>   https://git.linuxtv.org/v4l-utils.git/tree/utils/keytable/keytable.c#n197
>   5 lines of code + a const table, if we use the same example
>   for ras-daemon:
> 
> 		const char *name[] = { 
> 			[EDAC_FOO] = "[foo]",
> 			[EDAC_BAR] = "[bar]",
> 		};
> 
> 		for (p = strtok(arg, " ,"); p; p = strtok(NULL, " ,"))
> 			for (i = 0; i < ARRAY_SIZE(name); i++)
> 				if (!strcasecmp(p, name[i])
> 					return i;
> 		return -1;
> 
> 	(strtok handles both space and commas at the above example)
> 
> IMO, this is a lot better, as the alternative would be to have separate
> sysfs nodes to describe what values are valid for a given edac devnode.
> 
> See, userspace needs to know what values are valid for a given
> device and support for it may vary depending on the Kernel and
> device version. So, we need to have the information about what values
> are valid stored on some sysfs devnode, to allow backward compatibility.

These aren't selectors from a discrete list so the question is more
whether a syntax of
<min> value <max> 
is intuitive or not.  I'm not aware of precedence for this one.

There was another branch of the thread where Boris mentioned this as an
option. It isn't bad to deal with and an easy change to the code,
but I have an open question on what choice we make for representing
unknown min / max.  For separate files the absence of the file
indicates we don't have any information.


> 
> > 
> > So in conclusion, I think the proposed multiple sysfs attribute style
> > with them reading back the most recent value written is the least bad
> > solution to a complex control interface.
> >   
> > > 
> > > echo ... 
> > > 
> > > then writes in the bank.
> > >     
> > > > ... so we would propose we do not add max_ and min_ for now and see how the
> > > > use cases evolve.      
> > > 
> > > Yes, you should apply that same methodology to the rest of the new features
> > > you're adding: only add functionality for the stuff that is actually being
> > > used now. You can always extend it later.
> > > 
> > > Changing an already user-visible API is a whole different story and a lot lot
> > > harder, even impossible.
> > > 
> > > So I'd suggest you prune the EDAC patches from all the hypothetical usage and
> > > then send only what remains so that I can try to queue them.    
> > 
> > Sure. In this case the addition of min/max was perhaps a wrong response to
> > your request for a way to those ranges rather than just rejecting a write
> > of something out of range as earlier version did.
> > 
> > We can revisit in future if range discovery becomes necessary.  Personally
> > I don't think it is given we are only taking these actions in response error
> > records that give us precisely what to write and hence are always in range.  
> 
> For RO devnodes, there's no need for ranges, but those are likely needed for
> RW, as otherwise userspace may try to write invalid requests and/or have
> backward-compatibility issues.

Given these parameters are only meaningfully written with values coming
ultimately from error records, userspace should never consider writing
something that is out of range except during testing.

I don't mind presenting the range where known (in CXL case it is not
discoverable for most of them) but I wouldn't expect tooling to ever
read it as known correct values to write come from the error records.
Checking those values against provided limits seems an unnecessary step
given an invalid parameter that slips through will be rejected by the
hardware anyway.

Jonathan

> 
> 
> Thanks,
> Mauro

Mauro Carvalho Chehab Jan. 14, 2025, 1:10 p.m. UTC | #20

Em Thu, 9 Jan 2025 18:34:48 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu:

> On Thu, 9 Jan 2025 17:19:02 +0100
> Borislav Petkov <bp@alien8.de> wrote:

> > 
> > But then why do you even need the interface at all?
> > 
> > Why can't the kernel automatically collect all those attributes and start the
> > scrubbing automatically - no need for any user interaction...?

Implementing policies at Kernelspace is a very bad idea.

See, to properly implement scrubbing and memory sparing policies, one
needs to have knowledge not only about the current Kernel lifetime (which
may be a recent boot due to a Kernel upgrade), but also for events that 
happened in the past months/years.

On other words, a database is needed to know if a memory error
was just a random issue due to high cosmic ray activities (so, a
soft PPR would be used - just in case), or if it is due to some memory
region that it is known to have past problems, probably an indication
of a a hardware issue - in which case a hard PPR would be used instead.

If this were meant to be done automatically, CXL wouldn't need to send
events about that to the OSPM.

Also, different usecases may require different policies. So, better
to let an userspace daemon to handle policies, and use sysfs for
such daemon to to setup the hardware.

Thanks,
Mauro

Mauro Carvalho Chehab Jan. 14, 2025, 1:47 p.m. UTC | #21

Em Mon, 6 Jan 2025 12:10:00 +0000
<shiju.jose@huawei.com> escreveu:

> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair_function
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RO) Memory repair function type. For eg. post package repair,
> +		memory sparing etc.
> +		EDAC_SOFT_PPR - Soft post package repair
> +		EDAC_HARD_PPR - Hard post package repair
> +		EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing
> +		EDAC_ROW_MEM_SPARING - Row memory sparing
> +		EDAC_BANK_MEM_SPARING - Bank memory sparing
> +		EDAC_RANK_MEM_SPARING - Rank memory sparing
> +		All other values are reserved.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode
> +Date:		Jan 2025
> +KernelVersion:	6.14
> +Contact:	linux-edac@vger.kernel.org
> +Description:
> +		(RW) Read/Write the current persist repair mode set for a
> +		repair function. Persist repair modes supported in the
> +		device, based on the memory repair function is temporary
> +		or permanent and is lost with a power cycle.
> +		EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary repair).
> +		EDAC_MEM_REPAIR_HARD - Hard memory repair function (permanent repair).
> +		All other values are reserved.
> +

After re-reading some things, I suspect that the above can be simplified
a little bit by folding soft/hard PPR into a single element at
/repair_function, and letting it clearer that persist_mode is valid only
for PPR (I think this is the case, right?), e.g. something like:

	What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair_function
	...
	Description:
			(RO) Memory repair function type. For e.g. post package repair,
			memory sparing etc. Valid values are:

			- ppr - post package repair.
			  Please define its mode via
			  /sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode
			- cacheline-sparing - Cacheline memory sparing
			- row-sparing - Row memory sparing
			- bank-sparing - Bank memory sparing
			- rank-sparing - Rank memory sparing
			- All other values are reserved.

and define persist_mode in a different way:

	What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/ppr_persist_mode
	...
	Description:
		(RW) Read/Write the current persist repair (PPR) mode set for a
		post package repair function. Persist repair modes supported 
		in the device, based on the memory repair function is temporary
		or permanent and is lost with a power cycle. Valid values are:

		- repair-soft - Soft PPR function (temporary repair).
		- repair-hard - Hard memory repair function (permanent repair).
		- All other values are reserved.

Thanks,
Mauro

Mauro Carvalho Chehab Jan. 14, 2025, 2:26 p.m. UTC | #22

Em Tue, 14 Jan 2025 12:31:44 +0000
Shiju Jose <shiju.jose@huawei.com> escreveu:

> Hi Mauro,
> 
> Thanks for the comments.
> 
> >-----Original Message-----
> >From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> >Sent: 14 January 2025 11:48
> >To: Shiju Jose <shiju.jose@huawei.com>
> >Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux-
> >acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
> >bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
> >mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
> >Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
> >alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
> >david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
> >Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
> >Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
> >naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
> >somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
> >duenwen@google.com; gthelen@google.com;
> >wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
> >wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
> ><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
> >Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
> >wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
> ><linuxarm@huawei.com>
> >Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
> >
> >Em Mon, 6 Jan 2025 12:10:00 +0000
> ><shiju.jose@huawei.com> escreveu:
> >  
> >> From: Shiju Jose <shiju.jose@huawei.com>
> >>
> >> Add a generic EDAC memory repair control driver to manage memory repairs
> >> in the system, such as CXL Post Package Repair (PPR) and CXL memory sparing
> >> features.
> >>
> >> For example, a CXL device with DRAM components that support PPR features
> >> may implement PPR maintenance operations. DRAM components may support  
> >two  
> >> types of PPR, hard PPR, for a permanent row repair, and soft PPR,  for a
> >> temporary row repair. Soft PPR is much faster than hard PPR, but the repair
> >> is lost with a power cycle.
> >> Similarly a CXL memory device may support soft and hard memory sparing at
> >> cacheline, row, bank and rank granularities. Memory sparing is defined as
> >> a repair function that replaces a portion of memory with a portion of
> >> functional memory at that same granularity.
> >> When a CXL device detects an error in a memory, it may report the host of
> >> the need for a repair maintenance operation by using an event record where
> >> the "maintenance needed" flag is set. The event records contains the device
> >> physical address(DPA) and other attributes of the memory to repair (such as
> >> channel, sub-channel, bank group, bank, rank, row, column etc). The kernel
> >> will report the corresponding CXL general media or DRAM trace event to
> >> userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
> >> operation in response to the device request via the sysfs repair control.
> >>
> >> Device with memory repair features registers with EDAC device driver,
> >> which retrieves memory repair descriptor from EDAC memory repair driver
> >> and exposes the sysfs repair control attributes to userspace in
> >> /sys/bus/edac/devices/<dev-name>/mem_repairX/.
> >>
> >> The common memory repair control interface abstracts the control of
> >> arbitrary memory repair functionality into a standardized set of functions.
> >> The sysfs memory repair attribute nodes are only available if the client
> >> driver has implemented the corresponding attribute callback function and
> >> provided operations to the EDAC device driver during registration.
> >>
> >> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> >> ---
> >>  .../ABI/testing/sysfs-edac-memory-repair      | 244 +++++++++
> >>  Documentation/edac/features.rst               |   3 +
> >>  Documentation/edac/index.rst                  |   1 +
> >>  Documentation/edac/memory_repair.rst          | 101 ++++
> >>  drivers/edac/Makefile                         |   2 +-
> >>  drivers/edac/edac_device.c                    |  33 ++
> >>  drivers/edac/mem_repair.c                     | 492 ++++++++++++++++++
> >>  include/linux/edac.h                          | 139 +++++
> >>  8 files changed, 1014 insertions(+), 1 deletion(-)
> >>  create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair
> >>  create mode 100644 Documentation/edac/memory_repair.rst
> >>  create mode 100755 drivers/edac/mem_repair.c
> >>
> >> diff --git a/Documentation/ABI/testing/sysfs-edac-memory-repair  
> >b/Documentation/ABI/testing/sysfs-edac-memory-repair  
> >> new file mode 100644
> >> index 000000000000..e9268f3780ed
> >> --- /dev/null
> >> +++ b/Documentation/ABI/testing/sysfs-edac-memory-repair
> >> @@ -0,0 +1,244 @@
> >> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX
> >> +Date:		Jan 2025
> >> +KernelVersion:	6.14
> >> +Contact:	linux-edac@vger.kernel.org
> >> +Description:
> >> +		The sysfs EDAC bus devices /<dev-name>/mem_repairX  
> >subdirectory  
> >> +		pertains to the memory media repair features control, such as
> >> +		PPR (Post Package Repair), memory sparing etc, where<dev-
> >name>
> >> +		directory corresponds to a device registered with the EDAC
> >> +		device driver for the memory repair features.
> >> +
> >> +		Post Package Repair is a maintenance operation requests the  
> >memory  
> >> +		device to perform a repair operation on its media, in detail is a
> >> +		memory self-healing feature that fixes a failing memory  
> >location by  
> >> +		replacing it with a spare row in a DRAM device. For example, a
> >> +		CXL memory device with DRAM components that support PPR  
> >features may  
> >> +		implement PPR maintenance operations. DRAM components  
> >may support  
> >> +		two types of PPR functions: hard PPR, for a permanent row  
> >repair, and  
> >> +		soft PPR, for a temporary row repair. soft PPR is much faster  
> >than  
> >> +		hard PPR, but the repair is lost with a power cycle.
> >> +
> >> +		Memory sparing is a repair function that replaces a portion
> >> +		of memory with a portion of functional memory at that same
> >> +		sparing granularity. Memory sparing has  
> >cacheline/row/bank/rank  
> >> +		sparing granularities. For example, in memory-sparing mode,
> >> +		one memory rank serves as a spare for other ranks on the same
> >> +		channel in case they fail. The spare rank is held in reserve and
> >> +		not used as active memory until a failure is indicated, with
> >> +		reserved capacity subtracted from the total available memory
> >> +		in the system.The DIMM installation order for memory sparing
> >> +		varies based on the number of processors and memory modules
> >> +		installed in the server. After an error threshold is surpassed
> >> +		in a system protected by memory sparing, the content of a  
> >failing  
> >> +		rank of DIMMs is copied to the spare rank. The failing rank is
> >> +		then taken offline and the spare rank placed online for use as
> >> +		active memory in place of the failed rank.
> >> +
> >> +		The sysfs attributes nodes for a repair feature are only
> >> +		present if the parent driver has implemented the corresponding
> >> +		attr callback function and provided the necessary operations
> >> +		to the EDAC device driver during registration.
> >> +
> >> +		In some states of system configuration (e.g. before address
> >> +		decoders have been configured), memory devices (e.g. CXL)
> >> +		may not have an active mapping in the main host address
> >> +		physical address map. As such, the memory to repair must be
> >> +		identified by a device specific physical addressing scheme
> >> +		using a device physical address(DPA). The DPA and other control
> >> +		attributes to use will be presented in related error records.
> >> +
> >> +What:		/sys/bus/edac/devices/<dev-
> >name>/mem_repairX/repair_function
> >> +Date:		Jan 2025
> >> +KernelVersion:	6.14
> >> +Contact:	linux-edac@vger.kernel.org
> >> +Description:
> >> +		(RO) Memory repair function type. For eg. post package repair,
> >> +		memory sparing etc.
> >> +		EDAC_SOFT_PPR - Soft post package repair
> >> +		EDAC_HARD_PPR - Hard post package repair
> >> +		EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing
> >> +		EDAC_ROW_MEM_SPARING - Row memory sparing
> >> +		EDAC_BANK_MEM_SPARING - Bank memory sparing
> >> +		EDAC_RANK_MEM_SPARING - Rank memory sparing
> >> +		All other values are reserved.  
> >
> >Too big strings. Why are them in upper cases? IMO:
> >
> >	soft-ppr, hard-ppr, ... would be enough.
> >  
> Here return repair type (single value, such as 0, 1, or 2 etc not as decoded string  for eg."EDAC_SOFT_PPR")
> of the memory repair instance, which is  defined as enums (EDAC_SOFT_PPR, EDAC_HARD_PPR, ... etc) 
> for the memory repair interface in the include/linux/edac.h.
> 
> enum edac_mem_repair_function {
> 	EDAC_SOFT_PPR,
> 	EDAC_HARD_PPR,
> 	EDAC_CACHELINE_MEM_SPARING,
> 	EDAC_ROW_MEM_SPARING,
> 	EDAC_BANK_MEM_SPARING,
> 	EDAC_RANK_MEM_SPARING,
> };
>   
> I documented return value in terms of the above enums.

The ABI documentation describes exactly what numeric/strings values will be there.
So, if you place:

	EDAC_SOFT_PPR

It means a string with EDAC_SOFT_PPR, not a numeric zero value.

Also, as I explained at:
	 https://lore.kernel.org/linux-edac/1bf421f9d1924d68860d08c70829a705@huawei.com/T/#m1e60da13198b47701a4c2f740d4b78701f912d2d

it doesn't make sense to report soft/hard PPR, as the persist mode
is designed to be on a different sysfs devnode (/persist_mode on your
proposal).

So, here you need to fold EDAC_SOFT_PPR and EDAC_HARD_PPR into a single
value ("ppr").

-

Btw, very few sysfs nodes use numbers for things that can be mapped with 
enums:

	$ git grep -l "\- 0" Documentation/ABI|wc -l
	20
	(several of those are actually false-positives)

and this is done mostly when it reports what the hardware actually
outputs when reading some register.

Thanks,
Mauro

Shiju Jose Jan. 14, 2025, 2:30 p.m. UTC | #23

>-----Original Message-----
>From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
>Sent: 14 January 2025 13:47
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux-
>acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
>bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
>mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
>Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
>alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
>david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
>Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
>Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
>duenwen@google.com; gthelen@google.com;
>wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
>wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
><linuxarm@huawei.com>
>Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
>
>Em Mon, 6 Jan 2025 12:10:00 +0000
><shiju.jose@huawei.com> escreveu:
>
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair_function
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RO) Memory repair function type. For eg. post package repair,
>> +		memory sparing etc.
>> +		EDAC_SOFT_PPR - Soft post package repair
>> +		EDAC_HARD_PPR - Hard post package repair
>> +		EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing
>> +		EDAC_ROW_MEM_SPARING - Row memory sparing
>> +		EDAC_BANK_MEM_SPARING - Bank memory sparing
>> +		EDAC_RANK_MEM_SPARING - Rank memory sparing
>> +		All other values are reserved.
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/persist_mode
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@vger.kernel.org
>> +Description:
>> +		(RW) Read/Write the current persist repair mode set for a
>> +		repair function. Persist repair modes supported in the
>> +		device, based on the memory repair function is temporary
>> +		or permanent and is lost with a power cycle.
>> +		EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary
>repair).
>> +		EDAC_MEM_REPAIR_HARD - Hard memory repair function
>(permanent repair).
>> +		All other values are reserved.
>> +
>
>After re-reading some things, I suspect that the above can be simplified a little
>bit by folding soft/hard PPR into a single element at /repair_function, and letting
>it clearer that persist_mode is valid only for PPR (I think this is the case, right?),
>e.g. something like:
persist_mode is valid for memory sparing features(atleast in CXL) as well.
In the case of CXL memory sparing, host has option to request either soft or hard sparing
in a flag when issue a memory sparing operation.

>
>	What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair_function
>	...
>	Description:
>			(RO) Memory repair function type. For e.g. post
>package repair,
>			memory sparing etc. Valid values are:
>
>			- ppr - post package repair.
>			  Please define its mode via
>			  /sys/bus/edac/devices/<dev-
>name>/mem_repairX/persist_mode
>			- cacheline-sparing - Cacheline memory sparing
>			- row-sparing - Row memory sparing
>			- bank-sparing - Bank memory sparing
>			- rank-sparing - Rank memory sparing
>			- All other values are reserved.
>
>and define persist_mode in a different way:
Note: For return as decoded strings instead of raw value,  I need to add some extra callback function/s
in the edac/memory_repair.c  for these attributes  and which will reduce the current level of optimization done to
minimize the code size.
>
>	What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/ppr_persist_mode
Same as above.  persist_mode is needed for memory sparing feature too.
>	...
>	Description:
>		(RW) Read/Write the current persist repair (PPR) mode set for a
>		post package repair function. Persist repair modes supported
>		in the device, based on the memory repair function is
>temporary
>		or permanent and is lost with a power cycle. Valid values are:
>
>		- repair-soft - Soft PPR function (temporary repair).
>		- repair-hard - Hard memory repair function (permanent
>repair).
>		- All other values are reserved.
>
>Thanks,
>Mauro

Thanks,
Shiju

Mauro Carvalho Chehab Jan. 14, 2025, 2:39 p.m. UTC | #24

Em Tue, 14 Jan 2025 13:05:37 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu:

> On Tue, 14 Jan 2025 13:38:31 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> 
> > Em Thu, 9 Jan 2025 14:24:33 +0000
> > Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu:
> >   
> > > On Thu, 9 Jan 2025 13:32:22 +0100
> > > Borislav Petkov <bp@alien8.de> wrote:
> > > 
> > > Hi Boris,
> > >     
> > > > On Thu, Jan 09, 2025 at 11:00:43AM +0000, Shiju Jose wrote:      
> > > > > The min_ and max_ attributes of the control attributes are added  for your
> > > > > feedback on V15 to expose supported ranges of these control attributes to the user, 
> > > > > in the following links.          
> > > > 
> > > > Sure, but you can make that differently:
> > > > 
> > > > cat /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
> > > > [x:y]
> > > > 
> > > > which is the allowed range.      
> > > 
> > > To my thinking that would fail the test of being an intuitive interface.
> > > To issue a repair command requires that multiple attributes be configured
> > > before triggering the actual repair.
> > > 
> > > Think of it as setting the coordinates of the repair in a high dimensional
> > > space.
> > > 
> > > In the extreme case of fine grained repair (Cacheline), to identify the
> > > relevant subunit of memory (obtained from the error record that we are
> > > basing the decision to repair on) we need to specify all of:
> > > 
> > > Channel, sub-channel, rank,  bank group, row, column and nibble mask.
> > > For coarser granularity repair only specify a subset of these applies and
> > > only the relevant controls are exposed to userspace.
> > > 
> > > They are broken out as specific attributes to enable each to be set before
> > > triggering the action with a write to the repair attribute.
> > > 
> > > There are several possible alternatives:
> > > 
> > > Option 1
> > > 
> > > "A:B:C:D:E:F:G:H:I:J" opaque single write to trigger the repair where
> > > each number is providing one of those coordinates and where a readback
> > > let's us known what each number is.
> > > 
> > > That single attribute interface is very hard to extend in an intuitive way.
> > > 
> > > History tell us more levels will be introduced in the middle, not just
> > > at the finest granularity, making such an interface hard to extend in
> > > a backwards compatible way.
> > > 
> > > Another alternative of a key value list would make for a nasty sysfs
> > > interface.
> > > 
> > > Option 2 
> > > There are sysfs interfaces that use a selection type presentation.
> > > 
> > > Write: "C", Read: "A, B, [C], D" but that only works well for discrete sets
> > > of options and is a pain to parse if read back is necessary.    
> > 
> > Writing it as:
> > 
> > 	a b [c] d
> > 
> > or even:
> > 	a, b, [c], d
> > 
> > doesn't make it hard to be parse on userspace. Adding a comma makes
> > Kernel code a little bigger, as it needs an extra check at the loop
> > to check if the line is empty or not:
> > 
> > 	if (*tmp != '\0')
> > 		*tmp += snprintf(", ")
> > 
> > Btwm we have an implementation like that on kernelspace/userspace for
> > the RC API:
> > 
> > - Kernelspace:
> >   https://github.com/torvalds/linux/blob/master/drivers/media/rc/rc-main.c#L1125
> >   6 lines of code + a const table with names/values, if we use the same example
> >   for EDAC:
> > 
> > 	const char *name[] = { "foo", "bar" };
> > 
> > 	for (i = 0; i < ARRAY_SIZE(names); i++) {
> > 		if (enabled & names[i].type)
> > 			tmp += sprintf(tmp, "[%s] ", names[i].name);
> > 		else if (allowed & proto_names[i].type)
> > 			tmp += sprintf(tmp, "%s ", names[i].name);
> > 	}
> > 
> > 
> > - Userspace:
> >   https://git.linuxtv.org/v4l-utils.git/tree/utils/keytable/keytable.c#n197
> >   5 lines of code + a const table, if we use the same example
> >   for ras-daemon:
> > 
> > 		const char *name[] = { 
> > 			[EDAC_FOO] = "[foo]",
> > 			[EDAC_BAR] = "[bar]",
> > 		};
> > 
> > 		for (p = strtok(arg, " ,"); p; p = strtok(NULL, " ,"))
> > 			for (i = 0; i < ARRAY_SIZE(name); i++)
> > 				if (!strcasecmp(p, name[i])
> > 					return i;
> > 		return -1;
> > 
> > 	(strtok handles both space and commas at the above example)
> > 
> > IMO, this is a lot better, as the alternative would be to have separate
> > sysfs nodes to describe what values are valid for a given edac devnode.
> > 
> > See, userspace needs to know what values are valid for a given
> > device and support for it may vary depending on the Kernel and
> > device version. So, we need to have the information about what values
> > are valid stored on some sysfs devnode, to allow backward compatibility.  
> 
> These aren't selectors from a discrete list so the question is more
> whether a syntax of
> <min> value <max> 
> is intuitive or not.  I'm not aware of precedence for this one.

From my side, I prefer having 3 separate sysfs nodes, as this is a
very common practice. Doing it on a different way sounds an API violation,
but if someone insists on dropping min/max, this can be argued at
https://lore.kernel.org/linux-api/.

On a very quick search:

	$ ./scripts/get_abi.pl search "\bmin.*max"

I can't see any place using min and max at the same devnode.

	$ ./scripts/get_abi.pl search "\b(min|max)"|grep /sys/ |wc -l
	234

So, it sounds to me that merging those into a single devnode is an
API violation.

> 
> There was another branch of the thread where Boris mentioned this as an
> option. It isn't bad to deal with and an easy change to the code,
> but I have an open question on what choice we make for representing
> unknown min / max.  For separate files the absence of the file
> indicates we don't have any information.
> 
> 
> >   
> > > 
> > > So in conclusion, I think the proposed multiple sysfs attribute style
> > > with them reading back the most recent value written is the least bad
> > > solution to a complex control interface.
> > >     
> > > > 
> > > > echo ... 
> > > > 
> > > > then writes in the bank.
> > > >       
> > > > > ... so we would propose we do not add max_ and min_ for now and see how the
> > > > > use cases evolve.        
> > > > 
> > > > Yes, you should apply that same methodology to the rest of the new features
> > > > you're adding: only add functionality for the stuff that is actually being
> > > > used now. You can always extend it later.
> > > > 
> > > > Changing an already user-visible API is a whole different story and a lot lot
> > > > harder, even impossible.
> > > > 
> > > > So I'd suggest you prune the EDAC patches from all the hypothetical usage and
> > > > then send only what remains so that I can try to queue them.      
> > > 
> > > Sure. In this case the addition of min/max was perhaps a wrong response to
> > > your request for a way to those ranges rather than just rejecting a write
> > > of something out of range as earlier version did.
> > > 
> > > We can revisit in future if range discovery becomes necessary.  Personally
> > > I don't think it is given we are only taking these actions in response error
> > > records that give us precisely what to write and hence are always in range.    
> > 
> > For RO devnodes, there's no need for ranges, but those are likely needed for
> > RW, as otherwise userspace may try to write invalid requests and/or have
> > backward-compatibility issues.  
> 
> Given these parameters are only meaningfully written with values coming
> ultimately from error records, userspace should never consider writing
> something that is out of range except during testing.
> 
> I don't mind presenting the range where known (in CXL case it is not
> discoverable for most of them) but I wouldn't expect tooling to ever
> read it as known correct values to write come from the error records.
> Checking those values against provided limits seems an unnecessary step
> given an invalid parameter that slips through will be rejected by the
> hardware anyway.

I'm fine starting without min/max if there's no current usecase, provided
that:

1. when needed, we add min/max as separate devnodes;
2. there won't be any backward issues when min/max gets added.

Regards,
Mauro

Dan Williams Jan. 14, 2025, 7:35 p.m. UTC | #25

Jonathan Cameron wrote:
> On Fri, 10 Jan 2025 14:49:03 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
[..]
> > This is where you lose me. The error record is a point in time snapshot
> > of the SPA:HPA:DPA:<proprietary internal "DIMM" mapping>. The security
> > model for memory operations is based on coordinating with the kernel's
> > understanding of how that SPA is *currently* being used.
> 
> Whilst it is being used I agree.  Key is to only do disruptive / data
> changing actions when it is not being used.

Sure.

> > The kernel can not just take userspace's word for it that potentially
> > data changing, or temporary loss-of-use operations are safe to execute
> > just because once upon a time userspace saw an error record that
> > implicated a given SPA in the past, especially over reboot. 
> 
> There are two cases (discoverable from hardware)
> 
> 1) Non disruptive.  No security concern as the device guarantees to
>    not interrupt traffic and the memory contents is copied to the new
>    location. Basically software never knows it happened.
> 2) Disruptive.  We only allow this if the memory is offline. In the CXL case
>    the CXL specific code must check no memory on the device is online so
>    we aren't disrupting anything.  The other implementation we have code
>    for (will post after this lands) has finer granularity constraints and only
>    the page needs to be offline.
>    As it is offline the content is not preserved anyway. We may need to add extra
>    constraints along with future support for temporal persistence / sharing but
>    we can do that as part of adding that support in general.
>    (Personally I think in those cases memory repair is a job for the out of
>     band management anyway).
> 
> In neither case am I seeing a security concern.  Am I missing something?

s/security/system-integrity/

1/ Hardware engineers may have a different definition of "non-disuptive"
than software. See the history around hibernate_quiet_exec() to work
around the disruption of latency spikes. If this is poorly specified
across vendors we are going to wish that we did not build a "take
userspace's word for it" interface.

2/ Yes, if the device is not actively decoding any in use memory feel
free to run destructive operations on the device. However, is sysfs the
right interface for "submit multi-parameter atomic operation with
transient result"? I lean heavily into sysfs, but ioctl and netlink have
a role to play in scenarios like this. Otherwise userspace can inject
error records back into the kernel with the expectation that the kernel
can only accept the DIMM address and not any of the translation data
which might be stale.

[..]
> > Again, the repair control assumes that the kernel can just trust
> > userspace to get it right. When the kernel knows the SPA implications it
> > can add safety like "you are going to issue sparing on deviceA that will
> > temporarily take deviceA offline. CXL subsystem tells me deviceA is
> > interleaved with deviceB in SPA so the whole SPA range needs to be
> > offline before this operation proceeds". That is not someting that
> > userspace can reliably coordinate.
> 
> Absolutely he kernel has to enforce this. Same way we protect against
> poison injection in some cases.  Right now the enforcement is slightly
> wrong (Shiju is looking at it again) as we were enforcing at wrong
> granularity (specific dpa, not device). Identifying that hole is a good
> outcome of this discussion making us take another look.
> 
> Enforcing this is one of the key jobs of the CXL specific driver.
> We considered doing it in the core, but the granularity differences
> between our initial few examples meant we decided on specific driver
> implementations of the checks for now.

Which specific driver? Is this not just a callback provided via the EDAC
registration interface to say "sparing allowed"?

Yes, this needs to avoid the midlayer mistake, but I expect more CXL
memory exporting devices can live with the CXL core's determination that
HDM decode is live or not.

> > > > 3/ What if the device does not use DDR terminology / topology terms for
> > > > repair?  
> > > 
> > > Then we provide the additional interfaces assuming the correspond to well
> > > known terms.  If someone is using a magic key then we can get grumpy
> > > with them, but that can also be supported.
> > > 
> > > Mostly I'd expect a new technology to overlap a lot of the existing
> > > interface and maybe add one or two more; which layer in the stack for
> > > HBM for instance.  
> > 
> > The concern is the assertion that sysfs needs to care about all these
> > parameters vs an ABI that says "repair errorX". If persistence and
> > validity of error records is the concern lets build an ABI for that and
> > not depend upon trust in userspace to properly coordinate memory
> > integrity concerns.
> 
> It doesn't have to.  It just has to ensure that the memory device is in the correct
> state.  So check, not coordinate. At a larger scale, coordination is already doable
> (subject to races that we must avoid by holding locks), tear down the regions
> so there are no mappings on the device you want to repair.  Don't bring them
> up again until after you are done.
> 
> The main use case is probably do it before you bring the mappings up, but
> same result.

Agree.

> 
> > 
> > > 
> > > The main alternative is where the device takes an HPA / SPA / DPA. We have one
> > > driver that does that queued up behind this series that uses HPA. PPR uses
> > > DPA.  In that case userspace first tries to see if it can repair by HPA then
> > > DPA and if not moves on to see if it it can use the fuller description.
> > > We will see devices supporting HPA / DPA (which to use depends on when you
> > > are doing the operation and what has been configured) but mostly I'd expect
> > > either HPA/DPA or fine grained on a given repair instance.
> > > 
> > > HPA only works if the address decoders are always configured (so not on CXL)
> > > What is actually happening in that case is typically that a firmware is
> > > involved that can look up address decoders etc, and map the control HPA
> > > to Bank / row etc to issue the actual low level commands.  This keeps
> > > the memory mapping completely secret rather than exposing it in error
> > > records.
> > >   
> > > > 
> > > > I expect the flow rasdaemon would want is that the current PFA (leaky
> > > > bucket Pre-Failure Analysis) decides that the number of soft-offlines it
> > > > has performed exceeds some threshold and it wants to attempt to repair
> > > > memory.  
> > > 
> > > Sparing may happen prior to point where we'd have done a soft offline
> > > if non disruptive (whether it is can be read from another bit of the
> > > ABI).  Memory repair might be much less disruptive than soft-offline!
> > > I rather hope memory manufacturers build that, but I'm aware of at least
> > > one case where they didn't and the memory must be offline.  
> > 
> > That's a good point, spare before offline makes sense.
> 
> If transparent an resources not constrained.
> Very much not if we have to tear down the memory first.
> 
> > 
> > [..]
> > > However, there are other usecases where this isn't needed which is why
> > > that isn't a precursor for this series.
> > > 
> > > Initial enablement targets two situations:
> > > 1) Repair can be done in non disruptive way - no need to soft offline at all.  
> > 
> > Modulo needing to quiesce access over the sparing event?
> 
> Absolutely.  This is only doable in devices that don't need to quiesce.
> 
> > 
> > > 2) Repair can be done at boot before memory is onlined or on admin
> > >    action to take the whole region offline, then repair specific chunks of
> > >    memory before bringing it back online.  
> > 
> > Which is userspace racing the kernel to online memory?
> 
> If you are doing this scheme you don't automatically online memory. So
> both are in userspace control and can be easily sequenced.
> If you aren't auto onlining then buy devices with hard PPR and do it by offlining
> manually, repairing and rebooting. Or buy devices that don't need to quiecse
> and cross your fingers the dodgy ram doesn't throw an error before you get
> that far.  Little choice if you decide to online right at the start as normal
> memory.
> 
> > 
> > > > So, yes, +1 to simpler for now where software effectively just needs to
> > > > deal with a handful of "region repair" buttons and the semantics of
> > > > those are coarse and sub-optimal. Wait for a future where a tool author
> > > > says, "we have had good success getting bulk offlined pages back into
> > > > service, but now we need this specific finer grained kernel interface to
> > > > avoid wasting spare banks prematurely".  
> > > 
> > > Depends on where you think that interface is.  I can absolutely see that
> > > as a control to RAS Daemon.  Option 2 above, region is offline, repair
> > > all dodgy looking fine grained buckets.
> > > 
> > > Note though that a suboptimal repair may mean permanent use of very rare
> > > resources.  So there needs to be a control a the finest granularity as well.
> > > Which order those get added to userspace tools doesn't matter to me.
> > > 
> > > If you mean that interface in kernel it brings some non trivial requirements.
> > > The kernel would need all of:
> > > 1) Tracking interface for all error records so the reverse map from region
> > >    to specific bank / row etc is available for a subset of entries.  The
> > >    kernel would need to know which of those are important (soft offline
> > >    might help in that use case, otherwise that means decision algorithms
> > >    are in kernel or we have fine grained queue for region repair in parallel
> > >    with soft-offline).
> > > 2) A way to inject the reverse map information from a userspace store
> > >   (to deal with reboot etc).  
> > 
> > Not a way to inject the reverse map information, a way to inject the
> > error records and assert that memory topology changes have not
> > invalidated those records.
> 
> There is no way to tell that the topology hasn't changed.
> For the reasons above, I don't think we care. Instead of trying to stop
> userspace reparing the wrong memory, make sure it is safe for it to do that.
> (The kernel is rarely in the business of preventing the slightly stupid)

If the policy is "error records with SPA from the current boot can be
trusted" and "userspace requests outside of current boot error records
must only be submitted to known offline" then I think we are aligned.

> > > That sounds a lot harder to deal with than relying on the usespace program
> > > that already does the tracking across boots.  
> > 
> > I am stuck behind the barrier of userspace must not assume it knows
> > better than the kernel about the SPA impact of a DIMM sparing
> > event. The kernel needs evidence either live records from within the
> > same kernel boot or validated records from a previous boot.
> 
> I think this is the wrong approach.  The operation must be 'safe'.
> With that in place we absolutely can let userspace assume it knows better than
> the kernel. 

Violent agreement? Operation must be safe, yes, next what are the criteria
for kernel management of safety. Offline-only repair is great place to
be.

Jonathan Cameron Jan. 15, 2025, 10:07 a.m. UTC | #26

On Tue, 14 Jan 2025 11:35:21 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan Cameron wrote:
> > On Fri, 10 Jan 2025 14:49:03 -0800
> > Dan Williams <dan.j.williams@intel.com> wrote:  
> [..]
> > > This is where you lose me. The error record is a point in time snapshot
> > > of the SPA:HPA:DPA:<proprietary internal "DIMM" mapping>. The security
> > > model for memory operations is based on coordinating with the kernel's
> > > understanding of how that SPA is *currently* being used.  
> > 
> > Whilst it is being used I agree.  Key is to only do disruptive / data
> > changing actions when it is not being used.  
> 
> Sure.
> 
> > > The kernel can not just take userspace's word for it that potentially
> > > data changing, or temporary loss-of-use operations are safe to execute
> > > just because once upon a time userspace saw an error record that
> > > implicated a given SPA in the past, especially over reboot.   
> > 
> > There are two cases (discoverable from hardware)
> > 
> > 1) Non disruptive.  No security concern as the device guarantees to
> >    not interrupt traffic and the memory contents is copied to the new
> >    location. Basically software never knows it happened.
> > 2) Disruptive.  We only allow this if the memory is offline. In the CXL case
> >    the CXL specific code must check no memory on the device is online so
> >    we aren't disrupting anything.  The other implementation we have code
> >    for (will post after this lands) has finer granularity constraints and only
> >    the page needs to be offline.
> >    As it is offline the content is not preserved anyway. We may need to add extra
> >    constraints along with future support for temporal persistence / sharing but
> >    we can do that as part of adding that support in general.
> >    (Personally I think in those cases memory repair is a job for the out of
> >     band management anyway).
> > 
> > In neither case am I seeing a security concern.  Am I missing something?  
> 
> s/security/system-integrity/
> 
> 1/ Hardware engineers may have a different definition of "non-disuptive"
> than software. See the history around hibernate_quiet_exec() to work
> around the disruption of latency spikes. If this is poorly specified
> across vendors we are going to wish that we did not build a "take
> userspace's word for it" interface.

Sure, but given the spec is fairly specific "CXL.mem requests are correctly
preserved and data is retained" they will have to stay with latency
requirements of the CXL system.  If a device breaks these admittedly soft
rules then we can add allow / deny lists or let userspace tooling handle
that (which I'd prefer).  Note that this stuff happens on some devices
under the hood anyway, so I'd not expect the command driven case to be
much worse (but I accept it might be).

I'd also expect some very grumpy large purchasers of that device to get
the firmware changed to not advertise that it is fine to do it live if
the performance drop is large.

Note this happens today under the hood on at least some (I think most?)
servers.  Have you ever seen a latency bug report from memory sparing?
How often?  The memory and controllers on CXL devices are going to be very
similar in technology to host memory controllers so I can't see why they'd
be much worse (subject of course to cheap and nasty maybe turning up - I doubt
that will be common given the cost of the RAM behind these things).

This applies just as much however we implement this interface. So a concern
but not one specific to the 'how' part.

> 
> 2/ Yes, if the device is not actively decoding any in use memory feel
> free to run destructive operations on the device. However, is sysfs the
> right interface for "submit multi-parameter atomic operation with
> transient result"? I lean heavily into sysfs, but ioctl and netlink have
> a role to play in scenarios like this. Otherwise userspace can inject
> error records back into the kernel with the expectation that the kernel
> can only accept the DIMM address and not any of the translation data
> which might be stale.

I'd argue yes to the sysfs question.  This is common enough for other
subsystems (e.g. CXL - I still do all tests with bash scripts!) and this
really isn't a particularly complex interface. I'd expect very
light weight tooling to be able to use it (though solutions to work out
whether to do it are way more complex). It is a simple more or less self
describing interface that directly exposes what controls are available.

If we end up reinjecting error records (which I'm still pushing back
strongly on because I am yet to see any reason to do so) then maybe
sysfs is not the right interface.  Pushing it to netlink or ioctl
doesn't change the complexity of the interface, so I'm not sure there
is any strong reason to do so.  Files are cheap and easy to use from
all manner of simple scripting.  I think from later discussion below
that we don't need to reinject which is nice!

> 
> [..]
> > > Again, the repair control assumes that the kernel can just trust
> > > userspace to get it right. When the kernel knows the SPA implications it
> > > can add safety like "you are going to issue sparing on deviceA that will
> > > temporarily take deviceA offline. CXL subsystem tells me deviceA is
> > > interleaved with deviceB in SPA so the whole SPA range needs to be
> > > offline before this operation proceeds". That is not someting that
> > > userspace can reliably coordinate.  
> > 
> > Absolutely he kernel has to enforce this. Same way we protect against
> > poison injection in some cases.  Right now the enforcement is slightly
> > wrong (Shiju is looking at it again) as we were enforcing at wrong
> > granularity (specific dpa, not device). Identifying that hole is a good
> > outcome of this discussion making us take another look.
> > 
> > Enforcing this is one of the key jobs of the CXL specific driver.
> > We considered doing it in the core, but the granularity differences
> > between our initial few examples meant we decided on specific driver
> > implementations of the checks for now.  
> 
> Which specific driver? Is this not just a callback provided via the EDAC
> registration interface to say "sparing allowed"?

That would be a reasonable interface to add. There was push back in earlier
reviews on an explicit query so we backed away from this sort of thing.
The question IIRC was what was the point in querying given we could just
try it and rapidly see an error if it can't. There is a callback to say
it is disruptive but it is an exercise for userspace to figure out what
it needs to do to allow the request to succeed.

The granularity varies across the two sparing implementations I have access
to, so this interface would just be a 'can I do it now?'  Whether that
is helpful given userspace will have to be involved in getting that answer
to be 'yes' is not entirely clear to me.

So interesting thought, but I'm not yet convinced this isn't a userspace
problem + the suck it and see call to find out the answer.

> 
> Yes, this needs to avoid the midlayer mistake, but I expect more CXL
> memory exporting devices can live with the CXL core's determination that
> HDM decode is live or not.

I think this is valid for the ones that advertise that they need the
traffic to stop. From later in discussion I think we are aligned on that.

> 
> > > > > 3/ What if the device does not use DDR terminology / topology terms for
> > > > > repair?    
> > > > 
> > > > Then we provide the additional interfaces assuming the correspond to well
> > > > known terms.  If someone is using a magic key then we can get grumpy
> > > > with them, but that can also be supported.
> > > > 
> > > > Mostly I'd expect a new technology to overlap a lot of the existing
> > > > interface and maybe add one or two more; which layer in the stack for
> > > > HBM for instance.    
> > > 
> > > The concern is the assertion that sysfs needs to care about all these
> > > parameters vs an ABI that says "repair errorX". If persistence and
> > > validity of error records is the concern lets build an ABI for that and
> > > not depend upon trust in userspace to properly coordinate memory
> > > integrity concerns.  
> > 
> > It doesn't have to.  It just has to ensure that the memory device is in the correct
> > state.  So check, not coordinate. At a larger scale, coordination is already doable
> > (subject to races that we must avoid by holding locks), tear down the regions
> > so there are no mappings on the device you want to repair.  Don't bring them
> > up again until after you are done.
> > 
> > The main use case is probably do it before you bring the mappings up, but
> > same result.  
> 
> Agree.
> 
> >   
> > >   
> > > > 
> > > > The main alternative is where the device takes an HPA / SPA / DPA. We have one
> > > > driver that does that queued up behind this series that uses HPA. PPR uses
> > > > DPA.  In that case userspace first tries to see if it can repair by HPA then
> > > > DPA and if not moves on to see if it it can use the fuller description.
> > > > We will see devices supporting HPA / DPA (which to use depends on when you
> > > > are doing the operation and what has been configured) but mostly I'd expect
> > > > either HPA/DPA or fine grained on a given repair instance.
> > > > 
> > > > HPA only works if the address decoders are always configured (so not on CXL)
> > > > What is actually happening in that case is typically that a firmware is
> > > > involved that can look up address decoders etc, and map the control HPA
> > > > to Bank / row etc to issue the actual low level commands.  This keeps
> > > > the memory mapping completely secret rather than exposing it in error
> > > > records.
> > > >     
> > > > > 
> > > > > I expect the flow rasdaemon would want is that the current PFA (leaky
> > > > > bucket Pre-Failure Analysis) decides that the number of soft-offlines it
> > > > > has performed exceeds some threshold and it wants to attempt to repair
> > > > > memory.    
> > > > 
> > > > Sparing may happen prior to point where we'd have done a soft offline
> > > > if non disruptive (whether it is can be read from another bit of the
> > > > ABI).  Memory repair might be much less disruptive than soft-offline!
> > > > I rather hope memory manufacturers build that, but I'm aware of at least
> > > > one case where they didn't and the memory must be offline.    
> > > 
> > > That's a good point, spare before offline makes sense.  
> > 
> > If transparent an resources not constrained.
> > Very much not if we have to tear down the memory first.
> >   
> > > 
> > > [..]  
> > > > However, there are other usecases where this isn't needed which is why
> > > > that isn't a precursor for this series.
> > > > 
> > > > Initial enablement targets two situations:
> > > > 1) Repair can be done in non disruptive way - no need to soft offline at all.    
> > > 
> > > Modulo needing to quiesce access over the sparing event?  
> > 
> > Absolutely.  This is only doable in devices that don't need to quiesce.
> >   
> > >   
> > > > 2) Repair can be done at boot before memory is onlined or on admin
> > > >    action to take the whole region offline, then repair specific chunks of
> > > >    memory before bringing it back online.    
> > > 
> > > Which is userspace racing the kernel to online memory?  
> > 
> > If you are doing this scheme you don't automatically online memory. So
> > both are in userspace control and can be easily sequenced.
> > If you aren't auto onlining then buy devices with hard PPR and do it by offlining
> > manually, repairing and rebooting. Or buy devices that don't need to quiecse
> > and cross your fingers the dodgy ram doesn't throw an error before you get
> > that far.  Little choice if you decide to online right at the start as normal
> > memory.
> >   
> > >   
> > > > > So, yes, +1 to simpler for now where software effectively just needs to
> > > > > deal with a handful of "region repair" buttons and the semantics of
> > > > > those are coarse and sub-optimal. Wait for a future where a tool author
> > > > > says, "we have had good success getting bulk offlined pages back into
> > > > > service, but now we need this specific finer grained kernel interface to
> > > > > avoid wasting spare banks prematurely".    
> > > > 
> > > > Depends on where you think that interface is.  I can absolutely see that
> > > > as a control to RAS Daemon.  Option 2 above, region is offline, repair
> > > > all dodgy looking fine grained buckets.
> > > > 
> > > > Note though that a suboptimal repair may mean permanent use of very rare
> > > > resources.  So there needs to be a control a the finest granularity as well.
> > > > Which order those get added to userspace tools doesn't matter to me.
> > > > 
> > > > If you mean that interface in kernel it brings some non trivial requirements.
> > > > The kernel would need all of:
> > > > 1) Tracking interface for all error records so the reverse map from region
> > > >    to specific bank / row etc is available for a subset of entries.  The
> > > >    kernel would need to know which of those are important (soft offline
> > > >    might help in that use case, otherwise that means decision algorithms
> > > >    are in kernel or we have fine grained queue for region repair in parallel
> > > >    with soft-offline).
> > > > 2) A way to inject the reverse map information from a userspace store
> > > >   (to deal with reboot etc).    
> > > 
> > > Not a way to inject the reverse map information, a way to inject the
> > > error records and assert that memory topology changes have not
> > > invalidated those records.  
> > 
> > There is no way to tell that the topology hasn't changed.
> > For the reasons above, I don't think we care. Instead of trying to stop
> > userspace reparing the wrong memory, make sure it is safe for it to do that.
> > (The kernel is rarely in the business of preventing the slightly stupid)  
> 
> If the policy is "error records with SPA from the current boot can be
> trusted" and "userspace requests outside of current boot error records
> must only be submitted to known offline" then I think we are aligned.

Ok, I'm not sure the requirement is clear, but given that is probably not
too bad to do and we can rip it out later if it turns out to be overkill.
I'll discuss this with Shiju and others later today.

For now I think this belongs in the specific drivers, not the core
as they will be tracking different things and getting the data
from different paths.

How we make this work with other error records is going to be more painful but
should be easy enough for CXL non firmware first reporting. Other sources
of records can come later.

Eventually we might end up with a top level PA check in the EDAC core code,
and additional optional checks in the drivers, or factor out the shared bit
as library code. Not sure yet.

> 
> > > > That sounds a lot harder to deal with than relying on the usespace program
> > > > that already does the tracking across boots.    
> > > 
> > > I am stuck behind the barrier of userspace must not assume it knows
> > > better than the kernel about the SPA impact of a DIMM sparing
> > > event. The kernel needs evidence either live records from within the
> > > same kernel boot or validated records from a previous boot.  
> > 
> > I think this is the wrong approach.  The operation must be 'safe'.
> > With that in place we absolutely can let userspace assume it knows better than
> > the kernel.   
> 
> Violent agreement? Operation must be safe, yes, next what are the criteria
> for kernel management of safety. Offline-only repair is great place to
> be.

Offline makes life easy but removes a significant usecase - the above extra
checks bring that back so may give a way forwards.  The combination
of 'only previous records' and online isn't an important one as you can
act on them before it is online.

So 'maybe' a path forwards that adds extra guarantees.  I'm not convinced
we need them, but it seems to not restrict real usecases so not too bad.

Jonathan

[v18,04/19] EDAC: Add memory repair control feature

Commit Message

Comments

Patch