Message ID | 20250106121017.1620-5-shiju.jose@huawei.com (mailing list archive) |
---|---|
State | Handled Elsewhere, archived |
Headers | show |
Series | EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers | expand |
On Mon, Jan 06, 2025 at 12:10:00PM +0000, shiju.jose@huawei.com wrote: > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_hpa > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_dpa > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_nibble_mask > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank_group > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_rank > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_row > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_column > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_channel > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_sub_channel > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_hpa > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_dpa > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_nibble_mask > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank_group > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_rank > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_row > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_column > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_channel > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_sub_channel So this is new. I don't remember seeing that when I looked at your patches the last time. Looks like you have all those attributes and now you've decided to add a min and max for each one, in addition. And UI-wise it is a madness as there are gazillion single-value files now. "Attributes should be ASCII text files, preferably with only one value per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of values of the same type." So you don't need those - you can simply express each attribute as a range: echo "1:2" > /sys/bus/edac/devices/<dev-name>/mem_repairX/bank or if you wanna scrub only one bank: echo "1:1" > /sys/bus/edac/devices/<dev-name>/mem_repairX/bank What is the use case of that thing? Someone might find it useful so let's add it preemptively? Pfff.
>-----Original Message----- >From: Borislav Petkov <bp@alien8.de> >Sent: 09 January 2025 09:19 >To: Shiju Jose <shiju.jose@huawei.com> >Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux- >acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org; >tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org; >mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan >Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com; >alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com; >david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com; >Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com; >Jon.Grimm@amd.com; dave.hansen@linux.intel.com; >naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com; >somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com; >duenwen@google.com; gthelen@google.com; >wschwartz@amperecomputing.com; dferguson@amperecomputing.com; >wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei ><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto >Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com; >wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm ><linuxarm@huawei.com> >Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature > >On Mon, Jan 06, 2025 at 12:10:00PM +0000, shiju.jose@huawei.com wrote: >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_hpa >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_dpa >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_nibble_mask >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_bank_group >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_bank >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_rank >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_row >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_column >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_channel >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_sub_channel >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_hpa >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_dpa >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_nibble_mask >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_bank_group >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_bank >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_rank >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_row >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_column >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_channel >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_sub_channel > >So this is new. I don't remember seeing that when I looked at your patches the >last time. > >Looks like you have all those attributes and now you've decided to add a min and >max for each one, in addition. And UI-wise it is a madness as there are gazillion >single-value files now. > Thanks for the feedbacks. The min_ and max_ attributes of the control attributes are added for your feedback on V15 to expose supported ranges of these control attributes to the user, in the following links. However these min_ and max_ attributes are 'RO' instead of 'RW' as specified in the doc, which to be fixed in the doc. https://lore.kernel.org/lkml/20241114133249.GEZzX8ATNyc_Xw1L52@fat_crate.local/ https://lore.kernel.org/lkml/fa5d6bdd08104cf1a09c4960a0f9bc46@huawei.com/ https://lore.kernel.org/lkml/20241119123657.GCZzyGaZIExvUHPLKL@fat_crate.local/ >"Attributes should be ASCII text files, preferably with only one value per file. It is >noted that it may not be efficient to contain only one value per file, so it is >socially acceptable to express an array of values of the same type." > >So you don't need those - you can simply express each attribute as a range: > >echo "1:2" > /sys/bus/edac/devices/<dev-name>/mem_repairX/bank > >or if you wanna scrub only one bank: After internal discussion, we think this is the source of the confusion. This is not scrub where a range would indeed make sense. It is repair. We are not aware of a failure mechanism where a set of memory banks would fail together but not the whole of the next level up in the memory topology. In theory we might get a stupid device design where it reports coarse level errors but can only repair at fine levels where a range might be appropriate. We are not sure that makes sense in practice and with a range interface we will get mess like running out of repair resources half way through a list with no visibility of what is repaired. However, given the repair flow is driven by userspace receiving error records that will only possible values to repair, we think these bounds on what can be repaired are a nice to have rather than necessary so we would propose we do not add max_ and min_ for now and see how the use cases evolve. > >echo "1:1" > /sys/bus/edac/devices/<dev-name>/mem_repairX/bank > >What is the use case of that thing? > >Someone might find it useful so let's add it preemptively? > >Pfff. > >-- >Regards/Gruss, > Boris. > >https://people.kernel.org/tglx/notes-about-netiquette Thanks, Shiju
On Thu, Jan 09, 2025 at 11:00:43AM +0000, Shiju Jose wrote: > The min_ and max_ attributes of the control attributes are added for your > feedback on V15 to expose supported ranges of these control attributes to the user, > in the following links. Sure, but you can make that differently: cat /sys/bus/edac/devices/<dev-name>/mem_repairX/bank [x:y] which is the allowed range. echo ... then writes in the bank. > ... so we would propose we do not add max_ and min_ for now and see how the > use cases evolve. Yes, you should apply that same methodology to the rest of the new features you're adding: only add functionality for the stuff that is actually being used now. You can always extend it later. Changing an already user-visible API is a whole different story and a lot lot harder, even impossible. So I'd suggest you prune the EDAC patches from all the hypothetical usage and then send only what remains so that I can try to queue them. Thx.
On Thu, 9 Jan 2025 13:32:22 +0100 Borislav Petkov <bp@alien8.de> wrote: Hi Boris, > On Thu, Jan 09, 2025 at 11:00:43AM +0000, Shiju Jose wrote: > > The min_ and max_ attributes of the control attributes are added for your > > feedback on V15 to expose supported ranges of these control attributes to the user, > > in the following links. > > Sure, but you can make that differently: > > cat /sys/bus/edac/devices/<dev-name>/mem_repairX/bank > [x:y] > > which is the allowed range. To my thinking that would fail the test of being an intuitive interface. To issue a repair command requires that multiple attributes be configured before triggering the actual repair. Think of it as setting the coordinates of the repair in a high dimensional space. In the extreme case of fine grained repair (Cacheline), to identify the relevant subunit of memory (obtained from the error record that we are basing the decision to repair on) we need to specify all of: Channel, sub-channel, rank, bank group, row, column and nibble mask. For coarser granularity repair only specify a subset of these applies and only the relevant controls are exposed to userspace. They are broken out as specific attributes to enable each to be set before triggering the action with a write to the repair attribute. There are several possible alternatives: Option 1 "A:B:C:D:E:F:G:H:I:J" opaque single write to trigger the repair where each number is providing one of those coordinates and where a readback let's us known what each number is. That single attribute interface is very hard to extend in an intuitive way. History tell us more levels will be introduced in the middle, not just at the finest granularity, making such an interface hard to extend in a backwards compatible way. Another alternative of a key value list would make for a nasty sysfs interface. Option 2 There are sysfs interfaces that use a selection type presentation. Write: "C", Read: "A, B, [C], D" but that only works well for discrete sets of options and is a pain to parse if read back is necessary. So in conclusion, I think the proposed multiple sysfs attribute style with them reading back the most recent value written is the least bad solution to a complex control interface. > > echo ... > > then writes in the bank. > > > ... so we would propose we do not add max_ and min_ for now and see how the > > use cases evolve. > > Yes, you should apply that same methodology to the rest of the new features > you're adding: only add functionality for the stuff that is actually being > used now. You can always extend it later. > > Changing an already user-visible API is a whole different story and a lot lot > harder, even impossible. > > So I'd suggest you prune the EDAC patches from all the hypothetical usage and > then send only what remains so that I can try to queue them. Sure. In this case the addition of min/max was perhaps a wrong response to your request for a way to those ranges rather than just rejecting a write of something out of range as earlier version did. We can revisit in future if range discovery becomes necessary. Personally I don't think it is given we are only taking these actions in response error records that give us precisely what to write and hence are always in range. Jonathan > > Thx. >
On Thu, Jan 09, 2025 at 02:24:33PM +0000, Jonathan Cameron wrote: > To my thinking that would fail the test of being an intuitive interface. > To issue a repair command requires that multiple attributes be configured > before triggering the actual repair. > > Think of it as setting the coordinates of the repair in a high dimensional > space. Why? You can write every attribute in its separate file and have a "commit" or "start" file which does that. Or you can designate a file which starts the process. This is how I'm injecting errors on x86: see readme_msg here: arch/x86/kernel/cpu/mce/inject.c More specifically: "flags:\t Injection type to be performed. Writing to this file will trigger a\n" "\t real machine check, an APIC interrupt or invoke the error decoder routines\n" "\t for AMD processors.\n" So you set everything else, and as the last step you set the injection type *and* you also trigger it with this one write. > Sure. In this case the addition of min/max was perhaps a wrong response to > your request for a way to those ranges rather than just rejecting a write > of something out of range as earlier version did. > > We can revisit in future if range discovery becomes necessary. Personally > I don't think it is given we are only taking these actions in response error > records that give us precisely what to write and hence are always in range. My goal here was to make this user-friendly. Because you need some way of knowing what valid ranges are and in order to trigger the repair, if it needs to happen for a range. Or, you can teach the repair logic to ignore invalid ranges and "clamp" things to whatever makes sense. Again, I'm looking at it from the usability perspective. I haven't actually needed this scrub+repair functionality yet to know whether the UI makes sense. So yeah, collecting some feedback from real-life use cases would probably give you a lot better understanding of how that UI should be designed... perhaps you won't ever need the ranges, whow knows. So yes, preemptively designing stuff like that "in the dark" is kinda hard. :-)
On Thu, 9 Jan 2025 16:18:54 +0100 Borislav Petkov <bp@alien8.de> wrote: > On Thu, Jan 09, 2025 at 02:24:33PM +0000, Jonathan Cameron wrote: > > To my thinking that would fail the test of being an intuitive interface. > > To issue a repair command requires that multiple attributes be configured > > before triggering the actual repair. > > > > Think of it as setting the coordinates of the repair in a high dimensional > > space. > > Why? Ok. To me the fact it's not a single write was relevant. Seems not in your mental model of how this works. For me a single write that you cannot query back is fine, setting lots of parameters and being unable to query any of them less so. I guess you disagree. In interests of progress I'm not going to argue further. No one is going to use this interface by hand anyway so the lost of useability I'm seeing doesn't matter a lot. > > You can write every attribute in its separate file and have a "commit" or > "start" file which does that. That's what we have. > > Or you can designate a file which starts the process. This is how I'm > injecting errors on x86: > > see readme_msg here: arch/x86/kernel/cpu/mce/inject.c > > More specifically: > > "flags:\t Injection type to be performed. Writing to this file will trigger a\n" > "\t real machine check, an APIC interrupt or invoke the error decoder routines\n" > "\t for AMD processors.\n" > > So you set everything else, and as the last step you set the injection type > *and* you also trigger it with this one write. Agreed. I'm not sure of the relevance though. This is how it works and there is no proposal to change that. What I was trying to argue was for an interface that let you set all the coordinates and read back what they were before hitting go. > > > Sure. In this case the addition of min/max was perhaps a wrong response to > > your request for a way to those ranges rather than just rejecting a write > > of something out of range as earlier version did. > > > > We can revisit in future if range discovery becomes necessary. Personally > > I don't think it is given we are only taking these actions in response error > > records that give us precisely what to write and hence are always in range. > > My goal here was to make this user-friendly. Because you need some way of > knowing what valid ranges are and in order to trigger the repair, if it needs > to happen for a range. In at least the CXL case I'm fairly sure most of them are not discoverable. Until you see errors you have no idea what the memory topology is. > > Or, you can teach the repair logic to ignore invalid ranges and "clamp" things > to whatever makes sense. For that you'd need to have a path to read back what happened. > > Again, I'm looking at it from the usability perspective. I haven't actually > needed this scrub+repair functionality yet to know whether the UI makes sense. > So yeah, collecting some feedback from real-life use cases would probably give > you a lot better understanding of how that UI should be designed... perhaps > you won't ever need the ranges, whow knows. > > So yes, preemptively designing stuff like that "in the dark" is kinda hard. > :-) The discoverability is unnecessary for any known usecase. Ok. Then can we just drop the range discoverability entirely or we go with your suggestion and do not support read back of what has been requested but instead have the reads return a range if known or "" / return -EONOTSUPP if simply not known? I can live with that though to me we are heading in the direction of a less intuitive interface to save a small number of additional files. Jonathan >
On Thu, Jan 09, 2025 at 04:01:59PM +0000, Jonathan Cameron wrote: > Ok. To me the fact it's not a single write was relevant. Seems not > in your mental model of how this works. For me a single write > that you cannot query back is fine, setting lots of parameters and > being unable to query any of them less so. I guess you disagree. Why can't you query it back? grep -r . /sysfs/dir/ All files' values have been previously set and should still be there on a read, I'd strongly hope. Your ->read routines should give the values back. > In interests of progress I'm not going to argue further. No one is > going to use this interface by hand anyway so the lost of useability > I'm seeing doesn't matter a lot. I had the suspicion that this user interface is not really going to be used by a user but by a tool. But then if you don't have a tool, you're lost. This is one of the reasons why you can control ftrace directly on the shell too - without a tool. This is very useful in certain cases where you cannot run some userspace tools. > In at least the CXL case I'm fairly sure most of them are not discoverable. > Until you see errors you have no idea what the memory topology is. Ok. > For that you'd need to have a path to read back what happened. So how is this scrubbing going to work? You get an error, you parse it for all the attributes and you go and write those attributes into the scrub interface and it starts scrubbing? But then why do you even need the interface at all? Why can't the kernel automatically collect all those attributes and start the scrubbing automatically - no need for any user interaction...? So why do you *actually* even need user interaction here and why can't the kernel be smart enough to start the scrub automatically? > Ok. Then can we just drop the range discoverability entirely or we go with > your suggestion and do not support read back of what has been > requested but instead have the reads return a range if known or "" / > return -EONOTSUPP if simply not known? Probably. > I can live with that though to me we are heading in the direction of > a less intuitive interface to save a small number of additional files. This is not the point. I already alluded to this earlier - we're talking about a user visible interface which, once it goes out, it is cast in stone forever. So those files better have a good reason to exist... And if we're not sure yet, we can upstream only those which are fine now and then continue discussing the rest. HTH.
On Thu, 9 Jan 2025 17:19:02 +0100 Borislav Petkov <bp@alien8.de> wrote: > On Thu, Jan 09, 2025 at 04:01:59PM +0000, Jonathan Cameron wrote: > > Ok. To me the fact it's not a single write was relevant. Seems not > > in your mental model of how this works. For me a single write > > that you cannot query back is fine, setting lots of parameters and > > being unable to query any of them less so. I guess you disagree. > > Why can't you query it back? > > grep -r . /sysfs/dir/ > > All files' values have been previously set and should still be there on > a read, I'd strongly hope. Your ->read routines should give the values back. Today you can. Seems we are talking cross purposes. I'm confused. I thought your proposal was for "bank" attribute to present an allowed range on read. "bank" attribute is currently written to and read back as the value of the bank on which to conduct a repair. Maybe this disconnect is down to the fact max_ and min_ attributes should have been marked as RO in the docs. They aren't controls, just presentation of limits to userspace. Was intent a separate bank_range type attribute rather than max_bank, min_bank? One of those would be absolutely fine (similar to the _available attributes in IIO - I added those years ago to meet a similar need and we've never had any issues with those). > > > In interests of progress I'm not going to argue further. No one is > > going to use this interface by hand anyway so the lost of useability > > I'm seeing doesn't matter a lot. > > I had the suspicion that this user interface is not really going to be used by > a user but by a tool. But then if you don't have a tool, you're lost. > > This is one of the reasons why you can control ftrace directly on the shell > too - without a tool. This is very useful in certain cases where you cannot > run some userspace tools. I fully agree. What I was saying was in response to me thinking you wanted it to not be possible to read back the user set values (overlapping uses of single bank attribute which wasn't what you meant). That is useful for a user wanting to do the cat /sys/... that you mention above, but not vital if they are directly reading the tracepoints for the error records and poking the sysfs interface. Given it seems I misunderstood that suggestion, ignore my reply to that as irrelevant. > > > In at least the CXL case I'm fairly sure most of them are not discoverable. > > Until you see errors you have no idea what the memory topology is. > > Ok. > > > For that you'd need to have a path to read back what happened. > > So how is this scrubbing going to work? You get an error, you parse it for all > the attributes and you go and write those attributes into the scrub interface > and it starts scrubbing? Repair not scrubbing. They are different things we should keep separate, scrub corrects the value, if it can, but doesn't change the underlying memory to new memory cells to avoid repeated errors. Replacing scrub with repair (which I think was the intent here)... You get error records that describe the error seen in hardware, write back the values into this interface and tell it to repair the memory. This is not necessarily a synchronous or immediate thing - instead typically based on trend analysis. As an example, the decision might be that bit of ram threw up 3 errors over a month including multiple system reboots (for other reasons) and that is over some threshold so we use a spare memory line to replace it. > > But then why do you even need the interface at all? > > Why can't the kernel automatically collect all those attributes and start the > scrubbing automatically - no need for any user interaction...? > > So why do you *actually* even need user interaction here and why can't the > kernel be smart enough to start the scrub automatically? Short answer, it needs to be very smart and there isn't a case of one size fits all - hence suggested approach of making it a user space problem. There are hardware autonomous solutions and ones handled by host firmware. That is how repair is done in many servers - at most software sees a slightly latency spike as the memory is repaired under the hood. Some CXL devices will do this as well. Those CXL devices may provide an additional repair interface for the less clear cut decisions that need more data processing / analysis than the device firmware is doing. Other CXL devices will take the view the OS is best placed to make all the decisions - those sometimes will give a 'maintenance needed' indication in the error records but that is still a hint the host may or may not take any notice of. Given in the systems being considered here, software is triggering the repair, we want to allow for policy in the decision. In simple cases we could push that policy into the kernel e.g. just repair the moment we see an error record. These repair resources are very limited in number, so immediately repairing may a bad idea. We want to build up a history of errors before making such a decision. That can be done in kernel. The decision to repair memory is heavily influenced by policy and time considerations against device resource constraints. Some options that are hard to do in kernel. 1. Typical asynchronous error report for a corrected error. Tells us memory had an error (perhaps from a scrubbing engine on the device running checks). No need to take action immediately. Instead build up more data over time and if lots of errors occur make decision to repair as no we are sure it is worth doing rather than a single random event. We may tune scrubbing engines to check this memory more frequently and adjust our data analysis to take that into account for setting thresholds etc. When an admin considers it a good time to take action, offline the memory and repair before bringing it back into use (sometimes by rebooting the machine). Sometimes repair can be triggered in a software transparent way, sometimes not. This also applies to uncorrectable errors though in that case you can't necessarily repair it without ever seeing a synchronous poison with all the impacts that has. 2. Soft repair across boots. We are actually storing the error records, then only applying the fix on reboot before using the memory - so maintaining a list of bad memory and saving it to a file to read back on boot. We could provide another kernel interface to get this info and reinject it after reboot instead of doing it in userspace but that is another ABI to design. 3. Complex policy across fleets. A lot of work is going on around prediction techniques that may change the local policy on each node dependent on the overall reliability patterns of a particular batch of devices and local characteristics, service guarantees etc. If it is hard repair, then once you've run out you need schedule an engineer out to replace the DIMM. All complex inputs to the decision. Similar cases like CPU offlining on repeated errors are done in userspace (e.g. RAS Daemon) for similar reasons of long term data gathering and potentially complex algorithms. > > > Ok. Then can we just drop the range discoverability entirely or we go with > > your suggestion and do not support read back of what has been > > requested but instead have the reads return a range if known or "" / > > return -EONOTSUPP if simply not known? > > Probably. Too many options in the above paragraph so just to check... Probably to which? If it's a separate attribute from the one we write the control so then we do what is already done here and don't present the interface at all if the range isn't discoverable. > > > I can live with that though to me we are heading in the direction of > > a less intuitive interface to save a small number of additional files. > > This is not the point. I already alluded to this earlier - we're talking about > a user visible interface which, once it goes out, it is cast in stone forever. > > So those files better have a good reason to exist... > > And if we're not sure yet, we can upstream only those which are fine now and > then continue discussing the rest. Ok. Best path is drop the available range support then (so no min_ max_ or anything to replace them for now). Added bonus is we don't have to rush this conversation and can make sure we come to the right solution driven by use cases. Jonathan > HTH. >
Jonathan Cameron wrote: > Ok. Best path is drop the available range support then (so no min_ max_ or > anything to replace them for now). I think less is more in this case. The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too wide for userspace to have a chance at writing a competent tool. At least I am struggling with where to even begin with those ABIs if I was asked to write a tool. Does a tool already exist for those? Some questions that read on those ABIs are: 1/ What if the platform has translation between HPA (CXL decode) and SPA (physical addresses reported in trace points that PIO and DMA see)? 2/ What if memory is interleaved across repair domains? 3/ What if the device does not use DDR terminology / topology terms for repair? I expect the flow rasdaemon would want is that the current PFA (leaky bucket Pre-Failure Analysis) decides that the number of soft-offlines it has performed exceeds some threshold and it wants to attempt to repair memory. However, what is missing today for volatile memory is that some failures can be repaired with in-band writes and some failures need heavier hammers like Post-Package-Repair to actively swap in whole new banks of memory. So don't we need something like "soft-offline-undo" on the way to PPR? So, yes, +1 to simpler for now where software effectively just needs to deal with a handful of "region repair" buttons and the semantics of those are coarse and sub-optimal. Wait for a future where a tool author says, "we have had good success getting bulk offlined pages back into service, but now we need this specific finer grained kernel interface to avoid wasting spare banks prematurely". Anything more complex than a set of /sys/devices/system/memory/ devices has a /sys/bus/edac/devices/devX/repair button, feels like a generation ahead of where the initial sophistication needs to lie. That said, I do not closely follow ras tooling to say whether someone has already identified the critical need for a fine grained repair ABI?
On Thu, 9 Jan 2025 15:51:39 -0800 Dan Williams <dan.j.williams@intel.com> wrote: > Jonathan Cameron wrote: > > Ok. Best path is drop the available range support then (so no min_ max_ or > > anything to replace them for now). > > I think less is more in this case. A few notes before I get to specific questions. Key in the discussion that follows is that the 'repair' is a separate from the 'decision to repair'. They mostly need different information all of which is in the error trace points. A lot of the questions are about the 'decision to repair' part not the repair itself. Critical point is there is on some devices (all CXL ones), there may be no direct way to discover the mapping from HPA/SPA/DPA to bank row etc other than the error record. The memory mapping functions are an internal detail not exposed on any interfaces. Odd though it may seem, those mapping functions are considered confidential enough that manufacturers don't always publish them (though I believe they are fairly easy to reverse engineer) - I know a team whose job involves designing those. Anyhow, short of the kernel or RAS Daemon carrying a look up table of all known devices (no support for new ones until they are added) we can't do a reverse map from DPA etc to bank. There are complex ways round this like storing the mappings when issuing an error record, to build up the necessary reverse map, but that would have to be preserved across boot. These error tend not to be frequent, so cross reboot /kexec etc need to be incorporated. PPR on CXL does use DPA, but memory sparing commands are meant to supersede that interface (the reason for that is perhaps bordering on consortium confidential, but lets say it doesn't work well for some cases). Memory sparing does not use DPA. I'd advise mostly ignoring PPR and looking at memory sparing in the CXL spec if you want to consider an example. For PPR DPA is used (there is an HPA option that might be available). DPA is still needed for on boot soft repair (or we could delay that until regions configured, but then we'd need to do DPA to HPA mapping as that will be different on a new config, and then write HPA for the kernel to map it back to DPA. > > The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too > wide for userspace to have a chance at writing a competent tool. At > least I am struggling with where to even begin with those ABIs if I was > asked to write a tool. Does a tool already exist for those? There is little choice on that - those are the controls for this type of repair. If we do something like a magic 'key' based on a concatenation of those values we need to define that description to replace a clean self describing interface. I'm not 100% against that but I think it would be a terrible interface design and I don't think anyone is really in favor of it. All a userspace tool does is read the error record fields of exactly those names. From that it will log data (already happening under those names in RAS daemon alongside HPA/ DPA). Then, in simplest case, a threshold is passed and we write those values to the repair interface. There is zero need in that simple case for these to be understood at all. You can think of them as a complex key but divided into well defined fields. For more complex decision algorithms, that structure info may be needed to make the decision. As a dumb example, maybe certain banks are more error prone on a batch of devices so we need a lower threshold before repairing. Simplest case is maybe 20-30 lines of code looping over result of an SQL query on the RASDaemon DB and writing the values to the files. Not the most challenging userspace tool. The complexity is in the analysis of the errors, not this part. I don't think we bothered doing this one yet in rasdaemon because we considered it obvious enough an example wasn't needed. (Mauro / Shiju, does that estimate sound reasonable?) We would need a couple of variants but those map 1:1 with the variants of error record parsing and logging RAS Daemon already has. > > Some questions that read on those ABIs are: > > 1/ What if the platform has translation between HPA (CXL decode) and SPA > (physical addresses reported in trace points that PIO and DMA see)? See later for discussion of other interfaces.. This is assuming the repair key is not HPA (it is on some systems / situations) - if it's the repair key then that is easily dealt with. HPA / SPA more or less irrelevant for repair itself, they are relevant for the decision to repair. In the 'on reboot' soft repair case they may not even exist at the time of repair as it would be expected to happen before we've brought up a region (to get the RAM into a good state at boot). For cases where the memory decoders are configured and so there is an HPA to DPA mapping: The trace reports provide both all these somewhat magic values and the HPA. Thus we can do the HPA aware stuff on that before then looking up the other bit of the appropriate error reports to get the bank row etc. > > 2/ What if memory is interleaved across repair domains? Also not relevant to a repair control and only a little relevant to the decision to repair. The domains would be handled separately but if we are have to offline a chunk of memory to do it (needed for repair on some devices) that may be a bigger chunk if fine grained interleave in use and that may affect the decision. > > 3/ What if the device does not use DDR terminology / topology terms for > repair? Then we provide the additional interfaces assuming the correspond to well known terms. If someone is using a magic key then we can get grumpy with them, but that can also be supported. Mostly I'd expect a new technology to overlap a lot of the existing interface and maybe add one or two more; which layer in the stack for HBM for instance. The main alternative is where the device takes an HPA / SPA / DPA. We have one driver that does that queued up behind this series that uses HPA. PPR uses DPA. In that case userspace first tries to see if it can repair by HPA then DPA and if not moves on to see if it it can use the fuller description. We will see devices supporting HPA / DPA (which to use depends on when you are doing the operation and what has been configured) but mostly I'd expect either HPA/DPA or fine grained on a given repair instance. HPA only works if the address decoders are always configured (so not on CXL) What is actually happening in that case is typically that a firmware is involved that can look up address decoders etc, and map the control HPA to Bank / row etc to issue the actual low level commands. This keeps the memory mapping completely secret rather than exposing it in error records. > > I expect the flow rasdaemon would want is that the current PFA (leaky > bucket Pre-Failure Analysis) decides that the number of soft-offlines it > has performed exceeds some threshold and it wants to attempt to repair > memory. Sparing may happen prior to point where we'd have done a soft offline if non disruptive (whether it is can be read from another bit of the ABI). Memory repair might be much less disruptive than soft-offline! I rather hope memory manufacturers build that, but I'm aware of at least one case where they didn't and the memory must be offline. > > However, what is missing today for volatile memory is that some failures > can be repaired with in-band writes and some failures need heavier > hammers like Post-Package-Repair to actively swap in whole new banks of > memory. So don't we need something like "soft-offline-undo" on the way > to PPR? Ultimately we may do. That discussion was in one of the earlier threads on more heavy weight case of recovery from poison (unfortunately I can't find the thread) - the ask was for example code so that the complexity could be weighed against the demand for this sort of live repair or a lesser version where repair can only be done once a region is offline (and parts poisoned). However, there are other usecases where this isn't needed which is why that isn't a precursor for this series. Initial enablement targets two situations: 1) Repair can be done in non disruptive way - no need to soft offline at all. 2) Repair can be done at boot before memory is onlined or on admin action to take the whole region offline, then repair specific chunks of memory before bringing it back online. > > So, yes, +1 to simpler for now where software effectively just needs to > deal with a handful of "region repair" buttons and the semantics of > those are coarse and sub-optimal. Wait for a future where a tool author > says, "we have had good success getting bulk offlined pages back into > service, but now we need this specific finer grained kernel interface to > avoid wasting spare banks prematurely". Depends on where you think that interface is. I can absolutely see that as a control to RAS Daemon. Option 2 above, region is offline, repair all dodgy looking fine grained buckets. Note though that a suboptimal repair may mean permanent use of very rare resources. So there needs to be a control a the finest granularity as well. Which order those get added to userspace tools doesn't matter to me. If you mean that interface in kernel it brings some non trivial requirements. The kernel would need all of: 1) Tracking interface for all error records so the reverse map from region to specific bank / row etc is available for a subset of entries. The kernel would need to know which of those are important (soft offline might help in that use case, otherwise that means decision algorithms are in kernel or we have fine grained queue for region repair in parallel with soft-offline). 2) A way to inject the reverse map information from a userspace store (to deal with reboot etc). That sounds a lot harder to deal with than relying on the usespace program that already does the tracking across boots. > > Anything more complex than a set of /sys/devices/system/memory/ > devices has a /sys/bus/edac/devices/devX/repair button, feels like a > generation ahead of where the initial sophistication needs to lie. > > That said, I do not closely follow ras tooling to say whether someone > has already identified the critical need for a fine grained repair ABI? It's not that we necessarily want to repair at fine grain, it's that the control interface to hardware is fine grained and the reverse mapping often unknown except for specific error records. I'm fully on board with simple interfaces for common cases like repair the bad memory in this region. I'm just strongly against moving the complexity of doing that into the kernel. Jonathan >
Jonathan Cameron wrote: > On Thu, 9 Jan 2025 15:51:39 -0800 > Dan Williams <dan.j.williams@intel.com> wrote: > > > Jonathan Cameron wrote: > > > Ok. Best path is drop the available range support then (so no min_ max_ or > > > anything to replace them for now). > > > > I think less is more in this case. > > A few notes before I get to specific questions. > > Key in the discussion that follows is that the 'repair' is a separate > from the 'decision to repair'. They mostly need different information > all of which is in the error trace points. A lot of the questions > are about the 'decision to repair' part not the repair itself. > [snipped the parts I agree with] > I'd advise mostly ignoring PPR and looking at memory sparing in > the CXL spec if you want to consider an example. For PPR DPA is used > (there is an HPA option that might be available). DPA is still needed > for on boot soft repair (or we could delay that until regions configured, > but then we'd need to do DPA to HPA mapping as that will be different > on a new config, and then write HPA for the kernel to map it back to DPA. This is helpful because I was indeed getting lost in what kind of "repair" was being discussed in the thread. Ok, lets focus on sparing commands. > > > > > The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too > > wide for userspace to have a chance at writing a competent tool. At > > least I am struggling with where to even begin with those ABIs if I was > > asked to write a tool. Does a tool already exist for those? > > There is little choice on that - those are the controls for this type > of repair. If we do something like a magic 'key' based on a concatenation > of those values we need to define that description to replace a clean > self describing interface. I'm not 100% against that but I think it would > be a terrible interface design and I don't think anyone is really in favor of it. > > All a userspace tool does is read the error record fields of > exactly those names. From that it will log data (already happening under > those names in RAS daemon alongside HPA/ DPA). Then, in simplest case, > a threshold is passed and we write those values to the repair interface. > > There is zero need in that simple case for these to be understood at all. This is where you lose me. The error record is a point in time snapshot of the SPA:HPA:DPA:<proprietary internal "DIMM" mapping>. The security model for memory operations is based on coordinating with the kernel's understanding of how that SPA is *currently* being used. The kernel can not just take userspace's word for it that potentially data changing, or temporary loss-of-use operations are safe to execute just because once upon a time userspace saw an error record that implicated a given SPA in the past, especially over reboot. The SPA:HPA:DPA:DIMM tuple is invalidated on reconfiguration and reboot events. It follows that the kernel has a security/integrity interest in declining to act on invalidated tuples. This is solvable within a single boot as the kernel can cache the error records and userspace can ask to "trigger sparing based on cached record X". For the reboot case when the error record cache is flushed the kernel needs a reliable way to refill that cache, not an ABI for userspace to say "take my word for it, this *should* be safe". [snipped the explanation of replaying the old trace record parameters data back through sysfs, because that is precisely the hang up I have with the proposal] > > > > > Some questions that read on those ABIs are: > > > > 1/ What if the platform has translation between HPA (CXL decode) and SPA > > (physical addresses reported in trace points that PIO and DMA see)? > > See later for discussion of other interfaces.. This is assuming the > repair key is not HPA (it is on some systems / situations) - if it's > the repair key then that is easily dealt with. > > HPA / SPA more or less irrelevant for repair itself, they are relevant > for the decision to repair. In the 'on reboot' soft repair case they may > not even exist at the time of repair as it would be expected to happen > before we've brought up a region (to get the RAM into a good state at boot). > > For cases where the memory decoders are configured and so there is an HPA > to DPA mapping: > The trace reports provide both all these somewhat magic values and > the HPA. Thus we can do the HPA aware stuff on that before then looking > up the other bit of the appropriate error reports to get the bank row etc. > > > > > 2/ What if memory is interleaved across repair domains? > > Also not relevant to a repair control and only a little relevant to the > decision to repair. The domains would be handled separately but if > we are have to offline a chunk of memory to do it (needed for repair > on some devices) that may be a bigger chunk if fine grained interleave > in use and that may affect the decision. Again, the repair control assumes that the kernel can just trust userspace to get it right. When the kernel knows the SPA implications it can add safety like "you are going to issue sparing on deviceA that will temporarily take deviceA offline. CXL subsystem tells me deviceA is interleaved with deviceB in SPA so the whole SPA range needs to be offline before this operation proceeds". That is not someting that userspace can reliably coordinate. > > 3/ What if the device does not use DDR terminology / topology terms for > > repair? > > Then we provide the additional interfaces assuming the correspond to well > known terms. If someone is using a magic key then we can get grumpy > with them, but that can also be supported. > > Mostly I'd expect a new technology to overlap a lot of the existing > interface and maybe add one or two more; which layer in the stack for > HBM for instance. The concern is the assertion that sysfs needs to care about all these parameters vs an ABI that says "repair errorX". If persistence and validity of error records is the concern lets build an ABI for that and not depend upon trust in userspace to properly coordinate memory integrity concerns. > > The main alternative is where the device takes an HPA / SPA / DPA. We have one > driver that does that queued up behind this series that uses HPA. PPR uses > DPA. In that case userspace first tries to see if it can repair by HPA then > DPA and if not moves on to see if it it can use the fuller description. > We will see devices supporting HPA / DPA (which to use depends on when you > are doing the operation and what has been configured) but mostly I'd expect > either HPA/DPA or fine grained on a given repair instance. > > HPA only works if the address decoders are always configured (so not on CXL) > What is actually happening in that case is typically that a firmware is > involved that can look up address decoders etc, and map the control HPA > to Bank / row etc to issue the actual low level commands. This keeps > the memory mapping completely secret rather than exposing it in error > records. > > > > > I expect the flow rasdaemon would want is that the current PFA (leaky > > bucket Pre-Failure Analysis) decides that the number of soft-offlines it > > has performed exceeds some threshold and it wants to attempt to repair > > memory. > > Sparing may happen prior to point where we'd have done a soft offline > if non disruptive (whether it is can be read from another bit of the > ABI). Memory repair might be much less disruptive than soft-offline! > I rather hope memory manufacturers build that, but I'm aware of at least > one case where they didn't and the memory must be offline. That's a good point, spare before offline makes sense. [..] > However, there are other usecases where this isn't needed which is why > that isn't a precursor for this series. > > Initial enablement targets two situations: > 1) Repair can be done in non disruptive way - no need to soft offline at all. Modulo needing to quiesce access over the sparing event? > 2) Repair can be done at boot before memory is onlined or on admin > action to take the whole region offline, then repair specific chunks of > memory before bringing it back online. Which is userspace racing the kernel to online memory? > > So, yes, +1 to simpler for now where software effectively just needs to > > deal with a handful of "region repair" buttons and the semantics of > > those are coarse and sub-optimal. Wait for a future where a tool author > > says, "we have had good success getting bulk offlined pages back into > > service, but now we need this specific finer grained kernel interface to > > avoid wasting spare banks prematurely". > > Depends on where you think that interface is. I can absolutely see that > as a control to RAS Daemon. Option 2 above, region is offline, repair > all dodgy looking fine grained buckets. > > Note though that a suboptimal repair may mean permanent use of very rare > resources. So there needs to be a control a the finest granularity as well. > Which order those get added to userspace tools doesn't matter to me. > > If you mean that interface in kernel it brings some non trivial requirements. > The kernel would need all of: > 1) Tracking interface for all error records so the reverse map from region > to specific bank / row etc is available for a subset of entries. The > kernel would need to know which of those are important (soft offline > might help in that use case, otherwise that means decision algorithms > are in kernel or we have fine grained queue for region repair in parallel > with soft-offline). > 2) A way to inject the reverse map information from a userspace store > (to deal with reboot etc). Not a way to inject the reverse map information, a way to inject the error records and assert that memory topology changes have not invalidated those records. > That sounds a lot harder to deal with than relying on the usespace program > that already does the tracking across boots. I am stuck behind the barrier of userspace must not assume it knows better than the kernel about the SPA impact of a DIMM sparing event. The kernel needs evidence either live records from within the same kernel boot or validated records from a previous boot. ...devices could also help us out here with a way to replay DIMM error events. That would allow for refreshing error records even if the memory topology change because the new record would generate a refreshed SPA:HPA:DPA:DIMM tuple. > > Anything more complex than a set of /sys/devices/system/memory/ > > devices has a /sys/bus/edac/devices/devX/repair button, feels like a > > generation ahead of where the initial sophistication needs to lie. > > > > That said, I do not closely follow ras tooling to say whether someone > > has already identified the critical need for a fine grained repair ABI? > > It's not that we necessarily want to repair at fine grain, it's that > the control interface to hardware is fine grained and the reverse mapping > often unknown except for specific error records. > > I'm fully on board with simple interfaces for common cases like repair > the bad memory in this region. I'm just strongly against moving the > complexity of doing that into the kernel. Yes, we are just caught up on where that "...but no simpler" line is drawn.
On Thu, Jan 09, 2025 at 06:34:48PM +0000, Jonathan Cameron wrote: > Today you can. Seems we are talking cross purposes. > > I'm confused. I thought your proposal was for "bank" attribute to present an > allowed range on read. > "bank" attribute is currently written to and read back as the value of the bank on which > to conduct a repair. Maybe this disconnect is down to the fact max_ and min_ > attributes should have been marked as RO in the docs. They aren't controls, > just presentation of limits to userspace. > > Was intent a separate bank_range type attribute rather than max_bank, min_bank? I don't know - I'm just throwing ideas out there. You could do: cat /sys/.../bank and that gives you [<low> <current_value> <high>] So you have all the needed information. Dunno if this would be abusing sysfs rules too much tho. > > > > > In at least the CXL case I'm fairly sure most of them are not discoverable. > > > Until you see errors you have no idea what the memory topology is. > > > > Ok. > > > > > For that you'd need to have a path to read back what happened. > > > > So how is this scrubbing going to work? You get an error, you parse it for all :> > the attributes and you go and write those attributes into the scrub interface > > and it starts scrubbing? > > Repair not scrubbing. They are different things we should keep separate, > scrub corrects the value, if it can, but doesn't change the underlying memory to > new memory cells to avoid repeated errors. Replacing scrub with repair > (which I think was the intent here)... Really? So how is scrubbing defined for CXL? You read memory, do ECC check on it, report any potential errors but write back the *original* wrong value?! I thought the point of scrubbing is to repair it while at it too... > You get error records that describe the error seen in hardware, write back the > values into this interface and tell it to repair the memory. This is not > necessarily a synchronous or immediate thing - instead typically based on > trend analysis. This is just silly: I'm scrubbing, I found an error, I should simply fix it while at it. Why would I need an additional command to repair it?! > As an example, the decision might be that bit of ram threw up 3 errors > over a month including multiple system reboots (for other reasons) and > that is over some threshold so we use a spare memory line to replace it. Right. > Short answer, it needs to be very smart and there isn't a case of one size > fits all - hence suggested approach of making it a user space problem. Making it a userspace problem is pretty much always a sign that the hw design failed. > Given in the systems being considered here, software is triggering the repair, > we want to allow for policy in the decision. Right, you can leave a high-level decision to userspace: repair only when idle, repair only non-correctable errors, blabla but exposing every single aspect of every single error... meh. > In simple cases we could push that policy into the kernel e.g. just repair > the moment we see an error record. > > These repair resources are very limited in number, so immediately repairing > may a bad idea. We want to build up a history of errors before making > such a decision. That can be done in kernel. Yap, we are doing this now: drivers/ras/cec.c Userspace involvement is minimal, if at all. It is mostly controlling the parameters of the leaky bucket. > The decision to repair memory is heavily influenced by policy and time considerations > against device resource constraints. > > Some options that are hard to do in kernel. > > 1. Typical asynchronous error report for a corrected error. > > Tells us memory had an error (perhaps from a scrubbing engine on the device > running checks). No need to take action immediately. Instead build up more data > over time and if lots of errors occur make decision to repair as no we are sure it > is worth doing rather than a single random event. We may tune scrubbing engines > to check this memory more frequently and adjust our data analysis to take that > into account for setting thresholds etc. See above. What happens when your daemon dies and loses all that collected data? > 2. Soft repair across boots. We are actually storing the error records, then only > applying the fix on reboot before using the memory - so maintaining a list > of bad memory and saving it to a file to read back on boot. We could provide > another kernel interface to get this info and reinject it after reboot instead > of doing it in userspace but that is another ABI to design. We did something similar recently: drivers/ras/amd/fmpm.c. It basically "replays" errors from persistent storage as that memory cannot be replaced. > 3. Complex policy across fleets. A lot of work is going on around prediction techniques > that may change the local policy on each node dependent on the overall reliability > patterns of a particular batch of devices and local characteristics, service guarantees > etc. If it is hard repair, then once you've run out you need schedule an engineer > out to replace the DIMM. All complex inputs to the decision. You probably could say here: "repair or report when this and that." or "offline page and report error" and similar high-level decisions by leaving the details to the kernel instead of looking at every possible error in userspace and returning back to the kernel to state your decision. > Similar cases like CPU offlining on repeated errors are done in userspace (e.g. > RAS Daemon) for similar reasons of long term data gathering and potentially > complex algorithms. > > > > > > Ok. Then can we just drop the range discoverability entirely or we go with > > > your suggestion and do not support read back of what has been > > > requested but instead have the reads return a range if known or "" / > > > return -EONOTSUPP if simply not known? > > > > Probably. > > Too many options in the above paragraph so just to check... Probably to which? > If it's a separate attribute from the one we write the control so then > we do what is already done here and don't present the interface at all if > the range isn't discoverable. Probably means I still don't get a warm and fuzzy feeling about this design. As I've noted above. > Ok. Best path is drop the available range support then (so no min_ max_ or > anything to replace them for now). > > Added bonus is we don't have to rush this conversation and can make sure we > come to the right solution driven by use cases. Yap, that sounds like a prudent idea. What I'm trying to say, basically, is, this interface through sysfs is a *lot* of attributes, there's no clear cut use case where we can judge how useful it is and as I alluded to above, I really think that you should leave the high-level decisions to userspace and let the kernel do its job. This'll make your interface a lot simpler. And if you really need to control every single aspect of scrubbing in userspace, then you can always come later with proper design and use case. But again, I really think you should keep as much recovery logic in the kernel and as automatic as possible. Only when you really really need user input, only then you should allow it... I hope I'm making sense here... Thx.
On Sat, 11 Jan 2025 18:12:43 +0100 Borislav Petkov <bp@alien8.de> wrote: > On Thu, Jan 09, 2025 at 06:34:48PM +0000, Jonathan Cameron wrote: > > Today you can. Seems we are talking cross purposes. > > > > I'm confused. I thought your proposal was for "bank" attribute to present an > > allowed range on read. > > "bank" attribute is currently written to and read back as the value of the bank on which > > to conduct a repair. Maybe this disconnect is down to the fact max_ and min_ > > attributes should have been marked as RO in the docs. They aren't controls, > > just presentation of limits to userspace. > > > > Was intent a separate bank_range type attribute rather than max_bank, min_bank? > > I don't know - I'm just throwing ideas out there. You could do: > > cat /sys/.../bank > > and that gives you > > [<low> <current_value> <high>] > > So you have all the needed information. Dunno if this would be abusing sysfs > rules too much tho. We could do that, trade off between count of files and parsing complexity, but given I don't expect anyone to parse the reads anyway, sure. If we do go that way I guess we'd have things like <0> 33 <123> 33 To allow for cases where the range is known vs not known? Or <?> 33 <?> maybe? We can do that if you prefer. I'm not that fussed how this is handled because, for tooling at least, I don't see why we'd ever read it. It's for human parsing only and the above is fine. > > > > > > > > In at least the CXL case I'm fairly sure most of them are not discoverable. > > > > Until you see errors you have no idea what the memory topology is. > > > > > > Ok. > > > > > > > For that you'd need to have a path to read back what happened. > > > > > > So how is this scrubbing going to work? You get an error, you parse it for all > :> > the attributes and you go and write those attributes into the scrub interface > > > and it starts scrubbing? > > > > Repair not scrubbing. They are different things we should keep separate, > > scrub corrects the value, if it can, but doesn't change the underlying memory to > > new memory cells to avoid repeated errors. Replacing scrub with repair > > (which I think was the intent here)... > > Really? > > So how is scrubbing defined for CXL? You read memory, do ECC check on it, > report any potential errors but write back the *original* wrong value?! What you describe is correction, not repair. Of course it corrects! These two terms mean different things in CXL world and in DDR specs etc. We should have made that distinction clearer as perhaps that is the root of this struggle to reach agreement. > > I thought the point of scrubbing is to repair it while at it too... No. There is a major difference between correction (which scrubbing does) and repair. Point of scrubbing is to correct errors if it can by using (typically) the single error correction of SECDED ECC codes (single correct, double detect). In many cases it reports the error and uncorrectable errors it runs into. This is common for all scrub implementations not just CXL ones. What happens with those error reports differs between systems. This is not CXL specific. I'm aware of other systems that make repair an OS managed thing - we have another driver queued up behind this set that does exactly that. In many existing servers (I know some of ours do this for example), the firmware / BMC etc keep track of those errors and make the decision to repair the memory. It does not do this on detection of one correctable error, it does it after a repeated pattern. Repair cam be a feature of the DIMMs themselves or it can be a feature of the memory controller. It is basically replacing them with spare memory from somewhere else (usually elsewhere on same DIMMs that have a bit of spare capacity for this). Bit like a hot spare in a RAID setup. In some other systems the OS gets the errors and is responsible for making the decision. Sticking to the corrected error case (uncorrected handling is going to require a lot more work given we've lost data, Dan asked about that in the other branch of the thread), the OS as a whole (kernel + userspace) gets the error records and makes the policy decision to repair based on assessment of risk vs resource availability to make a repair. Two reasons for this 1) Hardware isn't necessarily capable of repairing autonomously as other actions may be needed (memory traffic to some granularity of memory may need to be stopped to avoid timeouts). Note there are many graduations of this from A) can do it live with no visible effect, through B) offline a page, to C) offlining the whole device. 2) Policy can be a lot more sophisticated than a BMC can do. > > > You get error records that describe the error seen in hardware, write back the > > values into this interface and tell it to repair the memory. This is not > > necessarily a synchronous or immediate thing - instead typically based on > > trend analysis. > > This is just silly: I'm scrubbing, I found an error, I should simply fix it > while at it. Why would I need an additional command to repair it?! This is confusing correction with repair. They are different operations. Correction of course happens (if we can- normally single bit errors only). No OS involvement in scrub based correction. Repair is normally only for repeated errors. If the memory is fine and we get a cosmic ray or similar flipping a bit there is no long term increased likelihood of seeing another error. If we get one every X hours then it is highly likely the issue is something wrong with the memory. > > > As an example, the decision might be that bit of ram threw up 3 errors > > over a month including multiple system reboots (for other reasons) and > > that is over some threshold so we use a spare memory line to replace it. > > Right. > > > Short answer, it needs to be very smart and there isn't a case of one size > > fits all - hence suggested approach of making it a user space problem. > > Making it a userspace problem is pretty much always a sign that the hw design > failed. In some cases perhaps, but another very strong driver is that policy is involved. We can either try put a complex design in firmware and poke it with N opaque parameters from a userspace tool or via some out of band method or we can put the algorithm in userspace where it can be designed to incorporate lessons learnt over time. We will start simple and see what is appropriate as this starts to get used in large fleets. This stuff is a reasonable target for AI type algorithms etc that we aren't going to put in the kernel. Doing this at all is a reliability optimization, normally it isn't required for correct operation. > > > Given in the systems being considered here, software is triggering the repair, > > we want to allow for policy in the decision. > > Right, you can leave a high-level decision to userspace: repair only when > idle, repair only non-correctable errors, blabla but exposing every single > aspect of every single error... meh. Those aspects identify the error location. As I put in reply to Dan a device does not necessarily have a fixed mapping to a device physical address over reboots etc. There are solutions that scramble that mapping on reboot or hotplug for various reasons (security and others). For CXL at least these are all in userspace already (RAS-daemon has supported them for quite a while). The same data is coming in a CPER record for other memory errors though I'm not sure it is currently surfaced to RAS Daemon. > > > In simple cases we could push that policy into the kernel e.g. just repair > > the moment we see an error record. > > > > These repair resources are very limited in number, so immediately repairing > > may a bad idea. We want to build up a history of errors before making > > such a decision. That can be done in kernel. > > Yap, we are doing this now: > > drivers/ras/cec.c > > Userspace involvement is minimal, if at all. It is mostly controlling the > parameters of the leaky bucket. Offline has no permanent cost and no limit on number of times you can do it. Repair is definitely a limited resource and may permanently use up that resource (discoverable as a policy wants to know that too!) In some cases once you run out of repair resources you have to send an engineer to replace the memory before you can do it again. So to me, these are different situations requiring different solutions. > > > The decision to repair memory is heavily influenced by policy and time considerations > > against device resource constraints. > > > > Some options that are hard to do in kernel. > > > > 1. Typical asynchronous error report for a corrected error. > > > > Tells us memory had an error (perhaps from a scrubbing engine on the device > > running checks). No need to take action immediately. Instead build up more data > > over time and if lots of errors occur make decision to repair as no we are sure it > > is worth doing rather than a single random event. We may tune scrubbing engines > > to check this memory more frequently and adjust our data analysis to take that > > into account for setting thresholds etc. > > See above. > > What happens when your daemon dies and loses all that collected data? It is logged always - in rasdaemon, either local sqlite DB or people are carrying patches to do fleet scale logging. This works across reboot so a daemon dying is part of the design rather than just an error case. If the daemon is able to corrupt your error logs then we have bigger problems. > > > 2. Soft repair across boots. We are actually storing the error records, then only > > applying the fix on reboot before using the memory - so maintaining a list > > of bad memory and saving it to a file to read back on boot. We could provide > > another kernel interface to get this info and reinject it after reboot instead > > of doing it in userspace but that is another ABI to design. > > We did something similar recently: drivers/ras/amd/fmpm.c. It basically > "replays" errors from persistent storage as that memory cannot be replaced. Ok. I guess it is an option (I wasn't aware of that work). I was thinking that was far more complex to deal with than just doing it in userspace tooling. From a quick look that solution seems to rely on ACPI ERSR infrastructure to provide that persistence that we won't generally have but I suppose we can read it from the filesystem or other persistent stores. We'd need to be a lot more general about that as can't make system assumptions that can be made in AMD specific code. So could be done, I don't think it is a good idea in this case, but that example does suggest it is possible. > > 3. Complex policy across fleets. A lot of work is going on around prediction techniques > > that may change the local policy on each node dependent on the overall reliability > > patterns of a particular batch of devices and local characteristics, service guarantees > > etc. If it is hard repair, then once you've run out you need schedule an engineer > > out to replace the DIMM. All complex inputs to the decision. > > You probably could say here: "repair or report when this and that." or > "offline page and report error" and similar high-level decisions by leaving > the details to the kernel instead of looking at every possible error in > userspace and returning back to the kernel to state your decision. In approach we are targetting, there is no round trip situation. We let the kernel deal with any synchronous error just fine and run it's existing logic to offline problematic memory. That needs to be timely and to carry on operating exactly as it always has. In parallel with that we gather the error reports that we will already be gathering and run analysis on those. From that we decide if a memory is likely to fail again and perform a sparing operation if appropriate. Effectively this is 'free'. All the information is already there in userspace and already understood by tools like rasdaemon, we are not expanding that reporting interface at all. Or we don't decide to and the memory continues to be used as before. Sure there may be a higher chance of error but there isn't one right now. (or there is and we have offlined the memory anyway via normal mechanisms so no problem). There is an interesting future exploration to be done on recovery from poison (uncorrectable error) situations, but that is not in scope of this initial code and that would likely be an asynchronous best effort process as well. > > > Similar cases like CPU offlining on repeated errors are done in userspace (e.g. > > RAS Daemon) for similar reasons of long term data gathering and potentially > > complex algorithms. > > > > > > > > > Ok. Then can we just drop the range discoverability entirely or we go with > > > > your suggestion and do not support read back of what has been > > > > requested but instead have the reads return a range if known or "" / > > > > return -EONOTSUPP if simply not known? > > > > > > Probably. > > > > Too many options in the above paragraph so just to check... Probably to which? > > If it's a separate attribute from the one we write the control so then > > we do what is already done here and don't present the interface at all if > > the range isn't discoverable. > > Probably means I still don't get a warm and fuzzy feeling about this design. > As I've noted above. Ok. I was confused as the paragraph asked and (A) or (B) question. > > > Ok. Best path is drop the available range support then (so no min_ max_ or > > anything to replace them for now). > > > > Added bonus is we don't have to rush this conversation and can make sure we > > come to the right solution driven by use cases. > > Yap, that sounds like a prudent idea. > > What I'm trying to say, basically, is, this interface through sysfs is > a *lot* of attributes, there's no clear cut use case where we can judge how > useful it is and as I alluded to above, I really think that you should leave > the high-level decisions to userspace and let the kernel do its job. > > This'll make your interface a lot simpler. Ok. It seems you correlate number of files with complexity. I correlated difficulty of understanding those files with complexity. Everyone one of the files is clearly defined and aligned with long history of how to describe DRAM (see how long CPER records have used these fields for example - they go back to the beginning). 1) It is only 8 files (if we drop the range info). I don't consider that a complex enough interface to worry about complexity. 2) Moving the control into the kernel will require a control interface as there will be multiple tuneables for those decisions. Like most such algorithm parameter questions we may spend years arguing about what those look like. Sure that problem also occurs in userspace but we aren't obliging people to use what we put in RASDaemon. > > And if you really need to control every single aspect of scrubbing in > userspace, then you can always come later with proper design and use case. I'm all in favor of building an interface up by providing minimum first and then adding to it, but here what is proposed is the minimum for basic functionality and the alternative of doing the whole thing in kernel both puts complexity in the wrong place and restricts us in what is possible. See below for a proposal to maybe make some progress. > > But again, I really think you should keep as much recovery logic in the kernel > and as automatic as possible. Only when you really really need user input, > only then you should allow it... This is missing the key considerations of resource availability being limited and not addressing the main usecase here. This is not and has never been about the logic to recover form an error. It is about reducing the possibility of repeated errors. > > I hope I'm making sense here... To some degree but I think there is a major mismatch in what we think this is for. What I've asked Shiju to look at is splitting the repair infrastructure into two cases so that maybe we can make partial progress: 1) Systems that support repair by Physical Address - Covers Post Package Repair for CXL 2) Systems that support repair by description of the underlying hardware - Covers Memory Sparing interfaces for CXL. We need both longer term anyway, but maybe 1 is less controversial simply on basis it has fewer control parameters This still fundamentally puts the policy in userspace where I believe it belongs. Jonathan > > Thx. >
On Fri, 10 Jan 2025 14:49:03 -0800 Dan Williams <dan.j.williams@intel.com> wrote: > Jonathan Cameron wrote: > > On Thu, 9 Jan 2025 15:51:39 -0800 > > Dan Williams <dan.j.williams@intel.com> wrote: > > > > > Jonathan Cameron wrote: > > > > Ok. Best path is drop the available range support then (so no min_ max_ or > > > > anything to replace them for now). > > > > > > I think less is more in this case. > > > > A few notes before I get to specific questions. > > > > Key in the discussion that follows is that the 'repair' is a separate > > from the 'decision to repair'. They mostly need different information > > all of which is in the error trace points. A lot of the questions > > are about the 'decision to repair' part not the repair itself. > > > [snipped the parts I agree with] > > I'd advise mostly ignoring PPR and looking at memory sparing in > > the CXL spec if you want to consider an example. For PPR DPA is used > > (there is an HPA option that might be available). DPA is still needed > > for on boot soft repair (or we could delay that until regions configured, > > but then we'd need to do DPA to HPA mapping as that will be different > > on a new config, and then write HPA for the kernel to map it back to DPA. > > This is helpful because I was indeed getting lost in what kind of > "repair" was being discussed in the thread. Ok, lets focus on sparing > commands. > > > > > > > > > The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too > > > wide for userspace to have a chance at writing a competent tool. At > > > least I am struggling with where to even begin with those ABIs if I was > > > asked to write a tool. Does a tool already exist for those? > > > > There is little choice on that - those are the controls for this type > > of repair. If we do something like a magic 'key' based on a concatenation > > of those values we need to define that description to replace a clean > > self describing interface. I'm not 100% against that but I think it would > > be a terrible interface design and I don't think anyone is really in favor of it. > > > > All a userspace tool does is read the error record fields of > > exactly those names. From that it will log data (already happening under > > those names in RAS daemon alongside HPA/ DPA). Then, in simplest case, > > a threshold is passed and we write those values to the repair interface. > > > > There is zero need in that simple case for these to be understood at all. > > This is where you lose me. The error record is a point in time snapshot > of the SPA:HPA:DPA:<proprietary internal "DIMM" mapping>. The security > model for memory operations is based on coordinating with the kernel's > understanding of how that SPA is *currently* being used. Whilst it is being used I agree. Key is to only do disruptive / data changing actions when it is not being used. > > The kernel can not just take userspace's word for it that potentially > data changing, or temporary loss-of-use operations are safe to execute > just because once upon a time userspace saw an error record that > implicated a given SPA in the past, especially over reboot. There are two cases (discoverable from hardware) 1) Non disruptive. No security concern as the device guarantees to not interrupt traffic and the memory contents is copied to the new location. Basically software never knows it happened. 2) Disruptive. We only allow this if the memory is offline. In the CXL case the CXL specific code must check no memory on the device is online so we aren't disrupting anything. The other implementation we have code for (will post after this lands) has finer granularity constraints and only the page needs to be offline. As it is offline the content is not preserved anyway. We may need to add extra constraints along with future support for temporal persistence / sharing but we can do that as part of adding that support in general. (Personally I think in those cases memory repair is a job for the out of band management anyway). In neither case am I seeing a security concern. Am I missing something? > > The SPA:HPA:DPA:DIMM tuple is invalidated on reconfiguration and reboot > events. It follows that the kernel has a security/integrity interest in > declining to act on invalidated tuples. This is solvable within a single > boot as the kernel can cache the error records and userspace can ask to > "trigger sparing based on cached record X". The above rules remove this complexity. Either it is always safe by device design, or the memory is offline and we'll zero fill it anyway so no security concern. > > For the reboot case when the error record cache is flushed the kernel > needs a reliable way to refill that cache, not an ABI for userspace to > say "take my word for it, this *should* be safe". It is safe because of 1 and 2 above we are not editing data in use except in a fashion that the device guarantees is safe. If you don't trust the device on this you have bigger problems. > > [snipped the explanation of replaying the old trace record parameters > data back through sysfs, because that is precisely the hang up I have > with the proposal] > > > > > > > > > Some questions that read on those ABIs are: > > > > > > 1/ What if the platform has translation between HPA (CXL decode) and SPA > > > (physical addresses reported in trace points that PIO and DMA see)? > > > > See later for discussion of other interfaces.. This is assuming the > > repair key is not HPA (it is on some systems / situations) - if it's > > the repair key then that is easily dealt with. > > > > HPA / SPA more or less irrelevant for repair itself, they are relevant > > for the decision to repair. In the 'on reboot' soft repair case they may > > not even exist at the time of repair as it would be expected to happen > > before we've brought up a region (to get the RAM into a good state at boot). > > > > For cases where the memory decoders are configured and so there is an HPA > > to DPA mapping: > > The trace reports provide both all these somewhat magic values and > > the HPA. Thus we can do the HPA aware stuff on that before then looking > > up the other bit of the appropriate error reports to get the bank row etc. > > > > > > > > 2/ What if memory is interleaved across repair domains? > > > > Also not relevant to a repair control and only a little relevant to the > > decision to repair. The domains would be handled separately but if > > we are have to offline a chunk of memory to do it (needed for repair > > on some devices) that may be a bigger chunk if fine grained interleave > > in use and that may affect the decision. > > Again, the repair control assumes that the kernel can just trust > userspace to get it right. When the kernel knows the SPA implications it > can add safety like "you are going to issue sparing on deviceA that will > temporarily take deviceA offline. CXL subsystem tells me deviceA is > interleaved with deviceB in SPA so the whole SPA range needs to be > offline before this operation proceeds". That is not someting that > userspace can reliably coordinate. Absolutely he kernel has to enforce this. Same way we protect against poison injection in some cases. Right now the enforcement is slightly wrong (Shiju is looking at it again) as we were enforcing at wrong granularity (specific dpa, not device). Identifying that hole is a good outcome of this discussion making us take another look. Enforcing this is one of the key jobs of the CXL specific driver. We considered doing it in the core, but the granularity differences between our initial few examples meant we decided on specific driver implementations of the checks for now. > > > > 3/ What if the device does not use DDR terminology / topology terms for > > > repair? > > > > Then we provide the additional interfaces assuming the correspond to well > > known terms. If someone is using a magic key then we can get grumpy > > with them, but that can also be supported. > > > > Mostly I'd expect a new technology to overlap a lot of the existing > > interface and maybe add one or two more; which layer in the stack for > > HBM for instance. > > The concern is the assertion that sysfs needs to care about all these > parameters vs an ABI that says "repair errorX". If persistence and > validity of error records is the concern lets build an ABI for that and > not depend upon trust in userspace to properly coordinate memory > integrity concerns. It doesn't have to. It just has to ensure that the memory device is in the correct state. So check, not coordinate. At a larger scale, coordination is already doable (subject to races that we must avoid by holding locks), tear down the regions so there are no mappings on the device you want to repair. Don't bring them up again until after you are done. The main use case is probably do it before you bring the mappings up, but same result. > > > > > The main alternative is where the device takes an HPA / SPA / DPA. We have one > > driver that does that queued up behind this series that uses HPA. PPR uses > > DPA. In that case userspace first tries to see if it can repair by HPA then > > DPA and if not moves on to see if it it can use the fuller description. > > We will see devices supporting HPA / DPA (which to use depends on when you > > are doing the operation and what has been configured) but mostly I'd expect > > either HPA/DPA or fine grained on a given repair instance. > > > > HPA only works if the address decoders are always configured (so not on CXL) > > What is actually happening in that case is typically that a firmware is > > involved that can look up address decoders etc, and map the control HPA > > to Bank / row etc to issue the actual low level commands. This keeps > > the memory mapping completely secret rather than exposing it in error > > records. > > > > > > > > I expect the flow rasdaemon would want is that the current PFA (leaky > > > bucket Pre-Failure Analysis) decides that the number of soft-offlines it > > > has performed exceeds some threshold and it wants to attempt to repair > > > memory. > > > > Sparing may happen prior to point where we'd have done a soft offline > > if non disruptive (whether it is can be read from another bit of the > > ABI). Memory repair might be much less disruptive than soft-offline! > > I rather hope memory manufacturers build that, but I'm aware of at least > > one case where they didn't and the memory must be offline. > > That's a good point, spare before offline makes sense. If transparent an resources not constrained. Very much not if we have to tear down the memory first. > > [..] > > However, there are other usecases where this isn't needed which is why > > that isn't a precursor for this series. > > > > Initial enablement targets two situations: > > 1) Repair can be done in non disruptive way - no need to soft offline at all. > > Modulo needing to quiesce access over the sparing event? Absolutely. This is only doable in devices that don't need to quiesce. > > > 2) Repair can be done at boot before memory is onlined or on admin > > action to take the whole region offline, then repair specific chunks of > > memory before bringing it back online. > > Which is userspace racing the kernel to online memory? If you are doing this scheme you don't automatically online memory. So both are in userspace control and can be easily sequenced. If you aren't auto onlining then buy devices with hard PPR and do it by offlining manually, repairing and rebooting. Or buy devices that don't need to quiecse and cross your fingers the dodgy ram doesn't throw an error before you get that far. Little choice if you decide to online right at the start as normal memory. > > > > So, yes, +1 to simpler for now where software effectively just needs to > > > deal with a handful of "region repair" buttons and the semantics of > > > those are coarse and sub-optimal. Wait for a future where a tool author > > > says, "we have had good success getting bulk offlined pages back into > > > service, but now we need this specific finer grained kernel interface to > > > avoid wasting spare banks prematurely". > > > > Depends on where you think that interface is. I can absolutely see that > > as a control to RAS Daemon. Option 2 above, region is offline, repair > > all dodgy looking fine grained buckets. > > > > Note though that a suboptimal repair may mean permanent use of very rare > > resources. So there needs to be a control a the finest granularity as well. > > Which order those get added to userspace tools doesn't matter to me. > > > > If you mean that interface in kernel it brings some non trivial requirements. > > The kernel would need all of: > > 1) Tracking interface for all error records so the reverse map from region > > to specific bank / row etc is available for a subset of entries. The > > kernel would need to know which of those are important (soft offline > > might help in that use case, otherwise that means decision algorithms > > are in kernel or we have fine grained queue for region repair in parallel > > with soft-offline). > > 2) A way to inject the reverse map information from a userspace store > > (to deal with reboot etc). > > Not a way to inject the reverse map information, a way to inject the > error records and assert that memory topology changes have not > invalidated those records. There is no way to tell that the topology hasn't changed. For the reasons above, I don't think we care. Instead of trying to stop userspace reparing the wrong memory, make sure it is safe for it to do that. (The kernel is rarely in the business of preventing the slightly stupid) > > > That sounds a lot harder to deal with than relying on the usespace program > > that already does the tracking across boots. > > I am stuck behind the barrier of userspace must not assume it knows > better than the kernel about the SPA impact of a DIMM sparing > event. The kernel needs evidence either live records from within the > same kernel boot or validated records from a previous boot. I think this is the wrong approach. The operation must be 'safe'. With that in place we absolutely can let userspace assume it knows better than the kernel. > > ...devices could also help us out here with a way to replay DIMM error > events. That would allow for refreshing error records even if the > memory topology change because the new record would generate a refreshed > SPA:HPA:DPA:DIMM tuple. Maybe, but I'm not seeing necessity. > > > > Anything more complex than a set of /sys/devices/system/memory/ > > > devices has a /sys/bus/edac/devices/devX/repair button, feels like a > > > generation ahead of where the initial sophistication needs to lie. > > > > > > That said, I do not closely follow ras tooling to say whether someone > > > has already identified the critical need for a fine grained repair ABI? > > > > It's not that we necessarily want to repair at fine grain, it's that > > the control interface to hardware is fine grained and the reverse mapping > > often unknown except for specific error records. > > > > I'm fully on board with simple interfaces for common cases like repair > > the bad memory in this region. I'm just strongly against moving the > > complexity of doing that into the kernel. > > Yes, we are just caught up on where that "...but no simpler" line is > drawn. > Sure. For now, I've proposed we split the two cases. 1) HPA / DPA repair (PPR) 2) Memory topology based repair (Sparing) If we can make progress on (1) perhaps we can come to a conclusion on what is required. Note that so far I see no reason why either should do any checking against errors observed by the kernel given the security guarantees above. Userspace can repair the wrong bit of memory. That's pointless and burning limited resources, but not a security problem. Jonathan
Em Mon, 6 Jan 2025 12:10:00 +0000 <shiju.jose@huawei.com> escreveu: > From: Shiju Jose <shiju.jose@huawei.com> > > Add a generic EDAC memory repair control driver to manage memory repairs > in the system, such as CXL Post Package Repair (PPR) and CXL memory sparing > features. > > For example, a CXL device with DRAM components that support PPR features > may implement PPR maintenance operations. DRAM components may support two > types of PPR, hard PPR, for a permanent row repair, and soft PPR, for a > temporary row repair. Soft PPR is much faster than hard PPR, but the repair > is lost with a power cycle. > Similarly a CXL memory device may support soft and hard memory sparing at > cacheline, row, bank and rank granularities. Memory sparing is defined as > a repair function that replaces a portion of memory with a portion of > functional memory at that same granularity. > When a CXL device detects an error in a memory, it may report the host of > the need for a repair maintenance operation by using an event record where > the "maintenance needed" flag is set. The event records contains the device > physical address(DPA) and other attributes of the memory to repair (such as > channel, sub-channel, bank group, bank, rank, row, column etc). The kernel > will report the corresponding CXL general media or DRAM trace event to > userspace, and userspace tools (e.g. rasdaemon) will initiate a repair > operation in response to the device request via the sysfs repair control. > > Device with memory repair features registers with EDAC device driver, > which retrieves memory repair descriptor from EDAC memory repair driver > and exposes the sysfs repair control attributes to userspace in > /sys/bus/edac/devices/<dev-name>/mem_repairX/. > > The common memory repair control interface abstracts the control of > arbitrary memory repair functionality into a standardized set of functions. > The sysfs memory repair attribute nodes are only available if the client > driver has implemented the corresponding attribute callback function and > provided operations to the EDAC device driver during registration. > > Signed-off-by: Shiju Jose <shiju.jose@huawei.com> > --- > .../ABI/testing/sysfs-edac-memory-repair | 244 +++++++++ > Documentation/edac/features.rst | 3 + > Documentation/edac/index.rst | 1 + > Documentation/edac/memory_repair.rst | 101 ++++ > drivers/edac/Makefile | 2 +- > drivers/edac/edac_device.c | 33 ++ > drivers/edac/mem_repair.c | 492 ++++++++++++++++++ > include/linux/edac.h | 139 +++++ > 8 files changed, 1014 insertions(+), 1 deletion(-) > create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair > create mode 100644 Documentation/edac/memory_repair.rst > create mode 100755 drivers/edac/mem_repair.c > > diff --git a/Documentation/ABI/testing/sysfs-edac-memory-repair b/Documentation/ABI/testing/sysfs-edac-memory-repair > new file mode 100644 > index 000000000000..e9268f3780ed > --- /dev/null > +++ b/Documentation/ABI/testing/sysfs-edac-memory-repair > @@ -0,0 +1,244 @@ > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX > +Date: Jan 2025 > +KernelVersion: 6.14 > +Contact: linux-edac@vger.kernel.org > +Description: > + The sysfs EDAC bus devices /<dev-name>/mem_repairX subdirectory > + pertains to the memory media repair features control, such as > + PPR (Post Package Repair), memory sparing etc, where<dev-name> > + directory corresponds to a device registered with the EDAC > + device driver for the memory repair features. > + > + Post Package Repair is a maintenance operation requests the memory > + device to perform a repair operation on its media, in detail is a > + memory self-healing feature that fixes a failing memory location by > + replacing it with a spare row in a DRAM device. For example, a > + CXL memory device with DRAM components that support PPR features may > + implement PPR maintenance operations. DRAM components may support > + two types of PPR functions: hard PPR, for a permanent row repair, and > + soft PPR, for a temporary row repair. soft PPR is much faster than > + hard PPR, but the repair is lost with a power cycle. > + > + Memory sparing is a repair function that replaces a portion > + of memory with a portion of functional memory at that same > + sparing granularity. Memory sparing has cacheline/row/bank/rank > + sparing granularities. For example, in memory-sparing mode, > + one memory rank serves as a spare for other ranks on the same > + channel in case they fail. The spare rank is held in reserve and > + not used as active memory until a failure is indicated, with > + reserved capacity subtracted from the total available memory > + in the system.The DIMM installation order for memory sparing > + varies based on the number of processors and memory modules > + installed in the server. After an error threshold is surpassed > + in a system protected by memory sparing, the content of a failing > + rank of DIMMs is copied to the spare rank. The failing rank is > + then taken offline and the spare rank placed online for use as > + active memory in place of the failed rank. > + > + The sysfs attributes nodes for a repair feature are only > + present if the parent driver has implemented the corresponding > + attr callback function and provided the necessary operations > + to the EDAC device driver during registration. > + > + In some states of system configuration (e.g. before address > + decoders have been configured), memory devices (e.g. CXL) > + may not have an active mapping in the main host address > + physical address map. As such, the memory to repair must be > + identified by a device specific physical addressing scheme > + using a device physical address(DPA). The DPA and other control > + attributes to use will be presented in related error records. > + > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair_function > +Date: Jan 2025 > +KernelVersion: 6.14 > +Contact: linux-edac@vger.kernel.org > +Description: > + (RO) Memory repair function type. For eg. post package repair, > + memory sparing etc. > + EDAC_SOFT_PPR - Soft post package repair > + EDAC_HARD_PPR - Hard post package repair > + EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing > + EDAC_ROW_MEM_SPARING - Row memory sparing > + EDAC_BANK_MEM_SPARING - Bank memory sparing > + EDAC_RANK_MEM_SPARING - Rank memory sparing > + All other values are reserved. Too big strings. Why are them in upper cases? IMO: soft-ppr, hard-ppr, ... would be enough. Also, Is it mandatory that all types are supported? If not, you need a way to report to userspace what of them are supported. One option would be that reading /sys/bus/edac/devices/<dev-name>/mem_repairX/repair_function would return something like: soft-ppr [hard-ppr] row-mem-sparing Also, as this will be parsed in ReST format, you need to change the description to use bullets, otherwise the html/pdf version of the document will place everything on a single line. E.g. something like: Description: (RO) Memory repair function type. For eg. post package repair, memory sparing etc. Can be: - EDAC_SOFT_PPR - Soft post package repair - EDAC_HARD_PPR - Hard post package repair - EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing - EDAC_ROW_MEM_SPARING - Row memory sparing - EDAC_BANK_MEM_SPARING - Bank memory sparing - EDAC_RANK_MEM_SPARING - Rank memory sparing - All other values are reserved. Same applies to other sysfs nodes. See for instance: Documentation/ABI/stable/sysfs-class-backlight And see how it is formatted after Sphinx processing at the Kernel Admin guide: https://www.kernel.org/doc/html/latest/admin-guide/abi-stable.html#symbols-under-sys-class Please fix it on all places you have a list of values. > + > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode > +Date: Jan 2025 > +KernelVersion: 6.14 > +Contact: linux-edac@vger.kernel.org > +Description: > + (RW) Read/Write the current persist repair mode set for a > + repair function. Persist repair modes supported in the > + device, based on the memory repair function is temporary > + or permanent and is lost with a power cycle. > + EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary repair). > + EDAC_MEM_REPAIR_HARD - Hard memory repair function (permanent repair). > + All other values are reserved. Same here: edac/ is already in the path. No need to place EDAC_ at the name. > + > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/dpa_support > +Date: Jan 2025 > +KernelVersion: 6.14 > +Contact: linux-edac@vger.kernel.org > +Description: > + (RO) True if memory device required device physical > + address (DPA) of memory to repair. > + False if memory device required host specific physical > + address (HPA) of memory to repair. Please remove the extra spaces before "address", as otherwise conversion to ReST may do the wrong thing or may produce doc warnings. > + In some states of system configuration (e.g. before address > + decoders have been configured), memory devices (e.g. CXL) > + may not have an active mapping in the main host address > + physical address map. As such, the memory to repair must be > + identified by a device specific physical addressing scheme > + using a DPA. The device physical address(DPA) to use will be > + presented in related error records. > + > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair_safe_when_in_use > +Date: Jan 2025 > +KernelVersion: 6.14 > +Contact: linux-edac@vger.kernel.org > +Description: > + (RO) True if memory media is accessible and data is retained > + during the memory repair operation. > + The data may not be retained and memory requests may not be > + correctly processed during a repair operation. In such case > + the repair operation should not executed at runtime. Please add an extra line before "The data" to ensure that the output at the admin-guide won't merge the two paragraphs. Same on other places along this patch series: paragraphs need a blank line at the description. > + > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/hpa > +Date: Jan 2025 > +KernelVersion: 6.14 > +Contact: linux-edac@vger.kernel.org > +Description: > + (RW) Host Physical Address (HPA) of the memory to repair. > + See attribute 'dpa_support' for more details. > + The HPA to use will be provided in related error records. > + > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/dpa > +Date: Jan 2025 > +KernelVersion: 6.14 > +Contact: linux-edac@vger.kernel.org > +Description: > + (RW) Device Physical Address (DPA) of the memory to repair. > + See attribute 'dpa_support' for more details. > + The specific DPA to use will be provided in related error > + records. > + > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/nibble_mask > +Date: Jan 2025 > +KernelVersion: 6.14 > +Contact: linux-edac@vger.kernel.org > +Description: > + (RW) Read/Write Nibble mask of the memory to repair. > + Nibble mask identifies one or more nibbles in error on the > + memory bus that produced the error event. Nibble Mask bit 0 > + shall be set if nibble 0 on the memory bus produced the > + event, etc. For example, CXL PPR and sparing, a nibble mask > + bit set to 1 indicates the request to perform repair > + operation in the specific device. All nibble mask bits set > + to 1 indicates the request to perform the operation in all > + devices. For CXL memory to repiar, the specific value of > + nibble mask to use will be provided in related error records. > + For more details, See nibble mask field in CXL spec ver 3.1, > + section 8.2.9.7.1.2 Table 8-103 soft PPR and section > + 8.2.9.7.1.3 Table 8-104 hard PPR, section 8.2.9.7.1.4 > + Table 8-105 memory sparing. > + > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/bank_group > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/bank > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/rank > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/row > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/column > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/channel > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/sub_channel > +Date: Jan 2025 > +KernelVersion: 6.14 > +Contact: linux-edac@vger.kernel.org > +Description: > + (RW) The control attributes associated with memory address > + that is to be repaired. The specific value of attributes to > + use depends on the portion of memory to repair and may be > + reported to host in related error records and may be > + available to userspace in trace events, such as in CXL > + memory devices. > + > + channel - The channel of the memory to repair. Channel is > + defined as an interface that can be independently accessed > + for a transaction. > + rank - The rank of the memory to repair. Rank is defined as a > + set of memory devices on a channel that together execute a > + transaction. > + bank_group - The bank group of the memory to repair. > + bank - The bank number of the memory to repair. > + row - The row number of the memory to repair. > + column - The column number of the memory to repair. > + sub_channel - The sub-channel of the memory to repair. Same problem here with regards to bad ReST input. I would do: channel The channel of the memory to repair. Channel is defined as an interface that can be independently accessed for a transaction. rank The rank of the memory to repair. Rank is defined as a set of memory devices on a channel that together execute a transaction. as this would provide a better output at admin-guide while still being nicer to read as text. > + > + The requirement to set these attributes varies based on the > + repair function. The attributes in sysfs are not present > + unless required for a repair function. > + For example, CXL spec ver 3.1, Section 8.2.9.7.1.2 Table 8-103 > + soft PPR and Section 8.2.9.7.1.3 Table 8-104 hard PPR operations, > + these attributes are not required to set. > + For example, CXL spec ver 3.1, Section 8.2.9.7.1.4 Table 8-105 > + memory sparing, these attributes are required to set based on > + memory sparing granularity as follows. > + Channel: Channel associated with the DPA that is to be spared > + and applies to all subclasses of sparing (cacheline, bank, > + row and rank sparing). > + Rank: Rank associated with the DPA that is to be spared and > + applies to all subclasses of sparing. > + Bank & Bank Group: Bank & bank group are associated with > + the DPA that is to be spared and applies to cacheline sparing, > + row sparing and bank sparing subclasses. > + Row: Row associated with the DPA that is to be spared and > + applies to cacheline sparing and row sparing subclasses. > + Column: column associated with the DPA that is to be spared > + and applies to cacheline sparing only. > + Sub-channel: sub-channel associated with the DPA that is to > + be spared and applies to cacheline sparing only. Same here: this will all be on a single paragraph which would be really weird. > + > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_hpa > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_dpa > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_nibble_mask > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank_group > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_rank > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_row > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_column > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_channel > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_sub_channel > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_hpa > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_dpa > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_nibble_mask > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank_group > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_rank > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_row > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_column > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_channel > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_sub_channel > +Date: Jan 2025 > +KernelVersion: 6.14 > +Contact: linux-edac@vger.kernel.org > +Description: > + (RW) The supported range of control attributes (optional) > + associated with a memory address that is to be repaired. > + The memory device may give the supported range of > + attributes to use and it will depend on the memory device > + and the portion of memory to repair. > + The userspace may receive the specific value of attributes > + to use for a repair operation from the memory device via > + related error records and trace events, such as in CXL > + memory devices. > + > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair > +Date: Jan 2025 > +KernelVersion: 6.14 > +Contact: linux-edac@vger.kernel.org > +Description: > + (WO) Issue the memory repair operation for the specified > + memory repair attributes. The operation may fail if resources > + are insufficient based on the requirements of the memory > + device and repair function. > + EDAC_DO_MEM_REPAIR - issue repair operation. > + All other values are reserved. > diff --git a/Documentation/edac/features.rst b/Documentation/edac/features.rst > index ba3ab993ee4f..bfd5533b81b7 100644 > --- a/Documentation/edac/features.rst > +++ b/Documentation/edac/features.rst > @@ -97,3 +97,6 @@ RAS features > ------------ > 1. Memory Scrub > Memory scrub features are documented in `Documentation/edac/scrub.rst`. > + > +2. Memory Repair > +Memory repair features are documented in `Documentation/edac/memory_repair.rst`. > diff --git a/Documentation/edac/index.rst b/Documentation/edac/index.rst > index dfb0c9fb9ab1..d6778f4562dd 100644 > --- a/Documentation/edac/index.rst > +++ b/Documentation/edac/index.rst > @@ -8,4 +8,5 @@ EDAC Subsystem > :maxdepth: 1 > > features > + memory_repair > scrub > diff --git a/Documentation/edac/memory_repair.rst b/Documentation/edac/memory_repair.rst > new file mode 100644 > index 000000000000..2787a8a2d6ba > --- /dev/null > +++ b/Documentation/edac/memory_repair.rst > @@ -0,0 +1,101 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +========================== > +EDAC Memory Repair Control > +========================== > + > +Copyright (c) 2024 HiSilicon Limited. > + > +:Author: Shiju Jose <shiju.jose@huawei.com> > +:License: The GNU Free Documentation License, Version 1.2 > + (dual licensed under the GPL v2) > +:Original Reviewers: > + > +- Written for: 6.14 See my comments with regards to license on the previous patches. > + > +Introduction > +------------ > +Memory devices may support repair operations to address issues in their > +memory media. Post Package Repair (PPR) and memory sparing are examples > +of such features. > + > +Post Package Repair(PPR) > +~~~~~~~~~~~~~~~~~~~~~~~~ > +Post Package Repair is a maintenance operation requests the memory device > +to perform repair operation on its media, in detail is a memory self-healing > +feature that fixes a failing memory location by replacing it with a spare > +row in a DRAM device. For example, a CXL memory device with DRAM components > +that support PPR features may implement PPR maintenance operations. DRAM > +components may support types of PPR functions, hard PPR, for a permanent row > +repair, and soft PPR, for a temporary row repair. Soft PPR is much faster > +than hard PPR, but the repair is lost with a power cycle. The data may not > +be retained and memory requests may not be correctly processed during a > +repair operation. In such case, the repair operation should not executed > +at runtime. > +For example, CXL memory devices, soft PPR and hard PPR repair operations > +may be supported. See CXL spec rev 3.1 sections 8.2.9.7.1.1 PPR Maintenance > +Operations, 8.2.9.7.1.2 sPPR Maintenance Operation and 8.2.9.7.1.3 hPPR > +Maintenance Operation for more details. Paragraphs require blank lines in ReST. Also, please place a link to the specs. I strongly suggest looking at the output of all docs with make htmldocs and make pdfdocs to be sure that the paragraphs and the final document will be properly handled. You may use: SPHINXDIRS="<book name(s)>" to speed-up documentation builds. Please see Sphinx documentation for more details about what it is expected there: https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html > + > +Memory Sparing > +~~~~~~~~~~~~~~ > +Memory sparing is a repair function that replaces a portion of memory with > +a portion of functional memory at that same sparing granularity. Memory > +sparing has cacheline/row/bank/rank sparing granularities. For example, in > +memory-sparing mode, one memory rank serves as a spare for other ranks on > +the same channel in case they fail. The spare rank is held in reserve and > +not used as active memory until a failure is indicated, with reserved > +capacity subtracted from the total available memory in the system. The DIMM > +installation order for memory sparing varies based on the number of processors > +and memory modules installed in the server. After an error threshold is > +surpassed in a system protected by memory sparing, the content of a failing > +rank of DIMMs is copied to the spare rank. The failing rank is then taken > +offline and the spare rank placed online for use as active memory in place > +of the failed rank. > + > +For example, CXL memory devices may support various subclasses for sparing > +operation vary in terms of the scope of the sparing being performed. > +Cacheline sparing subclass refers to a sparing action that can replace a > +full cacheline. Row sparing is provided as an alternative to PPR sparing > +functions and its scope is that of a single DDR row. Bank sparing allows > +an entire bank to be replaced. Rank sparing is defined as an operation > +in which an entire DDR rank is replaced. See CXL spec 3.1 section > +8.2.9.7.1.4 Memory Sparing Maintenance Operations for more details. > + > +Use cases of generic memory repair features control > +--------------------------------------------------- > + > +1. The soft PPR , hard PPR and memory-sparing features share similar > +control attributes. Therefore, there is a need for a standardized, generic > +sysfs repair control that is exposed to userspace and used by > +administrators, scripts and tools. > + > +2. When a CXL device detects an error in a memory component, it may inform > +the host of the need for a repair maintenance operation by using an event > +record where the "maintenance needed" flag is set. The event record > +specifies the device physical address(DPA) and attributes of the memory that > +requires repair. The kernel reports the corresponding CXL general media or > +DRAM trace event to userspace, and userspace tools (e.g. rasdaemon) initiate > +a repair maintenance operation in response to the device request using the > +sysfs repair control. > + > +3. Userspace tools, such as rasdaemon, may request a PPR/sparing on a memory > +region when an uncorrected memory error or an excess of corrected memory > +errors is reported on that memory. > + > +4. Multiple PPR/sparing instances may be present per memory device. > + > +The File System > +--------------- > + > +The control attributes of a registered memory repair instance could be > +accessed in the > + > +/sys/bus/edac/devices/<dev-name>/mem_repairX/ > + > +sysfs > +----- > + > +Sysfs files are documented in > + > +`Documentation/ABI/testing/sysfs-edac-memory-repair`. > diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile > index 3a49304860f0..1de9fe66ac6b 100644 > --- a/drivers/edac/Makefile > +++ b/drivers/edac/Makefile > @@ -10,7 +10,7 @@ obj-$(CONFIG_EDAC) := edac_core.o > > edac_core-y := edac_mc.o edac_device.o edac_mc_sysfs.o > edac_core-y += edac_module.o edac_device_sysfs.o wq.o > -edac_core-y += scrub.o ecs.o > +edac_core-y += scrub.o ecs.o mem_repair.o > > edac_core-$(CONFIG_EDAC_DEBUG) += debugfs.o > > diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c > index 1c1142a2e4e4..a401d81dad8a 100644 > --- a/drivers/edac/edac_device.c > +++ b/drivers/edac/edac_device.c > @@ -575,6 +575,7 @@ static void edac_dev_release(struct device *dev) > { > struct edac_dev_feat_ctx *ctx = container_of(dev, struct edac_dev_feat_ctx, dev); > > + kfree(ctx->mem_repair); > kfree(ctx->scrub); > kfree(ctx->dev.groups); > kfree(ctx); > @@ -611,6 +612,7 @@ int edac_dev_register(struct device *parent, char *name, > const struct attribute_group **ras_attr_groups; > struct edac_dev_data *dev_data; > struct edac_dev_feat_ctx *ctx; > + int mem_repair_cnt = 0; > int attr_gcnt = 0; > int scrub_cnt = 0; > int ret, feat; > @@ -628,6 +630,10 @@ int edac_dev_register(struct device *parent, char *name, > case RAS_FEAT_ECS: > attr_gcnt += ras_features[feat].ecs_info.num_media_frus; > break; > + case RAS_FEAT_MEM_REPAIR: > + attr_gcnt++; > + mem_repair_cnt++; > + break; > default: > return -EINVAL; > } > @@ -651,8 +657,17 @@ int edac_dev_register(struct device *parent, char *name, > } > } > > + if (mem_repair_cnt) { > + ctx->mem_repair = kcalloc(mem_repair_cnt, sizeof(*ctx->mem_repair), GFP_KERNEL); > + if (!ctx->mem_repair) { > + ret = -ENOMEM; > + goto data_mem_free; > + } > + } > + > attr_gcnt = 0; > scrub_cnt = 0; > + mem_repair_cnt = 0; > for (feat = 0; feat < num_features; feat++, ras_features++) { > switch (ras_features->ft_type) { > case RAS_FEAT_SCRUB: > @@ -686,6 +701,23 @@ int edac_dev_register(struct device *parent, char *name, > > attr_gcnt += ras_features->ecs_info.num_media_frus; > break; > + case RAS_FEAT_MEM_REPAIR: > + if (!ras_features->mem_repair_ops || > + mem_repair_cnt != ras_features->instance) > + goto data_mem_free; > + > + dev_data = &ctx->mem_repair[mem_repair_cnt]; > + dev_data->instance = mem_repair_cnt; > + dev_data->mem_repair_ops = ras_features->mem_repair_ops; > + dev_data->private = ras_features->ctx; > + ret = edac_mem_repair_get_desc(parent, &ras_attr_groups[attr_gcnt], > + ras_features->instance); > + if (ret) > + goto data_mem_free; > + > + mem_repair_cnt++; > + attr_gcnt++; > + break; > default: > ret = -EINVAL; > goto data_mem_free; > @@ -712,6 +744,7 @@ int edac_dev_register(struct device *parent, char *name, > return devm_add_action_or_reset(parent, edac_dev_unreg, &ctx->dev); > > data_mem_free: > + kfree(ctx->mem_repair); > kfree(ctx->scrub); > groups_free: > kfree(ras_attr_groups); > diff --git a/drivers/edac/mem_repair.c b/drivers/edac/mem_repair.c > new file mode 100755 > index 000000000000..e7439fd26c41 > --- /dev/null > +++ b/drivers/edac/mem_repair.c > @@ -0,0 +1,492 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * The generic EDAC memory repair driver is designed to control the memory > + * devices with memory repair features, such as Post Package Repair (PPR), > + * memory sparing etc. The common sysfs memory repair interface abstracts > + * the control of various arbitrary memory repair functionalities into a > + * unified set of functions. > + * > + * Copyright (c) 2024 HiSilicon Limited. > + */ > + > +#include <linux/edac.h> > + > +enum edac_mem_repair_attributes { > + MEM_REPAIR_FUNCTION, > + MEM_REPAIR_PERSIST_MODE, > + MEM_REPAIR_DPA_SUPPORT, > + MEM_REPAIR_SAFE_IN_USE, > + MEM_REPAIR_HPA, > + MEM_REPAIR_MIN_HPA, > + MEM_REPAIR_MAX_HPA, > + MEM_REPAIR_DPA, > + MEM_REPAIR_MIN_DPA, > + MEM_REPAIR_MAX_DPA, > + MEM_REPAIR_NIBBLE_MASK, > + MEM_REPAIR_MIN_NIBBLE_MASK, > + MEM_REPAIR_MAX_NIBBLE_MASK, > + MEM_REPAIR_BANK_GROUP, > + MEM_REPAIR_MIN_BANK_GROUP, > + MEM_REPAIR_MAX_BANK_GROUP, > + MEM_REPAIR_BANK, > + MEM_REPAIR_MIN_BANK, > + MEM_REPAIR_MAX_BANK, > + MEM_REPAIR_RANK, > + MEM_REPAIR_MIN_RANK, > + MEM_REPAIR_MAX_RANK, > + MEM_REPAIR_ROW, > + MEM_REPAIR_MIN_ROW, > + MEM_REPAIR_MAX_ROW, > + MEM_REPAIR_COLUMN, > + MEM_REPAIR_MIN_COLUMN, > + MEM_REPAIR_MAX_COLUMN, > + MEM_REPAIR_CHANNEL, > + MEM_REPAIR_MIN_CHANNEL, > + MEM_REPAIR_MAX_CHANNEL, > + MEM_REPAIR_SUB_CHANNEL, > + MEM_REPAIR_MIN_SUB_CHANNEL, > + MEM_REPAIR_MAX_SUB_CHANNEL, > + MEM_DO_REPAIR, > + MEM_REPAIR_MAX_ATTRS > +}; > + > +struct edac_mem_repair_dev_attr { > + struct device_attribute dev_attr; > + u8 instance; > +}; > + > +struct edac_mem_repair_context { > + char name[EDAC_FEAT_NAME_LEN]; > + struct edac_mem_repair_dev_attr mem_repair_dev_attr[MEM_REPAIR_MAX_ATTRS]; > + struct attribute *mem_repair_attrs[MEM_REPAIR_MAX_ATTRS + 1]; > + struct attribute_group group; > +}; > + > +#define TO_MEM_REPAIR_DEV_ATTR(_dev_attr) \ > + container_of(_dev_attr, struct edac_mem_repair_dev_attr, dev_attr) > + > +#define EDAC_MEM_REPAIR_ATTR_SHOW(attrib, cb, type, format) \ > +static ssize_t attrib##_show(struct device *ras_feat_dev, \ > + struct device_attribute *attr, char *buf) \ > +{ \ > + u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance; \ > + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); \ > + const struct edac_mem_repair_ops *ops = \ > + ctx->mem_repair[inst].mem_repair_ops; \ > + type data; \ > + int ret; \ > + \ > + ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private, \ > + &data); \ > + if (ret) \ > + return ret; \ > + \ > + return sysfs_emit(buf, format, data); \ > +} > + > +EDAC_MEM_REPAIR_ATTR_SHOW(repair_function, get_repair_function, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(persist_mode, get_persist_mode, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(dpa_support, get_dpa_support, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(repair_safe_when_in_use, get_repair_safe_when_in_use, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(hpa, get_hpa, u64, "0x%llx\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(min_hpa, get_min_hpa, u64, "0x%llx\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(max_hpa, get_max_hpa, u64, "0x%llx\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(dpa, get_dpa, u64, "0x%llx\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(min_dpa, get_min_dpa, u64, "0x%llx\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(max_dpa, get_max_dpa, u64, "0x%llx\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(nibble_mask, get_nibble_mask, u64, "0x%llx\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(min_nibble_mask, get_min_nibble_mask, u64, "0x%llx\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(max_nibble_mask, get_max_nibble_mask, u64, "0x%llx\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(bank_group, get_bank_group, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank_group, get_min_bank_group, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank_group, get_max_bank_group, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(bank, get_bank, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank, get_min_bank, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank, get_max_bank, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(rank, get_rank, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(min_rank, get_min_rank, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(max_rank, get_max_rank, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(row, get_row, u64, "0x%llx\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(min_row, get_min_row, u64, "0x%llx\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(max_row, get_max_row, u64, "0x%llx\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(column, get_column, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(min_column, get_min_column, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(max_column, get_max_column, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(channel, get_channel, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(min_channel, get_min_channel, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(max_channel, get_max_channel, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(sub_channel, get_sub_channel, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(min_sub_channel, get_min_sub_channel, u32, "%u\n") > +EDAC_MEM_REPAIR_ATTR_SHOW(max_sub_channel, get_max_sub_channel, u32, "%u\n") > + > +#define EDAC_MEM_REPAIR_ATTR_STORE(attrib, cb, type, conv_func) \ > +static ssize_t attrib##_store(struct device *ras_feat_dev, \ > + struct device_attribute *attr, \ > + const char *buf, size_t len) \ > +{ \ > + u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance; \ > + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); \ > + const struct edac_mem_repair_ops *ops = \ > + ctx->mem_repair[inst].mem_repair_ops; \ > + type data; \ > + int ret; \ > + \ > + ret = conv_func(buf, 0, &data); \ > + if (ret < 0) \ > + return ret; \ > + \ > + ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private, \ > + data); \ > + if (ret) \ > + return ret; \ > + \ > + return len; \ > +} > + > +EDAC_MEM_REPAIR_ATTR_STORE(persist_mode, set_persist_mode, unsigned long, kstrtoul) > +EDAC_MEM_REPAIR_ATTR_STORE(hpa, set_hpa, u64, kstrtou64) > +EDAC_MEM_REPAIR_ATTR_STORE(dpa, set_dpa, u64, kstrtou64) > +EDAC_MEM_REPAIR_ATTR_STORE(nibble_mask, set_nibble_mask, u64, kstrtou64) > +EDAC_MEM_REPAIR_ATTR_STORE(bank_group, set_bank_group, unsigned long, kstrtoul) > +EDAC_MEM_REPAIR_ATTR_STORE(bank, set_bank, unsigned long, kstrtoul) > +EDAC_MEM_REPAIR_ATTR_STORE(rank, set_rank, unsigned long, kstrtoul) > +EDAC_MEM_REPAIR_ATTR_STORE(row, set_row, u64, kstrtou64) > +EDAC_MEM_REPAIR_ATTR_STORE(column, set_column, unsigned long, kstrtoul) > +EDAC_MEM_REPAIR_ATTR_STORE(channel, set_channel, unsigned long, kstrtoul) > +EDAC_MEM_REPAIR_ATTR_STORE(sub_channel, set_sub_channel, unsigned long, kstrtoul) > + > +#define EDAC_MEM_REPAIR_DO_OP(attrib, cb) \ > +static ssize_t attrib##_store(struct device *ras_feat_dev, \ > + struct device_attribute *attr, \ > + const char *buf, size_t len) \ > +{ \ > + u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance; \ > + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); \ > + const struct edac_mem_repair_ops *ops = ctx->mem_repair[inst].mem_repair_ops; \ > + unsigned long data; \ > + int ret; \ > + \ > + ret = kstrtoul(buf, 0, &data); \ > + if (ret < 0) \ > + return ret; \ > + \ > + ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private, data); \ > + if (ret) \ > + return ret; \ > + \ > + return len; \ > +} > + > +EDAC_MEM_REPAIR_DO_OP(repair, do_repair) > + > +static umode_t mem_repair_attr_visible(struct kobject *kobj, struct attribute *a, int attr_id) > +{ > + struct device *ras_feat_dev = kobj_to_dev(kobj); > + struct device_attribute *dev_attr = container_of(a, struct device_attribute, attr); > + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); > + u8 inst = TO_MEM_REPAIR_DEV_ATTR(dev_attr)->instance; > + const struct edac_mem_repair_ops *ops = ctx->mem_repair[inst].mem_repair_ops; > + > + switch (attr_id) { > + case MEM_REPAIR_FUNCTION: > + if (ops->get_repair_function) > + return a->mode; > + break; > + case MEM_REPAIR_PERSIST_MODE: > + if (ops->get_persist_mode) { > + if (ops->set_persist_mode) > + return a->mode; > + else > + return 0444; > + } > + break; > + case MEM_REPAIR_DPA_SUPPORT: > + if (ops->get_dpa_support) > + return a->mode; > + break; > + case MEM_REPAIR_SAFE_IN_USE: > + if (ops->get_repair_safe_when_in_use) > + return a->mode; > + break; > + case MEM_REPAIR_HPA: > + if (ops->get_hpa) { > + if (ops->set_hpa) > + return a->mode; > + else > + return 0444; > + } > + break; > + case MEM_REPAIR_MIN_HPA: > + if (ops->get_min_hpa) > + return a->mode; > + break; > + case MEM_REPAIR_MAX_HPA: > + if (ops->get_max_hpa) > + return a->mode; > + break; > + case MEM_REPAIR_DPA: > + if (ops->get_dpa) { > + if (ops->set_dpa) > + return a->mode; > + else > + return 0444; > + } > + break; > + case MEM_REPAIR_MIN_DPA: > + if (ops->get_min_dpa) > + return a->mode; > + break; > + case MEM_REPAIR_MAX_DPA: > + if (ops->get_max_dpa) > + return a->mode; > + break; > + case MEM_REPAIR_NIBBLE_MASK: > + if (ops->get_nibble_mask) { > + if (ops->set_nibble_mask) > + return a->mode; > + else > + return 0444; > + } > + break; > + case MEM_REPAIR_MIN_NIBBLE_MASK: > + if (ops->get_min_nibble_mask) > + return a->mode; > + break; > + case MEM_REPAIR_MAX_NIBBLE_MASK: > + if (ops->get_max_nibble_mask) > + return a->mode; > + break; > + case MEM_REPAIR_BANK_GROUP: > + if (ops->get_bank_group) { > + if (ops->set_bank_group) > + return a->mode; > + else > + return 0444; > + } > + break; > + case MEM_REPAIR_MIN_BANK_GROUP: > + if (ops->get_min_bank_group) > + return a->mode; > + break; > + case MEM_REPAIR_MAX_BANK_GROUP: > + if (ops->get_max_bank_group) > + return a->mode; > + break; > + case MEM_REPAIR_BANK: > + if (ops->get_bank) { > + if (ops->set_bank) > + return a->mode; > + else > + return 0444; > + } > + break; > + case MEM_REPAIR_MIN_BANK: > + if (ops->get_min_bank) > + return a->mode; > + break; > + case MEM_REPAIR_MAX_BANK: > + if (ops->get_max_bank) > + return a->mode; > + break; > + case MEM_REPAIR_RANK: > + if (ops->get_rank) { > + if (ops->set_rank) > + return a->mode; > + else > + return 0444; > + } > + break; > + case MEM_REPAIR_MIN_RANK: > + if (ops->get_min_rank) > + return a->mode; > + break; > + case MEM_REPAIR_MAX_RANK: > + if (ops->get_max_rank) > + return a->mode; > + break; > + case MEM_REPAIR_ROW: > + if (ops->get_row) { > + if (ops->set_row) > + return a->mode; > + else > + return 0444; > + } > + break; > + case MEM_REPAIR_MIN_ROW: > + if (ops->get_min_row) > + return a->mode; > + break; > + case MEM_REPAIR_MAX_ROW: > + if (ops->get_max_row) > + return a->mode; > + break; > + case MEM_REPAIR_COLUMN: > + if (ops->get_column) { > + if (ops->set_column) > + return a->mode; > + else > + return 0444; > + } > + break; > + case MEM_REPAIR_MIN_COLUMN: > + if (ops->get_min_column) > + return a->mode; > + break; > + case MEM_REPAIR_MAX_COLUMN: > + if (ops->get_max_column) > + return a->mode; > + break; > + case MEM_REPAIR_CHANNEL: > + if (ops->get_channel) { > + if (ops->set_channel) > + return a->mode; > + else > + return 0444; > + } > + break; > + case MEM_REPAIR_MIN_CHANNEL: > + if (ops->get_min_channel) > + return a->mode; > + break; > + case MEM_REPAIR_MAX_CHANNEL: > + if (ops->get_max_channel) > + return a->mode; > + break; > + case MEM_REPAIR_SUB_CHANNEL: > + if (ops->get_sub_channel) { > + if (ops->set_sub_channel) > + return a->mode; > + else > + return 0444; > + } > + break; > + case MEM_REPAIR_MIN_SUB_CHANNEL: > + if (ops->get_min_sub_channel) > + return a->mode; > + break; > + case MEM_REPAIR_MAX_SUB_CHANNEL: > + if (ops->get_max_sub_channel) > + return a->mode; > + break; > + case MEM_DO_REPAIR: > + if (ops->do_repair) > + return a->mode; > + break; > + default: > + break; > + } > + > + return 0; > +} > + > +#define EDAC_MEM_REPAIR_ATTR_RO(_name, _instance) \ > + ((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RO(_name), \ > + .instance = _instance }) > + > +#define EDAC_MEM_REPAIR_ATTR_WO(_name, _instance) \ > + ((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_WO(_name), \ > + .instance = _instance }) > + > +#define EDAC_MEM_REPAIR_ATTR_RW(_name, _instance) \ > + ((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RW(_name), \ > + .instance = _instance }) > + > +static int mem_repair_create_desc(struct device *dev, > + const struct attribute_group **attr_groups, > + u8 instance) > +{ > + struct edac_mem_repair_context *ctx; > + struct attribute_group *group; > + int i; > + struct edac_mem_repair_dev_attr dev_attr[] = { > + [MEM_REPAIR_FUNCTION] = EDAC_MEM_REPAIR_ATTR_RO(repair_function, > + instance), > + [MEM_REPAIR_PERSIST_MODE] = > + EDAC_MEM_REPAIR_ATTR_RW(persist_mode, instance), > + [MEM_REPAIR_DPA_SUPPORT] = > + EDAC_MEM_REPAIR_ATTR_RO(dpa_support, instance), > + [MEM_REPAIR_SAFE_IN_USE] = > + EDAC_MEM_REPAIR_ATTR_RO(repair_safe_when_in_use, > + instance), > + [MEM_REPAIR_HPA] = EDAC_MEM_REPAIR_ATTR_RW(hpa, instance), > + [MEM_REPAIR_MIN_HPA] = EDAC_MEM_REPAIR_ATTR_RO(min_hpa, instance), > + [MEM_REPAIR_MAX_HPA] = EDAC_MEM_REPAIR_ATTR_RO(max_hpa, instance), > + [MEM_REPAIR_DPA] = EDAC_MEM_REPAIR_ATTR_RW(dpa, instance), > + [MEM_REPAIR_MIN_DPA] = EDAC_MEM_REPAIR_ATTR_RO(min_dpa, instance), > + [MEM_REPAIR_MAX_DPA] = EDAC_MEM_REPAIR_ATTR_RO(max_dpa, instance), > + [MEM_REPAIR_NIBBLE_MASK] = > + EDAC_MEM_REPAIR_ATTR_RW(nibble_mask, instance), > + [MEM_REPAIR_MIN_NIBBLE_MASK] = > + EDAC_MEM_REPAIR_ATTR_RO(min_nibble_mask, instance), > + [MEM_REPAIR_MAX_NIBBLE_MASK] = > + EDAC_MEM_REPAIR_ATTR_RO(max_nibble_mask, instance), > + [MEM_REPAIR_BANK_GROUP] = > + EDAC_MEM_REPAIR_ATTR_RW(bank_group, instance), > + [MEM_REPAIR_MIN_BANK_GROUP] = > + EDAC_MEM_REPAIR_ATTR_RO(min_bank_group, instance), > + [MEM_REPAIR_MAX_BANK_GROUP] = > + EDAC_MEM_REPAIR_ATTR_RO(max_bank_group, instance), > + [MEM_REPAIR_BANK] = EDAC_MEM_REPAIR_ATTR_RW(bank, instance), > + [MEM_REPAIR_MIN_BANK] = EDAC_MEM_REPAIR_ATTR_RO(min_bank, instance), > + [MEM_REPAIR_MAX_BANK] = EDAC_MEM_REPAIR_ATTR_RO(max_bank, instance), > + [MEM_REPAIR_RANK] = EDAC_MEM_REPAIR_ATTR_RW(rank, instance), > + [MEM_REPAIR_MIN_RANK] = EDAC_MEM_REPAIR_ATTR_RO(min_rank, instance), > + [MEM_REPAIR_MAX_RANK] = EDAC_MEM_REPAIR_ATTR_RO(max_rank, instance), > + [MEM_REPAIR_ROW] = EDAC_MEM_REPAIR_ATTR_RW(row, instance), > + [MEM_REPAIR_MIN_ROW] = EDAC_MEM_REPAIR_ATTR_RO(min_row, instance), > + [MEM_REPAIR_MAX_ROW] = EDAC_MEM_REPAIR_ATTR_RO(max_row, instance), > + [MEM_REPAIR_COLUMN] = EDAC_MEM_REPAIR_ATTR_RW(column, instance), > + [MEM_REPAIR_MIN_COLUMN] = EDAC_MEM_REPAIR_ATTR_RO(min_column, instance), > + [MEM_REPAIR_MAX_COLUMN] = EDAC_MEM_REPAIR_ATTR_RO(max_column, instance), > + [MEM_REPAIR_CHANNEL] = EDAC_MEM_REPAIR_ATTR_RW(channel, instance), > + [MEM_REPAIR_MIN_CHANNEL] = EDAC_MEM_REPAIR_ATTR_RO(min_channel, instance), > + [MEM_REPAIR_MAX_CHANNEL] = EDAC_MEM_REPAIR_ATTR_RO(max_channel, instance), > + [MEM_REPAIR_SUB_CHANNEL] = > + EDAC_MEM_REPAIR_ATTR_RW(sub_channel, instance), > + [MEM_REPAIR_MIN_SUB_CHANNEL] = > + EDAC_MEM_REPAIR_ATTR_RO(min_sub_channel, instance), > + [MEM_REPAIR_MAX_SUB_CHANNEL] = > + EDAC_MEM_REPAIR_ATTR_RO(max_sub_channel, instance), > + [MEM_DO_REPAIR] = EDAC_MEM_REPAIR_ATTR_WO(repair, instance) > + }; > + > + ctx = devm_kzalloc(dev, sizeof(*ctx), GFP_KERNEL); > + if (!ctx) > + return -ENOMEM; > + > + for (i = 0; i < MEM_REPAIR_MAX_ATTRS; i++) { > + memcpy(&ctx->mem_repair_dev_attr[i].dev_attr, > + &dev_attr[i], sizeof(dev_attr[i])); > + ctx->mem_repair_attrs[i] = > + &ctx->mem_repair_dev_attr[i].dev_attr.attr; > + } > + > + sprintf(ctx->name, "%s%d", "mem_repair", instance); > + group = &ctx->group; > + group->name = ctx->name; > + group->attrs = ctx->mem_repair_attrs; > + group->is_visible = mem_repair_attr_visible; > + attr_groups[0] = group; > + > + return 0; > +} > + > +/** > + * edac_mem_repair_get_desc - get EDAC memory repair descriptors > + * @dev: client device with memory repair feature > + * @attr_groups: pointer to attribute group container > + * @instance: device's memory repair instance number. > + * > + * Return: > + * * %0 - Success. > + * * %-EINVAL - Invalid parameters passed. > + * * %-ENOMEM - Dynamic memory allocation failed. > + */ > +int edac_mem_repair_get_desc(struct device *dev, > + const struct attribute_group **attr_groups, u8 instance) > +{ > + if (!dev || !attr_groups) > + return -EINVAL; > + > + return mem_repair_create_desc(dev, attr_groups, instance); > +} > diff --git a/include/linux/edac.h b/include/linux/edac.h > index 979e91426701..5d07192bf1a7 100644 > --- a/include/linux/edac.h > +++ b/include/linux/edac.h > @@ -668,6 +668,7 @@ static inline struct dimm_info *edac_get_dimm(struct mem_ctl_info *mci, > enum edac_dev_feat { > RAS_FEAT_SCRUB, > RAS_FEAT_ECS, > + RAS_FEAT_MEM_REPAIR, > RAS_FEAT_MAX > }; > > @@ -729,11 +730,147 @@ int edac_ecs_get_desc(struct device *ecs_dev, > const struct attribute_group **attr_groups, > u16 num_media_frus); > > +enum edac_mem_repair_function { > + EDAC_SOFT_PPR, > + EDAC_HARD_PPR, > + EDAC_CACHELINE_MEM_SPARING, > + EDAC_ROW_MEM_SPARING, > + EDAC_BANK_MEM_SPARING, > + EDAC_RANK_MEM_SPARING, > +}; > + > +enum edac_mem_repair_persist_mode { > + EDAC_MEM_REPAIR_SOFT, /* soft memory repair */ > + EDAC_MEM_REPAIR_HARD, /* hard memory repair */ > +}; > + > +enum edac_mem_repair_cmd { > + EDAC_DO_MEM_REPAIR = 1, > +}; > + > +/** > + * struct edac_mem_repair_ops - memory repair operations > + * (all elements are optional except do_repair, set_hpa/set_dpa) > + * @get_repair_function: get the memory repair function, listed in > + * enum edac_mem_repair_function. > + * @get_persist_mode: get the current persist mode. Persist repair modes supported > + * in the device is based on the memory repair function which is > + * temporary or permanent and is lost with a power cycle. > + * EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary repair). > + * EDAC_MEM_REPAIR_HARD - Hard memory repair function (permanent repair). > + * All other values are reserved. > + * @set_persist_mode: set the persist mode of the memory repair instance. > + * @get_dpa_support: get dpa support flag. In some states of system configuration > + * (e.g. before address decoders have been configured), memory devices > + * (e.g. CXL) may not have an active mapping in the main host address > + * physical address map. As such, the memory to repair must be identified > + * by a device specific physical addressing scheme using a device physical > + * address(DPA). The DPA and other control attributes to use for the > + * dry_run and repair operations will be presented in related error records. > + * @get_repair_safe_when_in_use: get whether memory media is accessible and > + * data is retained during repair operation. > + * @get_hpa: get current host physical address (HPA). > + * @set_hpa: set host physical address (HPA) of memory to repair. > + * @get_min_hpa: get the minimum supported host physical address (HPA). > + * @get_max_hpa: get the maximum supported host physical address (HPA). > + * @get_dpa: get current device physical address (DPA). > + * @set_dpa: set device physical address (DPA) of memory to repair. > + * @get_min_dpa: get the minimum supported device physical address (DPA). > + * @get_max_dpa: get the maximum supported device physical address (DPA). > + * @get_nibble_mask: get current nibble mask. > + * @set_nibble_mask: set nibble mask of memory to repair. > + * @get_min_nibble_mask: get the minimum supported nibble mask. > + * @get_max_nibble_mask: get the maximum supported nibble mask. > + * @get_bank_group: get current bank group. > + * @set_bank_group: set bank group of memory to repair. > + * @get_min_bank_group: get the minimum supported bank group. > + * @get_max_bank_group: get the maximum supported bank group. > + * @get_bank: get current bank. > + * @set_bank: set bank of memory to repair. > + * @get_min_bank: get the minimum supported bank. > + * @get_max_bank: get the maximum supported bank. > + * @get_rank: get current rank. > + * @set_rank: set rank of memory to repair. > + * @get_min_rank: get the minimum supported rank. > + * @get_max_rank: get the maximum supported rank. > + * @get_row: get current row. > + * @set_row: set row of memory to repair. > + * @get_min_row: get the minimum supported row. > + * @get_max_row: get the maximum supported row. > + * @get_column: get current column. > + * @set_column: set column of memory to repair. > + * @get_min_column: get the minimum supported column. > + * @get_max_column: get the maximum supported column. > + * @get_channel: get current channel. > + * @set_channel: set channel of memory to repair. > + * @get_min_channel: get the minimum supported channel. > + * @get_max_channel: get the maximum supported channel. > + * @get_sub_channel: get current sub channel. > + * @set_sub_channel: set sub channel of memory to repair. > + * @get_min_sub_channel: get the minimum supported sub channel. > + * @get_max_sub_channel: get the maximum supported sub channel. > + * @do_repair: Issue memory repair operation for the HPA/DPA and > + * other control attributes set for the memory to repair. > + */ > +struct edac_mem_repair_ops { > + int (*get_repair_function)(struct device *dev, void *drv_data, u32 *val); > + int (*get_persist_mode)(struct device *dev, void *drv_data, u32 *mode); > + int (*set_persist_mode)(struct device *dev, void *drv_data, u32 mode); > + int (*get_dpa_support)(struct device *dev, void *drv_data, u32 *val); > + int (*get_repair_safe_when_in_use)(struct device *dev, void *drv_data, u32 *val); > + int (*get_hpa)(struct device *dev, void *drv_data, u64 *hpa); > + int (*set_hpa)(struct device *dev, void *drv_data, u64 hpa); > + int (*get_min_hpa)(struct device *dev, void *drv_data, u64 *hpa); > + int (*get_max_hpa)(struct device *dev, void *drv_data, u64 *hpa); > + int (*get_dpa)(struct device *dev, void *drv_data, u64 *dpa); > + int (*set_dpa)(struct device *dev, void *drv_data, u64 dpa); > + int (*get_min_dpa)(struct device *dev, void *drv_data, u64 *dpa); > + int (*get_max_dpa)(struct device *dev, void *drv_data, u64 *dpa); > + int (*get_nibble_mask)(struct device *dev, void *drv_data, u64 *val); > + int (*set_nibble_mask)(struct device *dev, void *drv_data, u64 val); > + int (*get_min_nibble_mask)(struct device *dev, void *drv_data, u64 *val); > + int (*get_max_nibble_mask)(struct device *dev, void *drv_data, u64 *val); > + int (*get_bank_group)(struct device *dev, void *drv_data, u32 *val); > + int (*set_bank_group)(struct device *dev, void *drv_data, u32 val); > + int (*get_min_bank_group)(struct device *dev, void *drv_data, u32 *val); > + int (*get_max_bank_group)(struct device *dev, void *drv_data, u32 *val); > + int (*get_bank)(struct device *dev, void *drv_data, u32 *val); > + int (*set_bank)(struct device *dev, void *drv_data, u32 val); > + int (*get_min_bank)(struct device *dev, void *drv_data, u32 *val); > + int (*get_max_bank)(struct device *dev, void *drv_data, u32 *val); > + int (*get_rank)(struct device *dev, void *drv_data, u32 *val); > + int (*set_rank)(struct device *dev, void *drv_data, u32 val); > + int (*get_min_rank)(struct device *dev, void *drv_data, u32 *val); > + int (*get_max_rank)(struct device *dev, void *drv_data, u32 *val); > + int (*get_row)(struct device *dev, void *drv_data, u64 *val); > + int (*set_row)(struct device *dev, void *drv_data, u64 val); > + int (*get_min_row)(struct device *dev, void *drv_data, u64 *val); > + int (*get_max_row)(struct device *dev, void *drv_data, u64 *val); > + int (*get_column)(struct device *dev, void *drv_data, u32 *val); > + int (*set_column)(struct device *dev, void *drv_data, u32 val); > + int (*get_min_column)(struct device *dev, void *drv_data, u32 *val); > + int (*get_max_column)(struct device *dev, void *drv_data, u32 *val); > + int (*get_channel)(struct device *dev, void *drv_data, u32 *val); > + int (*set_channel)(struct device *dev, void *drv_data, u32 val); > + int (*get_min_channel)(struct device *dev, void *drv_data, u32 *val); > + int (*get_max_channel)(struct device *dev, void *drv_data, u32 *val); > + int (*get_sub_channel)(struct device *dev, void *drv_data, u32 *val); > + int (*set_sub_channel)(struct device *dev, void *drv_data, u32 val); > + int (*get_min_sub_channel)(struct device *dev, void *drv_data, u32 *val); > + int (*get_max_sub_channel)(struct device *dev, void *drv_data, u32 *val); > + int (*do_repair)(struct device *dev, void *drv_data, u32 val); > +}; > + > +int edac_mem_repair_get_desc(struct device *dev, > + const struct attribute_group **attr_groups, > + u8 instance); > + > /* EDAC device feature information structure */ > struct edac_dev_data { > union { > const struct edac_scrub_ops *scrub_ops; > const struct edac_ecs_ops *ecs_ops; > + const struct edac_mem_repair_ops *mem_repair_ops; > }; > u8 instance; > void *private; > @@ -744,6 +881,7 @@ struct edac_dev_feat_ctx { > void *private; > struct edac_dev_data *scrub; > struct edac_dev_data ecs; > + struct edac_dev_data *mem_repair; > }; > > struct edac_dev_feature { > @@ -752,6 +890,7 @@ struct edac_dev_feature { > union { > const struct edac_scrub_ops *scrub_ops; > const struct edac_ecs_ops *ecs_ops; > + const struct edac_mem_repair_ops *mem_repair_ops; > }; > void *ctx; > struct edac_ecs_ex_info ecs_info; Thanks, Mauro
Hi Mauro, Thanks for the comments. >-----Original Message----- >From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> >Sent: 14 January 2025 11:48 >To: Shiju Jose <shiju.jose@huawei.com> >Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux- >acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org; >bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org; >mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan >Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com; >alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com; >david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com; >Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com; >Jon.Grimm@amd.com; dave.hansen@linux.intel.com; >naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com; >somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com; >duenwen@google.com; gthelen@google.com; >wschwartz@amperecomputing.com; dferguson@amperecomputing.com; >wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei ><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto >Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com; >wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm ><linuxarm@huawei.com> >Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature > >Em Mon, 6 Jan 2025 12:10:00 +0000 ><shiju.jose@huawei.com> escreveu: > >> From: Shiju Jose <shiju.jose@huawei.com> >> >> Add a generic EDAC memory repair control driver to manage memory repairs >> in the system, such as CXL Post Package Repair (PPR) and CXL memory sparing >> features. >> >> For example, a CXL device with DRAM components that support PPR features >> may implement PPR maintenance operations. DRAM components may support >two >> types of PPR, hard PPR, for a permanent row repair, and soft PPR, for a >> temporary row repair. Soft PPR is much faster than hard PPR, but the repair >> is lost with a power cycle. >> Similarly a CXL memory device may support soft and hard memory sparing at >> cacheline, row, bank and rank granularities. Memory sparing is defined as >> a repair function that replaces a portion of memory with a portion of >> functional memory at that same granularity. >> When a CXL device detects an error in a memory, it may report the host of >> the need for a repair maintenance operation by using an event record where >> the "maintenance needed" flag is set. The event records contains the device >> physical address(DPA) and other attributes of the memory to repair (such as >> channel, sub-channel, bank group, bank, rank, row, column etc). The kernel >> will report the corresponding CXL general media or DRAM trace event to >> userspace, and userspace tools (e.g. rasdaemon) will initiate a repair >> operation in response to the device request via the sysfs repair control. >> >> Device with memory repair features registers with EDAC device driver, >> which retrieves memory repair descriptor from EDAC memory repair driver >> and exposes the sysfs repair control attributes to userspace in >> /sys/bus/edac/devices/<dev-name>/mem_repairX/. >> >> The common memory repair control interface abstracts the control of >> arbitrary memory repair functionality into a standardized set of functions. >> The sysfs memory repair attribute nodes are only available if the client >> driver has implemented the corresponding attribute callback function and >> provided operations to the EDAC device driver during registration. >> >> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> >> --- >> .../ABI/testing/sysfs-edac-memory-repair | 244 +++++++++ >> Documentation/edac/features.rst | 3 + >> Documentation/edac/index.rst | 1 + >> Documentation/edac/memory_repair.rst | 101 ++++ >> drivers/edac/Makefile | 2 +- >> drivers/edac/edac_device.c | 33 ++ >> drivers/edac/mem_repair.c | 492 ++++++++++++++++++ >> include/linux/edac.h | 139 +++++ >> 8 files changed, 1014 insertions(+), 1 deletion(-) >> create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair >> create mode 100644 Documentation/edac/memory_repair.rst >> create mode 100755 drivers/edac/mem_repair.c >> >> diff --git a/Documentation/ABI/testing/sysfs-edac-memory-repair >b/Documentation/ABI/testing/sysfs-edac-memory-repair >> new file mode 100644 >> index 000000000000..e9268f3780ed >> --- /dev/null >> +++ b/Documentation/ABI/testing/sysfs-edac-memory-repair >> @@ -0,0 +1,244 @@ >> +What: /sys/bus/edac/devices/<dev-name>/mem_repairX >> +Date: Jan 2025 >> +KernelVersion: 6.14 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + The sysfs EDAC bus devices /<dev-name>/mem_repairX >subdirectory >> + pertains to the memory media repair features control, such as >> + PPR (Post Package Repair), memory sparing etc, where<dev- >name> >> + directory corresponds to a device registered with the EDAC >> + device driver for the memory repair features. >> + >> + Post Package Repair is a maintenance operation requests the >memory >> + device to perform a repair operation on its media, in detail is a >> + memory self-healing feature that fixes a failing memory >location by >> + replacing it with a spare row in a DRAM device. For example, a >> + CXL memory device with DRAM components that support PPR >features may >> + implement PPR maintenance operations. DRAM components >may support >> + two types of PPR functions: hard PPR, for a permanent row >repair, and >> + soft PPR, for a temporary row repair. soft PPR is much faster >than >> + hard PPR, but the repair is lost with a power cycle. >> + >> + Memory sparing is a repair function that replaces a portion >> + of memory with a portion of functional memory at that same >> + sparing granularity. Memory sparing has >cacheline/row/bank/rank >> + sparing granularities. For example, in memory-sparing mode, >> + one memory rank serves as a spare for other ranks on the same >> + channel in case they fail. The spare rank is held in reserve and >> + not used as active memory until a failure is indicated, with >> + reserved capacity subtracted from the total available memory >> + in the system.The DIMM installation order for memory sparing >> + varies based on the number of processors and memory modules >> + installed in the server. After an error threshold is surpassed >> + in a system protected by memory sparing, the content of a >failing >> + rank of DIMMs is copied to the spare rank. The failing rank is >> + then taken offline and the spare rank placed online for use as >> + active memory in place of the failed rank. >> + >> + The sysfs attributes nodes for a repair feature are only >> + present if the parent driver has implemented the corresponding >> + attr callback function and provided the necessary operations >> + to the EDAC device driver during registration. >> + >> + In some states of system configuration (e.g. before address >> + decoders have been configured), memory devices (e.g. CXL) >> + may not have an active mapping in the main host address >> + physical address map. As such, the memory to repair must be >> + identified by a device specific physical addressing scheme >> + using a device physical address(DPA). The DPA and other control >> + attributes to use will be presented in related error records. >> + >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/repair_function >> +Date: Jan 2025 >> +KernelVersion: 6.14 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (RO) Memory repair function type. For eg. post package repair, >> + memory sparing etc. >> + EDAC_SOFT_PPR - Soft post package repair >> + EDAC_HARD_PPR - Hard post package repair >> + EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing >> + EDAC_ROW_MEM_SPARING - Row memory sparing >> + EDAC_BANK_MEM_SPARING - Bank memory sparing >> + EDAC_RANK_MEM_SPARING - Rank memory sparing >> + All other values are reserved. > >Too big strings. Why are them in upper cases? IMO: > > soft-ppr, hard-ppr, ... would be enough. > Here return repair type (single value, such as 0, 1, or 2 etc not as decoded string for eg."EDAC_SOFT_PPR") of the memory repair instance, which is defined as enums (EDAC_SOFT_PPR, EDAC_HARD_PPR, ... etc) for the memory repair interface in the include/linux/edac.h. enum edac_mem_repair_function { EDAC_SOFT_PPR, EDAC_HARD_PPR, EDAC_CACHELINE_MEM_SPARING, EDAC_ROW_MEM_SPARING, EDAC_BANK_MEM_SPARING, EDAC_RANK_MEM_SPARING, }; I documented return value in terms of the above enums. >Also, Is it mandatory that all types are supported? If not, you need a >way to report to userspace what of them are supported. One option >would be that reading /sys/bus/edac/devices/<dev- >name>/mem_repairX/repair_function >would return something like: > > soft-ppr [hard-ppr] row-mem-sparing > Same as above. It is not returned in the decoded string format. >Also, as this will be parsed in ReST format, you need to change the >description to use bullets, otherwise the html/pdf version of the >document will place everything on a single line. E.g. something like: Sure. > >Description: > (RO) Memory repair function type. For eg. post package repair, > memory sparing etc. Can be: > > - EDAC_SOFT_PPR - Soft post package repair > - EDAC_HARD_PPR - Hard post package repair > - EDAC_CACHELINE_MEM_SPARING - Cacheline memory >sparing > - EDAC_ROW_MEM_SPARING - Row memory sparing > - EDAC_BANK_MEM_SPARING - Bank memory sparing > - EDAC_RANK_MEM_SPARING - Rank memory sparing > - All other values are reserved. > >Same applies to other sysfs nodes. See for instance: > > Documentation/ABI/stable/sysfs-class-backlight > >And see how it is formatted after Sphinx processing at the Kernel >Admin guide: > > https://www.kernel.org/doc/html/latest/admin-guide/abi- >stable.html#symbols-under-sys-class > >Please fix it on all places you have a list of values. Sure. > >> + >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/persist_mode >> +Date: Jan 2025 >> +KernelVersion: 6.14 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (RW) Read/Write the current persist repair mode set for a >> + repair function. Persist repair modes supported in the >> + device, based on the memory repair function is temporary >> + or permanent and is lost with a power cycle. >> + EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary >repair). >> + EDAC_MEM_REPAIR_HARD - Hard memory repair function >(permanent repair). >> + All other values are reserved. > >Same here: edac/ is already in the path. No need to place EDAC_ at the name. > Sam as above. Return a single value, not as decoded string. But documented in terms of the enums defined for interface in the include/linux/edac.h >> + >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/dpa_support >> +Date: Jan 2025 >> +KernelVersion: 6.14 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (RO) True if memory device required device physical >> + address (DPA) of memory to repair. >> + False if memory device required host specific physical >> + address (HPA) of memory to repair. > >Please remove the extra spaces before "address", as otherwise conversion to >ReST may do the wrong thing or may produce doc warnings. Will fix. > >> + In some states of system configuration (e.g. before address >> + decoders have been configured), memory devices (e.g. CXL) >> + may not have an active mapping in the main host address >> + physical address map. As such, the memory to repair must be >> + identified by a device specific physical addressing scheme >> + using a DPA. The device physical address(DPA) to use will be >> + presented in related error records. >> + >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/repair_safe_when_in_use >> +Date: Jan 2025 >> +KernelVersion: 6.14 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (RO) True if memory media is accessible and data is retained >> + during the memory repair operation. >> + The data may not be retained and memory requests may not be >> + correctly processed during a repair operation. In such case >> + the repair operation should not executed at runtime. > >Please add an extra line before "The data" to ensure that the output at >the admin-guide won't merge the two paragraphs. Same on other places along >this patch series: paragraphs need a blank line at the description. > Sure. >> + >> +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/hpa >> +Date: Jan 2025 >> +KernelVersion: 6.14 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (RW) Host Physical Address (HPA) of the memory to repair. >> + See attribute 'dpa_support' for more details. >> + The HPA to use will be provided in related error records. >> + >> +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/dpa >> +Date: Jan 2025 >> +KernelVersion: 6.14 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (RW) Device Physical Address (DPA) of the memory to repair. >> + See attribute 'dpa_support' for more details. >> + The specific DPA to use will be provided in related error >> + records. >> + >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/nibble_mask >> +Date: Jan 2025 >> +KernelVersion: 6.14 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (RW) Read/Write Nibble mask of the memory to repair. >> + Nibble mask identifies one or more nibbles in error on the >> + memory bus that produced the error event. Nibble Mask bit 0 >> + shall be set if nibble 0 on the memory bus produced the >> + event, etc. For example, CXL PPR and sparing, a nibble mask >> + bit set to 1 indicates the request to perform repair >> + operation in the specific device. All nibble mask bits set >> + to 1 indicates the request to perform the operation in all >> + devices. For CXL memory to repiar, the specific value of >> + nibble mask to use will be provided in related error records. >> + For more details, See nibble mask field in CXL spec ver 3.1, >> + section 8.2.9.7.1.2 Table 8-103 soft PPR and section >> + 8.2.9.7.1.3 Table 8-104 hard PPR, section 8.2.9.7.1.4 >> + Table 8-105 memory sparing. >> + >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/bank_group >> +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/bank >> +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/rank >> +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/row >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/column >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/channel >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/sub_channel >> +Date: Jan 2025 >> +KernelVersion: 6.14 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (RW) The control attributes associated with memory address >> + that is to be repaired. The specific value of attributes to >> + use depends on the portion of memory to repair and may be >> + reported to host in related error records and may be >> + available to userspace in trace events, such as in CXL >> + memory devices. >> + >> + channel - The channel of the memory to repair. Channel is >> + defined as an interface that can be independently accessed >> + for a transaction. >> + rank - The rank of the memory to repair. Rank is defined as a >> + set of memory devices on a channel that together execute a >> + transaction. >> + bank_group - The bank group of the memory to repair. >> + bank - The bank number of the memory to repair. >> + row - The row number of the memory to repair. >> + column - The column number of the memory to repair. >> + sub_channel - The sub-channel of the memory to repair. > >Same problem here with regards to bad ReST input. I would do: > > channel > The channel of the memory to repair. Channel is > defined as an interface that can be independently accessed > for a transaction. > > rank > The rank of the memory to repair. Rank is defined as a > set of memory devices on a channel that together execute a > transaction. > Sure. Will fix. >as this would provide a better output at admin-guide while still being >nicer to read as text. > >> + >> + The requirement to set these attributes varies based on the >> + repair function. The attributes in sysfs are not present >> + unless required for a repair function. >> + For example, CXL spec ver 3.1, Section 8.2.9.7.1.2 Table 8-103 >> + soft PPR and Section 8.2.9.7.1.3 Table 8-104 hard PPR >operations, >> + these attributes are not required to set. >> + For example, CXL spec ver 3.1, Section 8.2.9.7.1.4 Table 8-105 >> + memory sparing, these attributes are required to set based on >> + memory sparing granularity as follows. >> + Channel: Channel associated with the DPA that is to be spared >> + and applies to all subclasses of sparing (cacheline, bank, >> + row and rank sparing). >> + Rank: Rank associated with the DPA that is to be spared and >> + applies to all subclasses of sparing. >> + Bank & Bank Group: Bank & bank group are associated with >> + the DPA that is to be spared and applies to cacheline sparing, >> + row sparing and bank sparing subclasses. >> + Row: Row associated with the DPA that is to be spared and >> + applies to cacheline sparing and row sparing subclasses. >> + Column: column associated with the DPA that is to be spared >> + and applies to cacheline sparing only. >> + Sub-channel: sub-channel associated with the DPA that is to >> + be spared and applies to cacheline sparing only. > >Same here: this will all be on a single paragraph which would be really >weird. Will fix. > >> + >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_hpa >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_dpa >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_nibble_mask >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_bank_group >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_bank >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_rank >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_row >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_column >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_channel >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/min_sub_channel >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_hpa >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_dpa >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_nibble_mask >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_bank_group >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_bank >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_rank >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_row >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_column >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_channel >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/max_sub_channel >> +Date: Jan 2025 >> +KernelVersion: 6.14 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (RW) The supported range of control attributes (optional) >> + associated with a memory address that is to be repaired. >> + The memory device may give the supported range of >> + attributes to use and it will depend on the memory device >> + and the portion of memory to repair. >> + The userspace may receive the specific value of attributes >> + to use for a repair operation from the memory device via >> + related error records and trace events, such as in CXL >> + memory devices. >> + >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/repair >> +Date: Jan 2025 >> +KernelVersion: 6.14 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (WO) Issue the memory repair operation for the specified >> + memory repair attributes. The operation may fail if resources >> + are insufficient based on the requirements of the memory >> + device and repair function. >> + EDAC_DO_MEM_REPAIR - issue repair operation. >> + All other values are reserved. >> diff --git a/Documentation/edac/features.rst >b/Documentation/edac/features.rst >> index ba3ab993ee4f..bfd5533b81b7 100644 >> --- a/Documentation/edac/features.rst >> +++ b/Documentation/edac/features.rst >> @@ -97,3 +97,6 @@ RAS features >> ------------ >> 1. Memory Scrub >> Memory scrub features are documented in `Documentation/edac/scrub.rst`. >> + >> +2. Memory Repair >> +Memory repair features are documented in >`Documentation/edac/memory_repair.rst`. >> diff --git a/Documentation/edac/index.rst b/Documentation/edac/index.rst >> index dfb0c9fb9ab1..d6778f4562dd 100644 >> --- a/Documentation/edac/index.rst >> +++ b/Documentation/edac/index.rst >> @@ -8,4 +8,5 @@ EDAC Subsystem >> :maxdepth: 1 >> >> features >> + memory_repair >> scrub >> diff --git a/Documentation/edac/memory_repair.rst >b/Documentation/edac/memory_repair.rst >> new file mode 100644 >> index 000000000000..2787a8a2d6ba >> --- /dev/null >> +++ b/Documentation/edac/memory_repair.rst >> @@ -0,0 +1,101 @@ >> +.. SPDX-License-Identifier: GPL-2.0 >> + >> +========================== >> +EDAC Memory Repair Control >> +========================== >> + >> +Copyright (c) 2024 HiSilicon Limited. >> + >> +:Author: Shiju Jose <shiju.jose@huawei.com> >> +:License: The GNU Free Documentation License, Version 1.2 >> + (dual licensed under the GPL v2) >> +:Original Reviewers: >> + >> +- Written for: 6.14 > >See my comments with regards to license on the previous patches. Ok. > >> + >> +Introduction >> +------------ >> +Memory devices may support repair operations to address issues in their >> +memory media. Post Package Repair (PPR) and memory sparing are >examples >> +of such features. >> + >> +Post Package Repair(PPR) >> +~~~~~~~~~~~~~~~~~~~~~~~~ >> +Post Package Repair is a maintenance operation requests the memory device >> +to perform repair operation on its media, in detail is a memory self-healing >> +feature that fixes a failing memory location by replacing it with a spare >> +row in a DRAM device. For example, a CXL memory device with DRAM >components >> +that support PPR features may implement PPR maintenance operations. >DRAM >> +components may support types of PPR functions, hard PPR, for a permanent >row >> +repair, and soft PPR, for a temporary row repair. Soft PPR is much faster >> +than hard PPR, but the repair is lost with a power cycle. The data may not >> +be retained and memory requests may not be correctly processed during a >> +repair operation. In such case, the repair operation should not executed >> +at runtime. >> +For example, CXL memory devices, soft PPR and hard PPR repair operations >> +may be supported. See CXL spec rev 3.1 sections 8.2.9.7.1.1 PPR Maintenance >> +Operations, 8.2.9.7.1.2 sPPR Maintenance Operation and 8.2.9.7.1.3 hPPR >> +Maintenance Operation for more details. > >Paragraphs require blank lines in ReST. Also, please place a link to the >specs. > >I strongly suggest looking at the output of all docs with make htmldocs >and make pdfdocs to be sure that the paragraphs and the final document >will be properly handled. You may use: > > SPHINXDIRS="<book name(s)>" > >to speed-up documentation builds. > >Please see Sphinx documentation for more details about what it is expected >there: > > https://www.sphinx- >doc.org/en/master/usage/restructuredtext/basics.html Thanks for information. I will check and fix. I had fixed blank line requirements in most of the main documentations, but was not aware of location of output for the ABI docs and missed. > >> + >> +Memory Sparing >> +~~~~~~~~~~~~~~ >> +Memory sparing is a repair function that replaces a portion of memory with >> +a portion of functional memory at that same sparing granularity. Memory >> +sparing has cacheline/row/bank/rank sparing granularities. For example, in >> +memory-sparing mode, one memory rank serves as a spare for other ranks >on >> +the same channel in case they fail. The spare rank is held in reserve and >> +not used as active memory until a failure is indicated, with reserved >> +capacity subtracted from the total available memory in the system. The >DIMM >> +installation order for memory sparing varies based on the number of >processors >> +and memory modules installed in the server. After an error threshold is >> +surpassed in a system protected by memory sparing, the content of a failing >> +rank of DIMMs is copied to the spare rank. The failing rank is then taken >> +offline and the spare rank placed online for use as active memory in place >> +of the failed rank. >> + >> +For example, CXL memory devices may support various subclasses for sparing >> +operation vary in terms of the scope of the sparing being performed. >> +Cacheline sparing subclass refers to a sparing action that can replace a >> +full cacheline. Row sparing is provided as an alternative to PPR sparing >> +functions and its scope is that of a single DDR row. Bank sparing allows >> +an entire bank to be replaced. Rank sparing is defined as an operation >> +in which an entire DDR rank is replaced. See CXL spec 3.1 section >> +8.2.9.7.1.4 Memory Sparing Maintenance Operations for more details. >> + >> +Use cases of generic memory repair features control >> +--------------------------------------------------- >> + >> +1. The soft PPR , hard PPR and memory-sparing features share similar >> +control attributes. Therefore, there is a need for a standardized, generic >> +sysfs repair control that is exposed to userspace and used by >> +administrators, scripts and tools. >> + >> +2. When a CXL device detects an error in a memory component, it may >inform >> +the host of the need for a repair maintenance operation by using an event >> +record where the "maintenance needed" flag is set. The event record >> +specifies the device physical address(DPA) and attributes of the memory that >> +requires repair. The kernel reports the corresponding CXL general media or >> +DRAM trace event to userspace, and userspace tools (e.g. rasdaemon) >initiate >> +a repair maintenance operation in response to the device request using the >> +sysfs repair control. >> + >> +3. Userspace tools, such as rasdaemon, may request a PPR/sparing on a >memory >> +region when an uncorrected memory error or an excess of corrected >memory >> +errors is reported on that memory. >> + >> +4. Multiple PPR/sparing instances may be present per memory device. >> + >> +The File System >> +--------------- >> + >> +The control attributes of a registered memory repair instance could be >> +accessed in the >> + >> +/sys/bus/edac/devices/<dev-name>/mem_repairX/ >> + >> +sysfs >> +----- >> + >> +Sysfs files are documented in >> + >> +`Documentation/ABI/testing/sysfs-edac-memory-repair`. >> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile >> index 3a49304860f0..1de9fe66ac6b 100644 >> --- a/drivers/edac/Makefile >> +++ b/drivers/edac/Makefile >> @@ -10,7 +10,7 @@ obj-$(CONFIG_EDAC) := edac_core.o >> >> edac_core-y := edac_mc.o edac_device.o edac_mc_sysfs.o >> edac_core-y += edac_module.o edac_device_sysfs.o wq.o >> -edac_core-y += scrub.o ecs.o >> +edac_core-y += scrub.o ecs.o mem_repair.o >> >> edac_core-$(CONFIG_EDAC_DEBUG) += debugfs.o >> >> diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c >> index 1c1142a2e4e4..a401d81dad8a 100644 >> --- a/drivers/edac/edac_device.c >> +++ b/drivers/edac/edac_device.c >> @@ -575,6 +575,7 @@ static void edac_dev_release(struct device *dev) >> { >> struct edac_dev_feat_ctx *ctx = container_of(dev, struct >edac_dev_feat_ctx, dev); >> >> + kfree(ctx->mem_repair); >> kfree(ctx->scrub); >> kfree(ctx->dev.groups); >> kfree(ctx); >> @@ -611,6 +612,7 @@ int edac_dev_register(struct device *parent, char >*name, >> const struct attribute_group **ras_attr_groups; >> struct edac_dev_data *dev_data; >> struct edac_dev_feat_ctx *ctx; >> + int mem_repair_cnt = 0; >> int attr_gcnt = 0; >> int scrub_cnt = 0; >> int ret, feat; >> @@ -628,6 +630,10 @@ int edac_dev_register(struct device *parent, char >*name, >> case RAS_FEAT_ECS: >> attr_gcnt += >ras_features[feat].ecs_info.num_media_frus; >> break; >> + case RAS_FEAT_MEM_REPAIR: >> + attr_gcnt++; >> + mem_repair_cnt++; >> + break; >> default: >> return -EINVAL; >> } >> @@ -651,8 +657,17 @@ int edac_dev_register(struct device *parent, char >*name, >> } >> } >> >> + if (mem_repair_cnt) { >> + ctx->mem_repair = kcalloc(mem_repair_cnt, sizeof(*ctx- >>mem_repair), GFP_KERNEL); >> + if (!ctx->mem_repair) { >> + ret = -ENOMEM; >> + goto data_mem_free; >> + } >> + } >> + >> attr_gcnt = 0; >> scrub_cnt = 0; >> + mem_repair_cnt = 0; >> for (feat = 0; feat < num_features; feat++, ras_features++) { >> switch (ras_features->ft_type) { >> case RAS_FEAT_SCRUB: >> @@ -686,6 +701,23 @@ int edac_dev_register(struct device *parent, char >*name, >> >> attr_gcnt += ras_features->ecs_info.num_media_frus; >> break; >> + case RAS_FEAT_MEM_REPAIR: >> + if (!ras_features->mem_repair_ops || >> + mem_repair_cnt != ras_features->instance) >> + goto data_mem_free; >> + >> + dev_data = &ctx->mem_repair[mem_repair_cnt]; >> + dev_data->instance = mem_repair_cnt; >> + dev_data->mem_repair_ops = ras_features- >>mem_repair_ops; >> + dev_data->private = ras_features->ctx; >> + ret = edac_mem_repair_get_desc(parent, >&ras_attr_groups[attr_gcnt], >> + ras_features->instance); >> + if (ret) >> + goto data_mem_free; >> + >> + mem_repair_cnt++; >> + attr_gcnt++; >> + break; >> default: >> ret = -EINVAL; >> goto data_mem_free; >> @@ -712,6 +744,7 @@ int edac_dev_register(struct device *parent, char >*name, >> return devm_add_action_or_reset(parent, edac_dev_unreg, &ctx->dev); >> >> data_mem_free: >> + kfree(ctx->mem_repair); >> kfree(ctx->scrub); >> groups_free: >> kfree(ras_attr_groups); >> diff --git a/drivers/edac/mem_repair.c b/drivers/edac/mem_repair.c >> new file mode 100755 >> index 000000000000..e7439fd26c41 >> --- /dev/null >> +++ b/drivers/edac/mem_repair.c >> @@ -0,0 +1,492 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* >> + * The generic EDAC memory repair driver is designed to control the memory >> + * devices with memory repair features, such as Post Package Repair (PPR), >> + * memory sparing etc. The common sysfs memory repair interface abstracts >> + * the control of various arbitrary memory repair functionalities into a >> + * unified set of functions. >> + * >> + * Copyright (c) 2024 HiSilicon Limited. >> + */ >> + >> +#include <linux/edac.h> >> + >> +enum edac_mem_repair_attributes { >> + MEM_REPAIR_FUNCTION, >> + MEM_REPAIR_PERSIST_MODE, >> + MEM_REPAIR_DPA_SUPPORT, >> + MEM_REPAIR_SAFE_IN_USE, >> + MEM_REPAIR_HPA, >> + MEM_REPAIR_MIN_HPA, >> + MEM_REPAIR_MAX_HPA, >> + MEM_REPAIR_DPA, >> + MEM_REPAIR_MIN_DPA, >> + MEM_REPAIR_MAX_DPA, >> + MEM_REPAIR_NIBBLE_MASK, >> + MEM_REPAIR_MIN_NIBBLE_MASK, >> + MEM_REPAIR_MAX_NIBBLE_MASK, >> + MEM_REPAIR_BANK_GROUP, >> + MEM_REPAIR_MIN_BANK_GROUP, >> + MEM_REPAIR_MAX_BANK_GROUP, >> + MEM_REPAIR_BANK, >> + MEM_REPAIR_MIN_BANK, >> + MEM_REPAIR_MAX_BANK, >> + MEM_REPAIR_RANK, >> + MEM_REPAIR_MIN_RANK, >> + MEM_REPAIR_MAX_RANK, >> + MEM_REPAIR_ROW, >> + MEM_REPAIR_MIN_ROW, >> + MEM_REPAIR_MAX_ROW, >> + MEM_REPAIR_COLUMN, >> + MEM_REPAIR_MIN_COLUMN, >> + MEM_REPAIR_MAX_COLUMN, >> + MEM_REPAIR_CHANNEL, >> + MEM_REPAIR_MIN_CHANNEL, >> + MEM_REPAIR_MAX_CHANNEL, >> + MEM_REPAIR_SUB_CHANNEL, >> + MEM_REPAIR_MIN_SUB_CHANNEL, >> + MEM_REPAIR_MAX_SUB_CHANNEL, >> + MEM_DO_REPAIR, >> + MEM_REPAIR_MAX_ATTRS >> +}; >> + >> +struct edac_mem_repair_dev_attr { >> + struct device_attribute dev_attr; >> + u8 instance; >> +}; >> + >> +struct edac_mem_repair_context { >> + char name[EDAC_FEAT_NAME_LEN]; >> + struct edac_mem_repair_dev_attr >mem_repair_dev_attr[MEM_REPAIR_MAX_ATTRS]; >> + struct attribute *mem_repair_attrs[MEM_REPAIR_MAX_ATTRS + 1]; >> + struct attribute_group group; >> +}; >> + >> +#define TO_MEM_REPAIR_DEV_ATTR(_dev_attr) \ >> + container_of(_dev_attr, struct edac_mem_repair_dev_attr, >dev_attr) >> + >> +#define EDAC_MEM_REPAIR_ATTR_SHOW(attrib, cb, type, format) > \ >> +static ssize_t attrib##_show(struct device *ras_feat_dev, > \ >> + struct device_attribute *attr, char *buf) > \ >> +{ > \ >> + u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance; > \ >> + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); > \ >> + const struct edac_mem_repair_ops *ops = > \ >> + ctx->mem_repair[inst].mem_repair_ops; > \ >> + type data; > \ >> + int ret; \ >> + > \ >> + ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private, > \ >> + &data); > \ >> + if (ret) \ >> + return ret; > \ >> + > \ >> + return sysfs_emit(buf, format, data); > \ >> +} >> + >> +EDAC_MEM_REPAIR_ATTR_SHOW(repair_function, get_repair_function, >u32, "%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(persist_mode, get_persist_mode, u32, >"%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(dpa_support, get_dpa_support, u32, >"%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(repair_safe_when_in_use, >get_repair_safe_when_in_use, u32, "%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(hpa, get_hpa, u64, "0x%llx\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(min_hpa, get_min_hpa, u64, "0x%llx\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(max_hpa, get_max_hpa, u64, "0x%llx\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(dpa, get_dpa, u64, "0x%llx\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(min_dpa, get_min_dpa, u64, "0x%llx\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(max_dpa, get_max_dpa, u64, "0x%llx\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(nibble_mask, get_nibble_mask, u64, >"0x%llx\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(min_nibble_mask, get_min_nibble_mask, >u64, "0x%llx\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(max_nibble_mask, >get_max_nibble_mask, u64, "0x%llx\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(bank_group, get_bank_group, u32, >"%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank_group, get_min_bank_group, >u32, "%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank_group, get_max_bank_group, >u32, "%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(bank, get_bank, u32, "%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank, get_min_bank, u32, "%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank, get_max_bank, u32, "%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(rank, get_rank, u32, "%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(min_rank, get_min_rank, u32, "%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(max_rank, get_max_rank, u32, "%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(row, get_row, u64, "0x%llx\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(min_row, get_min_row, u64, "0x%llx\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(max_row, get_max_row, u64, "0x%llx\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(column, get_column, u32, "%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(min_column, get_min_column, u32, >"%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(max_column, get_max_column, u32, >"%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(channel, get_channel, u32, "%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(min_channel, get_min_channel, u32, >"%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(max_channel, get_max_channel, u32, >"%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(sub_channel, get_sub_channel, u32, >"%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(min_sub_channel, get_min_sub_channel, >u32, "%u\n") >> +EDAC_MEM_REPAIR_ATTR_SHOW(max_sub_channel, >get_max_sub_channel, u32, "%u\n") >> + >> +#define EDAC_MEM_REPAIR_ATTR_STORE(attrib, cb, type, conv_func) > \ >> +static ssize_t attrib##_store(struct device *ras_feat_dev, > \ >> + struct device_attribute *attr, > \ >> + const char *buf, size_t len) \ >> +{ > \ >> + u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance; > \ >> + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); > \ >> + const struct edac_mem_repair_ops *ops = > \ >> + ctx->mem_repair[inst].mem_repair_ops; > \ >> + type data; > \ >> + int ret; \ >> + > \ >> + ret = conv_func(buf, 0, &data); > \ >> + if (ret < 0) > \ >> + return ret; > \ >> + > \ >> + ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private, > \ >> + data); > \ >> + if (ret) \ >> + return ret; > \ >> + > \ >> + return len; > \ >> +} >> + >> +EDAC_MEM_REPAIR_ATTR_STORE(persist_mode, set_persist_mode, >unsigned long, kstrtoul) >> +EDAC_MEM_REPAIR_ATTR_STORE(hpa, set_hpa, u64, kstrtou64) >> +EDAC_MEM_REPAIR_ATTR_STORE(dpa, set_dpa, u64, kstrtou64) >> +EDAC_MEM_REPAIR_ATTR_STORE(nibble_mask, set_nibble_mask, u64, >kstrtou64) >> +EDAC_MEM_REPAIR_ATTR_STORE(bank_group, set_bank_group, unsigned >long, kstrtoul) >> +EDAC_MEM_REPAIR_ATTR_STORE(bank, set_bank, unsigned long, kstrtoul) >> +EDAC_MEM_REPAIR_ATTR_STORE(rank, set_rank, unsigned long, kstrtoul) >> +EDAC_MEM_REPAIR_ATTR_STORE(row, set_row, u64, kstrtou64) >> +EDAC_MEM_REPAIR_ATTR_STORE(column, set_column, unsigned long, >kstrtoul) >> +EDAC_MEM_REPAIR_ATTR_STORE(channel, set_channel, unsigned long, >kstrtoul) >> +EDAC_MEM_REPAIR_ATTR_STORE(sub_channel, set_sub_channel, unsigned >long, kstrtoul) >> + >> +#define EDAC_MEM_REPAIR_DO_OP(attrib, cb) > \ >> +static ssize_t attrib##_store(struct device *ras_feat_dev, > \ >> + struct device_attribute *attr, > \ >> + const char *buf, size_t len) > \ >> +{ > \ >> + u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance; > \ >> + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); > \ >> + const struct edac_mem_repair_ops *ops = ctx- >>mem_repair[inst].mem_repair_ops; \ >> + unsigned long data; > \ >> + int ret; > \ >> + > \ >> + ret = kstrtoul(buf, 0, &data); > \ >> + if (ret < 0) > \ >> + return ret; > \ >> + > \ >> + ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private, >data); \ >> + if (ret) > \ >> + return ret; > \ >> + > \ >> + return len; > \ >> +} >> + >> +EDAC_MEM_REPAIR_DO_OP(repair, do_repair) >> + >> +static umode_t mem_repair_attr_visible(struct kobject *kobj, struct attribute >*a, int attr_id) >> +{ >> + struct device *ras_feat_dev = kobj_to_dev(kobj); >> + struct device_attribute *dev_attr = container_of(a, struct >device_attribute, attr); >> + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); >> + u8 inst = TO_MEM_REPAIR_DEV_ATTR(dev_attr)->instance; >> + const struct edac_mem_repair_ops *ops = ctx- >>mem_repair[inst].mem_repair_ops; >> + >> + switch (attr_id) { >> + case MEM_REPAIR_FUNCTION: >> + if (ops->get_repair_function) >> + return a->mode; >> + break; >> + case MEM_REPAIR_PERSIST_MODE: >> + if (ops->get_persist_mode) { >> + if (ops->set_persist_mode) >> + return a->mode; >> + else >> + return 0444; >> + } >> + break; >> + case MEM_REPAIR_DPA_SUPPORT: >> + if (ops->get_dpa_support) >> + return a->mode; >> + break; >> + case MEM_REPAIR_SAFE_IN_USE: >> + if (ops->get_repair_safe_when_in_use) >> + return a->mode; >> + break; >> + case MEM_REPAIR_HPA: >> + if (ops->get_hpa) { >> + if (ops->set_hpa) >> + return a->mode; >> + else >> + return 0444; >> + } >> + break; >> + case MEM_REPAIR_MIN_HPA: >> + if (ops->get_min_hpa) >> + return a->mode; >> + break; >> + case MEM_REPAIR_MAX_HPA: >> + if (ops->get_max_hpa) >> + return a->mode; >> + break; >> + case MEM_REPAIR_DPA: >> + if (ops->get_dpa) { >> + if (ops->set_dpa) >> + return a->mode; >> + else >> + return 0444; >> + } >> + break; >> + case MEM_REPAIR_MIN_DPA: >> + if (ops->get_min_dpa) >> + return a->mode; >> + break; >> + case MEM_REPAIR_MAX_DPA: >> + if (ops->get_max_dpa) >> + return a->mode; >> + break; >> + case MEM_REPAIR_NIBBLE_MASK: >> + if (ops->get_nibble_mask) { >> + if (ops->set_nibble_mask) >> + return a->mode; >> + else >> + return 0444; >> + } >> + break; >> + case MEM_REPAIR_MIN_NIBBLE_MASK: >> + if (ops->get_min_nibble_mask) >> + return a->mode; >> + break; >> + case MEM_REPAIR_MAX_NIBBLE_MASK: >> + if (ops->get_max_nibble_mask) >> + return a->mode; >> + break; >> + case MEM_REPAIR_BANK_GROUP: >> + if (ops->get_bank_group) { >> + if (ops->set_bank_group) >> + return a->mode; >> + else >> + return 0444; >> + } >> + break; >> + case MEM_REPAIR_MIN_BANK_GROUP: >> + if (ops->get_min_bank_group) >> + return a->mode; >> + break; >> + case MEM_REPAIR_MAX_BANK_GROUP: >> + if (ops->get_max_bank_group) >> + return a->mode; >> + break; >> + case MEM_REPAIR_BANK: >> + if (ops->get_bank) { >> + if (ops->set_bank) >> + return a->mode; >> + else >> + return 0444; >> + } >> + break; >> + case MEM_REPAIR_MIN_BANK: >> + if (ops->get_min_bank) >> + return a->mode; >> + break; >> + case MEM_REPAIR_MAX_BANK: >> + if (ops->get_max_bank) >> + return a->mode; >> + break; >> + case MEM_REPAIR_RANK: >> + if (ops->get_rank) { >> + if (ops->set_rank) >> + return a->mode; >> + else >> + return 0444; >> + } >> + break; >> + case MEM_REPAIR_MIN_RANK: >> + if (ops->get_min_rank) >> + return a->mode; >> + break; >> + case MEM_REPAIR_MAX_RANK: >> + if (ops->get_max_rank) >> + return a->mode; >> + break; >> + case MEM_REPAIR_ROW: >> + if (ops->get_row) { >> + if (ops->set_row) >> + return a->mode; >> + else >> + return 0444; >> + } >> + break; >> + case MEM_REPAIR_MIN_ROW: >> + if (ops->get_min_row) >> + return a->mode; >> + break; >> + case MEM_REPAIR_MAX_ROW: >> + if (ops->get_max_row) >> + return a->mode; >> + break; >> + case MEM_REPAIR_COLUMN: >> + if (ops->get_column) { >> + if (ops->set_column) >> + return a->mode; >> + else >> + return 0444; >> + } >> + break; >> + case MEM_REPAIR_MIN_COLUMN: >> + if (ops->get_min_column) >> + return a->mode; >> + break; >> + case MEM_REPAIR_MAX_COLUMN: >> + if (ops->get_max_column) >> + return a->mode; >> + break; >> + case MEM_REPAIR_CHANNEL: >> + if (ops->get_channel) { >> + if (ops->set_channel) >> + return a->mode; >> + else >> + return 0444; >> + } >> + break; >> + case MEM_REPAIR_MIN_CHANNEL: >> + if (ops->get_min_channel) >> + return a->mode; >> + break; >> + case MEM_REPAIR_MAX_CHANNEL: >> + if (ops->get_max_channel) >> + return a->mode; >> + break; >> + case MEM_REPAIR_SUB_CHANNEL: >> + if (ops->get_sub_channel) { >> + if (ops->set_sub_channel) >> + return a->mode; >> + else >> + return 0444; >> + } >> + break; >> + case MEM_REPAIR_MIN_SUB_CHANNEL: >> + if (ops->get_min_sub_channel) >> + return a->mode; >> + break; >> + case MEM_REPAIR_MAX_SUB_CHANNEL: >> + if (ops->get_max_sub_channel) >> + return a->mode; >> + break; >> + case MEM_DO_REPAIR: >> + if (ops->do_repair) >> + return a->mode; >> + break; >> + default: >> + break; >> + } >> + >> + return 0; >> +} >> + >> +#define EDAC_MEM_REPAIR_ATTR_RO(_name, _instance) \ >> + ((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RO(_name), >\ >> + .instance = _instance }) >> + >> +#define EDAC_MEM_REPAIR_ATTR_WO(_name, _instance) \ >> + ((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_WO(_name), >\ >> + .instance = _instance }) >> + >> +#define EDAC_MEM_REPAIR_ATTR_RW(_name, _instance) \ >> + ((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RW(_name), >\ >> + .instance = _instance }) >> + >> +static int mem_repair_create_desc(struct device *dev, >> + const struct attribute_group **attr_groups, >> + u8 instance) >> +{ >> + struct edac_mem_repair_context *ctx; >> + struct attribute_group *group; >> + int i; >> + struct edac_mem_repair_dev_attr dev_attr[] = { >> + [MEM_REPAIR_FUNCTION] = >EDAC_MEM_REPAIR_ATTR_RO(repair_function, >> + instance), >> + [MEM_REPAIR_PERSIST_MODE] = >> + EDAC_MEM_REPAIR_ATTR_RW(persist_mode, >instance), >> + [MEM_REPAIR_DPA_SUPPORT] = >> + EDAC_MEM_REPAIR_ATTR_RO(dpa_support, >instance), >> + [MEM_REPAIR_SAFE_IN_USE] = >> + > EDAC_MEM_REPAIR_ATTR_RO(repair_safe_when_in_use, >> + instance), >> + [MEM_REPAIR_HPA] = EDAC_MEM_REPAIR_ATTR_RW(hpa, >instance), >> + [MEM_REPAIR_MIN_HPA] = >EDAC_MEM_REPAIR_ATTR_RO(min_hpa, instance), >> + [MEM_REPAIR_MAX_HPA] = >EDAC_MEM_REPAIR_ATTR_RO(max_hpa, instance), >> + [MEM_REPAIR_DPA] = EDAC_MEM_REPAIR_ATTR_RW(dpa, >instance), >> + [MEM_REPAIR_MIN_DPA] = >EDAC_MEM_REPAIR_ATTR_RO(min_dpa, instance), >> + [MEM_REPAIR_MAX_DPA] = >EDAC_MEM_REPAIR_ATTR_RO(max_dpa, instance), >> + [MEM_REPAIR_NIBBLE_MASK] = >> + EDAC_MEM_REPAIR_ATTR_RW(nibble_mask, >instance), >> + [MEM_REPAIR_MIN_NIBBLE_MASK] = >> + > EDAC_MEM_REPAIR_ATTR_RO(min_nibble_mask, instance), >> + [MEM_REPAIR_MAX_NIBBLE_MASK] = >> + > EDAC_MEM_REPAIR_ATTR_RO(max_nibble_mask, instance), >> + [MEM_REPAIR_BANK_GROUP] = >> + EDAC_MEM_REPAIR_ATTR_RW(bank_group, >instance), >> + [MEM_REPAIR_MIN_BANK_GROUP] = >> + > EDAC_MEM_REPAIR_ATTR_RO(min_bank_group, instance), >> + [MEM_REPAIR_MAX_BANK_GROUP] = >> + > EDAC_MEM_REPAIR_ATTR_RO(max_bank_group, instance), >> + [MEM_REPAIR_BANK] = EDAC_MEM_REPAIR_ATTR_RW(bank, >instance), >> + [MEM_REPAIR_MIN_BANK] = >EDAC_MEM_REPAIR_ATTR_RO(min_bank, instance), >> + [MEM_REPAIR_MAX_BANK] = >EDAC_MEM_REPAIR_ATTR_RO(max_bank, instance), >> + [MEM_REPAIR_RANK] = EDAC_MEM_REPAIR_ATTR_RW(rank, >instance), >> + [MEM_REPAIR_MIN_RANK] = >EDAC_MEM_REPAIR_ATTR_RO(min_rank, instance), >> + [MEM_REPAIR_MAX_RANK] = >EDAC_MEM_REPAIR_ATTR_RO(max_rank, instance), >> + [MEM_REPAIR_ROW] = EDAC_MEM_REPAIR_ATTR_RW(row, >instance), >> + [MEM_REPAIR_MIN_ROW] = >EDAC_MEM_REPAIR_ATTR_RO(min_row, instance), >> + [MEM_REPAIR_MAX_ROW] = >EDAC_MEM_REPAIR_ATTR_RO(max_row, instance), >> + [MEM_REPAIR_COLUMN] = >EDAC_MEM_REPAIR_ATTR_RW(column, instance), >> + [MEM_REPAIR_MIN_COLUMN] = >EDAC_MEM_REPAIR_ATTR_RO(min_column, instance), >> + [MEM_REPAIR_MAX_COLUMN] = >EDAC_MEM_REPAIR_ATTR_RO(max_column, instance), >> + [MEM_REPAIR_CHANNEL] = >EDAC_MEM_REPAIR_ATTR_RW(channel, instance), >> + [MEM_REPAIR_MIN_CHANNEL] = >EDAC_MEM_REPAIR_ATTR_RO(min_channel, instance), >> + [MEM_REPAIR_MAX_CHANNEL] = >EDAC_MEM_REPAIR_ATTR_RO(max_channel, instance), >> + [MEM_REPAIR_SUB_CHANNEL] = >> + EDAC_MEM_REPAIR_ATTR_RW(sub_channel, >instance), >> + [MEM_REPAIR_MIN_SUB_CHANNEL] = >> + > EDAC_MEM_REPAIR_ATTR_RO(min_sub_channel, instance), >> + [MEM_REPAIR_MAX_SUB_CHANNEL] = >> + > EDAC_MEM_REPAIR_ATTR_RO(max_sub_channel, instance), >> + [MEM_DO_REPAIR] = EDAC_MEM_REPAIR_ATTR_WO(repair, >instance) >> + }; >> + >> + ctx = devm_kzalloc(dev, sizeof(*ctx), GFP_KERNEL); >> + if (!ctx) >> + return -ENOMEM; >> + >> + for (i = 0; i < MEM_REPAIR_MAX_ATTRS; i++) { >> + memcpy(&ctx->mem_repair_dev_attr[i].dev_attr, >> + &dev_attr[i], sizeof(dev_attr[i])); >> + ctx->mem_repair_attrs[i] = >> + &ctx->mem_repair_dev_attr[i].dev_attr.attr; >> + } >> + >> + sprintf(ctx->name, "%s%d", "mem_repair", instance); >> + group = &ctx->group; >> + group->name = ctx->name; >> + group->attrs = ctx->mem_repair_attrs; >> + group->is_visible = mem_repair_attr_visible; >> + attr_groups[0] = group; >> + >> + return 0; >> +} >> + >> +/** >> + * edac_mem_repair_get_desc - get EDAC memory repair descriptors >> + * @dev: client device with memory repair feature >> + * @attr_groups: pointer to attribute group container >> + * @instance: device's memory repair instance number. >> + * >> + * Return: >> + * * %0 - Success. >> + * * %-EINVAL - Invalid parameters passed. >> + * * %-ENOMEM - Dynamic memory allocation failed. >> + */ >> +int edac_mem_repair_get_desc(struct device *dev, >> + const struct attribute_group **attr_groups, u8 >instance) >> +{ >> + if (!dev || !attr_groups) >> + return -EINVAL; >> + >> + return mem_repair_create_desc(dev, attr_groups, instance); >> +} >> diff --git a/include/linux/edac.h b/include/linux/edac.h >> index 979e91426701..5d07192bf1a7 100644 >> --- a/include/linux/edac.h >> +++ b/include/linux/edac.h >> @@ -668,6 +668,7 @@ static inline struct dimm_info *edac_get_dimm(struct >mem_ctl_info *mci, >> enum edac_dev_feat { >> RAS_FEAT_SCRUB, >> RAS_FEAT_ECS, >> + RAS_FEAT_MEM_REPAIR, >> RAS_FEAT_MAX >> }; >> >> @@ -729,11 +730,147 @@ int edac_ecs_get_desc(struct device *ecs_dev, >> const struct attribute_group **attr_groups, >> u16 num_media_frus); >> >> +enum edac_mem_repair_function { >> + EDAC_SOFT_PPR, >> + EDAC_HARD_PPR, >> + EDAC_CACHELINE_MEM_SPARING, >> + EDAC_ROW_MEM_SPARING, >> + EDAC_BANK_MEM_SPARING, >> + EDAC_RANK_MEM_SPARING, >> +}; >> + >> +enum edac_mem_repair_persist_mode { >> + EDAC_MEM_REPAIR_SOFT, /* soft memory repair */ >> + EDAC_MEM_REPAIR_HARD, /* hard memory repair */ >> +}; >> + >> +enum edac_mem_repair_cmd { >> + EDAC_DO_MEM_REPAIR = 1, >> +}; >> + >> +/** >> + * struct edac_mem_repair_ops - memory repair operations >> + * (all elements are optional except do_repair, set_hpa/set_dpa) >> + * @get_repair_function: get the memory repair function, listed in >> + * enum edac_mem_repair_function. >> + * @get_persist_mode: get the current persist mode. Persist repair modes >supported >> + * in the device is based on the memory repair function which >is >> + * temporary or permanent and is lost with a power cycle. >> + * EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary >repair). >> + * EDAC_MEM_REPAIR_HARD - Hard memory repair function >(permanent repair). >> + * All other values are reserved. >> + * @set_persist_mode: set the persist mode of the memory repair instance. >> + * @get_dpa_support: get dpa support flag. In some states of system >configuration >> + * (e.g. before address decoders have been configured), >memory devices >> + * (e.g. CXL) may not have an active mapping in the main host >address >> + * physical address map. As such, the memory to repair must be >identified >> + * by a device specific physical addressing scheme using a >device physical >> + * address(DPA). The DPA and other control attributes to use for >the >> + * dry_run and repair operations will be presented in related >error records. >> + * @get_repair_safe_when_in_use: get whether memory media is accessible >and >> + * data is retained during repair operation. >> + * @get_hpa: get current host physical address (HPA). >> + * @set_hpa: set host physical address (HPA) of memory to repair. >> + * @get_min_hpa: get the minimum supported host physical address (HPA). >> + * @get_max_hpa: get the maximum supported host physical address (HPA). >> + * @get_dpa: get current device physical address (DPA). >> + * @set_dpa: set device physical address (DPA) of memory to repair. >> + * @get_min_dpa: get the minimum supported device physical address >(DPA). >> + * @get_max_dpa: get the maximum supported device physical address >(DPA). >> + * @get_nibble_mask: get current nibble mask. >> + * @set_nibble_mask: set nibble mask of memory to repair. >> + * @get_min_nibble_mask: get the minimum supported nibble mask. >> + * @get_max_nibble_mask: get the maximum supported nibble mask. >> + * @get_bank_group: get current bank group. >> + * @set_bank_group: set bank group of memory to repair. >> + * @get_min_bank_group: get the minimum supported bank group. >> + * @get_max_bank_group: get the maximum supported bank group. >> + * @get_bank: get current bank. >> + * @set_bank: set bank of memory to repair. >> + * @get_min_bank: get the minimum supported bank. >> + * @get_max_bank: get the maximum supported bank. >> + * @get_rank: get current rank. >> + * @set_rank: set rank of memory to repair. >> + * @get_min_rank: get the minimum supported rank. >> + * @get_max_rank: get the maximum supported rank. >> + * @get_row: get current row. >> + * @set_row: set row of memory to repair. >> + * @get_min_row: get the minimum supported row. >> + * @get_max_row: get the maximum supported row. >> + * @get_column: get current column. >> + * @set_column: set column of memory to repair. >> + * @get_min_column: get the minimum supported column. >> + * @get_max_column: get the maximum supported column. >> + * @get_channel: get current channel. >> + * @set_channel: set channel of memory to repair. >> + * @get_min_channel: get the minimum supported channel. >> + * @get_max_channel: get the maximum supported channel. >> + * @get_sub_channel: get current sub channel. >> + * @set_sub_channel: set sub channel of memory to repair. >> + * @get_min_sub_channel: get the minimum supported sub channel. >> + * @get_max_sub_channel: get the maximum supported sub channel. >> + * @do_repair: Issue memory repair operation for the HPA/DPA and >> + * other control attributes set for the memory to repair. >> + */ >> +struct edac_mem_repair_ops { >> + int (*get_repair_function)(struct device *dev, void *drv_data, u32 *val); >> + int (*get_persist_mode)(struct device *dev, void *drv_data, u32 >*mode); >> + int (*set_persist_mode)(struct device *dev, void *drv_data, u32 mode); >> + int (*get_dpa_support)(struct device *dev, void *drv_data, u32 *val); >> + int (*get_repair_safe_when_in_use)(struct device *dev, void *drv_data, >u32 *val); >> + int (*get_hpa)(struct device *dev, void *drv_data, u64 *hpa); >> + int (*set_hpa)(struct device *dev, void *drv_data, u64 hpa); >> + int (*get_min_hpa)(struct device *dev, void *drv_data, u64 *hpa); >> + int (*get_max_hpa)(struct device *dev, void *drv_data, u64 *hpa); >> + int (*get_dpa)(struct device *dev, void *drv_data, u64 *dpa); >> + int (*set_dpa)(struct device *dev, void *drv_data, u64 dpa); >> + int (*get_min_dpa)(struct device *dev, void *drv_data, u64 *dpa); >> + int (*get_max_dpa)(struct device *dev, void *drv_data, u64 *dpa); >> + int (*get_nibble_mask)(struct device *dev, void *drv_data, u64 *val); >> + int (*set_nibble_mask)(struct device *dev, void *drv_data, u64 val); >> + int (*get_min_nibble_mask)(struct device *dev, void *drv_data, u64 >*val); >> + int (*get_max_nibble_mask)(struct device *dev, void *drv_data, u64 >*val); >> + int (*get_bank_group)(struct device *dev, void *drv_data, u32 *val); >> + int (*set_bank_group)(struct device *dev, void *drv_data, u32 val); >> + int (*get_min_bank_group)(struct device *dev, void *drv_data, u32 >*val); >> + int (*get_max_bank_group)(struct device *dev, void *drv_data, u32 >*val); >> + int (*get_bank)(struct device *dev, void *drv_data, u32 *val); >> + int (*set_bank)(struct device *dev, void *drv_data, u32 val); >> + int (*get_min_bank)(struct device *dev, void *drv_data, u32 *val); >> + int (*get_max_bank)(struct device *dev, void *drv_data, u32 *val); >> + int (*get_rank)(struct device *dev, void *drv_data, u32 *val); >> + int (*set_rank)(struct device *dev, void *drv_data, u32 val); >> + int (*get_min_rank)(struct device *dev, void *drv_data, u32 *val); >> + int (*get_max_rank)(struct device *dev, void *drv_data, u32 *val); >> + int (*get_row)(struct device *dev, void *drv_data, u64 *val); >> + int (*set_row)(struct device *dev, void *drv_data, u64 val); >> + int (*get_min_row)(struct device *dev, void *drv_data, u64 *val); >> + int (*get_max_row)(struct device *dev, void *drv_data, u64 *val); >> + int (*get_column)(struct device *dev, void *drv_data, u32 *val); >> + int (*set_column)(struct device *dev, void *drv_data, u32 val); >> + int (*get_min_column)(struct device *dev, void *drv_data, u32 *val); >> + int (*get_max_column)(struct device *dev, void *drv_data, u32 *val); >> + int (*get_channel)(struct device *dev, void *drv_data, u32 *val); >> + int (*set_channel)(struct device *dev, void *drv_data, u32 val); >> + int (*get_min_channel)(struct device *dev, void *drv_data, u32 *val); >> + int (*get_max_channel)(struct device *dev, void *drv_data, u32 *val); >> + int (*get_sub_channel)(struct device *dev, void *drv_data, u32 *val); >> + int (*set_sub_channel)(struct device *dev, void *drv_data, u32 val); >> + int (*get_min_sub_channel)(struct device *dev, void *drv_data, u32 >*val); >> + int (*get_max_sub_channel)(struct device *dev, void *drv_data, u32 >*val); >> + int (*do_repair)(struct device *dev, void *drv_data, u32 val); >> +}; >> + >> +int edac_mem_repair_get_desc(struct device *dev, >> + const struct attribute_group **attr_groups, >> + u8 instance); >> + >> /* EDAC device feature information structure */ >> struct edac_dev_data { >> union { >> const struct edac_scrub_ops *scrub_ops; >> const struct edac_ecs_ops *ecs_ops; >> + const struct edac_mem_repair_ops *mem_repair_ops; >> }; >> u8 instance; >> void *private; >> @@ -744,6 +881,7 @@ struct edac_dev_feat_ctx { >> void *private; >> struct edac_dev_data *scrub; >> struct edac_dev_data ecs; >> + struct edac_dev_data *mem_repair; >> }; >> >> struct edac_dev_feature { >> @@ -752,6 +890,7 @@ struct edac_dev_feature { >> union { >> const struct edac_scrub_ops *scrub_ops; >> const struct edac_ecs_ops *ecs_ops; >> + const struct edac_mem_repair_ops *mem_repair_ops; >> }; >> void *ctx; >> struct edac_ecs_ex_info ecs_info; > >Thanks, >Mauro Thanks, Shiju
Em Thu, 9 Jan 2025 14:24:33 +0000 Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu: > On Thu, 9 Jan 2025 13:32:22 +0100 > Borislav Petkov <bp@alien8.de> wrote: > > Hi Boris, > > > On Thu, Jan 09, 2025 at 11:00:43AM +0000, Shiju Jose wrote: > > > The min_ and max_ attributes of the control attributes are added for your > > > feedback on V15 to expose supported ranges of these control attributes to the user, > > > in the following links. > > > > Sure, but you can make that differently: > > > > cat /sys/bus/edac/devices/<dev-name>/mem_repairX/bank > > [x:y] > > > > which is the allowed range. > > To my thinking that would fail the test of being an intuitive interface. > To issue a repair command requires that multiple attributes be configured > before triggering the actual repair. > > Think of it as setting the coordinates of the repair in a high dimensional > space. > > In the extreme case of fine grained repair (Cacheline), to identify the > relevant subunit of memory (obtained from the error record that we are > basing the decision to repair on) we need to specify all of: > > Channel, sub-channel, rank, bank group, row, column and nibble mask. > For coarser granularity repair only specify a subset of these applies and > only the relevant controls are exposed to userspace. > > They are broken out as specific attributes to enable each to be set before > triggering the action with a write to the repair attribute. > > There are several possible alternatives: > > Option 1 > > "A:B:C:D:E:F:G:H:I:J" opaque single write to trigger the repair where > each number is providing one of those coordinates and where a readback > let's us known what each number is. > > That single attribute interface is very hard to extend in an intuitive way. > > History tell us more levels will be introduced in the middle, not just > at the finest granularity, making such an interface hard to extend in > a backwards compatible way. > > Another alternative of a key value list would make for a nasty sysfs > interface. > > Option 2 > There are sysfs interfaces that use a selection type presentation. > > Write: "C", Read: "A, B, [C], D" but that only works well for discrete sets > of options and is a pain to parse if read back is necessary. Writing it as: a b [c] d or even: a, b, [c], d doesn't make it hard to be parse on userspace. Adding a comma makes Kernel code a little bigger, as it needs an extra check at the loop to check if the line is empty or not: if (*tmp != '\0') *tmp += snprintf(", ") Btwm we have an implementation like that on kernelspace/userspace for the RC API: - Kernelspace: https://github.com/torvalds/linux/blob/master/drivers/media/rc/rc-main.c#L1125 6 lines of code + a const table with names/values, if we use the same example for EDAC: const char *name[] = { "foo", "bar" }; for (i = 0; i < ARRAY_SIZE(names); i++) { if (enabled & names[i].type) tmp += sprintf(tmp, "[%s] ", names[i].name); else if (allowed & proto_names[i].type) tmp += sprintf(tmp, "%s ", names[i].name); } - Userspace: https://git.linuxtv.org/v4l-utils.git/tree/utils/keytable/keytable.c#n197 5 lines of code + a const table, if we use the same example for ras-daemon: const char *name[] = { [EDAC_FOO] = "[foo]", [EDAC_BAR] = "[bar]", }; for (p = strtok(arg, " ,"); p; p = strtok(NULL, " ,")) for (i = 0; i < ARRAY_SIZE(name); i++) if (!strcasecmp(p, name[i]) return i; return -1; (strtok handles both space and commas at the above example) IMO, this is a lot better, as the alternative would be to have separate sysfs nodes to describe what values are valid for a given edac devnode. See, userspace needs to know what values are valid for a given device and support for it may vary depending on the Kernel and device version. So, we need to have the information about what values are valid stored on some sysfs devnode, to allow backward compatibility. > > So in conclusion, I think the proposed multiple sysfs attribute style > with them reading back the most recent value written is the least bad > solution to a complex control interface. > > > > > echo ... > > > > then writes in the bank. > > > > > ... so we would propose we do not add max_ and min_ for now and see how the > > > use cases evolve. > > > > Yes, you should apply that same methodology to the rest of the new features > > you're adding: only add functionality for the stuff that is actually being > > used now. You can always extend it later. > > > > Changing an already user-visible API is a whole different story and a lot lot > > harder, even impossible. > > > > So I'd suggest you prune the EDAC patches from all the hypothetical usage and > > then send only what remains so that I can try to queue them. > > Sure. In this case the addition of min/max was perhaps a wrong response to > your request for a way to those ranges rather than just rejecting a write > of something out of range as earlier version did. > > We can revisit in future if range discovery becomes necessary. Personally > I don't think it is given we are only taking these actions in response error > records that give us precisely what to write and hence are always in range. For RO devnodes, there's no need for ranges, but those are likely needed for RW, as otherwise userspace may try to write invalid requests and/or have backward-compatibility issues. Thanks, Mauro
Em Thu, 9 Jan 2025 16:01:59 +0000 Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu: > > My goal here was to make this user-friendly. Because you need some way of > > knowing what valid ranges are and in order to trigger the repair, if it needs > > to happen for a range. IMO, user-friendly is important, as it allows people to manually use the feature. This is interesting for debugging purposes and also to test if some hardware is doing the right thing. Ok, in practice, production will use an userspace tool like rasdaemon, and/or some scripts [1]. [1] I'd say that rasdaemon should have an initialization phase to discover capabilities that can be discovered. As an example, rasdaemon could, for instance, reserve some sparing memory at init time, if the hardware (partially) supports it. For instance, maybe a CXL device could not be able to handle rank-mem-sparing, but it can handle bank-mem-sparing. > In at least the CXL case I'm fairly sure most of them are not discoverable. > Until you see errors you have no idea what the memory topology is. Sure, but some things can be discovered in advance, like what CXL scrubbing features are supported by a given hardware. If the hardware supports detecting ranges for row/bank/rank sparing, it would be nice to have this reported in a way that userspace can properly set it at OS init time, if desired by the sysadmins. > > Or, you can teach the repair logic to ignore invalid ranges and "clamp" things > > to whatever makes sense. > > For that you'd need to have a path to read back what happened. If sysfs is RW, you have it there already after committing the value set. > > Again, I'm looking at it from the usability perspective. I haven't actually > > needed this scrub+repair functionality yet to know whether the UI makes sense. > > So yeah, collecting some feedback from real-life use cases would probably give > > you a lot better understanding of how that UI should be designed... perhaps > > you won't ever need the ranges, whow knows. > > > > So yes, preemptively designing stuff like that "in the dark" is kinda hard. > > :-) > > The discoverability is unnecessary for any known usecase. > > Ok. Then can we just drop the range discoverability entirely or we go with > your suggestion and do not support read back of what has been > requested but instead have the reads return a range if known or "" / > return -EONOTSUPP if simply not known? It sounds to be that ranges are needed at least to setup mem sparing. > I can live with that though to me we are heading in the direction of > a less intuitive interface to save a small number of additional files. > > Jonathan > > > > Thanks, Mauro
On Tue, 14 Jan 2025 13:38:31 +0100 Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > Em Thu, 9 Jan 2025 14:24:33 +0000 > Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu: > > > On Thu, 9 Jan 2025 13:32:22 +0100 > > Borislav Petkov <bp@alien8.de> wrote: > > > > Hi Boris, > > > > > On Thu, Jan 09, 2025 at 11:00:43AM +0000, Shiju Jose wrote: > > > > The min_ and max_ attributes of the control attributes are added for your > > > > feedback on V15 to expose supported ranges of these control attributes to the user, > > > > in the following links. > > > > > > Sure, but you can make that differently: > > > > > > cat /sys/bus/edac/devices/<dev-name>/mem_repairX/bank > > > [x:y] > > > > > > which is the allowed range. > > > > To my thinking that would fail the test of being an intuitive interface. > > To issue a repair command requires that multiple attributes be configured > > before triggering the actual repair. > > > > Think of it as setting the coordinates of the repair in a high dimensional > > space. > > > > In the extreme case of fine grained repair (Cacheline), to identify the > > relevant subunit of memory (obtained from the error record that we are > > basing the decision to repair on) we need to specify all of: > > > > Channel, sub-channel, rank, bank group, row, column and nibble mask. > > For coarser granularity repair only specify a subset of these applies and > > only the relevant controls are exposed to userspace. > > > > They are broken out as specific attributes to enable each to be set before > > triggering the action with a write to the repair attribute. > > > > There are several possible alternatives: > > > > Option 1 > > > > "A:B:C:D:E:F:G:H:I:J" opaque single write to trigger the repair where > > each number is providing one of those coordinates and where a readback > > let's us known what each number is. > > > > That single attribute interface is very hard to extend in an intuitive way. > > > > History tell us more levels will be introduced in the middle, not just > > at the finest granularity, making such an interface hard to extend in > > a backwards compatible way. > > > > Another alternative of a key value list would make for a nasty sysfs > > interface. > > > > Option 2 > > There are sysfs interfaces that use a selection type presentation. > > > > Write: "C", Read: "A, B, [C], D" but that only works well for discrete sets > > of options and is a pain to parse if read back is necessary. > > Writing it as: > > a b [c] d > > or even: > a, b, [c], d > > doesn't make it hard to be parse on userspace. Adding a comma makes > Kernel code a little bigger, as it needs an extra check at the loop > to check if the line is empty or not: > > if (*tmp != '\0') > *tmp += snprintf(", ") > > Btwm we have an implementation like that on kernelspace/userspace for > the RC API: > > - Kernelspace: > https://github.com/torvalds/linux/blob/master/drivers/media/rc/rc-main.c#L1125 > 6 lines of code + a const table with names/values, if we use the same example > for EDAC: > > const char *name[] = { "foo", "bar" }; > > for (i = 0; i < ARRAY_SIZE(names); i++) { > if (enabled & names[i].type) > tmp += sprintf(tmp, "[%s] ", names[i].name); > else if (allowed & proto_names[i].type) > tmp += sprintf(tmp, "%s ", names[i].name); > } > > > - Userspace: > https://git.linuxtv.org/v4l-utils.git/tree/utils/keytable/keytable.c#n197 > 5 lines of code + a const table, if we use the same example > for ras-daemon: > > const char *name[] = { > [EDAC_FOO] = "[foo]", > [EDAC_BAR] = "[bar]", > }; > > for (p = strtok(arg, " ,"); p; p = strtok(NULL, " ,")) > for (i = 0; i < ARRAY_SIZE(name); i++) > if (!strcasecmp(p, name[i]) > return i; > return -1; > > (strtok handles both space and commas at the above example) > > IMO, this is a lot better, as the alternative would be to have separate > sysfs nodes to describe what values are valid for a given edac devnode. > > See, userspace needs to know what values are valid for a given > device and support for it may vary depending on the Kernel and > device version. So, we need to have the information about what values > are valid stored on some sysfs devnode, to allow backward compatibility. These aren't selectors from a discrete list so the question is more whether a syntax of <min> value <max> is intuitive or not. I'm not aware of precedence for this one. There was another branch of the thread where Boris mentioned this as an option. It isn't bad to deal with and an easy change to the code, but I have an open question on what choice we make for representing unknown min / max. For separate files the absence of the file indicates we don't have any information. > > > > > So in conclusion, I think the proposed multiple sysfs attribute style > > with them reading back the most recent value written is the least bad > > solution to a complex control interface. > > > > > > > > echo ... > > > > > > then writes in the bank. > > > > > > > ... so we would propose we do not add max_ and min_ for now and see how the > > > > use cases evolve. > > > > > > Yes, you should apply that same methodology to the rest of the new features > > > you're adding: only add functionality for the stuff that is actually being > > > used now. You can always extend it later. > > > > > > Changing an already user-visible API is a whole different story and a lot lot > > > harder, even impossible. > > > > > > So I'd suggest you prune the EDAC patches from all the hypothetical usage and > > > then send only what remains so that I can try to queue them. > > > > Sure. In this case the addition of min/max was perhaps a wrong response to > > your request for a way to those ranges rather than just rejecting a write > > of something out of range as earlier version did. > > > > We can revisit in future if range discovery becomes necessary. Personally > > I don't think it is given we are only taking these actions in response error > > records that give us precisely what to write and hence are always in range. > > For RO devnodes, there's no need for ranges, but those are likely needed for > RW, as otherwise userspace may try to write invalid requests and/or have > backward-compatibility issues. Given these parameters are only meaningfully written with values coming ultimately from error records, userspace should never consider writing something that is out of range except during testing. I don't mind presenting the range where known (in CXL case it is not discoverable for most of them) but I wouldn't expect tooling to ever read it as known correct values to write come from the error records. Checking those values against provided limits seems an unnecessary step given an invalid parameter that slips through will be rejected by the hardware anyway. Jonathan > > > Thanks, > Mauro
Em Thu, 9 Jan 2025 18:34:48 +0000 Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu: > On Thu, 9 Jan 2025 17:19:02 +0100 > Borislav Petkov <bp@alien8.de> wrote: > > > > But then why do you even need the interface at all? > > > > Why can't the kernel automatically collect all those attributes and start the > > scrubbing automatically - no need for any user interaction...? Implementing policies at Kernelspace is a very bad idea. See, to properly implement scrubbing and memory sparing policies, one needs to have knowledge not only about the current Kernel lifetime (which may be a recent boot due to a Kernel upgrade), but also for events that happened in the past months/years. On other words, a database is needed to know if a memory error was just a random issue due to high cosmic ray activities (so, a soft PPR would be used - just in case), or if it is due to some memory region that it is known to have past problems, probably an indication of a a hardware issue - in which case a hard PPR would be used instead. If this were meant to be done automatically, CXL wouldn't need to send events about that to the OSPM. Also, different usecases may require different policies. So, better to let an userspace daemon to handle policies, and use sysfs for such daemon to to setup the hardware. Thanks, Mauro
Em Mon, 6 Jan 2025 12:10:00 +0000 <shiju.jose@huawei.com> escreveu: > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair_function > +Date: Jan 2025 > +KernelVersion: 6.14 > +Contact: linux-edac@vger.kernel.org > +Description: > + (RO) Memory repair function type. For eg. post package repair, > + memory sparing etc. > + EDAC_SOFT_PPR - Soft post package repair > + EDAC_HARD_PPR - Hard post package repair > + EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing > + EDAC_ROW_MEM_SPARING - Row memory sparing > + EDAC_BANK_MEM_SPARING - Bank memory sparing > + EDAC_RANK_MEM_SPARING - Rank memory sparing > + All other values are reserved. > + > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode > +Date: Jan 2025 > +KernelVersion: 6.14 > +Contact: linux-edac@vger.kernel.org > +Description: > + (RW) Read/Write the current persist repair mode set for a > + repair function. Persist repair modes supported in the > + device, based on the memory repair function is temporary > + or permanent and is lost with a power cycle. > + EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary repair). > + EDAC_MEM_REPAIR_HARD - Hard memory repair function (permanent repair). > + All other values are reserved. > + After re-reading some things, I suspect that the above can be simplified a little bit by folding soft/hard PPR into a single element at /repair_function, and letting it clearer that persist_mode is valid only for PPR (I think this is the case, right?), e.g. something like: What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair_function ... Description: (RO) Memory repair function type. For e.g. post package repair, memory sparing etc. Valid values are: - ppr - post package repair. Please define its mode via /sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode - cacheline-sparing - Cacheline memory sparing - row-sparing - Row memory sparing - bank-sparing - Bank memory sparing - rank-sparing - Rank memory sparing - All other values are reserved. and define persist_mode in a different way: What: /sys/bus/edac/devices/<dev-name>/mem_repairX/ppr_persist_mode ... Description: (RW) Read/Write the current persist repair (PPR) mode set for a post package repair function. Persist repair modes supported in the device, based on the memory repair function is temporary or permanent and is lost with a power cycle. Valid values are: - repair-soft - Soft PPR function (temporary repair). - repair-hard - Hard memory repair function (permanent repair). - All other values are reserved. Thanks, Mauro
Em Tue, 14 Jan 2025 12:31:44 +0000 Shiju Jose <shiju.jose@huawei.com> escreveu: > Hi Mauro, > > Thanks for the comments. > > >-----Original Message----- > >From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> > >Sent: 14 January 2025 11:48 > >To: Shiju Jose <shiju.jose@huawei.com> > >Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux- > >acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org; > >bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org; > >mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan > >Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com; > >alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com; > >david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com; > >Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com; > >Jon.Grimm@amd.com; dave.hansen@linux.intel.com; > >naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com; > >somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com; > >duenwen@google.com; gthelen@google.com; > >wschwartz@amperecomputing.com; dferguson@amperecomputing.com; > >wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei > ><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto > >Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com; > >wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm > ><linuxarm@huawei.com> > >Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature > > > >Em Mon, 6 Jan 2025 12:10:00 +0000 > ><shiju.jose@huawei.com> escreveu: > > > >> From: Shiju Jose <shiju.jose@huawei.com> > >> > >> Add a generic EDAC memory repair control driver to manage memory repairs > >> in the system, such as CXL Post Package Repair (PPR) and CXL memory sparing > >> features. > >> > >> For example, a CXL device with DRAM components that support PPR features > >> may implement PPR maintenance operations. DRAM components may support > >two > >> types of PPR, hard PPR, for a permanent row repair, and soft PPR, for a > >> temporary row repair. Soft PPR is much faster than hard PPR, but the repair > >> is lost with a power cycle. > >> Similarly a CXL memory device may support soft and hard memory sparing at > >> cacheline, row, bank and rank granularities. Memory sparing is defined as > >> a repair function that replaces a portion of memory with a portion of > >> functional memory at that same granularity. > >> When a CXL device detects an error in a memory, it may report the host of > >> the need for a repair maintenance operation by using an event record where > >> the "maintenance needed" flag is set. The event records contains the device > >> physical address(DPA) and other attributes of the memory to repair (such as > >> channel, sub-channel, bank group, bank, rank, row, column etc). The kernel > >> will report the corresponding CXL general media or DRAM trace event to > >> userspace, and userspace tools (e.g. rasdaemon) will initiate a repair > >> operation in response to the device request via the sysfs repair control. > >> > >> Device with memory repair features registers with EDAC device driver, > >> which retrieves memory repair descriptor from EDAC memory repair driver > >> and exposes the sysfs repair control attributes to userspace in > >> /sys/bus/edac/devices/<dev-name>/mem_repairX/. > >> > >> The common memory repair control interface abstracts the control of > >> arbitrary memory repair functionality into a standardized set of functions. > >> The sysfs memory repair attribute nodes are only available if the client > >> driver has implemented the corresponding attribute callback function and > >> provided operations to the EDAC device driver during registration. > >> > >> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> > >> --- > >> .../ABI/testing/sysfs-edac-memory-repair | 244 +++++++++ > >> Documentation/edac/features.rst | 3 + > >> Documentation/edac/index.rst | 1 + > >> Documentation/edac/memory_repair.rst | 101 ++++ > >> drivers/edac/Makefile | 2 +- > >> drivers/edac/edac_device.c | 33 ++ > >> drivers/edac/mem_repair.c | 492 ++++++++++++++++++ > >> include/linux/edac.h | 139 +++++ > >> 8 files changed, 1014 insertions(+), 1 deletion(-) > >> create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair > >> create mode 100644 Documentation/edac/memory_repair.rst > >> create mode 100755 drivers/edac/mem_repair.c > >> > >> diff --git a/Documentation/ABI/testing/sysfs-edac-memory-repair > >b/Documentation/ABI/testing/sysfs-edac-memory-repair > >> new file mode 100644 > >> index 000000000000..e9268f3780ed > >> --- /dev/null > >> +++ b/Documentation/ABI/testing/sysfs-edac-memory-repair > >> @@ -0,0 +1,244 @@ > >> +What: /sys/bus/edac/devices/<dev-name>/mem_repairX > >> +Date: Jan 2025 > >> +KernelVersion: 6.14 > >> +Contact: linux-edac@vger.kernel.org > >> +Description: > >> + The sysfs EDAC bus devices /<dev-name>/mem_repairX > >subdirectory > >> + pertains to the memory media repair features control, such as > >> + PPR (Post Package Repair), memory sparing etc, where<dev- > >name> > >> + directory corresponds to a device registered with the EDAC > >> + device driver for the memory repair features. > >> + > >> + Post Package Repair is a maintenance operation requests the > >memory > >> + device to perform a repair operation on its media, in detail is a > >> + memory self-healing feature that fixes a failing memory > >location by > >> + replacing it with a spare row in a DRAM device. For example, a > >> + CXL memory device with DRAM components that support PPR > >features may > >> + implement PPR maintenance operations. DRAM components > >may support > >> + two types of PPR functions: hard PPR, for a permanent row > >repair, and > >> + soft PPR, for a temporary row repair. soft PPR is much faster > >than > >> + hard PPR, but the repair is lost with a power cycle. > >> + > >> + Memory sparing is a repair function that replaces a portion > >> + of memory with a portion of functional memory at that same > >> + sparing granularity. Memory sparing has > >cacheline/row/bank/rank > >> + sparing granularities. For example, in memory-sparing mode, > >> + one memory rank serves as a spare for other ranks on the same > >> + channel in case they fail. The spare rank is held in reserve and > >> + not used as active memory until a failure is indicated, with > >> + reserved capacity subtracted from the total available memory > >> + in the system.The DIMM installation order for memory sparing > >> + varies based on the number of processors and memory modules > >> + installed in the server. After an error threshold is surpassed > >> + in a system protected by memory sparing, the content of a > >failing > >> + rank of DIMMs is copied to the spare rank. The failing rank is > >> + then taken offline and the spare rank placed online for use as > >> + active memory in place of the failed rank. > >> + > >> + The sysfs attributes nodes for a repair feature are only > >> + present if the parent driver has implemented the corresponding > >> + attr callback function and provided the necessary operations > >> + to the EDAC device driver during registration. > >> + > >> + In some states of system configuration (e.g. before address > >> + decoders have been configured), memory devices (e.g. CXL) > >> + may not have an active mapping in the main host address > >> + physical address map. As such, the memory to repair must be > >> + identified by a device specific physical addressing scheme > >> + using a device physical address(DPA). The DPA and other control > >> + attributes to use will be presented in related error records. > >> + > >> +What: /sys/bus/edac/devices/<dev- > >name>/mem_repairX/repair_function > >> +Date: Jan 2025 > >> +KernelVersion: 6.14 > >> +Contact: linux-edac@vger.kernel.org > >> +Description: > >> + (RO) Memory repair function type. For eg. post package repair, > >> + memory sparing etc. > >> + EDAC_SOFT_PPR - Soft post package repair > >> + EDAC_HARD_PPR - Hard post package repair > >> + EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing > >> + EDAC_ROW_MEM_SPARING - Row memory sparing > >> + EDAC_BANK_MEM_SPARING - Bank memory sparing > >> + EDAC_RANK_MEM_SPARING - Rank memory sparing > >> + All other values are reserved. > > > >Too big strings. Why are them in upper cases? IMO: > > > > soft-ppr, hard-ppr, ... would be enough. > > > Here return repair type (single value, such as 0, 1, or 2 etc not as decoded string for eg."EDAC_SOFT_PPR") > of the memory repair instance, which is defined as enums (EDAC_SOFT_PPR, EDAC_HARD_PPR, ... etc) > for the memory repair interface in the include/linux/edac.h. > > enum edac_mem_repair_function { > EDAC_SOFT_PPR, > EDAC_HARD_PPR, > EDAC_CACHELINE_MEM_SPARING, > EDAC_ROW_MEM_SPARING, > EDAC_BANK_MEM_SPARING, > EDAC_RANK_MEM_SPARING, > }; > > I documented return value in terms of the above enums. The ABI documentation describes exactly what numeric/strings values will be there. So, if you place: EDAC_SOFT_PPR It means a string with EDAC_SOFT_PPR, not a numeric zero value. Also, as I explained at: https://lore.kernel.org/linux-edac/1bf421f9d1924d68860d08c70829a705@huawei.com/T/#m1e60da13198b47701a4c2f740d4b78701f912d2d it doesn't make sense to report soft/hard PPR, as the persist mode is designed to be on a different sysfs devnode (/persist_mode on your proposal). So, here you need to fold EDAC_SOFT_PPR and EDAC_HARD_PPR into a single value ("ppr"). - Btw, very few sysfs nodes use numbers for things that can be mapped with enums: $ git grep -l "\- 0" Documentation/ABI|wc -l 20 (several of those are actually false-positives) and this is done mostly when it reports what the hardware actually outputs when reading some register. Thanks, Mauro
>-----Original Message----- >From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> >Sent: 14 January 2025 13:47 >To: Shiju Jose <shiju.jose@huawei.com> >Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux- >acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org; >bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org; >mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan >Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com; >alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com; >david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com; >Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com; >Jon.Grimm@amd.com; dave.hansen@linux.intel.com; >naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com; >somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com; >duenwen@google.com; gthelen@google.com; >wschwartz@amperecomputing.com; dferguson@amperecomputing.com; >wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei ><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto >Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com; >wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm ><linuxarm@huawei.com> >Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature > >Em Mon, 6 Jan 2025 12:10:00 +0000 ><shiju.jose@huawei.com> escreveu: > >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/repair_function >> +Date: Jan 2025 >> +KernelVersion: 6.14 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (RO) Memory repair function type. For eg. post package repair, >> + memory sparing etc. >> + EDAC_SOFT_PPR - Soft post package repair >> + EDAC_HARD_PPR - Hard post package repair >> + EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing >> + EDAC_ROW_MEM_SPARING - Row memory sparing >> + EDAC_BANK_MEM_SPARING - Bank memory sparing >> + EDAC_RANK_MEM_SPARING - Rank memory sparing >> + All other values are reserved. >> + >> +What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/persist_mode >> +Date: Jan 2025 >> +KernelVersion: 6.14 >> +Contact: linux-edac@vger.kernel.org >> +Description: >> + (RW) Read/Write the current persist repair mode set for a >> + repair function. Persist repair modes supported in the >> + device, based on the memory repair function is temporary >> + or permanent and is lost with a power cycle. >> + EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary >repair). >> + EDAC_MEM_REPAIR_HARD - Hard memory repair function >(permanent repair). >> + All other values are reserved. >> + > >After re-reading some things, I suspect that the above can be simplified a little >bit by folding soft/hard PPR into a single element at /repair_function, and letting >it clearer that persist_mode is valid only for PPR (I think this is the case, right?), >e.g. something like: persist_mode is valid for memory sparing features(atleast in CXL) as well. In the case of CXL memory sparing, host has option to request either soft or hard sparing in a flag when issue a memory sparing operation. > > What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/repair_function > ... > Description: > (RO) Memory repair function type. For e.g. post >package repair, > memory sparing etc. Valid values are: > > - ppr - post package repair. > Please define its mode via > /sys/bus/edac/devices/<dev- >name>/mem_repairX/persist_mode > - cacheline-sparing - Cacheline memory sparing > - row-sparing - Row memory sparing > - bank-sparing - Bank memory sparing > - rank-sparing - Rank memory sparing > - All other values are reserved. > >and define persist_mode in a different way: Note: For return as decoded strings instead of raw value, I need to add some extra callback function/s in the edac/memory_repair.c for these attributes and which will reduce the current level of optimization done to minimize the code size. > > What: /sys/bus/edac/devices/<dev- >name>/mem_repairX/ppr_persist_mode Same as above. persist_mode is needed for memory sparing feature too. > ... > Description: > (RW) Read/Write the current persist repair (PPR) mode set for a > post package repair function. Persist repair modes supported > in the device, based on the memory repair function is >temporary > or permanent and is lost with a power cycle. Valid values are: > > - repair-soft - Soft PPR function (temporary repair). > - repair-hard - Hard memory repair function (permanent >repair). > - All other values are reserved. > >Thanks, >Mauro Thanks, Shiju
Em Tue, 14 Jan 2025 13:05:37 +0000 Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu: > On Tue, 14 Jan 2025 13:38:31 +0100 > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > > > Em Thu, 9 Jan 2025 14:24:33 +0000 > > Jonathan Cameron <Jonathan.Cameron@huawei.com> escreveu: > > > > > On Thu, 9 Jan 2025 13:32:22 +0100 > > > Borislav Petkov <bp@alien8.de> wrote: > > > > > > Hi Boris, > > > > > > > On Thu, Jan 09, 2025 at 11:00:43AM +0000, Shiju Jose wrote: > > > > > The min_ and max_ attributes of the control attributes are added for your > > > > > feedback on V15 to expose supported ranges of these control attributes to the user, > > > > > in the following links. > > > > > > > > Sure, but you can make that differently: > > > > > > > > cat /sys/bus/edac/devices/<dev-name>/mem_repairX/bank > > > > [x:y] > > > > > > > > which is the allowed range. > > > > > > To my thinking that would fail the test of being an intuitive interface. > > > To issue a repair command requires that multiple attributes be configured > > > before triggering the actual repair. > > > > > > Think of it as setting the coordinates of the repair in a high dimensional > > > space. > > > > > > In the extreme case of fine grained repair (Cacheline), to identify the > > > relevant subunit of memory (obtained from the error record that we are > > > basing the decision to repair on) we need to specify all of: > > > > > > Channel, sub-channel, rank, bank group, row, column and nibble mask. > > > For coarser granularity repair only specify a subset of these applies and > > > only the relevant controls are exposed to userspace. > > > > > > They are broken out as specific attributes to enable each to be set before > > > triggering the action with a write to the repair attribute. > > > > > > There are several possible alternatives: > > > > > > Option 1 > > > > > > "A:B:C:D:E:F:G:H:I:J" opaque single write to trigger the repair where > > > each number is providing one of those coordinates and where a readback > > > let's us known what each number is. > > > > > > That single attribute interface is very hard to extend in an intuitive way. > > > > > > History tell us more levels will be introduced in the middle, not just > > > at the finest granularity, making such an interface hard to extend in > > > a backwards compatible way. > > > > > > Another alternative of a key value list would make for a nasty sysfs > > > interface. > > > > > > Option 2 > > > There are sysfs interfaces that use a selection type presentation. > > > > > > Write: "C", Read: "A, B, [C], D" but that only works well for discrete sets > > > of options and is a pain to parse if read back is necessary. > > > > Writing it as: > > > > a b [c] d > > > > or even: > > a, b, [c], d > > > > doesn't make it hard to be parse on userspace. Adding a comma makes > > Kernel code a little bigger, as it needs an extra check at the loop > > to check if the line is empty or not: > > > > if (*tmp != '\0') > > *tmp += snprintf(", ") > > > > Btwm we have an implementation like that on kernelspace/userspace for > > the RC API: > > > > - Kernelspace: > > https://github.com/torvalds/linux/blob/master/drivers/media/rc/rc-main.c#L1125 > > 6 lines of code + a const table with names/values, if we use the same example > > for EDAC: > > > > const char *name[] = { "foo", "bar" }; > > > > for (i = 0; i < ARRAY_SIZE(names); i++) { > > if (enabled & names[i].type) > > tmp += sprintf(tmp, "[%s] ", names[i].name); > > else if (allowed & proto_names[i].type) > > tmp += sprintf(tmp, "%s ", names[i].name); > > } > > > > > > - Userspace: > > https://git.linuxtv.org/v4l-utils.git/tree/utils/keytable/keytable.c#n197 > > 5 lines of code + a const table, if we use the same example > > for ras-daemon: > > > > const char *name[] = { > > [EDAC_FOO] = "[foo]", > > [EDAC_BAR] = "[bar]", > > }; > > > > for (p = strtok(arg, " ,"); p; p = strtok(NULL, " ,")) > > for (i = 0; i < ARRAY_SIZE(name); i++) > > if (!strcasecmp(p, name[i]) > > return i; > > return -1; > > > > (strtok handles both space and commas at the above example) > > > > IMO, this is a lot better, as the alternative would be to have separate > > sysfs nodes to describe what values are valid for a given edac devnode. > > > > See, userspace needs to know what values are valid for a given > > device and support for it may vary depending on the Kernel and > > device version. So, we need to have the information about what values > > are valid stored on some sysfs devnode, to allow backward compatibility. > > These aren't selectors from a discrete list so the question is more > whether a syntax of > <min> value <max> > is intuitive or not. I'm not aware of precedence for this one. From my side, I prefer having 3 separate sysfs nodes, as this is a very common practice. Doing it on a different way sounds an API violation, but if someone insists on dropping min/max, this can be argued at https://lore.kernel.org/linux-api/. On a very quick search: $ ./scripts/get_abi.pl search "\bmin.*max" I can't see any place using min and max at the same devnode. $ ./scripts/get_abi.pl search "\b(min|max)"|grep /sys/ |wc -l 234 So, it sounds to me that merging those into a single devnode is an API violation. > > There was another branch of the thread where Boris mentioned this as an > option. It isn't bad to deal with and an easy change to the code, > but I have an open question on what choice we make for representing > unknown min / max. For separate files the absence of the file > indicates we don't have any information. > > > > > > > > > > So in conclusion, I think the proposed multiple sysfs attribute style > > > with them reading back the most recent value written is the least bad > > > solution to a complex control interface. > > > > > > > > > > > echo ... > > > > > > > > then writes in the bank. > > > > > > > > > ... so we would propose we do not add max_ and min_ for now and see how the > > > > > use cases evolve. > > > > > > > > Yes, you should apply that same methodology to the rest of the new features > > > > you're adding: only add functionality for the stuff that is actually being > > > > used now. You can always extend it later. > > > > > > > > Changing an already user-visible API is a whole different story and a lot lot > > > > harder, even impossible. > > > > > > > > So I'd suggest you prune the EDAC patches from all the hypothetical usage and > > > > then send only what remains so that I can try to queue them. > > > > > > Sure. In this case the addition of min/max was perhaps a wrong response to > > > your request for a way to those ranges rather than just rejecting a write > > > of something out of range as earlier version did. > > > > > > We can revisit in future if range discovery becomes necessary. Personally > > > I don't think it is given we are only taking these actions in response error > > > records that give us precisely what to write and hence are always in range. > > > > For RO devnodes, there's no need for ranges, but those are likely needed for > > RW, as otherwise userspace may try to write invalid requests and/or have > > backward-compatibility issues. > > Given these parameters are only meaningfully written with values coming > ultimately from error records, userspace should never consider writing > something that is out of range except during testing. > > I don't mind presenting the range where known (in CXL case it is not > discoverable for most of them) but I wouldn't expect tooling to ever > read it as known correct values to write come from the error records. > Checking those values against provided limits seems an unnecessary step > given an invalid parameter that slips through will be rejected by the > hardware anyway. I'm fine starting without min/max if there's no current usecase, provided that: 1. when needed, we add min/max as separate devnodes; 2. there won't be any backward issues when min/max gets added. Regards, Mauro
Jonathan Cameron wrote: > On Fri, 10 Jan 2025 14:49:03 -0800 > Dan Williams <dan.j.williams@intel.com> wrote: [..] > > This is where you lose me. The error record is a point in time snapshot > > of the SPA:HPA:DPA:<proprietary internal "DIMM" mapping>. The security > > model for memory operations is based on coordinating with the kernel's > > understanding of how that SPA is *currently* being used. > > Whilst it is being used I agree. Key is to only do disruptive / data > changing actions when it is not being used. Sure. > > The kernel can not just take userspace's word for it that potentially > > data changing, or temporary loss-of-use operations are safe to execute > > just because once upon a time userspace saw an error record that > > implicated a given SPA in the past, especially over reboot. > > There are two cases (discoverable from hardware) > > 1) Non disruptive. No security concern as the device guarantees to > not interrupt traffic and the memory contents is copied to the new > location. Basically software never knows it happened. > 2) Disruptive. We only allow this if the memory is offline. In the CXL case > the CXL specific code must check no memory on the device is online so > we aren't disrupting anything. The other implementation we have code > for (will post after this lands) has finer granularity constraints and only > the page needs to be offline. > As it is offline the content is not preserved anyway. We may need to add extra > constraints along with future support for temporal persistence / sharing but > we can do that as part of adding that support in general. > (Personally I think in those cases memory repair is a job for the out of > band management anyway). > > In neither case am I seeing a security concern. Am I missing something? s/security/system-integrity/ 1/ Hardware engineers may have a different definition of "non-disuptive" than software. See the history around hibernate_quiet_exec() to work around the disruption of latency spikes. If this is poorly specified across vendors we are going to wish that we did not build a "take userspace's word for it" interface. 2/ Yes, if the device is not actively decoding any in use memory feel free to run destructive operations on the device. However, is sysfs the right interface for "submit multi-parameter atomic operation with transient result"? I lean heavily into sysfs, but ioctl and netlink have a role to play in scenarios like this. Otherwise userspace can inject error records back into the kernel with the expectation that the kernel can only accept the DIMM address and not any of the translation data which might be stale. [..] > > Again, the repair control assumes that the kernel can just trust > > userspace to get it right. When the kernel knows the SPA implications it > > can add safety like "you are going to issue sparing on deviceA that will > > temporarily take deviceA offline. CXL subsystem tells me deviceA is > > interleaved with deviceB in SPA so the whole SPA range needs to be > > offline before this operation proceeds". That is not someting that > > userspace can reliably coordinate. > > Absolutely he kernel has to enforce this. Same way we protect against > poison injection in some cases. Right now the enforcement is slightly > wrong (Shiju is looking at it again) as we were enforcing at wrong > granularity (specific dpa, not device). Identifying that hole is a good > outcome of this discussion making us take another look. > > Enforcing this is one of the key jobs of the CXL specific driver. > We considered doing it in the core, but the granularity differences > between our initial few examples meant we decided on specific driver > implementations of the checks for now. Which specific driver? Is this not just a callback provided via the EDAC registration interface to say "sparing allowed"? Yes, this needs to avoid the midlayer mistake, but I expect more CXL memory exporting devices can live with the CXL core's determination that HDM decode is live or not. > > > > 3/ What if the device does not use DDR terminology / topology terms for > > > > repair? > > > > > > Then we provide the additional interfaces assuming the correspond to well > > > known terms. If someone is using a magic key then we can get grumpy > > > with them, but that can also be supported. > > > > > > Mostly I'd expect a new technology to overlap a lot of the existing > > > interface and maybe add one or two more; which layer in the stack for > > > HBM for instance. > > > > The concern is the assertion that sysfs needs to care about all these > > parameters vs an ABI that says "repair errorX". If persistence and > > validity of error records is the concern lets build an ABI for that and > > not depend upon trust in userspace to properly coordinate memory > > integrity concerns. > > It doesn't have to. It just has to ensure that the memory device is in the correct > state. So check, not coordinate. At a larger scale, coordination is already doable > (subject to races that we must avoid by holding locks), tear down the regions > so there are no mappings on the device you want to repair. Don't bring them > up again until after you are done. > > The main use case is probably do it before you bring the mappings up, but > same result. Agree. > > > > > > > > > The main alternative is where the device takes an HPA / SPA / DPA. We have one > > > driver that does that queued up behind this series that uses HPA. PPR uses > > > DPA. In that case userspace first tries to see if it can repair by HPA then > > > DPA and if not moves on to see if it it can use the fuller description. > > > We will see devices supporting HPA / DPA (which to use depends on when you > > > are doing the operation and what has been configured) but mostly I'd expect > > > either HPA/DPA or fine grained on a given repair instance. > > > > > > HPA only works if the address decoders are always configured (so not on CXL) > > > What is actually happening in that case is typically that a firmware is > > > involved that can look up address decoders etc, and map the control HPA > > > to Bank / row etc to issue the actual low level commands. This keeps > > > the memory mapping completely secret rather than exposing it in error > > > records. > > > > > > > > > > > I expect the flow rasdaemon would want is that the current PFA (leaky > > > > bucket Pre-Failure Analysis) decides that the number of soft-offlines it > > > > has performed exceeds some threshold and it wants to attempt to repair > > > > memory. > > > > > > Sparing may happen prior to point where we'd have done a soft offline > > > if non disruptive (whether it is can be read from another bit of the > > > ABI). Memory repair might be much less disruptive than soft-offline! > > > I rather hope memory manufacturers build that, but I'm aware of at least > > > one case where they didn't and the memory must be offline. > > > > That's a good point, spare before offline makes sense. > > If transparent an resources not constrained. > Very much not if we have to tear down the memory first. > > > > > [..] > > > However, there are other usecases where this isn't needed which is why > > > that isn't a precursor for this series. > > > > > > Initial enablement targets two situations: > > > 1) Repair can be done in non disruptive way - no need to soft offline at all. > > > > Modulo needing to quiesce access over the sparing event? > > Absolutely. This is only doable in devices that don't need to quiesce. > > > > > > 2) Repair can be done at boot before memory is onlined or on admin > > > action to take the whole region offline, then repair specific chunks of > > > memory before bringing it back online. > > > > Which is userspace racing the kernel to online memory? > > If you are doing this scheme you don't automatically online memory. So > both are in userspace control and can be easily sequenced. > If you aren't auto onlining then buy devices with hard PPR and do it by offlining > manually, repairing and rebooting. Or buy devices that don't need to quiecse > and cross your fingers the dodgy ram doesn't throw an error before you get > that far. Little choice if you decide to online right at the start as normal > memory. > > > > > > > So, yes, +1 to simpler for now where software effectively just needs to > > > > deal with a handful of "region repair" buttons and the semantics of > > > > those are coarse and sub-optimal. Wait for a future where a tool author > > > > says, "we have had good success getting bulk offlined pages back into > > > > service, but now we need this specific finer grained kernel interface to > > > > avoid wasting spare banks prematurely". > > > > > > Depends on where you think that interface is. I can absolutely see that > > > as a control to RAS Daemon. Option 2 above, region is offline, repair > > > all dodgy looking fine grained buckets. > > > > > > Note though that a suboptimal repair may mean permanent use of very rare > > > resources. So there needs to be a control a the finest granularity as well. > > > Which order those get added to userspace tools doesn't matter to me. > > > > > > If you mean that interface in kernel it brings some non trivial requirements. > > > The kernel would need all of: > > > 1) Tracking interface for all error records so the reverse map from region > > > to specific bank / row etc is available for a subset of entries. The > > > kernel would need to know which of those are important (soft offline > > > might help in that use case, otherwise that means decision algorithms > > > are in kernel or we have fine grained queue for region repair in parallel > > > with soft-offline). > > > 2) A way to inject the reverse map information from a userspace store > > > (to deal with reboot etc). > > > > Not a way to inject the reverse map information, a way to inject the > > error records and assert that memory topology changes have not > > invalidated those records. > > There is no way to tell that the topology hasn't changed. > For the reasons above, I don't think we care. Instead of trying to stop > userspace reparing the wrong memory, make sure it is safe for it to do that. > (The kernel is rarely in the business of preventing the slightly stupid) If the policy is "error records with SPA from the current boot can be trusted" and "userspace requests outside of current boot error records must only be submitted to known offline" then I think we are aligned. > > > That sounds a lot harder to deal with than relying on the usespace program > > > that already does the tracking across boots. > > > > I am stuck behind the barrier of userspace must not assume it knows > > better than the kernel about the SPA impact of a DIMM sparing > > event. The kernel needs evidence either live records from within the > > same kernel boot or validated records from a previous boot. > > I think this is the wrong approach. The operation must be 'safe'. > With that in place we absolutely can let userspace assume it knows better than > the kernel. Violent agreement? Operation must be safe, yes, next what are the criteria for kernel management of safety. Offline-only repair is great place to be.
On Tue, 14 Jan 2025 11:35:21 -0800 Dan Williams <dan.j.williams@intel.com> wrote: > Jonathan Cameron wrote: > > On Fri, 10 Jan 2025 14:49:03 -0800 > > Dan Williams <dan.j.williams@intel.com> wrote: > [..] > > > This is where you lose me. The error record is a point in time snapshot > > > of the SPA:HPA:DPA:<proprietary internal "DIMM" mapping>. The security > > > model for memory operations is based on coordinating with the kernel's > > > understanding of how that SPA is *currently* being used. > > > > Whilst it is being used I agree. Key is to only do disruptive / data > > changing actions when it is not being used. > > Sure. > > > > The kernel can not just take userspace's word for it that potentially > > > data changing, or temporary loss-of-use operations are safe to execute > > > just because once upon a time userspace saw an error record that > > > implicated a given SPA in the past, especially over reboot. > > > > There are two cases (discoverable from hardware) > > > > 1) Non disruptive. No security concern as the device guarantees to > > not interrupt traffic and the memory contents is copied to the new > > location. Basically software never knows it happened. > > 2) Disruptive. We only allow this if the memory is offline. In the CXL case > > the CXL specific code must check no memory on the device is online so > > we aren't disrupting anything. The other implementation we have code > > for (will post after this lands) has finer granularity constraints and only > > the page needs to be offline. > > As it is offline the content is not preserved anyway. We may need to add extra > > constraints along with future support for temporal persistence / sharing but > > we can do that as part of adding that support in general. > > (Personally I think in those cases memory repair is a job for the out of > > band management anyway). > > > > In neither case am I seeing a security concern. Am I missing something? > > s/security/system-integrity/ > > 1/ Hardware engineers may have a different definition of "non-disuptive" > than software. See the history around hibernate_quiet_exec() to work > around the disruption of latency spikes. If this is poorly specified > across vendors we are going to wish that we did not build a "take > userspace's word for it" interface. Sure, but given the spec is fairly specific "CXL.mem requests are correctly preserved and data is retained" they will have to stay with latency requirements of the CXL system. If a device breaks these admittedly soft rules then we can add allow / deny lists or let userspace tooling handle that (which I'd prefer). Note that this stuff happens on some devices under the hood anyway, so I'd not expect the command driven case to be much worse (but I accept it might be). I'd also expect some very grumpy large purchasers of that device to get the firmware changed to not advertise that it is fine to do it live if the performance drop is large. Note this happens today under the hood on at least some (I think most?) servers. Have you ever seen a latency bug report from memory sparing? How often? The memory and controllers on CXL devices are going to be very similar in technology to host memory controllers so I can't see why they'd be much worse (subject of course to cheap and nasty maybe turning up - I doubt that will be common given the cost of the RAM behind these things). This applies just as much however we implement this interface. So a concern but not one specific to the 'how' part. > > 2/ Yes, if the device is not actively decoding any in use memory feel > free to run destructive operations on the device. However, is sysfs the > right interface for "submit multi-parameter atomic operation with > transient result"? I lean heavily into sysfs, but ioctl and netlink have > a role to play in scenarios like this. Otherwise userspace can inject > error records back into the kernel with the expectation that the kernel > can only accept the DIMM address and not any of the translation data > which might be stale. I'd argue yes to the sysfs question. This is common enough for other subsystems (e.g. CXL - I still do all tests with bash scripts!) and this really isn't a particularly complex interface. I'd expect very light weight tooling to be able to use it (though solutions to work out whether to do it are way more complex). It is a simple more or less self describing interface that directly exposes what controls are available. If we end up reinjecting error records (which I'm still pushing back strongly on because I am yet to see any reason to do so) then maybe sysfs is not the right interface. Pushing it to netlink or ioctl doesn't change the complexity of the interface, so I'm not sure there is any strong reason to do so. Files are cheap and easy to use from all manner of simple scripting. I think from later discussion below that we don't need to reinject which is nice! > > [..] > > > Again, the repair control assumes that the kernel can just trust > > > userspace to get it right. When the kernel knows the SPA implications it > > > can add safety like "you are going to issue sparing on deviceA that will > > > temporarily take deviceA offline. CXL subsystem tells me deviceA is > > > interleaved with deviceB in SPA so the whole SPA range needs to be > > > offline before this operation proceeds". That is not someting that > > > userspace can reliably coordinate. > > > > Absolutely he kernel has to enforce this. Same way we protect against > > poison injection in some cases. Right now the enforcement is slightly > > wrong (Shiju is looking at it again) as we were enforcing at wrong > > granularity (specific dpa, not device). Identifying that hole is a good > > outcome of this discussion making us take another look. > > > > Enforcing this is one of the key jobs of the CXL specific driver. > > We considered doing it in the core, but the granularity differences > > between our initial few examples meant we decided on specific driver > > implementations of the checks for now. > > Which specific driver? Is this not just a callback provided via the EDAC > registration interface to say "sparing allowed"? That would be a reasonable interface to add. There was push back in earlier reviews on an explicit query so we backed away from this sort of thing. The question IIRC was what was the point in querying given we could just try it and rapidly see an error if it can't. There is a callback to say it is disruptive but it is an exercise for userspace to figure out what it needs to do to allow the request to succeed. The granularity varies across the two sparing implementations I have access to, so this interface would just be a 'can I do it now?' Whether that is helpful given userspace will have to be involved in getting that answer to be 'yes' is not entirely clear to me. So interesting thought, but I'm not yet convinced this isn't a userspace problem + the suck it and see call to find out the answer. > > Yes, this needs to avoid the midlayer mistake, but I expect more CXL > memory exporting devices can live with the CXL core's determination that > HDM decode is live or not. I think this is valid for the ones that advertise that they need the traffic to stop. From later in discussion I think we are aligned on that. > > > > > > 3/ What if the device does not use DDR terminology / topology terms for > > > > > repair? > > > > > > > > Then we provide the additional interfaces assuming the correspond to well > > > > known terms. If someone is using a magic key then we can get grumpy > > > > with them, but that can also be supported. > > > > > > > > Mostly I'd expect a new technology to overlap a lot of the existing > > > > interface and maybe add one or two more; which layer in the stack for > > > > HBM for instance. > > > > > > The concern is the assertion that sysfs needs to care about all these > > > parameters vs an ABI that says "repair errorX". If persistence and > > > validity of error records is the concern lets build an ABI for that and > > > not depend upon trust in userspace to properly coordinate memory > > > integrity concerns. > > > > It doesn't have to. It just has to ensure that the memory device is in the correct > > state. So check, not coordinate. At a larger scale, coordination is already doable > > (subject to races that we must avoid by holding locks), tear down the regions > > so there are no mappings on the device you want to repair. Don't bring them > > up again until after you are done. > > > > The main use case is probably do it before you bring the mappings up, but > > same result. > > Agree. > > > > > > > > > > > > > > The main alternative is where the device takes an HPA / SPA / DPA. We have one > > > > driver that does that queued up behind this series that uses HPA. PPR uses > > > > DPA. In that case userspace first tries to see if it can repair by HPA then > > > > DPA and if not moves on to see if it it can use the fuller description. > > > > We will see devices supporting HPA / DPA (which to use depends on when you > > > > are doing the operation and what has been configured) but mostly I'd expect > > > > either HPA/DPA or fine grained on a given repair instance. > > > > > > > > HPA only works if the address decoders are always configured (so not on CXL) > > > > What is actually happening in that case is typically that a firmware is > > > > involved that can look up address decoders etc, and map the control HPA > > > > to Bank / row etc to issue the actual low level commands. This keeps > > > > the memory mapping completely secret rather than exposing it in error > > > > records. > > > > > > > > > > > > > > I expect the flow rasdaemon would want is that the current PFA (leaky > > > > > bucket Pre-Failure Analysis) decides that the number of soft-offlines it > > > > > has performed exceeds some threshold and it wants to attempt to repair > > > > > memory. > > > > > > > > Sparing may happen prior to point where we'd have done a soft offline > > > > if non disruptive (whether it is can be read from another bit of the > > > > ABI). Memory repair might be much less disruptive than soft-offline! > > > > I rather hope memory manufacturers build that, but I'm aware of at least > > > > one case where they didn't and the memory must be offline. > > > > > > That's a good point, spare before offline makes sense. > > > > If transparent an resources not constrained. > > Very much not if we have to tear down the memory first. > > > > > > > > [..] > > > > However, there are other usecases where this isn't needed which is why > > > > that isn't a precursor for this series. > > > > > > > > Initial enablement targets two situations: > > > > 1) Repair can be done in non disruptive way - no need to soft offline at all. > > > > > > Modulo needing to quiesce access over the sparing event? > > > > Absolutely. This is only doable in devices that don't need to quiesce. > > > > > > > > > 2) Repair can be done at boot before memory is onlined or on admin > > > > action to take the whole region offline, then repair specific chunks of > > > > memory before bringing it back online. > > > > > > Which is userspace racing the kernel to online memory? > > > > If you are doing this scheme you don't automatically online memory. So > > both are in userspace control and can be easily sequenced. > > If you aren't auto onlining then buy devices with hard PPR and do it by offlining > > manually, repairing and rebooting. Or buy devices that don't need to quiecse > > and cross your fingers the dodgy ram doesn't throw an error before you get > > that far. Little choice if you decide to online right at the start as normal > > memory. > > > > > > > > > > So, yes, +1 to simpler for now where software effectively just needs to > > > > > deal with a handful of "region repair" buttons and the semantics of > > > > > those are coarse and sub-optimal. Wait for a future where a tool author > > > > > says, "we have had good success getting bulk offlined pages back into > > > > > service, but now we need this specific finer grained kernel interface to > > > > > avoid wasting spare banks prematurely". > > > > > > > > Depends on where you think that interface is. I can absolutely see that > > > > as a control to RAS Daemon. Option 2 above, region is offline, repair > > > > all dodgy looking fine grained buckets. > > > > > > > > Note though that a suboptimal repair may mean permanent use of very rare > > > > resources. So there needs to be a control a the finest granularity as well. > > > > Which order those get added to userspace tools doesn't matter to me. > > > > > > > > If you mean that interface in kernel it brings some non trivial requirements. > > > > The kernel would need all of: > > > > 1) Tracking interface for all error records so the reverse map from region > > > > to specific bank / row etc is available for a subset of entries. The > > > > kernel would need to know which of those are important (soft offline > > > > might help in that use case, otherwise that means decision algorithms > > > > are in kernel or we have fine grained queue for region repair in parallel > > > > with soft-offline). > > > > 2) A way to inject the reverse map information from a userspace store > > > > (to deal with reboot etc). > > > > > > Not a way to inject the reverse map information, a way to inject the > > > error records and assert that memory topology changes have not > > > invalidated those records. > > > > There is no way to tell that the topology hasn't changed. > > For the reasons above, I don't think we care. Instead of trying to stop > > userspace reparing the wrong memory, make sure it is safe for it to do that. > > (The kernel is rarely in the business of preventing the slightly stupid) > > If the policy is "error records with SPA from the current boot can be > trusted" and "userspace requests outside of current boot error records > must only be submitted to known offline" then I think we are aligned. Ok, I'm not sure the requirement is clear, but given that is probably not too bad to do and we can rip it out later if it turns out to be overkill. I'll discuss this with Shiju and others later today. For now I think this belongs in the specific drivers, not the core as they will be tracking different things and getting the data from different paths. How we make this work with other error records is going to be more painful but should be easy enough for CXL non firmware first reporting. Other sources of records can come later. Eventually we might end up with a top level PA check in the EDAC core code, and additional optional checks in the drivers, or factor out the shared bit as library code. Not sure yet. > > > > > That sounds a lot harder to deal with than relying on the usespace program > > > > that already does the tracking across boots. > > > > > > I am stuck behind the barrier of userspace must not assume it knows > > > better than the kernel about the SPA impact of a DIMM sparing > > > event. The kernel needs evidence either live records from within the > > > same kernel boot or validated records from a previous boot. > > > > I think this is the wrong approach. The operation must be 'safe'. > > With that in place we absolutely can let userspace assume it knows better than > > the kernel. > > Violent agreement? Operation must be safe, yes, next what are the criteria > for kernel management of safety. Offline-only repair is great place to > be. Offline makes life easy but removes a significant usecase - the above extra checks bring that back so may give a way forwards. The combination of 'only previous records' and online isn't an important one as you can act on them before it is online. So 'maybe' a path forwards that adds extra guarantees. I'm not convinced we need them, but it seems to not restrict real usecases so not too bad. Jonathan
Em Tue, 14 Jan 2025 11:35:21 -0800 Dan Williams <dan.j.williams@intel.com> escreveu: > > There is no way to tell that the topology hasn't changed. > > For the reasons above, I don't think we care. Instead of trying to stop > > userspace reparing the wrong memory, make sure it is safe for it to do that. > > (The kernel is rarely in the business of preventing the slightly stupid) > > If the policy is "error records with SPA from the current boot can be > trusted" and "userspace requests outside of current boot error records > must only be submitted to known offline" then I think we are aligned. Surely userspace cannot infere if past errors on SPA are for the same DPA block, but it may still decide between soft/hard PPR based on different criteria adopted by the machine admins - or use instead memory sparing. So, yeah sanity checks at Kernel level to identify "trust" level based on having DPA data or not makes sense, but the final decision about the action should be taken on userspace. Thanks, Mauro
diff --git a/Documentation/ABI/testing/sysfs-edac-memory-repair b/Documentation/ABI/testing/sysfs-edac-memory-repair new file mode 100644 index 000000000000..e9268f3780ed --- /dev/null +++ b/Documentation/ABI/testing/sysfs-edac-memory-repair @@ -0,0 +1,244 @@ +What: /sys/bus/edac/devices/<dev-name>/mem_repairX +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + The sysfs EDAC bus devices /<dev-name>/mem_repairX subdirectory + pertains to the memory media repair features control, such as + PPR (Post Package Repair), memory sparing etc, where<dev-name> + directory corresponds to a device registered with the EDAC + device driver for the memory repair features. + + Post Package Repair is a maintenance operation requests the memory + device to perform a repair operation on its media, in detail is a + memory self-healing feature that fixes a failing memory location by + replacing it with a spare row in a DRAM device. For example, a + CXL memory device with DRAM components that support PPR features may + implement PPR maintenance operations. DRAM components may support + two types of PPR functions: hard PPR, for a permanent row repair, and + soft PPR, for a temporary row repair. soft PPR is much faster than + hard PPR, but the repair is lost with a power cycle. + + Memory sparing is a repair function that replaces a portion + of memory with a portion of functional memory at that same + sparing granularity. Memory sparing has cacheline/row/bank/rank + sparing granularities. For example, in memory-sparing mode, + one memory rank serves as a spare for other ranks on the same + channel in case they fail. The spare rank is held in reserve and + not used as active memory until a failure is indicated, with + reserved capacity subtracted from the total available memory + in the system.The DIMM installation order for memory sparing + varies based on the number of processors and memory modules + installed in the server. After an error threshold is surpassed + in a system protected by memory sparing, the content of a failing + rank of DIMMs is copied to the spare rank. The failing rank is + then taken offline and the spare rank placed online for use as + active memory in place of the failed rank. + + The sysfs attributes nodes for a repair feature are only + present if the parent driver has implemented the corresponding + attr callback function and provided the necessary operations + to the EDAC device driver during registration. + + In some states of system configuration (e.g. before address + decoders have been configured), memory devices (e.g. CXL) + may not have an active mapping in the main host address + physical address map. As such, the memory to repair must be + identified by a device specific physical addressing scheme + using a device physical address(DPA). The DPA and other control + attributes to use will be presented in related error records. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair_function +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (RO) Memory repair function type. For eg. post package repair, + memory sparing etc. + EDAC_SOFT_PPR - Soft post package repair + EDAC_HARD_PPR - Hard post package repair + EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing + EDAC_ROW_MEM_SPARING - Row memory sparing + EDAC_BANK_MEM_SPARING - Bank memory sparing + EDAC_RANK_MEM_SPARING - Rank memory sparing + All other values are reserved. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (RW) Read/Write the current persist repair mode set for a + repair function. Persist repair modes supported in the + device, based on the memory repair function is temporary + or permanent and is lost with a power cycle. + EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary repair). + EDAC_MEM_REPAIR_HARD - Hard memory repair function (permanent repair). + All other values are reserved. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/dpa_support +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (RO) True if memory device required device physical + address (DPA) of memory to repair. + False if memory device required host specific physical + address (HPA) of memory to repair. + In some states of system configuration (e.g. before address + decoders have been configured), memory devices (e.g. CXL) + may not have an active mapping in the main host address + physical address map. As such, the memory to repair must be + identified by a device specific physical addressing scheme + using a DPA. The device physical address(DPA) to use will be + presented in related error records. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair_safe_when_in_use +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (RO) True if memory media is accessible and data is retained + during the memory repair operation. + The data may not be retained and memory requests may not be + correctly processed during a repair operation. In such case + the repair operation should not executed at runtime. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/hpa +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (RW) Host Physical Address (HPA) of the memory to repair. + See attribute 'dpa_support' for more details. + The HPA to use will be provided in related error records. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/dpa +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (RW) Device Physical Address (DPA) of the memory to repair. + See attribute 'dpa_support' for more details. + The specific DPA to use will be provided in related error + records. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/nibble_mask +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (RW) Read/Write Nibble mask of the memory to repair. + Nibble mask identifies one or more nibbles in error on the + memory bus that produced the error event. Nibble Mask bit 0 + shall be set if nibble 0 on the memory bus produced the + event, etc. For example, CXL PPR and sparing, a nibble mask + bit set to 1 indicates the request to perform repair + operation in the specific device. All nibble mask bits set + to 1 indicates the request to perform the operation in all + devices. For CXL memory to repiar, the specific value of + nibble mask to use will be provided in related error records. + For more details, See nibble mask field in CXL spec ver 3.1, + section 8.2.9.7.1.2 Table 8-103 soft PPR and section + 8.2.9.7.1.3 Table 8-104 hard PPR, section 8.2.9.7.1.4 + Table 8-105 memory sparing. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/bank_group +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/bank +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/rank +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/row +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/column +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/channel +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/sub_channel +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (RW) The control attributes associated with memory address + that is to be repaired. The specific value of attributes to + use depends on the portion of memory to repair and may be + reported to host in related error records and may be + available to userspace in trace events, such as in CXL + memory devices. + + channel - The channel of the memory to repair. Channel is + defined as an interface that can be independently accessed + for a transaction. + rank - The rank of the memory to repair. Rank is defined as a + set of memory devices on a channel that together execute a + transaction. + bank_group - The bank group of the memory to repair. + bank - The bank number of the memory to repair. + row - The row number of the memory to repair. + column - The column number of the memory to repair. + sub_channel - The sub-channel of the memory to repair. + + The requirement to set these attributes varies based on the + repair function. The attributes in sysfs are not present + unless required for a repair function. + For example, CXL spec ver 3.1, Section 8.2.9.7.1.2 Table 8-103 + soft PPR and Section 8.2.9.7.1.3 Table 8-104 hard PPR operations, + these attributes are not required to set. + For example, CXL spec ver 3.1, Section 8.2.9.7.1.4 Table 8-105 + memory sparing, these attributes are required to set based on + memory sparing granularity as follows. + Channel: Channel associated with the DPA that is to be spared + and applies to all subclasses of sparing (cacheline, bank, + row and rank sparing). + Rank: Rank associated with the DPA that is to be spared and + applies to all subclasses of sparing. + Bank & Bank Group: Bank & bank group are associated with + the DPA that is to be spared and applies to cacheline sparing, + row sparing and bank sparing subclasses. + Row: Row associated with the DPA that is to be spared and + applies to cacheline sparing and row sparing subclasses. + Column: column associated with the DPA that is to be spared + and applies to cacheline sparing only. + Sub-channel: sub-channel associated with the DPA that is to + be spared and applies to cacheline sparing only. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_hpa +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_dpa +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_nibble_mask +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank_group +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_rank +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_row +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_column +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_channel +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_sub_channel +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_hpa +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_dpa +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_nibble_mask +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank_group +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_rank +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_row +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_column +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_channel +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_sub_channel +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (RW) The supported range of control attributes (optional) + associated with a memory address that is to be repaired. + The memory device may give the supported range of + attributes to use and it will depend on the memory device + and the portion of memory to repair. + The userspace may receive the specific value of attributes + to use for a repair operation from the memory device via + related error records and trace events, such as in CXL + memory devices. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (WO) Issue the memory repair operation for the specified + memory repair attributes. The operation may fail if resources + are insufficient based on the requirements of the memory + device and repair function. + EDAC_DO_MEM_REPAIR - issue repair operation. + All other values are reserved. diff --git a/Documentation/edac/features.rst b/Documentation/edac/features.rst index ba3ab993ee4f..bfd5533b81b7 100644 --- a/Documentation/edac/features.rst +++ b/Documentation/edac/features.rst @@ -97,3 +97,6 @@ RAS features ------------ 1. Memory Scrub Memory scrub features are documented in `Documentation/edac/scrub.rst`. + +2. Memory Repair +Memory repair features are documented in `Documentation/edac/memory_repair.rst`. diff --git a/Documentation/edac/index.rst b/Documentation/edac/index.rst index dfb0c9fb9ab1..d6778f4562dd 100644 --- a/Documentation/edac/index.rst +++ b/Documentation/edac/index.rst @@ -8,4 +8,5 @@ EDAC Subsystem :maxdepth: 1 features + memory_repair scrub diff --git a/Documentation/edac/memory_repair.rst b/Documentation/edac/memory_repair.rst new file mode 100644 index 000000000000..2787a8a2d6ba --- /dev/null +++ b/Documentation/edac/memory_repair.rst @@ -0,0 +1,101 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +EDAC Memory Repair Control +========================== + +Copyright (c) 2024 HiSilicon Limited. + +:Author: Shiju Jose <shiju.jose@huawei.com> +:License: The GNU Free Documentation License, Version 1.2 + (dual licensed under the GPL v2) +:Original Reviewers: + +- Written for: 6.14 + +Introduction +------------ +Memory devices may support repair operations to address issues in their +memory media. Post Package Repair (PPR) and memory sparing are examples +of such features. + +Post Package Repair(PPR) +~~~~~~~~~~~~~~~~~~~~~~~~ +Post Package Repair is a maintenance operation requests the memory device +to perform repair operation on its media, in detail is a memory self-healing +feature that fixes a failing memory location by replacing it with a spare +row in a DRAM device. For example, a CXL memory device with DRAM components +that support PPR features may implement PPR maintenance operations. DRAM +components may support types of PPR functions, hard PPR, for a permanent row +repair, and soft PPR, for a temporary row repair. Soft PPR is much faster +than hard PPR, but the repair is lost with a power cycle. The data may not +be retained and memory requests may not be correctly processed during a +repair operation. In such case, the repair operation should not executed +at runtime. +For example, CXL memory devices, soft PPR and hard PPR repair operations +may be supported. See CXL spec rev 3.1 sections 8.2.9.7.1.1 PPR Maintenance +Operations, 8.2.9.7.1.2 sPPR Maintenance Operation and 8.2.9.7.1.3 hPPR +Maintenance Operation for more details. + +Memory Sparing +~~~~~~~~~~~~~~ +Memory sparing is a repair function that replaces a portion of memory with +a portion of functional memory at that same sparing granularity. Memory +sparing has cacheline/row/bank/rank sparing granularities. For example, in +memory-sparing mode, one memory rank serves as a spare for other ranks on +the same channel in case they fail. The spare rank is held in reserve and +not used as active memory until a failure is indicated, with reserved +capacity subtracted from the total available memory in the system. The DIMM +installation order for memory sparing varies based on the number of processors +and memory modules installed in the server. After an error threshold is +surpassed in a system protected by memory sparing, the content of a failing +rank of DIMMs is copied to the spare rank. The failing rank is then taken +offline and the spare rank placed online for use as active memory in place +of the failed rank. + +For example, CXL memory devices may support various subclasses for sparing +operation vary in terms of the scope of the sparing being performed. +Cacheline sparing subclass refers to a sparing action that can replace a +full cacheline. Row sparing is provided as an alternative to PPR sparing +functions and its scope is that of a single DDR row. Bank sparing allows +an entire bank to be replaced. Rank sparing is defined as an operation +in which an entire DDR rank is replaced. See CXL spec 3.1 section +8.2.9.7.1.4 Memory Sparing Maintenance Operations for more details. + +Use cases of generic memory repair features control +--------------------------------------------------- + +1. The soft PPR , hard PPR and memory-sparing features share similar +control attributes. Therefore, there is a need for a standardized, generic +sysfs repair control that is exposed to userspace and used by +administrators, scripts and tools. + +2. When a CXL device detects an error in a memory component, it may inform +the host of the need for a repair maintenance operation by using an event +record where the "maintenance needed" flag is set. The event record +specifies the device physical address(DPA) and attributes of the memory that +requires repair. The kernel reports the corresponding CXL general media or +DRAM trace event to userspace, and userspace tools (e.g. rasdaemon) initiate +a repair maintenance operation in response to the device request using the +sysfs repair control. + +3. Userspace tools, such as rasdaemon, may request a PPR/sparing on a memory +region when an uncorrected memory error or an excess of corrected memory +errors is reported on that memory. + +4. Multiple PPR/sparing instances may be present per memory device. + +The File System +--------------- + +The control attributes of a registered memory repair instance could be +accessed in the + +/sys/bus/edac/devices/<dev-name>/mem_repairX/ + +sysfs +----- + +Sysfs files are documented in + +`Documentation/ABI/testing/sysfs-edac-memory-repair`. diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile index 3a49304860f0..1de9fe66ac6b 100644 --- a/drivers/edac/Makefile +++ b/drivers/edac/Makefile @@ -10,7 +10,7 @@ obj-$(CONFIG_EDAC) := edac_core.o edac_core-y := edac_mc.o edac_device.o edac_mc_sysfs.o edac_core-y += edac_module.o edac_device_sysfs.o wq.o -edac_core-y += scrub.o ecs.o +edac_core-y += scrub.o ecs.o mem_repair.o edac_core-$(CONFIG_EDAC_DEBUG) += debugfs.o diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c index 1c1142a2e4e4..a401d81dad8a 100644 --- a/drivers/edac/edac_device.c +++ b/drivers/edac/edac_device.c @@ -575,6 +575,7 @@ static void edac_dev_release(struct device *dev) { struct edac_dev_feat_ctx *ctx = container_of(dev, struct edac_dev_feat_ctx, dev); + kfree(ctx->mem_repair); kfree(ctx->scrub); kfree(ctx->dev.groups); kfree(ctx); @@ -611,6 +612,7 @@ int edac_dev_register(struct device *parent, char *name, const struct attribute_group **ras_attr_groups; struct edac_dev_data *dev_data; struct edac_dev_feat_ctx *ctx; + int mem_repair_cnt = 0; int attr_gcnt = 0; int scrub_cnt = 0; int ret, feat; @@ -628,6 +630,10 @@ int edac_dev_register(struct device *parent, char *name, case RAS_FEAT_ECS: attr_gcnt += ras_features[feat].ecs_info.num_media_frus; break; + case RAS_FEAT_MEM_REPAIR: + attr_gcnt++; + mem_repair_cnt++; + break; default: return -EINVAL; } @@ -651,8 +657,17 @@ int edac_dev_register(struct device *parent, char *name, } } + if (mem_repair_cnt) { + ctx->mem_repair = kcalloc(mem_repair_cnt, sizeof(*ctx->mem_repair), GFP_KERNEL); + if (!ctx->mem_repair) { + ret = -ENOMEM; + goto data_mem_free; + } + } + attr_gcnt = 0; scrub_cnt = 0; + mem_repair_cnt = 0; for (feat = 0; feat < num_features; feat++, ras_features++) { switch (ras_features->ft_type) { case RAS_FEAT_SCRUB: @@ -686,6 +701,23 @@ int edac_dev_register(struct device *parent, char *name, attr_gcnt += ras_features->ecs_info.num_media_frus; break; + case RAS_FEAT_MEM_REPAIR: + if (!ras_features->mem_repair_ops || + mem_repair_cnt != ras_features->instance) + goto data_mem_free; + + dev_data = &ctx->mem_repair[mem_repair_cnt]; + dev_data->instance = mem_repair_cnt; + dev_data->mem_repair_ops = ras_features->mem_repair_ops; + dev_data->private = ras_features->ctx; + ret = edac_mem_repair_get_desc(parent, &ras_attr_groups[attr_gcnt], + ras_features->instance); + if (ret) + goto data_mem_free; + + mem_repair_cnt++; + attr_gcnt++; + break; default: ret = -EINVAL; goto data_mem_free; @@ -712,6 +744,7 @@ int edac_dev_register(struct device *parent, char *name, return devm_add_action_or_reset(parent, edac_dev_unreg, &ctx->dev); data_mem_free: + kfree(ctx->mem_repair); kfree(ctx->scrub); groups_free: kfree(ras_attr_groups); diff --git a/drivers/edac/mem_repair.c b/drivers/edac/mem_repair.c new file mode 100755 index 000000000000..e7439fd26c41 --- /dev/null +++ b/drivers/edac/mem_repair.c @@ -0,0 +1,492 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * The generic EDAC memory repair driver is designed to control the memory + * devices with memory repair features, such as Post Package Repair (PPR), + * memory sparing etc. The common sysfs memory repair interface abstracts + * the control of various arbitrary memory repair functionalities into a + * unified set of functions. + * + * Copyright (c) 2024 HiSilicon Limited. + */ + +#include <linux/edac.h> + +enum edac_mem_repair_attributes { + MEM_REPAIR_FUNCTION, + MEM_REPAIR_PERSIST_MODE, + MEM_REPAIR_DPA_SUPPORT, + MEM_REPAIR_SAFE_IN_USE, + MEM_REPAIR_HPA, + MEM_REPAIR_MIN_HPA, + MEM_REPAIR_MAX_HPA, + MEM_REPAIR_DPA, + MEM_REPAIR_MIN_DPA, + MEM_REPAIR_MAX_DPA, + MEM_REPAIR_NIBBLE_MASK, + MEM_REPAIR_MIN_NIBBLE_MASK, + MEM_REPAIR_MAX_NIBBLE_MASK, + MEM_REPAIR_BANK_GROUP, + MEM_REPAIR_MIN_BANK_GROUP, + MEM_REPAIR_MAX_BANK_GROUP, + MEM_REPAIR_BANK, + MEM_REPAIR_MIN_BANK, + MEM_REPAIR_MAX_BANK, + MEM_REPAIR_RANK, + MEM_REPAIR_MIN_RANK, + MEM_REPAIR_MAX_RANK, + MEM_REPAIR_ROW, + MEM_REPAIR_MIN_ROW, + MEM_REPAIR_MAX_ROW, + MEM_REPAIR_COLUMN, + MEM_REPAIR_MIN_COLUMN, + MEM_REPAIR_MAX_COLUMN, + MEM_REPAIR_CHANNEL, + MEM_REPAIR_MIN_CHANNEL, + MEM_REPAIR_MAX_CHANNEL, + MEM_REPAIR_SUB_CHANNEL, + MEM_REPAIR_MIN_SUB_CHANNEL, + MEM_REPAIR_MAX_SUB_CHANNEL, + MEM_DO_REPAIR, + MEM_REPAIR_MAX_ATTRS +}; + +struct edac_mem_repair_dev_attr { + struct device_attribute dev_attr; + u8 instance; +}; + +struct edac_mem_repair_context { + char name[EDAC_FEAT_NAME_LEN]; + struct edac_mem_repair_dev_attr mem_repair_dev_attr[MEM_REPAIR_MAX_ATTRS]; + struct attribute *mem_repair_attrs[MEM_REPAIR_MAX_ATTRS + 1]; + struct attribute_group group; +}; + +#define TO_MEM_REPAIR_DEV_ATTR(_dev_attr) \ + container_of(_dev_attr, struct edac_mem_repair_dev_attr, dev_attr) + +#define EDAC_MEM_REPAIR_ATTR_SHOW(attrib, cb, type, format) \ +static ssize_t attrib##_show(struct device *ras_feat_dev, \ + struct device_attribute *attr, char *buf) \ +{ \ + u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance; \ + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); \ + const struct edac_mem_repair_ops *ops = \ + ctx->mem_repair[inst].mem_repair_ops; \ + type data; \ + int ret; \ + \ + ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private, \ + &data); \ + if (ret) \ + return ret; \ + \ + return sysfs_emit(buf, format, data); \ +} + +EDAC_MEM_REPAIR_ATTR_SHOW(repair_function, get_repair_function, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(persist_mode, get_persist_mode, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(dpa_support, get_dpa_support, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(repair_safe_when_in_use, get_repair_safe_when_in_use, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(hpa, get_hpa, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_hpa, get_min_hpa, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_hpa, get_max_hpa, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(dpa, get_dpa, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_dpa, get_min_dpa, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_dpa, get_max_dpa, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(nibble_mask, get_nibble_mask, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_nibble_mask, get_min_nibble_mask, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_nibble_mask, get_max_nibble_mask, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(bank_group, get_bank_group, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank_group, get_min_bank_group, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank_group, get_max_bank_group, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(bank, get_bank, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank, get_min_bank, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank, get_max_bank, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(rank, get_rank, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_rank, get_min_rank, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_rank, get_max_rank, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(row, get_row, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_row, get_min_row, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_row, get_max_row, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(column, get_column, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_column, get_min_column, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_column, get_max_column, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(channel, get_channel, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_channel, get_min_channel, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_channel, get_max_channel, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(sub_channel, get_sub_channel, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_sub_channel, get_min_sub_channel, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_sub_channel, get_max_sub_channel, u32, "%u\n") + +#define EDAC_MEM_REPAIR_ATTR_STORE(attrib, cb, type, conv_func) \ +static ssize_t attrib##_store(struct device *ras_feat_dev, \ + struct device_attribute *attr, \ + const char *buf, size_t len) \ +{ \ + u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance; \ + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); \ + const struct edac_mem_repair_ops *ops = \ + ctx->mem_repair[inst].mem_repair_ops; \ + type data; \ + int ret; \ + \ + ret = conv_func(buf, 0, &data); \ + if (ret < 0) \ + return ret; \ + \ + ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private, \ + data); \ + if (ret) \ + return ret; \ + \ + return len; \ +} + +EDAC_MEM_REPAIR_ATTR_STORE(persist_mode, set_persist_mode, unsigned long, kstrtoul) +EDAC_MEM_REPAIR_ATTR_STORE(hpa, set_hpa, u64, kstrtou64) +EDAC_MEM_REPAIR_ATTR_STORE(dpa, set_dpa, u64, kstrtou64) +EDAC_MEM_REPAIR_ATTR_STORE(nibble_mask, set_nibble_mask, u64, kstrtou64) +EDAC_MEM_REPAIR_ATTR_STORE(bank_group, set_bank_group, unsigned long, kstrtoul) +EDAC_MEM_REPAIR_ATTR_STORE(bank, set_bank, unsigned long, kstrtoul) +EDAC_MEM_REPAIR_ATTR_STORE(rank, set_rank, unsigned long, kstrtoul) +EDAC_MEM_REPAIR_ATTR_STORE(row, set_row, u64, kstrtou64) +EDAC_MEM_REPAIR_ATTR_STORE(column, set_column, unsigned long, kstrtoul) +EDAC_MEM_REPAIR_ATTR_STORE(channel, set_channel, unsigned long, kstrtoul) +EDAC_MEM_REPAIR_ATTR_STORE(sub_channel, set_sub_channel, unsigned long, kstrtoul) + +#define EDAC_MEM_REPAIR_DO_OP(attrib, cb) \ +static ssize_t attrib##_store(struct device *ras_feat_dev, \ + struct device_attribute *attr, \ + const char *buf, size_t len) \ +{ \ + u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance; \ + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); \ + const struct edac_mem_repair_ops *ops = ctx->mem_repair[inst].mem_repair_ops; \ + unsigned long data; \ + int ret; \ + \ + ret = kstrtoul(buf, 0, &data); \ + if (ret < 0) \ + return ret; \ + \ + ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private, data); \ + if (ret) \ + return ret; \ + \ + return len; \ +} + +EDAC_MEM_REPAIR_DO_OP(repair, do_repair) + +static umode_t mem_repair_attr_visible(struct kobject *kobj, struct attribute *a, int attr_id) +{ + struct device *ras_feat_dev = kobj_to_dev(kobj); + struct device_attribute *dev_attr = container_of(a, struct device_attribute, attr); + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); + u8 inst = TO_MEM_REPAIR_DEV_ATTR(dev_attr)->instance; + const struct edac_mem_repair_ops *ops = ctx->mem_repair[inst].mem_repair_ops; + + switch (attr_id) { + case MEM_REPAIR_FUNCTION: + if (ops->get_repair_function) + return a->mode; + break; + case MEM_REPAIR_PERSIST_MODE: + if (ops->get_persist_mode) { + if (ops->set_persist_mode) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_DPA_SUPPORT: + if (ops->get_dpa_support) + return a->mode; + break; + case MEM_REPAIR_SAFE_IN_USE: + if (ops->get_repair_safe_when_in_use) + return a->mode; + break; + case MEM_REPAIR_HPA: + if (ops->get_hpa) { + if (ops->set_hpa) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_HPA: + if (ops->get_min_hpa) + return a->mode; + break; + case MEM_REPAIR_MAX_HPA: + if (ops->get_max_hpa) + return a->mode; + break; + case MEM_REPAIR_DPA: + if (ops->get_dpa) { + if (ops->set_dpa) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_DPA: + if (ops->get_min_dpa) + return a->mode; + break; + case MEM_REPAIR_MAX_DPA: + if (ops->get_max_dpa) + return a->mode; + break; + case MEM_REPAIR_NIBBLE_MASK: + if (ops->get_nibble_mask) { + if (ops->set_nibble_mask) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_NIBBLE_MASK: + if (ops->get_min_nibble_mask) + return a->mode; + break; + case MEM_REPAIR_MAX_NIBBLE_MASK: + if (ops->get_max_nibble_mask) + return a->mode; + break; + case MEM_REPAIR_BANK_GROUP: + if (ops->get_bank_group) { + if (ops->set_bank_group) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_BANK_GROUP: + if (ops->get_min_bank_group) + return a->mode; + break; + case MEM_REPAIR_MAX_BANK_GROUP: + if (ops->get_max_bank_group) + return a->mode; + break; + case MEM_REPAIR_BANK: + if (ops->get_bank) { + if (ops->set_bank) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_BANK: + if (ops->get_min_bank) + return a->mode; + break; + case MEM_REPAIR_MAX_BANK: + if (ops->get_max_bank) + return a->mode; + break; + case MEM_REPAIR_RANK: + if (ops->get_rank) { + if (ops->set_rank) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_RANK: + if (ops->get_min_rank) + return a->mode; + break; + case MEM_REPAIR_MAX_RANK: + if (ops->get_max_rank) + return a->mode; + break; + case MEM_REPAIR_ROW: + if (ops->get_row) { + if (ops->set_row) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_ROW: + if (ops->get_min_row) + return a->mode; + break; + case MEM_REPAIR_MAX_ROW: + if (ops->get_max_row) + return a->mode; + break; + case MEM_REPAIR_COLUMN: + if (ops->get_column) { + if (ops->set_column) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_COLUMN: + if (ops->get_min_column) + return a->mode; + break; + case MEM_REPAIR_MAX_COLUMN: + if (ops->get_max_column) + return a->mode; + break; + case MEM_REPAIR_CHANNEL: + if (ops->get_channel) { + if (ops->set_channel) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_CHANNEL: + if (ops->get_min_channel) + return a->mode; + break; + case MEM_REPAIR_MAX_CHANNEL: + if (ops->get_max_channel) + return a->mode; + break; + case MEM_REPAIR_SUB_CHANNEL: + if (ops->get_sub_channel) { + if (ops->set_sub_channel) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_SUB_CHANNEL: + if (ops->get_min_sub_channel) + return a->mode; + break; + case MEM_REPAIR_MAX_SUB_CHANNEL: + if (ops->get_max_sub_channel) + return a->mode; + break; + case MEM_DO_REPAIR: + if (ops->do_repair) + return a->mode; + break; + default: + break; + } + + return 0; +} + +#define EDAC_MEM_REPAIR_ATTR_RO(_name, _instance) \ + ((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RO(_name), \ + .instance = _instance }) + +#define EDAC_MEM_REPAIR_ATTR_WO(_name, _instance) \ + ((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_WO(_name), \ + .instance = _instance }) + +#define EDAC_MEM_REPAIR_ATTR_RW(_name, _instance) \ + ((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RW(_name), \ + .instance = _instance }) + +static int mem_repair_create_desc(struct device *dev, + const struct attribute_group **attr_groups, + u8 instance) +{ + struct edac_mem_repair_context *ctx; + struct attribute_group *group; + int i; + struct edac_mem_repair_dev_attr dev_attr[] = { + [MEM_REPAIR_FUNCTION] = EDAC_MEM_REPAIR_ATTR_RO(repair_function, + instance), + [MEM_REPAIR_PERSIST_MODE] = + EDAC_MEM_REPAIR_ATTR_RW(persist_mode, instance), + [MEM_REPAIR_DPA_SUPPORT] = + EDAC_MEM_REPAIR_ATTR_RO(dpa_support, instance), + [MEM_REPAIR_SAFE_IN_USE] = + EDAC_MEM_REPAIR_ATTR_RO(repair_safe_when_in_use, + instance), + [MEM_REPAIR_HPA] = EDAC_MEM_REPAIR_ATTR_RW(hpa, instance), + [MEM_REPAIR_MIN_HPA] = EDAC_MEM_REPAIR_ATTR_RO(min_hpa, instance), + [MEM_REPAIR_MAX_HPA] = EDAC_MEM_REPAIR_ATTR_RO(max_hpa, instance), + [MEM_REPAIR_DPA] = EDAC_MEM_REPAIR_ATTR_RW(dpa, instance), + [MEM_REPAIR_MIN_DPA] = EDAC_MEM_REPAIR_ATTR_RO(min_dpa, instance), + [MEM_REPAIR_MAX_DPA] = EDAC_MEM_REPAIR_ATTR_RO(max_dpa, instance), + [MEM_REPAIR_NIBBLE_MASK] = + EDAC_MEM_REPAIR_ATTR_RW(nibble_mask, instance), + [MEM_REPAIR_MIN_NIBBLE_MASK] = + EDAC_MEM_REPAIR_ATTR_RO(min_nibble_mask, instance), + [MEM_REPAIR_MAX_NIBBLE_MASK] = + EDAC_MEM_REPAIR_ATTR_RO(max_nibble_mask, instance), + [MEM_REPAIR_BANK_GROUP] = + EDAC_MEM_REPAIR_ATTR_RW(bank_group, instance), + [MEM_REPAIR_MIN_BANK_GROUP] = + EDAC_MEM_REPAIR_ATTR_RO(min_bank_group, instance), + [MEM_REPAIR_MAX_BANK_GROUP] = + EDAC_MEM_REPAIR_ATTR_RO(max_bank_group, instance), + [MEM_REPAIR_BANK] = EDAC_MEM_REPAIR_ATTR_RW(bank, instance), + [MEM_REPAIR_MIN_BANK] = EDAC_MEM_REPAIR_ATTR_RO(min_bank, instance), + [MEM_REPAIR_MAX_BANK] = EDAC_MEM_REPAIR_ATTR_RO(max_bank, instance), + [MEM_REPAIR_RANK] = EDAC_MEM_REPAIR_ATTR_RW(rank, instance), + [MEM_REPAIR_MIN_RANK] = EDAC_MEM_REPAIR_ATTR_RO(min_rank, instance), + [MEM_REPAIR_MAX_RANK] = EDAC_MEM_REPAIR_ATTR_RO(max_rank, instance), + [MEM_REPAIR_ROW] = EDAC_MEM_REPAIR_ATTR_RW(row, instance), + [MEM_REPAIR_MIN_ROW] = EDAC_MEM_REPAIR_ATTR_RO(min_row, instance), + [MEM_REPAIR_MAX_ROW] = EDAC_MEM_REPAIR_ATTR_RO(max_row, instance), + [MEM_REPAIR_COLUMN] = EDAC_MEM_REPAIR_ATTR_RW(column, instance), + [MEM_REPAIR_MIN_COLUMN] = EDAC_MEM_REPAIR_ATTR_RO(min_column, instance), + [MEM_REPAIR_MAX_COLUMN] = EDAC_MEM_REPAIR_ATTR_RO(max_column, instance), + [MEM_REPAIR_CHANNEL] = EDAC_MEM_REPAIR_ATTR_RW(channel, instance), + [MEM_REPAIR_MIN_CHANNEL] = EDAC_MEM_REPAIR_ATTR_RO(min_channel, instance), + [MEM_REPAIR_MAX_CHANNEL] = EDAC_MEM_REPAIR_ATTR_RO(max_channel, instance), + [MEM_REPAIR_SUB_CHANNEL] = + EDAC_MEM_REPAIR_ATTR_RW(sub_channel, instance), + [MEM_REPAIR_MIN_SUB_CHANNEL] = + EDAC_MEM_REPAIR_ATTR_RO(min_sub_channel, instance), + [MEM_REPAIR_MAX_SUB_CHANNEL] = + EDAC_MEM_REPAIR_ATTR_RO(max_sub_channel, instance), + [MEM_DO_REPAIR] = EDAC_MEM_REPAIR_ATTR_WO(repair, instance) + }; + + ctx = devm_kzalloc(dev, sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return -ENOMEM; + + for (i = 0; i < MEM_REPAIR_MAX_ATTRS; i++) { + memcpy(&ctx->mem_repair_dev_attr[i].dev_attr, + &dev_attr[i], sizeof(dev_attr[i])); + ctx->mem_repair_attrs[i] = + &ctx->mem_repair_dev_attr[i].dev_attr.attr; + } + + sprintf(ctx->name, "%s%d", "mem_repair", instance); + group = &ctx->group; + group->name = ctx->name; + group->attrs = ctx->mem_repair_attrs; + group->is_visible = mem_repair_attr_visible; + attr_groups[0] = group; + + return 0; +} + +/** + * edac_mem_repair_get_desc - get EDAC memory repair descriptors + * @dev: client device with memory repair feature + * @attr_groups: pointer to attribute group container + * @instance: device's memory repair instance number. + * + * Return: + * * %0 - Success. + * * %-EINVAL - Invalid parameters passed. + * * %-ENOMEM - Dynamic memory allocation failed. + */ +int edac_mem_repair_get_desc(struct device *dev, + const struct attribute_group **attr_groups, u8 instance) +{ + if (!dev || !attr_groups) + return -EINVAL; + + return mem_repair_create_desc(dev, attr_groups, instance); +} diff --git a/include/linux/edac.h b/include/linux/edac.h index 979e91426701..5d07192bf1a7 100644 --- a/include/linux/edac.h +++ b/include/linux/edac.h @@ -668,6 +668,7 @@ static inline struct dimm_info *edac_get_dimm(struct mem_ctl_info *mci, enum edac_dev_feat { RAS_FEAT_SCRUB, RAS_FEAT_ECS, + RAS_FEAT_MEM_REPAIR, RAS_FEAT_MAX }; @@ -729,11 +730,147 @@ int edac_ecs_get_desc(struct device *ecs_dev, const struct attribute_group **attr_groups, u16 num_media_frus); +enum edac_mem_repair_function { + EDAC_SOFT_PPR, + EDAC_HARD_PPR, + EDAC_CACHELINE_MEM_SPARING, + EDAC_ROW_MEM_SPARING, + EDAC_BANK_MEM_SPARING, + EDAC_RANK_MEM_SPARING, +}; + +enum edac_mem_repair_persist_mode { + EDAC_MEM_REPAIR_SOFT, /* soft memory repair */ + EDAC_MEM_REPAIR_HARD, /* hard memory repair */ +}; + +enum edac_mem_repair_cmd { + EDAC_DO_MEM_REPAIR = 1, +}; + +/** + * struct edac_mem_repair_ops - memory repair operations + * (all elements are optional except do_repair, set_hpa/set_dpa) + * @get_repair_function: get the memory repair function, listed in + * enum edac_mem_repair_function. + * @get_persist_mode: get the current persist mode. Persist repair modes supported + * in the device is based on the memory repair function which is + * temporary or permanent and is lost with a power cycle. + * EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary repair). + * EDAC_MEM_REPAIR_HARD - Hard memory repair function (permanent repair). + * All other values are reserved. + * @set_persist_mode: set the persist mode of the memory repair instance. + * @get_dpa_support: get dpa support flag. In some states of system configuration + * (e.g. before address decoders have been configured), memory devices + * (e.g. CXL) may not have an active mapping in the main host address + * physical address map. As such, the memory to repair must be identified + * by a device specific physical addressing scheme using a device physical + * address(DPA). The DPA and other control attributes to use for the + * dry_run and repair operations will be presented in related error records. + * @get_repair_safe_when_in_use: get whether memory media is accessible and + * data is retained during repair operation. + * @get_hpa: get current host physical address (HPA). + * @set_hpa: set host physical address (HPA) of memory to repair. + * @get_min_hpa: get the minimum supported host physical address (HPA). + * @get_max_hpa: get the maximum supported host physical address (HPA). + * @get_dpa: get current device physical address (DPA). + * @set_dpa: set device physical address (DPA) of memory to repair. + * @get_min_dpa: get the minimum supported device physical address (DPA). + * @get_max_dpa: get the maximum supported device physical address (DPA). + * @get_nibble_mask: get current nibble mask. + * @set_nibble_mask: set nibble mask of memory to repair. + * @get_min_nibble_mask: get the minimum supported nibble mask. + * @get_max_nibble_mask: get the maximum supported nibble mask. + * @get_bank_group: get current bank group. + * @set_bank_group: set bank group of memory to repair. + * @get_min_bank_group: get the minimum supported bank group. + * @get_max_bank_group: get the maximum supported bank group. + * @get_bank: get current bank. + * @set_bank: set bank of memory to repair. + * @get_min_bank: get the minimum supported bank. + * @get_max_bank: get the maximum supported bank. + * @get_rank: get current rank. + * @set_rank: set rank of memory to repair. + * @get_min_rank: get the minimum supported rank. + * @get_max_rank: get the maximum supported rank. + * @get_row: get current row. + * @set_row: set row of memory to repair. + * @get_min_row: get the minimum supported row. + * @get_max_row: get the maximum supported row. + * @get_column: get current column. + * @set_column: set column of memory to repair. + * @get_min_column: get the minimum supported column. + * @get_max_column: get the maximum supported column. + * @get_channel: get current channel. + * @set_channel: set channel of memory to repair. + * @get_min_channel: get the minimum supported channel. + * @get_max_channel: get the maximum supported channel. + * @get_sub_channel: get current sub channel. + * @set_sub_channel: set sub channel of memory to repair. + * @get_min_sub_channel: get the minimum supported sub channel. + * @get_max_sub_channel: get the maximum supported sub channel. + * @do_repair: Issue memory repair operation for the HPA/DPA and + * other control attributes set for the memory to repair. + */ +struct edac_mem_repair_ops { + int (*get_repair_function)(struct device *dev, void *drv_data, u32 *val); + int (*get_persist_mode)(struct device *dev, void *drv_data, u32 *mode); + int (*set_persist_mode)(struct device *dev, void *drv_data, u32 mode); + int (*get_dpa_support)(struct device *dev, void *drv_data, u32 *val); + int (*get_repair_safe_when_in_use)(struct device *dev, void *drv_data, u32 *val); + int (*get_hpa)(struct device *dev, void *drv_data, u64 *hpa); + int (*set_hpa)(struct device *dev, void *drv_data, u64 hpa); + int (*get_min_hpa)(struct device *dev, void *drv_data, u64 *hpa); + int (*get_max_hpa)(struct device *dev, void *drv_data, u64 *hpa); + int (*get_dpa)(struct device *dev, void *drv_data, u64 *dpa); + int (*set_dpa)(struct device *dev, void *drv_data, u64 dpa); + int (*get_min_dpa)(struct device *dev, void *drv_data, u64 *dpa); + int (*get_max_dpa)(struct device *dev, void *drv_data, u64 *dpa); + int (*get_nibble_mask)(struct device *dev, void *drv_data, u64 *val); + int (*set_nibble_mask)(struct device *dev, void *drv_data, u64 val); + int (*get_min_nibble_mask)(struct device *dev, void *drv_data, u64 *val); + int (*get_max_nibble_mask)(struct device *dev, void *drv_data, u64 *val); + int (*get_bank_group)(struct device *dev, void *drv_data, u32 *val); + int (*set_bank_group)(struct device *dev, void *drv_data, u32 val); + int (*get_min_bank_group)(struct device *dev, void *drv_data, u32 *val); + int (*get_max_bank_group)(struct device *dev, void *drv_data, u32 *val); + int (*get_bank)(struct device *dev, void *drv_data, u32 *val); + int (*set_bank)(struct device *dev, void *drv_data, u32 val); + int (*get_min_bank)(struct device *dev, void *drv_data, u32 *val); + int (*get_max_bank)(struct device *dev, void *drv_data, u32 *val); + int (*get_rank)(struct device *dev, void *drv_data, u32 *val); + int (*set_rank)(struct device *dev, void *drv_data, u32 val); + int (*get_min_rank)(struct device *dev, void *drv_data, u32 *val); + int (*get_max_rank)(struct device *dev, void *drv_data, u32 *val); + int (*get_row)(struct device *dev, void *drv_data, u64 *val); + int (*set_row)(struct device *dev, void *drv_data, u64 val); + int (*get_min_row)(struct device *dev, void *drv_data, u64 *val); + int (*get_max_row)(struct device *dev, void *drv_data, u64 *val); + int (*get_column)(struct device *dev, void *drv_data, u32 *val); + int (*set_column)(struct device *dev, void *drv_data, u32 val); + int (*get_min_column)(struct device *dev, void *drv_data, u32 *val); + int (*get_max_column)(struct device *dev, void *drv_data, u32 *val); + int (*get_channel)(struct device *dev, void *drv_data, u32 *val); + int (*set_channel)(struct device *dev, void *drv_data, u32 val); + int (*get_min_channel)(struct device *dev, void *drv_data, u32 *val); + int (*get_max_channel)(struct device *dev, void *drv_data, u32 *val); + int (*get_sub_channel)(struct device *dev, void *drv_data, u32 *val); + int (*set_sub_channel)(struct device *dev, void *drv_data, u32 val); + int (*get_min_sub_channel)(struct device *dev, void *drv_data, u32 *val); + int (*get_max_sub_channel)(struct device *dev, void *drv_data, u32 *val); + int (*do_repair)(struct device *dev, void *drv_data, u32 val); +}; + +int edac_mem_repair_get_desc(struct device *dev, + const struct attribute_group **attr_groups, + u8 instance); + /* EDAC device feature information structure */ struct edac_dev_data { union { const struct edac_scrub_ops *scrub_ops; const struct edac_ecs_ops *ecs_ops; + const struct edac_mem_repair_ops *mem_repair_ops; }; u8 instance; void *private; @@ -744,6 +881,7 @@ struct edac_dev_feat_ctx { void *private; struct edac_dev_data *scrub; struct edac_dev_data ecs; + struct edac_dev_data *mem_repair; }; struct edac_dev_feature { @@ -752,6 +890,7 @@ struct edac_dev_feature { union { const struct edac_scrub_ops *scrub_ops; const struct edac_ecs_ops *ecs_ops; + const struct edac_mem_repair_ops *mem_repair_ops; }; void *ctx; struct edac_ecs_ex_info ecs_info;