[06/19] EDAC, mc: Remove per layer counters

Message ID	20191010202418.25098-7-rrichter@marvell.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=1LRz=YD=vger.kernel.org=linux-edac-owner@kernel.org> From: Robert Richter <rrichter@marvell.com> To: Borislav Petkov <bp@alien8.de>, Mauro Carvalho Chehab <mchehab@kernel.org>, Tony Luck <tony.luck@intel.com> CC: James Morse <james.morse@arm.com>, Robert Richter <rrichter@marvell.com>, "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org> Subject: [PATCH 06/19] EDAC, mc: Remove per layer counters Thread-Topic: [PATCH 06/19] EDAC, mc: Remove per layer counters Thread-Index: AQHVf6jU3Hq1HIO940WaZOfHTRxydA== Date: Thu, 10 Oct 2019 20:25:16 +0000 Message-ID: <20191010202418.25098-7-rrichter@marvell.com> References: <20191010202418.25098-1-rrichter@marvell.com> In-Reply-To: <20191010202418.25098-1-rrichter@marvell.com> Accept-Language: en-US Content-Language: en-US received-spf: None (protection.outlook.com: marvell.com does not designate permitted sender hosts) Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Exchange-CrossTenant-Network-Message-Id: 144baca6-259a-458d-5d8b-08d74dbff672 X-MS-Exchange-CrossTenant-originalarrivaltime: 10 Oct 2019 20:25:16.7515 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted Sender: linux-edac-owner@vger.kernel.org Precedence: bulk
Series	EDAC: Rework edac_mc and ghes drivers \| expand [00/19] EDAC: Rework edac_mc and ghes drivers [01/19] EDAC: Replace EDAC_DIMM_PTR() macro with edac_get_dimm() function [02/19] EDAC: Remove EDAC_DIMM_OFF() macro [03/19] EDAC: Introduce mci_for_each_dimm() iterator [04/19] EDAC, mc: Do not BUG_ON() in edac_mc_alloc() [05/19] EDAC, mc: Reduce indentation level in edac_mc_handle_error() [06/19] EDAC, mc: Remove per layer counters [07/19] EDAC, mc: Rename iterator variable to idx [08/19] EDAC, mc: Split edac_mc_alloc() into smaller functions [09/19] EDAC, mc: Reorder functions edac_mc_alloc*() [10/19] EDAC, mc: Rework edac_raw_mc_handle_error() to use struct dimm_info [11/19] EDAC: Remove misleading comment in struct edac_raw_error_desc [12/19] EDAC: Store error type in struct edac_raw_error_desc [13/19] EDAC, mc: Determine mci pointer from the error descriptor [14/19] EDAC, mc: Create new function edac_inc_csrow() [15/19] EDAC, ghes: Use standard kernel macros for page calculations [16/19] EDAC, ghes: Fix grain calculation [17/19] EDAC, ghes: Remove intermediate buffer pvt->detail_location [18/19] EDAC, ghes: Unify trace_mc_event() code with edac_mc driver [19/19] EDAC, Documentation: Describe CPER module definition and DIMM ranks

Message ID

20191010202418.25098-7-rrichter@marvell.com (mailing list archive)

State

New, archived

Headers

From: Robert Richter <rrichter@marvell.com>
To: Borislav Petkov <bp@alien8.de>,
        Mauro Carvalho Chehab <mchehab@kernel.org>,
        Tony Luck <tony.luck@intel.com>
CC: James Morse <james.morse@arm.com>,
        Robert Richter <rrichter@marvell.com>,
        "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: [PATCH 06/19] EDAC, mc: Remove per layer counters
Thread-Topic: [PATCH 06/19] EDAC, mc: Remove per layer counters
Thread-Index: AQHVf6jU3Hq1HIO940WaZOfHTRxydA==
Date: Thu, 10 Oct 2019 20:25:16 +0000
Message-ID: <20191010202418.25098-7-rrichter@marvell.com>
References: <20191010202418.25098-1-rrichter@marvell.com>
In-Reply-To: <20191010202418.25098-1-rrichter@marvell.com>
Accept-Language: en-US
Content-Language: en-US
x-ms-exchange-messagesentrepresentingtype: 1
x-mailer: git-send-email 2.20.1
x-originating-ip: [31.208.96.227]
x-ms-publictraffictype: Email
x-ms-office365-filtering-correlation-id: 144baca6-259a-458d-5d8b-08d74dbff672
x-ms-traffictypediagnostic: MN2PR18MB2701:
x-ms-exchange-transport-forked: True
x-microsoft-antispam-prvs: 
 <MN2PR18MB2701B92834DA6F2CFB6F6185D9940@MN2PR18MB2701.namprd18.prod.outlook.com>
x-ms-oob-tlc-oobclassifiers: OLM:2582;
x-forefront-prvs: 018632C080
x-forefront-antispam-report: 
 SFV:NSPM;SFS:(10009020)(4636009)(396003)(39850400004)(346002)(136003)(366004)(376002)(189003)(199004)(6436002)(6512007)(54906003)(71190400001)(110136005)(14454004)(316002)(6486002)(86362001)(71200400001)(5660300002)(2906002)(66066001)(25786009)(50226002)(478600001)(11346002)(446003)(476003)(486006)(3846002)(6116002)(2616005)(99286004)(30864003)(1076003)(6506007)(76176011)(4326008)(386003)(7736002)(305945005)(52116002)(26005)(256004)(102836004)(81166006)(66946007)(36756003)(186003)(81156014)(66446008)(66556008)(8936002)(64756008)(8676002)(66476007);DIR:OUT;SFP:1101;SCL:1;SRVR:MN2PR18MB2701;H:MN2PR18MB3408.namprd18.prod.outlook.com;FPR:;SPF:None;LANG:en;PTR:InfoNoRecords;MX:1;A:1;
received-spf: None (protection.outlook.com: marvell.com does not designate
 permitted sender hosts)
x-ms-exchange-senderadcheck: 1
x-microsoft-antispam: BCL:0;
x-microsoft-antispam-message-info: 
 SzHzf+vr9iEJxj8jdjBDab2pCjH0LEXgWOfwm+/xKKN0QpSKSJpI5u77ijgm4q+HkMl7xFnfNw0xwBbNfVtRb3uMGbP9QcJ0zymqSfl12ATLwNc+l0aUWiI4eA9txfTkpDUC5C7pqsA72/ApHnV1AR+OgkSsOQFvfX2H+GW8G/+oBPHoi8P/hUyeSi08z76WY7ekqwDWpRwg4mCuBmzjgLpPZKMRlDR2y3jiKRTGYT5yEA83r8iZD1EwbmGNN+4Q+I4Wkng+4AorpFCouUNhlSU7Y+MbDJasqFPB0T7QEHjbcBjOpQOGirF6GxalZS9q84k65NCkyokCuf+iYhgg1/PbtvDc/2suIlT8rESbEQQFkHzhoy+tA8Scfuv5RDfTR50jsCEd0nyp35ru35sOciudDFXp6JVg+6rid72xDEk=
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 144baca6-259a-458d-5d8b-08d74dbff672
X-MS-Exchange-CrossTenant-originalarrivaltime: 10 Oct 2019 20:25:16.7515
 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 70e1fb47-1155-421d-87fc-2e58f638b6e0
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: 
 aIA+A7OxY7Kk+I/yxFNYHkrbUMcoZxeOje1lyKheiIWne4kU59zkhmtaUddm1eHt+ETNU8iEJVuOQt3Eg5pM6Q==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN2PR18MB2701
X-OriginatorOrg: marvell.com
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.95,1.0.8
 definitions=2019-10-10_07:2019-10-10,2019-10-10 signatures=0
Sender: linux-edac-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-edac.vger.kernel.org>
X-Mailing-List: linux-edac@vger.kernel.org

Series

EDAC: Rework edac_mc and ghes drivers | expand

Commit Message

Robert Richter Oct. 10, 2019, 8:25 p.m. UTC

Looking at how mci->{ue,ce}_per_layer[EDAC_MAX_LAYERS] is used, it
turns out that only the leaves in the memory hierarchy are consumed
(in sysfs), but not the intermediate layers, e.g.:

 count = dimm->mci->ce_per_layer[dimm->mci->n_layers-1][dimm->idx];

These unused counters only add complexity, remove them. The error
counter values are directly stored in struct dimm_info now.

Signed-off-by: Robert Richter <rrichter@marvell.com>
---
 drivers/edac/edac_mc.c       | 104 ++++++++++++-----------------------
 drivers/edac/edac_mc_sysfs.c |  20 +++----
 drivers/edac/ghes_edac.c     |   5 +-
 include/linux/edac.h         |   7 +--
 4 files changed, 46 insertions(+), 90 deletions(-)

Comments

Mauro Carvalho Chehab Oct. 11, 2019, 10:40 a.m. UTC | #1

Em Thu, 10 Oct 2019 20:25:16 +0000
Robert Richter <rrichter@marvell.com> escreveu:

> Looking at how mci->{ue,ce}_per_layer[EDAC_MAX_LAYERS] is used, it
> turns out that only the leaves in the memory hierarchy are consumed
> (in sysfs), but not the intermediate layers, e.g.:
> 
>  count = dimm->mci->ce_per_layer[dimm->mci->n_layers-1][dimm->idx];
> 
> These unused counters only add complexity, remove them. The error
> counter values are directly stored in struct dimm_info now.

Hmm... not sure if this patch is correct. I remember that there are some
border cases on some drivers (maybe the 3-layer drivers used together
with RDIMM memory controllers?) where some errors are not associated
to an specific dimm, but, instead, are related to a problem at the memory
bus.

Also, depending on how the memory controllers are organized[1], the ECC
logic groups memory on DIMM pairs. So, when an error occur, it may be
either at DIMM1 or DIMM2.

[1] On Intel, this happens with pre-Nehalem memory controllers.

Due to that, storing errors at the dimm struct sounds wrong, as the
error may affect multiple DIMMs or even the entire layer.

> 
> Signed-off-by: Robert Richter <rrichter@marvell.com>
> ---
>  drivers/edac/edac_mc.c       | 104 ++++++++++++-----------------------
>  drivers/edac/edac_mc_sysfs.c |  20 +++----
>  drivers/edac/ghes_edac.c     |   5 +-
>  include/linux/edac.h         |   7 +--
>  4 files changed, 46 insertions(+), 90 deletions(-)
> 
> diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
> index 45b02bb31964..c1e142643006 100644
> --- a/drivers/edac/edac_mc.c
> +++ b/drivers/edac/edac_mc.c
> @@ -315,10 +315,9 @@ struct mem_ctl_info *edac_mc_alloc(unsigned int mc_num,
>  	struct csrow_info *csr;
>  	struct rank_info *chan;
>  	struct dimm_info *dimm;
> -	u32 *ce_per_layer[EDAC_MAX_LAYERS], *ue_per_layer[EDAC_MAX_LAYERS];
>  	unsigned int pos[EDAC_MAX_LAYERS];
> -	unsigned int idx, size, tot_dimms = 1, count = 1;
> -	unsigned int tot_csrows = 1, tot_channels = 1, tot_errcount = 0;
> +	unsigned int idx, size, tot_dimms = 1;
> +	unsigned int tot_csrows = 1, tot_channels = 1;
>  	void *pvt, *p, *ptr = NULL;
>  	int i, j, row, chn, n, len;
>  	bool per_rank = false;
> @@ -346,19 +345,10 @@ struct mem_ctl_info *edac_mc_alloc(unsigned int mc_num,
>  	 * stringent as what the compiler would provide if we could simply
>  	 * hardcode everything into a single struct.
>  	 */
> -	mci = edac_align_ptr(&ptr, sizeof(*mci), 1);
> -	layer = edac_align_ptr(&ptr, sizeof(*layer), n_layers);
> -	for (i = 0; i < n_layers; i++) {
> -		count *= layers[i].size;
> -		edac_dbg(4, "errcount layer %d size %d\n", i, count);
> -		ce_per_layer[i] = edac_align_ptr(&ptr, sizeof(u32), count);
> -		ue_per_layer[i] = edac_align_ptr(&ptr, sizeof(u32), count);
> -		tot_errcount += 2 * count;
> -	}
> -
> -	edac_dbg(4, "allocating %d error counters\n", tot_errcount);
> -	pvt = edac_align_ptr(&ptr, sz_pvt, 1);
> -	size = ((unsigned long)pvt) + sz_pvt;
> +	mci	= edac_align_ptr(&ptr, sizeof(*mci), 1);
> +	layer	= edac_align_ptr(&ptr, sizeof(*layer), n_layers);
> +	pvt	= edac_align_ptr(&ptr, sz_pvt, 1);
> +	size	= ((unsigned long)pvt) + sz_pvt;
>  
>  	edac_dbg(1, "allocating %u bytes for mci data (%d %s, %d csrows/channels)\n",
>  		 size,
> @@ -374,10 +364,6 @@ struct mem_ctl_info *edac_mc_alloc(unsigned int mc_num,
>  	 * rather than an imaginary chunk of memory located at address 0.
>  	 */
>  	layer = (struct edac_mc_layer *)(((char *)mci) + ((unsigned long)layer));
> -	for (i = 0; i < n_layers; i++) {
> -		mci->ce_per_layer[i] = (u32 *)((char *)mci + ((unsigned long)ce_per_layer[i]));
> -		mci->ue_per_layer[i] = (u32 *)((char *)mci + ((unsigned long)ue_per_layer[i]));
> -	}
>  	pvt = sz_pvt ? (((char *)mci) + ((unsigned long)pvt)) : NULL;
>  
>  	/* setup index and various internal pointers */
> @@ -908,53 +894,31 @@ const char *edac_layer_name[] = {
>  EXPORT_SYMBOL_GPL(edac_layer_name);
>  
>  static void edac_inc_ce_error(struct mem_ctl_info *mci,
> -			      bool enable_per_layer_report,
>  			      const int pos[EDAC_MAX_LAYERS],
>  			      const u16 count)
>  {
> -	int i, index = 0;
> +	struct dimm_info *dimm = edac_get_dimm(mci, pos[0], pos[1], pos[2]);
>  
>  	mci->ce_mc += count;
>  
> -	if (!enable_per_layer_report) {
> +	if (dimm)
> +		dimm->ce_count += count;
> +	else
>  		mci->ce_noinfo_count += count;
> -		return;
> -	}
> -
> -	for (i = 0; i < mci->n_layers; i++) {
> -		if (pos[i] < 0)
> -			break;
> -		index += pos[i];
> -		mci->ce_per_layer[i][index] += count;
> -
> -		if (i < mci->n_layers - 1)
> -			index *= mci->layers[i + 1].size;
> -	}
>  }
>  
>  static void edac_inc_ue_error(struct mem_ctl_info *mci,
> -				    bool enable_per_layer_report,
>  				    const int pos[EDAC_MAX_LAYERS],
>  				    const u16 count)
>  {
> -	int i, index = 0;
> +	struct dimm_info *dimm = edac_get_dimm(mci, pos[0], pos[1], pos[2]);
>  
>  	mci->ue_mc += count;
>  
> -	if (!enable_per_layer_report) {
> +	if (dimm)
> +		dimm->ue_count += count;
> +	else
>  		mci->ue_noinfo_count += count;
> -		return;
> -	}
> -
> -	for (i = 0; i < mci->n_layers; i++) {
> -		if (pos[i] < 0)
> -			break;
> -		index += pos[i];
> -		mci->ue_per_layer[i][index] += count;
> -
> -		if (i < mci->n_layers - 1)
> -			index *= mci->layers[i + 1].size;
> -	}
>  }
>  
>  static void edac_ce_error(struct mem_ctl_info *mci,
> @@ -965,7 +929,6 @@ static void edac_ce_error(struct mem_ctl_info *mci,
>  			  const char *label,
>  			  const char *detail,
>  			  const char *other_detail,
> -			  const bool enable_per_layer_report,
>  			  const unsigned long page_frame_number,
>  			  const unsigned long offset_in_page,
>  			  long grain)
> @@ -988,7 +951,7 @@ static void edac_ce_error(struct mem_ctl_info *mci,
>  				       error_count, msg, msg_aux, label,
>  				       location, detail);
>  	}
> -	edac_inc_ce_error(mci, enable_per_layer_report, pos, error_count);
> +	edac_inc_ce_error(mci, pos, error_count);
>  
>  	if (mci->scrub_mode == SCRUB_SW_SRC) {
>  		/*
> @@ -1018,8 +981,7 @@ static void edac_ue_error(struct mem_ctl_info *mci,
>  			  const char *location,
>  			  const char *label,
>  			  const char *detail,
> -			  const char *other_detail,
> -			  const bool enable_per_layer_report)
> +			  const char *other_detail)
>  {
>  	char *msg_aux = "";
>  
> @@ -1048,7 +1010,7 @@ static void edac_ue_error(struct mem_ctl_info *mci,
>  			      msg, msg_aux, label, location, detail);
>  	}
>  
> -	edac_inc_ue_error(mci, enable_per_layer_report, pos, error_count);
> +	edac_inc_ue_error(mci, pos, error_count);
>  }
>  
>  void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
> @@ -1064,16 +1026,16 @@ void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
>  			"page:0x%lx offset:0x%lx grain:%ld syndrome:0x%lx",
>  			e->page_frame_number, e->offset_in_page,
>  			e->grain, e->syndrome);
> -		edac_ce_error(mci, e->error_count, pos, e->msg, e->location, e->label,
> -			      detail, e->other_detail, e->enable_per_layer_report,
> +		edac_ce_error(mci, e->error_count, pos, e->msg, e->location,
> +			      e->label, detail, e->other_detail,
>  			      e->page_frame_number, e->offset_in_page, e->grain);
>  	} else {
>  		snprintf(detail, sizeof(detail),
>  			"page:0x%lx offset:0x%lx grain:%ld",
>  			e->page_frame_number, e->offset_in_page, e->grain);
>  
> -		edac_ue_error(mci, e->error_count, pos, e->msg, e->location, e->label,
> -			      detail, e->other_detail, e->enable_per_layer_report);
> +		edac_ue_error(mci, e->error_count, pos, e->msg, e->location,
> +			      e->label, detail, e->other_detail);
>  	}
>  
>  
> @@ -1099,6 +1061,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
>  	int i, n_labels = 0;
>  	u8 grain_bits;
>  	struct edac_raw_error_desc *e = &mci->error_desc;
> +	bool any_memory = true;
>  
>  	edac_dbg(3, "MC%d\n", mci->mc_idx);
>  
> @@ -1116,9 +1079,9 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
>  
>  	/*
>  	 * Check if the event report is consistent and if the memory
> -	 * location is known. If it is known, enable_per_layer_report will be
> -	 * true, the DIMM(s) label info will be filled and the per-layer
> -	 * error counters will be incremented.
> +	 * location is known. If it is known, the DIMM(s) label info
> +	 * will be filled and the DIMM's error counters will be
> +	 * incremented.
>  	 */
>  	for (i = 0; i < mci->n_layers; i++) {
>  		if (pos[i] >= (int)mci->layers[i].size) {
> @@ -1136,7 +1099,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
>  			pos[i] = -1;
>  		}
>  		if (pos[i] >= 0)
> -			e->enable_per_layer_report = true;
> +			any_memory = false;
>  	}
>  
>  	/*
> @@ -1166,16 +1129,17 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
>  			e->grain = dimm->grain;
>  
>  		/*
> -		 * If the error is memory-controller wide, there's no need to
> -		 * seek for the affected DIMMs because the whole
> -		 * channel/memory controller/...  may be affected.
> -		 * Also, don't show errors for empty DIMM slots.
> +		 * If the error is memory-controller wide, there's no
> +		 * need to seek for the affected DIMMs because the
> +		 * whole channel/memory controller/... may be
> +		 * affected. Also, don't show errors for empty DIMM
> +		 * slots.
>  		 */
> -		if (!e->enable_per_layer_report || !dimm->nr_pages)
> +		if (any_memory || !dimm->nr_pages)
>  			continue;
>  
>  		if (n_labels >= EDAC_MAX_LABELS) {
> -			e->enable_per_layer_report = false;
> +			any_memory = true;
>  			break;
>  		}
>  		n_labels++;
> @@ -1205,7 +1169,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
>  			chan = -2;
>  	}
>  
> -	if (!e->enable_per_layer_report) {
> +	if (any_memory) {
>  		strcpy(e->label, "any memory");
>  	} else {
>  		edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
> diff --git a/drivers/edac/edac_mc_sysfs.c b/drivers/edac/edac_mc_sysfs.c
> index 0367554e7437..8682df2f7f4f 100644
> --- a/drivers/edac/edac_mc_sysfs.c
> +++ b/drivers/edac/edac_mc_sysfs.c
> @@ -556,10 +556,8 @@ static ssize_t dimmdev_ce_count_show(struct device *dev,
>  				      char *data)
>  {
>  	struct dimm_info *dimm = to_dimm(dev);
> -	u32 count;
>  
> -	count = dimm->mci->ce_per_layer[dimm->mci->n_layers-1][dimm->idx];
> -	return sprintf(data, "%u\n", count);
> +	return sprintf(data, "%u\n", dimm->ce_count);
>  }
>  
>  static ssize_t dimmdev_ue_count_show(struct device *dev,
> @@ -567,10 +565,8 @@ static ssize_t dimmdev_ue_count_show(struct device *dev,
>  				      char *data)
>  {
>  	struct dimm_info *dimm = to_dimm(dev);
> -	u32 count;
>  
> -	count = dimm->mci->ue_per_layer[dimm->mci->n_layers-1][dimm->idx];
> -	return sprintf(data, "%u\n", count);
> +	return sprintf(data, "%u\n", dimm->ue_count);
>  }
>  
>  /* dimm/rank attribute files */
> @@ -666,7 +662,9 @@ static ssize_t mci_reset_counters_store(struct device *dev,
>  					const char *data, size_t count)
>  {
>  	struct mem_ctl_info *mci = to_mci(dev);
> -	int cnt, row, chan, i;
> +	struct dimm_info *dimm;
> +	int row, chan;
> +
>  	mci->ue_mc = 0;
>  	mci->ce_mc = 0;
>  	mci->ue_noinfo_count = 0;
> @@ -682,11 +680,9 @@ static ssize_t mci_reset_counters_store(struct device *dev,
>  			ri->channels[chan]->ce_count = 0;
>  	}
>  
> -	cnt = 1;
> -	for (i = 0; i < mci->n_layers; i++) {
> -		cnt *= mci->layers[i].size;
> -		memset(mci->ce_per_layer[i], 0, cnt * sizeof(u32));
> -		memset(mci->ue_per_layer[i], 0, cnt * sizeof(u32));
> +	mci_for_each_dimm(mci, dimm) {
> +		dimm->ue_count = 0;
> +		dimm->ce_count = 0;
>  	}
>  
>  	mci->start_time = jiffies;
> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
> index 842080d7b33a..e0b90c6d7d63 100644
> --- a/drivers/edac/ghes_edac.c
> +++ b/drivers/edac/ghes_edac.c
> @@ -347,11 +347,8 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>  				     mem_err->mem_dev_handle);
>  
>  		index = get_dimm_smbios_index(mem_err->mem_dev_handle);
> -		if (index >= 0) {
> +		if (index >= 0)
>  			e->top_layer = index;
> -			e->enable_per_layer_report = true;
> -		}
> -
>  	}
>  	if (p > e->location)
>  		*(p - 1) = '\0';
> diff --git a/include/linux/edac.h b/include/linux/edac.h
> index 8beb6e834be9..8e72222e50b0 100644
> --- a/include/linux/edac.h
> +++ b/include/linux/edac.h
> @@ -383,6 +383,9 @@ struct dimm_info {
>  	unsigned int csrow, cschannel;	/* Points to the old API data */
>  
>  	u16 smbios_handle;              /* Handle for SMBIOS type 17 */
> +
> +	u32 ce_count;
> +	u32 ue_count;
>  };
>  
>  /**
> @@ -453,8 +456,6 @@ struct errcount_attribute_data {
>   * @location:			location of the error
>   * @label:			label of the affected DIMM(s)
>   * @other_detail:		other driver-specific detail about the error
> - * @enable_per_layer_report:	if false, the error affects all layers
> - *				(typically, a memory controller error)
>   */
>  struct edac_raw_error_desc {
>  	/*
> @@ -475,7 +476,6 @@ struct edac_raw_error_desc {
>  	unsigned long syndrome;
>  	const char *msg;
>  	const char *other_detail;
> -	bool enable_per_layer_report;
>  };
>  
>  /* MEMORY controller information structure
> @@ -565,7 +565,6 @@ struct mem_ctl_info {
>  	 */
>  	u32 ce_noinfo_count, ue_noinfo_count;
>  	u32 ue_mc, ce_mc;
> -	u32 *ce_per_layer[EDAC_MAX_LAYERS], *ue_per_layer[EDAC_MAX_LAYERS];
>  
>  	struct completion complete;
>  



Thanks,
Mauro

Robert Richter Oct. 14, 2019, 11:12 a.m. UTC | #2

On 11.10.19 07:40:31, Mauro Carvalho Chehab wrote:
> Em Thu, 10 Oct 2019 20:25:16 +0000
> Robert Richter <rrichter@marvell.com> escreveu:
> 
> > Looking at how mci->{ue,ce}_per_layer[EDAC_MAX_LAYERS] is used, it
> > turns out that only the leaves in the memory hierarchy are consumed
> > (in sysfs), but not the intermediate layers, e.g.:
> > 
> >  count = dimm->mci->ce_per_layer[dimm->mci->n_layers-1][dimm->idx];
> > 
> > These unused counters only add complexity, remove them. The error
> > counter values are directly stored in struct dimm_info now.
> 
> Hmm... not sure if this patch is correct. I remember that there are some
> border cases on some drivers (maybe the 3-layer drivers used together
> with RDIMM memory controllers?) where some errors are not associated
> to an specific dimm, but, instead, are related to a problem at the memory
> bus.
> 
> Also, depending on how the memory controllers are organized[1], the ECC
> logic groups memory on DIMM pairs. So, when an error occur, it may be
> either at DIMM1 or DIMM2.
> 
> [1] On Intel, this happens with pre-Nehalem memory controllers.
> 
> Due to that, storing errors at the dimm struct sounds wrong, as the
> error may affect multiple DIMMs or even the entire layer.

That was my first thought too, but the counter values are not used at
all. The only exception is to provide *per-dimm* counters here:

 {ce,ue}_per_layer[n_layers-1][dimm->idx]. 

 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/edac_mc_sysfs.c?id=4f5cafb5cb8471e54afdc9054d973535614f7675#n567
 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/edac_mc_sysfs.c?id=4f5cafb5cb8471e54afdc9054d973535614f7675#n584

The case you mentioned above is if the mc only sends parts of the
error location (with a top, mid or low layer missing). The dimm cannot
be identified then. In this case edac_mc_handle_error() tries to find
a unique row (+ channel infomation if available and lists all possible
dimm labels in e->label. See:

 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/edac_mc.c?id=4f5cafb5cb8471e54afdc9054d973535614f7675#n1153

Thus, we see a counter increment for row (and also channel if it can
be identified), but this is counted in mci->csrows array only that is
not removed by this patch.

That said, the {ue,ce}_per_layer[] arrays can be removed by keeping
the same driver functionality, esp. the case you mentioned above.

-Robert

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 45b02bb31964..c1e142643006 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -315,10 +315,9 @@  struct mem_ctl_info *edac_mc_alloc(unsigned int mc_num,
 	struct csrow_info *csr;
 	struct rank_info *chan;
 	struct dimm_info *dimm;
-	u32 *ce_per_layer[EDAC_MAX_LAYERS], *ue_per_layer[EDAC_MAX_LAYERS];
 	unsigned int pos[EDAC_MAX_LAYERS];
-	unsigned int idx, size, tot_dimms = 1, count = 1;
-	unsigned int tot_csrows = 1, tot_channels = 1, tot_errcount = 0;
+	unsigned int idx, size, tot_dimms = 1;
+	unsigned int tot_csrows = 1, tot_channels = 1;
 	void *pvt, *p, *ptr = NULL;
 	int i, j, row, chn, n, len;
 	bool per_rank = false;
@@ -346,19 +345,10 @@  struct mem_ctl_info *edac_mc_alloc(unsigned int mc_num,
 	 * stringent as what the compiler would provide if we could simply
 	 * hardcode everything into a single struct.
 	 */
-	mci = edac_align_ptr(&ptr, sizeof(*mci), 1);
-	layer = edac_align_ptr(&ptr, sizeof(*layer), n_layers);
-	for (i = 0; i < n_layers; i++) {
-		count *= layers[i].size;
-		edac_dbg(4, "errcount layer %d size %d\n", i, count);
-		ce_per_layer[i] = edac_align_ptr(&ptr, sizeof(u32), count);
-		ue_per_layer[i] = edac_align_ptr(&ptr, sizeof(u32), count);
-		tot_errcount += 2 * count;
-	}
-
-	edac_dbg(4, "allocating %d error counters\n", tot_errcount);
-	pvt = edac_align_ptr(&ptr, sz_pvt, 1);
-	size = ((unsigned long)pvt) + sz_pvt;
+	mci	= edac_align_ptr(&ptr, sizeof(*mci), 1);
+	layer	= edac_align_ptr(&ptr, sizeof(*layer), n_layers);
+	pvt	= edac_align_ptr(&ptr, sz_pvt, 1);
+	size	= ((unsigned long)pvt) + sz_pvt;
 
 	edac_dbg(1, "allocating %u bytes for mci data (%d %s, %d csrows/channels)\n",
 		 size,
@@ -374,10 +364,6 @@  struct mem_ctl_info *edac_mc_alloc(unsigned int mc_num,
 	 * rather than an imaginary chunk of memory located at address 0.
 	 */
 	layer = (struct edac_mc_layer *)(((char *)mci) + ((unsigned long)layer));
-	for (i = 0; i < n_layers; i++) {
-		mci->ce_per_layer[i] = (u32 *)((char *)mci + ((unsigned long)ce_per_layer[i]));
-		mci->ue_per_layer[i] = (u32 *)((char *)mci + ((unsigned long)ue_per_layer[i]));
-	}
 	pvt = sz_pvt ? (((char *)mci) + ((unsigned long)pvt)) : NULL;
 
 	/* setup index and various internal pointers */
@@ -908,53 +894,31 @@  const char *edac_layer_name[] = {
 EXPORT_SYMBOL_GPL(edac_layer_name);
 
 static void edac_inc_ce_error(struct mem_ctl_info *mci,
-			      bool enable_per_layer_report,
 			      const int pos[EDAC_MAX_LAYERS],
 			      const u16 count)
 {
-	int i, index = 0;
+	struct dimm_info *dimm = edac_get_dimm(mci, pos[0], pos[1], pos[2]);
 
 	mci->ce_mc += count;
 
-	if (!enable_per_layer_report) {
+	if (dimm)
+		dimm->ce_count += count;
+	else
 		mci->ce_noinfo_count += count;
-		return;
-	}
-
-	for (i = 0; i < mci->n_layers; i++) {
-		if (pos[i] < 0)
-			break;
-		index += pos[i];
-		mci->ce_per_layer[i][index] += count;
-
-		if (i < mci->n_layers - 1)
-			index *= mci->layers[i + 1].size;
-	}
 }
 
 static void edac_inc_ue_error(struct mem_ctl_info *mci,
-				    bool enable_per_layer_report,
 				    const int pos[EDAC_MAX_LAYERS],
 				    const u16 count)
 {
-	int i, index = 0;
+	struct dimm_info *dimm = edac_get_dimm(mci, pos[0], pos[1], pos[2]);
 
 	mci->ue_mc += count;
 
-	if (!enable_per_layer_report) {
+	if (dimm)
+		dimm->ue_count += count;
+	else
 		mci->ue_noinfo_count += count;
-		return;
-	}
-
-	for (i = 0; i < mci->n_layers; i++) {
-		if (pos[i] < 0)
-			break;
-		index += pos[i];
-		mci->ue_per_layer[i][index] += count;
-
-		if (i < mci->n_layers - 1)
-			index *= mci->layers[i + 1].size;
-	}
 }
 
 static void edac_ce_error(struct mem_ctl_info *mci,
@@ -965,7 +929,6 @@  static void edac_ce_error(struct mem_ctl_info *mci,
 			  const char *label,
 			  const char *detail,
 			  const char *other_detail,
-			  const bool enable_per_layer_report,
 			  const unsigned long page_frame_number,
 			  const unsigned long offset_in_page,
 			  long grain)
@@ -988,7 +951,7 @@  static void edac_ce_error(struct mem_ctl_info *mci,
 				       error_count, msg, msg_aux, label,
 				       location, detail);
 	}
-	edac_inc_ce_error(mci, enable_per_layer_report, pos, error_count);
+	edac_inc_ce_error(mci, pos, error_count);
 
 	if (mci->scrub_mode == SCRUB_SW_SRC) {
 		/*
@@ -1018,8 +981,7 @@  static void edac_ue_error(struct mem_ctl_info *mci,
 			  const char *location,
 			  const char *label,
 			  const char *detail,
-			  const char *other_detail,
-			  const bool enable_per_layer_report)
+			  const char *other_detail)
 {
 	char *msg_aux = "";
 
@@ -1048,7 +1010,7 @@  static void edac_ue_error(struct mem_ctl_info *mci,
 			      msg, msg_aux, label, location, detail);
 	}
 
-	edac_inc_ue_error(mci, enable_per_layer_report, pos, error_count);
+	edac_inc_ue_error(mci, pos, error_count);
 }
 
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
@@ -1064,16 +1026,16 @@  void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
 			"page:0x%lx offset:0x%lx grain:%ld syndrome:0x%lx",
 			e->page_frame_number, e->offset_in_page,
 			e->grain, e->syndrome);
-		edac_ce_error(mci, e->error_count, pos, e->msg, e->location, e->label,
-			      detail, e->other_detail, e->enable_per_layer_report,
+		edac_ce_error(mci, e->error_count, pos, e->msg, e->location,
+			      e->label, detail, e->other_detail,
 			      e->page_frame_number, e->offset_in_page, e->grain);
 	} else {
 		snprintf(detail, sizeof(detail),
 			"page:0x%lx offset:0x%lx grain:%ld",
 			e->page_frame_number, e->offset_in_page, e->grain);
 
-		edac_ue_error(mci, e->error_count, pos, e->msg, e->location, e->label,
-			      detail, e->other_detail, e->enable_per_layer_report);
+		edac_ue_error(mci, e->error_count, pos, e->msg, e->location,
+			      e->label, detail, e->other_detail);
 	}
 
 
@@ -1099,6 +1061,7 @@  void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	int i, n_labels = 0;
 	u8 grain_bits;
 	struct edac_raw_error_desc *e = &mci->error_desc;
+	bool any_memory = true;
 
 	edac_dbg(3, "MC%d\n", mci->mc_idx);
 
@@ -1116,9 +1079,9 @@  void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 
 	/*
 	 * Check if the event report is consistent and if the memory
-	 * location is known. If it is known, enable_per_layer_report will be
-	 * true, the DIMM(s) label info will be filled and the per-layer
-	 * error counters will be incremented.
+	 * location is known. If it is known, the DIMM(s) label info
+	 * will be filled and the DIMM's error counters will be
+	 * incremented.
 	 */
 	for (i = 0; i < mci->n_layers; i++) {
 		if (pos[i] >= (int)mci->layers[i].size) {
@@ -1136,7 +1099,7 @@  void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			pos[i] = -1;
 		}
 		if (pos[i] >= 0)
-			e->enable_per_layer_report = true;
+			any_memory = false;
 	}
 
 	/*
@@ -1166,16 +1129,17 @@  void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			e->grain = dimm->grain;
 
 		/*
-		 * If the error is memory-controller wide, there's no need to
-		 * seek for the affected DIMMs because the whole
-		 * channel/memory controller/...  may be affected.
-		 * Also, don't show errors for empty DIMM slots.
+		 * If the error is memory-controller wide, there's no
+		 * need to seek for the affected DIMMs because the
+		 * whole channel/memory controller/... may be
+		 * affected. Also, don't show errors for empty DIMM
+		 * slots.
 		 */
-		if (!e->enable_per_layer_report || !dimm->nr_pages)
+		if (any_memory || !dimm->nr_pages)
 			continue;
 
 		if (n_labels >= EDAC_MAX_LABELS) {
-			e->enable_per_layer_report = false;
+			any_memory = true;
 			break;
 		}
 		n_labels++;
@@ -1205,7 +1169,7 @@  void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			chan = -2;
 	}
 
-	if (!e->enable_per_layer_report) {
+	if (any_memory) {
 		strcpy(e->label, "any memory");
 	} else {
 		edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
diff --git a/drivers/edac/edac_mc_sysfs.c b/drivers/edac/edac_mc_sysfs.c
index 0367554e7437..8682df2f7f4f 100644
--- a/drivers/edac/edac_mc_sysfs.c
+++ b/drivers/edac/edac_mc_sysfs.c
@@ -556,10 +556,8 @@  static ssize_t dimmdev_ce_count_show(struct device *dev,
 				      char *data)
 {
 	struct dimm_info *dimm = to_dimm(dev);
-	u32 count;
 
-	count = dimm->mci->ce_per_layer[dimm->mci->n_layers-1][dimm->idx];
-	return sprintf(data, "%u\n", count);
+	return sprintf(data, "%u\n", dimm->ce_count);
 }
 
 static ssize_t dimmdev_ue_count_show(struct device *dev,
@@ -567,10 +565,8 @@  static ssize_t dimmdev_ue_count_show(struct device *dev,
 				      char *data)
 {
 	struct dimm_info *dimm = to_dimm(dev);
-	u32 count;
 
-	count = dimm->mci->ue_per_layer[dimm->mci->n_layers-1][dimm->idx];
-	return sprintf(data, "%u\n", count);
+	return sprintf(data, "%u\n", dimm->ue_count);
 }
 
 /* dimm/rank attribute files */
@@ -666,7 +662,9 @@  static ssize_t mci_reset_counters_store(struct device *dev,
 					const char *data, size_t count)
 {
 	struct mem_ctl_info *mci = to_mci(dev);
-	int cnt, row, chan, i;
+	struct dimm_info *dimm;
+	int row, chan;
+
 	mci->ue_mc = 0;
 	mci->ce_mc = 0;
 	mci->ue_noinfo_count = 0;
@@ -682,11 +680,9 @@  static ssize_t mci_reset_counters_store(struct device *dev,
 			ri->channels[chan]->ce_count = 0;
 	}
 
-	cnt = 1;
-	for (i = 0; i < mci->n_layers; i++) {
-		cnt *= mci->layers[i].size;
-		memset(mci->ce_per_layer[i], 0, cnt * sizeof(u32));
-		memset(mci->ue_per_layer[i], 0, cnt * sizeof(u32));
+	mci_for_each_dimm(mci, dimm) {
+		dimm->ue_count = 0;
+		dimm->ce_count = 0;
 	}
 
 	mci->start_time = jiffies;
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 842080d7b33a..e0b90c6d7d63 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -347,11 +347,8 @@  void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
 				     mem_err->mem_dev_handle);
 
 		index = get_dimm_smbios_index(mem_err->mem_dev_handle);
-		if (index >= 0) {
+		if (index >= 0)
 			e->top_layer = index;
-			e->enable_per_layer_report = true;
-		}
-
 	}
 	if (p > e->location)
 		*(p - 1) = '\0';
diff --git a/include/linux/edac.h b/include/linux/edac.h
index 8beb6e834be9..8e72222e50b0 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -383,6 +383,9 @@  struct dimm_info {
 	unsigned int csrow, cschannel;	/* Points to the old API data */
 
 	u16 smbios_handle;              /* Handle for SMBIOS type 17 */
+
+	u32 ce_count;
+	u32 ue_count;
 };
 
 /**
@@ -453,8 +456,6 @@  struct errcount_attribute_data {
  * @location:			location of the error
  * @label:			label of the affected DIMM(s)
  * @other_detail:		other driver-specific detail about the error
- * @enable_per_layer_report:	if false, the error affects all layers
- *				(typically, a memory controller error)
  */
 struct edac_raw_error_desc {
 	/*
@@ -475,7 +476,6 @@  struct edac_raw_error_desc {
 	unsigned long syndrome;
 	const char *msg;
 	const char *other_detail;
-	bool enable_per_layer_report;
 };
 
 /* MEMORY controller information structure
@@ -565,7 +565,6 @@  struct mem_ctl_info {
 	 */
 	u32 ce_noinfo_count, ue_noinfo_count;
 	u32 ue_mc, ce_mc;
-	u32 *ce_per_layer[EDAC_MAX_LAYERS], *ue_per_layer[EDAC_MAX_LAYERS];
 
 	struct completion complete;

[06/19] EDAC, mc: Remove per layer counters

Commit Message

Comments

Patch