diff mbox series

EDAC: skx_common: downgrade message importance on missing PCI device

Message ID 20191204212325.c4k47p5hrnn3vpb5@redhat.com (mailing list archive)
State New, archived
Headers show
Series EDAC: skx_common: downgrade message importance on missing PCI device | expand

Commit Message

Aristeu Rozanski Dec. 4, 2019, 9:23 p.m. UTC
Both skx_edac and i10nm_edac drivers are loaded based on the matching CPU being
available which leads the module to be automatically loaded in virtual machines
as well. That will fail due the missing PCI devices. In both drivers the first
function to make use of the PCI devices is skx_get_hi_lo() will simply print

	EDAC skx: Can't get tolm/tohm

for each CPU core, which is noisy. This patch makes it a debug message.

Signed-off-by: Aristeu Rozanski <aris@redhat.com>

Comments

Luck, Tony Dec. 9, 2019, 9:52 p.m. UTC | #1
>	EDAC skx: Can't get tolm/tohm
>
> for each CPU core, which is noisy. This patch makes it a debug message.

This looks like we call skx_init() once per core. Do we keep calling it because
the calls are failing?  Or do we do that even when calls succeed?

I was only really expecting that skx_init() would be called once.

-Tony
Luck, Tony Dec. 10, 2019, 12:02 a.m. UTC | #2
> This looks like we call skx_init() once per core. Do we keep calling it because
> the calls are failing?  Or do we do that even when calls succeed?
>
> I was only really expecting that skx_init() would be called once.

So (by experimentation) it seems that if the module load fails it
will be retried num_online_cpus times (though not bound to each
CPU in turn ... it will maybe try the init call on the same CPU multiple
times, but miss running on some CPUs).

If the load succeeds, then whoever is repeating the load decides
to stop.

-Tony
Borislav Petkov Dec. 10, 2019, 9 a.m. UTC | #3
On Tue, Dec 10, 2019 at 12:02:45AM +0000, Luck, Tony wrote:
> > This looks like we call skx_init() once per core. Do we keep calling it because
> > the calls are failing?  Or do we do that even when calls succeed?
> >
> > I was only really expecting that skx_init() would be called once.
> 
> So (by experimentation) it seems that if the module load fails it
> will be retried num_online_cpus times (though not bound to each
> CPU in turn ... it will maybe try the init call on the same CPU multiple
> times, but miss running on some CPUs).
> 
> If the load succeeds, then whoever is repeating the load decides
> to stop.

That's the result of our conversion to MODULE_DEVICE_TABLE to match CPU
models. So it tries once on each CPU:

https://lkml.kernel.org/r/20191107103857.GC19501@zn.tnic

I have no clean solution for this except maybe remembering the return
value of the first instance probing in the edac core module and then
asking it... it ain't pretty though.
Aristeu Rozanski Dec. 10, 2019, 2:18 p.m. UTC | #4
On Tue, Dec 10, 2019 at 12:02:45AM +0000, Luck, Tony wrote:
> > This looks like we call skx_init() once per core. Do we keep calling it because
> > the calls are failing?  Or do we do that even when calls succeed?
> >
> > I was only really expecting that skx_init() would be called once.
> 
> So (by experimentation) it seems that if the module load fails it
> will be retried num_online_cpus times (though not bound to each
> CPU in turn ... it will maybe try the init call on the same CPU multiple
> times, but miss running on some CPUs).
> 
> If the load succeeds, then whoever is repeating the load decides
> to stop.

Or silently fails to load the module again for all online cpus.
Aristeu Rozanski Jan. 6, 2020, 3:12 p.m. UTC | #5
On Tue, Dec 10, 2019 at 10:00:13AM +0100, Borislav Petkov wrote:
> On Tue, Dec 10, 2019 at 12:02:45AM +0000, Luck, Tony wrote:
> > > This looks like we call skx_init() once per core. Do we keep calling it because
> > > the calls are failing?  Or do we do that even when calls succeed?
> > >
> > > I was only really expecting that skx_init() would be called once.
> > 
> > So (by experimentation) it seems that if the module load fails it
> > will be retried num_online_cpus times (though not bound to each
> > CPU in turn ... it will maybe try the init call on the same CPU multiple
> > times, but miss running on some CPUs).
> > 
> > If the load succeeds, then whoever is repeating the load decides
> > to stop.
> 
> That's the result of our conversion to MODULE_DEVICE_TABLE to match CPU
> models. So it tries once on each CPU:
> 
> https://lkml.kernel.org/r/20191107103857.GC19501@zn.tnic
> 
> I have no clean solution for this except maybe remembering the return
> value of the first instance probing in the edac core module and then
> asking it... it ain't pretty though.

The other option I can think about is just allowing the module to load
in this situation: the CPU is present but you have no memory controller
PCI devices present. Some drivers will allow loading without having a
device present without errors. It's not clean as having to come up with
something modutils can pick up as hint to not try to load more than once.

Or could just downgrade this specific message since it's a very common
case and let the more unusual situations report more than once.
Borislav Petkov Jan. 6, 2020, 4:17 p.m. UTC | #6
On Mon, Jan 06, 2020 at 10:12:42AM -0500, 'Aristeu Rozanski' wrote:
> The other option I can think about is just allowing the module to load
> in this situation: the CPU is present but you have no memory controller
> PCI devices present. Some drivers will allow loading without having a
> device present without errors. It's not clean as having to come up with
> something modutils can pick up as hint to not try to load more than once.

Yeah, but having a driver loaded when there's no hw to drive is also not
pretty.

> Or could just downgrade this specific message since it's a very common
> case and let the more unusual situations report more than once.

Yap, we did a similar thing for adm64_edac:

7fdfee926be7 ("EDAC/amd64: Get rid of the ECC disabled long message")

but instead of downgrading prio we actually went and killed it directly.

:-)
Aristeu Rozanski Jan. 6, 2020, 4:20 p.m. UTC | #7
On Mon, Jan 06, 2020 at 05:17:32PM +0100, Borislav Petkov wrote:
> On Mon, Jan 06, 2020 at 10:12:42AM -0500, 'Aristeu Rozanski' wrote:
> > The other option I can think about is just allowing the module to load
> > in this situation: the CPU is present but you have no memory controller
> > PCI devices present. Some drivers will allow loading without having a
> > device present without errors. It's not clean as having to come up with
> > something modutils can pick up as hint to not try to load more than once.
> 
> Yeah, but having a driver loaded when there's no hw to drive is also not
> pretty.
> 
> > Or could just downgrade this specific message since it's a very common
> > case and let the more unusual situations report more than once.
> 
> Yap, we did a similar thing for adm64_edac:
> 
> 7fdfee926be7 ("EDAC/amd64: Get rid of the ECC disabled long message")
> 
> but instead of downgrading prio we actually went and killed it directly.
> 
> :-)

OK, will resubmit this patch just removing the messages then.

Thanks!
Borislav Petkov Jan. 6, 2020, 4:23 p.m. UTC | #8
On Mon, Jan 06, 2020 at 11:20:14AM -0500, 'Aristeu Rozanski' wrote:
> OK, will resubmit this patch just removing the messages then.

I'm not saying you should blindly remove them. They might be useful for
debugging purposes so you should consider that usage angle first. In the
AMD case, the message was really useless.
Aristeu Rozanski Jan. 7, 2020, 3:51 p.m. UTC | #9
On Mon, Jan 06, 2020 at 05:23:06PM +0100, Borislav Petkov wrote:
> I'm not saying you should blindly remove them. They might be useful for
> debugging purposes so you should consider that usage angle first. In the
> AMD case, the message was really useless.

Ah, I thought you had an objection to this patch :)
Do you mind considering it for inclusion?
Borislav Petkov Jan. 7, 2020, 4:45 p.m. UTC | #10
On Tue, Jan 07, 2020 at 10:51:09AM -0500, 'Aristeu Rozanski' wrote:
> On Mon, Jan 06, 2020 at 05:23:06PM +0100, Borislav Petkov wrote:
> > I'm not saying you should blindly remove them. They might be useful for
> > debugging purposes so you should consider that usage angle first. In the
> > AMD case, the message was really useless.
> 
> Ah, I thought you had an objection to this patch :)
> Do you mind considering it for inclusion?

That's Tony's call as he's maintaining the Intel side of EDAC.
Luck, Tony Jan. 7, 2020, 9:43 p.m. UTC | #11
On Tue, Jan 07, 2020 at 05:45:28PM +0100, Borislav Petkov wrote:
> On Tue, Jan 07, 2020 at 10:51:09AM -0500, 'Aristeu Rozanski' wrote:
> > On Mon, Jan 06, 2020 at 05:23:06PM +0100, Borislav Petkov wrote:
> > > I'm not saying you should blindly remove them. They might be useful for
> > > debugging purposes so you should consider that usage angle first. In the
> > > AMD case, the message was really useless.
> > 
> > Ah, I thought you had an objection to this patch :)
> > Do you mind considering it for inclusion?
> 
> That's Tony's call as he's maintaining the Intel side of EDAC.

Already applied to git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git
in edac-for-next branch.  Sorry, should have sent you an "Applied" message.

-Tony
Aristeu Rozanski Jan. 8, 2020, 2:54 p.m. UTC | #12
On Tue, Jan 07, 2020 at 01:43:10PM -0800, Luck, Tony wrote:
> Already applied to git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git
> in edac-for-next branch.  Sorry, should have sent you an "Applied" message.

Thanks Tony.
diff mbox series

Patch

diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c
index 95662a4ff4c4..99bbaf629b8d 100644
--- a/drivers/edac/skx_common.c
+++ b/drivers/edac/skx_common.c
@@ -256,7 +256,7 @@  int skx_get_hi_lo(unsigned int did, int off[], u64 *tolm, u64 *tohm)
 
 	pdev = pci_get_device(PCI_VENDOR_ID_INTEL, did, NULL);
 	if (!pdev) {
-		skx_printk(KERN_ERR, "Can't get tolm/tohm\n");
+		edac_dbg(2, "Can't get tolm/tohm\n");
 		return -ENODEV;
 	}