diff mbox

drm: Reduce EDID warnings from DRM_ERROR to DRM_NOTE

Message ID 20170210195913.9878-1-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson Feb. 10, 2017, 7:59 p.m. UTC
The warnings from parsing the EDID are not driver errors, but the
"normal but significant" conditions from the external device. As such,
they do not need the ferocity of an *ERROR*, but can use the less harsh
DRM_NOTE instead.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/drm_edid.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

Comments

Thierry Reding Feb. 13, 2017, 7:41 a.m. UTC | #1
On Fri, Feb 10, 2017 at 07:59:13PM +0000, Chris Wilson wrote:
> The warnings from parsing the EDID are not driver errors, but the
> "normal but significant" conditions from the external device. As such,
> they do not need the ferocity of an *ERROR*, but can use the less harsh
> DRM_NOTE instead.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/drm_edid.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)

The below are all conditions that happen when the EDID is bad. I'm not
sure that really qualifies as "normal".

From a quick look through the code we don't always trigger an error from
the below failure paths at higher levels, so decreasing the level here
has the potential to let this kind of exceptional condition go
unnoticed.

Thierry
Chris Wilson Feb. 13, 2017, 8:59 a.m. UTC | #2
On Mon, Feb 13, 2017 at 08:41:10AM +0100, Thierry Reding wrote:
> On Fri, Feb 10, 2017 at 07:59:13PM +0000, Chris Wilson wrote:
> > The warnings from parsing the EDID are not driver errors, but the
> > "normal but significant" conditions from the external device. As such,
> > they do not need the ferocity of an *ERROR*, but can use the less harsh
> > DRM_NOTE instead.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > ---
> >  drivers/gpu/drm/drm_edid.c | 15 ++++++++-------
> >  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> The below are all conditions that happen when the EDID is bad. I'm not
> sure that really qualifies as "normal".

Often it is - a bad EDID on the monitor will always be bad. The
challenge is distinguishing that from silent data corruption during the
read - a reported read failure are trivial.
 
> From a quick look through the code we don't always trigger an error from
> the below failure paths at higher levels, so decreasing the level here
> has the potential to let this kind of exceptional condition go
> unnoticed.

The messages are not gone, they are higher than the default loglevel,
but now below the level at which they are printed to a terminal. The
bad EDID is either expected or recoverable, and definitely not fatal
so I don't think an *ERROR* is justified.
-Chris
Sean Paul Feb. 13, 2017, 5:17 p.m. UTC | #3
On Mon, Feb 13, 2017 at 3:59 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Mon, Feb 13, 2017 at 08:41:10AM +0100, Thierry Reding wrote:
>> On Fri, Feb 10, 2017 at 07:59:13PM +0000, Chris Wilson wrote:
>> > The warnings from parsing the EDID are not driver errors, but the
>> > "normal but significant" conditions from the external device. As such,
>> > they do not need the ferocity of an *ERROR*, but can use the less harsh
>> > DRM_NOTE instead.
>> >
>> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> > ---
>> >  drivers/gpu/drm/drm_edid.c | 15 ++++++++-------
>> >  1 file changed, 8 insertions(+), 7 deletions(-)
>>
>> The below are all conditions that happen when the EDID is bad. I'm not
>> sure that really qualifies as "normal".
>
> Often it is - a bad EDID on the monitor will always be bad. The
> challenge is distinguishing that from silent data corruption during the
> read - a reported read failure are trivial.
>
>> From a quick look through the code we don't always trigger an error from
>> the below failure paths at higher levels, so decreasing the level here
>> has the potential to let this kind of exceptional condition go
>> unnoticed.
>
> The messages are not gone, they are higher than the default loglevel,
> but now below the level at which they are printed to a terminal. The
> bad EDID is either expected or recoverable, and definitely not fatal
> so I don't think an *ERROR* is justified.

I tend to agree.

The description for the KERN_NOTICE level is "normal but significant
condition". I might argue that the presence of these EDID messages
represents a normal *or* significant condition (depending on why the
EDID is bad), but I don't think it's unreasonable to expect people to
check their logs if the display/mode is not working properly.

Sean



> -Chris
>
> --
> Chris Wilson, Intel Open Source Technology Centre
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Daniel Vetter Feb. 14, 2017, 9:36 p.m. UTC | #4
On Mon, Feb 13, 2017 at 12:17:27PM -0500, Sean Paul wrote:
> On Mon, Feb 13, 2017 at 3:59 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > On Mon, Feb 13, 2017 at 08:41:10AM +0100, Thierry Reding wrote:
> >> On Fri, Feb 10, 2017 at 07:59:13PM +0000, Chris Wilson wrote:
> >> > The warnings from parsing the EDID are not driver errors, but the
> >> > "normal but significant" conditions from the external device. As such,
> >> > they do not need the ferocity of an *ERROR*, but can use the less harsh
> >> > DRM_NOTE instead.
> >> >
> >> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >> > ---
> >> >  drivers/gpu/drm/drm_edid.c | 15 ++++++++-------
> >> >  1 file changed, 8 insertions(+), 7 deletions(-)
> >>
> >> The below are all conditions that happen when the EDID is bad. I'm not
> >> sure that really qualifies as "normal".
> >
> > Often it is - a bad EDID on the monitor will always be bad. The
> > challenge is distinguishing that from silent data corruption during the
> > read - a reported read failure are trivial.
> >
> >> From a quick look through the code we don't always trigger an error from
> >> the below failure paths at higher levels, so decreasing the level here
> >> has the potential to let this kind of exceptional condition go
> >> unnoticed.
> >
> > The messages are not gone, they are higher than the default loglevel,
> > but now below the level at which they are printed to a terminal. The
> > bad EDID is either expected or recoverable, and definitely not fatal
> > so I don't think an *ERROR* is justified.
> 
> I tend to agree.
> 
> The description for the KERN_NOTICE level is "normal but significant
> condition". I might argue that the presence of these EDID messages
> represents a normal *or* significant condition (depending on why the
> EDID is bad), but I don't think it's unreasonable to expect people to
> check their logs if the display/mode is not working properly.

So for cases where we know that there is shit hw out there (specifically
kvm switches that mangle the cea block without adjusting the edid) we
already tune down the error to debug level. So in principle totally agree
with tuning down anything that happens because it's outside of our control
to info or debug, but do we still need this patch after the cea one has
landed? Our CI at least seems happy ...

Cheers, Daniel
Chris Wilson Feb. 14, 2017, 9:43 p.m. UTC | #5
On Tue, Feb 14, 2017 at 10:36:09PM +0100, Daniel Vetter wrote:
> On Mon, Feb 13, 2017 at 12:17:27PM -0500, Sean Paul wrote:
> > On Mon, Feb 13, 2017 at 3:59 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > > On Mon, Feb 13, 2017 at 08:41:10AM +0100, Thierry Reding wrote:
> > >> On Fri, Feb 10, 2017 at 07:59:13PM +0000, Chris Wilson wrote:
> > >> > The warnings from parsing the EDID are not driver errors, but the
> > >> > "normal but significant" conditions from the external device. As such,
> > >> > they do not need the ferocity of an *ERROR*, but can use the less harsh
> > >> > DRM_NOTE instead.
> > >> >
> > >> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > >> > ---
> > >> >  drivers/gpu/drm/drm_edid.c | 15 ++++++++-------
> > >> >  1 file changed, 8 insertions(+), 7 deletions(-)
> > >>
> > >> The below are all conditions that happen when the EDID is bad. I'm not
> > >> sure that really qualifies as "normal".
> > >
> > > Often it is - a bad EDID on the monitor will always be bad. The
> > > challenge is distinguishing that from silent data corruption during the
> > > read - a reported read failure are trivial.
> > >
> > >> From a quick look through the code we don't always trigger an error from
> > >> the below failure paths at higher levels, so decreasing the level here
> > >> has the potential to let this kind of exceptional condition go
> > >> unnoticed.
> > >
> > > The messages are not gone, they are higher than the default loglevel,
> > > but now below the level at which they are printed to a terminal. The
> > > bad EDID is either expected or recoverable, and definitely not fatal
> > > so I don't think an *ERROR* is justified.
> > 
> > I tend to agree.
> > 
> > The description for the KERN_NOTICE level is "normal but significant
> > condition". I might argue that the presence of these EDID messages
> > represents a normal *or* significant condition (depending on why the
> > EDID is bad), but I don't think it's unreasonable to expect people to
> > check their logs if the display/mode is not working properly.
> 
> So for cases where we know that there is shit hw out there (specifically
> kvm switches that mangle the cea block without adjusting the edid) we
> already tune down the error to debug level. So in principle totally agree
> with tuning down anything that happens because it's outside of our control
> to info or debug, but do we still need this patch after the cea one has
> landed? Our CI at least seems happy ...

Yes. The one machine with a dodgy EDID also happens to have a dodgy
BIOS. This reduces the number of consistent errors to 1, but since an
unrelated error still remains, CI doesn't detect the improvement.
https://intel-gfx-ci.01.org/CI/CI_DRM_2198/fi-skl-6700k/igt@drv_module_reload@basic-reload.html
-Chris
Daniel Vetter Feb. 14, 2017, 10:23 p.m. UTC | #6
On Tue, Feb 14, 2017 at 09:43:45PM +0000, Chris Wilson wrote:
> On Tue, Feb 14, 2017 at 10:36:09PM +0100, Daniel Vetter wrote:
> > On Mon, Feb 13, 2017 at 12:17:27PM -0500, Sean Paul wrote:
> > > On Mon, Feb 13, 2017 at 3:59 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > > > On Mon, Feb 13, 2017 at 08:41:10AM +0100, Thierry Reding wrote:
> > > >> On Fri, Feb 10, 2017 at 07:59:13PM +0000, Chris Wilson wrote:
> > > >> > The warnings from parsing the EDID are not driver errors, but the
> > > >> > "normal but significant" conditions from the external device. As such,
> > > >> > they do not need the ferocity of an *ERROR*, but can use the less harsh
> > > >> > DRM_NOTE instead.
> > > >> >
> > > >> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > >> > ---
> > > >> >  drivers/gpu/drm/drm_edid.c | 15 ++++++++-------
> > > >> >  1 file changed, 8 insertions(+), 7 deletions(-)
> > > >>
> > > >> The below are all conditions that happen when the EDID is bad. I'm not
> > > >> sure that really qualifies as "normal".
> > > >
> > > > Often it is - a bad EDID on the monitor will always be bad. The
> > > > challenge is distinguishing that from silent data corruption during the
> > > > read - a reported read failure are trivial.
> > > >
> > > >> From a quick look through the code we don't always trigger an error from
> > > >> the below failure paths at higher levels, so decreasing the level here
> > > >> has the potential to let this kind of exceptional condition go
> > > >> unnoticed.
> > > >
> > > > The messages are not gone, they are higher than the default loglevel,
> > > > but now below the level at which they are printed to a terminal. The
> > > > bad EDID is either expected or recoverable, and definitely not fatal
> > > > so I don't think an *ERROR* is justified.
> > > 
> > > I tend to agree.
> > > 
> > > The description for the KERN_NOTICE level is "normal but significant
> > > condition". I might argue that the presence of these EDID messages
> > > represents a normal *or* significant condition (depending on why the
> > > EDID is bad), but I don't think it's unreasonable to expect people to
> > > check their logs if the display/mode is not working properly.
> > 
> > So for cases where we know that there is shit hw out there (specifically
> > kvm switches that mangle the cea block without adjusting the edid) we
> > already tune down the error to debug level. So in principle totally agree
> > with tuning down anything that happens because it's outside of our control
> > to info or debug, but do we still need this patch after the cea one has
> > landed? Our CI at least seems happy ...
> 
> Yes. The one machine with a dodgy EDID also happens to have a dodgy
> BIOS. This reduces the number of consistent errors to 1, but since an
> unrelated error still remains, CI doesn't detect the improvement.
> https://intel-gfx-ci.01.org/CI/CI_DRM_2198/fi-skl-6700k/igt@drv_module_reload@basic-reload.html

Ok, count my convinced, I pushed the patch to drm-misc-next.
-Daniel
diff mbox

Patch

diff --git a/drivers/gpu/drm/drm_edid.c b/drivers/gpu/drm/drm_edid.c
index 5a3b34a88ac3..24e7b282f16c 100644
--- a/drivers/gpu/drm/drm_edid.c
+++ b/drivers/gpu/drm/drm_edid.c
@@ -1140,7 +1140,7 @@  bool drm_edid_block_valid(u8 *raw_edid, int block, bool print_bad_edid,
 			DRM_DEBUG("Assuming a KVM switch modified the CEA block but left the original checksum\n");
 		} else {
 			if (print_bad_edid)
-				DRM_ERROR("EDID checksum is invalid, remainder is %d\n", csum);
+				DRM_NOTE("EDID checksum is invalid, remainder is %d\n", csum);
 
 			goto bad;
 		}
@@ -1150,7 +1150,7 @@  bool drm_edid_block_valid(u8 *raw_edid, int block, bool print_bad_edid,
 	switch (raw_edid[0]) {
 	case 0: /* base */
 		if (edid->version != 1) {
-			DRM_ERROR("EDID has major version %d, instead of 1\n", edid->version);
+			DRM_NOTE("EDID has major version %d, instead of 1\n", edid->version);
 			goto bad;
 		}
 
@@ -1167,11 +1167,12 @@  bool drm_edid_block_valid(u8 *raw_edid, int block, bool print_bad_edid,
 bad:
 	if (print_bad_edid) {
 		if (drm_edid_is_zero(raw_edid, EDID_LENGTH)) {
-			printk(KERN_ERR "EDID block is all zeroes\n");
+			printk(KERN_NOTICE "EDID block is all zeroes\n");
 		} else {
-			printk(KERN_ERR "Raw EDID:\n");
-			print_hex_dump(KERN_ERR, " \t", DUMP_PREFIX_NONE, 16, 1,
-			       raw_edid, EDID_LENGTH, false);
+			printk(KERN_NOTICE "Raw EDID:\n");
+			print_hex_dump(KERN_NOTICE,
+				       " \t", DUMP_PREFIX_NONE, 16, 1,
+				       raw_edid, EDID_LENGTH, false);
 		}
 	}
 	return false;
@@ -4002,7 +4003,7 @@  static int validate_displayid(u8 *displayid, int length, int idx)
 		csum += displayid[i];
 	}
 	if (csum) {
-		DRM_ERROR("DisplayID checksum invalid, remainder is %d\n", csum);
+		DRM_NOTE("DisplayID checksum invalid, remainder is %d\n", csum);
 		return -EINVAL;
 	}
 	return 0;