diff mbox

[i-g-t,1/2] tests/chamelium: Skip suspend/resume test with unreliable hotplug event

Message ID 20170718151627.29641-2-paul.kocialkowski@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Paul Kocialkowki July 18, 2017, 3:16 p.m. UTC
It may occur that a hotplug uevent is detected at resume, even though it
does not indicate that an actual hotplug happened. This is the case when
link training fails on any other connector.

There is currently no way to distinguish what connector caused a hotplug
uevent, nor what the reason for that uevent really is. This makes it
impossible to find out whether the test actually passed or not.

To circumvent this problem, the link status of each connector is
collected before and after suspend and compared to skip the test if
the state was good before and turned to bad after resume.

This only concerns the EDID change test, where we cannot check the
connector state (that is not supposed to have changed). For actual
hotplug tests, the tests should be safe since they check each
connector's state after receiving the uevent.

The situation described here happens with DP-VGA bridges that fail link
training after resume, as they need some more time to response on their
AUX channel.

Signed-off-by: Paul Kocialkowski <paul.kocialkowski@linux.intel.com>
---
 tests/chamelium.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

Comments

Chris Wilson July 18, 2017, 9:21 p.m. UTC | #1
Quoting Paul Kocialkowski (2017-07-18 16:16:26)
> It may occur that a hotplug uevent is detected at resume, even though it
> does not indicate that an actual hotplug happened. This is the case when
> link training fails on any other connector.
> 
> There is currently no way to distinguish what connector caused a hotplug
> uevent, nor what the reason for that uevent really is. This makes it
> impossible to find out whether the test actually passed or not.

And you may get more than one and then this skips even though the test
passed. Looks like the patch is overcompensating. What you can do is
repeat the test a few times, and then look at all the different errors
you get. If the connector remains (no mst disappareance) once it goes
bad, it should remain bad and so not generate any new uevent. Or you
only repeat the test whilst link_status[old] != link_status[new].
-Chris
Paul Kocialkowki July 19, 2017, 8:31 a.m. UTC | #2
On Tue, 2017-07-18 at 22:21 +0100, Chris Wilson wrote:
> Quoting Paul Kocialkowski (2017-07-18 16:16:26)
> > It may occur that a hotplug uevent is detected at resume, even
> > though it
> > does not indicate that an actual hotplug happened. This is the case
> > when
> > link training fails on any other connector.
> > 
> > There is currently no way to distinguish what connector caused a
> > hotplug
> > uevent, nor what the reason for that uevent really is. This makes it
> > impossible to find out whether the test actually passed or not.
> 
> And you may get more than one and then this skips even though the test
> passed. Looks like the patch is overcompensating. What you can do is
> repeat the test a few times, and then look at all the different errors
> you get. If the connector remains (no mst disappareance) once it goes
> bad, it should remain bad and so not generate any new uevent. Or you
> only repeat the test whilst link_status[old] != link_status[new].

I am not sure it is really desirable to repeat the test until we are
fairly certain it succeeds. This involves suspend/resume, that is
already long enough as it is.

Also, a uevent will be generated everytime link training fails,
regardless of whether it was already failing before (I just tested that
to make sure). In my case, it's due to a DP-VGA bridge that will
consistently fail link training in the first seconds after resume.

So this is actually even worse that I thought, because there is no way
to find out that this is why a uevent was generated if the link status
was already bad before.

So I don't see how we can manage with the current information at
disposal.

My main point here is that we need more information about what's going
on than simply "HOTPLUG=1". These patches demonstrate that working
around the lack of information is a pain for testing purposes and can
only leads to semi-working hackish workarounds.

Do you agree that this is what the problem really is?
Lyude Paul July 19, 2017, 3:47 p.m. UTC | #3
On Wed, 2017-07-19 at 11:31 +0300, Paul Kocialkowski wrote:
> On Tue, 2017-07-18 at 22:21 +0100, Chris Wilson wrote:
> > Quoting Paul Kocialkowski (2017-07-18 16:16:26)
> > > It may occur that a hotplug uevent is detected at resume, even
> > > though it
> > > does not indicate that an actual hotplug happened. This is the
> > > case
> > > when
> > > link training fails on any other connector.
> > > 
> > > There is currently no way to distinguish what connector caused a
> > > hotplug
> > > uevent, nor what the reason for that uevent really is. This makes
> > > it
> > > impossible to find out whether the test actually passed or not.
> > 
> > And you may get more than one and then this skips even though the
> > test
> > passed. Looks like the patch is overcompensating. What you can do
> > is
> > repeat the test a few times, and then look at all the different
> > errors
> > you get. If the connector remains (no mst disappareance) once it
> > goes
> > bad, it should remain bad and so not generate any new uevent. Or
> > you
> > only repeat the test whilst link_status[old] != link_status[new].
> 
> I am not sure it is really desirable to repeat the test until we are
> fairly certain it succeeds. This involves suspend/resume, that is
> already long enough as it is.
> 
> Also, a uevent will be generated everytime link training fails,
> regardless of whether it was already failing before (I just tested
> that
> to make sure). In my case, it's due to a DP-VGA bridge that will
> consistently fail link training in the first seconds after resume.
> 
> So this is actually even worse that I thought, because there is no
> way
> to find out that this is why a uevent was generated if the link
> status
> was already bad before.
> 
> So I don't see how we can manage with the current information at
> disposal.
> 
> My main point here is that we need more information about what's
> going
> on than simply "HOTPLUG=1". These patches demonstrate that working
> around the lack of information is a pain for testing purposes and can
> only leads to semi-working hackish workarounds.
> 
> Do you agree that this is what the problem really is?
Yes, I agree we need more debugging information for when hotplugs fail.
This being said though, the fact that i915 is unconditionally sending
hotplugs on resume (this appears to be a hack that they did add to stop
from missign hotplug events between suspend/resume) is really what's
causing this problem specifically.

We really need the debugging stuff me and martin suggested for the
kernel, and also more drm helpers to actually do edid checks and that
sort of stuff so that we don't have to deal with dirty hacks like this
:\.
>
diff mbox

Patch

diff --git a/tests/chamelium.c b/tests/chamelium.c
index e26f0557..8af33aaa 100644
--- a/tests/chamelium.c
+++ b/tests/chamelium.c
@@ -87,6 +87,31 @@  get_precalculated_crc(struct chamelium_port *port, int w, int h)
 }
 
 static void
+get_connectors_link_status_failed(data_t *data, bool *link_status_failed)
+{
+	drmModeConnector *connector;
+	uint64_t link_status;
+	drmModePropertyPtr prop;
+	int p;
+
+	for (p = 0; p < data->port_count; p++) {
+		connector = chamelium_port_get_connector(data->chamelium,
+							 data->ports[p], false);
+
+		igt_assert(kmstest_get_property(data->drm_fd,
+						connector->connector_id,
+						DRM_MODE_OBJECT_CONNECTOR,
+						"link-status", NULL,
+						&link_status, &prop));
+
+		link_status_failed[p] = link_status == DRM_MODE_LINK_STATUS_BAD;
+
+		drmModeFreeProperty(prop);
+		drmModeFreeConnector(connector);
+	}
+}
+
+static void
 require_connector_present(data_t *data, unsigned int type)
 {
 	int i;
@@ -310,6 +335,8 @@  test_suspend_resume_edid_change(data_t *data, struct chamelium_port *port,
 				int alt_edid_id)
 {
 	struct udev_monitor *mon = igt_watch_hotplug();
+	bool link_status_failed[2][data->port_count];
+	int p;
 
 	reset_state(data, port);
 
@@ -326,8 +353,16 @@  test_suspend_resume_edid_change(data_t *data, struct chamelium_port *port,
 	 */
 	chamelium_port_set_edid(data->chamelium, port, alt_edid_id);
 
+	get_connectors_link_status_failed(data, link_status_failed[0]);
+
 	igt_system_suspend_autoresume(state, test);
+
 	igt_assert(igt_hotplug_detected(mon, HOTPLUG_TIMEOUT));
+
+	get_connectors_link_status_failed(data, link_status_failed[1]);
+
+	for (p = 0; p < data->port_count; p++)
+		igt_skip_on(!link_status_failed[0][p] && link_status_failed[1][p]);
 }
 
 static igt_output_t *