diff mbox

Panic after S3 resume and modeset with MST

Message ID s5ha884jiho.wl-tiwai@suse.de (mailing list archive)
State New, archived
Headers show

Commit Message

Takashi Iwai March 29, 2017, 1:54 p.m. UTC
On Wed, 29 Mar 2017 15:34:15 +0200,
Ville Syrjälä wrote:
> 
> On Wed, Mar 29, 2017 at 03:10:09PM +0200, Takashi Iwai wrote:
> > On Mon, 27 Mar 2017 18:02:13 +0200,
> > Takashi Iwai wrote:
> > > 
> > > Hi,
> > > 
> > > the upstream fix a16b7658f4e0d4aec9bc3e75a5f0cc3f7a3a0422
> > >     drm/i915: Call intel_dp_mst_resume() before resuming displays
> > > 
> > > seems to trigger a kernel panic when some modeset change happens after
> > > S3 resume.  The details are found in openSUSE bugzilla,
> > >   https://bugzilla.suse.com/show_bug.cgi?id=1029634
> > > 
> > > In short, the following procedure causes a kernel panic (supposedly)
> > > almost 100% on Dell Latitude with Skylake with MST DP on dock:
> > > 
> > > - Boot with a docking station, DP-1 connected.
> > > - Login on X
> > > - xrandr --output eDP-1 --primary --auto --output DP-1-1 --auto --left-of eDP-1
> > >   ==> This changes the mode.
> > > - Suspend ("systemctl suspend" in my case), and close the lid.
> > > - Remove from the dock (keep the lid closed).
> > > - Open the lid, which resumes automatically.  It works.
> > > - Suspend again.
> > > - Connect to the dock again (keep the lid closed).
> > > - Open the lid, which resumes automatically.  It's still OK.
> > > - xrandr --output eDP-1 --primary --auto --output DP-1-1 --auto --left-of eDP-1
> > >   ==> Now the kernel feezes.
> > > 
> > > Reverting the commit mentioned above fixes the problem.
> > > 
> > > The problem is present in all versions I tested.  The reported kernel
> > > in the Bugzilla is 4.4.x-based one, but the issue is seen in 4.11-rc3,
> > > too.  Note that the S3 resume itself works in 4.11-rc3; the kernel
> > > panic happens when invoking xrandr manually after that.
> > > 
> > > Unfortunately, I couldn't get a kernel panic message, so far.  kdump
> > > didn't work well in this case by some reason.  There are some
> > > screenshots taken by the original reporter (could switch VT
> > > beforehand), but I don't know whether it helps.
> > > 
> > > If you have any hints for further debugging, it'd be highly
> > > appreciated.
> > 
> > It seems that the patch below works around the problem.
> > Can anyone enlighten what's going on there?
> > 
> > 
> > thanks,
> > 
> > Takashi
> > 
> > -- 8< --
> > From: Takashi Iwai <tiwai@suse.de>
> > Subject: [PATCH] drm/i915: Fix crash after S3 resume with DP MST mode change
> > 
> > We've got a bug report showing that Skylake Dell machines with a
> > docking station causes a kernel panic after S3 resume and modeset.
> > The details are found in the openSUSE bugzilla entry below.  The
> > typical test procedure is:
> > 
> > - Laptop is Dell Latitude with eDP (1366x768)
> > - Boot with docking station connected to a DP (1920x1080)
> > - Login, change the mode via
> >   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-of eDP-1
> > - Suspend, and close the lid after the suspend
> >   (or close the lid to trigger the suspend)
> > - Undock while keeping the lid closed.
> > - Open the lid, which triggers the resume;
> >   the machine wakes up well, and X shows up.  No problem, so far.
> > - Suspend again, close the lid.
> > - Dock again while keeping the lid closed.
> > - Open the lid, triggering the resume; this wakes up still fine.
> > - At this moment, run xrandr again to re-setup DP-1
> >   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-of eDP-1
> >   ==> This triggers a hard crash.
> > 
> > I could bisect it, and this leaded to the commit a16b7658f4e0
> > ("drm/i915: Call intel_dp_mst_resume() before resuming displays").
> > 
> > Basically the commit just shuffles the calls of intel_display_resume()
> > and intel_dp_mst_resume().  So as a workaround, I tried to split
> > intel_dp_mst_resume() call to postpone the suspected code (the
> > invocation of intel_dp_check_mst_status()), then bingo, this cured the
> > problem.
> > 
> > But don't ask me *why* this fixes.  It's still in a cargo-cult state.
> > 
> > Fixes: a16b7658f4e0 ("drm/i915: Call intel_dp_mst_resume() before resuming displays")
> > Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1029634
> > Signed-off-by: Takashi Iwai <tiwai@suse.de>
> > ---
> >  drivers/gpu/drm/i915/i915_drv.c  |  5 ++++-
> >  drivers/gpu/drm/i915/intel_dp.c  | 20 +++++++++++++++++++-
> >  drivers/gpu/drm/i915/intel_drv.h |  3 ++-
> >  3 files changed, 25 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> > index 1c75402a59c1..62c40090ceed 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.c
> > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > @@ -1559,6 +1559,7 @@ static int i915_suspend_switcheroo(struct drm_device *dev, pm_message_t state)
> >  static int i915_drm_resume(struct drm_device *dev)
> >  {
> >  	struct drm_i915_private *dev_priv = to_i915(dev);
> > +	int mst_pending;
> >  	int ret;
> >  
> >  	disable_rpm_wakeref_asserts(dev_priv);
> > @@ -1608,10 +1609,12 @@ static int i915_drm_resume(struct drm_device *dev)
> >  		dev_priv->display.hpd_irq_setup(dev_priv);
> >  	spin_unlock_irq(&dev_priv->irq_lock);
> >  
> > -	intel_dp_mst_resume(dev);
> > +	mst_pending = intel_dp_mst_resume(dev);
> >  
> >  	intel_display_resume(dev);
> >  
> > +	intel_dp_mst_resume_post(dev, mst_pending);
> > +
> >  	drm_kms_helper_poll_enable(dev);
> >  
> >  	/*
> > diff --git a/drivers/gpu/drm/i915/intel_dp.c b/drivers/gpu/drm/i915/intel_dp.c
> > index d1670b8afbf5..fc5ea900e6f3 100644
> > --- a/drivers/gpu/drm/i915/intel_dp.c
> > +++ b/drivers/gpu/drm/i915/intel_dp.c
> > @@ -6027,9 +6027,10 @@ void intel_dp_mst_suspend(struct drm_device *dev)
> >  	}
> >  }
> >  
> > -void intel_dp_mst_resume(struct drm_device *dev)
> > +int intel_dp_mst_resume(struct drm_device *dev)
> >  {
> >  	struct drm_i915_private *dev_priv = to_i915(dev);
> > +	int pending = 0;
> >  	int i;
> >  
> >  	for (i = 0; i < I915_MAX_PORTS; i++) {
> > @@ -6041,6 +6042,23 @@ void intel_dp_mst_resume(struct drm_device *dev)
> >  
> >  		ret = drm_dp_mst_topology_mgr_resume(&intel_dig_port->dp.mst_mgr);
> >  		if (ret)
> > +			pending |= 1 << i;
> > +	}
> > +
> > +	return pending;
> > +}
> > +
> > +void intel_dp_mst_resume_post(struct drm_device *dev, int pending)
> > +{
> > +	struct drm_i915_private *dev_priv = to_i915(dev);
> > +	int i;
> > +
> > +	for (i = 0; i < I915_MAX_PORTS; i++) {
> > +		struct intel_digital_port *intel_dig_port =
> > +			dev_priv->hotplug.irq_port[i];
> > +		if (!intel_dig_port || !intel_dig_port->dp.can_mst)
> > +			continue;
> > +		if (pending & (1 << i))
> >  			intel_dp_check_mst_status(&intel_dig_port->dp);
> >  	}
> >  }
> 
> The whole MST resume is a bit of chicken and egg type of situation. We
> need the HPD interrupts to resume the previous state, but we don't want
> to actually process real hotplugs until we've done the resume. The
> current code is definitely broken IMO.
> 
> But I'm not really sure why this patch fixes things because the HPD
> processing that will occur when we talk to the sink during the display
> resume should also call intel_dp_check_mst_status().

Actually, just dropping intel_dp_check_mst_status() calls in
intel_dp_mst_resume() seems enough to fix the problem.

Below is the v2 patch doing that.  Does it make more sense?


thanks,

Takashi

-- 8< --
From: Takashi Iwai <tiwai@suse.de>
Subject: [PATCH] drm/i915: Fix crash after S3 resume with DP MST mode change
 (v2)

We've got a bug report showing that Skylake Dell machines with a
docking station causes a kernel panic after S3 resume and modeset.
The details are found in the openSUSE bugzilla entry below.  The
typical test procedure is:

- Laptop is Dell Latitude with eDP (1366x768)
- Boot with docking station connected to a DP (1920x1080)
- Login, change the mode via
  xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-of eDP-1
- Suspend, and close the lid after the suspend
  (or close the lid to trigger the suspend)
- Undock while keeping the lid closed.
- Open the lid, which triggers the resume;
  the machine wakes up well, and X shows up.  No problem, so far.
- Suspend again, close the lid.
- Dock again while keeping the lid closed.
- Open the lid, triggering the resume; this wakes up still fine.
- At this moment, run xrandr again to re-setup DP-1
  xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-of eDP-1
  ==> This triggers a hard crash.

I could bisect it, and this leaded to the commit a16b7658f4e0
("drm/i915: Call intel_dp_mst_resume() before resuming displays").

This patch tries to work around the crash by ignoring the failed port
from drm_dp_mst_topology_mgr_resume().  They should be handled in hpd
later in anyway.

v1->v2: just ignore the drm_dp_mst_topology_mgr_resume() error codes
        instead of postponing.

Fixes: a16b7658f4e0 ("drm/i915: Call intel_dp_mst_resume() before resuming displays")
Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1029634
Signed-off-by: Takashi Iwai <tiwai@suse.de>
---
 drivers/gpu/drm/i915/intel_dp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Lyude Paul March 30, 2017, 12:24 a.m. UTC | #1
On Wed, 2017-03-29 at 15:54 +0200, Takashi Iwai wrote:
> On Wed, 29 Mar 2017 15:34:15 +0200,
> Ville Syrjälä wrote:
> > 
> > On Wed, Mar 29, 2017 at 03:10:09PM +0200, Takashi Iwai wrote:
> > > On Mon, 27 Mar 2017 18:02:13 +0200,
> > > Takashi Iwai wrote:
> > > > 
> > > > Hi,
> > > > 
> > > > the upstream fix a16b7658f4e0d4aec9bc3e75a5f0cc3f7a3a0422
> > > >     drm/i915: Call intel_dp_mst_resume() before resuming
> > > > displays
> > > > 
> > > > seems to trigger a kernel panic when some modeset change
> > > > happens after
> > > > S3 resume.  The details are found in openSUSE bugzilla,
> > > >   https://bugzilla.suse.com/show_bug.cgi?id=1029634
> > > > 
> > > > In short, the following procedure causes a kernel panic
> > > > (supposedly)
> > > > almost 100% on Dell Latitude with Skylake with MST DP on dock:
> > > > 
> > > > - Boot with a docking station, DP-1 connected.
> > > > - Login on X
> > > > - xrandr --output eDP-1 --primary --auto --output DP-1-1 --auto 
> > > > --left-of eDP-1
> > > >   ==> This changes the mode.
> > > > - Suspend ("systemctl suspend" in my case), and close the lid.
> > > > - Remove from the dock (keep the lid closed).
> > > > - Open the lid, which resumes automatically.  It works.
> > > > - Suspend again.
> > > > - Connect to the dock again (keep the lid closed).
> > > > - Open the lid, which resumes automatically.  It's still OK.
> > > > - xrandr --output eDP-1 --primary --auto --output DP-1-1 --auto 
> > > > --left-of eDP-1
> > > >   ==> Now the kernel feezes.
> > > > 
> > > > Reverting the commit mentioned above fixes the problem.
> > > > 
> > > > The problem is present in all versions I tested.  The reported
> > > > kernel
> > > > in the Bugzilla is 4.4.x-based one, but the issue is seen in
> > > > 4.11-rc3,
> > > > too.  Note that the S3 resume itself works in 4.11-rc3; the
> > > > kernel
> > > > panic happens when invoking xrandr manually after that.
> > > > 
> > > > Unfortunately, I couldn't get a kernel panic message, so
> > > > far.  kdump
> > > > didn't work well in this case by some reason.  There are some
> > > > screenshots taken by the original reporter (could switch VT
> > > > beforehand), but I don't know whether it helps.
> > > > 
> > > > If you have any hints for further debugging, it'd be highly
> > > > appreciated.
> > > 
> > > It seems that the patch below works around the problem.
> > > Can anyone enlighten what's going on there?
> > > 
> > > 
> > > thanks,
> > > 
> > > Takashi
> > > 
> > > -- 8< --
> > > From: Takashi Iwai <tiwai@suse.de>
> > > Subject: [PATCH] drm/i915: Fix crash after S3 resume with DP MST
> > > mode change
> > > 
> > > We've got a bug report showing that Skylake Dell machines with a
> > > docking station causes a kernel panic after S3 resume and
> > > modeset.
> > > The details are found in the openSUSE bugzilla entry below.  The
> > > typical test procedure is:
> > > 
> > > - Laptop is Dell Latitude with eDP (1366x768)
> > > - Boot with docking station connected to a DP (1920x1080)
> > > - Login, change the mode via
> > >   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-of
> > > eDP-1
> > > - Suspend, and close the lid after the suspend
> > >   (or close the lid to trigger the suspend)
> > > - Undock while keeping the lid closed.
> > > - Open the lid, which triggers the resume;
> > >   the machine wakes up well, and X shows up.  No problem, so far.
> > > - Suspend again, close the lid.
> > > - Dock again while keeping the lid closed.
> > > - Open the lid, triggering the resume; this wakes up still fine.
> > > - At this moment, run xrandr again to re-setup DP-1
> > >   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-of
> > > eDP-1
> > >   ==> This triggers a hard crash.
> > > 
> > > I could bisect it, and this leaded to the commit a16b7658f4e0
> > > ("drm/i915: Call intel_dp_mst_resume() before resuming
> > > displays").
> > > 
> > > Basically the commit just shuffles the calls of
> > > intel_display_resume()
> > > and intel_dp_mst_resume().  So as a workaround, I tried to split
> > > intel_dp_mst_resume() call to postpone the suspected code (the
> > > invocation of intel_dp_check_mst_status()), then bingo, this
> > > cured the
> > > problem.
> > > 
> > > But don't ask me *why* this fixes.  It's still in a cargo-cult
> > > state.
> > > 
> > > Fixes: a16b7658f4e0 ("drm/i915: Call intel_dp_mst_resume() before
> > > resuming displays")
> > > Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1029634
> > > Signed-off-by: Takashi Iwai <tiwai@suse.de>
> > > ---
> > >  drivers/gpu/drm/i915/i915_drv.c  |  5 ++++-
> > >  drivers/gpu/drm/i915/intel_dp.c  | 20 +++++++++++++++++++-
> > >  drivers/gpu/drm/i915/intel_drv.h |  3 ++-
> > >  3 files changed, 25 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_drv.c
> > > b/drivers/gpu/drm/i915/i915_drv.c
> > > index 1c75402a59c1..62c40090ceed 100644
> > > --- a/drivers/gpu/drm/i915/i915_drv.c
> > > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > > @@ -1559,6 +1559,7 @@ static int i915_suspend_switcheroo(struct
> > > drm_device *dev, pm_message_t state)
> > >  static int i915_drm_resume(struct drm_device *dev)
> > >  {
> > >  	struct drm_i915_private *dev_priv = to_i915(dev);
> > > +	int mst_pending;
> > >  	int ret;
> > >  
> > >  	disable_rpm_wakeref_asserts(dev_priv);
> > > @@ -1608,10 +1609,12 @@ static int i915_drm_resume(struct
> > > drm_device *dev)
> > >  		dev_priv->display.hpd_irq_setup(dev_priv);
> > >  	spin_unlock_irq(&dev_priv->irq_lock);
> > >  
> > > -	intel_dp_mst_resume(dev);
> > > +	mst_pending = intel_dp_mst_resume(dev);
> > >  
> > >  	intel_display_resume(dev);
> > >  
> > > +	intel_dp_mst_resume_post(dev, mst_pending);
> > > +
> > >  	drm_kms_helper_poll_enable(dev);
> > >  
> > >  	/*
> > > diff --git a/drivers/gpu/drm/i915/intel_dp.c
> > > b/drivers/gpu/drm/i915/intel_dp.c
> > > index d1670b8afbf5..fc5ea900e6f3 100644
> > > --- a/drivers/gpu/drm/i915/intel_dp.c
> > > +++ b/drivers/gpu/drm/i915/intel_dp.c
> > > @@ -6027,9 +6027,10 @@ void intel_dp_mst_suspend(struct
> > > drm_device *dev)
> > >  	}
> > >  }
> > >  
> > > -void intel_dp_mst_resume(struct drm_device *dev)
> > > +int intel_dp_mst_resume(struct drm_device *dev)
> > >  {
> > >  	struct drm_i915_private *dev_priv = to_i915(dev);
> > > +	int pending = 0;
> > >  	int i;
> > >  
> > >  	for (i = 0; i < I915_MAX_PORTS; i++) {
> > > @@ -6041,6 +6042,23 @@ void intel_dp_mst_resume(struct drm_device
> > > *dev)
> > >  
> > >  		ret =
> > > drm_dp_mst_topology_mgr_resume(&intel_dig_port->dp.mst_mgr);
> > >  		if (ret)
> > > +			pending |= 1 << i;
> > > +	}
> > > +
> > > +	return pending;
> > > +}
> > > +
> > > +void intel_dp_mst_resume_post(struct drm_device *dev, int
> > > pending)
> > > +{
> > > +	struct drm_i915_private *dev_priv = to_i915(dev);
> > > +	int i;
> > > +
> > > +	for (i = 0; i < I915_MAX_PORTS; i++) {
> > > +		struct intel_digital_port *intel_dig_port =
> > > +			dev_priv->hotplug.irq_port[i];
> > > +		if (!intel_dig_port || !intel_dig_port-
> > > >dp.can_mst)
> > > +			continue;
> > > +		if (pending & (1 << i))
> > >  			intel_dp_check_mst_status(&intel_dig_por
> > > t->dp);
> > >  	}
> > >  }
> > 
> > The whole MST resume is a bit of chicken and egg type of situation.
> > We
> > need the HPD interrupts to resume the previous state, but we don't
> > want
> > to actually process real hotplugs until we've done the resume. The
> > current code is definitely broken IMO.
> > 
> > But I'm not really sure why this patch fixes things because the HPD
> > processing that will occur when we talk to the sink during the
> > display
> > resume should also call intel_dp_check_mst_status().
> 
> Actually, just dropping intel_dp_check_mst_status() calls in
> intel_dp_mst_resume() seems enough to fix the problem.
> 
> Below is the v2 patch doing that.  Does it make more sense?

Barely, unfortunately :(. I'll be honest I'm a little surprised this
patch doesn't break anything, but you do seem to be right in saying
that hotplugging re-sets up the displays in the event that things go
wrong.

Going through the backtrace and the dmesg that you gave on the bugzilla
I can't figure out any definitive answer for why this is crashing. We
will have to get dmesg output from when it crashes to understand this
I'm pretty sure.

JFYI (I really need to make a guide for this somewhere...) I usually
use the netconsole module to get kernel output from machines when they
kernel panic. In your case I would set this up right before you do the
final `xrandr` call that kills the machine, since netconsole has not
been very reliable in my experience over suspend/resume cycles. It
might take a few tries to get it to print the whole crash to the
netconsole (and you will need to make sure the system has a PCI
ethernet adapter, USB or wifi won't do), but this usually works.

If you could try getting a full dmesg with this, that would be super
appreciated.

JFYI: in the future you should use git send-email for sending these
patches to intel-gfx, since the format you sent it in is preventing
your patch from getting noticed by patchwork. Other then that though:

Reviewed-by: Lyude <lyude@redhat.com>
> 
> 
> thanks,
> 
> Takashi
> 
> -- 8< --
> From: Takashi Iwai <tiwai@suse.de>
> Subject: [PATCH] drm/i915: Fix crash after S3 resume with DP MST mode
> change
>  (v2)
> 
> We've got a bug report showing that Skylake Dell machines with a
> docking station causes a kernel panic after S3 resume and modeset.
> The details are found in the openSUSE bugzilla entry below.  The
> typical test procedure is:
> 
> - Laptop is Dell Latitude with eDP (1366x768)
> - Boot with docking station connected to a DP (1920x1080)
> - Login, change the mode via
>   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-of eDP-1
> - Suspend, and close the lid after the suspend
>   (or close the lid to trigger the suspend)
> - Undock while keeping the lid closed.
> - Open the lid, which triggers the resume;
>   the machine wakes up well, and X shows up.  No problem, so far.
> - Suspend again, close the lid.
> - Dock again while keeping the lid closed.
> - Open the lid, triggering the resume; this wakes up still fine.
> - At this moment, run xrandr again to re-setup DP-1
>   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-of eDP-1
>   ==> This triggers a hard crash.
> 
> I could bisect it, and this leaded to the commit a16b7658f4e0
> ("drm/i915: Call intel_dp_mst_resume() before resuming displays").
> 
> This patch tries to work around the crash by ignoring the failed port
> from drm_dp_mst_topology_mgr_resume().  They should be handled in hpd
> later in anyway.
> 
> v1->v2: just ignore the drm_dp_mst_topology_mgr_resume() error codes
>         instead of postponing.
> 
> Fixes: a16b7658f4e0 ("drm/i915: Call intel_dp_mst_resume() before
> resuming displays")
> Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1029634
> Signed-off-by: Takashi Iwai <tiwai@suse.de>
> ---
>  drivers/gpu/drm/i915/intel_dp.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_dp.c
> b/drivers/gpu/drm/i915/intel_dp.c
> index d1670b8afbf5..a6c0f0ac16eb 100644
> --- a/drivers/gpu/drm/i915/intel_dp.c
> +++ b/drivers/gpu/drm/i915/intel_dp.c
> @@ -6041,6 +6041,7 @@ void intel_dp_mst_resume(struct drm_device
> *dev)
>  
>  		ret =
> drm_dp_mst_topology_mgr_resume(&intel_dig_port->dp.mst_mgr);
>  		if (ret)
> -			intel_dp_check_mst_status(&intel_dig_port-
> >dp);
> +			DRM_DEBUG_KMS("DP MST resume failed for
> port-%c\n",
> +				      port_name(intel_dig_port-
> >port));
>  	}
>  }
Takashi Iwai March 30, 2017, 5:55 a.m. UTC | #2
On Thu, 30 Mar 2017 02:24:37 +0200,
Lyude Paul wrote:
> 
> On Wed, 2017-03-29 at 15:54 +0200, Takashi Iwai wrote:
> > On Wed, 29 Mar 2017 15:34:15 +0200,
> > Ville Syrjälä wrote:
> > > 
> > > On Wed, Mar 29, 2017 at 03:10:09PM +0200, Takashi Iwai wrote:
> > > > On Mon, 27 Mar 2017 18:02:13 +0200,
> > > > Takashi Iwai wrote:
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > the upstream fix a16b7658f4e0d4aec9bc3e75a5f0cc3f7a3a0422
> > > > >     drm/i915: Call intel_dp_mst_resume() before resuming
> > > > > displays
> > > > > 
> > > > > seems to trigger a kernel panic when some modeset change
> > > > > happens after
> > > > > S3 resume.  The details are found in openSUSE bugzilla,
> > > > >   https://bugzilla.suse.com/show_bug.cgi?id=1029634
> > > > > 
> > > > > In short, the following procedure causes a kernel panic
> > > > > (supposedly)
> > > > > almost 100% on Dell Latitude with Skylake with MST DP on dock:
> > > > > 
> > > > > - Boot with a docking station, DP-1 connected.
> > > > > - Login on X
> > > > > - xrandr --output eDP-1 --primary --auto --output DP-1-1 --auto 
> > > > > --left-of eDP-1
> > > > >   ==> This changes the mode.
> > > > > - Suspend ("systemctl suspend" in my case), and close the lid.
> > > > > - Remove from the dock (keep the lid closed).
> > > > > - Open the lid, which resumes automatically.  It works.
> > > > > - Suspend again.
> > > > > - Connect to the dock again (keep the lid closed).
> > > > > - Open the lid, which resumes automatically.  It's still OK.
> > > > > - xrandr --output eDP-1 --primary --auto --output DP-1-1 --auto 
> > > > > --left-of eDP-1
> > > > >   ==> Now the kernel feezes.
> > > > > 
> > > > > Reverting the commit mentioned above fixes the problem.
> > > > > 
> > > > > The problem is present in all versions I tested.  The reported
> > > > > kernel
> > > > > in the Bugzilla is 4.4.x-based one, but the issue is seen in
> > > > > 4.11-rc3,
> > > > > too.  Note that the S3 resume itself works in 4.11-rc3; the
> > > > > kernel
> > > > > panic happens when invoking xrandr manually after that.
> > > > > 
> > > > > Unfortunately, I couldn't get a kernel panic message, so
> > > > > far.  kdump
> > > > > didn't work well in this case by some reason.  There are some
> > > > > screenshots taken by the original reporter (could switch VT
> > > > > beforehand), but I don't know whether it helps.
> > > > > 
> > > > > If you have any hints for further debugging, it'd be highly
> > > > > appreciated.
> > > > 
> > > > It seems that the patch below works around the problem.
> > > > Can anyone enlighten what's going on there?
> > > > 
> > > > 
> > > > thanks,
> > > > 
> > > > Takashi
> > > > 
> > > > -- 8< --
> > > > From: Takashi Iwai <tiwai@suse.de>
> > > > Subject: [PATCH] drm/i915: Fix crash after S3 resume with DP MST
> > > > mode change
> > > > 
> > > > We've got a bug report showing that Skylake Dell machines with a
> > > > docking station causes a kernel panic after S3 resume and
> > > > modeset.
> > > > The details are found in the openSUSE bugzilla entry below.  The
> > > > typical test procedure is:
> > > > 
> > > > - Laptop is Dell Latitude with eDP (1366x768)
> > > > - Boot with docking station connected to a DP (1920x1080)
> > > > - Login, change the mode via
> > > >   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-of
> > > > eDP-1
> > > > - Suspend, and close the lid after the suspend
> > > >   (or close the lid to trigger the suspend)
> > > > - Undock while keeping the lid closed.
> > > > - Open the lid, which triggers the resume;
> > > >   the machine wakes up well, and X shows up.  No problem, so far.
> > > > - Suspend again, close the lid.
> > > > - Dock again while keeping the lid closed.
> > > > - Open the lid, triggering the resume; this wakes up still fine.
> > > > - At this moment, run xrandr again to re-setup DP-1
> > > >   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-of
> > > > eDP-1
> > > >   ==> This triggers a hard crash.
> > > > 
> > > > I could bisect it, and this leaded to the commit a16b7658f4e0
> > > > ("drm/i915: Call intel_dp_mst_resume() before resuming
> > > > displays").
> > > > 
> > > > Basically the commit just shuffles the calls of
> > > > intel_display_resume()
> > > > and intel_dp_mst_resume().  So as a workaround, I tried to split
> > > > intel_dp_mst_resume() call to postpone the suspected code (the
> > > > invocation of intel_dp_check_mst_status()), then bingo, this
> > > > cured the
> > > > problem.
> > > > 
> > > > But don't ask me *why* this fixes.  It's still in a cargo-cult
> > > > state.
> > > > 
> > > > Fixes: a16b7658f4e0 ("drm/i915: Call intel_dp_mst_resume() before
> > > > resuming displays")
> > > > Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1029634
> > > > Signed-off-by: Takashi Iwai <tiwai@suse.de>
> > > > ---
> > > >  drivers/gpu/drm/i915/i915_drv.c  |  5 ++++-
> > > >  drivers/gpu/drm/i915/intel_dp.c  | 20 +++++++++++++++++++-
> > > >  drivers/gpu/drm/i915/intel_drv.h |  3 ++-
> > > >  3 files changed, 25 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_drv.c
> > > > b/drivers/gpu/drm/i915/i915_drv.c
> > > > index 1c75402a59c1..62c40090ceed 100644
> > > > --- a/drivers/gpu/drm/i915/i915_drv.c
> > > > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > > > @@ -1559,6 +1559,7 @@ static int i915_suspend_switcheroo(struct
> > > > drm_device *dev, pm_message_t state)
> > > >  static int i915_drm_resume(struct drm_device *dev)
> > > >  {
> > > >  	struct drm_i915_private *dev_priv = to_i915(dev);
> > > > +	int mst_pending;
> > > >  	int ret;
> > > >  
> > > >  	disable_rpm_wakeref_asserts(dev_priv);
> > > > @@ -1608,10 +1609,12 @@ static int i915_drm_resume(struct
> > > > drm_device *dev)
> > > >  		dev_priv->display.hpd_irq_setup(dev_priv);
> > > >  	spin_unlock_irq(&dev_priv->irq_lock);
> > > >  
> > > > -	intel_dp_mst_resume(dev);
> > > > +	mst_pending = intel_dp_mst_resume(dev);
> > > >  
> > > >  	intel_display_resume(dev);
> > > >  
> > > > +	intel_dp_mst_resume_post(dev, mst_pending);
> > > > +
> > > >  	drm_kms_helper_poll_enable(dev);
> > > >  
> > > >  	/*
> > > > diff --git a/drivers/gpu/drm/i915/intel_dp.c
> > > > b/drivers/gpu/drm/i915/intel_dp.c
> > > > index d1670b8afbf5..fc5ea900e6f3 100644
> > > > --- a/drivers/gpu/drm/i915/intel_dp.c
> > > > +++ b/drivers/gpu/drm/i915/intel_dp.c
> > > > @@ -6027,9 +6027,10 @@ void intel_dp_mst_suspend(struct
> > > > drm_device *dev)
> > > >  	}
> > > >  }
> > > >  
> > > > -void intel_dp_mst_resume(struct drm_device *dev)
> > > > +int intel_dp_mst_resume(struct drm_device *dev)
> > > >  {
> > > >  	struct drm_i915_private *dev_priv = to_i915(dev);
> > > > +	int pending = 0;
> > > >  	int i;
> > > >  
> > > >  	for (i = 0; i < I915_MAX_PORTS; i++) {
> > > > @@ -6041,6 +6042,23 @@ void intel_dp_mst_resume(struct drm_device
> > > > *dev)
> > > >  
> > > >  		ret =
> > > > drm_dp_mst_topology_mgr_resume(&intel_dig_port->dp.mst_mgr);
> > > >  		if (ret)
> > > > +			pending |= 1 << i;
> > > > +	}
> > > > +
> > > > +	return pending;
> > > > +}
> > > > +
> > > > +void intel_dp_mst_resume_post(struct drm_device *dev, int
> > > > pending)
> > > > +{
> > > > +	struct drm_i915_private *dev_priv = to_i915(dev);
> > > > +	int i;
> > > > +
> > > > +	for (i = 0; i < I915_MAX_PORTS; i++) {
> > > > +		struct intel_digital_port *intel_dig_port =
> > > > +			dev_priv->hotplug.irq_port[i];
> > > > +		if (!intel_dig_port || !intel_dig_port-
> > > > >dp.can_mst)
> > > > +			continue;
> > > > +		if (pending & (1 << i))
> > > >  			intel_dp_check_mst_status(&intel_dig_por
> > > > t->dp);
> > > >  	}
> > > >  }
> > > 
> > > The whole MST resume is a bit of chicken and egg type of situation.
> > > We
> > > need the HPD interrupts to resume the previous state, but we don't
> > > want
> > > to actually process real hotplugs until we've done the resume. The
> > > current code is definitely broken IMO.
> > > 
> > > But I'm not really sure why this patch fixes things because the HPD
> > > processing that will occur when we talk to the sink during the
> > > display
> > > resume should also call intel_dp_check_mst_status().
> > 
> > Actually, just dropping intel_dp_check_mst_status() calls in
> > intel_dp_mst_resume() seems enough to fix the problem.
> > 
> > Below is the v2 patch doing that.  Does it make more sense?
> 
> Barely, unfortunately :(. I'll be honest I'm a little surprised this
> patch doesn't break anything, but you do seem to be right in saying
> that hotplugging re-sets up the displays in the event that things go
> wrong.
> 
> Going through the backtrace and the dmesg that you gave on the bugzilla
> I can't figure out any definitive answer for why this is crashing. We
> will have to get dmesg output from when it crashes to understand this
> I'm pretty sure.
> 
> JFYI (I really need to make a guide for this somewhere...) I usually
> use the netconsole module to get kernel output from machines when they
> kernel panic. In your case I would set this up right before you do the
> final `xrandr` call that kills the machine, since netconsole has not
> been very reliable in my experience over suspend/resume cycles. It
> might take a few tries to get it to print the whole crash to the
> netconsole (and you will need to make sure the system has a PCI
> ethernet adapter, USB or wifi won't do), but this usually works.

I've already tried netconsole (even before kdump), but unfortunately
it doesn't work in this case.  As mentioned in Bugzilla, the network
goes crazy when the machine crashes (even the whole network segment
the machine connects to is stalling).  It's similar like the symptom
we've seen with the SKL CPU-state screw up in the past with intel_idle
driver.

> If you could try getting a full dmesg with this, that would be super
> appreciated.
> 
> JFYI: in the future you should use git send-email for sending these
> patches to intel-gfx, since the format you sent it in is preventing
> your patch from getting noticed by patchwork. Other then that though:
> 
> Reviewed-by: Lyude <lyude@redhat.com>

Thanks, it's not meant to be a real submission of the patch, but
currently it's still a sort of RFC.  If there is no better solution
insight, I'll submit it as a proper fix patch.


Takashi

> > 
> > 
> > thanks,
> > 
> > Takashi
> > 
> > -- 8< --
> > From: Takashi Iwai <tiwai@suse.de>
> > Subject: [PATCH] drm/i915: Fix crash after S3 resume with DP MST mode
> > change
> >  (v2)
> > 
> > We've got a bug report showing that Skylake Dell machines with a
> > docking station causes a kernel panic after S3 resume and modeset.
> > The details are found in the openSUSE bugzilla entry below.  The
> > typical test procedure is:
> > 
> > - Laptop is Dell Latitude with eDP (1366x768)
> > - Boot with docking station connected to a DP (1920x1080)
> > - Login, change the mode via
> >   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-of eDP-1
> > - Suspend, and close the lid after the suspend
> >   (or close the lid to trigger the suspend)
> > - Undock while keeping the lid closed.
> > - Open the lid, which triggers the resume;
> >   the machine wakes up well, and X shows up.  No problem, so far.
> > - Suspend again, close the lid.
> > - Dock again while keeping the lid closed.
> > - Open the lid, triggering the resume; this wakes up still fine.
> > - At this moment, run xrandr again to re-setup DP-1
> >   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-of eDP-1
> >   ==> This triggers a hard crash.
> > 
> > I could bisect it, and this leaded to the commit a16b7658f4e0
> > ("drm/i915: Call intel_dp_mst_resume() before resuming displays").
> > 
> > This patch tries to work around the crash by ignoring the failed port
> > from drm_dp_mst_topology_mgr_resume().  They should be handled in hpd
> > later in anyway.
> > 
> > v1->v2: just ignore the drm_dp_mst_topology_mgr_resume() error codes
> >         instead of postponing.
> > 
> > Fixes: a16b7658f4e0 ("drm/i915: Call intel_dp_mst_resume() before
> > resuming displays")
> > Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1029634
> > Signed-off-by: Takashi Iwai <tiwai@suse.de>
> > ---
> >  drivers/gpu/drm/i915/intel_dp.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_dp.c
> > b/drivers/gpu/drm/i915/intel_dp.c
> > index d1670b8afbf5..a6c0f0ac16eb 100644
> > --- a/drivers/gpu/drm/i915/intel_dp.c
> > +++ b/drivers/gpu/drm/i915/intel_dp.c
> > @@ -6041,6 +6041,7 @@ void intel_dp_mst_resume(struct drm_device
> > *dev)
> >  
> >  		ret =
> > drm_dp_mst_topology_mgr_resume(&intel_dig_port->dp.mst_mgr);
> >  		if (ret)
> > -			intel_dp_check_mst_status(&intel_dig_port-
> > >dp);
> > +			DRM_DEBUG_KMS("DP MST resume failed for
> > port-%c\n",
> > +				      port_name(intel_dig_port-
> > >port));
> >  	}
> >  }
> -- 
> Cheers,
> 	Lyude
>
Lyude Paul March 30, 2017, 6:07 p.m. UTC | #3
On Thu, 2017-03-30 at 07:55 +0200, Takashi Iwai wrote:
> On Thu, 30 Mar 2017 02:24:37 +0200,
> Lyude Paul wrote:
> > 
> > On Wed, 2017-03-29 at 15:54 +0200, Takashi Iwai wrote:
> > > On Wed, 29 Mar 2017 15:34:15 +0200,
> > > Ville Syrjälä wrote:
> > > > 
> > > > On Wed, Mar 29, 2017 at 03:10:09PM +0200, Takashi Iwai wrote:
> > > > > On Mon, 27 Mar 2017 18:02:13 +0200,
> > > > > Takashi Iwai wrote:
> > > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > the upstream fix a16b7658f4e0d4aec9bc3e75a5f0cc3f7a3a0422
> > > > > >     drm/i915: Call intel_dp_mst_resume() before resuming
> > > > > > displays
> > > > > > 
> > > > > > seems to trigger a kernel panic when some modeset change
> > > > > > happens after
> > > > > > S3 resume.  The details are found in openSUSE bugzilla,
> > > > > >   https://bugzilla.suse.com/show_bug.cgi?id=1029634
> > > > > > 
> > > > > > In short, the following procedure causes a kernel panic
> > > > > > (supposedly)
> > > > > > almost 100% on Dell Latitude with Skylake with MST DP on
> > > > > > dock:
> > > > > > 
> > > > > > - Boot with a docking station, DP-1 connected.
> > > > > > - Login on X
> > > > > > - xrandr --output eDP-1 --primary --auto --output DP-1-1 --
> > > > > > auto 
> > > > > > --left-of eDP-1
> > > > > >   ==> This changes the mode.
> > > > > > - Suspend ("systemctl suspend" in my case), and close the
> > > > > > lid.
> > > > > > - Remove from the dock (keep the lid closed).
> > > > > > - Open the lid, which resumes automatically.  It works.
> > > > > > - Suspend again.
> > > > > > - Connect to the dock again (keep the lid closed).
> > > > > > - Open the lid, which resumes automatically.  It's still
> > > > > > OK.
> > > > > > - xrandr --output eDP-1 --primary --auto --output DP-1-1 --
> > > > > > auto 
> > > > > > --left-of eDP-1
> > > > > >   ==> Now the kernel feezes.
> > > > > > 
> > > > > > Reverting the commit mentioned above fixes the problem.
> > > > > > 
> > > > > > The problem is present in all versions I tested.  The
> > > > > > reported
> > > > > > kernel
> > > > > > in the Bugzilla is 4.4.x-based one, but the issue is seen
> > > > > > in
> > > > > > 4.11-rc3,
> > > > > > too.  Note that the S3 resume itself works in 4.11-rc3; the
> > > > > > kernel
> > > > > > panic happens when invoking xrandr manually after that.
> > > > > > 
> > > > > > Unfortunately, I couldn't get a kernel panic message, so
> > > > > > far.  kdump
> > > > > > didn't work well in this case by some reason.  There are
> > > > > > some
> > > > > > screenshots taken by the original reporter (could switch VT
> > > > > > beforehand), but I don't know whether it helps.
> > > > > > 
> > > > > > If you have any hints for further debugging, it'd be highly
> > > > > > appreciated.
> > > > > 
> > > > > It seems that the patch below works around the problem.
> > > > > Can anyone enlighten what's going on there?
> > > > > 
> > > > > 
> > > > > thanks,
> > > > > 
> > > > > Takashi
> > > > > 
> > > > > -- 8< --
> > > > > From: Takashi Iwai <tiwai@suse.de>
> > > > > Subject: [PATCH] drm/i915: Fix crash after S3 resume with DP
> > > > > MST
> > > > > mode change
> > > > > 
> > > > > We've got a bug report showing that Skylake Dell machines
> > > > > with a
> > > > > docking station causes a kernel panic after S3 resume and
> > > > > modeset.
> > > > > The details are found in the openSUSE bugzilla entry
> > > > > below.  The
> > > > > typical test procedure is:
> > > > > 
> > > > > - Laptop is Dell Latitude with eDP (1366x768)
> > > > > - Boot with docking station connected to a DP (1920x1080)
> > > > > - Login, change the mode via
> > > > >   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-
> > > > > of
> > > > > eDP-1
> > > > > - Suspend, and close the lid after the suspend
> > > > >   (or close the lid to trigger the suspend)
> > > > > - Undock while keeping the lid closed.
> > > > > - Open the lid, which triggers the resume;
> > > > >   the machine wakes up well, and X shows up.  No problem, so
> > > > > far.
> > > > > - Suspend again, close the lid.
> > > > > - Dock again while keeping the lid closed.
> > > > > - Open the lid, triggering the resume; this wakes up still
> > > > > fine.
> > > > > - At this moment, run xrandr again to re-setup DP-1
> > > > >   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-
> > > > > of
> > > > > eDP-1
> > > > >   ==> This triggers a hard crash.
> > > > > 
> > > > > I could bisect it, and this leaded to the commit a16b7658f4e0
> > > > > ("drm/i915: Call intel_dp_mst_resume() before resuming
> > > > > displays").
> > > > > 
> > > > > Basically the commit just shuffles the calls of
> > > > > intel_display_resume()
> > > > > and intel_dp_mst_resume().  So as a workaround, I tried to
> > > > > split
> > > > > intel_dp_mst_resume() call to postpone the suspected code
> > > > > (the
> > > > > invocation of intel_dp_check_mst_status()), then bingo, this
> > > > > cured the
> > > > > problem.
> > > > > 
> > > > > But don't ask me *why* this fixes.  It's still in a cargo-
> > > > > cult
> > > > > state.
> > > > > 
> > > > > Fixes: a16b7658f4e0 ("drm/i915: Call intel_dp_mst_resume()
> > > > > before
> > > > > resuming displays")
> > > > > Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1029634
> > > > > Signed-off-by: Takashi Iwai <tiwai@suse.de>
> > > > > ---
> > > > >  drivers/gpu/drm/i915/i915_drv.c  |  5 ++++-
> > > > >  drivers/gpu/drm/i915/intel_dp.c  | 20 +++++++++++++++++++-
> > > > >  drivers/gpu/drm/i915/intel_drv.h |  3 ++-
> > > > >  3 files changed, 25 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/i915_drv.c
> > > > > b/drivers/gpu/drm/i915/i915_drv.c
> > > > > index 1c75402a59c1..62c40090ceed 100644
> > > > > --- a/drivers/gpu/drm/i915/i915_drv.c
> > > > > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > > > > @@ -1559,6 +1559,7 @@ static int
> > > > > i915_suspend_switcheroo(struct
> > > > > drm_device *dev, pm_message_t state)
> > > > >  static int i915_drm_resume(struct drm_device *dev)
> > > > >  {
> > > > >  	struct drm_i915_private *dev_priv = to_i915(dev);
> > > > > +	int mst_pending;
> > > > >  	int ret;
> > > > >  
> > > > >  	disable_rpm_wakeref_asserts(dev_priv);
> > > > > @@ -1608,10 +1609,12 @@ static int i915_drm_resume(struct
> > > > > drm_device *dev)
> > > > >  		dev_priv->display.hpd_irq_setup(dev_priv);
> > > > >  	spin_unlock_irq(&dev_priv->irq_lock);
> > > > >  
> > > > > -	intel_dp_mst_resume(dev);
> > > > > +	mst_pending = intel_dp_mst_resume(dev);
> > > > >  
> > > > >  	intel_display_resume(dev);
> > > > >  
> > > > > +	intel_dp_mst_resume_post(dev, mst_pending);
> > > > > +
> > > > >  	drm_kms_helper_poll_enable(dev);
> > > > >  
> > > > >  	/*
> > > > > diff --git a/drivers/gpu/drm/i915/intel_dp.c
> > > > > b/drivers/gpu/drm/i915/intel_dp.c
> > > > > index d1670b8afbf5..fc5ea900e6f3 100644
> > > > > --- a/drivers/gpu/drm/i915/intel_dp.c
> > > > > +++ b/drivers/gpu/drm/i915/intel_dp.c
> > > > > @@ -6027,9 +6027,10 @@ void intel_dp_mst_suspend(struct
> > > > > drm_device *dev)
> > > > >  	}
> > > > >  }
> > > > >  
> > > > > -void intel_dp_mst_resume(struct drm_device *dev)
> > > > > +int intel_dp_mst_resume(struct drm_device *dev)
> > > > >  {
> > > > >  	struct drm_i915_private *dev_priv = to_i915(dev);
> > > > > +	int pending = 0;
> > > > >  	int i;
> > > > >  
> > > > >  	for (i = 0; i < I915_MAX_PORTS; i++) {
> > > > > @@ -6041,6 +6042,23 @@ void intel_dp_mst_resume(struct
> > > > > drm_device
> > > > > *dev)
> > > > >  
> > > > >  		ret =
> > > > > drm_dp_mst_topology_mgr_resume(&intel_dig_port->dp.mst_mgr);
> > > > >  		if (ret)
> > > > > +			pending |= 1 << i;
> > > > > +	}
> > > > > +
> > > > > +	return pending;
> > > > > +}
> > > > > +
> > > > > +void intel_dp_mst_resume_post(struct drm_device *dev, int
> > > > > pending)
> > > > > +{
> > > > > +	struct drm_i915_private *dev_priv = to_i915(dev);
> > > > > +	int i;
> > > > > +
> > > > > +	for (i = 0; i < I915_MAX_PORTS; i++) {
> > > > > +		struct intel_digital_port *intel_dig_port =
> > > > > +			dev_priv->hotplug.irq_port[i];
> > > > > +		if (!intel_dig_port || !intel_dig_port-
> > > > > > dp.can_mst)
> > > > > 
> > > > > +			continue;
> > > > > +		if (pending & (1 << i))
> > > > >  			intel_dp_check_mst_status(&intel_dig
> > > > > _por
> > > > > t->dp);
> > > > >  	}
> > > > >  }
> > > > 
> > > > The whole MST resume is a bit of chicken and egg type of
> > > > situation.
> > > > We
> > > > need the HPD interrupts to resume the previous state, but we
> > > > don't
> > > > want
> > > > to actually process real hotplugs until we've done the resume.
> > > > The
> > > > current code is definitely broken IMO.
> > > > 
> > > > But I'm not really sure why this patch fixes things because the
> > > > HPD
> > > > processing that will occur when we talk to the sink during the
> > > > display
> > > > resume should also call intel_dp_check_mst_status().
> > > 
> > > Actually, just dropping intel_dp_check_mst_status() calls in
> > > intel_dp_mst_resume() seems enough to fix the problem.
> > > 
> > > Below is the v2 patch doing that.  Does it make more sense?
> > 
> > Barely, unfortunately :(. I'll be honest I'm a little surprised
> > this
> > patch doesn't break anything, but you do seem to be right in saying
> > that hotplugging re-sets up the displays in the event that things
> > go
> > wrong.
> > 
> > Going through the backtrace and the dmesg that you gave on the
> > bugzilla
> > I can't figure out any definitive answer for why this is crashing.
> > We
> > will have to get dmesg output from when it crashes to understand
> > this
> > I'm pretty sure.
> > 
> > JFYI (I really need to make a guide for this somewhere...) I
> > usually
> > use the netconsole module to get kernel output from machines when
> > they
> > kernel panic. In your case I would set this up right before you do
> > the
> > final `xrandr` call that kills the machine, since netconsole has
> > not
> > been very reliable in my experience over suspend/resume cycles. It
> > might take a few tries to get it to print the whole crash to the
> > netconsole (and you will need to make sure the system has a PCI
> > ethernet adapter, USB or wifi won't do), but this usually works.
> 
> I've already tried netconsole (even before kdump), but unfortunately
> it doesn't work in this case.  As mentioned in Bugzilla, the network
> goes crazy when the machine crashes (even the whole network segment
> the machine connects to is stalling).  It's similar like the symptom
> we've seen with the SKL CPU-state screw up in the past with
> intel_idle
> driver.
Do you think you might be able to get a translation of the stack dump
to the actual lines this issues is occurring on? You should be able to
do this against i915.ko using eu-addr2line (it will have to be against
the same kernel build you're running on the crashing machine, hence why
I can't just do it myself). I might see if I can give you some patches
to spit some info out.

> 
> > If you could try getting a full dmesg with this, that would be
> > super
> > appreciated.
> > 
> > JFYI: in the future you should use git send-email for sending these
> > patches to intel-gfx, since the format you sent it in is preventing
> > your patch from getting noticed by patchwork. Other then that
> > though:
> > 
> > Reviewed-by: Lyude <lyude@redhat.com>
> 
> Thanks, it's not meant to be a real submission of the patch, but
> currently it's still a sort of RFC.  If there is no better solution
> insight, I'll submit it as a proper fix patch.
> 
> 
> Takashi
> 
> > > 
> > > 
> > > thanks,
> > > 
> > > Takashi
> > > 
> > > -- 8< --
> > > From: Takashi Iwai <tiwai@suse.de>
> > > Subject: [PATCH] drm/i915: Fix crash after S3 resume with DP MST
> > > mode
> > > change
> > >  (v2)
> > > 
> > > We've got a bug report showing that Skylake Dell machines with a
> > > docking station causes a kernel panic after S3 resume and
> > > modeset.
> > > The details are found in the openSUSE bugzilla entry below.  The
> > > typical test procedure is:
> > > 
> > > - Laptop is Dell Latitude with eDP (1366x768)
> > > - Boot with docking station connected to a DP (1920x1080)
> > > - Login, change the mode via
> > >   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-of
> > > eDP-1
> > > - Suspend, and close the lid after the suspend
> > >   (or close the lid to trigger the suspend)
> > > - Undock while keeping the lid closed.
> > > - Open the lid, which triggers the resume;
> > >   the machine wakes up well, and X shows up.  No problem, so far.
> > > - Suspend again, close the lid.
> > > - Dock again while keeping the lid closed.
> > > - Open the lid, triggering the resume; this wakes up still fine.
> > > - At this moment, run xrandr again to re-setup DP-1
> > >   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-of
> > > eDP-1
> > >   ==> This triggers a hard crash.
> > > 
> > > I could bisect it, and this leaded to the commit a16b7658f4e0
> > > ("drm/i915: Call intel_dp_mst_resume() before resuming
> > > displays").
> > > 
> > > This patch tries to work around the crash by ignoring the failed
> > > port
> > > from drm_dp_mst_topology_mgr_resume().  They should be handled in
> > > hpd
> > > later in anyway.
> > > 
> > > v1->v2: just ignore the drm_dp_mst_topology_mgr_resume() error
> > > codes
> > >         instead of postponing.
> > > 
> > > Fixes: a16b7658f4e0 ("drm/i915: Call intel_dp_mst_resume() before
> > > resuming displays")
> > > Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1029634
> > > Signed-off-by: Takashi Iwai <tiwai@suse.de>
> > > ---
> > >  drivers/gpu/drm/i915/intel_dp.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/intel_dp.c
> > > b/drivers/gpu/drm/i915/intel_dp.c
> > > index d1670b8afbf5..a6c0f0ac16eb 100644
> > > --- a/drivers/gpu/drm/i915/intel_dp.c
> > > +++ b/drivers/gpu/drm/i915/intel_dp.c
> > > @@ -6041,6 +6041,7 @@ void intel_dp_mst_resume(struct drm_device
> > > *dev)
> > >  
> > >  		ret =
> > > drm_dp_mst_topology_mgr_resume(&intel_dig_port->dp.mst_mgr);
> > >  		if (ret)
> > > -			intel_dp_check_mst_status(&intel_dig_por
> > > t-
> > > > dp);
> > > 
> > > +			DRM_DEBUG_KMS("DP MST resume failed for
> > > port-%c\n",
> > > +				      port_name(intel_dig_port-
> > > > port));
> > > 
> > >  	}
> > >  }
> > 
> > -- 
> > Cheers,
> > 	Lyude
> >
Takashi Iwai March 30, 2017, 6:50 p.m. UTC | #4
On Thu, 30 Mar 2017 20:07:42 +0200,
Lyude Paul wrote:
> 
> On Thu, 2017-03-30 at 07:55 +0200, Takashi Iwai wrote:
> > On Thu, 30 Mar 2017 02:24:37 +0200,
> > Lyude Paul wrote:
> > > 
> > > On Wed, 2017-03-29 at 15:54 +0200, Takashi Iwai wrote:
> > > > On Wed, 29 Mar 2017 15:34:15 +0200,
> > > > Ville Syrjälä wrote:
> > > > > 
> > > > > On Wed, Mar 29, 2017 at 03:10:09PM +0200, Takashi Iwai wrote:
> > > > > > On Mon, 27 Mar 2017 18:02:13 +0200,
> > > > > > Takashi Iwai wrote:
> > > > > > > 
> > > > > > > Hi,
> > > > > > > 
> > > > > > > the upstream fix a16b7658f4e0d4aec9bc3e75a5f0cc3f7a3a0422
> > > > > > >     drm/i915: Call intel_dp_mst_resume() before resuming
> > > > > > > displays
> > > > > > > 
> > > > > > > seems to trigger a kernel panic when some modeset change
> > > > > > > happens after
> > > > > > > S3 resume.  The details are found in openSUSE bugzilla,
> > > > > > >   https://bugzilla.suse.com/show_bug.cgi?id=1029634
> > > > > > > 
> > > > > > > In short, the following procedure causes a kernel panic
> > > > > > > (supposedly)
> > > > > > > almost 100% on Dell Latitude with Skylake with MST DP on
> > > > > > > dock:
> > > > > > > 
> > > > > > > - Boot with a docking station, DP-1 connected.
> > > > > > > - Login on X
> > > > > > > - xrandr --output eDP-1 --primary --auto --output DP-1-1 --
> > > > > > > auto 
> > > > > > > --left-of eDP-1
> > > > > > >   ==> This changes the mode.
> > > > > > > - Suspend ("systemctl suspend" in my case), and close the
> > > > > > > lid.
> > > > > > > - Remove from the dock (keep the lid closed).
> > > > > > > - Open the lid, which resumes automatically.  It works.
> > > > > > > - Suspend again.
> > > > > > > - Connect to the dock again (keep the lid closed).
> > > > > > > - Open the lid, which resumes automatically.  It's still
> > > > > > > OK.
> > > > > > > - xrandr --output eDP-1 --primary --auto --output DP-1-1 --
> > > > > > > auto 
> > > > > > > --left-of eDP-1
> > > > > > >   ==> Now the kernel feezes.
> > > > > > > 
> > > > > > > Reverting the commit mentioned above fixes the problem.
> > > > > > > 
> > > > > > > The problem is present in all versions I tested.  The
> > > > > > > reported
> > > > > > > kernel
> > > > > > > in the Bugzilla is 4.4.x-based one, but the issue is seen
> > > > > > > in
> > > > > > > 4.11-rc3,
> > > > > > > too.  Note that the S3 resume itself works in 4.11-rc3; the
> > > > > > > kernel
> > > > > > > panic happens when invoking xrandr manually after that.
> > > > > > > 
> > > > > > > Unfortunately, I couldn't get a kernel panic message, so
> > > > > > > far.  kdump
> > > > > > > didn't work well in this case by some reason.  There are
> > > > > > > some
> > > > > > > screenshots taken by the original reporter (could switch VT
> > > > > > > beforehand), but I don't know whether it helps.
> > > > > > > 
> > > > > > > If you have any hints for further debugging, it'd be highly
> > > > > > > appreciated.
> > > > > > 
> > > > > > It seems that the patch below works around the problem.
> > > > > > Can anyone enlighten what's going on there?
> > > > > > 
> > > > > > 
> > > > > > thanks,
> > > > > > 
> > > > > > Takashi
> > > > > > 
> > > > > > -- 8< --
> > > > > > From: Takashi Iwai <tiwai@suse.de>
> > > > > > Subject: [PATCH] drm/i915: Fix crash after S3 resume with DP
> > > > > > MST
> > > > > > mode change
> > > > > > 
> > > > > > We've got a bug report showing that Skylake Dell machines
> > > > > > with a
> > > > > > docking station causes a kernel panic after S3 resume and
> > > > > > modeset.
> > > > > > The details are found in the openSUSE bugzilla entry
> > > > > > below.  The
> > > > > > typical test procedure is:
> > > > > > 
> > > > > > - Laptop is Dell Latitude with eDP (1366x768)
> > > > > > - Boot with docking station connected to a DP (1920x1080)
> > > > > > - Login, change the mode via
> > > > > >   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-
> > > > > > of
> > > > > > eDP-1
> > > > > > - Suspend, and close the lid after the suspend
> > > > > >   (or close the lid to trigger the suspend)
> > > > > > - Undock while keeping the lid closed.
> > > > > > - Open the lid, which triggers the resume;
> > > > > >   the machine wakes up well, and X shows up.  No problem, so
> > > > > > far.
> > > > > > - Suspend again, close the lid.
> > > > > > - Dock again while keeping the lid closed.
> > > > > > - Open the lid, triggering the resume; this wakes up still
> > > > > > fine.
> > > > > > - At this moment, run xrandr again to re-setup DP-1
> > > > > >   xrandr --output eDP-1 --auto --output DP-1-1 --auto --left-
> > > > > > of
> > > > > > eDP-1
> > > > > >   ==> This triggers a hard crash.
> > > > > > 
> > > > > > I could bisect it, and this leaded to the commit a16b7658f4e0
> > > > > > ("drm/i915: Call intel_dp_mst_resume() before resuming
> > > > > > displays").
> > > > > > 
> > > > > > Basically the commit just shuffles the calls of
> > > > > > intel_display_resume()
> > > > > > and intel_dp_mst_resume().  So as a workaround, I tried to
> > > > > > split
> > > > > > intel_dp_mst_resume() call to postpone the suspected code
> > > > > > (the
> > > > > > invocation of intel_dp_check_mst_status()), then bingo, this
> > > > > > cured the
> > > > > > problem.
> > > > > > 
> > > > > > But don't ask me *why* this fixes.  It's still in a cargo-
> > > > > > cult
> > > > > > state.
> > > > > > 
> > > > > > Fixes: a16b7658f4e0 ("drm/i915: Call intel_dp_mst_resume()
> > > > > > before
> > > > > > resuming displays")
> > > > > > Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1029634
> > > > > > Signed-off-by: Takashi Iwai <tiwai@suse.de>
> > > > > > ---
> > > > > >  drivers/gpu/drm/i915/i915_drv.c  |  5 ++++-
> > > > > >  drivers/gpu/drm/i915/intel_dp.c  | 20 +++++++++++++++++++-
> > > > > >  drivers/gpu/drm/i915/intel_drv.h |  3 ++-
> > > > > >  3 files changed, 25 insertions(+), 3 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/i915_drv.c
> > > > > > b/drivers/gpu/drm/i915/i915_drv.c
> > > > > > index 1c75402a59c1..62c40090ceed 100644
> > > > > > --- a/drivers/gpu/drm/i915/i915_drv.c
> > > > > > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > > > > > @@ -1559,6 +1559,7 @@ static int
> > > > > > i915_suspend_switcheroo(struct
> > > > > > drm_device *dev, pm_message_t state)
> > > > > >  static int i915_drm_resume(struct drm_device *dev)
> > > > > >  {
> > > > > >  	struct drm_i915_private *dev_priv = to_i915(dev);
> > > > > > +	int mst_pending;
> > > > > >  	int ret;
> > > > > >  
> > > > > >  	disable_rpm_wakeref_asserts(dev_priv);
> > > > > > @@ -1608,10 +1609,12 @@ static int i915_drm_resume(struct
> > > > > > drm_device *dev)
> > > > > >  		dev_priv->display.hpd_irq_setup(dev_priv);
> > > > > >  	spin_unlock_irq(&dev_priv->irq_lock);
> > > > > >  
> > > > > > -	intel_dp_mst_resume(dev);
> > > > > > +	mst_pending = intel_dp_mst_resume(dev);
> > > > > >  
> > > > > >  	intel_display_resume(dev);
> > > > > >  
> > > > > > +	intel_dp_mst_resume_post(dev, mst_pending);
> > > > > > +
> > > > > >  	drm_kms_helper_poll_enable(dev);
> > > > > >  
> > > > > >  	/*
> > > > > > diff --git a/drivers/gpu/drm/i915/intel_dp.c
> > > > > > b/drivers/gpu/drm/i915/intel_dp.c
> > > > > > index d1670b8afbf5..fc5ea900e6f3 100644
> > > > > > --- a/drivers/gpu/drm/i915/intel_dp.c
> > > > > > +++ b/drivers/gpu/drm/i915/intel_dp.c
> > > > > > @@ -6027,9 +6027,10 @@ void intel_dp_mst_suspend(struct
> > > > > > drm_device *dev)
> > > > > >  	}
> > > > > >  }
> > > > > >  
> > > > > > -void intel_dp_mst_resume(struct drm_device *dev)
> > > > > > +int intel_dp_mst_resume(struct drm_device *dev)
> > > > > >  {
> > > > > >  	struct drm_i915_private *dev_priv = to_i915(dev);
> > > > > > +	int pending = 0;
> > > > > >  	int i;
> > > > > >  
> > > > > >  	for (i = 0; i < I915_MAX_PORTS; i++) {
> > > > > > @@ -6041,6 +6042,23 @@ void intel_dp_mst_resume(struct
> > > > > > drm_device
> > > > > > *dev)
> > > > > >  
> > > > > >  		ret =
> > > > > > drm_dp_mst_topology_mgr_resume(&intel_dig_port->dp.mst_mgr);
> > > > > >  		if (ret)
> > > > > > +			pending |= 1 << i;
> > > > > > +	}
> > > > > > +
> > > > > > +	return pending;
> > > > > > +}
> > > > > > +
> > > > > > +void intel_dp_mst_resume_post(struct drm_device *dev, int
> > > > > > pending)
> > > > > > +{
> > > > > > +	struct drm_i915_private *dev_priv = to_i915(dev);
> > > > > > +	int i;
> > > > > > +
> > > > > > +	for (i = 0; i < I915_MAX_PORTS; i++) {
> > > > > > +		struct intel_digital_port *intel_dig_port =
> > > > > > +			dev_priv->hotplug.irq_port[i];
> > > > > > +		if (!intel_dig_port || !intel_dig_port-
> > > > > > > dp.can_mst)
> > > > > > 
> > > > > > +			continue;
> > > > > > +		if (pending & (1 << i))
> > > > > >  			intel_dp_check_mst_status(&intel_dig
> > > > > > _por
> > > > > > t->dp);
> > > > > >  	}
> > > > > >  }
> > > > > 
> > > > > The whole MST resume is a bit of chicken and egg type of
> > > > > situation.
> > > > > We
> > > > > need the HPD interrupts to resume the previous state, but we
> > > > > don't
> > > > > want
> > > > > to actually process real hotplugs until we've done the resume.
> > > > > The
> > > > > current code is definitely broken IMO.
> > > > > 
> > > > > But I'm not really sure why this patch fixes things because the
> > > > > HPD
> > > > > processing that will occur when we talk to the sink during the
> > > > > display
> > > > > resume should also call intel_dp_check_mst_status().
> > > > 
> > > > Actually, just dropping intel_dp_check_mst_status() calls in
> > > > intel_dp_mst_resume() seems enough to fix the problem.
> > > > 
> > > > Below is the v2 patch doing that.  Does it make more sense?
> > > 
> > > Barely, unfortunately :(. I'll be honest I'm a little surprised
> > > this
> > > patch doesn't break anything, but you do seem to be right in saying
> > > that hotplugging re-sets up the displays in the event that things
> > > go
> > > wrong.
> > > 
> > > Going through the backtrace and the dmesg that you gave on the
> > > bugzilla
> > > I can't figure out any definitive answer for why this is crashing.
> > > We
> > > will have to get dmesg output from when it crashes to understand
> > > this
> > > I'm pretty sure.
> > > 
> > > JFYI (I really need to make a guide for this somewhere...) I
> > > usually
> > > use the netconsole module to get kernel output from machines when
> > > they
> > > kernel panic. In your case I would set this up right before you do
> > > the
> > > final `xrandr` call that kills the machine, since netconsole has
> > > not
> > > been very reliable in my experience over suspend/resume cycles. It
> > > might take a few tries to get it to print the whole crash to the
> > > netconsole (and you will need to make sure the system has a PCI
> > > ethernet adapter, USB or wifi won't do), but this usually works.
> > 
> > I've already tried netconsole (even before kdump), but unfortunately
> > it doesn't work in this case.  As mentioned in Bugzilla, the network
> > goes crazy when the machine crashes (even the whole network segment
> > the machine connects to is stalling).  It's similar like the symptom
> > we've seen with the SKL CPU-state screw up in the past with
> > intel_idle
> > driver.
> Do you think you might be able to get a translation of the stack dump
> to the actual lines this issues is occurring on? You should be able to
> do this against i915.ko using eu-addr2line (it will have to be against
> the same kernel build you're running on the crashing machine, hence why
> I can't just do it myself). I might see if I can give you some patches
> to spit some info out.

Sure, if we get a proper stack dump, we can analyze it somehow.  You
can use addr2line, or even check objdump output manually.
But in this case, as already mentioned, it was impossible to get any
sensible stack trace on my machine with 4.11-rc, so far,
unfortunately.  So no material to read.

That is, the problem isn't how to translate it, but how to get it.
Normal ways didn't work.  Maybe I can try AMT, but I doubt that it'll
give any output since kdump already failed...


thanks,

Takashi
Lyude Paul March 30, 2017, 8:01 p.m. UTC | #5
On Thu, 2017-03-30 at 20:50 +0200, Takashi Iwai wrote:
<snip>
> 
> Sure, if we get a proper stack dump, we can analyze it somehow.  You
> can use addr2line, or even check objdump output manually.
> But in this case, as already mentioned, it was impossible to get any
> sensible stack trace on my machine with 4.11-rc, so far,
> unfortunately.  So no material to read.

huh? I thought that was what the file called "screenshot showing kernel
panic trace" on the bugzilla was (although that backtrace definitely
didn't look too relevant)... anyway if you are having trouble getting
just a stack trace though, one of my coworkers here has taught me a
trick called divide and conquer.

The idea is pretty simple. Let's say we have a block of code like this
in the kernel

void some_resume_func() {
	cool_function_call();
	this_is_neat_too();

	foo();
	bar();
	death();
	baz();
	zab();
}

And you know it's crashing inside this function on resume (e.g. it
could be in foo(), bar(), or that suspicious death() function) but you
have no way of getting a back trace.

This is where the trick comes in: while you might not be able to get a
stack trace, you can probably at least tell the difference between when
the machine reboots immediately as a result of calling
emergency_restart(), and whether it's just hanging due to the bug.

So what you do is kind of like bisecting, except instead of testing
different commits you see what happens when you insert a call to
emergency_restart() and move it around:

- Try #1:

void some_resume_func() {
	cool_function_call();
	this_is_neat_too();

	foo();
	emergency_restart();
	bar();
	death();
	baz();
	zab();
}

The machine immediately reboots, so the problem is below where we
inserted the emergency_reboot() call

- Try #2:

void some_resume_func() {
	cool_function_call();
	this_is_neat_too();

	foo();
	bar();
	death();
	emergency_restart();
	baz();
	zab();
}

The machine hangs, so we know the problem's either in the call to bar()
or death().

- Try #3:

void some_resume_func() {
	cool_function_call();
	this_is_neat_too();

	foo();
	bar();
	emergency_restart();
	death();
	baz();
	zab();
}

The machine reboots immediately this time, which means that the problem
has to be occurring inside the suspicious death() function. Of course,
if we want to keep debugging further we can go into the death()
function itself and try the same thing to figure out which line inside
it is causing the issue.

So if you do this except around wherever it looks like this crash might
be happening. From:

https://bugzilla.suse.com/show_bug.cgi?id=1029634#c5

It sounds like this happens on hotplugging, so the place to start this
would probably be i915_hotplug_work_func(). Keep going down the call
stack there and you should eventually find the culprit.

The only complication I foresee here is that you'll have to write a
little bit of additional debugging code so that
i915_hotplug_work_func() doesn't actually call emergency_restart()
until right before the moment where the crash happens. This shouldn't
be too difficult, you could do something like add a module parameter to
i915 that you change right before the final step of reproducing the bug
that enables the calls to emergency_restart(). If you have any trouble
with this part, feel free to let me know and I'll hack together a quick
patch you can use.

Lemme know if this helps at all :).

> 
> That is, the problem isn't how to translate it, but how to get it.
> Normal ways didn't work.  Maybe I can try AMT, but I doubt that it'll
> give any output since kdump already failed...
> 
> 
> thanks,
> 
> Takashi
Takashi Iwai March 30, 2017, 8:27 p.m. UTC | #6
On Thu, 30 Mar 2017 22:01:31 +0200,
Lyude Paul wrote:
> 
> On Thu, 2017-03-30 at 20:50 +0200, Takashi Iwai wrote:
> <snip>
> > 
> > Sure, if we get a proper stack dump, we can analyze it somehow.  You
> > can use addr2line, or even check objdump output manually.
> > But in this case, as already mentioned, it was impossible to get any
> > sensible stack trace on my machine with 4.11-rc, so far,
> > unfortunately.  So no material to read.
> 
> huh? I thought that was what the file called "screenshot showing kernel
> panic trace" on the bugzilla was (although that backtrace definitely
> didn't look too relevant)...

It's not from my machine, and it's not from 4.11-rc.  It's a
screenshot taken on 4.4.x openSUSE kernel with the backport of your
fix.  So it might be some help, but the stack trace there is merely a
red herring.

The reason I couldn't get such a screenshot is that VT switching is
broken on 4.11 in multiple ways.  One VT bug got fixed in 4.11-rc4,
but another still remains....

> anyway if you are having trouble getting
> just a stack trace though, one of my coworkers here has taught me a
> trick called divide and conquer.
> 
> The idea is pretty simple. Let's say we have a block of code like this
> in the kernel
> 
> void some_resume_func() {
> 	cool_function_call();
> 	this_is_neat_too();
> 
> 	foo();
> 	bar();
> 	death();
> 	baz();
> 	zab();
> }
> 
> And you know it's crashing inside this function on resume (e.g. it
> could be in foo(), bar(), or that suspicious death() function) but you
> have no way of getting a back trace.
> 
> This is where the trick comes in: while you might not be able to get a
> stack trace, you can probably at least tell the difference between when
> the machine reboots immediately as a result of calling
> emergency_restart(), and whether it's just hanging due to the bug.
> 
> So what you do is kind of like bisecting, except instead of testing
> different commits you see what happens when you insert a call to
> emergency_restart() and move it around:
> 
> - Try #1:
> 
> void some_resume_func() {
> 	cool_function_call();
> 	this_is_neat_too();
> 
> 	foo();
> 	emergency_restart();
> 	bar();
> 	death();
> 	baz();
> 	zab();
> }
> 
> The machine immediately reboots, so the problem is below where we
> inserted the emergency_reboot() call
> 
> - Try #2:
> 
> void some_resume_func() {
> 	cool_function_call();
> 	this_is_neat_too();
> 
> 	foo();
> 	bar();
> 	death();
> 	emergency_restart();
> 	baz();
> 	zab();
> }
> 
> The machine hangs, so we know the problem's either in the call to bar()
> or death().
> 
> - Try #3:
> 
> void some_resume_func() {
> 	cool_function_call();
> 	this_is_neat_too();
> 
> 	foo();
> 	bar();
> 	emergency_restart();
> 	death();
> 	baz();
> 	zab();
> }
> 
> The machine reboots immediately this time, which means that the problem
> has to be occurring inside the suspicious death() function. Of course,
> if we want to keep debugging further we can go into the death()
> function itself and try the same thing to figure out which line inside
> it is causing the issue.

Heh, the divide-and-conquer is also the strategy how I reached to my
patch :)  I divided the possible cause (the call of
intel_dp_mst_resume()), split them, and luckily it worked by the first
shot.

> So if you do this except around wherever it looks like this crash might
> be happening. From:
> 
> https://bugzilla.suse.com/show_bug.cgi?id=1029634#c5
> 
> It sounds like this happens on hotplugging, so the place to start this
> would probably be i915_hotplug_work_func(). Keep going down the call
> stack there and you should eventually find the culprit.
> 
> The only complication I foresee here is that you'll have to write a
> little bit of additional debugging code so that
> i915_hotplug_work_func() doesn't actually call emergency_restart()
> until right before the moment where the crash happens. This shouldn't
> be too difficult, you could do something like add a module parameter to
> i915 that you change right before the final step of reproducing the bug
> that enables the calls to emergency_restart(). If you have any trouble
> with this part, feel free to let me know and I'll hack together a quick
> patch you can use.

Right, that's the most difficult part; for reproducing the crash, we
need multiple suspend/resume and dock/undock, so the code path may be
executed multiple times.  And tracking i915_hotplug_work_func() in the
way you suggested isn't so trivial, as it's with full of indirect
calls...

A trick I often used instead is to put additional delays (very long
ones) between the suspected code lines with marking via trace_ or
normal printk, and track at which point we could reach.  Then you
don't need a frequent reboot but just a few long runs.  Of course, it
can't be used for irq context, but for the work, it's OK.

Maybe I'll give it a try, but likely later in the next week; I'll be
very busy for other tasks in tomorrow, sorry.


thanks,

Takashi

> 
> Lemme know if this helps at all :).
> 
> > 
> > That is, the problem isn't how to translate it, but how to get it.
> > Normal ways didn't work.  Maybe I can try AMT, but I doubt that it'll
> > give any output since kdump already failed...
> > 
> > 
> > thanks,
> > 
> > Takashi
>
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_dp.c b/drivers/gpu/drm/i915/intel_dp.c
index d1670b8afbf5..a6c0f0ac16eb 100644
--- a/drivers/gpu/drm/i915/intel_dp.c
+++ b/drivers/gpu/drm/i915/intel_dp.c
@@ -6041,6 +6041,7 @@  void intel_dp_mst_resume(struct drm_device *dev)
 
 		ret = drm_dp_mst_topology_mgr_resume(&intel_dig_port->dp.mst_mgr);
 		if (ret)
-			intel_dp_check_mst_status(&intel_dig_port->dp);
+			DRM_DEBUG_KMS("DP MST resume failed for port-%c\n",
+				      port_name(intel_dig_port->port));
 	}
 }