diff mbox

[RFC] drm/vblanks: Deal with HW vblank counter resets.

Message ID 20171107062617.4227-1-dhinakaran.pandiyan@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dhinakaran Pandiyan Nov. 7, 2017, 6:26 a.m. UTC
Some HW vblank counters reset due to power management events, which messes
up the vblank counting logic. This leads to screen freezes with user space
waiting on vblank events that may not occur if the counter keeps resetting.

For e.g., After the HW vblank counter resets
[    9.007359] [drm:drm_update_vblank_count [drm]] updating vblank count
on crtc 0: current=297, diff=4294965389, hw=5 hw_last=1912

So, fall back to the SW counter, computed using  vblank timestamps
and frame duration, when the HW counter value deviates by 50% of the SW
computed value.

I have tested this patch on my SKL laptop with i915.enable_psr=1 and it
*seems* to solve the screen freeze issue seen with PSR when DMC is loaded.

Known issues:
1) The 50% deviation margin is arbitrary.
2) "Redundant vblirq ignored" messages are more frequent.

I am sending this as an RFC to get feedback on whether the fall back
approach is sane and if it should be implemented in the core.

Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Dhinakaran Pandiyan <dhinakaran.pandiyan@intel.com>
---
 drivers/gpu/drm/drm_vblank.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

Comments

Michel Dänzer Nov. 7, 2017, 9:47 a.m. UTC | #1
On 07/11/17 07:26 AM, Dhinakaran Pandiyan wrote:
> Some HW vblank counters reset due to power management events, which messes
> up the vblank counting logic. This leads to screen freezes with user space
> waiting on vblank events that may not occur if the counter keeps resetting.
> 
> For e.g., After the HW vblank counter resets
> [    9.007359] [drm:drm_update_vblank_count [drm]] updating vblank count
> on crtc 0: current=297, diff=4294965389, hw=5 hw_last=1912
> 
> So, fall back to the SW counter, computed using  vblank timestamps
> and frame duration, when the HW counter value deviates by 50% of the SW
> computed value.
> 
> I have tested this patch on my SKL laptop with i915.enable_psr=1 and it
> *seems* to solve the screen freeze issue seen with PSR when DMC is loaded.
> 
> Known issues:
> 1) The 50% deviation margin is arbitrary.
> 2) "Redundant vblirq ignored" messages are more frequent.
> 
> I am sending this as an RFC to get feedback on whether the fall back
> approach is sane and if it should be implemented in the core.

Is there no way for the driver to know under which circumstances the
reset to 0 might happen? If there is, maybe it could be solved by
calling drm_crtc_vblank_off() before it might happen and
drm_crtc_vblank_on() after it might have happened.

Otherwise, might it be better not to use the HW counter at all when it's
known not to be reliable?
Ville Syrjälä Nov. 7, 2017, 10:50 a.m. UTC | #2
On Tue, Nov 07, 2017 at 10:47:00AM +0100, Michel Dänzer wrote:
> On 07/11/17 07:26 AM, Dhinakaran Pandiyan wrote:
> > Some HW vblank counters reset due to power management events, which messes
> > up the vblank counting logic. This leads to screen freezes with user space
> > waiting on vblank events that may not occur if the counter keeps resetting.
> > 
> > For e.g., After the HW vblank counter resets
> > [    9.007359] [drm:drm_update_vblank_count [drm]] updating vblank count
> > on crtc 0: current=297, diff=4294965389, hw=5 hw_last=1912
> > 
> > So, fall back to the SW counter, computed using  vblank timestamps
> > and frame duration, when the HW counter value deviates by 50% of the SW
> > computed value.
> > 
> > I have tested this patch on my SKL laptop with i915.enable_psr=1 and it
> > *seems* to solve the screen freeze issue seen with PSR when DMC is loaded.
> > 
> > Known issues:
> > 1) The 50% deviation margin is arbitrary.
> > 2) "Redundant vblirq ignored" messages are more frequent.
> > 
> > I am sending this as an RFC to get feedback on whether the fall back
> > approach is sane and if it should be implemented in the core.
> 
> Is there no way for the driver to know under which circumstances the
> reset to 0 might happen? If there is, maybe it could be solved by
> calling drm_crtc_vblank_off() before it might happen and
> drm_crtc_vblank_on() after it might have happened.
> 
> Otherwise, might it be better not to use the HW counter at all when it's
> known not to be reliable?

Yeah, I think we could just avoid using the HW counter whenever there's
a possibility of PSR being used on the crtc.

Assuming we still want to use the HW counter on crtcs that can't do PSR,
we're going to need some kind of per-crtc mechanism to tell drm_vblank.c
which method to use. hwmode.private_flags is one option. We already use
it for choosing between the scanline counter and hardware timestamps when
calculating the scanout position. But in this case the flag wouldn't be
exactly private since drm_vblank.c would have to know about it as well.
If that is deemed to be a problem, then we might just need to add a bool
to struct drm_vblank_crtc to indicate that the hw counter is good, or
maybe even move the dev->max_vblank_count to live inside
struct drm_vblank_crtc.
Michel Dänzer Nov. 7, 2017, 10:55 a.m. UTC | #3
On 07/11/17 11:50 AM, Ville Syrjälä wrote:
> On Tue, Nov 07, 2017 at 10:47:00AM +0100, Michel Dänzer wrote:
>> On 07/11/17 07:26 AM, Dhinakaran Pandiyan wrote:
>>> Some HW vblank counters reset due to power management events, which messes
>>> up the vblank counting logic. This leads to screen freezes with user space
>>> waiting on vblank events that may not occur if the counter keeps resetting.
>>>
>>> For e.g., After the HW vblank counter resets
>>> [    9.007359] [drm:drm_update_vblank_count [drm]] updating vblank count
>>> on crtc 0: current=297, diff=4294965389, hw=5 hw_last=1912
>>>
>>> So, fall back to the SW counter, computed using  vblank timestamps
>>> and frame duration, when the HW counter value deviates by 50% of the SW
>>> computed value.
>>>
>>> I have tested this patch on my SKL laptop with i915.enable_psr=1 and it
>>> *seems* to solve the screen freeze issue seen with PSR when DMC is loaded.
>>>
>>> Known issues:
>>> 1) The 50% deviation margin is arbitrary.
>>> 2) "Redundant vblirq ignored" messages are more frequent.
>>>
>>> I am sending this as an RFC to get feedback on whether the fall back
>>> approach is sane and if it should be implemented in the core.
>>
>> Is there no way for the driver to know under which circumstances the
>> reset to 0 might happen? If there is, maybe it could be solved by
>> calling drm_crtc_vblank_off() before it might happen and
>> drm_crtc_vblank_on() after it might have happened.
>>
>> Otherwise, might it be better not to use the HW counter at all when it's
>> known not to be reliable?
> 
> Yeah, I think we could just avoid using the HW counter whenever there's
> a possibility of PSR being used on the crtc.
> 
> Assuming we still want to use the HW counter on crtcs that can't do PSR,
> we're going to need some kind of per-crtc mechanism to tell drm_vblank.c
> which method to use. hwmode.private_flags is one option. We already use
> it for choosing between the scanline counter and hardware timestamps when
> calculating the scanout position. But in this case the flag wouldn't be
> exactly private since drm_vblank.c would have to know about it as well.
> If that is deemed to be a problem, then we might just need to add a bool
> to struct drm_vblank_crtc to indicate that the hw counter is good, or
> maybe even move the dev->max_vblank_count to live inside
> struct drm_vblank_crtc.

Or just allow setting struct drm_vblank_crtc's get_vblank_counter member
to NULL?
Ville Syrjälä Nov. 7, 2017, 11:26 a.m. UTC | #4
On Tue, Nov 07, 2017 at 11:55:44AM +0100, Michel Dänzer wrote:
> On 07/11/17 11:50 AM, Ville Syrjälä wrote:
> > On Tue, Nov 07, 2017 at 10:47:00AM +0100, Michel Dänzer wrote:
> >> On 07/11/17 07:26 AM, Dhinakaran Pandiyan wrote:
> >>> Some HW vblank counters reset due to power management events, which messes
> >>> up the vblank counting logic. This leads to screen freezes with user space
> >>> waiting on vblank events that may not occur if the counter keeps resetting.
> >>>
> >>> For e.g., After the HW vblank counter resets
> >>> [    9.007359] [drm:drm_update_vblank_count [drm]] updating vblank count
> >>> on crtc 0: current=297, diff=4294965389, hw=5 hw_last=1912
> >>>
> >>> So, fall back to the SW counter, computed using  vblank timestamps
> >>> and frame duration, when the HW counter value deviates by 50% of the SW
> >>> computed value.
> >>>
> >>> I have tested this patch on my SKL laptop with i915.enable_psr=1 and it
> >>> *seems* to solve the screen freeze issue seen with PSR when DMC is loaded.
> >>>
> >>> Known issues:
> >>> 1) The 50% deviation margin is arbitrary.
> >>> 2) "Redundant vblirq ignored" messages are more frequent.
> >>>
> >>> I am sending this as an RFC to get feedback on whether the fall back
> >>> approach is sane and if it should be implemented in the core.
> >>
> >> Is there no way for the driver to know under which circumstances the
> >> reset to 0 might happen? If there is, maybe it could be solved by
> >> calling drm_crtc_vblank_off() before it might happen and
> >> drm_crtc_vblank_on() after it might have happened.
> >>
> >> Otherwise, might it be better not to use the HW counter at all when it's
> >> known not to be reliable?
> > 
> > Yeah, I think we could just avoid using the HW counter whenever there's
> > a possibility of PSR being used on the crtc.
> > 
> > Assuming we still want to use the HW counter on crtcs that can't do PSR,
> > we're going to need some kind of per-crtc mechanism to tell drm_vblank.c
> > which method to use. hwmode.private_flags is one option. We already use
> > it for choosing between the scanline counter and hardware timestamps when
> > calculating the scanout position. But in this case the flag wouldn't be
> > exactly private since drm_vblank.c would have to know about it as well.
> > If that is deemed to be a problem, then we might just need to add a bool
> > to struct drm_vblank_crtc to indicate that the hw counter is good, or
> > maybe even move the dev->max_vblank_count to live inside
> > struct drm_vblank_crtc.
> 
> Or just allow setting struct drm_vblank_crtc's get_vblank_counter member
> to NULL?

Looks like that's actually under drm_crtc_funcs. I'm thinking we want
to keep that as const. Not sure we want to add a third way for drivers
to provide a .get_vblank_counter() hook (the second one being the
old drm_driver.get_vblank_counter() hook, which i915 is still using
actually).
Daniel Vetter Nov. 7, 2017, 12:39 p.m. UTC | #5
On Tue, Nov 07, 2017 at 10:47:00AM +0100, Michel Dänzer wrote:
> On 07/11/17 07:26 AM, Dhinakaran Pandiyan wrote:
> > Some HW vblank counters reset due to power management events, which messes
> > up the vblank counting logic. This leads to screen freezes with user space
> > waiting on vblank events that may not occur if the counter keeps resetting.
> > 
> > For e.g., After the HW vblank counter resets
> > [    9.007359] [drm:drm_update_vblank_count [drm]] updating vblank count
> > on crtc 0: current=297, diff=4294965389, hw=5 hw_last=1912
> > 
> > So, fall back to the SW counter, computed using  vblank timestamps
> > and frame duration, when the HW counter value deviates by 50% of the SW
> > computed value.
> > 
> > I have tested this patch on my SKL laptop with i915.enable_psr=1 and it
> > *seems* to solve the screen freeze issue seen with PSR when DMC is loaded.
> > 
> > Known issues:
> > 1) The 50% deviation margin is arbitrary.
> > 2) "Redundant vblirq ignored" messages are more frequent.
> > 
> > I am sending this as an RFC to get feedback on whether the fall back
> > approach is sane and if it should be implemented in the core.
> 
> Is there no way for the driver to know under which circumstances the
> reset to 0 might happen? If there is, maybe it could be solved by
> calling drm_crtc_vblank_off() before it might happen and
> drm_crtc_vblank_on() after it might have happened.
> 
> Otherwise, might it be better not to use the HW counter at all when it's
> known not to be reliable?

We know when it happens, so agreed this isn't a good/workable solution
really. I thought the plan to fix that was to fix up our runtime pm to
make sure the vblank counter doesn't get reset while we need it (pending
flip or vblank). And in-between (when the vblank counter is totally off)
we'd fix any mismatch by adjusting the sw vblank counter with an explicit
call (where we can use the elapsed time to estimate the elapsed vblank
counts well enough). Adding a magic hack like this doesn't sound like a
good plan to me indeed.
-Daniel
Dhinakaran Pandiyan Nov. 7, 2017, 7:14 p.m. UTC | #6
On Tue, 2017-11-07 at 13:39 +0100, Daniel Vetter wrote:
> On Tue, Nov 07, 2017 at 10:47:00AM +0100, Michel Dänzer wrote:

> > On 07/11/17 07:26 AM, Dhinakaran Pandiyan wrote:

> > > Some HW vblank counters reset due to power management events, which messes

> > > up the vblank counting logic. This leads to screen freezes with user space

> > > waiting on vblank events that may not occur if the counter keeps resetting.

> > > 

> > > For e.g., After the HW vblank counter resets

> > > [    9.007359] [drm:drm_update_vblank_count [drm]] updating vblank count

> > > on crtc 0: current=297, diff=4294965389, hw=5 hw_last=1912

> > > 

> > > So, fall back to the SW counter, computed using  vblank timestamps

> > > and frame duration, when the HW counter value deviates by 50% of the SW

> > > computed value.

> > > 

> > > I have tested this patch on my SKL laptop with i915.enable_psr=1 and it

> > > *seems* to solve the screen freeze issue seen with PSR when DMC is loaded.

> > > 

> > > Known issues:

> > > 1) The 50% deviation margin is arbitrary.

> > > 2) "Redundant vblirq ignored" messages are more frequent.

> > > 

> > > I am sending this as an RFC to get feedback on whether the fall back

> > > approach is sane and if it should be implemented in the core.

> > 

> > Is there no way for the driver to know under which circumstances the

> > reset to 0 might happen? 


Not precisely, but the driver does need to allow the firmware to go into
lower power states.

> If there is, maybe it could be solved by

> > calling drm_crtc_vblank_off() before it might happen and

> > drm_crtc_vblank_on() after it might have happened.

> > 

> > Otherwise, might it be better not to use the HW counter at all when it's

> > known not to be reliable?

> 

> We know when it happens, so agreed this isn't a good/workable solution

> really. I thought the plan to fix that was to fix up our runtime pm to

> make sure the vblank counter doesn't get reset while we need it (pending

> flip or vblank). And in-between (when the vblank counter is totally off)

> we'd fix any mismatch by adjusting the sw vblank counter with an explicit

> call (where we can use the elapsed time to estimate the elapsed vblank

> counts well enough). Adding a magic hack like this doesn't sound like a

> good plan to me indeed.

> -Daniel


Thanks for the feedback, I wasn't aware of an agreed plan. I just went
through patchwork history now and realized that the full-time SW counter
approach was tried earlier, but not much information on why it was
abandoned.

Allowing runtime PM (PSR + DMC) to do it's job while letting the SW
estimate vblanks if the hardware counter is unreliable didn't seem like
a terrible plan. It could save more power fwiw. But, I understand your
concerns, will see what I can do.

-DK
diff mbox

Patch

diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
index 57cc6e37c810..8000aae5f1f7 100644
--- a/drivers/gpu/drm/drm_vblank.c
+++ b/drivers/gpu/drm/drm_vblank.c
@@ -190,11 +190,12 @@  static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
 				    bool in_vblank_irq)
 {
 	struct drm_vblank_crtc *vblank = &dev->vblank[pipe];
-	u32 cur_vblank, diff;
+	u32 cur_vblank;
 	bool rc;
 	ktime_t t_vblank;
 	int count = DRM_TIMESTAMP_MAXRETRIES;
 	int framedur_ns = vblank->framedur_ns;
+	u32 diff = in_vblank_irq ? 1 : 0;
 
 	/*
 	 * Interrupts were disabled prior to this call, so deal with counter
@@ -213,26 +214,31 @@  static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
 		rc = drm_get_last_vbltimestamp(dev, pipe, &t_vblank, in_vblank_irq);
 	} while (cur_vblank != __get_vblank_counter(dev, pipe) && --count > 0);
 
-	if (dev->max_vblank_count != 0) {
-		/* trust the hw counter when it's around */
+	if (dev->max_vblank_count)
 		diff = (cur_vblank - vblank->last) & dev->max_vblank_count;
-	} else if (rc && framedur_ns) {
+
+	if (rc && framedur_ns) {
 		u64 diff_ns = ktime_to_ns(ktime_sub(t_vblank, vblank->time));
+		u32 sw_diff;
 
 		/*
 		 * Figure out how many vblanks we've missed based
 		 * on the difference in the timestamps and the
 		 * frame/field duration.
 		 */
-		diff = DIV_ROUND_CLOSEST_ULL(diff_ns, framedur_ns);
-
-		if (diff == 0 && in_vblank_irq)
+		sw_diff = DIV_ROUND_CLOSEST_ULL(diff_ns, framedur_ns);
+		if (sw_diff == 0 && in_vblank_irq)
 			DRM_DEBUG_VBL("crtc %u: Redundant vblirq ignored."
 				      " diff_ns = %lld, framedur_ns = %d)\n",
 				      pipe, (long long) diff_ns, framedur_ns);
-	} else {
-		/* some kind of default for drivers w/o accurate vbl timestamping */
-		diff = in_vblank_irq ? 1 : 0;
+
+		if (!dev->max_vblank_count)
+			diff = sw_diff;
+		else if (sw_diff && abs(diff - sw_diff) > DIV_ROUND_CLOSEST(sw_diff, 2)) {
+			DRM_DEBUG_VBL("hw vblank counter(%u) deviates from sw (%u)\n",
+				      diff, sw_diff);
+			diff = sw_diff;
+		}
 	}
 
 	/*