diff mbox

Small vc4 kms fixes and some questions.

Message ID 573E6B05.5010507@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mario Kleiner May 20, 2016, 1:40 a.m. UTC
On 05/13/2016 08:51 PM, Eric Anholt wrote:
> Mario Kleiner <mario.kleiner.de@gmail.com> writes:
>
>> On 05/09/2016 09:38 PM, Eric Anholt wrote:
>>> Mario Kleiner <mario.kleiner.de@gmail.com> writes:
>>>
>>>> Hi Eric and all,
>>>>
>>>> two small fixes against vc4 kms, built and tested agains the
>>>> Raspberry Pi foundations 4.4.8 kernel tree on RPi2B.
>>>>
>>>> I'm tinkering with a Rpi 2B a bit to see if your vc4 work can
>>>> already make the Pi useful as a device for some serious but low
>>>> cost neuro-science applications.
>>>>
>>>> Eric:
>>>>
>>>> Is there any public documentation about the HVS hardware video scaler
>>>> or the pixel valves? I could find other docs about Videocore's 3d
>>>> part, but nothing about hvs or the pixel valves? Or are the register
>>>> definitions inside the vc4 already all that exists in the hw?
>>>
>>> Nope, docs for display never got released.  I've got docs internally,
>>> and I'm happy to try to answer questions.
>>>
>>
>> Ah good. My questions are around making the pageflip completion events
>> and vblank timestamps as precise and reliable as possible.
>>
>> Atm. i'm working on a patch to the flip completion handling to make it
>> more robust. The current code in my stress tests with 10000 flips sends
>> out flip completions too early in about 1-2% of the trials.
>>
>> My current patch reduces these to 0 failures in my test. I'll send the
>> patch out later after some more tweaking and cleanup.
>>
>> E.g., for crtc/pixelvalve 1 the patch checks if the SCALER_DISPLIST1 and
>> SCALER_DISPLACT1 registers in the HVS match, and only then sends out the
>> pageflip completion event, otherwise waits for the next vblank. I assume
>> SCALER_DISPLACT1 is the true current value for start of active display
>> list whereas SCALER_DISPLIST1 is the value that got latched and then
>> gets committed at the next vblank after writing to the reg. This seems
>> to work well according to my special measurement equipment which can
>> timestamp the true start of display of a new framebuffer.
>
> Oh, interesting.  The docs had said that the display lists were
> re-parsed on every line, so I assumed that also meant that the reparsing
> started from the head pointer every line.  I've confirmed in the
> hardware what you've found: the head pointer is only latched at VSTART
> or INIT signal to the display fifo (that start signal comes from the
> pixel valve).
>

Yes. You can observe it also as a symptom with async pageflips, where 
the return from flip is async, but the display still doesn't tear.

>> However i don't know if this already perfect, or just strongly reduced
>> error rate, so i need to know when the value gets committed (start of
>> vblank, end of vblank, vsync)? And when does the vblank irq fire for the
>> different possible settings of:
>>
>> # define PV_INT_VID_IDLE                        BIT(9)
>> # define PV_INT_VFP_END                         BIT(8)
>> # define PV_INT_VFP_START                       BIT(7)
>> # define PV_INT_VACT_START                      BIT(6)
>> # define PV_INT_VBP_START                       BIT(5)
>> # define PV_INT_VSYNC_START                     BIT(4)
>>
>> Currently the driver uses PV_INT_VFP_START.
>
> OK, looks like the PV timing state is:
>
> IDLE
> START
> (vfp lines)
> VSYNC
> (vsw lines)
> VBP
> (vbp lines)
> VACT
> (vact lines)
> VFP
> (jump to waiting for vfp lines)
>
> The normal timing loop after START does its transitions at HFP, and
> those transitions are when you get the VFP/VBP/VSYNC START/END
> interrupts.
>
> You'll jump back to IDLE almost right away if VID_EN is dropped.
>
> The VSTART signal to the HVS is when PV does IDLE -> START or on the
> last pixel of the active scanout.  Note, this is *not* the PV's START
> signal, which looks like it's basically unused.  Also, I think it's an
> interesting note that we don't have VFP_START on our first frame, as far
> as I've found.
>
> The PV requests that the HVS generate the next line at the last pixel of
> each HACTIVE when we're either at end of VBP or VACTIVE-but-not-its-end.
>

Thanks for the detailed explanation Eric. The way this is designed is 
almost as if the Broadcom hw engineers had read our DRM vblank handling 
and DRI swap scheduling code and decided to build the hw to exactly 
match the expectations of our code :)

This means that my patch for pageflip completion robustness should be 
perfectly robust and not subject to races with the hardware :) - My 
testing over many runs of 10000's of flips confirms that. I've sent the 
patch out for review, cc'ed to you, and it should be fine as is for 
inclusion if you are happy with it.

>> The second question is if the HVS or pixelvalves have some kind of
>> scanline register that reports the currently scanned out scanline? I'd
>> like to implement scanout position queries, so we can get instant high
>> precision vblank timestamps if possible like we have for intel, amd and
>> nouveau, so we'd have precise timestamps, a vblank counter and also
>> additional power savings. Or lacking that are there other regs that
>> could be used to timestamp vblanks or updates of display lists in the
>> hardware?
>
> HVS has bits 0:11 of DISPSTATx for the Y line being generated.  That
> will be in a different clock domain from the PV, but it's probably good
> enough, right?
>

Mostly. It's not quite as good as having true scanout position from the 
PV's. Attached is my current w.i.p patch for scanoutpos based 
timestamping. It already works quite well with my timing tests, but as 
you can see from the code and the longish explanation, we can't get 
close to perfect accuracy if we query the timestamp while the PV is in 
vblank, and need some trickery to get ok'ish results there.

The problem, as far as my understanding of the hw from the results goes, 
is that the HVS has some linebuffer fifo which can hold a couple of 
composited scanlines, e.g., 13 - 24 for typical video resolutions, for 
later consumption by the PV. The HVS refills much faster than the PV 
consumes. During active scanout that means the PV and HVS work in 
lockstep, the HVS fifo is almost always completely full and the HVS is 
throttled to the rate with which the PV consumes. This is good, because 
the scanlinepos of the PV is the compositing pos of the HVS from 
DISPSTATx minus the capacity of the fifo.

At the last few lines of active scanout the HVS stops compositing while
the PV drains the fifo, so our position estimation gets inaccurate - not 
a big deal in practice. In VBLANK however, we don't get any meaningful 
reading because the HVS apparently quickly refills the fifo to full 
capacity after the VSTART signal from the PV and then it is idle until 
start of active scanout when the PV starts to consume from the fifo again.

You can see multiple special cases for "in vblank" to deal with this ok'ish.

Anyway, this still gives us almost all advantages of scanoutpos based 
timestamping, except for the blind spot in vblank when we can't use HVS 
readings. Even so the timestamps will always be accurate to an error of 
less than 1 vblank duration or ~1 msec, and typically accurate to about 
0.1 msecs, according to my measurements.

Two questions:

1. Can you tell me something about the size of that fifo - capacity in 
lines, depending on horizontal resolution? My heuristic formula in the 
patch...

fifo_lines = (2048 * 7 / mode->crtc_hdisplay) - 1;

... seems to work well for the three or four video modes i could 
actually test. But having the real numbers would be better.

2. In the special case hack for vpos < fifo_lines, i try to estimate the 
refill speed of the HVS as ratio between HVS clock and mode clock...

*vpos = -vblank_lines + (*vpos * mode->crtc_clock / 250000);

... under the assumption that the HVS clock is == system clock and that 
clock is a constant 250 Mhz, based on some numbers from some of the 
public docs. However, i'm not sure if the 250 Mhz is right, or if this 
is even constant across Soc's or wrt. power management. That specific 
code path so far doesn't really improve precision. I'm not sure if i 
should drop it, or refine it, or how. But maybe my assumptions about HVS 
composition rate vs. PV scanout rate are wrong there, or the clock value 
is wrong?

thanks,
-mario
diff mbox

Patch

From 3c563068a891271fc544a5c515995971e3f02955 Mon Sep 17 00:00:00 2001
From: Mario Kleiner <mario.kleiner.de@gmail.com>
Date: Wed, 18 May 2016 05:55:11 +0000
Subject: [PATCH] drm/vc4: Implement precise vblank timestamping.

Precise vblank timestamping is implemented via the
usual scanout position based method. However, on
VC4 the Pixelvalves PV do not have a scanout position
register. Only the Hardware video scaler HVS has a
similar register which describes which scanline for
the output is currently composited and stored in the
HVS fifo for later consumption by the PV.

This causes a problem in that the HVS runs at a much
faster clock (system clock / audio gate) than the PV
which runs at video mode dot clock, so the unless the
fifo between HVS and PV is full, the HVS will progress
faster in its observable read line position than video
scan rate, so the HVS position reading can't be directly
translated into a scanout position for timestamp correction.

Additionally when the PV is in vblank, it doesn't consume
from the fifo, so the fifo gets full very quickly and then
the HVS stops compositing until the PV enters active scanout
and starts consuming scanlines from the fifo again, making
new space for the HVS to composite.

Therefore a simple translation of HVS read position into
elapsed time since (or to) start of active scanout does
not work, but for the most interesting cases we can still
get useful and accurate results:

1. The PV enters active scanout of a new frame with the
   fifo of the HVS completely full, and the HVS can refill
   any fifo line which gets consumed and thereby freed up by
   the PV during active scanout very quickly. Therefore the
   PV and HVS work effectively in lock-step during active
   scanout with the fifo never having more than 1 scanline
   freed up by the PV before it gets refilled. The PV's
   real scanout position is therefore trailing the HVS
   compositing position by scanoutpos = hvspos - fifosize
   and we can get the true scanoutpos as HVS readpos minus
   fifo size, so precise timestamping works while in active
   scanout, except for the last few scanlines of the image,
   when the HVS reaches end of frame, stops compositing and
   the PV catches up and drains the fifo.

2. During early vblank, while the HVS is still refilling the
   empty fifo, we can try to translate fifo fill level into
   PV position in vblank, taking the ratio betwenn HVS clock
   and PV clock into account. From that we estimate how many
   scanlines for the PV to go before start of new active
   scanout.

3. Once the HVS fifo is full in vblank, we can only guess
   something half-way reasonable. If called from vblank irq,
   we assume the irq is usually dispatched with minimum
   delay, so we can take a timestamp taken at entry into vblank
   irq handler as a baseline and then add a full vblank duration
   until the guessed start of active scanout. As irq dispatch
   is usually pretty low latency this works with relatively
   low jitter and good results.

   If we aren't called from vblank then we could be anywhere
   within the vblank interval, so we return a neutral result,
   simply the current system timestamp, and hope for the best.

In practice we are usually in case 3 + vblank irq, or case 2,
and the timestamps are rather precise, and at least never off
more than 1 vblank duration. During vblank on/off, or if the
timestamping gets strongly delayed into active scanout due to
things ike irqs being off for a long time, lock contention etc.,
we end in case 1 and also get very precise timestamps.

TODO:

- Case 2 does not improve accuracy, probably due to wrong
  assumptions about HVS clock frequency, or about refill
  behavior, or because it only covers about 1-3 scanlines
  at most and suffers roundoff errors. Drop case 2? Or a
  way to refine it?

- Replace heuristically found numbers for fifo linebuffer
  capacity by some actual info from Eric.

- A way to improve case 3 in non-irq case?

Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
---
 drivers/gpu/drm/vc4/vc4_crtc.c | 155 +++++++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/vc4/vc4_drv.c  |   2 +
 drivers/gpu/drm/vc4/vc4_drv.h  |   7 ++
 drivers/gpu/drm/vc4/vc4_kms.c  |  19 +++++
 4 files changed, 183 insertions(+)

diff --git a/drivers/gpu/drm/vc4/vc4_crtc.c b/drivers/gpu/drm/vc4/vc4_crtc.c
index 1981ca3..2de8ba9 100644
--- a/drivers/gpu/drm/vc4/vc4_crtc.c
+++ b/drivers/gpu/drm/vc4/vc4_crtc.c
@@ -46,6 +46,9 @@  struct vc4_crtc {
 	const struct vc4_crtc_data *data;
 	void __iomem *regs;
 
+	/* Timestamp at start of vblank irq - unaffected by lock delays. */
+	ktime_t t_vblank;
+
 	/* Which HVS channel we're using for our CRTC. */
 	int channel;
 
@@ -146,6 +149,156 @@  int vc4_crtc_debugfs_regs(struct seq_file *m, void *unused)
 }
 #endif
 
+int vc4_crtc_get_scanoutpos(struct drm_device *dev, unsigned int crtc_id,
+			    unsigned int flags, int *vpos, int *hpos,
+			    ktime_t *stime, ktime_t *etime,
+			    const struct drm_display_mode *mode)
+{
+	struct vc4_dev *vc4 = to_vc4_dev(dev);
+	struct vc4_crtc *vc4_crtc = vc4->crtc[crtc_id];
+	u32 val;
+	int fifo_lines, vblank_lines;
+	int ret = 0;
+
+	/* preempt_disable_rt() should go right here in PREEMPT_RT patchset. */
+
+	/* Get optional system timestamp before query. */
+	if (stime)
+		*stime = ktime_get();
+
+	/*
+	 * Read vertical scanline which is currently composed for our
+	 * pixelvalve by the HVS, and also scaler status.
+	 */
+	val = HVS_READ(SCALER_DISPSTATX(vc4_crtc->channel));
+
+	/* Get optional system timestamp after query. */
+	if (etime)
+		*etime = ktime_get();
+
+	/* preempt_enable_rt() should go right here in PREEMPT_RT patchset. */
+
+	/* vpos of composed (not scanned out) scanline is in 12 lsb's. */
+	*vpos = val & VC4_MASK(11, 0);
+
+	/* No hpos info available. */
+	if (hpos)
+		*hpos = 0;
+
+	/* XXX Proper estimate? Works for different modes, but guesswork...
+	 * Estimated the linebuffer fifo to hold 7 lines at 2048 horizontal
+	 * resolution, then guess a reasonable size according to HACTIVE.
+	 */
+	fifo_lines = (2048 * 7 / mode->crtc_hdisplay) - 1;
+
+	DRM_DEBUG_KMS("crtc %d : vblirq %d : fifo-lines %d : fifo-full %d : usec-irqdelay: %d : scanline %d\n",
+		      crtc_id, (flags & DRM_CALLED_FROM_VBLIRQ) ? 1 : 0, fifo_lines,
+		      ((val & SCALER_DISPSTATX_FULL) == SCALER_DISPSTATX_FULL) ? 1 : 0,
+		      (int) (ktime_to_ns(*stime) - ktime_to_ns(vc4_crtc->t_vblank)) / 1000,
+		      *vpos);
+
+	/* HVS more than fifo_lines into frame for compositing? */
+	if (*vpos > fifo_lines) {
+		/*
+		 * We are in active scanout and can get some meaningful results
+		 * from HVS. The actual PV scanout can not trail behind more
+		 * than fifo_lines as that is the fifo's capacity. Assume that
+		 * in active scanout the HVS and PV work in lockstep wrt. HVS
+		 * refilling the fifo and PV consuming from the fifo, ie.
+		 * whenever the PV consumes and frees up a scanline in the
+		 * fifo, the HVS will immediately refill it, therefore
+		 * incrementing vpos. Therefore we choose HVS refill position -
+		 * fifo size in scanlines as a good guess wrt. the real scanout
+		 * position of the PV.
+		 */
+		*vpos -= fifo_lines + 1;
+		ret |= DRM_SCANOUTPOS_VALID | DRM_SCANOUTPOS_ACCURATE;
+		return ret;
+	}
+
+	/*
+	 * Less: This happens when we are in vblank and the HVS, after getting
+	 * the VSTART restart signal from the PV, just started refilling its
+	 * fifo with new lines from the top-most lines of the new framebuffers.
+	 * The PV does not scan out in vblank, so does not remove lines from
+	 * the fifo, so the fifo will be full quickly. At that point we can not
+	 * get meaningful readings wrt. scanline position of the PV and need
+	 * to make things up in a approximative and somewhat consistent way.
+	 */
+	ret |= DRM_SCANOUTPOS_IN_VBLANK;
+	vblank_lines = mode->crtc_vtotal - mode->crtc_vdisplay;
+
+	/* XXX Is there really a point in this special case - worth it?
+	 * If the HVS fifo is not yet full, we can try to estimate how many
+	 * scanlines we are into VBLANK by how many fifo lines HVS could
+	 * refill since start of VBLANK, therefore how many scanlines to go
+	 * until start of active scanout.
+	 */
+	if ((*vpos < fifo_lines) &&
+	    ((val & SCALER_DISPSTATX_FULL) != SCALER_DISPSTATX_FULL)) {
+		/*
+		 * Fifo not yet fully refilled, so presumably at beginning of
+		 * vblank. Calculate scanlines to go until end of vblank. We
+		 * assume the HVS fills its fifo at the system clock rate of
+		 * 250 Mhz, but the PV moves through the vblank at video dot
+		 * clock rate, so must not reduce lines to go by *vpos but only
+		 * by the fraction of dot_clock / system_clock.
+		 * XXX Is this right? Is 250 Mhz right?
+		 */
+		*vpos = -vblank_lines + (*vpos * mode->crtc_clock / 250000);
+		ret |= DRM_SCANOUTPOS_VALID | DRM_SCANOUTPOS_ACCURATE;
+		return ret;
+	}
+
+	/* Fifo full. The HVS readings can't tell us where we are in VBLANK,
+	 * because the HVS is not compositing and updating its source read
+	 * position and the PV will not start consuming scanlines before the
+	 * start of active scanout, therefore the HVS can not update anything
+	 * while we are in VBLANK. All we can do is fake something reasonable.
+	 */
+	if (flags & DRM_CALLED_FROM_VBLIRQ) {
+		/*
+		 * If we are called from vblank irq then we assume the irq
+		 * handler got called within the first line of vblank,
+		 * so assume PV had about a full vblank scanlines to go.
+		 * As a base timestamp use the one taken at entry into vblank
+		 * irq handler, so it was not affected by delays due to lock
+		 * contention on event_lock or vblank_time lock in the core.
+		 */
+		*vpos = -vblank_lines;
+		if (stime)
+			*stime = vc4_crtc->t_vblank;
+		if (etime)
+			*etime = vc4_crtc->t_vblank;
+
+		ret |= DRM_SCANOUTPOS_VALID;
+	} else {
+		/*
+		 * No clue where we are. Return a vpos of zero, which will
+		 * cause calling code to just return the system timestamp.
+		 * At least this is no worse than the standard fallback.
+		 */
+		*vpos = 0;
+		ret |= DRM_SCANOUTPOS_VALID;
+	}
+
+	return ret;
+}
+
+int vc4_crtc_get_vblank_timestamp(struct drm_device *dev, unsigned int crtc_id,
+				  int *max_error, struct timeval *vblank_time,
+				  unsigned flags)
+{
+	struct vc4_dev *vc4 = to_vc4_dev(dev);
+	struct vc4_crtc *vc4_crtc = vc4->crtc[crtc_id];
+	struct drm_crtc *crtc = &vc4_crtc->base;
+
+	/* Helper routine in DRM core does all the work: */
+	return drm_calc_vbltimestamp_from_scanoutpos(dev, crtc_id, max_error,
+						     vblank_time, flags,
+						     &crtc->hwmode);
+}
+
 static void vc4_crtc_destroy(struct drm_crtc *crtc)
 {
 	drm_crtc_cleanup(crtc);
@@ -526,7 +679,9 @@  static irqreturn_t vc4_crtc_irq_handler(int irq, void *data)
 	irqreturn_t ret = IRQ_NONE;
 
 	if (stat & PV_INT_VFP_START) {
+		vc4_crtc->t_vblank = ktime_get();
 		CRTC_WRITE(PV_INTSTAT, PV_INT_VFP_START);
+		/* udelay(1500); */
 		drm_crtc_handle_vblank(&vc4_crtc->base);
 		vc4_crtc_handle_page_flip(vc4_crtc);
 		ret = IRQ_HANDLED;
diff --git a/drivers/gpu/drm/vc4/vc4_drv.c b/drivers/gpu/drm/vc4/vc4_drv.c
index 7ff70ce7..bf9054c 100644
--- a/drivers/gpu/drm/vc4/vc4_drv.c
+++ b/drivers/gpu/drm/vc4/vc4_drv.c
@@ -92,6 +92,8 @@  static struct drm_driver vc4_drm_driver = {
 	.enable_vblank = vc4_enable_vblank,
 	.disable_vblank = vc4_disable_vblank,
 	.get_vblank_counter = drm_vblank_no_hw_counter,
+	.get_scanout_position = vc4_crtc_get_scanoutpos,
+	.get_vblank_timestamp = vc4_crtc_get_vblank_timestamp,
 
 #if defined(CONFIG_DEBUG_FS)
 	.debugfs_init = vc4_debugfs_init,
diff --git a/drivers/gpu/drm/vc4/vc4_drv.h b/drivers/gpu/drm/vc4/vc4_drv.h
index 37cac59..1b5dc60 100644
--- a/drivers/gpu/drm/vc4/vc4_drv.h
+++ b/drivers/gpu/drm/vc4/vc4_drv.h
@@ -415,6 +415,13 @@  extern struct platform_driver vc4_crtc_driver;
 int vc4_enable_vblank(struct drm_device *dev, unsigned int crtc_id);
 void vc4_disable_vblank(struct drm_device *dev, unsigned int crtc_id);
 int vc4_crtc_debugfs_regs(struct seq_file *m, void *arg);
+int vc4_crtc_get_scanoutpos(struct drm_device *dev, unsigned int crtc_id,
+			    unsigned int flags, int *vpos, int *hpos,
+			    ktime_t *stime, ktime_t *etime,
+			    const struct drm_display_mode *mode);
+int vc4_crtc_get_vblank_timestamp(struct drm_device *dev, unsigned int crtc_id,
+				  int *max_error, struct timeval *vblank_time,
+				  unsigned flags);
 
 /* vc4_debugfs.c */
 int vc4_debugfs_init(struct drm_minor *minor);
diff --git a/drivers/gpu/drm/vc4/vc4_kms.c b/drivers/gpu/drm/vc4/vc4_kms.c
index d423ba1..6092c55 100644
--- a/drivers/gpu/drm/vc4/vc4_kms.c
+++ b/drivers/gpu/drm/vc4/vc4_kms.c
@@ -89,6 +89,22 @@  static struct vc4_commit *commit_init(struct drm_atomic_state *state)
 	return c;
 }
 
+static void vc4_update_crtc_state(struct drm_atomic_state *state)
+{
+	struct drm_crtc *crtc;
+	struct drm_crtc_state *crtc_state;
+	int i;
+
+	/* Double check state. */
+	for_each_crtc_in_state(state, crtc, crtc_state, i) {
+		/* Update hwmode for vblank functions */
+		if (crtc->state->active)
+			crtc->hwmode = crtc->state->adjusted_mode;
+		else
+			crtc->hwmode.crtc_clock = 0;
+	}
+}
+
 /**
  * vc4_atomic_commit - commit validated state object
  * @dev: DRM device
@@ -154,6 +170,9 @@  static int vc4_atomic_commit(struct drm_device *dev,
 
 	drm_atomic_helper_swap_state(dev, state);
 
+	/* Update crtc state */
+	vc4_update_crtc_state(state);
+
 	/*
 	 * Everything below can be run asynchronously without the need to grab
 	 * any modeset locks at all under one condition: It must be guaranteed
-- 
2.7.0