Major 2.6.38 / 2.6.39 / 3.0 regression ignored?

Message ID	yun7h7atcty.fsf@aiko.keithp.com (mailing list archive)
State	New, archived
Headers	show Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) by demeter1.kernel.org (8.14.4/8.14.4) with ESMTP id p6MI1pXG020207 for <patchwork-intel-gfx@patchwork.kernel.org>; Fri, 22 Jul 2011 18:02:11 GMT From: Keith Packard <keithp@keithp.com> To: Kirill Smelkov <kirr@mns.spb.ru>, Pekka Enberg <penberg@kernel.org> In-Reply-To: <20110722110806.GA29757@tugrik.mns.mnsspb.ru> References: <201105201306.31204.luke@dashjr.org> <BANLkTimOqhNEsBEg6f=nMiMJqYGNyFJmeQ@mail.gmail.com> <013811$4lfs6@fmsmga002.fm.intel.com> <201105211123.56053.luke@dashjr.org> <aefc95$1usaq@orsmga001.jf.intel.com> <20110528131920.GA10467@tugrik.mns.mnsspb.ru> <20110712171706.GA18414@tugrik.mns.mnsspb.ru> <CAOJsxLEz-fPnb9oYnLMk9UbGqgsfm-g1d2yH=TrEiEk7HvyQzA@mail.gmail.com> <20110722110806.GA29757@tugrik.mns.mnsspb.ru> User-Agent: Notmuch/0.6.1-66-ga900dda (http://notmuchmail.org) Emacs/23.3.1 (i486-pc-linux-gnu) Date: Fri, 22 Jul 2011 11:00:41 -0700 Message-ID: <yun7h7atcty.fsf@aiko.keithp.com> MIME-Version: 1.0 Cc: Herbert Xu <herbert@gondor.hengli.com.au>, Luke-Jr <luke@dashjr.org>, intel-gfx@lists.freedesktop.org, LKML <linux-kernel@vger.kernel.org>, dri-devel@lists.freedesktop.org, "Rafael J. Wysocki" <rjw@sisk.pl>, Ray Lee <ray-lk@madrabbit.org>, Andrew Morton <akpm@linux-foundation.org>, Linus Torvalds <torvalds@linux-foundation.org> Subject: Re: [Intel-gfx] Major 2.6.38 / 2.6.39 / 3.0 regression ignored? Precedence: list Content-Type: multipart/mixed; boundary="===============0208067410==" Sender: intel-gfx-bounces+patchwork-intel-gfx=patchwork.kernel.org@lists.freedesktop.org Errors-To: intel-gfx-bounces+patchwork-intel-gfx=patchwork.kernel.org@lists.freedesktop.org

Keith Packard July 22, 2011, 6 p.m. UTC

On Fri, 22 Jul 2011 15:08:06 +0400, Kirill Smelkov <kirr@mns.spb.ru> wrote:

> And now after v3.0 is out, I've tested it again, and yes, like it was
> broken on v3.0-rc5, it is (now even more) broken on v3.0 -- after first
> bad io access the system freezes completely:

I looked at this when I first saw it (a couple of weeks ago), and I
couldn't see any obvious reason this patch would cause this particular
problem. I didn't want to revert the patch at that point as I feared it
would cause other subtle problems. Given that you've got a work-around,
it seemed best to just push this off past 3.0.

Given the failing address passed to ioread32, this seems like it's
probably the call to READ_BREADCRUMB -- I915_BREADCRUMB_INDEX is 0x21,
which is an offset in 32-bit units within the hardware status page. If
the status_page.page_addr value was zero, then the computed address
would end up being 0x84.

And, it looks like status_page.page_addr *will* end up being zero as a
result of the patch in question. The patch resets the entire ring
structure contents back to the initial values, which includes smashing
the status_page structure to zero, clearing the value of
status_page.page_addr set in i915_init_phys_hws.

Here's an untested patch which moves the initialization of
status_page.page_addr into intel_render_ring_init_dri. I note that
intel_init_render_ring_buffer *already* has the setting of the
status_page.page_addr value, and so I've removed the setting of
status_page.page_addr from i915_init_phys_hws.

I suspect we could remove the memset from intel_init_render_ring_buffer;
it seems entirely superfluous given the memset in i915_init_phys_hws.

From 159ba1dd207fc52590ce8a3afd83f40bd2cedf46 Mon Sep 17 00:00:00 2001
From: Keith Packard <keithp@keithp.com>
Date: Fri, 22 Jul 2011 10:44:39 -0700
Subject: [PATCH] drm/i915: Initialize RCS ring status page address in
 intel_render_ring_init_dri

Physically-addressed hardware status pages are initialized early in
the driver load process by i915_init_phys_hws. For UMS environments,
the ring structure is not initialized until the X server starts. At
that point, the entire ring structure is re-initialized with all new
values. Any values set in the ring structure (including
ring->status_page.page_addr) will be lost when the ring is
re-initialized.

This patch moves the initialization of the status_page.page_addr value
to intel_render_ring_init_dri.

Signed-off-by: Keith Packard <keithp@keithp.com>
---
 drivers/gpu/drm/i915/i915_dma.c         |    6 ++----
 drivers/gpu/drm/i915/intel_ringbuffer.c |    3 +++
 2 files changed, 5 insertions(+), 4 deletions(-)

Kirill Smelkov July 22, 2011, 8:23 p.m. UTC | #1

Keith,

first of all thanks for your prompt reply. Then...

On Fri, Jul 22, 2011 at 11:00:41AM -0700, Keith Packard wrote:
> On Fri, 22 Jul 2011 15:08:06 +0400, Kirill Smelkov <kirr@mns.spb.ru> wrote:
> 
> > And now after v3.0 is out, I've tested it again, and yes, like it was
> > broken on v3.0-rc5, it is (now even more) broken on v3.0 -- after first
> > bad io access the system freezes completely:
> 
> I looked at this when I first saw it (a couple of weeks ago), and I
> couldn't see any obvious reason this patch would cause this particular
> problem. I didn't want to revert the patch at that point as I feared it
> would cause other subtle problems. Given that you've got a work-around,
> it seemed best to just push this off past 3.0.

What kind of a workaround are you talking about? Sorry, to me it all
looked like "UMS is being ignored forever". Anyway, let's move on to try
to solve the issue.


> Given the failing address passed to ioread32, this seems like it's
> probably the call to READ_BREADCRUMB -- I915_BREADCRUMB_INDEX is 0x21,
> which is an offset in 32-bit units within the hardware status page. If
> the status_page.page_addr value was zero, then the computed address
> would end up being 0x84.
> 
> And, it looks like status_page.page_addr *will* end up being zero as a
> result of the patch in question. The patch resets the entire ring
> structure contents back to the initial values, which includes smashing
> the status_page structure to zero, clearing the value of
> status_page.page_addr set in i915_init_phys_hws.
> 
> Here's an untested patch which moves the initialization of
> status_page.page_addr into intel_render_ring_init_dri. I note that
> intel_init_render_ring_buffer *already* has the setting of the
> status_page.page_addr value, and so I've removed the setting of
> status_page.page_addr from i915_init_phys_hws.
> 
> I suspect we could remove the memset from intel_init_render_ring_buffer;
> it seems entirely superfluous given the memset in i915_init_phys_hws.
> 
> From 159ba1dd207fc52590ce8a3afd83f40bd2cedf46 Mon Sep 17 00:00:00 2001
> From: Keith Packard <keithp@keithp.com>
> Date: Fri, 22 Jul 2011 10:44:39 -0700
> Subject: [PATCH] drm/i915: Initialize RCS ring status page address in
>  intel_render_ring_init_dri
> 
> Physically-addressed hardware status pages are initialized early in
> the driver load process by i915_init_phys_hws. For UMS environments,
> the ring structure is not initialized until the X server starts. At
> that point, the entire ring structure is re-initialized with all new
> values. Any values set in the ring structure (including
> ring->status_page.page_addr) will be lost when the ring is
> re-initialized.
> 
> This patch moves the initialization of the status_page.page_addr value
> to intel_render_ring_init_dri.
> 
> Signed-off-by: Keith Packard <keithp@keithp.com>
> ---
>  drivers/gpu/drm/i915/i915_dma.c         |    6 ++----
>  drivers/gpu/drm/i915/intel_ringbuffer.c |    3 +++
>  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> index 1271282..8a3942c 100644
> --- a/drivers/gpu/drm/i915/i915_dma.c
> +++ b/drivers/gpu/drm/i915/i915_dma.c
> @@ -61,7 +61,6 @@ static void i915_write_hws_pga(struct drm_device *dev)
>  static int i915_init_phys_hws(struct drm_device *dev)
>  {
>  	drm_i915_private_t *dev_priv = dev->dev_private;
> -	struct intel_ring_buffer *ring = LP_RING(dev_priv);
>  
>  	/* Program Hardware Status Page */
>  	dev_priv->status_page_dmah =
> @@ -71,10 +70,9 @@ static int i915_init_phys_hws(struct drm_device *dev)
>  		DRM_ERROR("Can not allocate hardware status page\n");
>  		return -ENOMEM;
>  	}
> -	ring->status_page.page_addr =
> -		(void __force __iomem *)dev_priv->status_page_dmah->vaddr;
>  
> -	memset_io(ring->status_page.page_addr, 0, PAGE_SIZE);
> +	memset_io((void __force __iomem *)dev_priv->status_page_dmah->vaddr,
> +		  0, PAGE_SIZE);
>  
>  	i915_write_hws_pga(dev);
>  
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index e961568..47b9b27 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -1321,6 +1321,9 @@ int intel_render_ring_init_dri(struct drm_device *dev, u64 start, u32 size)
>  		ring->get_seqno = pc_render_get_seqno;
>  	}
>  
> +	if (!I915_NEED_GFX_HWS(dev))
> +		ring->status_page.page_addr = dev_priv->status_page_dmah->vaddr;
> +
>  	ring->dev = dev;
>  	INIT_LIST_HEAD(&ring->active_list);
>  	INIT_LIST_HEAD(&ring->request_list);

I can't tell whether this is correct, because intel gfx driver is
unknown to me, but from the first glance your description sounds reasonable.

I'm out of office till ~ next week's tuesday, and on return I'll try
to test it on the hardware in question.


Thanks again,
Kirill

Keith Packard July 22, 2011, 8:50 p.m. UTC | #2

On Sat, 23 Jul 2011 00:23:36 +0400, Kirill Smelkov <kirr@mns.spb.ru> wrote:

> What kind of a workaround are you talking about?

Just reverting the commit -- that makes your machine work, even if it's
wrong for other machines.

> Sorry, to me it all looked like "UMS is being ignored forever".

You're right, of course -- UMS is a huge wart on the kernel driver at
this point, keeping it working while also adding new functionality
continues to cause challenges. We tend to expect that most people will
run reasonably contemporaneous kernel and user space code, and so three
years after the switch, it continues to surprise us when someone
actually tries UMS.

> I'm out of office till ~ next week's tuesday, and on return I'll try
> to test it on the hardware in question.

Let me know; I've pushed this patch to my drm-intel-fixes tree on
kernel.org in the meantime; if it does solve the problem, I'd like to
add your Tested-by: line.

Kirill Smelkov July 22, 2011, 9:08 p.m. UTC | #3

On Fri, Jul 22, 2011 at 01:50:04PM -0700, Keith Packard wrote:
> On Sat, 23 Jul 2011 00:23:36 +0400, Kirill Smelkov <kirr@mns.spb.ru> wrote:
> 
> > What kind of a workaround are you talking about?
> 
> Just reverting the commit -- that makes your machine work, even if it's
> wrong for other machines.

Yes, I could revert it. But since the driver is reasonably complex, it
is better to know what I'm doing and that the change makes sense,
especially when it's not "my machine", but lots of target boards located
all over the country.

That's why I wanted, and imho reasonably, because I did the homework,
your feedback - to be not on my own, alone.

> > Sorry, to me it all looked like "UMS is being ignored forever".
> 
> You're right, of course -- UMS is a huge wart on the kernel driver at
> this point, keeping it working while also adding new functionality
> continues to cause challenges. We tend to expect that most people will
> run reasonably contemporaneous kernel and user space code, and so three
> years after the switch, it continues to surprise us when someone
> actually tries UMS.

We are planning upgrade to KMS too. The kernel is upgraded more often
compared to userspace, because of already mentioned (thanks!) "no
regression" rule. Userspace is more complex and more work in my context,
so it is lagging, but eventually we'll get there.

So I hope some day, when everyone upgrades, UMS support could be
"cleaned up" out from the driver.

> > I'm out of office till ~ next week's tuesday, and on return I'll try
> > to test it on the hardware in question.
> 
> Let me know; I've pushed this patch to my drm-intel-fixes tree on
> kernel.org in the meantime; if it does solve the problem, I'd like to
> add your Tested-by: line.

Yes, sure, I'll let you know the results.

Thanks,
Kirill

Kirill Smelkov July 22, 2011, 9:31 p.m. UTC | #4

On Sat, Jul 23, 2011 at 01:08:14AM +0400, Kirill Smelkov wrote:
> On Fri, Jul 22, 2011 at 01:50:04PM -0700, Keith Packard wrote:

> > You're right, of course -- UMS is a huge wart on the kernel driver at
> > this point, keeping it working while also adding new functionality
> > continues to cause challenges. We tend to expect that most people will
> > run reasonably contemporaneous kernel and user space code, and so three
> > years after the switch, it continues to surprise us when someone
> > actually tries UMS.
> 
> We are planning upgrade to KMS too. The kernel is upgraded more often
> compared to userspace, because of already mentioned (thanks!) "no
> regression" rule. Userspace is more complex and more work in my context,
> so it is lagging, but eventually we'll get there.

Also wanted to say, that if whole X could be built, like the kernel, from one
repo without multirepo-setup tool, with 100% reliable working
incremental rebuild, etc... it would be a bit easier to upgrade X too.

Sorry for being a bit offtopic, could not resist. I was keeping that
though in my head for ~ 2 years already, and now had a chance to mention it.

Thanks,
Kirill

Kirill Smelkov July 23, 2011, 6:19 p.m. UTC | #5

On Sat, Jul 23, 2011 at 11:10:53AM -0400, Alex Deucher wrote:
> On Fri, Jul 22, 2011 at 5:31 PM, Kirill Smelkov <kirr@mns.spb.ru> wrote:
> > On Sat, Jul 23, 2011 at 01:08:14AM +0400, Kirill Smelkov wrote:
> >> On Fri, Jul 22, 2011 at 01:50:04PM -0700, Keith Packard wrote:
> >
> >> > You're right, of course -- UMS is a huge wart on the kernel driver at
> >> > this point, keeping it working while also adding new functionality
> >> > continues to cause challenges. We tend to expect that most people will
> >> > run reasonably contemporaneous kernel and user space code, and so three
> >> > years after the switch, it continues to surprise us when someone
> >> > actually tries UMS.
> >>
> >> We are planning upgrade to KMS too. The kernel is upgraded more often
> >> compared to userspace, because of already mentioned (thanks!) "no
> >> regression" rule. Userspace is more complex and more work in my context,
> >> so it is lagging, but eventually we'll get there.
> >
> > Also wanted to say, that if whole X could be built, like the kernel, from one
> > repo without multirepo-setup tool, with 100% reliable working
> > incremental rebuild, etc... it would be a bit easier to upgrade X too.
> >
> > Sorry for being a bit offtopic, could not resist. I was keeping that
> > though in my head for ~ 2 years already, and now had a chance to mention it.
> 
> You don't have to rebuild all of X to use KMS.  In most cases, you
> just need to update the ddx for your card.

I meant the rebuilt not to use KMS, but general case. To me the kernel
has one of the great advantage of being lots of self-consistent code
because of being maintained in one repo + good build system + good
development process. And as the result it is (relatively) easy to
upgrade.

Anyway, this is just a note from both kernel and X stranger, so
whatever...


Kirill

Keith Packard July 25, 2011, 4:29 a.m. UTC | #6

On Sat, 23 Jul 2011 18:55:48 +0300 (EEST), Pekka Enberg <penberg@kernel.org> wrote:

> I know I sound like a broken record but I really wish you i915 devs were 
> little more eager to revert broken patches early rather than late. I mean, 
> this particular breakage was already bisected but nobody said or 
> did anything - and it's not like it's the first time either!

We've switched processes starting with 2.6.39 and I think we're doing
better in this regard. For this particular issue, the regression came
with 2.6.38, and the revert was too large for me to consider merging
just before 3.0 shipped -- I knew reverting it *would* cause problems
for anyone using UMS on newer hardware.

> I suppose I need to bribe Linus somehow to be more strict with you
> folks.

He nicely delivered the message for you a few months ago in person.

In any case, I'm hoping that my smaller fix will resolve the problem and
also not cause regressions for other users.

Kirill Smelkov July 26, 2011, 1:48 p.m. UTC | #7

On Sat, Jul 23, 2011 at 12:23:36AM +0400, Kirill Smelkov wrote:
> Keith,
> 
> first of all thanks for your prompt reply. Then...
> 
> On Fri, Jul 22, 2011 at 11:00:41AM -0700, Keith Packard wrote:
> > On Fri, 22 Jul 2011 15:08:06 +0400, Kirill Smelkov <kirr@mns.spb.ru> wrote:
> > 
> > > And now after v3.0 is out, I've tested it again, and yes, like it was
> > > broken on v3.0-rc5, it is (now even more) broken on v3.0 -- after first
> > > bad io access the system freezes completely:
> > 
> > I looked at this when I first saw it (a couple of weeks ago), and I
> > couldn't see any obvious reason this patch would cause this particular
> > problem. I didn't want to revert the patch at that point as I feared it
> > would cause other subtle problems. Given that you've got a work-around,
> > it seemed best to just push this off past 3.0.
> 
> What kind of a workaround are you talking about? Sorry, to me it all
> looked like "UMS is being ignored forever". Anyway, let's move on to try
> to solve the issue.
> 
> 
> > Given the failing address passed to ioread32, this seems like it's
> > probably the call to READ_BREADCRUMB -- I915_BREADCRUMB_INDEX is 0x21,
> > which is an offset in 32-bit units within the hardware status page. If
> > the status_page.page_addr value was zero, then the computed address
> > would end up being 0x84.
> > 
> > And, it looks like status_page.page_addr *will* end up being zero as a
> > result of the patch in question. The patch resets the entire ring
> > structure contents back to the initial values, which includes smashing
> > the status_page structure to zero, clearing the value of
> > status_page.page_addr set in i915_init_phys_hws.
> > 
> > Here's an untested patch which moves the initialization of
> > status_page.page_addr into intel_render_ring_init_dri. I note that
> > intel_init_render_ring_buffer *already* has the setting of the
> > status_page.page_addr value, and so I've removed the setting of
> > status_page.page_addr from i915_init_phys_hws.
> > 
> > I suspect we could remove the memset from intel_init_render_ring_buffer;
> > it seems entirely superfluous given the memset in i915_init_phys_hws.
> > 
> > From 159ba1dd207fc52590ce8a3afd83f40bd2cedf46 Mon Sep 17 00:00:00 2001
> > From: Keith Packard <keithp@keithp.com>
> > Date: Fri, 22 Jul 2011 10:44:39 -0700
> > Subject: [PATCH] drm/i915: Initialize RCS ring status page address in
> >  intel_render_ring_init_dri
> > 
> > Physically-addressed hardware status pages are initialized early in
> > the driver load process by i915_init_phys_hws. For UMS environments,
> > the ring structure is not initialized until the X server starts. At
> > that point, the entire ring structure is re-initialized with all new
> > values. Any values set in the ring structure (including
> > ring->status_page.page_addr) will be lost when the ring is
> > re-initialized.
> > 
> > This patch moves the initialization of the status_page.page_addr value
> > to intel_render_ring_init_dri.
> > 
> > Signed-off-by: Keith Packard <keithp@keithp.com>
> > ---
> >  drivers/gpu/drm/i915/i915_dma.c         |    6 ++----
> >  drivers/gpu/drm/i915/intel_ringbuffer.c |    3 +++
> >  2 files changed, 5 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> > index 1271282..8a3942c 100644
> > --- a/drivers/gpu/drm/i915/i915_dma.c
> > +++ b/drivers/gpu/drm/i915/i915_dma.c
> > @@ -61,7 +61,6 @@ static void i915_write_hws_pga(struct drm_device *dev)
> >  static int i915_init_phys_hws(struct drm_device *dev)
> >  {
> >  	drm_i915_private_t *dev_priv = dev->dev_private;
> > -	struct intel_ring_buffer *ring = LP_RING(dev_priv);
> >  
> >  	/* Program Hardware Status Page */
> >  	dev_priv->status_page_dmah =
> > @@ -71,10 +70,9 @@ static int i915_init_phys_hws(struct drm_device *dev)
> >  		DRM_ERROR("Can not allocate hardware status page\n");
> >  		return -ENOMEM;
> >  	}
> > -	ring->status_page.page_addr =
> > -		(void __force __iomem *)dev_priv->status_page_dmah->vaddr;
> >  
> > -	memset_io(ring->status_page.page_addr, 0, PAGE_SIZE);
> > +	memset_io((void __force __iomem *)dev_priv->status_page_dmah->vaddr,
> > +		  0, PAGE_SIZE);
> >  
> >  	i915_write_hws_pga(dev);
> >  
> > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > index e961568..47b9b27 100644
> > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > @@ -1321,6 +1321,9 @@ int intel_render_ring_init_dri(struct drm_device *dev, u64 start, u32 size)
> >  		ring->get_seqno = pc_render_get_seqno;
> >  	}
> >  
> > +	if (!I915_NEED_GFX_HWS(dev))
> > +		ring->status_page.page_addr = dev_priv->status_page_dmah->vaddr;
> > +
> >  	ring->dev = dev;
> >  	INIT_LIST_HEAD(&ring->active_list);
> >  	INIT_LIST_HEAD(&ring->request_list);
> 
> I can't tell whether this is correct, because intel gfx driver is
> unknown to me, but from the first glance your description sounds reasonable.
> 
> I'm out of office till ~ next week's tuesday, and on return I'll try
> to test it on the hardware in question.

Keith, thanks again for the patch. As promised I've tested it on the
hardware in question and yes, bad_access is gone and X seems to work,
so thank you, but...


I see there are more such bugs in introduced-in-guilty-patch
intel_render_ring_init_dri(). For example ring->irq_queue is
left uninitialized and also ring->irq_lock etc...

I'm X newbie, so if here is something stupid X-wise, please don't
beat me too hard, but to me the gist of the problem is the original
patch, where Chris does

( git show e8616b6ced6137085e6657cc63bc2fe3900b8616 )
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index 03e3370..51fbc5e 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -1291,6 +1291,48 @@ int intel_init_render_ring_buffer(struct drm_device *dev)
>         return intel_init_ring_buffer(dev, ring);
>  }
>  
> +int intel_render_ring_init_dri(struct drm_device *dev, u64 start, u32 size)
> +{
> +       drm_i915_private_t *dev_priv = dev->dev_private;
> +       struct intel_ring_buffer *ring = &dev_priv->ring[RCS];
> +
> +       *ring = render_ring;
          ^^^^^^^^^^^^^^^^^^^
          here resets

> +       if (INTEL_INFO(dev)->gen >= 6) {
> +               ring->add_request = gen6_add_request;
> +               ring->irq_get = gen6_render_ring_get_irq;
> +               ring->irq_put = gen6_render_ring_put_irq;
> +       } else if (IS_GEN5(dev)) {
> +               ring->add_request = pc_render_add_request;
> +               ring->get_seqno = pc_render_get_seqno;
> +       }

and then the rest of the `ring` is initialized seemingly copy-pasted
from intel_init_ring_buffer():

> +       ring->dev = dev;
> +       INIT_LIST_HEAD(&ring->active_list);
> +       INIT_LIST_HEAD(&ring->request_list);
> +       INIT_LIST_HEAD(&ring->gpu_write_list);
> +
> +       ring->size = size;
> +       ring->effective_size = ring->size;
> +       if (IS_I830(ring->dev))
> +               ring->effective_size -= 128;
> +
> +       ring->map.offset = start;
> +       ring->map.size = size;
> +       ring->map.type = 0;
> +       ring->map.flags = 0;
> +       ring->map.mtrr = 0;
...

where both 3 chunks go almost exactly from intel_init_ring_buffer(), and
ring->effective_size tweak even stripped original comment:

# original version from intel_init_ring_buffer():
        /* Workaround an erratum on the i830 which causes a hang if
         * the TAIL pointer points to within the last 2 cachelines
         * of the buffer.
         */
        ring->effective_size = ring->size;
        if (IS_I830(ring->dev))
                ring->effective_size -= 128;

...


The line marked "here resets" resets all the fields, and maybe it's not a good
idea to re-initialize them all afterwards (missing some as this thread show),
or at least if it is really needed, share initialization code between
intel_render_ring_init_dri() and intel_init_ring_buffer() ?

From the outside it looks like the offending patch was done as a quick
fix in a hurry (lots of copy-paste), and maybe it would be better to
re-do it properly...


Thanks again,
Kirill

Kirill Smelkov Aug. 9, 2011, 12:08 p.m. UTC | #8

On Tue, Jul 26, 2011 at 05:48:27PM +0400, Kirill Smelkov wrote:
> On Sat, Jul 23, 2011 at 12:23:36AM +0400, Kirill Smelkov wrote:
> > Keith,
> > 
> > first of all thanks for your prompt reply. Then...
> > 
> > On Fri, Jul 22, 2011 at 11:00:41AM -0700, Keith Packard wrote:
> > > On Fri, 22 Jul 2011 15:08:06 +0400, Kirill Smelkov <kirr@mns.spb.ru> wrote:
> > > 
> > > > And now after v3.0 is out, I've tested it again, and yes, like it was
> > > > broken on v3.0-rc5, it is (now even more) broken on v3.0 -- after first
> > > > bad io access the system freezes completely:
> > > 
> > > I looked at this when I first saw it (a couple of weeks ago), and I
> > > couldn't see any obvious reason this patch would cause this particular
> > > problem. I didn't want to revert the patch at that point as I feared it
> > > would cause other subtle problems. Given that you've got a work-around,
> > > it seemed best to just push this off past 3.0.
> > 
> > What kind of a workaround are you talking about? Sorry, to me it all
> > looked like "UMS is being ignored forever". Anyway, let's move on to try
> > to solve the issue.
> > 
> > 
> > > Given the failing address passed to ioread32, this seems like it's
> > > probably the call to READ_BREADCRUMB -- I915_BREADCRUMB_INDEX is 0x21,
> > > which is an offset in 32-bit units within the hardware status page. If
> > > the status_page.page_addr value was zero, then the computed address
> > > would end up being 0x84.
> > > 
> > > And, it looks like status_page.page_addr *will* end up being zero as a
> > > result of the patch in question. The patch resets the entire ring
> > > structure contents back to the initial values, which includes smashing
> > > the status_page structure to zero, clearing the value of
> > > status_page.page_addr set in i915_init_phys_hws.
> > > 
> > > Here's an untested patch which moves the initialization of
> > > status_page.page_addr into intel_render_ring_init_dri. I note that
> > > intel_init_render_ring_buffer *already* has the setting of the
> > > status_page.page_addr value, and so I've removed the setting of
> > > status_page.page_addr from i915_init_phys_hws.
> > > 
> > > I suspect we could remove the memset from intel_init_render_ring_buffer;
> > > it seems entirely superfluous given the memset in i915_init_phys_hws.
> > > 
> > > From 159ba1dd207fc52590ce8a3afd83f40bd2cedf46 Mon Sep 17 00:00:00 2001
> > > From: Keith Packard <keithp@keithp.com>
> > > Date: Fri, 22 Jul 2011 10:44:39 -0700
> > > Subject: [PATCH] drm/i915: Initialize RCS ring status page address in
> > >  intel_render_ring_init_dri
> > > 
> > > Physically-addressed hardware status pages are initialized early in
> > > the driver load process by i915_init_phys_hws. For UMS environments,
> > > the ring structure is not initialized until the X server starts. At
> > > that point, the entire ring structure is re-initialized with all new
> > > values. Any values set in the ring structure (including
> > > ring->status_page.page_addr) will be lost when the ring is
> > > re-initialized.
> > > 
> > > This patch moves the initialization of the status_page.page_addr value
> > > to intel_render_ring_init_dri.
> > > 
> > > Signed-off-by: Keith Packard <keithp@keithp.com>
> > > ---
> > >  drivers/gpu/drm/i915/i915_dma.c         |    6 ++----
> > >  drivers/gpu/drm/i915/intel_ringbuffer.c |    3 +++
> > >  2 files changed, 5 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> > > index 1271282..8a3942c 100644
> > > --- a/drivers/gpu/drm/i915/i915_dma.c
> > > +++ b/drivers/gpu/drm/i915/i915_dma.c
> > > @@ -61,7 +61,6 @@ static void i915_write_hws_pga(struct drm_device *dev)
> > >  static int i915_init_phys_hws(struct drm_device *dev)
> > >  {
> > >  	drm_i915_private_t *dev_priv = dev->dev_private;
> > > -	struct intel_ring_buffer *ring = LP_RING(dev_priv);
> > >  
> > >  	/* Program Hardware Status Page */
> > >  	dev_priv->status_page_dmah =
> > > @@ -71,10 +70,9 @@ static int i915_init_phys_hws(struct drm_device *dev)
> > >  		DRM_ERROR("Can not allocate hardware status page\n");
> > >  		return -ENOMEM;
> > >  	}
> > > -	ring->status_page.page_addr =
> > > -		(void __force __iomem *)dev_priv->status_page_dmah->vaddr;
> > >  
> > > -	memset_io(ring->status_page.page_addr, 0, PAGE_SIZE);
> > > +	memset_io((void __force __iomem *)dev_priv->status_page_dmah->vaddr,
> > > +		  0, PAGE_SIZE);
> > >  
> > >  	i915_write_hws_pga(dev);
> > >  
> > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > index e961568..47b9b27 100644
> > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > @@ -1321,6 +1321,9 @@ int intel_render_ring_init_dri(struct drm_device *dev, u64 start, u32 size)
> > >  		ring->get_seqno = pc_render_get_seqno;
> > >  	}
> > >  
> > > +	if (!I915_NEED_GFX_HWS(dev))
> > > +		ring->status_page.page_addr = dev_priv->status_page_dmah->vaddr;
> > > +
> > >  	ring->dev = dev;
> > >  	INIT_LIST_HEAD(&ring->active_list);
> > >  	INIT_LIST_HEAD(&ring->request_list);
> > 
> > I can't tell whether this is correct, because intel gfx driver is
> > unknown to me, but from the first glance your description sounds reasonable.
> > 
> > I'm out of office till ~ next week's tuesday, and on return I'll try
> > to test it on the hardware in question.
> 
> Keith, thanks again for the patch. As promised I've tested it on the
> hardware in question and yes, bad_access is gone and X seems to work,
> so thank you, but...
> 
> 
> I see there are more such bugs in introduced-in-guilty-patch
> intel_render_ring_init_dri(). For example ring->irq_queue is
> left uninitialized and also ring->irq_lock etc...
>
>
> I'm X newbie, so if here is something stupid X-wise, please don't
> beat me too hard, but to me the gist of the problem is the original
> patch, where Chris does
> 
> ( git show e8616b6ced6137085e6657cc63bc2fe3900b8616 )
> > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > index 03e3370..51fbc5e 100644
> > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > @@ -1291,6 +1291,48 @@ int intel_init_render_ring_buffer(struct drm_device *dev)
> >         return intel_init_ring_buffer(dev, ring);
> >  }
> >  
> > +int intel_render_ring_init_dri(struct drm_device *dev, u64 start, u32 size)
> > +{
> > +       drm_i915_private_t *dev_priv = dev->dev_private;
> > +       struct intel_ring_buffer *ring = &dev_priv->ring[RCS];
> > +
> > +       *ring = render_ring;
>           ^^^^^^^^^^^^^^^^^^^
>           here resets
> 
> > +       if (INTEL_INFO(dev)->gen >= 6) {
> > +               ring->add_request = gen6_add_request;
> > +               ring->irq_get = gen6_render_ring_get_irq;
> > +               ring->irq_put = gen6_render_ring_put_irq;
> > +       } else if (IS_GEN5(dev)) {
> > +               ring->add_request = pc_render_add_request;
> > +               ring->get_seqno = pc_render_get_seqno;
> > +       }
> 
> and then the rest of the `ring` is initialized seemingly copy-pasted
> from intel_init_ring_buffer():
> 
> > +       ring->dev = dev;
> > +       INIT_LIST_HEAD(&ring->active_list);
> > +       INIT_LIST_HEAD(&ring->request_list);
> > +       INIT_LIST_HEAD(&ring->gpu_write_list);
> > +
> > +       ring->size = size;
> > +       ring->effective_size = ring->size;
> > +       if (IS_I830(ring->dev))
> > +               ring->effective_size -= 128;
> > +
> > +       ring->map.offset = start;
> > +       ring->map.size = size;
> > +       ring->map.type = 0;
> > +       ring->map.flags = 0;
> > +       ring->map.mtrr = 0;
> ...
> 
> where both 3 chunks go almost exactly from intel_init_ring_buffer(), and
> ring->effective_size tweak even stripped original comment:
> 
> # original version from intel_init_ring_buffer():
>         /* Workaround an erratum on the i830 which causes a hang if
>          * the TAIL pointer points to within the last 2 cachelines
>          * of the buffer.
>          */
>         ring->effective_size = ring->size;
>         if (IS_I830(ring->dev))
>                 ring->effective_size -= 128;
> 
> ...
> 
> 
> The line marked "here resets" resets all the fields, and maybe it's not a good
> idea to re-initialize them all afterwards (missing some as this thread show),
> or at least if it is really needed, share initialization code between
> intel_render_ring_init_dri() and intel_init_ring_buffer() ?
> 
> >From the outside it looks like the offending patch was done as a quick
> fix in a hurry (lots of copy-paste), and maybe it would be better to
> re-do it properly...

Silence... ?

I read UMS is still ignored, because e.g. that uninitialized
ring->irq_lock which I've wrote about above is for sure used e.g. in
gen6_render_ring_get_irq() added to ring vtable in
intel_render_ring_init_dri().

And also is copy-pasting, instead of properly structuring things, ok?


Why not revert what caused trouble and introduced other subtle bugs, and
redo things properly in the first place?

Vasily Khoruzhick Aug. 9, 2011, 2 p.m. UTC | #9

On Tuesday 09 August 2011 15:08:03 Kirill Smelkov wrote:
> On Tue, Jul 26, 2011 at 05:48:27PM +0400, Kirill Smelkov wrote:
> > On Sat, Jul 23, 2011 at 12:23:36AM +0400, Kirill Smelkov wrote:
> > > Keith,
> > > 
> > > first of all thanks for your prompt reply. Then...
> > > 
> > > On Fri, Jul 22, 2011 at 11:00:41AM -0700, Keith Packard wrote:
> > > > On Fri, 22 Jul 2011 15:08:06 +0400, Kirill Smelkov <kirr@mns.spb.ru> 
wrote:
> > > > > And now after v3.0 is out, I've tested it again, and yes, like it
> > > > > was broken on v3.0-rc5, it is (now even more) broken on v3.0 --
> > > > > after first
> > > > 
> > > > > bad io access the system freezes completely:
> > > > I looked at this when I first saw it (a couple of weeks ago), and I
> > > > couldn't see any obvious reason this patch would cause this
> > > > particular problem. I didn't want to revert the patch at that point
> > > > as I feared it would cause other subtle problems. Given that you've
> > > > got a work-around, it seemed best to just push this off past 3.0.
> > > 
> > > What kind of a workaround are you talking about? Sorry, to me it all
> > > looked like "UMS is being ignored forever". Anyway, let's move on to
> > > try to solve the issue.
> > > 
> > > > Given the failing address passed to ioread32, this seems like it's
> > > > probably the call to READ_BREADCRUMB -- I915_BREADCRUMB_INDEX is
> > > > 0x21, which is an offset in 32-bit units within the hardware status
> > > > page. If the status_page.page_addr value was zero, then the computed
> > > > address would end up being 0x84.
> > > > 
> > > > And, it looks like status_page.page_addr *will* end up being zero as
> > > > a result of the patch in question. The patch resets the entire ring
> > > > structure contents back to the initial values, which includes
> > > > smashing the status_page structure to zero, clearing the value of
> > > > status_page.page_addr set in i915_init_phys_hws.
> > > > 
> > > > Here's an untested patch which moves the initialization of
> > > > status_page.page_addr into intel_render_ring_init_dri. I note that
> > > > intel_init_render_ring_buffer *already* has the setting of the
> > > > status_page.page_addr value, and so I've removed the setting of
> > > > status_page.page_addr from i915_init_phys_hws.
> > > > 
> > > > I suspect we could remove the memset from
> > > > intel_init_render_ring_buffer; it seems entirely superfluous given
> > > > the memset in i915_init_phys_hws.
> > > > 
> > > > From 159ba1dd207fc52590ce8a3afd83f40bd2cedf46 Mon Sep 17 00:00:00
> > > > 2001 From: Keith Packard <keithp@keithp.com>
> > > > Date: Fri, 22 Jul 2011 10:44:39 -0700
> > > > Subject: [PATCH] drm/i915: Initialize RCS ring status page address in
> > > > 
> > > >  intel_render_ring_init_dri
> > > > 
> > > > Physically-addressed hardware status pages are initialized early in
> > > > the driver load process by i915_init_phys_hws. For UMS environments,
> > > > the ring structure is not initialized until the X server starts. At
> > > > that point, the entire ring structure is re-initialized with all new
> > > > values. Any values set in the ring structure (including
> > > > ring->status_page.page_addr) will be lost when the ring is
> > > > re-initialized.
> > > > 
> > > > This patch moves the initialization of the status_page.page_addr
> > > > value to intel_render_ring_init_dri.
> > > > 
> > > > Signed-off-by: Keith Packard <keithp@keithp.com>
> > > > ---
> > > > 
> > > >  drivers/gpu/drm/i915/i915_dma.c         |    6 ++----
> > > >  drivers/gpu/drm/i915/intel_ringbuffer.c |    3 +++
> > > >  2 files changed, 5 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_dma.c
> > > > b/drivers/gpu/drm/i915/i915_dma.c index 1271282..8a3942c 100644
> > > > --- a/drivers/gpu/drm/i915/i915_dma.c
> > > > +++ b/drivers/gpu/drm/i915/i915_dma.c
> > > > @@ -61,7 +61,6 @@ static void i915_write_hws_pga(struct drm_device
> > > > *dev)
> > > > 
> > > >  static int i915_init_phys_hws(struct drm_device *dev)
> > > >  {
> > > >  
> > > >  	drm_i915_private_t *dev_priv = dev->dev_private;
> > > > 
> > > > -	struct intel_ring_buffer *ring = LP_RING(dev_priv);
> > > > 
> > > >  	/* Program Hardware Status Page */
> > > >  	dev_priv->status_page_dmah =
> > > > 
> > > > @@ -71,10 +70,9 @@ static int i915_init_phys_hws(struct drm_device
> > > > *dev)
> > > > 
> > > >  		DRM_ERROR("Can not allocate hardware status page\n");
> > > >  		return -ENOMEM;
> > > >  	
> > > >  	}
> > > > 
> > > > -	ring->status_page.page_addr =
> > > > -		(void __force __iomem *)dev_priv->status_page_dmah->vaddr;
> > > > 
> > > > -	memset_io(ring->status_page.page_addr, 0, PAGE_SIZE);
> > > > +	memset_io((void __force __iomem
> > > > *)dev_priv->status_page_dmah->vaddr, +		  0, PAGE_SIZE);
> > > > 
> > > >  	i915_write_hws_pga(dev);
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > b/drivers/gpu/drm/i915/intel_ringbuffer.c index e961568..47b9b27
> > > > 100644
> > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > @@ -1321,6 +1321,9 @@ int intel_render_ring_init_dri(struct
> > > > drm_device *dev, u64 start, u32 size)
> > > > 
> > > >  		ring->get_seqno = pc_render_get_seqno;
> > > >  	
> > > >  	}
> > > > 
> > > > +	if (!I915_NEED_GFX_HWS(dev))
> > > > +		ring->status_page.page_addr = dev_priv->status_page_dmah->vaddr;
> > > > +
> > > > 
> > > >  	ring->dev = dev;
> > > >  	INIT_LIST_HEAD(&ring->active_list);
> > > >  	INIT_LIST_HEAD(&ring->request_list);
> > > 
> > > I can't tell whether this is correct, because intel gfx driver is
> > > unknown to me, but from the first glance your description sounds
> > > reasonable.
> > > 
> > > I'm out of office till ~ next week's tuesday, and on return I'll try
> > > to test it on the hardware in question.
> > 
> > Keith, thanks again for the patch. As promised I've tested it on the
> > hardware in question and yes, bad_access is gone and X seems to work,
> > so thank you, but...
> > 
> > 
> > I see there are more such bugs in introduced-in-guilty-patch
> > intel_render_ring_init_dri(). For example ring->irq_queue is
> > left uninitialized and also ring->irq_lock etc...
> > 
> > 
> > I'm X newbie, so if here is something stupid X-wise, please don't
> > beat me too hard, but to me the gist of the problem is the original
> > patch, where Chris does
> > 
> > ( git show e8616b6ced6137085e6657cc63bc2fe3900b8616 )
> > 
> > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > b/drivers/gpu/drm/i915/intel_ringbuffer.c index 03e3370..51fbc5e
> > > 100644
> > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > @@ -1291,6 +1291,48 @@ int intel_init_render_ring_buffer(struct
> > > drm_device *dev)
> > > 
> > >         return intel_init_ring_buffer(dev, ring);
> > >  
> > >  }
> > > 
> > > +int intel_render_ring_init_dri(struct drm_device *dev, u64 start, u32
> > > size) +{
> > > +       drm_i915_private_t *dev_priv = dev->dev_private;
> > > +       struct intel_ring_buffer *ring = &dev_priv->ring[RCS];
> > > +
> > > +       *ring = render_ring;
> > > 
> >           ^^^^^^^^^^^^^^^^^^^
> >           here resets
> > > 
> > > +       if (INTEL_INFO(dev)->gen >= 6) {
> > > +               ring->add_request = gen6_add_request;
> > > +               ring->irq_get = gen6_render_ring_get_irq;
> > > +               ring->irq_put = gen6_render_ring_put_irq;
> > > +       } else if (IS_GEN5(dev)) {
> > > +               ring->add_request = pc_render_add_request;
> > > +               ring->get_seqno = pc_render_get_seqno;
> > > +       }
> > 
> > and then the rest of the `ring` is initialized seemingly copy-pasted
> > 
> > from intel_init_ring_buffer():
> > > +       ring->dev = dev;
> > > +       INIT_LIST_HEAD(&ring->active_list);
> > > +       INIT_LIST_HEAD(&ring->request_list);
> > > +       INIT_LIST_HEAD(&ring->gpu_write_list);
> > > +
> > > +       ring->size = size;
> > > +       ring->effective_size = ring->size;
> > > +       if (IS_I830(ring->dev))
> > > +               ring->effective_size -= 128;
> > > +
> > > +       ring->map.offset = start;
> > > +       ring->map.size = size;
> > > +       ring->map.type = 0;
> > > +       ring->map.flags = 0;
> > > +       ring->map.mtrr = 0;
> > 
> > ...
> > 
> > where both 3 chunks go almost exactly from intel_init_ring_buffer(), and
> > ring->effective_size tweak even stripped original comment:
> > 
> > # original version from intel_init_ring_buffer():
> >         /* Workaround an erratum on the i830 which causes a hang if
> >         
> >          * the TAIL pointer points to within the last 2 cachelines
> >          * of the buffer.
> >          */
> >         
> >         ring->effective_size = ring->size;
> >         if (IS_I830(ring->dev))
> >         
> >                 ring->effective_size -= 128;
> > 
> > ...
> > 
> > 
> > The line marked "here resets" resets all the fields, and maybe it's not a
> > good idea to re-initialize them all afterwards (missing some as this
> > thread show), or at least if it is really needed, share initialization
> > code between intel_render_ring_init_dri() and intel_init_ring_buffer() ?
> > 
> > >From the outside it looks like the offending patch was done as a quick
> > 
> > fix in a hurry (lots of copy-paste), and maybe it would be better to
> > re-do it properly...
> 
> Silence... ?
> 
> I read UMS is still ignored, because e.g. that uninitialized
> ring->irq_lock which I've wrote about above is for sure used e.g. in
> gen6_render_ring_get_irq() added to ring vtable in
> intel_render_ring_init_dri().

I really doubt that UMS supports gen6 hardware.

Regards
Vasily

Kirill Smelkov Aug. 9, 2011, 2:47 p.m. UTC | #10

On Tue, Aug 09, 2011 at 05:00:52PM +0300, Vasily Khoruzhick wrote:
> On Tuesday 09 August 2011 15:08:03 Kirill Smelkov wrote:
> > On Tue, Jul 26, 2011 at 05:48:27PM +0400, Kirill Smelkov wrote:
> > > On Sat, Jul 23, 2011 at 12:23:36AM +0400, Kirill Smelkov wrote:
> > > > Keith,
> > > > 
> > > > first of all thanks for your prompt reply. Then...
> > > > 
> > > > On Fri, Jul 22, 2011 at 11:00:41AM -0700, Keith Packard wrote:
> > > > > On Fri, 22 Jul 2011 15:08:06 +0400, Kirill Smelkov <kirr@mns.spb.ru> 
> wrote:
> > > > > > And now after v3.0 is out, I've tested it again, and yes, like it
> > > > > > was broken on v3.0-rc5, it is (now even more) broken on v3.0 --
> > > > > > after first
> > > > > 
> > > > > > bad io access the system freezes completely:
> > > > > I looked at this when I first saw it (a couple of weeks ago), and I
> > > > > couldn't see any obvious reason this patch would cause this
> > > > > particular problem. I didn't want to revert the patch at that point
> > > > > as I feared it would cause other subtle problems. Given that you've
> > > > > got a work-around, it seemed best to just push this off past 3.0.
> > > > 
> > > > What kind of a workaround are you talking about? Sorry, to me it all
> > > > looked like "UMS is being ignored forever". Anyway, let's move on to
> > > > try to solve the issue.
> > > > 
> > > > > Given the failing address passed to ioread32, this seems like it's
> > > > > probably the call to READ_BREADCRUMB -- I915_BREADCRUMB_INDEX is
> > > > > 0x21, which is an offset in 32-bit units within the hardware status
> > > > > page. If the status_page.page_addr value was zero, then the computed
> > > > > address would end up being 0x84.
> > > > > 
> > > > > And, it looks like status_page.page_addr *will* end up being zero as
> > > > > a result of the patch in question. The patch resets the entire ring
> > > > > structure contents back to the initial values, which includes
> > > > > smashing the status_page structure to zero, clearing the value of
> > > > > status_page.page_addr set in i915_init_phys_hws.
> > > > > 
> > > > > Here's an untested patch which moves the initialization of
> > > > > status_page.page_addr into intel_render_ring_init_dri. I note that
> > > > > intel_init_render_ring_buffer *already* has the setting of the
> > > > > status_page.page_addr value, and so I've removed the setting of
> > > > > status_page.page_addr from i915_init_phys_hws.
> > > > > 
> > > > > I suspect we could remove the memset from
> > > > > intel_init_render_ring_buffer; it seems entirely superfluous given
> > > > > the memset in i915_init_phys_hws.
> > > > > 
> > > > > From 159ba1dd207fc52590ce8a3afd83f40bd2cedf46 Mon Sep 17 00:00:00
> > > > > 2001 From: Keith Packard <keithp@keithp.com>
> > > > > Date: Fri, 22 Jul 2011 10:44:39 -0700
> > > > > Subject: [PATCH] drm/i915: Initialize RCS ring status page address in
> > > > > 
> > > > >  intel_render_ring_init_dri
> > > > > 
> > > > > Physically-addressed hardware status pages are initialized early in
> > > > > the driver load process by i915_init_phys_hws. For UMS environments,
> > > > > the ring structure is not initialized until the X server starts. At
> > > > > that point, the entire ring structure is re-initialized with all new
> > > > > values. Any values set in the ring structure (including
> > > > > ring->status_page.page_addr) will be lost when the ring is
> > > > > re-initialized.
> > > > > 
> > > > > This patch moves the initialization of the status_page.page_addr
> > > > > value to intel_render_ring_init_dri.
> > > > > 
> > > > > Signed-off-by: Keith Packard <keithp@keithp.com>
> > > > > ---
> > > > > 
> > > > >  drivers/gpu/drm/i915/i915_dma.c         |    6 ++----
> > > > >  drivers/gpu/drm/i915/intel_ringbuffer.c |    3 +++
> > > > >  2 files changed, 5 insertions(+), 4 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/i915_dma.c
> > > > > b/drivers/gpu/drm/i915/i915_dma.c index 1271282..8a3942c 100644
> > > > > --- a/drivers/gpu/drm/i915/i915_dma.c
> > > > > +++ b/drivers/gpu/drm/i915/i915_dma.c
> > > > > @@ -61,7 +61,6 @@ static void i915_write_hws_pga(struct drm_device
> > > > > *dev)
> > > > > 
> > > > >  static int i915_init_phys_hws(struct drm_device *dev)
> > > > >  {
> > > > >  
> > > > >  	drm_i915_private_t *dev_priv = dev->dev_private;
> > > > > 
> > > > > -	struct intel_ring_buffer *ring = LP_RING(dev_priv);
> > > > > 
> > > > >  	/* Program Hardware Status Page */
> > > > >  	dev_priv->status_page_dmah =
> > > > > 
> > > > > @@ -71,10 +70,9 @@ static int i915_init_phys_hws(struct drm_device
> > > > > *dev)
> > > > > 
> > > > >  		DRM_ERROR("Can not allocate hardware status page\n");
> > > > >  		return -ENOMEM;
> > > > >  	
> > > > >  	}
> > > > > 
> > > > > -	ring->status_page.page_addr =
> > > > > -		(void __force __iomem *)dev_priv->status_page_dmah->vaddr;
> > > > > 
> > > > > -	memset_io(ring->status_page.page_addr, 0, PAGE_SIZE);
> > > > > +	memset_io((void __force __iomem
> > > > > *)dev_priv->status_page_dmah->vaddr, +		  0, PAGE_SIZE);
> > > > > 
> > > > >  	i915_write_hws_pga(dev);
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > b/drivers/gpu/drm/i915/intel_ringbuffer.c index e961568..47b9b27
> > > > > 100644
> > > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > @@ -1321,6 +1321,9 @@ int intel_render_ring_init_dri(struct
> > > > > drm_device *dev, u64 start, u32 size)
> > > > > 
> > > > >  		ring->get_seqno = pc_render_get_seqno;
> > > > >  	
> > > > >  	}
> > > > > 
> > > > > +	if (!I915_NEED_GFX_HWS(dev))
> > > > > +		ring->status_page.page_addr = dev_priv->status_page_dmah->vaddr;
> > > > > +
> > > > > 
> > > > >  	ring->dev = dev;
> > > > >  	INIT_LIST_HEAD(&ring->active_list);
> > > > >  	INIT_LIST_HEAD(&ring->request_list);
> > > > 
> > > > I can't tell whether this is correct, because intel gfx driver is
> > > > unknown to me, but from the first glance your description sounds
> > > > reasonable.
> > > > 
> > > > I'm out of office till ~ next week's tuesday, and on return I'll try
> > > > to test it on the hardware in question.
> > > 
> > > Keith, thanks again for the patch. As promised I've tested it on the
> > > hardware in question and yes, bad_access is gone and X seems to work,
> > > so thank you, but...
> > > 
> > > 
> > > I see there are more such bugs in introduced-in-guilty-patch
> > > intel_render_ring_init_dri(). For example ring->irq_queue is
> > > left uninitialized and also ring->irq_lock etc...
> > > 
> > > 
> > > I'm X newbie, so if here is something stupid X-wise, please don't
> > > beat me too hard, but to me the gist of the problem is the original
> > > patch, where Chris does
> > > 
> > > ( git show e8616b6ced6137085e6657cc63bc2fe3900b8616 )
> > > 
> > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > b/drivers/gpu/drm/i915/intel_ringbuffer.c index 03e3370..51fbc5e
> > > > 100644
> > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > @@ -1291,6 +1291,48 @@ int intel_init_render_ring_buffer(struct
> > > > drm_device *dev)
> > > > 
> > > >         return intel_init_ring_buffer(dev, ring);
> > > >  
> > > >  }
> > > > 
> > > > +int intel_render_ring_init_dri(struct drm_device *dev, u64 start, u32
> > > > size) +{
> > > > +       drm_i915_private_t *dev_priv = dev->dev_private;
> > > > +       struct intel_ring_buffer *ring = &dev_priv->ring[RCS];
> > > > +
> > > > +       *ring = render_ring;
> > > > 
> > >           ^^^^^^^^^^^^^^^^^^^
> > >           here resets
> > > > 
> > > > +       if (INTEL_INFO(dev)->gen >= 6) {
> > > > +               ring->add_request = gen6_add_request;
> > > > +               ring->irq_get = gen6_render_ring_get_irq;
> > > > +               ring->irq_put = gen6_render_ring_put_irq;
> > > > +       } else if (IS_GEN5(dev)) {
> > > > +               ring->add_request = pc_render_add_request;
> > > > +               ring->get_seqno = pc_render_get_seqno;
> > > > +       }
> > > 
> > > and then the rest of the `ring` is initialized seemingly copy-pasted
> > > 
> > > from intel_init_ring_buffer():
> > > > +       ring->dev = dev;
> > > > +       INIT_LIST_HEAD(&ring->active_list);
> > > > +       INIT_LIST_HEAD(&ring->request_list);
> > > > +       INIT_LIST_HEAD(&ring->gpu_write_list);
> > > > +
> > > > +       ring->size = size;
> > > > +       ring->effective_size = ring->size;
> > > > +       if (IS_I830(ring->dev))
> > > > +               ring->effective_size -= 128;
> > > > +
> > > > +       ring->map.offset = start;
> > > > +       ring->map.size = size;
> > > > +       ring->map.type = 0;
> > > > +       ring->map.flags = 0;
> > > > +       ring->map.mtrr = 0;
> > > 
> > > ...
> > > 
> > > where both 3 chunks go almost exactly from intel_init_ring_buffer(), and
> > > ring->effective_size tweak even stripped original comment:
> > > 
> > > # original version from intel_init_ring_buffer():
> > >         /* Workaround an erratum on the i830 which causes a hang if
> > >         
> > >          * the TAIL pointer points to within the last 2 cachelines
> > >          * of the buffer.
> > >          */
> > >         
> > >         ring->effective_size = ring->size;
> > >         if (IS_I830(ring->dev))
> > >         
> > >                 ring->effective_size -= 128;
> > > 
> > > ...
> > > 
> > > 
> > > The line marked "here resets" resets all the fields, and maybe it's not a
> > > good idea to re-initialize them all afterwards (missing some as this
> > > thread show), or at least if it is really needed, share initialization
> > > code between intel_render_ring_init_dri() and intel_init_ring_buffer() ?
> > > 
> > > >From the outside it looks like the offending patch was done as a quick
> > > 
> > > fix in a hurry (lots of copy-paste), and maybe it would be better to
> > > re-do it properly...
> > 
> > Silence... ?
> > 
> > I read UMS is still ignored, because e.g. that uninitialized
> > ring->irq_lock which I've wrote about above is for sure used e.g. in
> > gen6_render_ring_get_irq() added to ring vtable in
> > intel_render_ring_init_dri().
> 
> I really doubt that UMS supports gen6 hardware.

Then why it is there in intel_render_ring_init_dri():

    int intel_render_ring_init_dri(struct drm_device *dev, u64 start, u32 size)
    {
    	drm_i915_private_t *dev_priv = dev->dev_private;
    	struct intel_ring_buffer *ring = &dev_priv->ring[RCS];
    
    	*ring = render_ring;
    	if (INTEL_INFO(dev)->gen >= 6) {
    		ring->add_request = gen6_add_request;
    		ring->irq_get = gen6_render_ring_get_irq;
    		ring->irq_put = gen6_render_ring_put_irq;
    	} else if (IS_GEN5(dev)) {
    		ring->add_request = pc_render_add_request;
    		ring->get_seqno = pc_render_get_seqno;
    	}


?


Added by the same guilty commit e8616b6c I'm talking about.

Vasily Khoruzhick Aug. 9, 2011, 3:09 p.m. UTC | #11

On Tuesday 09 August 2011 17:47:56 Kirill Smelkov wrote:
> On Tue, Aug 09, 2011 at 05:00:52PM +0300, Vasily Khoruzhick wrote:
> > On Tuesday 09 August 2011 15:08:03 Kirill Smelkov wrote:
> > > On Tue, Jul 26, 2011 at 05:48:27PM +0400, Kirill Smelkov wrote:
> > > > On Sat, Jul 23, 2011 at 12:23:36AM +0400, Kirill Smelkov wrote:
> > > > > Keith,
> > > > > 
> > > > > first of all thanks for your prompt reply. Then...
> > > > > 
> > > > > On Fri, Jul 22, 2011 at 11:00:41AM -0700, Keith Packard wrote:
> > > > > > On Fri, 22 Jul 2011 15:08:06 +0400, Kirill Smelkov
> > > > > > <kirr@mns.spb.ru>
> > 
> > wrote:
> > > > > > > And now after v3.0 is out, I've tested it again, and yes, like
> > > > > > > it was broken on v3.0-rc5, it is (now even more) broken on
> > > > > > > v3.0 -- after first
> > > > > > 
> > > > > > > bad io access the system freezes completely:
> > > > > > I looked at this when I first saw it (a couple of weeks ago), and
> > > > > > I couldn't see any obvious reason this patch would cause this
> > > > > > particular problem. I didn't want to revert the patch at that
> > > > > > point as I feared it would cause other subtle problems. Given
> > > > > > that you've got a work-around, it seemed best to just push this
> > > > > > off past 3.0.
> > > > > 
> > > > > What kind of a workaround are you talking about? Sorry, to me it
> > > > > all looked like "UMS is being ignored forever". Anyway, let's move
> > > > > on to try to solve the issue.
> > > > > 
> > > > > > Given the failing address passed to ioread32, this seems like
> > > > > > it's probably the call to READ_BREADCRUMB --
> > > > > > I915_BREADCRUMB_INDEX is 0x21, which is an offset in 32-bit
> > > > > > units within the hardware status page. If the
> > > > > > status_page.page_addr value was zero, then the computed address
> > > > > > would end up being 0x84.
> > > > > > 
> > > > > > And, it looks like status_page.page_addr *will* end up being zero
> > > > > > as a result of the patch in question. The patch resets the
> > > > > > entire ring structure contents back to the initial values, which
> > > > > > includes smashing the status_page structure to zero, clearing
> > > > > > the value of status_page.page_addr set in i915_init_phys_hws.
> > > > > > 
> > > > > > Here's an untested patch which moves the initialization of
> > > > > > status_page.page_addr into intel_render_ring_init_dri. I note
> > > > > > that intel_init_render_ring_buffer *already* has the setting of
> > > > > > the status_page.page_addr value, and so I've removed the setting
> > > > > > of status_page.page_addr from i915_init_phys_hws.
> > > > > > 
> > > > > > I suspect we could remove the memset from
> > > > > > intel_init_render_ring_buffer; it seems entirely superfluous
> > > > > > given the memset in i915_init_phys_hws.
> > > > > > 
> > > > > > From 159ba1dd207fc52590ce8a3afd83f40bd2cedf46 Mon Sep 17 00:00:00
> > > > > > 2001 From: Keith Packard <keithp@keithp.com>
> > > > > > Date: Fri, 22 Jul 2011 10:44:39 -0700
> > > > > > Subject: [PATCH] drm/i915: Initialize RCS ring status page
> > > > > > address in
> > > > > > 
> > > > > >  intel_render_ring_init_dri
> > > > > > 
> > > > > > Physically-addressed hardware status pages are initialized early
> > > > > > in the driver load process by i915_init_phys_hws. For UMS
> > > > > > environments, the ring structure is not initialized until the X
> > > > > > server starts. At that point, the entire ring structure is
> > > > > > re-initialized with all new values. Any values set in the ring
> > > > > > structure (including
> > > > > > ring->status_page.page_addr) will be lost when the ring is
> > > > > > re-initialized.
> > > > > > 
> > > > > > This patch moves the initialization of the status_page.page_addr
> > > > > > value to intel_render_ring_init_dri.
> > > > > > 
> > > > > > Signed-off-by: Keith Packard <keithp@keithp.com>
> > > > > > ---
> > > > > > 
> > > > > >  drivers/gpu/drm/i915/i915_dma.c         |    6 ++----
> > > > > >  drivers/gpu/drm/i915/intel_ringbuffer.c |    3 +++
> > > > > >  2 files changed, 5 insertions(+), 4 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/i915_dma.c
> > > > > > b/drivers/gpu/drm/i915/i915_dma.c index 1271282..8a3942c 100644
> > > > > > --- a/drivers/gpu/drm/i915/i915_dma.c
> > > > > > +++ b/drivers/gpu/drm/i915/i915_dma.c
> > > > > > @@ -61,7 +61,6 @@ static void i915_write_hws_pga(struct
> > > > > > drm_device *dev)
> > > > > > 
> > > > > >  static int i915_init_phys_hws(struct drm_device *dev)
> > > > > >  {
> > > > > >  
> > > > > >  	drm_i915_private_t *dev_priv = dev->dev_private;
> > > > > > 
> > > > > > -	struct intel_ring_buffer *ring = LP_RING(dev_priv);
> > > > > > 
> > > > > >  	/* Program Hardware Status Page */
> > > > > >  	dev_priv->status_page_dmah =
> > > > > > 
> > > > > > @@ -71,10 +70,9 @@ static int i915_init_phys_hws(struct
> > > > > > drm_device *dev)
> > > > > > 
> > > > > >  		DRM_ERROR("Can not allocate hardware status page\n");
> > > > > >  		return -ENOMEM;
> > > > > >  	
> > > > > >  	}
> > > > > > 
> > > > > > -	ring->status_page.page_addr =
> > > > > > -		(void __force __iomem *)dev_priv->status_page_dmah->vaddr;
> > > > > > 
> > > > > > -	memset_io(ring->status_page.page_addr, 0, PAGE_SIZE);
> > > > > > +	memset_io((void __force __iomem
> > > > > > *)dev_priv->status_page_dmah->vaddr, +		  0, PAGE_SIZE);
> > > > > > 
> > > > > >  	i915_write_hws_pga(dev);
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > b/drivers/gpu/drm/i915/intel_ringbuffer.c index e961568..47b9b27
> > > > > > 100644
> > > > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > @@ -1321,6 +1321,9 @@ int intel_render_ring_init_dri(struct
> > > > > > drm_device *dev, u64 start, u32 size)
> > > > > > 
> > > > > >  		ring->get_seqno = pc_render_get_seqno;
> > > > > >  	
> > > > > >  	}
> > > > > > 
> > > > > > +	if (!I915_NEED_GFX_HWS(dev))
> > > > > > +		ring->status_page.page_addr =
> > > > > > dev_priv->status_page_dmah->vaddr; +
> > > > > > 
> > > > > >  	ring->dev = dev;
> > > > > >  	INIT_LIST_HEAD(&ring->active_list);
> > > > > >  	INIT_LIST_HEAD(&ring->request_list);
> > > > > 
> > > > > I can't tell whether this is correct, because intel gfx driver is
> > > > > unknown to me, but from the first glance your description sounds
> > > > > reasonable.
> > > > > 
> > > > > I'm out of office till ~ next week's tuesday, and on return I'll
> > > > > try to test it on the hardware in question.
> > > > 
> > > > Keith, thanks again for the patch. As promised I've tested it on the
> > > > hardware in question and yes, bad_access is gone and X seems to work,
> > > > so thank you, but...
> > > > 
> > > > 
> > > > I see there are more such bugs in introduced-in-guilty-patch
> > > > intel_render_ring_init_dri(). For example ring->irq_queue is
> > > > left uninitialized and also ring->irq_lock etc...
> > > > 
> > > > 
> > > > I'm X newbie, so if here is something stupid X-wise, please don't
> > > > beat me too hard, but to me the gist of the problem is the original
> > > > patch, where Chris does
> > > > 
> > > > ( git show e8616b6ced6137085e6657cc63bc2fe3900b8616 )
> > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > b/drivers/gpu/drm/i915/intel_ringbuffer.c index 03e3370..51fbc5e
> > > > > 100644
> > > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > @@ -1291,6 +1291,48 @@ int intel_init_render_ring_buffer(struct
> > > > > drm_device *dev)
> > > > > 
> > > > >         return intel_init_ring_buffer(dev, ring);
> > > > >  
> > > > >  }
> > > > > 
> > > > > +int intel_render_ring_init_dri(struct drm_device *dev, u64 start,
> > > > > u32 size) +{
> > > > > +       drm_i915_private_t *dev_priv = dev->dev_private;
> > > > > +       struct intel_ring_buffer *ring = &dev_priv->ring[RCS];
> > > > > +
> > > > > +       *ring = render_ring;
> > > > > 
> > > >           ^^^^^^^^^^^^^^^^^^^
> > > >           here resets
> > > > > 
> > > > > +       if (INTEL_INFO(dev)->gen >= 6) {
> > > > > +               ring->add_request = gen6_add_request;
> > > > > +               ring->irq_get = gen6_render_ring_get_irq;
> > > > > +               ring->irq_put = gen6_render_ring_put_irq;
> > > > > +       } else if (IS_GEN5(dev)) {
> > > > > +               ring->add_request = pc_render_add_request;
> > > > > +               ring->get_seqno = pc_render_get_seqno;
> > > > > +       }
> > > > 
> > > > and then the rest of the `ring` is initialized seemingly copy-pasted
> > > > 
> > > > from intel_init_ring_buffer():
> > > > > +       ring->dev = dev;
> > > > > +       INIT_LIST_HEAD(&ring->active_list);
> > > > > +       INIT_LIST_HEAD(&ring->request_list);
> > > > > +       INIT_LIST_HEAD(&ring->gpu_write_list);
> > > > > +
> > > > > +       ring->size = size;
> > > > > +       ring->effective_size = ring->size;
> > > > > +       if (IS_I830(ring->dev))
> > > > > +               ring->effective_size -= 128;
> > > > > +
> > > > > +       ring->map.offset = start;
> > > > > +       ring->map.size = size;
> > > > > +       ring->map.type = 0;
> > > > > +       ring->map.flags = 0;
> > > > > +       ring->map.mtrr = 0;
> > > > 
> > > > ...
> > > > 
> > > > where both 3 chunks go almost exactly from intel_init_ring_buffer(),
> > > > and ring->effective_size tweak even stripped original comment:
> > > > 
> > > > # original version from intel_init_ring_buffer():
> > > >         /* Workaround an erratum on the i830 which causes a hang if
> > > >         
> > > >          * the TAIL pointer points to within the last 2 cachelines
> > > >          * of the buffer.
> > > >          */
> > > >         
> > > >         ring->effective_size = ring->size;
> > > >         if (IS_I830(ring->dev))
> > > >         
> > > >                 ring->effective_size -= 128;
> > > > 
> > > > ...
> > > > 
> > > > 
> > > > The line marked "here resets" resets all the fields, and maybe it's
> > > > not a good idea to re-initialize them all afterwards (missing some
> > > > as this thread show), or at least if it is really needed, share
> > > > initialization code between intel_render_ring_init_dri() and
> > > > intel_init_ring_buffer() ?
> > > > 
> > > > >From the outside it looks like the offending patch was done as a
> > > > >quick
> > > > 
> > > > fix in a hurry (lots of copy-paste), and maybe it would be better to
> > > > re-do it properly...
> > > 
> > > Silence... ?
> > > 
> > > I read UMS is still ignored, because e.g. that uninitialized
> > > ring->irq_lock which I've wrote about above is for sure used e.g. in
> > > gen6_render_ring_get_irq() added to ring vtable in
> > > intel_render_ring_init_dri().
> > 
> > I really doubt that UMS supports gen6 hardware.
> 
> Then why it is there in intel_render_ring_init_dri():
> 
>     int intel_render_ring_init_dri(struct drm_device *dev, u64 start, u32
> size) {
>     	drm_i915_private_t *dev_priv = dev->dev_private;
>     	struct intel_ring_buffer *ring = &dev_priv->ring[RCS];
> 
>     	*ring = render_ring;
>     	if (INTEL_INFO(dev)->gen >= 6) {

This branch executes only when hw generation is 6 or newer.

>     		ring->add_request = gen6_add_request;
>     		ring->irq_get = gen6_render_ring_get_irq;
>     		ring->irq_put = gen6_render_ring_put_irq;
>     	} else if (IS_GEN5(dev)) {
>     		ring->add_request = pc_render_add_request;
>     		ring->get_seqno = pc_render_get_seqno;
>     	}

Kirill Smelkov Aug. 9, 2011, 3:34 p.m. UTC | #12

On Tue, Aug 09, 2011 at 06:09:57PM +0300, Vasily Khoruzhick wrote:
> On Tuesday 09 August 2011 17:47:56 Kirill Smelkov wrote:
> > On Tue, Aug 09, 2011 at 05:00:52PM +0300, Vasily Khoruzhick wrote:
> > > On Tuesday 09 August 2011 15:08:03 Kirill Smelkov wrote:
> > > > On Tue, Jul 26, 2011 at 05:48:27PM +0400, Kirill Smelkov wrote:
> > > > > On Sat, Jul 23, 2011 at 12:23:36AM +0400, Kirill Smelkov wrote:
> > > > > > Keith,
> > > > > > 
> > > > > > first of all thanks for your prompt reply. Then...
> > > > > > 
> > > > > > On Fri, Jul 22, 2011 at 11:00:41AM -0700, Keith Packard wrote:
> > > > > > > On Fri, 22 Jul 2011 15:08:06 +0400, Kirill Smelkov
> > > > > > > <kirr@mns.spb.ru>
> > > 
> > > wrote:
> > > > > > > > And now after v3.0 is out, I've tested it again, and yes, like
> > > > > > > > it was broken on v3.0-rc5, it is (now even more) broken on
> > > > > > > > v3.0 -- after first
> > > > > > > 
> > > > > > > > bad io access the system freezes completely:
> > > > > > > I looked at this when I first saw it (a couple of weeks ago), and
> > > > > > > I couldn't see any obvious reason this patch would cause this
> > > > > > > particular problem. I didn't want to revert the patch at that
> > > > > > > point as I feared it would cause other subtle problems. Given
> > > > > > > that you've got a work-around, it seemed best to just push this
> > > > > > > off past 3.0.
> > > > > > 
> > > > > > What kind of a workaround are you talking about? Sorry, to me it
> > > > > > all looked like "UMS is being ignored forever". Anyway, let's move
> > > > > > on to try to solve the issue.
> > > > > > 
> > > > > > > Given the failing address passed to ioread32, this seems like
> > > > > > > it's probably the call to READ_BREADCRUMB --
> > > > > > > I915_BREADCRUMB_INDEX is 0x21, which is an offset in 32-bit
> > > > > > > units within the hardware status page. If the
> > > > > > > status_page.page_addr value was zero, then the computed address
> > > > > > > would end up being 0x84.
> > > > > > > 
> > > > > > > And, it looks like status_page.page_addr *will* end up being zero
> > > > > > > as a result of the patch in question. The patch resets the
> > > > > > > entire ring structure contents back to the initial values, which
> > > > > > > includes smashing the status_page structure to zero, clearing
> > > > > > > the value of status_page.page_addr set in i915_init_phys_hws.
> > > > > > > 
> > > > > > > Here's an untested patch which moves the initialization of
> > > > > > > status_page.page_addr into intel_render_ring_init_dri. I note
> > > > > > > that intel_init_render_ring_buffer *already* has the setting of
> > > > > > > the status_page.page_addr value, and so I've removed the setting
> > > > > > > of status_page.page_addr from i915_init_phys_hws.
> > > > > > > 
> > > > > > > I suspect we could remove the memset from
> > > > > > > intel_init_render_ring_buffer; it seems entirely superfluous
> > > > > > > given the memset in i915_init_phys_hws.
> > > > > > > 
> > > > > > > From 159ba1dd207fc52590ce8a3afd83f40bd2cedf46 Mon Sep 17 00:00:00
> > > > > > > 2001 From: Keith Packard <keithp@keithp.com>
> > > > > > > Date: Fri, 22 Jul 2011 10:44:39 -0700
> > > > > > > Subject: [PATCH] drm/i915: Initialize RCS ring status page
> > > > > > > address in
> > > > > > > 
> > > > > > >  intel_render_ring_init_dri
> > > > > > > 
> > > > > > > Physically-addressed hardware status pages are initialized early
> > > > > > > in the driver load process by i915_init_phys_hws. For UMS
> > > > > > > environments, the ring structure is not initialized until the X
> > > > > > > server starts. At that point, the entire ring structure is
> > > > > > > re-initialized with all new values. Any values set in the ring
> > > > > > > structure (including
> > > > > > > ring->status_page.page_addr) will be lost when the ring is
> > > > > > > re-initialized.
> > > > > > > 
> > > > > > > This patch moves the initialization of the status_page.page_addr
> > > > > > > value to intel_render_ring_init_dri.
> > > > > > > 
> > > > > > > Signed-off-by: Keith Packard <keithp@keithp.com>
> > > > > > > ---
> > > > > > > 
> > > > > > >  drivers/gpu/drm/i915/i915_dma.c         |    6 ++----
> > > > > > >  drivers/gpu/drm/i915/intel_ringbuffer.c |    3 +++
> > > > > > >  2 files changed, 5 insertions(+), 4 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/i915/i915_dma.c
> > > > > > > b/drivers/gpu/drm/i915/i915_dma.c index 1271282..8a3942c 100644
> > > > > > > --- a/drivers/gpu/drm/i915/i915_dma.c
> > > > > > > +++ b/drivers/gpu/drm/i915/i915_dma.c
> > > > > > > @@ -61,7 +61,6 @@ static void i915_write_hws_pga(struct
> > > > > > > drm_device *dev)
> > > > > > > 
> > > > > > >  static int i915_init_phys_hws(struct drm_device *dev)
> > > > > > >  {
> > > > > > >  
> > > > > > >  	drm_i915_private_t *dev_priv = dev->dev_private;
> > > > > > > 
> > > > > > > -	struct intel_ring_buffer *ring = LP_RING(dev_priv);
> > > > > > > 
> > > > > > >  	/* Program Hardware Status Page */
> > > > > > >  	dev_priv->status_page_dmah =
> > > > > > > 
> > > > > > > @@ -71,10 +70,9 @@ static int i915_init_phys_hws(struct
> > > > > > > drm_device *dev)
> > > > > > > 
> > > > > > >  		DRM_ERROR("Can not allocate hardware status page\n");
> > > > > > >  		return -ENOMEM;
> > > > > > >  	
> > > > > > >  	}
> > > > > > > 
> > > > > > > -	ring->status_page.page_addr =
> > > > > > > -		(void __force __iomem *)dev_priv->status_page_dmah->vaddr;
> > > > > > > 
> > > > > > > -	memset_io(ring->status_page.page_addr, 0, PAGE_SIZE);
> > > > > > > +	memset_io((void __force __iomem
> > > > > > > *)dev_priv->status_page_dmah->vaddr, +		  0, PAGE_SIZE);
> > > > > > > 
> > > > > > >  	i915_write_hws_pga(dev);
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > > b/drivers/gpu/drm/i915/intel_ringbuffer.c index e961568..47b9b27
> > > > > > > 100644
> > > > > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > > @@ -1321,6 +1321,9 @@ int intel_render_ring_init_dri(struct
> > > > > > > drm_device *dev, u64 start, u32 size)
> > > > > > > 
> > > > > > >  		ring->get_seqno = pc_render_get_seqno;
> > > > > > >  	
> > > > > > >  	}
> > > > > > > 
> > > > > > > +	if (!I915_NEED_GFX_HWS(dev))
> > > > > > > +		ring->status_page.page_addr =
> > > > > > > dev_priv->status_page_dmah->vaddr; +
> > > > > > > 
> > > > > > >  	ring->dev = dev;
> > > > > > >  	INIT_LIST_HEAD(&ring->active_list);
> > > > > > >  	INIT_LIST_HEAD(&ring->request_list);
> > > > > > 
> > > > > > I can't tell whether this is correct, because intel gfx driver is
> > > > > > unknown to me, but from the first glance your description sounds
> > > > > > reasonable.
> > > > > > 
> > > > > > I'm out of office till ~ next week's tuesday, and on return I'll
> > > > > > try to test it on the hardware in question.
> > > > > 
> > > > > Keith, thanks again for the patch. As promised I've tested it on the
> > > > > hardware in question and yes, bad_access is gone and X seems to work,
> > > > > so thank you, but...
> > > > > 
> > > > > 
> > > > > I see there are more such bugs in introduced-in-guilty-patch
> > > > > intel_render_ring_init_dri(). For example ring->irq_queue is
> > > > > left uninitialized and also ring->irq_lock etc...
> > > > > 
> > > > > 
> > > > > I'm X newbie, so if here is something stupid X-wise, please don't
> > > > > beat me too hard, but to me the gist of the problem is the original
> > > > > patch, where Chris does
> > > > > 
> > > > > ( git show e8616b6ced6137085e6657cc63bc2fe3900b8616 )
> > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > b/drivers/gpu/drm/i915/intel_ringbuffer.c index 03e3370..51fbc5e
> > > > > > 100644
> > > > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > @@ -1291,6 +1291,48 @@ int intel_init_render_ring_buffer(struct
> > > > > > drm_device *dev)
> > > > > > 
> > > > > >         return intel_init_ring_buffer(dev, ring);
> > > > > >  
> > > > > >  }
> > > > > > 
> > > > > > +int intel_render_ring_init_dri(struct drm_device *dev, u64 start,
> > > > > > u32 size) +{
> > > > > > +       drm_i915_private_t *dev_priv = dev->dev_private;
> > > > > > +       struct intel_ring_buffer *ring = &dev_priv->ring[RCS];
> > > > > > +
> > > > > > +       *ring = render_ring;
> > > > > > 
> > > > >           ^^^^^^^^^^^^^^^^^^^
> > > > >           here resets
> > > > > > 
> > > > > > +       if (INTEL_INFO(dev)->gen >= 6) {
> > > > > > +               ring->add_request = gen6_add_request;
> > > > > > +               ring->irq_get = gen6_render_ring_get_irq;
> > > > > > +               ring->irq_put = gen6_render_ring_put_irq;
> > > > > > +       } else if (IS_GEN5(dev)) {
> > > > > > +               ring->add_request = pc_render_add_request;
> > > > > > +               ring->get_seqno = pc_render_get_seqno;
> > > > > > +       }
> > > > > 
> > > > > and then the rest of the `ring` is initialized seemingly copy-pasted
> > > > > 
> > > > > from intel_init_ring_buffer():
> > > > > > +       ring->dev = dev;
> > > > > > +       INIT_LIST_HEAD(&ring->active_list);
> > > > > > +       INIT_LIST_HEAD(&ring->request_list);
> > > > > > +       INIT_LIST_HEAD(&ring->gpu_write_list);
> > > > > > +
> > > > > > +       ring->size = size;
> > > > > > +       ring->effective_size = ring->size;
> > > > > > +       if (IS_I830(ring->dev))
> > > > > > +               ring->effective_size -= 128;
> > > > > > +
> > > > > > +       ring->map.offset = start;
> > > > > > +       ring->map.size = size;
> > > > > > +       ring->map.type = 0;
> > > > > > +       ring->map.flags = 0;
> > > > > > +       ring->map.mtrr = 0;
> > > > > 
> > > > > ...
> > > > > 
> > > > > where both 3 chunks go almost exactly from intel_init_ring_buffer(),
> > > > > and ring->effective_size tweak even stripped original comment:
> > > > > 
> > > > > # original version from intel_init_ring_buffer():
> > > > >         /* Workaround an erratum on the i830 which causes a hang if
> > > > >         
> > > > >          * the TAIL pointer points to within the last 2 cachelines
> > > > >          * of the buffer.
> > > > >          */
> > > > >         
> > > > >         ring->effective_size = ring->size;
> > > > >         if (IS_I830(ring->dev))
> > > > >         
> > > > >                 ring->effective_size -= 128;
> > > > > 
> > > > > ...
> > > > > 
> > > > > 
> > > > > The line marked "here resets" resets all the fields, and maybe it's
> > > > > not a good idea to re-initialize them all afterwards (missing some
> > > > > as this thread show), or at least if it is really needed, share
> > > > > initialization code between intel_render_ring_init_dri() and
> > > > > intel_init_ring_buffer() ?
> > > > > 
> > > > > >From the outside it looks like the offending patch was done as a
> > > > > >quick
> > > > > 
> > > > > fix in a hurry (lots of copy-paste), and maybe it would be better to
> > > > > re-do it properly...
> > > > 
> > > > Silence... ?
> > > > 
> > > > I read UMS is still ignored, because e.g. that uninitialized
> > > > ring->irq_lock which I've wrote about above is for sure used e.g. in
> > > > gen6_render_ring_get_irq() added to ring vtable in
> > > > intel_render_ring_init_dri().
> > > 
> > > I really doubt that UMS supports gen6 hardware.
> > 
> > Then why it is there in intel_render_ring_init_dri():
> > 
> >     int intel_render_ring_init_dri(struct drm_device *dev, u64 start, u32
> > size) {
> >     	drm_i915_private_t *dev_priv = dev->dev_private;
> >     	struct intel_ring_buffer *ring = &dev_priv->ring[RCS];
> > 
> >     	*ring = render_ring;
> >     	if (INTEL_INFO(dev)->gen >= 6) {
> 
> This branch executes only when hw generation is 6 or newer.

and adds gen6_render_ring_get_irq() to vtable which uses ring->irq_lock
which is left uninitialized.

I don't understand what you were trying to say. How does it matter if
some branch executes only for such-and-such hardware, when this branch
contains bugs? Could you please clarify?


> >     		ring->add_request = gen6_add_request;
> >     		ring->irq_get = gen6_render_ring_get_irq;
> >     		ring->irq_put = gen6_render_ring_put_irq;
> >     	} else if (IS_GEN5(dev)) {
> >     		ring->add_request = pc_render_add_request;
> >     		ring->get_seqno = pc_render_get_seqno;
> >     	}

Vasily Khoruzhick Aug. 9, 2011, 4:02 p.m. UTC | #13

On Tuesday 09 August 2011 18:34:46 Kirill Smelkov wrote:
> On Tue, Aug 09, 2011 at 06:09:57PM +0300, Vasily Khoruzhick wrote:
> > On Tuesday 09 August 2011 17:47:56 Kirill Smelkov wrote:
> > > On Tue, Aug 09, 2011 at 05:00:52PM +0300, Vasily Khoruzhick wrote:
> > > > On Tuesday 09 August 2011 15:08:03 Kirill Smelkov wrote:
> > > > > On Tue, Jul 26, 2011 at 05:48:27PM +0400, Kirill Smelkov wrote:
> > > > > > On Sat, Jul 23, 2011 at 12:23:36AM +0400, Kirill Smelkov wrote:
> > > > > > > Keith,
> > > > > > > 
> > > > > > > first of all thanks for your prompt reply. Then...
> > > > > > > 
> > > > > > > On Fri, Jul 22, 2011 at 11:00:41AM -0700, Keith Packard wrote:
> > > > > > > > On Fri, 22 Jul 2011 15:08:06 +0400, Kirill Smelkov
> > > > > > > > <kirr@mns.spb.ru>
> > > > 
> > > > wrote:
> > > > > > > > > And now after v3.0 is out, I've tested it again, and yes,
> > > > > > > > > like it was broken on v3.0-rc5, it is (now even more)
> > > > > > > > > broken on v3.0 -- after first
> > > > > > > > 
> > > > > > > > > bad io access the system freezes completely:
> > > > > > > > I looked at this when I first saw it (a couple of weeks ago),
> > > > > > > > and I couldn't see any obvious reason this patch would cause
> > > > > > > > this particular problem. I didn't want to revert the patch
> > > > > > > > at that point as I feared it would cause other subtle
> > > > > > > > problems. Given that you've got a work-around, it seemed
> > > > > > > > best to just push this off past 3.0.
> > > > > > > 
> > > > > > > What kind of a workaround are you talking about? Sorry, to me
> > > > > > > it all looked like "UMS is being ignored forever". Anyway,
> > > > > > > let's move on to try to solve the issue.
> > > > > > > 
> > > > > > > > Given the failing address passed to ioread32, this seems like
> > > > > > > > it's probably the call to READ_BREADCRUMB --
> > > > > > > > I915_BREADCRUMB_INDEX is 0x21, which is an offset in 32-bit
> > > > > > > > units within the hardware status page. If the
> > > > > > > > status_page.page_addr value was zero, then the computed
> > > > > > > > address would end up being 0x84.
> > > > > > > > 
> > > > > > > > And, it looks like status_page.page_addr *will* end up being
> > > > > > > > zero as a result of the patch in question. The patch resets
> > > > > > > > the entire ring structure contents back to the initial
> > > > > > > > values, which includes smashing the status_page structure to
> > > > > > > > zero, clearing the value of status_page.page_addr set in
> > > > > > > > i915_init_phys_hws.
> > > > > > > > 
> > > > > > > > Here's an untested patch which moves the initialization of
> > > > > > > > status_page.page_addr into intel_render_ring_init_dri. I note
> > > > > > > > that intel_init_render_ring_buffer *already* has the setting
> > > > > > > > of the status_page.page_addr value, and so I've removed the
> > > > > > > > setting of status_page.page_addr from i915_init_phys_hws.
> > > > > > > > 
> > > > > > > > I suspect we could remove the memset from
> > > > > > > > intel_init_render_ring_buffer; it seems entirely superfluous
> > > > > > > > given the memset in i915_init_phys_hws.
> > > > > > > > 
> > > > > > > > From 159ba1dd207fc52590ce8a3afd83f40bd2cedf46 Mon Sep 17
> > > > > > > > 00:00:00 2001 From: Keith Packard <keithp@keithp.com>
> > > > > > > > Date: Fri, 22 Jul 2011 10:44:39 -0700
> > > > > > > > Subject: [PATCH] drm/i915: Initialize RCS ring status page
> > > > > > > > address in
> > > > > > > > 
> > > > > > > >  intel_render_ring_init_dri
> > > > > > > > 
> > > > > > > > Physically-addressed hardware status pages are initialized
> > > > > > > > early in the driver load process by i915_init_phys_hws. For
> > > > > > > > UMS environments, the ring structure is not initialized
> > > > > > > > until the X server starts. At that point, the entire ring
> > > > > > > > structure is re-initialized with all new values. Any values
> > > > > > > > set in the ring structure (including
> > > > > > > > ring->status_page.page_addr) will be lost when the ring is
> > > > > > > > re-initialized.
> > > > > > > > 
> > > > > > > > This patch moves the initialization of the
> > > > > > > > status_page.page_addr value to intel_render_ring_init_dri.
> > > > > > > > 
> > > > > > > > Signed-off-by: Keith Packard <keithp@keithp.com>
> > > > > > > > ---
> > > > > > > > 
> > > > > > > >  drivers/gpu/drm/i915/i915_dma.c         |    6 ++----
> > > > > > > >  drivers/gpu/drm/i915/intel_ringbuffer.c |    3 +++
> > > > > > > >  2 files changed, 5 insertions(+), 4 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/i915/i915_dma.c
> > > > > > > > b/drivers/gpu/drm/i915/i915_dma.c index 1271282..8a3942c
> > > > > > > > 100644 --- a/drivers/gpu/drm/i915/i915_dma.c
> > > > > > > > +++ b/drivers/gpu/drm/i915/i915_dma.c
> > > > > > > > @@ -61,7 +61,6 @@ static void i915_write_hws_pga(struct
> > > > > > > > drm_device *dev)
> > > > > > > > 
> > > > > > > >  static int i915_init_phys_hws(struct drm_device *dev)
> > > > > > > >  {
> > > > > > > >  
> > > > > > > >  	drm_i915_private_t *dev_priv = dev->dev_private;
> > > > > > > > 
> > > > > > > > -	struct intel_ring_buffer *ring = LP_RING(dev_priv);
> > > > > > > > 
> > > > > > > >  	/* Program Hardware Status Page */
> > > > > > > >  	dev_priv->status_page_dmah =
> > > > > > > > 
> > > > > > > > @@ -71,10 +70,9 @@ static int i915_init_phys_hws(struct
> > > > > > > > drm_device *dev)
> > > > > > > > 
> > > > > > > >  		DRM_ERROR("Can not allocate hardware status page\n");
> > > > > > > >  		return -ENOMEM;
> > > > > > > >  	
> > > > > > > >  	}
> > > > > > > > 
> > > > > > > > -	ring->status_page.page_addr =
> > > > > > > > -		(void __force __iomem *)dev_priv->status_page_dmah-
>vaddr;
> > > > > > > > 
> > > > > > > > -	memset_io(ring->status_page.page_addr, 0, PAGE_SIZE);
> > > > > > > > +	memset_io((void __force __iomem
> > > > > > > > *)dev_priv->status_page_dmah->vaddr, +		  0, PAGE_SIZE);
> > > > > > > > 
> > > > > > > >  	i915_write_hws_pga(dev);
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > > > b/drivers/gpu/drm/i915/intel_ringbuffer.c index
> > > > > > > > e961568..47b9b27 100644
> > > > > > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > > > @@ -1321,6 +1321,9 @@ int intel_render_ring_init_dri(struct
> > > > > > > > drm_device *dev, u64 start, u32 size)
> > > > > > > > 
> > > > > > > >  		ring->get_seqno = pc_render_get_seqno;
> > > > > > > >  	
> > > > > > > >  	}
> > > > > > > > 
> > > > > > > > +	if (!I915_NEED_GFX_HWS(dev))
> > > > > > > > +		ring->status_page.page_addr =
> > > > > > > > dev_priv->status_page_dmah->vaddr; +
> > > > > > > > 
> > > > > > > >  	ring->dev = dev;
> > > > > > > >  	INIT_LIST_HEAD(&ring->active_list);
> > > > > > > >  	INIT_LIST_HEAD(&ring->request_list);
> > > > > > > 
> > > > > > > I can't tell whether this is correct, because intel gfx driver
> > > > > > > is unknown to me, but from the first glance your description
> > > > > > > sounds reasonable.
> > > > > > > 
> > > > > > > I'm out of office till ~ next week's tuesday, and on return
> > > > > > > I'll try to test it on the hardware in question.
> > > > > > 
> > > > > > Keith, thanks again for the patch. As promised I've tested it on
> > > > > > the hardware in question and yes, bad_access is gone and X seems
> > > > > > to work, so thank you, but...
> > > > > > 
> > > > > > 
> > > > > > I see there are more such bugs in introduced-in-guilty-patch
> > > > > > intel_render_ring_init_dri(). For example ring->irq_queue is
> > > > > > left uninitialized and also ring->irq_lock etc...
> > > > > > 
> > > > > > 
> > > > > > I'm X newbie, so if here is something stupid X-wise, please don't
> > > > > > beat me too hard, but to me the gist of the problem is the
> > > > > > original patch, where Chris does
> > > > > > 
> > > > > > ( git show e8616b6ced6137085e6657cc63bc2fe3900b8616 )
> > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > > b/drivers/gpu/drm/i915/intel_ringbuffer.c index
> > > > > > > 03e3370..51fbc5e 100644
> > > > > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > > @@ -1291,6 +1291,48 @@ int intel_init_render_ring_buffer(struct
> > > > > > > drm_device *dev)
> > > > > > > 
> > > > > > >         return intel_init_ring_buffer(dev, ring);
> > > > > > >  
> > > > > > >  }
> > > > > > > 
> > > > > > > +int intel_render_ring_init_dri(struct drm_device *dev, u64
> > > > > > > start, u32 size) +{
> > > > > > > +       drm_i915_private_t *dev_priv = dev->dev_private;
> > > > > > > +       struct intel_ring_buffer *ring = &dev_priv->ring[RCS];
> > > > > > > +
> > > > > > > +       *ring = render_ring;
> > > > > > > 
> > > > > >           ^^^^^^^^^^^^^^^^^^^
> > > > > >           here resets
> > > > > > > 
> > > > > > > +       if (INTEL_INFO(dev)->gen >= 6) {
> > > > > > > +               ring->add_request = gen6_add_request;
> > > > > > > +               ring->irq_get = gen6_render_ring_get_irq;
> > > > > > > +               ring->irq_put = gen6_render_ring_put_irq;
> > > > > > > +       } else if (IS_GEN5(dev)) {
> > > > > > > +               ring->add_request = pc_render_add_request;
> > > > > > > +               ring->get_seqno = pc_render_get_seqno;
> > > > > > > +       }
> > > > > > 
> > > > > > and then the rest of the `ring` is initialized seemingly
> > > > > > copy-pasted
> > > > > > 
> > > > > > from intel_init_ring_buffer():
> > > > > > > +       ring->dev = dev;
> > > > > > > +       INIT_LIST_HEAD(&ring->active_list);
> > > > > > > +       INIT_LIST_HEAD(&ring->request_list);
> > > > > > > +       INIT_LIST_HEAD(&ring->gpu_write_list);
> > > > > > > +
> > > > > > > +       ring->size = size;
> > > > > > > +       ring->effective_size = ring->size;
> > > > > > > +       if (IS_I830(ring->dev))
> > > > > > > +               ring->effective_size -= 128;
> > > > > > > +
> > > > > > > +       ring->map.offset = start;
> > > > > > > +       ring->map.size = size;
> > > > > > > +       ring->map.type = 0;
> > > > > > > +       ring->map.flags = 0;
> > > > > > > +       ring->map.mtrr = 0;
> > > > > > 
> > > > > > ...
> > > > > > 
> > > > > > where both 3 chunks go almost exactly from
> > > > > > intel_init_ring_buffer(), and ring->effective_size tweak even
> > > > > > stripped original comment:
> > > > > > 
> > > > > > # original version from intel_init_ring_buffer():
> > > > > >         /* Workaround an erratum on the i830 which causes a hang
> > > > > >         if
> > > > > >         
> > > > > >          * the TAIL pointer points to within the last 2
> > > > > >          cachelines * of the buffer.
> > > > > >          */
> > > > > >         
> > > > > >         ring->effective_size = ring->size;
> > > > > >         if (IS_I830(ring->dev))
> > > > > >         
> > > > > >                 ring->effective_size -= 128;
> > > > > > 
> > > > > > ...
> > > > > > 
> > > > > > 
> > > > > > The line marked "here resets" resets all the fields, and maybe
> > > > > > it's not a good idea to re-initialize them all afterwards
> > > > > > (missing some as this thread show), or at least if it is really
> > > > > > needed, share initialization code between
> > > > > > intel_render_ring_init_dri() and intel_init_ring_buffer() ?
> > > > > > 
> > > > > > >From the outside it looks like the offending patch was done as a
> > > > > > >quick
> > > > > > 
> > > > > > fix in a hurry (lots of copy-paste), and maybe it would be better
> > > > > > to re-do it properly...
> > > > > 
> > > > > Silence... ?
> > > > > 
> > > > > I read UMS is still ignored, because e.g. that uninitialized
> > > > > ring->irq_lock which I've wrote about above is for sure used e.g.
> > > > > in gen6_render_ring_get_irq() added to ring vtable in
> > > > > intel_render_ring_init_dri().
> > > > 
> > > > I really doubt that UMS supports gen6 hardware.
> > > 
> > > Then why it is there in intel_render_ring_init_dri():
> > >     int intel_render_ring_init_dri(struct drm_device *dev, u64 start,
> > >     u32
> > > 
> > > size) {
> > > 
> > >     	drm_i915_private_t *dev_priv = dev->dev_private;
> > >     	struct intel_ring_buffer *ring = &dev_priv->ring[RCS];
> > >     	
> > >     	*ring = render_ring;
> > >     	if (INTEL_INFO(dev)->gen >= 6) {
> > 
> > This branch executes only when hw generation is 6 or newer.
> 
> and adds gen6_render_ring_get_irq() to vtable which uses ring->irq_lock
> which is left uninitialized.
> 
> I don't understand what you were trying to say. How does it matter if
> some branch executes only for such-and-such hardware, when this branch
> contains bugs? Could you please clarify?

I want to say that xf86-video-intel with gen6 support does not support UMS. So 
you can't even hit this "bug".

Kirill Smelkov Aug. 9, 2011, 4:32 p.m. UTC | #14

On Tue, Aug 09, 2011 at 07:02:59PM +0300, Vasily Khoruzhick wrote:
> On Tuesday 09 August 2011 18:34:46 Kirill Smelkov wrote:
> > On Tue, Aug 09, 2011 at 06:09:57PM +0300, Vasily Khoruzhick wrote:
> > > On Tuesday 09 August 2011 17:47:56 Kirill Smelkov wrote:
> > > > On Tue, Aug 09, 2011 at 05:00:52PM +0300, Vasily Khoruzhick wrote:
> > > > > On Tuesday 09 August 2011 15:08:03 Kirill Smelkov wrote:
> > > > > > On Tue, Jul 26, 2011 at 05:48:27PM +0400, Kirill Smelkov wrote:
> > > > > > > On Sat, Jul 23, 2011 at 12:23:36AM +0400, Kirill Smelkov wrote:
> > > > > > > > Keith,
> > > > > > > > 
> > > > > > > > first of all thanks for your prompt reply. Then...
> > > > > > > > 
> > > > > > > > On Fri, Jul 22, 2011 at 11:00:41AM -0700, Keith Packard wrote:
> > > > > > > > > On Fri, 22 Jul 2011 15:08:06 +0400, Kirill Smelkov
> > > > > > > > > <kirr@mns.spb.ru>
> > > > > 
> > > > > wrote:
> > > > > > > > > > And now after v3.0 is out, I've tested it again, and yes,
> > > > > > > > > > like it was broken on v3.0-rc5, it is (now even more)
> > > > > > > > > > broken on v3.0 -- after first
> > > > > > > > > 
> > > > > > > > > > bad io access the system freezes completely:
> > > > > > > > > I looked at this when I first saw it (a couple of weeks ago),
> > > > > > > > > and I couldn't see any obvious reason this patch would cause
> > > > > > > > > this particular problem. I didn't want to revert the patch
> > > > > > > > > at that point as I feared it would cause other subtle
> > > > > > > > > problems. Given that you've got a work-around, it seemed
> > > > > > > > > best to just push this off past 3.0.
> > > > > > > > 
> > > > > > > > What kind of a workaround are you talking about? Sorry, to me
> > > > > > > > it all looked like "UMS is being ignored forever". Anyway,
> > > > > > > > let's move on to try to solve the issue.
> > > > > > > > 
> > > > > > > > > Given the failing address passed to ioread32, this seems like
> > > > > > > > > it's probably the call to READ_BREADCRUMB --
> > > > > > > > > I915_BREADCRUMB_INDEX is 0x21, which is an offset in 32-bit
> > > > > > > > > units within the hardware status page. If the
> > > > > > > > > status_page.page_addr value was zero, then the computed
> > > > > > > > > address would end up being 0x84.
> > > > > > > > > 
> > > > > > > > > And, it looks like status_page.page_addr *will* end up being
> > > > > > > > > zero as a result of the patch in question. The patch resets
> > > > > > > > > the entire ring structure contents back to the initial
> > > > > > > > > values, which includes smashing the status_page structure to
> > > > > > > > > zero, clearing the value of status_page.page_addr set in
> > > > > > > > > i915_init_phys_hws.
> > > > > > > > > 
> > > > > > > > > Here's an untested patch which moves the initialization of
> > > > > > > > > status_page.page_addr into intel_render_ring_init_dri. I note
> > > > > > > > > that intel_init_render_ring_buffer *already* has the setting
> > > > > > > > > of the status_page.page_addr value, and so I've removed the
> > > > > > > > > setting of status_page.page_addr from i915_init_phys_hws.
> > > > > > > > > 
> > > > > > > > > I suspect we could remove the memset from
> > > > > > > > > intel_init_render_ring_buffer; it seems entirely superfluous
> > > > > > > > > given the memset in i915_init_phys_hws.
> > > > > > > > > 
> > > > > > > > > From 159ba1dd207fc52590ce8a3afd83f40bd2cedf46 Mon Sep 17
> > > > > > > > > 00:00:00 2001 From: Keith Packard <keithp@keithp.com>
> > > > > > > > > Date: Fri, 22 Jul 2011 10:44:39 -0700
> > > > > > > > > Subject: [PATCH] drm/i915: Initialize RCS ring status page
> > > > > > > > > address in
> > > > > > > > > 
> > > > > > > > >  intel_render_ring_init_dri
> > > > > > > > > 
> > > > > > > > > Physically-addressed hardware status pages are initialized
> > > > > > > > > early in the driver load process by i915_init_phys_hws. For
> > > > > > > > > UMS environments, the ring structure is not initialized
> > > > > > > > > until the X server starts. At that point, the entire ring
> > > > > > > > > structure is re-initialized with all new values. Any values
> > > > > > > > > set in the ring structure (including
> > > > > > > > > ring->status_page.page_addr) will be lost when the ring is
> > > > > > > > > re-initialized.
> > > > > > > > > 
> > > > > > > > > This patch moves the initialization of the
> > > > > > > > > status_page.page_addr value to intel_render_ring_init_dri.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Keith Packard <keithp@keithp.com>
> > > > > > > > > ---
> > > > > > > > > 
> > > > > > > > >  drivers/gpu/drm/i915/i915_dma.c         |    6 ++----
> > > > > > > > >  drivers/gpu/drm/i915/intel_ringbuffer.c |    3 +++
> > > > > > > > >  2 files changed, 5 insertions(+), 4 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/gpu/drm/i915/i915_dma.c
> > > > > > > > > b/drivers/gpu/drm/i915/i915_dma.c index 1271282..8a3942c
> > > > > > > > > 100644 --- a/drivers/gpu/drm/i915/i915_dma.c
> > > > > > > > > +++ b/drivers/gpu/drm/i915/i915_dma.c
> > > > > > > > > @@ -61,7 +61,6 @@ static void i915_write_hws_pga(struct
> > > > > > > > > drm_device *dev)
> > > > > > > > > 
> > > > > > > > >  static int i915_init_phys_hws(struct drm_device *dev)
> > > > > > > > >  {
> > > > > > > > >  
> > > > > > > > >  	drm_i915_private_t *dev_priv = dev->dev_private;
> > > > > > > > > 
> > > > > > > > > -	struct intel_ring_buffer *ring = LP_RING(dev_priv);
> > > > > > > > > 
> > > > > > > > >  	/* Program Hardware Status Page */
> > > > > > > > >  	dev_priv->status_page_dmah =
> > > > > > > > > 
> > > > > > > > > @@ -71,10 +70,9 @@ static int i915_init_phys_hws(struct
> > > > > > > > > drm_device *dev)
> > > > > > > > > 
> > > > > > > > >  		DRM_ERROR("Can not allocate hardware status page\n");
> > > > > > > > >  		return -ENOMEM;
> > > > > > > > >  	
> > > > > > > > >  	}
> > > > > > > > > 
> > > > > > > > > -	ring->status_page.page_addr =
> > > > > > > > > -		(void __force __iomem *)dev_priv->status_page_dmah-
> >vaddr;
> > > > > > > > > 
> > > > > > > > > -	memset_io(ring->status_page.page_addr, 0, PAGE_SIZE);
> > > > > > > > > +	memset_io((void __force __iomem
> > > > > > > > > *)dev_priv->status_page_dmah->vaddr, +		  0, PAGE_SIZE);
> > > > > > > > > 
> > > > > > > > >  	i915_write_hws_pga(dev);
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > > > > b/drivers/gpu/drm/i915/intel_ringbuffer.c index
> > > > > > > > > e961568..47b9b27 100644
> > > > > > > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > > > > @@ -1321,6 +1321,9 @@ int intel_render_ring_init_dri(struct
> > > > > > > > > drm_device *dev, u64 start, u32 size)
> > > > > > > > > 
> > > > > > > > >  		ring->get_seqno = pc_render_get_seqno;
> > > > > > > > >  	
> > > > > > > > >  	}
> > > > > > > > > 
> > > > > > > > > +	if (!I915_NEED_GFX_HWS(dev))
> > > > > > > > > +		ring->status_page.page_addr =
> > > > > > > > > dev_priv->status_page_dmah->vaddr; +
> > > > > > > > > 
> > > > > > > > >  	ring->dev = dev;
> > > > > > > > >  	INIT_LIST_HEAD(&ring->active_list);
> > > > > > > > >  	INIT_LIST_HEAD(&ring->request_list);
> > > > > > > > 
> > > > > > > > I can't tell whether this is correct, because intel gfx driver
> > > > > > > > is unknown to me, but from the first glance your description
> > > > > > > > sounds reasonable.
> > > > > > > > 
> > > > > > > > I'm out of office till ~ next week's tuesday, and on return
> > > > > > > > I'll try to test it on the hardware in question.
> > > > > > > 
> > > > > > > Keith, thanks again for the patch. As promised I've tested it on
> > > > > > > the hardware in question and yes, bad_access is gone and X seems
> > > > > > > to work, so thank you, but...
> > > > > > > 
> > > > > > > 
> > > > > > > I see there are more such bugs in introduced-in-guilty-patch
> > > > > > > intel_render_ring_init_dri(). For example ring->irq_queue is
> > > > > > > left uninitialized and also ring->irq_lock etc...
> > > > > > > 
> > > > > > > 
> > > > > > > I'm X newbie, so if here is something stupid X-wise, please don't
> > > > > > > beat me too hard, but to me the gist of the problem is the
> > > > > > > original patch, where Chris does
> > > > > > > 
> > > > > > > ( git show e8616b6ced6137085e6657cc63bc2fe3900b8616 )
> > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > > > b/drivers/gpu/drm/i915/intel_ringbuffer.c index
> > > > > > > > 03e3370..51fbc5e 100644
> > > > > > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > > > @@ -1291,6 +1291,48 @@ int intel_init_render_ring_buffer(struct
> > > > > > > > drm_device *dev)
> > > > > > > > 
> > > > > > > >         return intel_init_ring_buffer(dev, ring);
> > > > > > > >  
> > > > > > > >  }
> > > > > > > > 
> > > > > > > > +int intel_render_ring_init_dri(struct drm_device *dev, u64
> > > > > > > > start, u32 size) +{
> > > > > > > > +       drm_i915_private_t *dev_priv = dev->dev_private;
> > > > > > > > +       struct intel_ring_buffer *ring = &dev_priv->ring[RCS];
> > > > > > > > +
> > > > > > > > +       *ring = render_ring;
> > > > > > > > 
> > > > > > >           ^^^^^^^^^^^^^^^^^^^
> > > > > > >           here resets
> > > > > > > > 
> > > > > > > > +       if (INTEL_INFO(dev)->gen >= 6) {
> > > > > > > > +               ring->add_request = gen6_add_request;
> > > > > > > > +               ring->irq_get = gen6_render_ring_get_irq;
> > > > > > > > +               ring->irq_put = gen6_render_ring_put_irq;
> > > > > > > > +       } else if (IS_GEN5(dev)) {
> > > > > > > > +               ring->add_request = pc_render_add_request;
> > > > > > > > +               ring->get_seqno = pc_render_get_seqno;
> > > > > > > > +       }
> > > > > > > 
> > > > > > > and then the rest of the `ring` is initialized seemingly
> > > > > > > copy-pasted
> > > > > > > 
> > > > > > > from intel_init_ring_buffer():
> > > > > > > > +       ring->dev = dev;
> > > > > > > > +       INIT_LIST_HEAD(&ring->active_list);
> > > > > > > > +       INIT_LIST_HEAD(&ring->request_list);
> > > > > > > > +       INIT_LIST_HEAD(&ring->gpu_write_list);
> > > > > > > > +
> > > > > > > > +       ring->size = size;
> > > > > > > > +       ring->effective_size = ring->size;
> > > > > > > > +       if (IS_I830(ring->dev))
> > > > > > > > +               ring->effective_size -= 128;
> > > > > > > > +
> > > > > > > > +       ring->map.offset = start;
> > > > > > > > +       ring->map.size = size;
> > > > > > > > +       ring->map.type = 0;
> > > > > > > > +       ring->map.flags = 0;
> > > > > > > > +       ring->map.mtrr = 0;
> > > > > > > 
> > > > > > > ...
> > > > > > > 
> > > > > > > where both 3 chunks go almost exactly from
> > > > > > > intel_init_ring_buffer(), and ring->effective_size tweak even
> > > > > > > stripped original comment:
> > > > > > > 
> > > > > > > # original version from intel_init_ring_buffer():
> > > > > > >         /* Workaround an erratum on the i830 which causes a hang
> > > > > > >         if
> > > > > > >         
> > > > > > >          * the TAIL pointer points to within the last 2
> > > > > > >          cachelines * of the buffer.
> > > > > > >          */
> > > > > > >         
> > > > > > >         ring->effective_size = ring->size;
> > > > > > >         if (IS_I830(ring->dev))
> > > > > > >         
> > > > > > >                 ring->effective_size -= 128;
> > > > > > > 
> > > > > > > ...
> > > > > > > 
> > > > > > > 
> > > > > > > The line marked "here resets" resets all the fields, and maybe
> > > > > > > it's not a good idea to re-initialize them all afterwards
> > > > > > > (missing some as this thread show), or at least if it is really
> > > > > > > needed, share initialization code between
> > > > > > > intel_render_ring_init_dri() and intel_init_ring_buffer() ?
> > > > > > > 
> > > > > > > >From the outside it looks like the offending patch was done as a
> > > > > > > >quick
> > > > > > > 
> > > > > > > fix in a hurry (lots of copy-paste), and maybe it would be better
> > > > > > > to re-do it properly...
> > > > > > 
> > > > > > Silence... ?
> > > > > > 
> > > > > > I read UMS is still ignored, because e.g. that uninitialized
> > > > > > ring->irq_lock which I've wrote about above is for sure used e.g.
> > > > > > in gen6_render_ring_get_irq() added to ring vtable in
> > > > > > intel_render_ring_init_dri().
> > > > > 
> > > > > I really doubt that UMS supports gen6 hardware.
> > > > 
> > > > Then why it is there in intel_render_ring_init_dri():
> > > >     int intel_render_ring_init_dri(struct drm_device *dev, u64 start,
> > > >     u32
> > > > 
> > > > size) {
> > > > 
> > > >     	drm_i915_private_t *dev_priv = dev->dev_private;
> > > >     	struct intel_ring_buffer *ring = &dev_priv->ring[RCS];
> > > >     	
> > > >     	*ring = render_ring;
> > > >     	if (INTEL_INFO(dev)->gen >= 6) {
> > > 
> > > This branch executes only when hw generation is 6 or newer.
> > 
> > and adds gen6_render_ring_get_irq() to vtable which uses ring->irq_lock
> > which is left uninitialized.
> > 
> > I don't understand what you were trying to say. How does it matter if
> > some branch executes only for such-and-such hardware, when this branch
> > contains bugs? Could you please clarify?
> 
> I want to say that xf86-video-intel with gen6 support does not support UMS. So 
> you can't even hit this "bug".


Ok, but so then there is a dead code in the kernel, right? Or not dead
at all because potentially some non-X userspace could trigger the bug.

Why it was added in the first place?


To me, intel_render_ring_init_dri() looks like being copy-pasted from
several places in a hurry. And I was already beaten by one bug
introduced in it, without a single response for 3 kernel cycles though
I've asked for help several times and provided detailed info.

Finally Keith analyzed and plugged NULL-pointer dereference (thanks)
but I'm telling, it seems there are more bugs introduced in e8616b6c.

The patch title says "drm/i915: Initialise ring vfuncs for old DRI
paths" and one could ask, why couldn't it be done without bugs and
regressions. Are we waiting for another one hitting left bugs instead of
fix them in the first place?

Quite frankly, I don't understand intel-gfx developers attitude: why is
it me, just random user who is nitpicking here? Why there is no
interest/will to analyze now obviously buggy/duplicate code and fix it?


If support for UMS/old-dri/whatever is dropped, could you please say so
and clean the driver from legacy code and move on. That would be at
least fair for people not hoping their old setups will continue to
work.


Thanks,
Kirill

Kirill Smelkov Aug. 9, 2011, 5:40 p.m. UTC | #15

On Tue, Aug 09, 2011 at 09:56:01AM -0700, Ray Lee wrote:
> On Tue, Aug 9, 2011 at 9:32 AM, Kirill Smelkov <kirr@mns.spb.ru> wrote:
> > Quite frankly, I don't understand intel-gfx developers attitude: why is
> > it me, just random user who is nitpicking here? Why there is no
> > interest/will to analyze now obviously buggy/duplicate code and fix it?
> 
> Because they don't have an infinite amount of manpower. Actual bugs
> hitting actual users take precedence over 'cleanups' which always have
> a chance of causing regressions, as you're well aware. Code churn for
> the sake of abstract prettiness is discouraged, as it has a potential
> cost for little potential gain.
> 
> If you like, submit a patch. You may now be more up-to-date on those
> particular code paths than most of the intel-gfx developers.

Ray, I'd agree with you if the topic was about cleanups.

But here I was talking about copy-pasty commit which introduced
regressions and bugs, and if now it's a user dilemma to either "clean up"
it after developers himself, or accept that something is broken because
developers lack manpower and so plug things in a hurry increasing
entropy, I'd like to remind a good rule, at least to me one more time,
not to break things in the first place.

I'm not talking about cleanup here. I'm talking about original commit
which introduced problems, and that there is no need to clean it up, but
better revert and redo properly to avoid subsequent code churn in lots
of fixes.

Sorry, I won't submit a patch. If there is a need to find/fix/cleanup
obvious things after company's developers, I have better things to do,
and a todo item to re-evaluate hardware for my next project.

Thanks,
Kirill

Kirill Smelkov Aug. 10, 2011, 8:36 a.m. UTC | #16

On Tue, Aug 09, 2011 at 10:43:08AM -0700, Ray Lee wrote:
> On Tue, Aug 9, 2011 at 10:40 AM, Kirill Smelkov <kirr@mns.spb.ru> wrote:
> >> If you like, submit a patch. You may now be more up-to-date on those
> >> particular code paths than most of the intel-gfx developers.
> >
> > Ray, I'd agree with you if the topic was about cleanups.
> 
> At this point it is about cleanups unless Keith's patch upthread does
> not work for you. Does it or not?

I've already wrote two weeks ago it does, but if this needs to be
restated one more time here it is: Keith's patch fixes the problem in a
sense that X now starts and seemingly works (thanks), but several issues
remain to be there imho. I've got the message, if it's ok for intel-gfx
to leave them as is - it's ok for me.

Peace,
Kirill

Kirill Smelkov Aug. 10, 2011, 11:37 a.m. UTC | #17

On Wed, Aug 10, 2011 at 10:41:44AM +0100, Alan Cox wrote:
> > Sorry, I won't submit a patch. If there is a need to find/fix/cleanup
> > obvious things after company's developers, I have better things to do,
> > and a todo item to re-evaluate hardware for my next project.
> 
> You seem confused. If you have a support contract of some form with a
> Linux supplier or Intel please contact your support. This mailing list
> isn't for support services.

Thanks for clarifying.

Major 2.6.38 / 2.6.39 / 3.0 regression ignored?

Commit Message

Comments

Patch