diff mbox series

[RFC,1/1] drm/radeon: fix bad DMA from INTERRUPT_CNTL2

Message ID 7251833ab9439f4e34ba3fb2c5daf6c9e01b6551.1573698927.git.sbobroff@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series [RFC,1/1] drm/radeon: fix bad DMA from INTERRUPT_CNTL2 | expand

Commit Message

Sam Bobroff Nov. 14, 2019, 2:36 a.m. UTC
Currently, binding the radeon driver to a Radeon FirePro (PCI
1002:68f2) that has been passed through to a guest on a Power8 machine
causes a bad DMA read which causes the device to be frozen, leading to
the driver failing to bind and other problems.

The bad DMA is to the address written into the INTERRUPT_CNTL2
register during r600_irq_init(), and it can be fixed by substituting a
valid (dummy) DMA address.
---
Hello,

I've been tracking down a bug, described above, and I've been able to hack
in a workaround but I could use some help understanding how to fix it
properly.

What seems to be happening is that when the first CRTC is enabled (when
evergreen_irq_set() writes to GRPH_INT_CONTROL) the device immediately
performs a DMA read from the address that's been programmed into
INTERRUPT_CNTL2 by r600_irq_init().

That address isn't a valid DMA address, so it triggers the problem. The
address comes from rdev->ih.gpu_addr, which seems to be a 'linear GPU
address', calculated from the size of the card's VRAM. It definitely hasn't
come from a DMA mapping operation.

Based on the nearby comment, "set dummy read address to ring address", I
tried substituting a valid dummy DMA address (using the address mapped to
the 'dummy page' used by the GART) and it does prevent the problem.
However, I don't know what that register is supposed to do or what
information the device might be communicating via that DMA, so presumably
it breaks something.

Note: I have tried loading the patched driver with "test=1" and all the
self tests succeed.

Could anyone offer some insight into this problem?
What does that register do? What kind of address is it expecting?
What might be a good way of fixing it?

Thanks in advance,
Sam.

 drivers/gpu/drm/radeon/r600.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Alex Deucher Nov. 14, 2019, 5:05 a.m. UTC | #1
On Wed, Nov 13, 2019 at 9:53 PM Sam Bobroff <sbobroff@linux.ibm.com> wrote:
>
> Currently, binding the radeon driver to a Radeon FirePro (PCI
> 1002:68f2) that has been passed through to a guest on a Power8 machine
> causes a bad DMA read which causes the device to be frozen, leading to
> the driver failing to bind and other problems.
>
> The bad DMA is to the address written into the INTERRUPT_CNTL2
> register during r600_irq_init(), and it can be fixed by substituting a
> valid (dummy) DMA address.

The patch is correct.  INTERRUPT_CNTL2 takes a bus address not a GPU
MC address.  Care to update si.c and cik.c as well?

Thanks,

Alex

> ---
> Hello,
>
> I've been tracking down a bug, described above, and I've been able to hack
> in a workaround but I could use some help understanding how to fix it
> properly.
>
> What seems to be happening is that when the first CRTC is enabled (when
> evergreen_irq_set() writes to GRPH_INT_CONTROL) the device immediately
> performs a DMA read from the address that's been programmed into
> INTERRUPT_CNTL2 by r600_irq_init().
>
> That address isn't a valid DMA address, so it triggers the problem. The
> address comes from rdev->ih.gpu_addr, which seems to be a 'linear GPU
> address', calculated from the size of the card's VRAM. It definitely hasn't
> come from a DMA mapping operation.
>
> Based on the nearby comment, "set dummy read address to ring address", I
> tried substituting a valid dummy DMA address (using the address mapped to
> the 'dummy page' used by the GART) and it does prevent the problem.
> However, I don't know what that register is supposed to do or what
> information the device might be communicating via that DMA, so presumably
> it breaks something.
>
> Note: I have tried loading the patched driver with "test=1" and all the
> self tests succeed.
>
> Could anyone offer some insight into this problem?
> What does that register do? What kind of address is it expecting?
> What might be a good way of fixing it?
>
> Thanks in advance,
> Sam.
>
>  drivers/gpu/drm/radeon/r600.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c
> index e937cc01910d..033bc466a862 100644
> --- a/drivers/gpu/drm/radeon/r600.c
> +++ b/drivers/gpu/drm/radeon/r600.c
> @@ -3696,8 +3696,8 @@ int r600_irq_init(struct radeon_device *rdev)
>         }
>
>         /* setup interrupt control */
> -       /* set dummy read address to ring address */
> -       WREG32(INTERRUPT_CNTL2, rdev->ih.gpu_addr >> 8);
> +       /* set dummy read address to dummy page address */
> +       WREG32(INTERRUPT_CNTL2, rdev->dummy_page.addr >> 8);
>         interrupt_cntl = RREG32(INTERRUPT_CNTL);
>         /* IH_DUMMY_RD_OVERRIDE=0 - dummy read disabled with msi, enabled without msi
>          * IH_DUMMY_RD_OVERRIDE=1 - dummy read controlled by IH_DUMMY_RD_EN
> --
> 2.22.0.216.g00a2a96fc9
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Sam Bobroff Nov. 14, 2019, 11:08 p.m. UTC | #2
On Thu, Nov 14, 2019 at 12:05:19AM -0500, Alex Deucher wrote:
> On Wed, Nov 13, 2019 at 9:53 PM Sam Bobroff <sbobroff@linux.ibm.com> wrote:
> >
> > Currently, binding the radeon driver to a Radeon FirePro (PCI
> > 1002:68f2) that has been passed through to a guest on a Power8 machine
> > causes a bad DMA read which causes the device to be frozen, leading to
> > the driver failing to bind and other problems.
> >
> > The bad DMA is to the address written into the INTERRUPT_CNTL2
> > register during r600_irq_init(), and it can be fixed by substituting a
> > valid (dummy) DMA address.
> 
> The patch is correct.  INTERRUPT_CNTL2 takes a bus address not a GPU
> MC address.  Care to update si.c and cik.c as well?

Sure (although I can't test those cases). I see that there's one other
case in si_ih.c (although it's for amdgpu, not radeon); should I include
that as well?

> Thanks,
> 
> Alex
> 
> > ---
> > Hello,
> >
> > I've been tracking down a bug, described above, and I've been able to hack
> > in a workaround but I could use some help understanding how to fix it
> > properly.
> >
> > What seems to be happening is that when the first CRTC is enabled (when
> > evergreen_irq_set() writes to GRPH_INT_CONTROL) the device immediately
> > performs a DMA read from the address that's been programmed into
> > INTERRUPT_CNTL2 by r600_irq_init().
> >
> > That address isn't a valid DMA address, so it triggers the problem. The
> > address comes from rdev->ih.gpu_addr, which seems to be a 'linear GPU
> > address', calculated from the size of the card's VRAM. It definitely hasn't
> > come from a DMA mapping operation.
> >
> > Based on the nearby comment, "set dummy read address to ring address", I
> > tried substituting a valid dummy DMA address (using the address mapped to
> > the 'dummy page' used by the GART) and it does prevent the problem.
> > However, I don't know what that register is supposed to do or what
> > information the device might be communicating via that DMA, so presumably
> > it breaks something.
> >
> > Note: I have tried loading the patched driver with "test=1" and all the
> > self tests succeed.
> >
> > Could anyone offer some insight into this problem?
> > What does that register do? What kind of address is it expecting?
> > What might be a good way of fixing it?
> >
> > Thanks in advance,
> > Sam.
> >
> >  drivers/gpu/drm/radeon/r600.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c
> > index e937cc01910d..033bc466a862 100644
> > --- a/drivers/gpu/drm/radeon/r600.c
> > +++ b/drivers/gpu/drm/radeon/r600.c
> > @@ -3696,8 +3696,8 @@ int r600_irq_init(struct radeon_device *rdev)
> >         }
> >
> >         /* setup interrupt control */
> > -       /* set dummy read address to ring address */
> > -       WREG32(INTERRUPT_CNTL2, rdev->ih.gpu_addr >> 8);
> > +       /* set dummy read address to dummy page address */
> > +       WREG32(INTERRUPT_CNTL2, rdev->dummy_page.addr >> 8);
> >         interrupt_cntl = RREG32(INTERRUPT_CNTL);
> >         /* IH_DUMMY_RD_OVERRIDE=0 - dummy read disabled with msi, enabled without msi
> >          * IH_DUMMY_RD_OVERRIDE=1 - dummy read controlled by IH_DUMMY_RD_EN
> > --
> > 2.22.0.216.g00a2a96fc9
> >
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
Alex Deucher Nov. 14, 2019, 11:09 p.m. UTC | #3
On Thu, Nov 14, 2019 at 6:08 PM Sam Bobroff <sbobroff@linux.ibm.com> wrote:
>
> On Thu, Nov 14, 2019 at 12:05:19AM -0500, Alex Deucher wrote:
> > On Wed, Nov 13, 2019 at 9:53 PM Sam Bobroff <sbobroff@linux.ibm.com> wrote:
> > >
> > > Currently, binding the radeon driver to a Radeon FirePro (PCI
> > > 1002:68f2) that has been passed through to a guest on a Power8 machine
> > > causes a bad DMA read which causes the device to be frozen, leading to
> > > the driver failing to bind and other problems.
> > >
> > > The bad DMA is to the address written into the INTERRUPT_CNTL2
> > > register during r600_irq_init(), and it can be fixed by substituting a
> > > valid (dummy) DMA address.
> >
> > The patch is correct.  INTERRUPT_CNTL2 takes a bus address not a GPU
> > MC address.  Care to update si.c and cik.c as well?
>
> Sure (although I can't test those cases). I see that there's one other
> case in si_ih.c (although it's for amdgpu, not radeon); should I include
> that as well?
>

Yes, please.

thanks,

Alex

> > Thanks,
> >
> > Alex
> >
> > > ---
> > > Hello,
> > >
> > > I've been tracking down a bug, described above, and I've been able to hack
> > > in a workaround but I could use some help understanding how to fix it
> > > properly.
> > >
> > > What seems to be happening is that when the first CRTC is enabled (when
> > > evergreen_irq_set() writes to GRPH_INT_CONTROL) the device immediately
> > > performs a DMA read from the address that's been programmed into
> > > INTERRUPT_CNTL2 by r600_irq_init().
> > >
> > > That address isn't a valid DMA address, so it triggers the problem. The
> > > address comes from rdev->ih.gpu_addr, which seems to be a 'linear GPU
> > > address', calculated from the size of the card's VRAM. It definitely hasn't
> > > come from a DMA mapping operation.
> > >
> > > Based on the nearby comment, "set dummy read address to ring address", I
> > > tried substituting a valid dummy DMA address (using the address mapped to
> > > the 'dummy page' used by the GART) and it does prevent the problem.
> > > However, I don't know what that register is supposed to do or what
> > > information the device might be communicating via that DMA, so presumably
> > > it breaks something.
> > >
> > > Note: I have tried loading the patched driver with "test=1" and all the
> > > self tests succeed.
> > >
> > > Could anyone offer some insight into this problem?
> > > What does that register do? What kind of address is it expecting?
> > > What might be a good way of fixing it?
> > >
> > > Thanks in advance,
> > > Sam.
> > >
> > >  drivers/gpu/drm/radeon/r600.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c
> > > index e937cc01910d..033bc466a862 100644
> > > --- a/drivers/gpu/drm/radeon/r600.c
> > > +++ b/drivers/gpu/drm/radeon/r600.c
> > > @@ -3696,8 +3696,8 @@ int r600_irq_init(struct radeon_device *rdev)
> > >         }
> > >
> > >         /* setup interrupt control */
> > > -       /* set dummy read address to ring address */
> > > -       WREG32(INTERRUPT_CNTL2, rdev->ih.gpu_addr >> 8);
> > > +       /* set dummy read address to dummy page address */
> > > +       WREG32(INTERRUPT_CNTL2, rdev->dummy_page.addr >> 8);
> > >         interrupt_cntl = RREG32(INTERRUPT_CNTL);
> > >         /* IH_DUMMY_RD_OVERRIDE=0 - dummy read disabled with msi, enabled without msi
> > >          * IH_DUMMY_RD_OVERRIDE=1 - dummy read controlled by IH_DUMMY_RD_EN
> > > --
> > > 2.22.0.216.g00a2a96fc9
> > >
> > > _______________________________________________
> > > dri-devel mailing list
> > > dri-devel@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
diff mbox series

Patch

diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c
index e937cc01910d..033bc466a862 100644
--- a/drivers/gpu/drm/radeon/r600.c
+++ b/drivers/gpu/drm/radeon/r600.c
@@ -3696,8 +3696,8 @@  int r600_irq_init(struct radeon_device *rdev)
 	}
 
 	/* setup interrupt control */
-	/* set dummy read address to ring address */
-	WREG32(INTERRUPT_CNTL2, rdev->ih.gpu_addr >> 8);
+	/* set dummy read address to dummy page address */
+	WREG32(INTERRUPT_CNTL2, rdev->dummy_page.addr >> 8);
 	interrupt_cntl = RREG32(INTERRUPT_CNTL);
 	/* IH_DUMMY_RD_OVERRIDE=0 - dummy read disabled with msi, enabled without msi
 	 * IH_DUMMY_RD_OVERRIDE=1 - dummy read controlled by IH_DUMMY_RD_EN