diff mbox series

Revert "drm/radeon: handle PCIe root ports with addressing limitations"

Message ID 20200915184607.84435-1-alexander.deucher@amd.com
State New, archived
Headers show
Series Revert "drm/radeon: handle PCIe root ports with addressing limitations" | expand

Commit Message

Alex Deucher Sept. 15, 2020, 6:46 p.m. UTC
This change breaks tons of systems.

This reverts commit 33b3ad3788aba846fc8b9a065fe2685a0b64f713.

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=206973
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=206697
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=207763
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1140
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1287
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: christian.koenig@amd.com
---
 drivers/gpu/drm/radeon/radeon.h        |  1 +
 drivers/gpu/drm/radeon/radeon_device.c | 13 ++++++++-----
 drivers/gpu/drm/radeon/radeon_ttm.c    |  2 +-
 3 files changed, 10 insertions(+), 6 deletions(-)

Comments

Greg KH Sept. 16, 2020, 6:33 a.m. UTC | #1
On Tue, Sep 15, 2020 at 02:46:07PM -0400, Alex Deucher wrote:
> This change breaks tons of systems.

Very vague :(

This commit has also been merged for over a year, why the sudden
problem now?

> This reverts commit 33b3ad3788aba846fc8b9a065fe2685a0b64f713.

You mean "33b3ad3788ab ("drm/radeon: handle PCIe root ports with
addressing limitations")"?

That's the proper way to reference commits in changelogs please.  It's
even documented that way...

> 
> Bug: https://bugzilla.kernel.org/show_bug.cgi?id=206973
> Bug: https://bugzilla.kernel.org/show_bug.cgi?id=206697
> Bug: https://bugzilla.kernel.org/show_bug.cgi?id=207763
> Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1140
> Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1287
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> Cc: stable@vger.kernel.org
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: christian.koenig@amd.com

Fixes: 33b3ad3788ab ("drm/radeon: handle PCIe root ports with addressing limitations")

as well?

thanks,

greg k-h
Daniel Vetter Sept. 16, 2020, 9:59 a.m. UTC | #2
On Wed, Sep 16, 2020 at 08:33:00AM +0200, Greg KH wrote:
> On Tue, Sep 15, 2020 at 02:46:07PM -0400, Alex Deucher wrote:
> > This change breaks tons of systems.
> 
> Very vague :(
> 
> This commit has also been merged for over a year, why the sudden
> problem now?

Unrelated rant, but one year is generally what it takes for most users to
upgrade to new kernels, through their distro updates. Especially for older
hw like the radeon drivers (since 5 years or so amd gpus switched over to
amdgpu.ko).

So surprise that bugs only show up after 1+ year shouldn't be a surprise
:-) My personal rule is that I put a 1 year spacer between a risky change
and any cleanup that enables. Too many regrets in the past.

Cheers, Daniel

> 
> > This reverts commit 33b3ad3788aba846fc8b9a065fe2685a0b64f713.
> 
> You mean "33b3ad3788ab ("drm/radeon: handle PCIe root ports with
> addressing limitations")"?
> 
> That's the proper way to reference commits in changelogs please.  It's
> even documented that way...
> 
> > 
> > Bug: https://bugzilla.kernel.org/show_bug.cgi?id=206973
> > Bug: https://bugzilla.kernel.org/show_bug.cgi?id=206697
> > Bug: https://bugzilla.kernel.org/show_bug.cgi?id=207763
> > Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1140
> > Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1287
> > Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> > Cc: stable@vger.kernel.org
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: christian.koenig@amd.com
> 
> Fixes: 33b3ad3788ab ("drm/radeon: handle PCIe root ports with addressing limitations")
> 
> as well?
> 
> thanks,
> 
> greg k-h
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Alex Deucher Sept. 16, 2020, 1:08 p.m. UTC | #3
On Wed, Sep 16, 2020 at 2:32 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Tue, Sep 15, 2020 at 02:46:07PM -0400, Alex Deucher wrote:
> > This change breaks tons of systems.
>
> Very vague :(

Screen corruption making the system unusable.

>
> This commit has also been merged for over a year, why the sudden
> problem now?
>

It was noticed by several people closer to when the change went in as
well.  If you notice, most of the bugs date back quite a while.  We
looked into it a bit at the time but couldn't determine the problem.
It only seems to affect really old chips (like 15-20 years old) which
makes it hard to reproduce if you don't have an old system.  There
were a couple of threads at the time, but nothing was resolved.  I was
able to find one of them:
https://lkml.org/lkml/2019/12/14/263

There were several new bugs filed which brought the issue back to my
attention recently.

> > This reverts commit 33b3ad3788aba846fc8b9a065fe2685a0b64f713.
>
> You mean "33b3ad3788ab ("drm/radeon: handle PCIe root ports with
> addressing limitations")"?
>
> That's the proper way to reference commits in changelogs please.  It's
> even documented that way...

When you revert a patch with git, that is what it does.  Maybe we
should fix git to change the formatting.

>
> >
> > Bug: https://bugzilla.kernel.org/show_bug.cgi?id=206973
> > Bug: https://bugzilla.kernel.org/show_bug.cgi?id=206697
> > Bug: https://bugzilla.kernel.org/show_bug.cgi?id=207763
> > Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1140
> > Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1287
> > Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> > Cc: stable@vger.kernel.org
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: christian.koenig@amd.com
>
> Fixes: 33b3ad3788ab ("drm/radeon: handle PCIe root ports with addressing limitations")
>

Sure, I can add that, but it doesn't really fix it, it reverts it.
But point taken, it does fix the commit by removing it.

Alex

> as well?
>
> thanks,
>
> greg k-h
Alex Deucher Sept. 16, 2020, 10:16 p.m. UTC | #4
On Wed, Sep 16, 2020 at 3:04 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, Sep 15, 2020 at 02:46:07PM -0400, Alex Deucher wrote:
> > This change breaks tons of systems.
>
> Did you do at least some basic root causing on why?  Do GPUs get
> fed address they can't deal with?  Any examples?
>
> Bug 1 doesn't seem to contain any analysis and was reported against
> a very old kernel that had all kind of fixes since.
>
> Bug 2 seems to imply a drm kthread is accessing some structure it
> shouldn't, which would imply a mismatch between pools used by radeon
> now and those actually provided by the core.  Something that should
> be pretty to trivial to fix for someone understanding the whole ttm
> pool maze.
>
> Bug 3: same as 1, but an even older kernel.
>
> Bug 4: looks like 1 and 3, and actually verified to work properly
> in 5.9-rc.  Did you try to get the other reporters test this as well?

It would appear that the change in 5.9 to disable AGP on radeon fixed
the issue.  I'm following up on the other tickets to see if I can get
confirmation.  On another thread[1], the user was able to avoid the
issue by disabling HIMEM.  Looks like some issue with HIMEM and/or
AGP.

Alex

[1] https://lkml.org/lkml/2019/12/14/263
diff mbox series

Patch

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index b7c3fb2bfb54..eed23dffccf4 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -2391,6 +2391,7 @@  struct radeon_device {
 	struct radeon_wb		wb;
 	struct radeon_dummy_page	dummy_page;
 	bool				shutdown;
+	bool				need_dma32;
 	bool				need_swiotlb;
 	bool				accel_working;
 	bool				fastfb_working; /* IGP feature*/
diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c
index 266e3cbbd09b..f74c74ad8b5d 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -1363,25 +1363,28 @@  int radeon_device_init(struct radeon_device *rdev,
 	else
 		rdev->mc.mc_mask = 0xffffffffULL; /* 32 bit MC */
 
-	/* set DMA mask.
+	/* set DMA mask + need_dma32 flags.
 	 * PCIE - can handle 40-bits.
 	 * IGP - can handle 40-bits
 	 * AGP - generally dma32 is safest
 	 * PCI - dma32 for legacy pci gart, 40 bits on newer asics
 	 */
-	dma_bits = 40;
+	rdev->need_dma32 = false;
 	if (rdev->flags & RADEON_IS_AGP)
-		dma_bits = 32;
+		rdev->need_dma32 = true;
 	if ((rdev->flags & RADEON_IS_PCI) &&
 	    (rdev->family <= CHIP_RS740))
-		dma_bits = 32;
+		rdev->need_dma32 = true;
 #ifdef CONFIG_PPC64
 	if (rdev->family == CHIP_CEDAR)
-		dma_bits = 32;
+		rdev->need_dma32 = true;
 #endif
 
+	dma_bits = rdev->need_dma32 ? 32 : 40;
 	r = dma_set_mask_and_coherent(&rdev->pdev->dev, DMA_BIT_MASK(dma_bits));
 	if (r) {
+		rdev->need_dma32 = true;
+		dma_bits = 32;
 		pr_warn("radeon: No suitable DMA available\n");
 		return r;
 	}
diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
index 357e8e98cca9..d2550862313e 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -787,7 +787,7 @@  int radeon_ttm_init(struct radeon_device *rdev)
 			       &radeon_bo_driver,
 			       rdev->ddev->anon_inode->i_mapping,
 			       rdev->ddev->vma_offset_manager,
-			       dma_addressing_limited(&rdev->pdev->dev));
+			       rdev->need_dma32);
 	if (r) {
 		DRM_ERROR("failed initializing buffer object driver(%d).\n", r);
 		return r;