Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
diff mbox

Message ID 8020e778-c9c8-e67b-428a-436f2f1c0d7b@daenzer.net
State New
Headers show

Commit Message

Michel Dänzer March 24, 2017, 9:24 a.m. UTC
On 23/03/17 06:26 PM, Julien Isorce wrote:
> Hi Michel,
> 
> When it happens, the main thread of our gl based app is stuck on a
> ioctl(RADEON_CS). I set RADEON_THREAD=false to ease the debugging but
> same thing happens if true. Other threads are only si_shader:0,1,2,3 and
> are doing nothing, just waiting for jobs. I can also do sudo gdb -p
> $(pidof Xorg) to block the X11 server, to make sure there is no ping
> pong between 2 processes. All other processes are not loading
> dri/radeonsi_dri.so . And adding a few traces shows that the above ioctl
> call is looping for ever on
> https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/ttm/ttm_bo.c#L819
> <https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/ttm/ttm_bo.c#L819> and
> comes from
> mesa https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/winsys/radeon/drm/radeon_drm_cs.c#n454
> . 
> 
> After adding even more traces I can see that the bo, which is being
> indefinitely evicted, has the flag RADEON_GEM_NO_CPU_ACCESS.
> And it gets 3 potential placements after calling "radeon_evict_flags". 
>  1: VRAM cpu inaccessible, fpfn is 65536
>  2: VRAM cpu accessible, fpfn is 0
>  3: GTT, fpfn is 0
> 
> And it looks like it continuously succeeds to move on the second
> placement. So I might be wrong but it looks it is not even a ping pong
> between VRAM accessible / not accessible, it just keeps being blited in
> the CPU accessible part of the VRAM.

Thanks for the detailed description! AFAICT this can only happen due to
a silly mistake I made in this code. Does this fix it?

Comments

Julien Isorce March 24, 2017, 9:50 a.m. UTC | #1
Hi Michel,

(Just for other readers my reply has been delayed on the mailing lists and
should have been on second position)

We have actually spotted this /0/i/ but somehow I convinced myself it was
intentional. The reason I found was that you wanted to set the fpfn only if
there is 2 placements, which means it will try to move from accessible to
inaccessible.

I will have a go with that change and let you know. I do not remember if I
tried it for this soft lockup. But for sure it does not solve the hard
lockup that Zach also mentioned at the end of his reply. I am saying that
because this other issue has some similarities (same ioctl call).

But in general, isn't "radeon_lockup_timeout" supposed to detect this
situation ?

Thx
Julien


On 24 March 2017 at 09:24, Michel Dänzer <michel@daenzer.net> wrote:

> On 23/03/17 06:26 PM, Julien Isorce wrote:
> > Hi Michel,
> >
> > When it happens, the main thread of our gl based app is stuck on a
> > ioctl(RADEON_CS). I set RADEON_THREAD=false to ease the debugging but
> > same thing happens if true. Other threads are only si_shader:0,1,2,3 and
> > are doing nothing, just waiting for jobs. I can also do sudo gdb -p
> > $(pidof Xorg) to block the X11 server, to make sure there is no ping
> > pong between 2 processes. All other processes are not loading
> > dri/radeonsi_dri.so . And adding a few traces shows that the above ioctl
> > call is looping for ever on
> > https://github.com/torvalds/linux/blob/master/drivers/gpu/
> drm/ttm/ttm_bo.c#L819
> > <https://github.com/torvalds/linux/blob/master/drivers/gpu/
> drm/ttm/ttm_bo.c#L819> and
> > comes from
> > mesa https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/
> winsys/radeon/drm/radeon_drm_cs.c#n454
> > .
> >
> > After adding even more traces I can see that the bo, which is being
> > indefinitely evicted, has the flag RADEON_GEM_NO_CPU_ACCESS.
> > And it gets 3 potential placements after calling "radeon_evict_flags".
> >  1: VRAM cpu inaccessible, fpfn is 65536
> >  2: VRAM cpu accessible, fpfn is 0
> >  3: GTT, fpfn is 0
> >
> > And it looks like it continuously succeeds to move on the second
> > placement. So I might be wrong but it looks it is not even a ping pong
> > between VRAM accessible / not accessible, it just keeps being blited in
> > the CPU accessible part of the VRAM.
>
> Thanks for the detailed description! AFAICT this can only happen due to
> a silly mistake I made in this code. Does this fix it?
>
> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/
> radeon_ttm.c
> index 5c7cf644ba1d..37d68cd1f272 100644
> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> @@ -213,8 +213,8 @@ static void radeon_evict_flags(struct
> ttm_buffer_object *bo,
>                         rbo->placement.num_busy_placement = 0;
>                         for (i = 0; i < rbo->placement.num_placement; i++)
> {
>                                 if (rbo->placements[i].flags &
> TTM_PL_FLAG_VRAM) {
> -                                       if (rbo->placements[0].fpfn < fpfn)
> -                                               rbo->placements[0].fpfn =
> fpfn;
> +                                       if (rbo->placements[i].fpfn < fpfn)
> +                                               rbo->placements[i].fpfn =
> fpfn;
>                                 } else {
>                                         rbo->placement.busy_placement =
>                                                 &rbo->placements[i];
>
>
>
> --
> Earthling Michel Dänzer               |               http://www.amd.com
> Libre software enthusiast             |             Mesa and X developer
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
Michel Dänzer March 24, 2017, 9:59 a.m. UTC | #2
On 24/03/17 06:50 PM, Julien Isorce wrote:
> Hi Michel,
> 
> (Just for other readers my reply has been delayed on the mailing lists
> and should have been on second position)

It is on https://patchwork.freedesktop.org/patch/145731/ , did you mean
something else?

The delay was because you weren't subscribed to the amd-gfx mailing list
yet, so your post went through the moderation queue.


> I will have a go with that change and let you know. I do not remember if
> I tried it for this soft lockup. But for sure it does not solve the hard
> lockup that Zach also mentioned at the end of his reply.

I'll follow up to his post about that.


> But in general, isn't "radeon_lockup_timeout" supposed to detect this
> situation ?

No, it's for detecting GPU hangs, whereas this is a CPU "hang" (infinite
loop).
Julien Isorce March 24, 2017, 6:59 p.m. UTC | #3
Hi Michel,

I double checked and you are right, the change 0 -> i works.

Cheers
Julien

On 24 March 2017 at 09:59, Michel Dänzer <michel@daenzer.net> wrote:

> On 24/03/17 06:50 PM, Julien Isorce wrote:
> > Hi Michel,
> >
> > (Just for other readers my reply has been delayed on the mailing lists
> > and should have been on second position)
>
> It is on https://patchwork.freedesktop.org/patch/145731/ , did you mean
> something else?
>
> The delay was because you weren't subscribed to the amd-gfx mailing list
> yet, so your post went through the moderation queue.
>
>
> > I will have a go with that change and let you know. I do not remember if
> > I tried it for this soft lockup. But for sure it does not solve the hard
> > lockup that Zach also mentioned at the end of his reply.
>
> I'll follow up to his post about that.
>
>
> > But in general, isn't "radeon_lockup_timeout" supposed to detect this
> > situation ?
>
> No, it's for detecting GPU hangs, whereas this is a CPU "hang" (infinite
> loop).
>
>
> --
> Earthling Michel Dänzer               |               http://www.amd.com
> Libre software enthusiast             |             Mesa and X developer
>
Michel Dänzer March 27, 2017, 12:59 a.m. UTC | #4
On 25/03/17 03:59 AM, Julien Isorce wrote:
> Hi Michel,
> 
> I double checked and you are right, the change 0 -> i works. 

Thanks for testing, fix submitted for review.

Patch
diff mbox

diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
index 5c7cf644ba1d..37d68cd1f272 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -213,8 +213,8 @@  static void radeon_evict_flags(struct ttm_buffer_object *bo,
                        rbo->placement.num_busy_placement = 0;
                        for (i = 0; i < rbo->placement.num_placement; i++) {
                                if (rbo->placements[i].flags & TTM_PL_FLAG_VRAM) {
-                                       if (rbo->placements[0].fpfn < fpfn)
-                                               rbo->placements[0].fpfn = fpfn;
+                                       if (rbo->placements[i].fpfn < fpfn)
+                                               rbo->placements[i].fpfn = fpfn;
                                } else {
                                        rbo->placement.busy_placement =
                                                &rbo->placements[i];