diff mbox

Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"

Message ID 5f26f8a1-e5bd-ca7a-c7ee-08c8f04f2110@daenzer.net (mailing list archive)
State New, archived
Headers show

Commit Message

Michel Dänzer March 24, 2017, 10:03 a.m. UTC
On 24/03/17 12:31 AM, Zachary Michaels wrote:
> 
> I should also note that we are experiencing another issue where the
> kernel locks up in similar circumstances. As Julien noted, we get no
> output, and the watchdogs don't seem to work. It may be the case that
> Xorg and our process are calling ttm_bo_mem_force_space concurrently,
> but I don't think we have enough information yet to say for
> sure. Reverting this commit does not fix that issue. I have some small
> amount of evidence indicating that bos flagged for CPU access are
> getting placed in CPU inaccessible memory. Could that cause this sort of
> kernel lockup?

Possibly, does this help?

Comments

Julien Isorce March 24, 2017, 7:01 p.m. UTC | #1
Hi Michel,

No this change does not help on the other issue (hard lockup).
I have no tried it in combination with the 0 -> i change.

Thx anyway.
Julien


On 24 March 2017 at 10:03, Michel Dänzer <michel@daenzer.net> wrote:

> On 24/03/17 12:31 AM, Zachary Michaels wrote:
> >
> > I should also note that we are experiencing another issue where the
> > kernel locks up in similar circumstances. As Julien noted, we get no
> > output, and the watchdogs don't seem to work. It may be the case that
> > Xorg and our process are calling ttm_bo_mem_force_space concurrently,
> > but I don't think we have enough information yet to say for
> > sure. Reverting this commit does not fix that issue. I have some small
> > amount of evidence indicating that bos flagged for CPU access are
> > getting placed in CPU inaccessible memory. Could that cause this sort of
> > kernel lockup?
>
> Possibly, does this help?
>
> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/
> radeon_ttm.c
> index 37d68cd1f272..40d1bb467a71 100644
> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> @@ -198,7 +198,8 @@ static void radeon_evict_flags(struct
> ttm_buffer_object *bo,
>         case TTM_PL_VRAM:
>                 if (rbo->rdev->ring[radeon_copy_ring_index(rbo->rdev)].ready
> == false)
>                         radeon_ttm_placement_from_domain(rbo,
> RADEON_GEM_DOMAIN_CPU);
> -               else if (rbo->rdev->mc.visible_vram_size <
> rbo->rdev->mc.real_vram_size &&
> +               else if (!(rbo->flags & RADEON_GEM_CPU_ACCESS) &&
> +                        rbo->rdev->mc.visible_vram_size <
> rbo->rdev->mc.real_vram_size &&
>                          bo->mem.start < (rbo->rdev->mc.visible_vram_size
> >> PAGE_SHIFT)) {
>                         unsigned fpfn = rbo->rdev->mc.visible_vram_size
> >> PAGE_SHIFT;
>                         int i;
>
>
>
> --
> Earthling Michel Dänzer               |               http://www.amd.com
> Libre software enthusiast             |             Mesa and X developer
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
Julien Isorce March 28, 2017, 8:24 a.m. UTC | #2
Hi Michel,

About the hard lockup, I noticed that I cannot have it with the following
conditions:

1. soft lockup fix (the 0->i change which avoids infinite loop)
2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS)
3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values above do not
help, for example (1024, 1024) or (1024, 2048))

Without 1 and 2, but with 3, our test reproduces the soft lockup (just
discovered few days ago).
Without 3 (and with or without 1., 2.), our test reproduces the hard lockup
which one does not give any info in kern.log (sometimes some NUL ^@
characters but not always).

We are converting this repro test to a piglit test in order to share it but
it will take some times. But to simplify it continuously uploads images
with a size picked randomly and up to 4K. So TTM's eviction mechanism is
hit a lot.

(The card is a ForePro W600 Cape Verde 2048M )

I am happy to try any other suggestion.

Thx
Julien

On 24 March 2017 at 19:01, Julien Isorce <julien.isorce@gmail.com> wrote:

> Hi Michel,
>
> No this change does not help on the other issue (hard lockup).
> I have no tried it in combination with the 0 -> i change.
>
> Thx anyway.
> Julien
>
>
> On 24 March 2017 at 10:03, Michel Dänzer <michel@daenzer.net> wrote:
>
>> On 24/03/17 12:31 AM, Zachary Michaels wrote:
>> >
>> > I should also note that we are experiencing another issue where the
>> > kernel locks up in similar circumstances. As Julien noted, we get no
>> > output, and the watchdogs don't seem to work. It may be the case that
>> > Xorg and our process are calling ttm_bo_mem_force_space concurrently,
>> > but I don't think we have enough information yet to say for
>> > sure. Reverting this commit does not fix that issue. I have some small
>> > amount of evidence indicating that bos flagged for CPU access are
>> > getting placed in CPU inaccessible memory. Could that cause this sort of
>> > kernel lockup?
>>
>> Possibly, does this help?
>>
>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c
>> b/drivers/gpu/drm/radeon/radeon_ttm.c
>> index 37d68cd1f272..40d1bb467a71 100644
>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
>> @@ -198,7 +198,8 @@ static void radeon_evict_flags(struct
>> ttm_buffer_object *bo,
>>         case TTM_PL_VRAM:
>>                 if (rbo->rdev->ring[radeon_copy_ring_index(rbo->rdev)].ready
>> == false)
>>                         radeon_ttm_placement_from_domain(rbo,
>> RADEON_GEM_DOMAIN_CPU);
>> -               else if (rbo->rdev->mc.visible_vram_size <
>> rbo->rdev->mc.real_vram_size &&
>> +               else if (!(rbo->flags & RADEON_GEM_CPU_ACCESS) &&
>> +                        rbo->rdev->mc.visible_vram_size <
>> rbo->rdev->mc.real_vram_size &&
>>                          bo->mem.start < (rbo->rdev->mc.visible_vram_size
>> >> PAGE_SHIFT)) {
>>                         unsigned fpfn = rbo->rdev->mc.visible_vram_size
>> >> PAGE_SHIFT;
>>                         int i;
>>
>>
>>
>> --
>> Earthling Michel Dänzer               |               http://www.amd.com
>> Libre software enthusiast             |             Mesa and X developer
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>
>
Michel Dänzer March 28, 2017, 9:36 a.m. UTC | #3
On 28/03/17 05:24 PM, Julien Isorce wrote:
> Hi Michel,
> 
> About the hard lockup, I noticed that I cannot have it with the
> following conditions:
> 
> 1. soft lockup fix (the 0->i change which avoids infinite loop)
> 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS)
> 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values above do
> not help, for example (1024, 1024) or (1024, 2048))
> 
> Without 1 and 2, but with 3, our test reproduces the soft lockup (just
> discovered few days ago).
> Without 3 (and with or without 1., 2.), our test reproduces the hard
> lockup which one does not give any info in kern.log (sometimes some NUL
> ^@ characters but not always).

What exactly does "hard lockup" mean? What are the symptoms?
Julien Isorce March 28, 2017, 11 a.m. UTC | #4
On 28 March 2017 at 10:36, Michel Dänzer <michel@daenzer.net> wrote:

> On 28/03/17 05:24 PM, Julien Isorce wrote:
> > Hi Michel,
> >
> > About the hard lockup, I noticed that I cannot have it with the
> > following conditions:
> >
> > 1. soft lockup fix (the 0->i change which avoids infinite loop)
> > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS)
> > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values above do
> > not help, for example (1024, 1024) or (1024, 2048))
> >
> > Without 1 and 2, but with 3, our test reproduces the soft lockup (just
> > discovered few days ago).
> > Without 3 (and with or without 1., 2.), our test reproduces the hard
> > lockup which one does not give any info in kern.log (sometimes some NUL
> > ^@ characters but not always).
>
> What exactly does "hard lockup" mean? What are the symptoms?
>
>
Screens are frozen, cannot ssh, no mouse/keyboard, no kernel panic.
Requires hard reboot.
After reboot, nothing in /var/crash, nothing in /sys/fs/pstore, nothing in
kern.log except sometimes some nul characters ^@.
Adding traces it looks like the test app was still in a ioctl(RADEON_CS)
but it is difficult to rely on that since this is called a lot.
Using a serial console did not show additional debug messages. kgdb was not
useful but probably worth another attempt.
Michel Dänzer March 29, 2017, 8:59 a.m. UTC | #5
On 28/03/17 08:00 PM, Julien Isorce wrote:
> 
> On 28 March 2017 at 10:36, Michel Dänzer <michel@daenzer.net
> <mailto:michel@daenzer.net>> wrote:
> 
>     On 28/03/17 05:24 PM, Julien Isorce wrote:
>     > Hi Michel,
>     >
>     > About the hard lockup, I noticed that I cannot have it with the
>     > following conditions:
>     >
>     > 1. soft lockup fix (the 0->i change which avoids infinite loop)
>     > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS)
>     > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values above do
>     > not help, for example (1024, 1024) or (1024, 2048))
>     >
>     > Without 1 and 2, but with 3, our test reproduces the soft lockup (just
>     > discovered few days ago).
>     > Without 3 (and with or without 1., 2.), our test reproduces the hard
>     > lockup which one does not give any info in kern.log (sometimes some NUL
>     > ^@ characters but not always).
> 
>     What exactly does "hard lockup" mean? What are the symptoms?
> 
> 
> Screens are frozen, cannot ssh, no mouse/keyboard, no kernel panic.
> Requires hard reboot.
> After reboot, nothing in /var/crash, nothing in /sys/fs/pstore, nothing
> in kern.log except sometimes some nul characters ^@.

Does it still respond to ping when it's hung?


> Using a serial console did not show additional debug messages. kgdb was
> not useful but probably worth another attempt.

Right.

Anyway, I'm afraid it sounds like it's probably not directly related to
the issue I was thinking of for my previous test patch or other similar
ones I was thinking of writing.
Christian König March 29, 2017, 9:07 a.m. UTC | #6
Am 29.03.2017 um 10:59 schrieb Michel Dänzer:
> On 28/03/17 08:00 PM, Julien Isorce wrote:
>> On 28 March 2017 at 10:36, Michel Dänzer <michel@daenzer.net
>> <mailto:michel@daenzer.net>> wrote:
>>
>>      On 28/03/17 05:24 PM, Julien Isorce wrote:
>>      > Hi Michel,
>>      >
>>      > About the hard lockup, I noticed that I cannot have it with the
>>      > following conditions:
>>      >
>>      > 1. soft lockup fix (the 0->i change which avoids infinite loop)
>>      > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS)
>>      > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values above do
>>      > not help, for example (1024, 1024) or (1024, 2048))
>>      >
>>      > Without 1 and 2, but with 3, our test reproduces the soft lockup (just
>>      > discovered few days ago).
>>      > Without 3 (and with or without 1., 2.), our test reproduces the hard
>>      > lockup which one does not give any info in kern.log (sometimes some NUL
>>      > ^@ characters but not always).
>>
>>      What exactly does "hard lockup" mean? What are the symptoms?
>>
>>
>> Screens are frozen, cannot ssh, no mouse/keyboard, no kernel panic.
>> Requires hard reboot.
>> After reboot, nothing in /var/crash, nothing in /sys/fs/pstore, nothing
>> in kern.log except sometimes some nul characters ^@.
> Does it still respond to ping when it's hung?
>
>
>> Using a serial console did not show additional debug messages. kgdb was
>> not useful but probably worth another attempt.
> Right.
>
> Anyway, I'm afraid it sounds like it's probably not directly related to
> the issue I was thinking of for my previous test patch or other similar
> ones I was thinking of writing.

Yeah, agree.

Additional to that a complete crash where you don't even get anything 
over serial console is rather unlikely to be cause by something an 
application can do directly.

Possible causes are more likely power management or completely messing 
up a bus system. Have you tried disabling dpm as well?

Christian.
Michel Dänzer March 29, 2017, 9:36 a.m. UTC | #7
On 29/03/17 06:07 PM, Christian König wrote:
> Am 29.03.2017 um 10:59 schrieb Michel Dänzer:
>> On 28/03/17 08:00 PM, Julien Isorce wrote:
>>> On 28 March 2017 at 10:36, Michel Dänzer <michel@daenzer.net
>>> <mailto:michel@daenzer.net>> wrote:
>>>
>>>      On 28/03/17 05:24 PM, Julien Isorce wrote:
>>>      > Hi Michel,
>>>      >
>>>      > About the hard lockup, I noticed that I cannot have it with the
>>>      > following conditions:
>>>      >
>>>      > 1. soft lockup fix (the 0->i change which avoids infinite loop)
>>>      > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS)
>>>      > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values
>>> above do
>>>      > not help, for example (1024, 1024) or (1024, 2048))
>>>      >
>>>      > Without 1 and 2, but with 3, our test reproduces the soft
>>> lockup (just
>>>      > discovered few days ago).
>>>      > Without 3 (and with or without 1., 2.), our test reproduces
>>> the hard
>>>      > lockup which one does not give any info in kern.log (sometimes
>>> some NUL
>>>      > ^@ characters but not always).
>>>
>>>      What exactly does "hard lockup" mean? What are the symptoms?
>>>
>>>
>>> Screens are frozen, cannot ssh, no mouse/keyboard, no kernel panic.
>>> Requires hard reboot.
>>> After reboot, nothing in /var/crash, nothing in /sys/fs/pstore, nothing
>>> in kern.log except sometimes some nul characters ^@.
>> Does it still respond to ping when it's hung?
>>
>>
>>> Using a serial console did not show additional debug messages. kgdb was
>>> not useful but probably worth another attempt.
>> Right.
>>
>> Anyway, I'm afraid it sounds like it's probably not directly related to
>> the issue I was thinking of for my previous test patch or other similar
>> ones I was thinking of writing.
> 
> Yeah, agree.
> 
> Additional to that a complete crash where you don't even get anything
> over serial console is rather unlikely to be cause by something an
> application can do directly.
> 
> Possible causes are more likely power management or completely messing
> up a bus system. Have you tried disabling dpm as well?

Might also be worth trying the amdgpu kernel driver instead of radeon,
not sure how well the former currently works with Cape Verde though.
Nicolai Hähnle March 29, 2017, 5:26 p.m. UTC | #8
On 29.03.2017 11:36, Michel Dänzer wrote:
> On 29/03/17 06:07 PM, Christian König wrote:
>> Am 29.03.2017 um 10:59 schrieb Michel Dänzer:
>>> On 28/03/17 08:00 PM, Julien Isorce wrote:
>>>> On 28 March 2017 at 10:36, Michel Dänzer <michel@daenzer.net
>>>> <mailto:michel@daenzer.net>> wrote:
>>>>
>>>>      On 28/03/17 05:24 PM, Julien Isorce wrote:
>>>>      > Hi Michel,
>>>>      >
>>>>      > About the hard lockup, I noticed that I cannot have it with the
>>>>      > following conditions:
>>>>      >
>>>>      > 1. soft lockup fix (the 0->i change which avoids infinite loop)
>>>>      > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS)
>>>>      > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values
>>>> above do
>>>>      > not help, for example (1024, 1024) or (1024, 2048))
>>>>      >
>>>>      > Without 1 and 2, but with 3, our test reproduces the soft
>>>> lockup (just
>>>>      > discovered few days ago).
>>>>      > Without 3 (and with or without 1., 2.), our test reproduces
>>>> the hard
>>>>      > lockup which one does not give any info in kern.log (sometimes
>>>> some NUL
>>>>      > ^@ characters but not always).
>>>>
>>>>      What exactly does "hard lockup" mean? What are the symptoms?
>>>>
>>>>
>>>> Screens are frozen, cannot ssh, no mouse/keyboard, no kernel panic.
>>>> Requires hard reboot.
>>>> After reboot, nothing in /var/crash, nothing in /sys/fs/pstore, nothing
>>>> in kern.log except sometimes some nul characters ^@.
>>> Does it still respond to ping when it's hung?
>>>
>>>
>>>> Using a serial console did not show additional debug messages. kgdb was
>>>> not useful but probably worth another attempt.
>>> Right.
>>>
>>> Anyway, I'm afraid it sounds like it's probably not directly related to
>>> the issue I was thinking of for my previous test patch or other similar
>>> ones I was thinking of writing.
>>
>> Yeah, agree.
>>
>> Additional to that a complete crash where you don't even get anything
>> over serial console is rather unlikely to be cause by something an
>> application can do directly.
>>
>> Possible causes are more likely power management or completely messing
>> up a bus system. Have you tried disabling dpm as well?
>
> Might also be worth trying the amdgpu kernel driver instead of radeon,
> not sure how well the former currently works with Cape Verde though.

I've recently used it to experiment with the sparse buffer support. It 
worked well enough for that :)

Cheers,
Nicolai
Julien Isorce March 30, 2017, 11:03 a.m. UTC | #9
Thx for the suggestions.

No, it does not respond to ping.

radeon.dpm=0 does not help. But it only tells to use the old power
management right ?
So I tried: low, mid and high for /sys/class/drm/card0/device/prower_profile
 (and setting profile for power_mode)
With radeon.dpm=1 I tried all values for power_dpm_state /
power_dpm_force_performance_level. Same results.

I also tried today amd-staging-4.9 branch and same result.

Note that this also happens on W600 (verde), W9000 (tahiti) and W9100
(hawaii), with radeonsi driver.

I have open a bug here: https://bugs.freedesktop.org/show_bug.cgi?id=100465
. It contains a test to reproduce the freeze.

Cheers
Julien

On 29 March 2017 at 18:26, Nicolai Hähnle <nhaehnle@gmail.com> wrote:

> On 29.03.2017 11:36, Michel Dänzer wrote:
>
>> On 29/03/17 06:07 PM, Christian König wrote:
>>
>>> Am 29.03.2017 um 10:59 schrieb Michel Dänzer:
>>>
>>>> On 28/03/17 08:00 PM, Julien Isorce wrote:
>>>>
>>>>> On 28 March 2017 at 10:36, Michel Dänzer <michel@daenzer.net
>>>>> <mailto:michel@daenzer.net>> wrote:
>>>>>
>>>>>      On 28/03/17 05:24 PM, Julien Isorce wrote:
>>>>>      > Hi Michel,
>>>>>      >
>>>>>      > About the hard lockup, I noticed that I cannot have it with the
>>>>>      > following conditions:
>>>>>      >
>>>>>      > 1. soft lockup fix (the 0->i change which avoids infinite loop)
>>>>>      > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS)
>>>>>      > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values
>>>>> above do
>>>>>      > not help, for example (1024, 1024) or (1024, 2048))
>>>>>      >
>>>>>      > Without 1 and 2, but with 3, our test reproduces the soft
>>>>> lockup (just
>>>>>      > discovered few days ago).
>>>>>      > Without 3 (and with or without 1., 2.), our test reproduces
>>>>> the hard
>>>>>      > lockup which one does not give any info in kern.log (sometimes
>>>>> some NUL
>>>>>      > ^@ characters but not always).
>>>>>
>>>>>      What exactly does "hard lockup" mean? What are the symptoms?
>>>>>
>>>>>
>>>>> Screens are frozen, cannot ssh, no mouse/keyboard, no kernel panic.
>>>>> Requires hard reboot.
>>>>> After reboot, nothing in /var/crash, nothing in /sys/fs/pstore, nothing
>>>>> in kern.log except sometimes some nul characters ^@.
>>>>>
>>>> Does it still respond to ping when it's hung?
>>>>
>>>>
>>>> Using a serial console did not show additional debug messages. kgdb was
>>>>> not useful but probably worth another attempt.
>>>>>
>>>> Right.
>>>>
>>>> Anyway, I'm afraid it sounds like it's probably not directly related to
>>>> the issue I was thinking of for my previous test patch or other similar
>>>> ones I was thinking of writing.
>>>>
>>>
>>> Yeah, agree.
>>>
>>> Additional to that a complete crash where you don't even get anything
>>> over serial console is rather unlikely to be cause by something an
>>> application can do directly.
>>>
>>> Possible causes are more likely power management or completely messing
>>> up a bus system. Have you tried disabling dpm as well?
>>>
>>
>> Might also be worth trying the amdgpu kernel driver instead of radeon,
>> not sure how well the former currently works with Cape Verde though.
>>
>
> I've recently used it to experiment with the sparse buffer support. It
> worked well enough for that :)
>
> Cheers,
> Nicolai
> --
> Lerne, wie die Welt wirklich ist,
> Aber vergiss niemals, wie sie sein sollte.
>
diff mbox

Patch

diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
index 37d68cd1f272..40d1bb467a71 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -198,7 +198,8 @@  static void radeon_evict_flags(struct ttm_buffer_object *bo,
        case TTM_PL_VRAM:
                if (rbo->rdev->ring[radeon_copy_ring_index(rbo->rdev)].ready == false)
                        radeon_ttm_placement_from_domain(rbo, RADEON_GEM_DOMAIN_CPU);
-               else if (rbo->rdev->mc.visible_vram_size < rbo->rdev->mc.real_vram_size &&
+               else if (!(rbo->flags & RADEON_GEM_CPU_ACCESS) &&
+                        rbo->rdev->mc.visible_vram_size < rbo->rdev->mc.real_vram_size &&
                         bo->mem.start < (rbo->rdev->mc.visible_vram_size >> PAGE_SHIFT)) {
                        unsigned fpfn = rbo->rdev->mc.visible_vram_size >> PAGE_SHIFT;
                        int i;