Message ID | 5f26f8a1-e5bd-ca7a-c7ee-08c8f04f2110@daenzer.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Michel, No this change does not help on the other issue (hard lockup). I have no tried it in combination with the 0 -> i change. Thx anyway. Julien On 24 March 2017 at 10:03, Michel Dänzer <michel@daenzer.net> wrote: > On 24/03/17 12:31 AM, Zachary Michaels wrote: > > > > I should also note that we are experiencing another issue where the > > kernel locks up in similar circumstances. As Julien noted, we get no > > output, and the watchdogs don't seem to work. It may be the case that > > Xorg and our process are calling ttm_bo_mem_force_space concurrently, > > but I don't think we have enough information yet to say for > > sure. Reverting this commit does not fix that issue. I have some small > > amount of evidence indicating that bos flagged for CPU access are > > getting placed in CPU inaccessible memory. Could that cause this sort of > > kernel lockup? > > Possibly, does this help? > > diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/ > radeon_ttm.c > index 37d68cd1f272..40d1bb467a71 100644 > --- a/drivers/gpu/drm/radeon/radeon_ttm.c > +++ b/drivers/gpu/drm/radeon/radeon_ttm.c > @@ -198,7 +198,8 @@ static void radeon_evict_flags(struct > ttm_buffer_object *bo, > case TTM_PL_VRAM: > if (rbo->rdev->ring[radeon_copy_ring_index(rbo->rdev)].ready > == false) > radeon_ttm_placement_from_domain(rbo, > RADEON_GEM_DOMAIN_CPU); > - else if (rbo->rdev->mc.visible_vram_size < > rbo->rdev->mc.real_vram_size && > + else if (!(rbo->flags & RADEON_GEM_CPU_ACCESS) && > + rbo->rdev->mc.visible_vram_size < > rbo->rdev->mc.real_vram_size && > bo->mem.start < (rbo->rdev->mc.visible_vram_size > >> PAGE_SHIFT)) { > unsigned fpfn = rbo->rdev->mc.visible_vram_size > >> PAGE_SHIFT; > int i; > > > > -- > Earthling Michel Dänzer | http://www.amd.com > Libre software enthusiast | Mesa and X developer > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel >
Hi Michel, About the hard lockup, I noticed that I cannot have it with the following conditions: 1. soft lockup fix (the 0->i change which avoids infinite loop) 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS) 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values above do not help, for example (1024, 1024) or (1024, 2048)) Without 1 and 2, but with 3, our test reproduces the soft lockup (just discovered few days ago). Without 3 (and with or without 1., 2.), our test reproduces the hard lockup which one does not give any info in kern.log (sometimes some NUL ^@ characters but not always). We are converting this repro test to a piglit test in order to share it but it will take some times. But to simplify it continuously uploads images with a size picked randomly and up to 4K. So TTM's eviction mechanism is hit a lot. (The card is a ForePro W600 Cape Verde 2048M ) I am happy to try any other suggestion. Thx Julien On 24 March 2017 at 19:01, Julien Isorce <julien.isorce@gmail.com> wrote: > Hi Michel, > > No this change does not help on the other issue (hard lockup). > I have no tried it in combination with the 0 -> i change. > > Thx anyway. > Julien > > > On 24 March 2017 at 10:03, Michel Dänzer <michel@daenzer.net> wrote: > >> On 24/03/17 12:31 AM, Zachary Michaels wrote: >> > >> > I should also note that we are experiencing another issue where the >> > kernel locks up in similar circumstances. As Julien noted, we get no >> > output, and the watchdogs don't seem to work. It may be the case that >> > Xorg and our process are calling ttm_bo_mem_force_space concurrently, >> > but I don't think we have enough information yet to say for >> > sure. Reverting this commit does not fix that issue. I have some small >> > amount of evidence indicating that bos flagged for CPU access are >> > getting placed in CPU inaccessible memory. Could that cause this sort of >> > kernel lockup? >> >> Possibly, does this help? >> >> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c >> b/drivers/gpu/drm/radeon/radeon_ttm.c >> index 37d68cd1f272..40d1bb467a71 100644 >> --- a/drivers/gpu/drm/radeon/radeon_ttm.c >> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c >> @@ -198,7 +198,8 @@ static void radeon_evict_flags(struct >> ttm_buffer_object *bo, >> case TTM_PL_VRAM: >> if (rbo->rdev->ring[radeon_copy_ring_index(rbo->rdev)].ready >> == false) >> radeon_ttm_placement_from_domain(rbo, >> RADEON_GEM_DOMAIN_CPU); >> - else if (rbo->rdev->mc.visible_vram_size < >> rbo->rdev->mc.real_vram_size && >> + else if (!(rbo->flags & RADEON_GEM_CPU_ACCESS) && >> + rbo->rdev->mc.visible_vram_size < >> rbo->rdev->mc.real_vram_size && >> bo->mem.start < (rbo->rdev->mc.visible_vram_size >> >> PAGE_SHIFT)) { >> unsigned fpfn = rbo->rdev->mc.visible_vram_size >> >> PAGE_SHIFT; >> int i; >> >> >> >> -- >> Earthling Michel Dänzer | http://www.amd.com >> Libre software enthusiast | Mesa and X developer >> _______________________________________________ >> dri-devel mailing list >> dri-devel@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/dri-devel >> > >
On 28/03/17 05:24 PM, Julien Isorce wrote: > Hi Michel, > > About the hard lockup, I noticed that I cannot have it with the > following conditions: > > 1. soft lockup fix (the 0->i change which avoids infinite loop) > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS) > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values above do > not help, for example (1024, 1024) or (1024, 2048)) > > Without 1 and 2, but with 3, our test reproduces the soft lockup (just > discovered few days ago). > Without 3 (and with or without 1., 2.), our test reproduces the hard > lockup which one does not give any info in kern.log (sometimes some NUL > ^@ characters but not always). What exactly does "hard lockup" mean? What are the symptoms?
On 28 March 2017 at 10:36, Michel Dänzer <michel@daenzer.net> wrote: > On 28/03/17 05:24 PM, Julien Isorce wrote: > > Hi Michel, > > > > About the hard lockup, I noticed that I cannot have it with the > > following conditions: > > > > 1. soft lockup fix (the 0->i change which avoids infinite loop) > > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS) > > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values above do > > not help, for example (1024, 1024) or (1024, 2048)) > > > > Without 1 and 2, but with 3, our test reproduces the soft lockup (just > > discovered few days ago). > > Without 3 (and with or without 1., 2.), our test reproduces the hard > > lockup which one does not give any info in kern.log (sometimes some NUL > > ^@ characters but not always). > > What exactly does "hard lockup" mean? What are the symptoms? > > Screens are frozen, cannot ssh, no mouse/keyboard, no kernel panic. Requires hard reboot. After reboot, nothing in /var/crash, nothing in /sys/fs/pstore, nothing in kern.log except sometimes some nul characters ^@. Adding traces it looks like the test app was still in a ioctl(RADEON_CS) but it is difficult to rely on that since this is called a lot. Using a serial console did not show additional debug messages. kgdb was not useful but probably worth another attempt.
On 28/03/17 08:00 PM, Julien Isorce wrote: > > On 28 March 2017 at 10:36, Michel Dänzer <michel@daenzer.net > <mailto:michel@daenzer.net>> wrote: > > On 28/03/17 05:24 PM, Julien Isorce wrote: > > Hi Michel, > > > > About the hard lockup, I noticed that I cannot have it with the > > following conditions: > > > > 1. soft lockup fix (the 0->i change which avoids infinite loop) > > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS) > > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values above do > > not help, for example (1024, 1024) or (1024, 2048)) > > > > Without 1 and 2, but with 3, our test reproduces the soft lockup (just > > discovered few days ago). > > Without 3 (and with or without 1., 2.), our test reproduces the hard > > lockup which one does not give any info in kern.log (sometimes some NUL > > ^@ characters but not always). > > What exactly does "hard lockup" mean? What are the symptoms? > > > Screens are frozen, cannot ssh, no mouse/keyboard, no kernel panic. > Requires hard reboot. > After reboot, nothing in /var/crash, nothing in /sys/fs/pstore, nothing > in kern.log except sometimes some nul characters ^@. Does it still respond to ping when it's hung? > Using a serial console did not show additional debug messages. kgdb was > not useful but probably worth another attempt. Right. Anyway, I'm afraid it sounds like it's probably not directly related to the issue I was thinking of for my previous test patch or other similar ones I was thinking of writing.
Am 29.03.2017 um 10:59 schrieb Michel Dänzer: > On 28/03/17 08:00 PM, Julien Isorce wrote: >> On 28 March 2017 at 10:36, Michel Dänzer <michel@daenzer.net >> <mailto:michel@daenzer.net>> wrote: >> >> On 28/03/17 05:24 PM, Julien Isorce wrote: >> > Hi Michel, >> > >> > About the hard lockup, I noticed that I cannot have it with the >> > following conditions: >> > >> > 1. soft lockup fix (the 0->i change which avoids infinite loop) >> > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS) >> > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values above do >> > not help, for example (1024, 1024) or (1024, 2048)) >> > >> > Without 1 and 2, but with 3, our test reproduces the soft lockup (just >> > discovered few days ago). >> > Without 3 (and with or without 1., 2.), our test reproduces the hard >> > lockup which one does not give any info in kern.log (sometimes some NUL >> > ^@ characters but not always). >> >> What exactly does "hard lockup" mean? What are the symptoms? >> >> >> Screens are frozen, cannot ssh, no mouse/keyboard, no kernel panic. >> Requires hard reboot. >> After reboot, nothing in /var/crash, nothing in /sys/fs/pstore, nothing >> in kern.log except sometimes some nul characters ^@. > Does it still respond to ping when it's hung? > > >> Using a serial console did not show additional debug messages. kgdb was >> not useful but probably worth another attempt. > Right. > > Anyway, I'm afraid it sounds like it's probably not directly related to > the issue I was thinking of for my previous test patch or other similar > ones I was thinking of writing. Yeah, agree. Additional to that a complete crash where you don't even get anything over serial console is rather unlikely to be cause by something an application can do directly. Possible causes are more likely power management or completely messing up a bus system. Have you tried disabling dpm as well? Christian.
On 29/03/17 06:07 PM, Christian König wrote: > Am 29.03.2017 um 10:59 schrieb Michel Dänzer: >> On 28/03/17 08:00 PM, Julien Isorce wrote: >>> On 28 March 2017 at 10:36, Michel Dänzer <michel@daenzer.net >>> <mailto:michel@daenzer.net>> wrote: >>> >>> On 28/03/17 05:24 PM, Julien Isorce wrote: >>> > Hi Michel, >>> > >>> > About the hard lockup, I noticed that I cannot have it with the >>> > following conditions: >>> > >>> > 1. soft lockup fix (the 0->i change which avoids infinite loop) >>> > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS) >>> > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values >>> above do >>> > not help, for example (1024, 1024) or (1024, 2048)) >>> > >>> > Without 1 and 2, but with 3, our test reproduces the soft >>> lockup (just >>> > discovered few days ago). >>> > Without 3 (and with or without 1., 2.), our test reproduces >>> the hard >>> > lockup which one does not give any info in kern.log (sometimes >>> some NUL >>> > ^@ characters but not always). >>> >>> What exactly does "hard lockup" mean? What are the symptoms? >>> >>> >>> Screens are frozen, cannot ssh, no mouse/keyboard, no kernel panic. >>> Requires hard reboot. >>> After reboot, nothing in /var/crash, nothing in /sys/fs/pstore, nothing >>> in kern.log except sometimes some nul characters ^@. >> Does it still respond to ping when it's hung? >> >> >>> Using a serial console did not show additional debug messages. kgdb was >>> not useful but probably worth another attempt. >> Right. >> >> Anyway, I'm afraid it sounds like it's probably not directly related to >> the issue I was thinking of for my previous test patch or other similar >> ones I was thinking of writing. > > Yeah, agree. > > Additional to that a complete crash where you don't even get anything > over serial console is rather unlikely to be cause by something an > application can do directly. > > Possible causes are more likely power management or completely messing > up a bus system. Have you tried disabling dpm as well? Might also be worth trying the amdgpu kernel driver instead of radeon, not sure how well the former currently works with Cape Verde though.
On 29.03.2017 11:36, Michel Dänzer wrote: > On 29/03/17 06:07 PM, Christian König wrote: >> Am 29.03.2017 um 10:59 schrieb Michel Dänzer: >>> On 28/03/17 08:00 PM, Julien Isorce wrote: >>>> On 28 March 2017 at 10:36, Michel Dänzer <michel@daenzer.net >>>> <mailto:michel@daenzer.net>> wrote: >>>> >>>> On 28/03/17 05:24 PM, Julien Isorce wrote: >>>> > Hi Michel, >>>> > >>>> > About the hard lockup, I noticed that I cannot have it with the >>>> > following conditions: >>>> > >>>> > 1. soft lockup fix (the 0->i change which avoids infinite loop) >>>> > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS) >>>> > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values >>>> above do >>>> > not help, for example (1024, 1024) or (1024, 2048)) >>>> > >>>> > Without 1 and 2, but with 3, our test reproduces the soft >>>> lockup (just >>>> > discovered few days ago). >>>> > Without 3 (and with or without 1., 2.), our test reproduces >>>> the hard >>>> > lockup which one does not give any info in kern.log (sometimes >>>> some NUL >>>> > ^@ characters but not always). >>>> >>>> What exactly does "hard lockup" mean? What are the symptoms? >>>> >>>> >>>> Screens are frozen, cannot ssh, no mouse/keyboard, no kernel panic. >>>> Requires hard reboot. >>>> After reboot, nothing in /var/crash, nothing in /sys/fs/pstore, nothing >>>> in kern.log except sometimes some nul characters ^@. >>> Does it still respond to ping when it's hung? >>> >>> >>>> Using a serial console did not show additional debug messages. kgdb was >>>> not useful but probably worth another attempt. >>> Right. >>> >>> Anyway, I'm afraid it sounds like it's probably not directly related to >>> the issue I was thinking of for my previous test patch or other similar >>> ones I was thinking of writing. >> >> Yeah, agree. >> >> Additional to that a complete crash where you don't even get anything >> over serial console is rather unlikely to be cause by something an >> application can do directly. >> >> Possible causes are more likely power management or completely messing >> up a bus system. Have you tried disabling dpm as well? > > Might also be worth trying the amdgpu kernel driver instead of radeon, > not sure how well the former currently works with Cape Verde though. I've recently used it to experiment with the sparse buffer support. It worked well enough for that :) Cheers, Nicolai
Thx for the suggestions. No, it does not respond to ping. radeon.dpm=0 does not help. But it only tells to use the old power management right ? So I tried: low, mid and high for /sys/class/drm/card0/device/prower_profile (and setting profile for power_mode) With radeon.dpm=1 I tried all values for power_dpm_state / power_dpm_force_performance_level. Same results. I also tried today amd-staging-4.9 branch and same result. Note that this also happens on W600 (verde), W9000 (tahiti) and W9100 (hawaii), with radeonsi driver. I have open a bug here: https://bugs.freedesktop.org/show_bug.cgi?id=100465 . It contains a test to reproduce the freeze. Cheers Julien On 29 March 2017 at 18:26, Nicolai Hähnle <nhaehnle@gmail.com> wrote: > On 29.03.2017 11:36, Michel Dänzer wrote: > >> On 29/03/17 06:07 PM, Christian König wrote: >> >>> Am 29.03.2017 um 10:59 schrieb Michel Dänzer: >>> >>>> On 28/03/17 08:00 PM, Julien Isorce wrote: >>>> >>>>> On 28 March 2017 at 10:36, Michel Dänzer <michel@daenzer.net >>>>> <mailto:michel@daenzer.net>> wrote: >>>>> >>>>> On 28/03/17 05:24 PM, Julien Isorce wrote: >>>>> > Hi Michel, >>>>> > >>>>> > About the hard lockup, I noticed that I cannot have it with the >>>>> > following conditions: >>>>> > >>>>> > 1. soft lockup fix (the 0->i change which avoids infinite loop) >>>>> > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS) >>>>> > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values >>>>> above do >>>>> > not help, for example (1024, 1024) or (1024, 2048)) >>>>> > >>>>> > Without 1 and 2, but with 3, our test reproduces the soft >>>>> lockup (just >>>>> > discovered few days ago). >>>>> > Without 3 (and with or without 1., 2.), our test reproduces >>>>> the hard >>>>> > lockup which one does not give any info in kern.log (sometimes >>>>> some NUL >>>>> > ^@ characters but not always). >>>>> >>>>> What exactly does "hard lockup" mean? What are the symptoms? >>>>> >>>>> >>>>> Screens are frozen, cannot ssh, no mouse/keyboard, no kernel panic. >>>>> Requires hard reboot. >>>>> After reboot, nothing in /var/crash, nothing in /sys/fs/pstore, nothing >>>>> in kern.log except sometimes some nul characters ^@. >>>>> >>>> Does it still respond to ping when it's hung? >>>> >>>> >>>> Using a serial console did not show additional debug messages. kgdb was >>>>> not useful but probably worth another attempt. >>>>> >>>> Right. >>>> >>>> Anyway, I'm afraid it sounds like it's probably not directly related to >>>> the issue I was thinking of for my previous test patch or other similar >>>> ones I was thinking of writing. >>>> >>> >>> Yeah, agree. >>> >>> Additional to that a complete crash where you don't even get anything >>> over serial console is rather unlikely to be cause by something an >>> application can do directly. >>> >>> Possible causes are more likely power management or completely messing >>> up a bus system. Have you tried disabling dpm as well? >>> >> >> Might also be worth trying the amdgpu kernel driver instead of radeon, >> not sure how well the former currently works with Cape Verde though. >> > > I've recently used it to experiment with the sparse buffer support. It > worked well enough for that :) > > Cheers, > Nicolai > -- > Lerne, wie die Welt wirklich ist, > Aber vergiss niemals, wie sie sein sollte. >
diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c index 37d68cd1f272..40d1bb467a71 100644 --- a/drivers/gpu/drm/radeon/radeon_ttm.c +++ b/drivers/gpu/drm/radeon/radeon_ttm.c @@ -198,7 +198,8 @@ static void radeon_evict_flags(struct ttm_buffer_object *bo, case TTM_PL_VRAM: if (rbo->rdev->ring[radeon_copy_ring_index(rbo->rdev)].ready == false) radeon_ttm_placement_from_domain(rbo, RADEON_GEM_DOMAIN_CPU); - else if (rbo->rdev->mc.visible_vram_size < rbo->rdev->mc.real_vram_size && + else if (!(rbo->flags & RADEON_GEM_CPU_ACCESS) && + rbo->rdev->mc.visible_vram_size < rbo->rdev->mc.real_vram_size && bo->mem.start < (rbo->rdev->mc.visible_vram_size >> PAGE_SHIFT)) { unsigned fpfn = rbo->rdev->mc.visible_vram_size >> PAGE_SHIFT; int i;