diff mbox series

[2/2] drm/vram-helper: Alternate between bottom-up and top-down placement

Message ID 20200422144055.27801-3-tzimmermann@suse.de (mailing list archive)
State New, archived
Headers show
Series drm/vram-helper: Reduce memory fragmentation | expand

Commit Message

Thomas Zimmermann April 22, 2020, 2:40 p.m. UTC
With limited VRAM available, fragmentation can lead to OOM errors.
Alternating between bottom-up and top-down placement keeps BOs near the
ends of the VRAM and the available pages consecutively near the middle.

A real-world example with 16 MiB of VRAM is shown below.

  > cat /sys/kernel/debug/dri/0/vram-mm
  0x0000000000000000-0x000000000000057f: 1407: free
  0x000000000000057f-0x0000000000000b5b: 1500: used
  0x0000000000000b5b-0x0000000000000ff0: 1173: free

The first free area was the location of the fbdev framebuffer. The used
area is Weston's current framebuffer of 1500 pages. Weston now cannot
do a pageflip to another 1500 page-wide framebuffer, even though enough
pages are available. The patch resolves this problem to

  > cat /sys/kernel/debug/dri/0/vram-mm
  0x0000000000000000-0x00000000000005dc: 1500: used
  0x00000000000005dc-0x0000000000000a14: 1080: free
  0x0000000000000a14-0x0000000000000ff0: 1500: used

with both of Weston's framebuffers located near the ends of the VRAM
memory.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/drm_gem_vram_helper.c | 28 ++++++++++++++++++++++-----
 include/drm/drm_gem_vram_helper.h     | 10 ++++++----
 2 files changed, 29 insertions(+), 9 deletions(-)

Comments

Gerd Hoffmann April 23, 2020, 11:18 a.m. UTC | #1
On Wed, Apr 22, 2020 at 04:40:55PM +0200, Thomas Zimmermann wrote:
> With limited VRAM available, fragmentation can lead to OOM errors.
> Alternating between bottom-up and top-down placement keeps BOs near the
> ends of the VRAM and the available pages consecutively near the middle.
> 
> A real-world example with 16 MiB of VRAM is shown below.
> 
>   > cat /sys/kernel/debug/dri/0/vram-mm
>   0x0000000000000000-0x000000000000057f: 1407: free
>   0x000000000000057f-0x0000000000000b5b: 1500: used
>   0x0000000000000b5b-0x0000000000000ff0: 1173: free
> 
> The first free area was the location of the fbdev framebuffer. The used
> area is Weston's current framebuffer of 1500 pages. Weston now cannot
> do a pageflip to another 1500 page-wide framebuffer, even though enough
> pages are available. The patch resolves this problem to
> 
>   > cat /sys/kernel/debug/dri/0/vram-mm
>   0x0000000000000000-0x00000000000005dc: 1500: used
>   0x00000000000005dc-0x0000000000000a14: 1080: free
>   0x0000000000000a14-0x0000000000000ff0: 1500: used
> 
> with both of Weston's framebuffers located near the ends of the VRAM
> memory.

I don't think it is that simple.

First:  How will that interact with cursor bo allocations?  IIRC the
strategy for them is to allocate top-down, for similar reasons (avoid
small cursor bo allocs fragment vram memory).

Second:  I think ttm will move bo's from vram to system only on memory
pressure.  So you can still end up with fragmented memory.  To make the
scheme with one fb @ top and one @ bottom work reliable you have to be
more aggressive on pushing out framebuffers.

Third:  I'd suggest make topdown allocations depending on current state
instead of simply alternating, i.e. if there is a pinned framebuffer @
offset 0, then go for top-down.

I also think using this scheme should be optional.  In the simplest case
we can allow drivers opt-in.  Or we try do to something clever
automatically: using the strategy only for framebuffers larger than 1/4
or 1/3 of total vram memory (which makes alloc failures due to
fragmentation much more likely).

cheers,
  Gerd
Thomas Zimmermann April 23, 2020, 12:44 p.m. UTC | #2
Hi Gerd,

thanks for the feedback.

Am 23.04.20 um 13:18 schrieb Gerd Hoffmann:
> On Wed, Apr 22, 2020 at 04:40:55PM +0200, Thomas Zimmermann wrote:
>> With limited VRAM available, fragmentation can lead to OOM errors.
>> Alternating between bottom-up and top-down placement keeps BOs near the
>> ends of the VRAM and the available pages consecutively near the middle.
>>
>> A real-world example with 16 MiB of VRAM is shown below.
>>
>>   > cat /sys/kernel/debug/dri/0/vram-mm
>>   0x0000000000000000-0x000000000000057f: 1407: free
>>   0x000000000000057f-0x0000000000000b5b: 1500: used
>>   0x0000000000000b5b-0x0000000000000ff0: 1173: free
>>
>> The first free area was the location of the fbdev framebuffer. The used
>> area is Weston's current framebuffer of 1500 pages. Weston now cannot
>> do a pageflip to another 1500 page-wide framebuffer, even though enough
>> pages are available. The patch resolves this problem to
>>
>>   > cat /sys/kernel/debug/dri/0/vram-mm
>>   0x0000000000000000-0x00000000000005dc: 1500: used
>>   0x00000000000005dc-0x0000000000000a14: 1080: free
>>   0x0000000000000a14-0x0000000000000ff0: 1500: used
>>
>> with both of Weston's framebuffers located near the ends of the VRAM
>> memory.
> 
> I don't think it is that simple.
> 
> First:  How will that interact with cursor bo allocations?  IIRC the
> strategy for them is to allocate top-down, for similar reasons (avoid
> small cursor bo allocs fragment vram memory).

In ast, 2 cursor BOs are allocated during driver initialization and kept
permanently at the vram's top end. I don't know about other drivers.

But cursor BOs are small, so they don't make much of a difference. What
is needed is space for 2 primary framebuffers during pageflips, with one
of them pinned. The other framebuffer can be located anywhere.


> 
> Second:  I think ttm will move bo's from vram to system only on memory
> pressure.  So you can still end up with fragmented memory.  To make the
> scheme with one fb @ top and one @ bottom work reliable you have to be
> more aggressive on pushing out framebuffers.

I'm the process of converting mgag200 to atomic modesetting. The given
example is what I observed. I'm not claiming that the placement scheme
is perfect, but it is required to get mgag200 working with atomic
modesetting's pageflip logic. So we're solving a real problem here.

The bug comes from Weston's allocation strategy. Looking at the debug
output:

>>   0x0000000000000000-0x000000000000057f: 1407: free

This was fbdev's framebuffer with 1600x900@32bpp

>>   0x000000000000057f-0x0000000000000b5b: 1500: used

This is Weston's framebuffer also with 1600x900@32bpp. But Weston
allocates an additional, unused 60 scanlines. That is to render with
tiles of 64x64px, I suppose. fbdev doesn't do that, hence Weston's
second framebuffer doesn't fit into the free location of the fbdev
framebuffer.

The other drivers with a small amount of vram are also prone to this
problem. They simply have not yet encountered such a setup.


> 
> Third:  I'd suggest make topdown allocations depending on current state
> instead of simply alternating, i.e. if there is a pinned framebuffer @
> offset 0, then go for top-down.

That's what the current patch does. If the last pin was at the bottom,
the next goes to the top. And then the other way around. Without
alternating between both end of vram, the problem would occur again when
fragmentation happens near the top end.


> 
> I also think using this scheme should be optional.  In the simplest case
> we can allow drivers opt-in.  Or we try do to something clever
> automatically: using the strategy only for framebuffers larger than 1/4
> or 1/3 of total vram memory (which makes alloc failures due to
> fragmentation much more likely).

I'd like to not change behavior automatically, but we can surely make
this optional.

Looking again at the vram helpers, this functionality could be
implemented in drm_gem_vram_plane_helper_prepare_fb(). Drivers with
other placement strategies could implement their own helper for prepare_fb.

Best regards
Thomas


> 
> cheers,
>   Gerd
>
Gerd Hoffmann April 23, 2020, 1:57 p.m. UTC | #3
> > I don't think it is that simple.
> > 
> > First:  How will that interact with cursor bo allocations?  IIRC the
> > strategy for them is to allocate top-down, for similar reasons (avoid
> > small cursor bo allocs fragment vram memory).
> 
> In ast, 2 cursor BOs are allocated during driver initialization and kept
> permanently at the vram's top end. I don't know about other drivers.

One-time allocation at init time shouldn't be a problem.

> But cursor BOs are small, so they don't make much of a difference. What
> is needed is space for 2 primary framebuffers during pageflips, with one
> of them pinned. The other framebuffer can be located anywhere.

The problem isn't the size.  The problem is dynamically allocated cursor
BOs can also fragment vram, especially if top-bottom allocation is also
used for large framebuffers so cursor BOs can end up somewhere in the
middle of vram.

> > Second:  I think ttm will move bo's from vram to system only on memory
> > pressure.  So you can still end up with fragmented memory.  To make the
> > scheme with one fb @ top and one @ bottom work reliable you have to be
> > more aggressive on pushing out framebuffers.
> 
> I'm the process of converting mgag200 to atomic modesetting. The given
> example is what I observed. I'm not claiming that the placement scheme
> is perfect, but it is required to get mgag200 working with atomic
> modesetting's pageflip logic. So we're solving a real problem here.

I don't doubt this is a real problem.

> The bug comes from Weston's allocation strategy. Looking at the debug
> output:
> 
> >>   0x0000000000000000-0x000000000000057f: 1407: free
> 
> This was fbdev's framebuffer with 1600x900@32bpp
> 
> >>   0x000000000000057f-0x0000000000000b5b: 1500: used
> 
> This is Weston's framebuffer also with 1600x900@32bpp. But Weston
> allocates an additional, unused 60 scanlines. That is to render with
> tiles of 64x64px, I suppose. fbdev doesn't do that, hence Weston's
> second framebuffer doesn't fit into the free location of the fbdev
> framebuffer.

Sure.  Assume there is just enough vram to fit in fbdev and two weston
framebuffers.  fbdev is allocated from bottom, first weston fb from top,
second weston fb from bottom again.  fbdev is not pushed out (no memory
pressure yet) so the second weston fb ends up in the middle of vram
fragmenting things.  And now you are again in the situation where you
might have enough free vram for an allocation but can't use it due to
fragmention (probably harder to trigger in practice though).

That's why I would suggest to explicitly move out unpinned framebuffers
(aka fbdev) before pinning a new one (second weston fb) instead of
depending on ttm moving things out on OOM, to make sure you never
allocate something in the middle of vram.

> > Third:  I'd suggest make topdown allocations depending on current state
> > instead of simply alternating, i.e. if there is a pinned framebuffer @
> > offset 0, then go for top-down.
> 
> That's what the current patch does. If the last pin was at the bottom,
> the next goes to the top. And then the other way around. Without
> alternating between both end of vram, the problem would occur again when
> fragmentation happens near the top end.

I'd feel better when checking the state of my current pins to figure
whenever I should alloc top-bottom or not, for robustness reasons.

> Looking again at the vram helpers, this functionality could be
> implemented in drm_gem_vram_plane_helper_prepare_fb(). Drivers with
> other placement strategies could implement their own helper for prepare_fb.

vram helpers could also simply offer two prepare_fb variants.

cheers,
  Gerd
Christian König April 23, 2020, 3:04 p.m. UTC | #4
Hi guys,

one thing you don't seem to have considered yet is that not setting the 
top-down flag also won't get you the bottom-up approach, but rather the 
best fit.

Additional to that the top-down allocations have quite a bit more CPU 
overhead since they don't walk the size tree, but the address tree in 
search for a fitting hole. Nirmoy is currently working on improving this.

Am 23.04.20 um 15:57 schrieb Gerd Hoffmann:
>> But cursor BOs are small, so they don't make much of a difference. What
>> is needed is space for 2 primary framebuffers during pageflips, with one
>> of them pinned. The other framebuffer can be located anywhere.
> The problem isn't the size.  The problem is dynamically allocated cursor
> BOs can also fragment vram, especially if top-bottom allocation is also
> used for large framebuffers so cursor BOs can end up somewhere in the
> middle of vram.

Well maybe instead of a ping/pong approach between top and bottom use a 
size based approach to place small BOs at the top and only the larger 
ones at the bottom (or the other way around).

Cheers,
Christian.
Thomas Zimmermann April 24, 2020, 6:27 a.m. UTC | #5
Hi Christian

Am 23.04.20 um 17:04 schrieb Christian König:
> Hi guys,
> 
> one thing you don't seem to have considered yet is that not setting the
> top-down flag also won't get you the bottom-up approach, but rather the
> best fit.

Kind of unrelated: Would it pick the smallest free area that can hold a
BO? IOW, would a small cursor BO fill up a small free area even if
there's a much larger free area somewhere?


> 
> Additional to that the top-down allocations have quite a bit more CPU
> overhead since they don't walk the size tree, but the address tree in
> search for a fitting hole. Nirmoy is currently working on improving this.

I'm not worried about that. Compositors alternate between only two
framebuffer instances. Once their BOs have been placed, they remain in vram.


> 
> Am 23.04.20 um 15:57 schrieb Gerd Hoffmann:
>>> But cursor BOs are small, so they don't make much of a difference. What
>>> is needed is space for 2 primary framebuffers during pageflips, with one
>>> of them pinned. The other framebuffer can be located anywhere.
>> The problem isn't the size.  The problem is dynamically allocated cursor
>> BOs can also fragment vram, especially if top-bottom allocation is also
>> used for large framebuffers so cursor BOs can end up somewhere in the
>> middle of vram.
> 
> Well maybe instead of a ping/pong approach between top and bottom use a
> size based approach to place small BOs at the top and only the larger
> ones at the bottom (or the other way around).

That's what the current code does and it results in the OOM. Basically,
there's a large BO in the middle of vram and the pageflip's destination
BO does not fit before or after the large one.

Best regards
Thomas

> 
> Cheers,
> Christian.
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Christian König April 24, 2020, 6:56 a.m. UTC | #6
Am 24.04.20 um 08:27 schrieb Thomas Zimmermann:
> Hi Christian
>
> Am 23.04.20 um 17:04 schrieb Christian König:
>> Hi guys,
>>
>> one thing you don't seem to have considered yet is that not setting the
>> top-down flag also won't get you the bottom-up approach, but rather the
>> best fit.
> Kind of unrelated: Would it pick the smallest free area that can hold a
> BO? IOW, would a small cursor BO fill up a small free area even if
> there's a much larger free area somewhere?

Yes, exactly that.

The bottom-up method searches for the lowest hole which can fit the 
requirements.

The top-down method searches for the highest hole which can fit the 
requirements.

Both of those walk the holes by the address index tree, but there is 
also the best method which looks at the holes by their size and tries to 
find the best fit.

The best method usually only needs a single tree walk instead of a 
linear search.

>> Am 23.04.20 um 15:57 schrieb Gerd Hoffmann:
>>>> But cursor BOs are small, so they don't make much of a difference. What
>>>> is needed is space for 2 primary framebuffers during pageflips, with one
>>>> of them pinned. The other framebuffer can be located anywhere.
>>> The problem isn't the size.  The problem is dynamically allocated cursor
>>> BOs can also fragment vram, especially if top-bottom allocation is also
>>> used for large framebuffers so cursor BOs can end up somewhere in the
>>> middle of vram.
>> Well maybe instead of a ping/pong approach between top and bottom use a
>> size based approach to place small BOs at the top and only the larger
>> ones at the bottom (or the other way around).
> That's what the current code does and it results in the OOM. Basically,
> there's a large BO in the middle of vram and the pageflip's destination
> BO does not fit before or after the large one.

Tough problem. No idea how to otherwise fix this without making the 
large BO moveable somehow.

Christian.

>
> Best regards
> Thomas
>
>> Cheers,
>> Christian.
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Thomas Zimmermann April 24, 2020, 6:59 a.m. UTC | #7
Hi Gerd

Am 23.04.20 um 15:57 schrieb Gerd Hoffmann:
>>> I don't think it is that simple.
>>>
>>> First:  How will that interact with cursor bo allocations?  IIRC the
>>> strategy for them is to allocate top-down, for similar reasons (avoid
>>> small cursor bo allocs fragment vram memory).
>>
>> In ast, 2 cursor BOs are allocated during driver initialization and kept
>> permanently at the vram's top end. I don't know about other drivers.
> 
> One-time allocation at init time shouldn't be a problem.
> 
>> But cursor BOs are small, so they don't make much of a difference. What
>> is needed is space for 2 primary framebuffers during pageflips, with one
>> of them pinned. The other framebuffer can be located anywhere.
> 
> The problem isn't the size.  The problem is dynamically allocated cursor
> BOs can also fragment vram, especially if top-bottom allocation is also
> used for large framebuffers so cursor BOs can end up somewhere in the
> middle of vram.

But the problem is the size. Cursor BOs are unrelated. Of the
vram-helper-based drivers, only ast and mgag200 use cursors, and both
perma-pin them to the end of vram.


> 
>>> Second:  I think ttm will move bo's from vram to system only on memory
>>> pressure.  So you can still end up with fragmented memory.  To make the
>>> scheme with one fb @ top and one @ bottom work reliable you have to be
>>> more aggressive on pushing out framebuffers.
>>
>> I'm the process of converting mgag200 to atomic modesetting. The given
>> example is what I observed. I'm not claiming that the placement scheme
>> is perfect, but it is required to get mgag200 working with atomic
>> modesetting's pageflip logic. So we're solving a real problem here.
> 
> I don't doubt this is a real problem.
> 
>> The bug comes from Weston's allocation strategy. Looking at the debug
>> output:
>>
>>>>   0x0000000000000000-0x000000000000057f: 1407: free
>>
>> This was fbdev's framebuffer with 1600x900@32bpp
>>
>>>>   0x000000000000057f-0x0000000000000b5b: 1500: used
>>
>> This is Weston's framebuffer also with 1600x900@32bpp. But Weston
>> allocates an additional, unused 60 scanlines. That is to render with
>> tiles of 64x64px, I suppose. fbdev doesn't do that, hence Weston's
>> second framebuffer doesn't fit into the free location of the fbdev
>> framebuffer.
> 
> Sure.  Assume there is just enough vram to fit in fbdev and two weston
> framebuffers.  fbdev is allocated from bottom, first weston fb from top,
> second weston fb from bottom again.  fbdev is not pushed out (no memory
> pressure yet) so the second weston fb ends up in the middle of vram
> fragmenting things.  And now you are again in the situation where you
> might have enough free vram for an allocation but can't use it due to
> fragmention (probably harder to trigger in practice though).

Not quite. Framebuffer BOs of the current or smaller size will fit into
vram. It's only a problem if the size of the new framebuffer BO
increases. And that is exactly what currently happens with mgag200.

That aside, it's not a fair point, as you constructed an example that no
memory manager can resolve.

> 
> That's why I would suggest to explicitly move out unpinned framebuffers
> (aka fbdev) before pinning a new one (second weston fb) instead of
> depending on ttm moving things out on OOM, to make sure you never
> allocate something in the middle of vram.

We cannot do that. Evicting BOs from vram involves an unmap operation.
We did that in an earlier version of the code and received reports about
performance regressions and CPU cycles in TLB update.

So we added the lazy-unmap feature, where BOs are only unmapped and
evicted when necessary. I think it was even you who suggested this idea. :)

> 
>>> Third:  I'd suggest make topdown allocations depending on current state
>>> instead of simply alternating, i.e. if there is a pinned framebuffer @
>>> offset 0, then go for top-down.
>>
>> That's what the current patch does. If the last pin was at the bottom,
>> the next goes to the top. And then the other way around. Without
>> alternating between both end of vram, the problem would occur again when
>> fragmentation happens near the top end.
> 
> I'd feel better when checking the state of my current pins to figure
> whenever I should alloc top-bottom or not, for robustness reasons.

I don't understand why this is more robust. For example, if you pin a
larger number of BOs, and than evict every other BO, there will always
be free areas among the remaining BOs. All that changes is the location
of those areas.

For a strategy, one cannot look at the BO size alone. In my initial
example, BOs are ~6 MiB and vram is 16 MiB. So any strategy ala 'choose
top-bottom for BOs >1/3 of vram' selects top-bottom for all framebuffer
BOs. This would result in the same OOM, but from top to bottom.

At some point one has to choose to switch to top-down, and then back
again at one of the next BOs. So the current patch effectively splits
vram into a lower half and an upper half and puts BOs in alternating halves.

Best regards
Thomas

> 
>> Looking again at the vram helpers, this functionality could be
>> implemented in drm_gem_vram_plane_helper_prepare_fb(). Drivers with
>> other placement strategies could implement their own helper for prepare_fb.
> 
> vram helpers could also simply offer two prepare_fb variants.
> 
> cheers,
>   Gerd
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
Thomas Zimmermann April 24, 2020, 7:03 a.m. UTC | #8
Hi

Am 24.04.20 um 08:56 schrieb Christian König:
> Am 24.04.20 um 08:27 schrieb Thomas Zimmermann:
>> Hi Christian
>>
>> Am 23.04.20 um 17:04 schrieb Christian König:
>>> Hi guys,
>>>
>>> one thing you don't seem to have considered yet is that not setting the
>>> top-down flag also won't get you the bottom-up approach, but rather the
>>> best fit.
>> Kind of unrelated: Would it pick the smallest free area that can hold a
>> BO? IOW, would a small cursor BO fill up a small free area even if
>> there's a much larger free area somewhere?
> 
> Yes, exactly that.

I see.

> 
> The bottom-up method searches for the lowest hole which can fit the
> requirements.

But there's no way to explictely select bottom-up? What is
TTM_PL_FLAG_CONTIGUOUS doing?

Best regards
Thomas


> 
> The top-down method searches for the highest hole which can fit the
> requirements.
> 
> Both of those walk the holes by the address index tree, but there is
> also the best method which looks at the holes by their size and tries to
> find the best fit.
> 
> The best method usually only needs a single tree walk instead of a
> linear search.
> 
>>> Am 23.04.20 um 15:57 schrieb Gerd Hoffmann:
>>>>> But cursor BOs are small, so they don't make much of a difference.
>>>>> What
>>>>> is needed is space for 2 primary framebuffers during pageflips,
>>>>> with one
>>>>> of them pinned. The other framebuffer can be located anywhere.
>>>> The problem isn't the size.  The problem is dynamically allocated
>>>> cursor
>>>> BOs can also fragment vram, especially if top-bottom allocation is also
>>>> used for large framebuffers so cursor BOs can end up somewhere in the
>>>> middle of vram.
>>> Well maybe instead of a ping/pong approach between top and bottom use a
>>> size based approach to place small BOs at the top and only the larger
>>> ones at the bottom (or the other way around).
>> That's what the current code does and it results in the OOM. Basically,
>> there's a large BO in the middle of vram and the pageflip's destination
>> BO does not fit before or after the large one.
> 
> Tough problem. No idea how to otherwise fix this without making the
> large BO moveable somehow.
> 
> Christian.
> 
>>
>> Best regards
>> Thomas
>>
>>> Cheers,
>>> Christian.
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Christian König April 24, 2020, 7:39 a.m. UTC | #9
Am 24.04.20 um 09:03 schrieb Thomas Zimmermann:
> Hi
>
> Am 24.04.20 um 08:56 schrieb Christian König:
>> Am 24.04.20 um 08:27 schrieb Thomas Zimmermann:
>>> Hi Christian
>>>
>>> Am 23.04.20 um 17:04 schrieb Christian König:
>>>> Hi guys,
>>>>
>>>> one thing you don't seem to have considered yet is that not setting the
>>>> top-down flag also won't get you the bottom-up approach, but rather the
>>>> best fit.
>>> Kind of unrelated: Would it pick the smallest free area that can hold a
>>> BO? IOW, would a small cursor BO fill up a small free area even if
>>> there's a much larger free area somewhere?
>> Yes, exactly that.
> I see.
>
>> The bottom-up method searches for the lowest hole which can fit the
>> requirements.
> But there's no way to explictely select bottom-up? What is
> TTM_PL_FLAG_CONTIGUOUS doing?

Making sure that we allocate contiguous for amdgpu. Added that because I 
couldn't fix a few problems otherwise.

Should probably be a driver private placement flag instead, but didn't 
had time to clean that up.

Regards,
Christian.

>
> Best regards
> Thomas
>
>
>> The top-down method searches for the highest hole which can fit the
>> requirements.
>>
>> Both of those walk the holes by the address index tree, but there is
>> also the best method which looks at the holes by their size and tries to
>> find the best fit.
>>
>> The best method usually only needs a single tree walk instead of a
>> linear search.
>>
>>>> Am 23.04.20 um 15:57 schrieb Gerd Hoffmann:
>>>>>> But cursor BOs are small, so they don't make much of a difference.
>>>>>> What
>>>>>> is needed is space for 2 primary framebuffers during pageflips,
>>>>>> with one
>>>>>> of them pinned. The other framebuffer can be located anywhere.
>>>>> The problem isn't the size.  The problem is dynamically allocated
>>>>> cursor
>>>>> BOs can also fragment vram, especially if top-bottom allocation is also
>>>>> used for large framebuffers so cursor BOs can end up somewhere in the
>>>>> middle of vram.
>>>> Well maybe instead of a ping/pong approach between top and bottom use a
>>>> size based approach to place small BOs at the top and only the larger
>>>> ones at the bottom (or the other way around).
>>> That's what the current code does and it results in the OOM. Basically,
>>> there's a large BO in the middle of vram and the pageflip's destination
>>> BO does not fit before or after the large one.
>> Tough problem. No idea how to otherwise fix this without making the
>> large BO moveable somehow.
>>
>> Christian.
>>
>>> Best regards
>>> Thomas
>>>
>>>> Cheers,
>>>> Christian.
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Gerd Hoffmann April 24, 2020, 9:38 a.m. UTC | #10
Hi,

> At some point one has to choose to switch to top-down, and then back
> again at one of the next BOs. So the current patch effectively splits
> vram into a lower half and an upper half and puts BOs in alternating halves.

Hmm, so maybe just make the split explicit instead of playing tricks
with top_bottom?  Register the lower vram half as TT_VRAM, register the
upper half as TT_PRIV.  That'll make sure you never have a big
allocation in the middle blocking things.  Or maybe add a allocation
strategy flag to ttm which effectively does the same.

take care,
  Gerd
Thomas Zimmermann April 28, 2020, 3:16 p.m. UTC | #11
Hi Gerd

Am 24.04.20 um 11:38 schrieb Gerd Hoffmann:
>   Hi,
> 
>> At some point one has to choose to switch to top-down, and then back
>> again at one of the next BOs. So the current patch effectively splits
>> vram into a lower half and an upper half and puts BOs in alternating halves.
> 
> Hmm, so maybe just make the split explicit instead of playing tricks
> with top_bottom?  Register the lower vram half as TT_VRAM, register the
> upper half as TT_PRIV.  That'll make sure you never have a big
> allocation in the middle blocking things.  Or maybe add a allocation
> strategy flag to ttm which effectively does the same.

It's not that easy. Current cursors n ast are statically allocated. As
soon as you add dynamic cursors into the mix, you'd get OOMs. If the
framebuffer BO goes into VRAM and the cursor BO goes into PRIV, pinning
the next framebuffer BO during a pageflip would send it to VRAM. But
VRAM is still filled with the first framebuffer BO. So the memory
manager needs some flexibility with the placement. Handling all that is
already TTM's job.

I'm preparing v2 of this patch set. It'll get static cursors out of the
way and make the feature opt-in.

Best regards
Thomas

> 
> take care,
>   Gerd
>
Gerd Hoffmann April 29, 2020, 7:25 a.m. UTC | #12
Hi,

> It's not that easy. Current cursors n ast are statically allocated. As
> soon as you add dynamic cursors into the mix, you'd get OOMs.

Well, with the split you can simply allocate dynamic cursors with
top-bottom to keep them out of the way.  It's not 100% perfect, the area
where the cursors are allocated from has a bit less free vram, so the
split effectively isn't 50/50 but more like 49/51.  But hey, still alot
better than what we have today.

With two static cursors you can allocate one from TT_VRAM and one from
TT_PRIV so the remaining free vram is the same in both regions.

> If the framebuffer BO goes into VRAM and the cursor BO goes into PRIV,
> pinning the next framebuffer BO during a pageflip would send it to
> VRAM.

Oh, seems my idea outline was a bit to short.  The plan is *not* to
alternate between VRAM and PRIV on allocations.  The plan is to add both
PRIV and VRAM to the placement array when pinning the framebuffer.

ttm can simply place the framebuffers as it pleases and where it wants.
Due to the split it can't do a big blocking allocation in the middle
of vram though.

If you are going to pageflip you should have one framebuffer already
pinned (the one currently displayed).  If that happens to live TT_VRAM
ttm should be able to make room in TT_PRIV to pin the new framebuffer
you want pageflip to, and visa-versa.

ttm will only evict unpinned BOs if needed, when running with smaller
framebuffers ttm will happily keep more than two framebuffers in vram.
All fully automatic without vram helpers having to pay much attention
to the allocation strategy.

> I'm preparing v2 of this patch set. It'll get static cursors out of the
> way and make the feature opt-in.

I think with the split model there is no need to make that optional.

cheers,
  Gerd
Christian König April 29, 2020, 9:11 a.m. UTC | #13
Am 29.04.20 um 09:25 schrieb Gerd Hoffmann:
>    Hi,
>
>> It's not that easy. Current cursors n ast are statically allocated. As
>> soon as you add dynamic cursors into the mix, you'd get OOMs.
> Well, with the split you can simply allocate dynamic cursors with
> top-bottom to keep them out of the way.  It's not 100% perfect, the area
> where the cursors are allocated from has a bit less free vram, so the
> split effectively isn't 50/50 but more like 49/51.  But hey, still alot
> better than what we have today.
>
> With two static cursors you can allocate one from TT_VRAM and one from
> TT_PRIV so the remaining free vram is the same in both regions.
>
>> If the framebuffer BO goes into VRAM and the cursor BO goes into PRIV,
>> pinning the next framebuffer BO during a pageflip would send it to
>> VRAM.
> Oh, seems my idea outline was a bit to short.  The plan is *not* to
> alternate between VRAM and PRIV on allocations.  The plan is to add both
> PRIV and VRAM to the placement array when pinning the framebuffer.
>
> ttm can simply place the framebuffers as it pleases and where it wants.
> Due to the split it can't do a big blocking allocation in the middle
> of vram though.

There is an easier and most likely less CPU intense way of doing this.

All you need to do is to make allocations which are below halve of your 
address space allocate from the beginning of a hole and if they are 
above halve of the address space from the end of a hole.

This way you should always keep the largest hole in the middle.

This can be done with another flag to drm_mm and would most likely would 
be a nice addition to other drivers as well.

Regards,
Christian.

>
> If you are going to pageflip you should have one framebuffer already
> pinned (the one currently displayed).  If that happens to live TT_VRAM
> ttm should be able to make room in TT_PRIV to pin the new framebuffer
> you want pageflip to, and visa-versa.
>
> ttm will only evict unpinned BOs if needed, when running with smaller
> framebuffers ttm will happily keep more than two framebuffers in vram.
> All fully automatic without vram helpers having to pay much attention
> to the allocation strategy.
>
>> I'm preparing v2 of this patch set. It'll get static cursors out of the
>> way and make the feature opt-in.
> I think with the split model there is no need to make that optional.
>
> cheers,
>    Gerd
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/drm_gem_vram_helper.c b/drivers/gpu/drm/drm_gem_vram_helper.c
index 5d5bfb38bbed4..2d0b7474288de 100644
--- a/drivers/gpu/drm/drm_gem_vram_helper.c
+++ b/drivers/gpu/drm/drm_gem_vram_helper.c
@@ -146,15 +146,33 @@  static void drm_gem_vram_placement(struct drm_gem_vram_object *gbo,
 	unsigned int i;
 	unsigned int c = 0;
 	u32 invariant_flags = pl_flag & TTM_PL_FLAG_TOPDOWN;
+	struct drm_device *dev = gbo->bo.base.dev;
+	struct drm_vram_mm *vmm = dev->vram_mm;
 
 	gbo->placement.placement = gbo->placements;
 	gbo->placement.busy_placement = gbo->placements;
 
-	if (pl_flag & TTM_PL_FLAG_VRAM)
-		gbo->placements[c++].flags = TTM_PL_FLAG_WC |
-					     TTM_PL_FLAG_UNCACHED |
-					     TTM_PL_FLAG_VRAM |
-					     invariant_flags;
+	if (pl_flag & TTM_PL_FLAG_VRAM) {
+		/*
+		 * We usually have at most 2 pinned BOs during pageflips,
+		 * plus cursor BOs. Even with a significant number of free
+		 ' pages, always placing bottom-up can lead to fragmentation
+		 * and OOM situations. So if there's no explicit request for
+		 * top-down placement, we alternatingly place BOs bottom-up
+		 * and top-down. The placement strategy should help to keep
+		 * free VRAM pages available near the middle of the VRAM.
+		 */
+		gbo->placements[c].flags = TTM_PL_FLAG_WC |
+					   TTM_PL_FLAG_UNCACHED |
+					   TTM_PL_FLAG_VRAM |
+					   invariant_flags;
+		if (!(invariant_flags & TTM_PL_FLAG_TOPDOWN)) {
+			if (vmm->place_topdown)
+				gbo->placements[c].flags |= TTM_PL_FLAG_TOPDOWN;
+			vmm->place_topdown = !vmm->place_topdown;
+		}
+		++c;
+	}
 
 	if (pl_flag & TTM_PL_FLAG_SYSTEM)
 		gbo->placements[c++].flags = TTM_PL_MASK_CACHING |
diff --git a/include/drm/drm_gem_vram_helper.h b/include/drm/drm_gem_vram_helper.h
index b63bcd1b996da..04767d0ff23a6 100644
--- a/include/drm/drm_gem_vram_helper.h
+++ b/include/drm/drm_gem_vram_helper.h
@@ -165,10 +165,10 @@  void drm_gem_vram_simple_display_pipe_cleanup_fb(
 
 /**
  * struct drm_vram_mm - An instance of VRAM MM
- * @vram_base:	Base address of the managed video memory
- * @vram_size:	Size of the managed video memory in bytes
- * @bdev:	The TTM BO device.
- * @funcs:	TTM BO functions
+ * @vram_base:		Base address of the managed video memory
+ * @vram_size:		Size of the managed video memory in bytes
+ * @bdev:		The TTM BO device
+ * @place_topdown:	Flags the next BO to be placed at the VRAM's high end
  *
  * The fields &struct drm_vram_mm.vram_base and
  * &struct drm_vram_mm.vrm_size are managed by VRAM MM, but are
@@ -180,6 +180,8 @@  struct drm_vram_mm {
 	size_t vram_size;
 
 	struct ttm_bo_device bdev;
+
+	bool place_topdown;
 };
 
 /**