diff mbox series

[3/3] drm/ttm: make up to 90% of system memory available

Message ID 20201117140615.255887-3-christian.koenig@amd.com (mailing list archive)
State New, archived
Headers show
Series [1/3] drm/ttm: rework ttm_tt page limit | expand

Commit Message

Christian König Nov. 17, 2020, 2:06 p.m. UTC
Increase the ammount of system memory drivers can use to about 90% of
the total available.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Michel Dänzer Nov. 17, 2020, 4:38 p.m. UTC | #1
On 2020-11-17 3:06 p.m., Christian König wrote:
> Increase the ammount of system memory drivers can use to about 90% of
> the total available.
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>   drivers/gpu/drm/ttm/ttm_bo.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index a958135cb3fe..0a93df93dba4 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -1267,7 +1267,7 @@ static int ttm_bo_global_init(void)
>   	 * the available system memory.
>   	 */
>   	num_pages = (u64)si.totalram * si.mem_unit;
> -	num_pages = (num_pages * 50 / 100) >> PAGE_SHIFT;
> +	num_pages = (num_pages * 90 / 100) >> PAGE_SHIFT;
>   
>   	/* But for DMA32 we limit ourself to only use 2GiB maximum. */
>   	num_dma32_pages = (u64)(si.totalram - si.totalhigh) * si.mem_unit;
> 

This should update the comment added in patch 1.
Daniel Vetter Nov. 17, 2020, 5:19 p.m. UTC | #2
On Tue, Nov 17, 2020 at 03:06:15PM +0100, Christian König wrote:
> Increase the ammount of system memory drivers can use to about 90% of
> the total available.
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/ttm/ttm_bo.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index a958135cb3fe..0a93df93dba4 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -1267,7 +1267,7 @@ static int ttm_bo_global_init(void)
>  	 * the available system memory.
>  	 */
>  	num_pages = (u64)si.totalram * si.mem_unit;
> -	num_pages = (num_pages * 50 / 100) >> PAGE_SHIFT;
> +	num_pages = (num_pages * 90 / 100) >> PAGE_SHIFT;

I don't think this is the design we want. As long as it was set at "half
of system memory" it was clear that a) it's a hack b) precision didn't
matter.

But if you go to the limit and still want to keep the "we make sure
there's no OOM", then precision starts to matter:
- memory hotplug and hotunplug is a thing
- userspace can mlock, and it's configureable
- drivers can pin_user_pages for IO and random other stuff. Some of it is
  accounted as some subsystem specific mlock (like rdma does iirc), some
  is just yolo or short term enough (like)
- none of what we do here takes into considerations any interactions with
  core mm tracking (like cgroups or numa or anything like that)

If we want to drop the "half of system ram" limit (and yes that makes
sense) I think the right design is:

- we give up on the "no OOM" guarantee.

- This means if you want real isolation of tasks, we need cgroups, and we
  need to integrate ttm cgroups with system memory cgroups somehow. Unlike
  randomly picked hardcoded limits this should work a lot more reliably
  and be a lot more useful in practical use in the field.

- This also means that drivers start to fail in interesting ways. I think
  locking headaches are covered with all the lockdep annotations I've
  pushed, plus some of the things I still have in-flight (I have a
  might_alloc() annotations somewhere). That leaves validation of error
  paths for when allocations fail. Ime a very effective way we used in
  i915 is (ab)using EINTR restarting, which per drmIoctl uapi spec is
  requried. We could put a debug mode into ttm_tt which randomly fails
  with -EINTR to make sure it's all working correctly (plus anything else
  that allocates memory), and unlike real out-of-memory injection piglit
  and any other cts will complete without failure. Which gives us an
  excellent metric for "does it work". Actualy OOM, even injected one,
  tends to make stuff blow up in a way that's very hard to track and make
  sure you've got good coverage, since all your usual tests die pretty
  quickly.

- ttm_tt needs to play fair with every other system memory user. We need
  to register a shrinker for each ttm_tt (so usually one per device I
  guess), which walks the lru (in shrink_count) and uses dma_resv_trylock
  for actual shrinking. We probably want to move it to SYSTEM first for
  that shrinker to pick up, so that there's some global fairness across
  all ttm_tt.

- for GFP_DMA32 that means zone aware shrinkers. We've never used those,
  because thus far i915 didn't have any big need for low memory, so we
  haven't used this in practice. But it's supposed to be a thing.

It's a bit more code than the oneliner above, but I also think it's a lot
more solid. Plus it would resolve the last big issue where i915 gem is
fairly fundamentally different compared to ttm. For that question I think
once Maarten's locking rework for i915 has landed and all the other ttm
rework from you and Dave is in, we've resolved them all.


>  	/* But for DMA32 we limit ourself to only use 2GiB maximum. */
>  	num_dma32_pages = (u64)(si.totalram - si.totalhigh) * si.mem_unit;
> -- 
> 2.25.1
>
Christian König Nov. 18, 2020, 12:09 p.m. UTC | #3
Am 17.11.20 um 17:38 schrieb Michel Dänzer:
> On 2020-11-17 3:06 p.m., Christian König wrote:
>> Increase the ammount of system memory drivers can use to about 90% of
>> the total available.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/ttm/ttm_bo.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
>> index a958135cb3fe..0a93df93dba4 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>> @@ -1267,7 +1267,7 @@ static int ttm_bo_global_init(void)
>>        * the available system memory.
>>        */
>>       num_pages = (u64)si.totalram * si.mem_unit;
>> -    num_pages = (num_pages * 50 / 100) >> PAGE_SHIFT;
>> +    num_pages = (num_pages * 90 / 100) >> PAGE_SHIFT;
>>         /* But for DMA32 we limit ourself to only use 2GiB maximum. */
>>       num_dma32_pages = (u64)(si.totalram - si.totalhigh) * si.mem_unit;
>>
>
> This should update the comment added in patch 1.

Yeah indeed, I've tested this with 70, 75, 80, 85 and finally 90 and it 
looks like I've forgot to update the comment after a while.

Thanks,
Christian.
Christian König Nov. 18, 2020, 1:35 p.m. UTC | #4
Am 17.11.20 um 18:19 schrieb Daniel Vetter:
> On Tue, Nov 17, 2020 at 03:06:15PM +0100, Christian König wrote:
>> Increase the ammount of system memory drivers can use to about 90% of
>> the total available.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/ttm/ttm_bo.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
>> index a958135cb3fe..0a93df93dba4 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>> @@ -1267,7 +1267,7 @@ static int ttm_bo_global_init(void)
>>   	 * the available system memory.
>>   	 */
>>   	num_pages = (u64)si.totalram * si.mem_unit;
>> -	num_pages = (num_pages * 50 / 100) >> PAGE_SHIFT;
>> +	num_pages = (num_pages * 90 / 100) >> PAGE_SHIFT;
> I don't think this is the design we want. As long as it was set at "half
> of system memory" it was clear that a) it's a hack b) precision didn't
> matter.

Yeah, I'm also wondering what to do here exactly. In generally I would 
completely nuke that limit if possible.

> But if you go to the limit and still want to keep the "we make sure
> there's no OOM", then precision starts to matter:
> - memory hotplug and hotunplug is a thing
> - userspace can mlock, and it's configureable
> - drivers can pin_user_pages for IO and random other stuff. Some of it is
>    accounted as some subsystem specific mlock (like rdma does iirc), some
>    is just yolo or short term enough (like)
> - none of what we do here takes into considerations any interactions with
>    core mm tracking (like cgroups or numa or anything like that)

OOM is perfectly fine with me, we should just not run into an OOM killer 
situation because we call shmem_read_mapping_page_gfp() in the shrinker 
callback.

Any idea how we could guarantee that?

> If we want to drop the "half of system ram" limit (and yes that makes
> sense) I think the right design is:
>
> - we give up on the "no OOM" guarantee.
>
> - This means if you want real isolation of tasks, we need cgroups, and we
>    need to integrate ttm cgroups with system memory cgroups somehow. Unlike
>    randomly picked hardcoded limits this should work a lot more reliably
>    and be a lot more useful in practical use in the field.
>
> - This also means that drivers start to fail in interesting ways. I think
>    locking headaches are covered with all the lockdep annotations I've
>    pushed, plus some of the things I still have in-flight (I have a
>    might_alloc() annotations somewhere). That leaves validation of error
>    paths for when allocations fail. Ime a very effective way we used in
>    i915 is (ab)using EINTR restarting, which per drmIoctl uapi spec is
>    requried. We could put a debug mode into ttm_tt which randomly fails
>    with -EINTR to make sure it's all working correctly (plus anything else
>    that allocates memory), and unlike real out-of-memory injection piglit
>    and any other cts will complete without failure. Which gives us an
>    excellent metric for "does it work". Actualy OOM, even injected one,
>    tends to make stuff blow up in a way that's very hard to track and make
>    sure you've got good coverage, since all your usual tests die pretty
>    quickly.
>
> - ttm_tt needs to play fair with every other system memory user. We need
>    to register a shrinker for each ttm_tt (so usually one per device I
>    guess), which walks the lru (in shrink_count) and uses dma_resv_trylock
>    for actual shrinking. We probably want to move it to SYSTEM first for
>    that shrinker to pick up, so that there's some global fairness across
>    all ttm_tt.

I already have patches for this.

What's still missing is teaching the OOM killer which task is using the 
memory since memory referenced through the file descriptors are 
currently not accounted towards any process resources.

> - for GFP_DMA32 that means zone aware shrinkers. We've never used those,
>    because thus far i915 didn't have any big need for low memory, so we
>    haven't used this in practice. But it's supposed to be a thing.

I think we can mostly forget about GFP_DMA32, this should only be used 
for AGP and other ancient hardware.

Christian.

> It's a bit more code than the oneliner above, but I also think it's a lot
> more solid. Plus it would resolve the last big issue where i915 gem is
> fairly fundamentally different compared to ttm. For that question I think
> once Maarten's locking rework for i915 has landed and all the other ttm
> rework from you and Dave is in, we've resolved them all.
>
>
>>   	/* But for DMA32 we limit ourself to only use 2GiB maximum. */
>>   	num_dma32_pages = (u64)(si.totalram - si.totalhigh) * si.mem_unit;
>> -- 
>> 2.25.1
>>
Daniel Vetter Nov. 18, 2020, 9:15 p.m. UTC | #5
On Wed, Nov 18, 2020 at 02:35:24PM +0100, Christian König wrote:
> Am 17.11.20 um 18:19 schrieb Daniel Vetter:
> > On Tue, Nov 17, 2020 at 03:06:15PM +0100, Christian König wrote:
> > > Increase the ammount of system memory drivers can use to about 90% of
> > > the total available.
> > > 
> > > Signed-off-by: Christian König <christian.koenig@amd.com>
> > > ---
> > >   drivers/gpu/drm/ttm/ttm_bo.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> > > index a958135cb3fe..0a93df93dba4 100644
> > > --- a/drivers/gpu/drm/ttm/ttm_bo.c
> > > +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> > > @@ -1267,7 +1267,7 @@ static int ttm_bo_global_init(void)
> > >   	 * the available system memory.
> > >   	 */
> > >   	num_pages = (u64)si.totalram * si.mem_unit;
> > > -	num_pages = (num_pages * 50 / 100) >> PAGE_SHIFT;
> > > +	num_pages = (num_pages * 90 / 100) >> PAGE_SHIFT;
> > I don't think this is the design we want. As long as it was set at "half
> > of system memory" it was clear that a) it's a hack b) precision didn't
> > matter.
> 
> Yeah, I'm also wondering what to do here exactly. In generally I would
> completely nuke that limit if possible.
> 
> > But if you go to the limit and still want to keep the "we make sure
> > there's no OOM", then precision starts to matter:
> > - memory hotplug and hotunplug is a thing
> > - userspace can mlock, and it's configureable
> > - drivers can pin_user_pages for IO and random other stuff. Some of it is
> >    accounted as some subsystem specific mlock (like rdma does iirc), some
> >    is just yolo or short term enough (like)
> > - none of what we do here takes into considerations any interactions with
> >    core mm tracking (like cgroups or numa or anything like that)
> 
> OOM is perfectly fine with me, we should just not run into an OOM killer
> situation because we call shmem_read_mapping_page_gfp() in the shrinker
> callback.
> 
> Any idea how we could guarantee that?

Uh ...

I just realized I missed something between how ttm works and how i915
works.  With i915 directly using shmem, all we do in the shrinker is unpin
the shmem pages. With ttm we have the page pool in the middle. And it's
needed for dma coherent and other things. This is rather fundamental.

btw I don't think you'll get OOM, you get lockdep splat because you're
recusring on fs_reclaim fake lock. We should probably put a might_alloc()
into shmem_read_mapping_page_gfp().

> > If we want to drop the "half of system ram" limit (and yes that makes
> > sense) I think the right design is:
> > 
> > - we give up on the "no OOM" guarantee.
> > 
> > - This means if you want real isolation of tasks, we need cgroups, and we
> >    need to integrate ttm cgroups with system memory cgroups somehow. Unlike
> >    randomly picked hardcoded limits this should work a lot more reliably
> >    and be a lot more useful in practical use in the field.
> > 
> > - This also means that drivers start to fail in interesting ways. I think
> >    locking headaches are covered with all the lockdep annotations I've
> >    pushed, plus some of the things I still have in-flight (I have a
> >    might_alloc() annotations somewhere). That leaves validation of error
> >    paths for when allocations fail. Ime a very effective way we used in
> >    i915 is (ab)using EINTR restarting, which per drmIoctl uapi spec is
> >    requried. We could put a debug mode into ttm_tt which randomly fails
> >    with -EINTR to make sure it's all working correctly (plus anything else
> >    that allocates memory), and unlike real out-of-memory injection piglit
> >    and any other cts will complete without failure. Which gives us an
> >    excellent metric for "does it work". Actualy OOM, even injected one,
> >    tends to make stuff blow up in a way that's very hard to track and make
> >    sure you've got good coverage, since all your usual tests die pretty
> >    quickly.
> > 
> > - ttm_tt needs to play fair with every other system memory user. We need
> >    to register a shrinker for each ttm_tt (so usually one per device I
> >    guess), which walks the lru (in shrink_count) and uses dma_resv_trylock
> >    for actual shrinking. We probably want to move it to SYSTEM first for
> >    that shrinker to pick up, so that there's some global fairness across
> >    all ttm_tt.
> 
> I already have patches for this.
> 
> What's still missing is teaching the OOM killer which task is using the
> memory since memory referenced through the file descriptors are currently
> not accounted towards any process resources.

Yeah I don't think we've fixed that problem for i915 either. It loves to
kill randome other stuff. In igt we solve this by marking any igt testcase
as "pls kill this first". Well "solve".

> > - for GFP_DMA32 that means zone aware shrinkers. We've never used those,
> >    because thus far i915 didn't have any big need for low memory, so we
> >    haven't used this in practice. But it's supposed to be a thing.
> 
> I think we can mostly forget about GFP_DMA32, this should only be used for
> AGP and other ancient hardware.

Ok, that's good. So having an arbitrary "only half of GFP_DMA32" for these
should also be acceptable, since back then gpus didn't really have
gigabytes of vram yet. Biggest r500 seems to have topped out at 512MB.

How much do we require the dma page pool? Afaiui it's only when there's
bounce buffers or something nasty like that present. Are there more use
cases?

For fixing the TT -> SYSTEM problem of calling shmem_read_mapping_page_gfp
from shrinker I think there's 3 solutions:

1)Call shmem_read_mapping_page_gfp with GFP_IO. This way we should not be
  calling back into our shrinker, which are at GFP_FS level. This isn't
  great, but since page reclaim is a giantic loop which repeats as long as
  we're making meaningful forward progress, we should be able to trickle
  TT pages to SYSTEM pages even under severe memory pressure.

2)For the normal page pool (i.e. neither GFP_DMA32 nor dma_alloc_coherent
  or write-combine) we stop copying the pages for TT <-> SYSTEM, but just
  directly pin the pages we get from shmem. Like all the shmem based soc
  drivers do. No copying, no need for allocations, no problem in
  shrinkers.

3)For the remaining page pools we'd need ttmfs so that we can control the
  address_space_operations and directly swap out from our pages to
  swapout, bypassing shmem and the copying. Iirc Chris Wilson had a
  prototype once somewhere under gemfs for this. Only needed if 1) isn't
  fast enough for these because there's too many such cases where we care.

At least on intel hw the direction is very much towards fully coherent,
so 1&2 should be good enough.
-Daniel

> 
> Christian.
> 
> > It's a bit more code than the oneliner above, but I also think it's a lot
> > more solid. Plus it would resolve the last big issue where i915 gem is
> > fairly fundamentally different compared to ttm. For that question I think
> > once Maarten's locking rework for i915 has landed and all the other ttm
> > rework from you and Dave is in, we've resolved them all.
> > 
> > 
> > >   	/* But for DMA32 we limit ourself to only use 2GiB maximum. */
> > >   	num_dma32_pages = (u64)(si.totalram - si.totalhigh) * si.mem_unit;
> > > -- 
> > > 2.25.1
> > > 
>
Christian König Nov. 19, 2020, 3:27 p.m. UTC | #6
Am 18.11.20 um 22:15 schrieb Daniel Vetter:
> On Wed, Nov 18, 2020 at 02:35:24PM +0100, Christian König wrote:
>> Am 17.11.20 um 18:19 schrieb Daniel Vetter:
>>> On Tue, Nov 17, 2020 at 03:06:15PM +0100, Christian König wrote:
>>>> Increase the ammount of system memory drivers can use to about 90% of
>>>> the total available.
>>>>
>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/ttm/ttm_bo.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
>>>> index a958135cb3fe..0a93df93dba4 100644
>>>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>>>> @@ -1267,7 +1267,7 @@ static int ttm_bo_global_init(void)
>>>>    	 * the available system memory.
>>>>    	 */
>>>>    	num_pages = (u64)si.totalram * si.mem_unit;
>>>> -	num_pages = (num_pages * 50 / 100) >> PAGE_SHIFT;
>>>> +	num_pages = (num_pages * 90 / 100) >> PAGE_SHIFT;
>>> I don't think this is the design we want. As long as it was set at "half
>>> of system memory" it was clear that a) it's a hack b) precision didn't
>>> matter.
>> Yeah, I'm also wondering what to do here exactly. In generally I would
>> completely nuke that limit if possible.
>>
>>> But if you go to the limit and still want to keep the "we make sure
>>> there's no OOM", then precision starts to matter:
>>> - memory hotplug and hotunplug is a thing
>>> - userspace can mlock, and it's configureable
>>> - drivers can pin_user_pages for IO and random other stuff. Some of it is
>>>     accounted as some subsystem specific mlock (like rdma does iirc), some
>>>     is just yolo or short term enough (like)
>>> - none of what we do here takes into considerations any interactions with
>>>     core mm tracking (like cgroups or numa or anything like that)
>> OOM is perfectly fine with me, we should just not run into an OOM killer
>> situation because we call shmem_read_mapping_page_gfp() in the shrinker
>> callback.
>>
>> Any idea how we could guarantee that?
> Uh ...
>
> I just realized I missed something between how ttm works and how i915
> works.  With i915 directly using shmem, all we do in the shrinker is unpin
> the shmem pages. With ttm we have the page pool in the middle. And it's
> needed for dma coherent and other things. This is rather fundamental.

Yeah, the WC/UC handling is otherwise way to slow. Only for normal WB 
allocations we could do this.

> btw I don't think you'll get OOM, you get lockdep splat because you're
> recusring on fs_reclaim fake lock. We should probably put a might_alloc()
> into shmem_read_mapping_page_gfp().

Yeah, one reason why I haven't enabled that yet.

>
>>> If we want to drop the "half of system ram" limit (and yes that makes
>>> sense) I think the right design is:
>>>
>>> - we give up on the "no OOM" guarantee.
>>>
>>> - This means if you want real isolation of tasks, we need cgroups, and we
>>>     need to integrate ttm cgroups with system memory cgroups somehow. Unlike
>>>     randomly picked hardcoded limits this should work a lot more reliably
>>>     and be a lot more useful in practical use in the field.
>>>
>>> - This also means that drivers start to fail in interesting ways. I think
>>>     locking headaches are covered with all the lockdep annotations I've
>>>     pushed, plus some of the things I still have in-flight (I have a
>>>     might_alloc() annotations somewhere). That leaves validation of error
>>>     paths for when allocations fail. Ime a very effective way we used in
>>>     i915 is (ab)using EINTR restarting, which per drmIoctl uapi spec is
>>>     requried. We could put a debug mode into ttm_tt which randomly fails
>>>     with -EINTR to make sure it's all working correctly (plus anything else
>>>     that allocates memory), and unlike real out-of-memory injection piglit
>>>     and any other cts will complete without failure. Which gives us an
>>>     excellent metric for "does it work". Actualy OOM, even injected one,
>>>     tends to make stuff blow up in a way that's very hard to track and make
>>>     sure you've got good coverage, since all your usual tests die pretty
>>>     quickly.
>>>
>>> - ttm_tt needs to play fair with every other system memory user. We need
>>>     to register a shrinker for each ttm_tt (so usually one per device I
>>>     guess), which walks the lru (in shrink_count) and uses dma_resv_trylock
>>>     for actual shrinking. We probably want to move it to SYSTEM first for
>>>     that shrinker to pick up, so that there's some global fairness across
>>>     all ttm_tt.
>> I already have patches for this.
>>
>> What's still missing is teaching the OOM killer which task is using the
>> memory since memory referenced through the file descriptors are currently
>> not accounted towards any process resources.
> Yeah I don't think we've fixed that problem for i915 either. It loves to
> kill randome other stuff. In igt we solve this by marking any igt testcase
> as "pls kill this first". Well "solve".

Well that is a "creative" solution :)

I will be rather looking into if we can somehow track to which 
files_struct a file belongs to and if we somehow can use this in the OOM 
killer.

My last try of giving each file something wasn't received well.

The real boomer is that you can really nicely create a deny of service 
using memfd_create() because of this :)

>>> - for GFP_DMA32 that means zone aware shrinkers. We've never used those,
>>>     because thus far i915 didn't have any big need for low memory, so we
>>>     haven't used this in practice. But it's supposed to be a thing.
>> I think we can mostly forget about GFP_DMA32, this should only be used for
>> AGP and other ancient hardware.
> Ok, that's good. So having an arbitrary "only half of GFP_DMA32" for these
> should also be acceptable, since back then gpus didn't really have
> gigabytes of vram yet. Biggest r500 seems to have topped out at 512MB.
>
> How much do we require the dma page pool? Afaiui it's only when there's
> bounce buffers or something nasty like that present. Are there more use
> cases?

Not that I know off.

>
> For fixing the TT -> SYSTEM problem of calling shmem_read_mapping_page_gfp
> from shrinker I think there's 3 solutions:
>
> 1)Call shmem_read_mapping_page_gfp with GFP_IO. This way we should not be
>    calling back into our shrinker, which are at GFP_FS level. This isn't
>    great, but since page reclaim is a giantic loop which repeats as long as
>    we're making meaningful forward progress, we should be able to trickle
>    TT pages to SYSTEM pages even under severe memory pressure.

This is most likely the way to go for us.

> 2)For the normal page pool (i.e. neither GFP_DMA32 nor dma_alloc_coherent
>    or write-combine) we stop copying the pages for TT <-> SYSTEM, but just
>    directly pin the pages we get from shmem. Like all the shmem based soc
>    drivers do. No copying, no need for allocations, no problem in
>    shrinkers.

Not really doable, WC is just to widely used and shmem doesn't support 
larger orders.

Regards,
Christian.

>
> 3)For the remaining page pools we'd need ttmfs so that we can control the
>    address_space_operations and directly swap out from our pages to
>    swapout, bypassing shmem and the copying. Iirc Chris Wilson had a
>    prototype once somewhere under gemfs for this. Only needed if 1) isn't
>    fast enough for these because there's too many such cases where we care.
>
> At least on intel hw the direction is very much towards fully coherent,
> so 1&2 should be good enough.
> -Daniel
>
>> Christian.
>>
>>> It's a bit more code than the oneliner above, but I also think it's a lot
>>> more solid. Plus it would resolve the last big issue where i915 gem is
>>> fairly fundamentally different compared to ttm. For that question I think
>>> once Maarten's locking rework for i915 has landed and all the other ttm
>>> rework from you and Dave is in, we've resolved them all.
>>>
>>>
>>>>    	/* But for DMA32 we limit ourself to only use 2GiB maximum. */
>>>>    	num_dma32_pages = (u64)(si.totalram - si.totalhigh) * si.mem_unit;
>>>> -- 
>>>> 2.25.1
>>>>
Daniel Vetter Nov. 19, 2020, 3:42 p.m. UTC | #7
On Thu, Nov 19, 2020 at 04:27:00PM +0100, Christian König wrote:
> Am 18.11.20 um 22:15 schrieb Daniel Vetter:
> > On Wed, Nov 18, 2020 at 02:35:24PM +0100, Christian König wrote:
> > > Am 17.11.20 um 18:19 schrieb Daniel Vetter:
> > > > On Tue, Nov 17, 2020 at 03:06:15PM +0100, Christian König wrote:
> > > > > Increase the ammount of system memory drivers can use to about 90% of
> > > > > the total available.
> > > > > 
> > > > > Signed-off-by: Christian König <christian.koenig@amd.com>
> > > > > ---
> > > > >    drivers/gpu/drm/ttm/ttm_bo.c | 2 +-
> > > > >    1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> > > > > index a958135cb3fe..0a93df93dba4 100644
> > > > > --- a/drivers/gpu/drm/ttm/ttm_bo.c
> > > > > +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> > > > > @@ -1267,7 +1267,7 @@ static int ttm_bo_global_init(void)
> > > > >    	 * the available system memory.
> > > > >    	 */
> > > > >    	num_pages = (u64)si.totalram * si.mem_unit;
> > > > > -	num_pages = (num_pages * 50 / 100) >> PAGE_SHIFT;
> > > > > +	num_pages = (num_pages * 90 / 100) >> PAGE_SHIFT;
> > > > I don't think this is the design we want. As long as it was set at "half
> > > > of system memory" it was clear that a) it's a hack b) precision didn't
> > > > matter.
> > > Yeah, I'm also wondering what to do here exactly. In generally I would
> > > completely nuke that limit if possible.
> > > 
> > > > But if you go to the limit and still want to keep the "we make sure
> > > > there's no OOM", then precision starts to matter:
> > > > - memory hotplug and hotunplug is a thing
> > > > - userspace can mlock, and it's configureable
> > > > - drivers can pin_user_pages for IO and random other stuff. Some of it is
> > > >     accounted as some subsystem specific mlock (like rdma does iirc), some
> > > >     is just yolo or short term enough (like)
> > > > - none of what we do here takes into considerations any interactions with
> > > >     core mm tracking (like cgroups or numa or anything like that)
> > > OOM is perfectly fine with me, we should just not run into an OOM killer
> > > situation because we call shmem_read_mapping_page_gfp() in the shrinker
> > > callback.
> > > 
> > > Any idea how we could guarantee that?
> > Uh ...
> > 
> > I just realized I missed something between how ttm works and how i915
> > works.  With i915 directly using shmem, all we do in the shrinker is unpin
> > the shmem pages. With ttm we have the page pool in the middle. And it's
> > needed for dma coherent and other things. This is rather fundamental.
> 
> Yeah, the WC/UC handling is otherwise way to slow. Only for normal WB
> allocations we could do this.
> 
> > btw I don't think you'll get OOM, you get lockdep splat because you're
> > recusring on fs_reclaim fake lock. We should probably put a might_alloc()
> > into shmem_read_mapping_page_gfp().
> 
> Yeah, one reason why I haven't enabled that yet.
> 
> > 
> > > > If we want to drop the "half of system ram" limit (and yes that makes
> > > > sense) I think the right design is:
> > > > 
> > > > - we give up on the "no OOM" guarantee.
> > > > 
> > > > - This means if you want real isolation of tasks, we need cgroups, and we
> > > >     need to integrate ttm cgroups with system memory cgroups somehow. Unlike
> > > >     randomly picked hardcoded limits this should work a lot more reliably
> > > >     and be a lot more useful in practical use in the field.
> > > > 
> > > > - This also means that drivers start to fail in interesting ways. I think
> > > >     locking headaches are covered with all the lockdep annotations I've
> > > >     pushed, plus some of the things I still have in-flight (I have a
> > > >     might_alloc() annotations somewhere). That leaves validation of error
> > > >     paths for when allocations fail. Ime a very effective way we used in
> > > >     i915 is (ab)using EINTR restarting, which per drmIoctl uapi spec is
> > > >     requried. We could put a debug mode into ttm_tt which randomly fails
> > > >     with -EINTR to make sure it's all working correctly (plus anything else
> > > >     that allocates memory), and unlike real out-of-memory injection piglit
> > > >     and any other cts will complete without failure. Which gives us an
> > > >     excellent metric for "does it work". Actualy OOM, even injected one,
> > > >     tends to make stuff blow up in a way that's very hard to track and make
> > > >     sure you've got good coverage, since all your usual tests die pretty
> > > >     quickly.
> > > > 
> > > > - ttm_tt needs to play fair with every other system memory user. We need
> > > >     to register a shrinker for each ttm_tt (so usually one per device I
> > > >     guess), which walks the lru (in shrink_count) and uses dma_resv_trylock
> > > >     for actual shrinking. We probably want to move it to SYSTEM first for
> > > >     that shrinker to pick up, so that there's some global fairness across
> > > >     all ttm_tt.
> > > I already have patches for this.
> > > 
> > > What's still missing is teaching the OOM killer which task is using the
> > > memory since memory referenced through the file descriptors are currently
> > > not accounted towards any process resources.
> > Yeah I don't think we've fixed that problem for i915 either. It loves to
> > kill randome other stuff. In igt we solve this by marking any igt testcase
> > as "pls kill this first". Well "solve".
> 
> Well that is a "creative" solution :)
> 
> I will be rather looking into if we can somehow track to which files_struct
> a file belongs to and if we somehow can use this in the OOM killer.
> 
> My last try of giving each file something wasn't received well.
> 
> The real boomer is that you can really nicely create a deny of service using
> memfd_create() because of this :)

Yeah proper accounting is the right fix here I agree.

> > > > - for GFP_DMA32 that means zone aware shrinkers. We've never used those,
> > > >     because thus far i915 didn't have any big need for low memory, so we
> > > >     haven't used this in practice. But it's supposed to be a thing.
> > > I think we can mostly forget about GFP_DMA32, this should only be used for
> > > AGP and other ancient hardware.
> > Ok, that's good. So having an arbitrary "only half of GFP_DMA32" for these
> > should also be acceptable, since back then gpus didn't really have
> > gigabytes of vram yet. Biggest r500 seems to have topped out at 512MB.
> > 
> > How much do we require the dma page pool? Afaiui it's only when there's
> > bounce buffers or something nasty like that present. Are there more use
> > cases?
> 
> Not that I know off.
> 
> > 
> > For fixing the TT -> SYSTEM problem of calling shmem_read_mapping_page_gfp
> > from shrinker I think there's 3 solutions:
> > 
> > 1)Call shmem_read_mapping_page_gfp with GFP_IO. This way we should not be
> >    calling back into our shrinker, which are at GFP_FS level. This isn't
> >    great, but since page reclaim is a giantic loop which repeats as long as
> >    we're making meaningful forward progress, we should be able to trickle
> >    TT pages to SYSTEM pages even under severe memory pressure.
> 
> This is most likely the way to go for us.

GFP_NOFS is the flag, I mixed them up.

> > 2)For the normal page pool (i.e. neither GFP_DMA32 nor dma_alloc_coherent
> >    or write-combine) we stop copying the pages for TT <-> SYSTEM, but just
> >    directly pin the pages we get from shmem. Like all the shmem based soc
> >    drivers do. No copying, no need for allocations, no problem in
> >    shrinkers.
> 
> Not really doable, WC is just to widely used and shmem doesn't support
> larger orders.

Hm I thought you get large pages automatically, if you reassemble them.
It's just the interface that sucks.

btw did you look at 3 below? It might break dma-api assumptions a bit too
badly since we're doing nice page flushing behind everyone's back.
-Daniel

> 
> Regards,
> Christian.
> 
> > 
> > 3)For the remaining page pools we'd need ttmfs so that we can control the
> >    address_space_operations and directly swap out from our pages to
> >    swapout, bypassing shmem and the copying. Iirc Chris Wilson had a
> >    prototype once somewhere under gemfs for this. Only needed if 1) isn't
> >    fast enough for these because there's too many such cases where we care.
> > 
> > At least on intel hw the direction is very much towards fully coherent,
> > so 1&2 should be good enough.
> > -Daniel
> > 
> > > Christian.
> > > 
> > > > It's a bit more code than the oneliner above, but I also think it's a lot
> > > > more solid. Plus it would resolve the last big issue where i915 gem is
> > > > fairly fundamentally different compared to ttm. For that question I think
> > > > once Maarten's locking rework for i915 has landed and all the other ttm
> > > > rework from you and Dave is in, we've resolved them all.
> > > > 
> > > > 
> > > > >    	/* But for DMA32 we limit ourself to only use 2GiB maximum. */
> > > > >    	num_dma32_pages = (u64)(si.totalram - si.totalhigh) * si.mem_unit;
> > > > > -- 
> > > > > 2.25.1
> > > > > 
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index a958135cb3fe..0a93df93dba4 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -1267,7 +1267,7 @@  static int ttm_bo_global_init(void)
 	 * the available system memory.
 	 */
 	num_pages = (u64)si.totalram * si.mem_unit;
-	num_pages = (num_pages * 50 / 100) >> PAGE_SHIFT;
+	num_pages = (num_pages * 90 / 100) >> PAGE_SHIFT;
 
 	/* But for DMA32 we limit ourself to only use 2GiB maximum. */
 	num_dma32_pages = (u64)(si.totalram - si.totalhigh) * si.mem_unit;