diff mbox series

dma/pool: do not complain if DMA pool is not allocated

Message ID 20220325122559.14251-1-mhocko@kernel.org (mailing list archive)
State New
Headers show
Series dma/pool: do not complain if DMA pool is not allocated | expand

Commit Message

Michal Hocko March 25, 2022, 12:25 p.m. UTC
From: Michal Hocko <mhocko@suse.com>

we have a system complainging about order-5 allocation for the DMA pool.
This is something that a674e48c5443 ("dma/pool: create dma atomic pool
only if dma zone has managed pages") has already tried to achieve but I
do not think it went all the way to have it covered completely. In this
particular case has_managed_dma() will not work because:
[    0.678539][    T0] Initmem setup node 0 [mem 0x0000000000001000-0x000000027dffffff]
[    0.686316][    T0] On node 0, zone DMA: 1 pages in unavailable ranges
[    0.687093][    T0] On node 0, zone DMA32: 36704 pages in unavailable ranges
[    0.694278][    T0] On node 0, zone Normal: 53252 pages in unavailable ranges
[    0.701257][    T0] On node 0, zone Normal: 8192 pages in unavailable ranges

The allocation failure on the DMA zone shouldn't be really critical for
the system operation so just silence the warning instead.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 kernel/dma/pool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Michal Hocko March 25, 2022, 12:58 p.m. UTC | #1
On Fri 25-03-22 13:25:59, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> we have a system complainging about order-5 allocation for the DMA pool.
> This is something that a674e48c5443 ("dma/pool: create dma atomic pool
> only if dma zone has managed pages") has already tried to achieve but I
> do not think it went all the way to have it covered completely. In this
> particular case has_managed_dma() will not work because:
> [    0.678539][    T0] Initmem setup node 0 [mem 0x0000000000001000-0x000000027dffffff]
> [    0.686316][    T0] On node 0, zone DMA: 1 pages in unavailable ranges
> [    0.687093][    T0] On node 0, zone DMA32: 36704 pages in unavailable ranges
> [    0.694278][    T0] On node 0, zone Normal: 53252 pages in unavailable ranges
> [    0.701257][    T0] On node 0, zone Normal: 8192 pages in unavailable ranges

Dang, I have just realized that I have misread the boot log and it has
turned out that a674e48c5443 is covering my situation because the
allocation failure message says:
Node 0 DMA free:0kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:636kB managed:0kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

I thought there are only few pages in the managed by the DMA zone. This
is still theoretically possible so I think __GFP_NOWARN makes sense here
but it would require to change the patch description.

Is this really worth it?

> 
> The allocation failure on the DMA zone shouldn't be really critical for
> the system operation so just silence the warning instead.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  kernel/dma/pool.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
> index 4d40dcce7604..1bf6de398986 100644
> --- a/kernel/dma/pool.c
> +++ b/kernel/dma/pool.c
> @@ -205,7 +205,7 @@ static int __init dma_atomic_pool_init(void)
>  		ret = -ENOMEM;
>  	if (has_managed_dma()) {
>  		atomic_pool_dma = __dma_atomic_pool_init(atomic_pool_size,
> -						GFP_KERNEL | GFP_DMA);
> +						GFP_KERNEL | GFP_DMA | __GFP_NOWARN);
>  		if (!atomic_pool_dma)
>  			ret = -ENOMEM;
>  	}
> -- 
> 2.30.2
Christoph Hellwig March 25, 2022, 4:48 p.m. UTC | #2
On Fri, Mar 25, 2022 at 01:58:42PM +0100, Michal Hocko wrote:
> Dang, I have just realized that I have misread the boot log and it has
> turned out that a674e48c5443 is covering my situation because the
> allocation failure message says:
>
> Node 0 DMA free:0kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:636kB managed:0kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

As in your report is from a kernel that does not have a674e48c5443
yet?

> 
> I thought there are only few pages in the managed by the DMA zone. This
> is still theoretically possible so I think __GFP_NOWARN makes sense here
> but it would require to change the patch description.
> 
> Is this really worth it?

In general I think for kernels where we need the pool and can't allocate
it, a warning is very useful.  We just shouldn't spew it when there is
no need for the pool to start with.
Michal Hocko March 25, 2022, 4:54 p.m. UTC | #3
On Fri 25-03-22 17:48:56, Christoph Hellwig wrote:
> On Fri, Mar 25, 2022 at 01:58:42PM +0100, Michal Hocko wrote:
> > Dang, I have just realized that I have misread the boot log and it has
> > turned out that a674e48c5443 is covering my situation because the
> > allocation failure message says:
> >
> > Node 0 DMA free:0kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:636kB managed:0kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> 
> As in your report is from a kernel that does not have a674e48c5443
> yet?

yes. I just mixed up the early boot messages and thought that DMA zone
ended up with a single page. That message was saying something else
though.
 
> > I thought there are only few pages in the managed by the DMA zone. This
> > is still theoretically possible so I think __GFP_NOWARN makes sense here
> > but it would require to change the patch description.
> > 
> > Is this really worth it?
> 
> In general I think for kernels where we need the pool and can't allocate
> it, a warning is very useful.  We just shouldn't spew it when there is
> no need for the pool to start with.

Well, do we have any way to find that out during early boot?
Michal Hocko Aug. 3, 2022, 9:52 a.m. UTC | #4
On Fri 25-03-22 17:54:33, Michal Hocko wrote:
> On Fri 25-03-22 17:48:56, Christoph Hellwig wrote:
> > On Fri, Mar 25, 2022 at 01:58:42PM +0100, Michal Hocko wrote:
> > > Dang, I have just realized that I have misread the boot log and it has
> > > turned out that a674e48c5443 is covering my situation because the
> > > allocation failure message says:
> > >
> > > Node 0 DMA free:0kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:636kB managed:0kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > 
> > As in your report is from a kernel that does not have a674e48c5443
> > yet?
> 
> yes. I just mixed up the early boot messages and thought that DMA zone
> ended up with a single page. That message was saying something else
> though.

OK, so I have another machine spewing this warning. Still on an older
kernel but I do not think the current upstream would be any different in
that regards. This time the DMA zone is populated and consumed from
large part and the pool size request is just too large for it:

[   14.017417][    T1] swapper/0: page allocation failure: order:10, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0-7
[   14.017429][    T1] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 5.14.21-150400.22-default #1 SLE15-SP4 0b6a6578ade2de5c4a0b916095dff44f76ef1704
[   14.017434][    T1] Hardware name: XXXX
[   14.017437][    T1] Call Trace:
[   14.017444][    T1]  <TASK>
[   14.017449][    T1]  dump_stack_lvl+0x45/0x57
[   14.017469][    T1]  warn_alloc+0xfe/0x160
[   14.017490][    T1]  __alloc_pages_slowpath.constprop.112+0xc27/0xc60
[   14.017497][    T1]  ? rdinit_setup+0x2b/0x2b
[   14.017509][    T1]  ? rdinit_setup+0x2b/0x2b
[   14.017512][    T1]  __alloc_pages+0x2d5/0x320
[   14.017517][    T1]  alloc_page_interleave+0xf/0x70
[   14.017531][    T1]  atomic_pool_expand+0x4a/0x200
[   14.017541][    T1]  ? rdinit_setup+0x2b/0x2b
[   14.017544][    T1]  __dma_atomic_pool_init+0x44/0x90
[   14.017556][    T1]  dma_atomic_pool_init+0xad/0x13f
[   14.017560][    T1]  ? __dma_atomic_pool_init+0x90/0x90
[   14.017562][    T1]  do_one_initcall+0x41/0x200
[   14.017581][    T1]  kernel_init_freeable+0x236/0x298
[   14.017589][    T1]  ? rest_init+0xd0/0xd0
[   14.017596][    T1]  kernel_init+0x16/0x120
[   14.017599][    T1]  ret_from_fork+0x22/0x30
[   14.017604][    T1]  </TASK>
[...]
[   14.018026][    T1] Node 0 DMA free:160kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   14.018035][    T1] lowmem_reserve[]: 0 0 0 0 0
[   14.018339][    T1] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 0*64kB 1*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 160kB

So the DMA zone has only 160kB free while the pool would like to use 4MB
of it which obviously fails. I haven't tried to check who is consuming
the DMA zone memory and why but this shouldn't be all that important
because the pool clearly cannot allocate and there is not much the
user/admin can do about that. Well, the pool could be explicitly
requested smaller but is that really what we expect them to do?
  
> > > I thought there are only few pages in the managed by the DMA zone. This
> > > is still theoretically possible so I think __GFP_NOWARN makes sense here
> > > but it would require to change the patch description.
> > > 
> > > Is this really worth it?
> > 
> > In general I think for kernels where we need the pool and can't allocate
> > it, a warning is very useful.  We just shouldn't spew it when there is
> > no need for the pool to start with.
> 
> Well, do we have any way to find that out during early boot?

Thinking about it. We should get a warning when the actual allocation
from the pool fails no? That would be more useful information than the
pre-allocation failure when it is not really clear whether anybody is
ever going to consume it.

What do you think? Should I repost my original patch with the updated
changelog?
Baoquan He Aug. 3, 2022, 2:59 p.m. UTC | #5
On 08/03/22 at 11:52am, Michal Hocko wrote:
> On Fri 25-03-22 17:54:33, Michal Hocko wrote:
> > On Fri 25-03-22 17:48:56, Christoph Hellwig wrote:
> > > On Fri, Mar 25, 2022 at 01:58:42PM +0100, Michal Hocko wrote:
> > > > Dang, I have just realized that I have misread the boot log and it has
> > > > turned out that a674e48c5443 is covering my situation because the
> > > > allocation failure message says:
> > > >
> > > > Node 0 DMA free:0kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:636kB managed:0kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > > 
> > > As in your report is from a kernel that does not have a674e48c5443
> > > yet?
> > 
> > yes. I just mixed up the early boot messages and thought that DMA zone
> > ended up with a single page. That message was saying something else
> > though.
> 
> OK, so I have another machine spewing this warning. Still on an older
> kernel but I do not think the current upstream would be any different in
> that regards. This time the DMA zone is populated and consumed from
> large part and the pool size request is just too large for it:
> 
> [   14.017417][    T1] swapper/0: page allocation failure: order:10, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0-7
> [   14.017429][    T1] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 5.14.21-150400.22-default #1 SLE15-SP4 0b6a6578ade2de5c4a0b916095dff44f76ef1704
> [   14.017434][    T1] Hardware name: XXXX
> [   14.017437][    T1] Call Trace:
> [   14.017444][    T1]  <TASK>
> [   14.017449][    T1]  dump_stack_lvl+0x45/0x57
> [   14.017469][    T1]  warn_alloc+0xfe/0x160
> [   14.017490][    T1]  __alloc_pages_slowpath.constprop.112+0xc27/0xc60
> [   14.017497][    T1]  ? rdinit_setup+0x2b/0x2b
> [   14.017509][    T1]  ? rdinit_setup+0x2b/0x2b
> [   14.017512][    T1]  __alloc_pages+0x2d5/0x320
> [   14.017517][    T1]  alloc_page_interleave+0xf/0x70
> [   14.017531][    T1]  atomic_pool_expand+0x4a/0x200
> [   14.017541][    T1]  ? rdinit_setup+0x2b/0x2b
> [   14.017544][    T1]  __dma_atomic_pool_init+0x44/0x90
> [   14.017556][    T1]  dma_atomic_pool_init+0xad/0x13f
> [   14.017560][    T1]  ? __dma_atomic_pool_init+0x90/0x90
> [   14.017562][    T1]  do_one_initcall+0x41/0x200
> [   14.017581][    T1]  kernel_init_freeable+0x236/0x298
> [   14.017589][    T1]  ? rest_init+0xd0/0xd0
> [   14.017596][    T1]  kernel_init+0x16/0x120
> [   14.017599][    T1]  ret_from_fork+0x22/0x30
> [   14.017604][    T1]  </TASK>
> [...]
> [   14.018026][    T1] Node 0 DMA free:160kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> [   14.018035][    T1] lowmem_reserve[]: 0 0 0 0 0
> [   14.018339][    T1] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 0*64kB 1*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 160kB
> 
> So the DMA zone has only 160kB free while the pool would like to use 4MB
> of it which obviously fails. I haven't tried to check who is consuming
> the DMA zone memory and why but this shouldn't be all that important
> because the pool clearly cannot allocate and there is not much the
> user/admin can do about that. Well, the pool could be explicitly
> requested smaller but is that really what we expect them to do?
>   
> > > > I thought there are only few pages in the managed by the DMA zone. This
> > > > is still theoretically possible so I think __GFP_NOWARN makes sense here
> > > > but it would require to change the patch description.
> > > > 
> > > > Is this really worth it?
> > > 
> > > In general I think for kernels where we need the pool and can't allocate
> > > it, a warning is very useful.  We just shouldn't spew it when there is
> > > no need for the pool to start with.
> > 
> > Well, do we have any way to find that out during early boot?
> 
> Thinking about it. We should get a warning when the actual allocation
> from the pool fails no? That would be more useful information than the
> pre-allocation failure when it is not really clear whether anybody is
> ever going to consume it.

Hi Michal,

You haven't told on which ARCH you met this issue, is it x86_64?
If yes, I have one patch queued to fix it in another way which I have
been trying to take in mind.

Thanks
Baoquan
Michal Hocko Aug. 3, 2022, 3:05 p.m. UTC | #6
On Wed 03-08-22 22:59:26, Baoquan He wrote:
> On 08/03/22 at 11:52am, Michal Hocko wrote:
> > On Fri 25-03-22 17:54:33, Michal Hocko wrote:
> > > On Fri 25-03-22 17:48:56, Christoph Hellwig wrote:
> > > > On Fri, Mar 25, 2022 at 01:58:42PM +0100, Michal Hocko wrote:
> > > > > Dang, I have just realized that I have misread the boot log and it has
> > > > > turned out that a674e48c5443 is covering my situation because the
> > > > > allocation failure message says:
> > > > >
> > > > > Node 0 DMA free:0kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:636kB managed:0kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > > > 
> > > > As in your report is from a kernel that does not have a674e48c5443
> > > > yet?
> > > 
> > > yes. I just mixed up the early boot messages and thought that DMA zone
> > > ended up with a single page. That message was saying something else
> > > though.
> > 
> > OK, so I have another machine spewing this warning. Still on an older
> > kernel but I do not think the current upstream would be any different in
> > that regards. This time the DMA zone is populated and consumed from
> > large part and the pool size request is just too large for it:
> > 
> > [   14.017417][    T1] swapper/0: page allocation failure: order:10, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0-7
> > [   14.017429][    T1] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 5.14.21-150400.22-default #1 SLE15-SP4 0b6a6578ade2de5c4a0b916095dff44f76ef1704
> > [   14.017434][    T1] Hardware name: XXXX
> > [   14.017437][    T1] Call Trace:
> > [   14.017444][    T1]  <TASK>
> > [   14.017449][    T1]  dump_stack_lvl+0x45/0x57
> > [   14.017469][    T1]  warn_alloc+0xfe/0x160
> > [   14.017490][    T1]  __alloc_pages_slowpath.constprop.112+0xc27/0xc60
> > [   14.017497][    T1]  ? rdinit_setup+0x2b/0x2b
> > [   14.017509][    T1]  ? rdinit_setup+0x2b/0x2b
> > [   14.017512][    T1]  __alloc_pages+0x2d5/0x320
> > [   14.017517][    T1]  alloc_page_interleave+0xf/0x70
> > [   14.017531][    T1]  atomic_pool_expand+0x4a/0x200
> > [   14.017541][    T1]  ? rdinit_setup+0x2b/0x2b
> > [   14.017544][    T1]  __dma_atomic_pool_init+0x44/0x90
> > [   14.017556][    T1]  dma_atomic_pool_init+0xad/0x13f
> > [   14.017560][    T1]  ? __dma_atomic_pool_init+0x90/0x90
> > [   14.017562][    T1]  do_one_initcall+0x41/0x200
> > [   14.017581][    T1]  kernel_init_freeable+0x236/0x298
> > [   14.017589][    T1]  ? rest_init+0xd0/0xd0
> > [   14.017596][    T1]  kernel_init+0x16/0x120
> > [   14.017599][    T1]  ret_from_fork+0x22/0x30
> > [   14.017604][    T1]  </TASK>
> > [...]
> > [   14.018026][    T1] Node 0 DMA free:160kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > [   14.018035][    T1] lowmem_reserve[]: 0 0 0 0 0
> > [   14.018339][    T1] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 0*64kB 1*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 160kB
> > 
> > So the DMA zone has only 160kB free while the pool would like to use 4MB
> > of it which obviously fails. I haven't tried to check who is consuming
> > the DMA zone memory and why but this shouldn't be all that important
> > because the pool clearly cannot allocate and there is not much the
> > user/admin can do about that. Well, the pool could be explicitly
> > requested smaller but is that really what we expect them to do?
> >   
> > > > > I thought there are only few pages in the managed by the DMA zone. This
> > > > > is still theoretically possible so I think __GFP_NOWARN makes sense here
> > > > > but it would require to change the patch description.
> > > > > 
> > > > > Is this really worth it?
> > > > 
> > > > In general I think for kernels where we need the pool and can't allocate
> > > > it, a warning is very useful.  We just shouldn't spew it when there is
> > > > no need for the pool to start with.
> > > 
> > > Well, do we have any way to find that out during early boot?
> > 
> > Thinking about it. We should get a warning when the actual allocation
> > from the pool fails no? That would be more useful information than the
> > pre-allocation failure when it is not really clear whether anybody is
> > ever going to consume it.
> 
> Hi Michal,
> 
> You haven't told on which ARCH you met this issue, is it x86_64?

yes x86_64, so a small 16MB DMA zone.

> If yes, I have one patch queued to fix it in another way which I have
> been trying to take in mind.

Any reference?
Baoquan He Aug. 3, 2022, 3:32 p.m. UTC | #7
On 08/03/22 at 05:05pm, Michal Hocko wrote:
> On Wed 03-08-22 22:59:26, Baoquan He wrote:
> > On 08/03/22 at 11:52am, Michal Hocko wrote:
> > > On Fri 25-03-22 17:54:33, Michal Hocko wrote:
> > > > On Fri 25-03-22 17:48:56, Christoph Hellwig wrote:
> > > > > On Fri, Mar 25, 2022 at 01:58:42PM +0100, Michal Hocko wrote:
> > > > > > Dang, I have just realized that I have misread the boot log and it has
> > > > > > turned out that a674e48c5443 is covering my situation because the
> > > > > > allocation failure message says:
> > > > > >
> > > > > > Node 0 DMA free:0kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:636kB managed:0kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > > > > 
> > > > > As in your report is from a kernel that does not have a674e48c5443
> > > > > yet?
> > > > 
> > > > yes. I just mixed up the early boot messages and thought that DMA zone
> > > > ended up with a single page. That message was saying something else
> > > > though.
> > > 
> > > OK, so I have another machine spewing this warning. Still on an older
> > > kernel but I do not think the current upstream would be any different in
> > > that regards. This time the DMA zone is populated and consumed from
> > > large part and the pool size request is just too large for it:
> > > 
> > > [   14.017417][    T1] swapper/0: page allocation failure: order:10, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0-7
> > > [   14.017429][    T1] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 5.14.21-150400.22-default #1 SLE15-SP4 0b6a6578ade2de5c4a0b916095dff44f76ef1704
> > > [   14.017434][    T1] Hardware name: XXXX
> > > [   14.017437][    T1] Call Trace:
> > > [   14.017444][    T1]  <TASK>
> > > [   14.017449][    T1]  dump_stack_lvl+0x45/0x57
> > > [   14.017469][    T1]  warn_alloc+0xfe/0x160
> > > [   14.017490][    T1]  __alloc_pages_slowpath.constprop.112+0xc27/0xc60
> > > [   14.017497][    T1]  ? rdinit_setup+0x2b/0x2b
> > > [   14.017509][    T1]  ? rdinit_setup+0x2b/0x2b
> > > [   14.017512][    T1]  __alloc_pages+0x2d5/0x320
> > > [   14.017517][    T1]  alloc_page_interleave+0xf/0x70
> > > [   14.017531][    T1]  atomic_pool_expand+0x4a/0x200
> > > [   14.017541][    T1]  ? rdinit_setup+0x2b/0x2b
> > > [   14.017544][    T1]  __dma_atomic_pool_init+0x44/0x90
> > > [   14.017556][    T1]  dma_atomic_pool_init+0xad/0x13f
> > > [   14.017560][    T1]  ? __dma_atomic_pool_init+0x90/0x90
> > > [   14.017562][    T1]  do_one_initcall+0x41/0x200
> > > [   14.017581][    T1]  kernel_init_freeable+0x236/0x298
> > > [   14.017589][    T1]  ? rest_init+0xd0/0xd0
> > > [   14.017596][    T1]  kernel_init+0x16/0x120
> > > [   14.017599][    T1]  ret_from_fork+0x22/0x30
> > > [   14.017604][    T1]  </TASK>
> > > [...]
> > > [   14.018026][    T1] Node 0 DMA free:160kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > > [   14.018035][    T1] lowmem_reserve[]: 0 0 0 0 0
> > > [   14.018339][    T1] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 0*64kB 1*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 160kB
> > > 
> > > So the DMA zone has only 160kB free while the pool would like to use 4MB
> > > of it which obviously fails. I haven't tried to check who is consuming
> > > the DMA zone memory and why but this shouldn't be all that important
> > > because the pool clearly cannot allocate and there is not much the
> > > user/admin can do about that. Well, the pool could be explicitly
> > > requested smaller but is that really what we expect them to do?
> > >   
> > > > > > I thought there are only few pages in the managed by the DMA zone. This
> > > > > > is still theoretically possible so I think __GFP_NOWARN makes sense here
> > > > > > but it would require to change the patch description.
> > > > > > 
> > > > > > Is this really worth it?
> > > > > 
> > > > > In general I think for kernels where we need the pool and can't allocate
> > > > > it, a warning is very useful.  We just shouldn't spew it when there is
> > > > > no need for the pool to start with.
> > > > 
> > > > Well, do we have any way to find that out during early boot?
> > > 
> > > Thinking about it. We should get a warning when the actual allocation
> > > from the pool fails no? That would be more useful information than the
> > > pre-allocation failure when it is not really clear whether anybody is
> > > ever going to consume it.
> > 
> > Hi Michal,
> > 
> > You haven't told on which ARCH you met this issue, is it x86_64?
> 
> yes x86_64, so a small 16MB DMA zone.

Yeah, the 16M DMA zone is redicilous and exists only for hardly seen
ISA-style devices support. Haven't prepared the log well.

From 0b32b4c441f9e28bbda06eefbd14c25d00924830 Mon Sep 17 00:00:00 2001
From: Baoquan He <bhe@redhat.com>
Date: Wed, 6 Jul 2022 15:26:15 +0800
Subject: [PATCH] x86, 64: let zone DMA cover low 4G if no ISA-style devices
Content-type: text/plain

It doesn't make sense to let the rare legacy ISA-style devies
drag x86_64 to have a tiny zone DMA of 16M which cause many
troubles.

Signed-off-by: Baoquan He <bhe@redhat.com>
---
 arch/x86/Kconfig   | 1 -
 arch/x86/mm/init.c | 5 ++++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5aa4c2ecf5c7..93af781f9445 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2761,7 +2761,6 @@ config ISA_BUS
 # x86_64 have no ISA slots, but can have ISA-style DMA.
 config ISA_DMA_API
 	bool "ISA-style DMA support" if (X86_64 && EXPERT)
-	default y
 	help
 	  Enables ISA-style DMA support for devices requiring such controllers.
 	  If unsure, say Y.
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 82a042c03824..c9ffb38dcc6a 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1024,9 +1024,12 @@ void __init zone_sizes_init(void)
 
 	memset(max_zone_pfns, 0, sizeof(max_zone_pfns));
 
-#ifdef CONFIG_ZONE_DMA
+#if defined(CONFIG_ZONE_DMA) && defined(CONFIG_ISA_DMA_API)
 	max_zone_pfns[ZONE_DMA]		= min(MAX_DMA_PFN, max_low_pfn);
+#else
+	max_zone_pfns[ZONE_DMA]		= min(MAX_DMA32_PFN, max_low_pfn);
 #endif
+
 #ifdef CONFIG_ZONE_DMA32
 	max_zone_pfns[ZONE_DMA32]	= min(MAX_DMA32_PFN, max_low_pfn);
 #endif
Michal Hocko Aug. 3, 2022, 3:44 p.m. UTC | #8
On Wed 03-08-22 23:32:03, Baoquan He wrote:
> On 08/03/22 at 05:05pm, Michal Hocko wrote:
> > On Wed 03-08-22 22:59:26, Baoquan He wrote:
> > > On 08/03/22 at 11:52am, Michal Hocko wrote:
> > > > On Fri 25-03-22 17:54:33, Michal Hocko wrote:
> > > > > On Fri 25-03-22 17:48:56, Christoph Hellwig wrote:
> > > > > > On Fri, Mar 25, 2022 at 01:58:42PM +0100, Michal Hocko wrote:
> > > > > > > Dang, I have just realized that I have misread the boot log and it has
> > > > > > > turned out that a674e48c5443 is covering my situation because the
> > > > > > > allocation failure message says:
> > > > > > >
> > > > > > > Node 0 DMA free:0kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:636kB managed:0kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > > > > > 
> > > > > > As in your report is from a kernel that does not have a674e48c5443
> > > > > > yet?
> > > > > 
> > > > > yes. I just mixed up the early boot messages and thought that DMA zone
> > > > > ended up with a single page. That message was saying something else
> > > > > though.
> > > > 
> > > > OK, so I have another machine spewing this warning. Still on an older
> > > > kernel but I do not think the current upstream would be any different in
> > > > that regards. This time the DMA zone is populated and consumed from
> > > > large part and the pool size request is just too large for it:
> > > > 
> > > > [   14.017417][    T1] swapper/0: page allocation failure: order:10, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0-7
> > > > [   14.017429][    T1] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 5.14.21-150400.22-default #1 SLE15-SP4 0b6a6578ade2de5c4a0b916095dff44f76ef1704
> > > > [   14.017434][    T1] Hardware name: XXXX
> > > > [   14.017437][    T1] Call Trace:
> > > > [   14.017444][    T1]  <TASK>
> > > > [   14.017449][    T1]  dump_stack_lvl+0x45/0x57
> > > > [   14.017469][    T1]  warn_alloc+0xfe/0x160
> > > > [   14.017490][    T1]  __alloc_pages_slowpath.constprop.112+0xc27/0xc60
> > > > [   14.017497][    T1]  ? rdinit_setup+0x2b/0x2b
> > > > [   14.017509][    T1]  ? rdinit_setup+0x2b/0x2b
> > > > [   14.017512][    T1]  __alloc_pages+0x2d5/0x320
> > > > [   14.017517][    T1]  alloc_page_interleave+0xf/0x70
> > > > [   14.017531][    T1]  atomic_pool_expand+0x4a/0x200
> > > > [   14.017541][    T1]  ? rdinit_setup+0x2b/0x2b
> > > > [   14.017544][    T1]  __dma_atomic_pool_init+0x44/0x90
> > > > [   14.017556][    T1]  dma_atomic_pool_init+0xad/0x13f
> > > > [   14.017560][    T1]  ? __dma_atomic_pool_init+0x90/0x90
> > > > [   14.017562][    T1]  do_one_initcall+0x41/0x200
> > > > [   14.017581][    T1]  kernel_init_freeable+0x236/0x298
> > > > [   14.017589][    T1]  ? rest_init+0xd0/0xd0
> > > > [   14.017596][    T1]  kernel_init+0x16/0x120
> > > > [   14.017599][    T1]  ret_from_fork+0x22/0x30
> > > > [   14.017604][    T1]  </TASK>
> > > > [...]
> > > > [   14.018026][    T1] Node 0 DMA free:160kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > > > [   14.018035][    T1] lowmem_reserve[]: 0 0 0 0 0
> > > > [   14.018339][    T1] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 0*64kB 1*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 160kB
> > > > 
> > > > So the DMA zone has only 160kB free while the pool would like to use 4MB
> > > > of it which obviously fails. I haven't tried to check who is consuming
> > > > the DMA zone memory and why but this shouldn't be all that important
> > > > because the pool clearly cannot allocate and there is not much the
> > > > user/admin can do about that. Well, the pool could be explicitly
> > > > requested smaller but is that really what we expect them to do?
> > > >   
> > > > > > > I thought there are only few pages in the managed by the DMA zone. This
> > > > > > > is still theoretically possible so I think __GFP_NOWARN makes sense here
> > > > > > > but it would require to change the patch description.
> > > > > > > 
> > > > > > > Is this really worth it?
> > > > > > 
> > > > > > In general I think for kernels where we need the pool and can't allocate
> > > > > > it, a warning is very useful.  We just shouldn't spew it when there is
> > > > > > no need for the pool to start with.
> > > > > 
> > > > > Well, do we have any way to find that out during early boot?
> > > > 
> > > > Thinking about it. We should get a warning when the actual allocation
> > > > from the pool fails no? That would be more useful information than the
> > > > pre-allocation failure when it is not really clear whether anybody is
> > > > ever going to consume it.
> > > 
> > > Hi Michal,
> > > 
> > > You haven't told on which ARCH you met this issue, is it x86_64?
> > 
> > yes x86_64, so a small 16MB DMA zone.
> 
> Yeah, the 16M DMA zone is redicilous and exists only for hardly seen
> ISA-style devices support. Haven't prepared the log well.

Agreed on that! I would essentially suggest to completely ignore pool
pre-allocation failures for the small DMA zone. There is barely anything
to be ever consuming it.

Unfortunately generic kernels cannot really know there is any
crippled device without some code to some checking early boot (and I am
not even sure this would be sufficient).

> >From 0b32b4c441f9e28bbda06eefbd14c25d00924830 Mon Sep 17 00:00:00 2001
> From: Baoquan He <bhe@redhat.com>
> Date: Wed, 6 Jul 2022 15:26:15 +0800
> Subject: [PATCH] x86, 64: let zone DMA cover low 4G if no ISA-style devices
> Content-type: text/plain
> 
> It doesn't make sense to let the rare legacy ISA-style devies
> drag x86_64 to have a tiny zone DMA of 16M which cause many
> troubles.
> 
> Signed-off-by: Baoquan He <bhe@redhat.com>
> ---
>  arch/x86/Kconfig   | 1 -
>  arch/x86/mm/init.c | 5 ++++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5aa4c2ecf5c7..93af781f9445 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2761,7 +2761,6 @@ config ISA_BUS
>  # x86_64 have no ISA slots, but can have ISA-style DMA.
>  config ISA_DMA_API
>  	bool "ISA-style DMA support" if (X86_64 && EXPERT)
> -	default y
>  	help
>  	  Enables ISA-style DMA support for devices requiring such controllers.
>  	  If unsure, say Y.
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 82a042c03824..c9ffb38dcc6a 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -1024,9 +1024,12 @@ void __init zone_sizes_init(void)
>  
>  	memset(max_zone_pfns, 0, sizeof(max_zone_pfns));
>  
> -#ifdef CONFIG_ZONE_DMA
> +#if defined(CONFIG_ZONE_DMA) && defined(CONFIG_ISA_DMA_API)
>  	max_zone_pfns[ZONE_DMA]		= min(MAX_DMA_PFN, max_low_pfn);
> +#else
> +	max_zone_pfns[ZONE_DMA]		= min(MAX_DMA32_PFN, max_low_pfn);
>  #endif
> +
>  #ifdef CONFIG_ZONE_DMA32
>  	max_zone_pfns[ZONE_DMA32]	= min(MAX_DMA32_PFN, max_low_pfn);
>  #endif

I would rather see the zone go away completley and free up the slot in
page flags.  This seems like a hack to have two zones representing the
same physical memory range.

This also mostly papers over this particular problem by allocating
allocating two pools for the same range.
Baoquan He Aug. 4, 2022, 11:01 a.m. UTC | #9
On 08/03/22 at 05:44pm, Michal Hocko wrote:
> On Wed 03-08-22 23:32:03, Baoquan He wrote:
> > On 08/03/22 at 05:05pm, Michal Hocko wrote:
> > > On Wed 03-08-22 22:59:26, Baoquan He wrote:
> > > > On 08/03/22 at 11:52am, Michal Hocko wrote:
> > > > > On Fri 25-03-22 17:54:33, Michal Hocko wrote:
> > > > > > On Fri 25-03-22 17:48:56, Christoph Hellwig wrote:
> > > > > > > On Fri, Mar 25, 2022 at 01:58:42PM +0100, Michal Hocko wrote:
> > > > > > > > Dang, I have just realized that I have misread the boot log and it has
> > > > > > > > turned out that a674e48c5443 is covering my situation because the
> > > > > > > > allocation failure message says:
> > > > > > > >
> > > > > > > > Node 0 DMA free:0kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:636kB managed:0kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > > > > > > 
> > > > > > > As in your report is from a kernel that does not have a674e48c5443
> > > > > > > yet?
> > > > > > 
> > > > > > yes. I just mixed up the early boot messages and thought that DMA zone
> > > > > > ended up with a single page. That message was saying something else
> > > > > > though.
> > > > > 
> > > > > OK, so I have another machine spewing this warning. Still on an older
> > > > > kernel but I do not think the current upstream would be any different in
> > > > > that regards. This time the DMA zone is populated and consumed from
> > > > > large part and the pool size request is just too large for it:
> > > > > 
> > > > > [   14.017417][    T1] swapper/0: page allocation failure: order:10, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0-7
> > > > > [   14.017429][    T1] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 5.14.21-150400.22-default #1 SLE15-SP4 0b6a6578ade2de5c4a0b916095dff44f76ef1704
> > > > > [   14.017434][    T1] Hardware name: XXXX
> > > > > [   14.017437][    T1] Call Trace:
> > > > > [   14.017444][    T1]  <TASK>
> > > > > [   14.017449][    T1]  dump_stack_lvl+0x45/0x57
> > > > > [   14.017469][    T1]  warn_alloc+0xfe/0x160
> > > > > [   14.017490][    T1]  __alloc_pages_slowpath.constprop.112+0xc27/0xc60
> > > > > [   14.017497][    T1]  ? rdinit_setup+0x2b/0x2b
> > > > > [   14.017509][    T1]  ? rdinit_setup+0x2b/0x2b
> > > > > [   14.017512][    T1]  __alloc_pages+0x2d5/0x320
> > > > > [   14.017517][    T1]  alloc_page_interleave+0xf/0x70
> > > > > [   14.017531][    T1]  atomic_pool_expand+0x4a/0x200
> > > > > [   14.017541][    T1]  ? rdinit_setup+0x2b/0x2b
> > > > > [   14.017544][    T1]  __dma_atomic_pool_init+0x44/0x90
> > > > > [   14.017556][    T1]  dma_atomic_pool_init+0xad/0x13f
> > > > > [   14.017560][    T1]  ? __dma_atomic_pool_init+0x90/0x90
> > > > > [   14.017562][    T1]  do_one_initcall+0x41/0x200
> > > > > [   14.017581][    T1]  kernel_init_freeable+0x236/0x298
> > > > > [   14.017589][    T1]  ? rest_init+0xd0/0xd0
> > > > > [   14.017596][    T1]  kernel_init+0x16/0x120
> > > > > [   14.017599][    T1]  ret_from_fork+0x22/0x30
> > > > > [   14.017604][    T1]  </TASK>
> > > > > [...]
> > > > > [   14.018026][    T1] Node 0 DMA free:160kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > > > > [   14.018035][    T1] lowmem_reserve[]: 0 0 0 0 0
> > > > > [   14.018339][    T1] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 0*64kB 1*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 160kB
> > > > > 
> > > > > So the DMA zone has only 160kB free while the pool would like to use 4MB
> > > > > of it which obviously fails. I haven't tried to check who is consuming
> > > > > the DMA zone memory and why but this shouldn't be all that important
> > > > > because the pool clearly cannot allocate and there is not much the
> > > > > user/admin can do about that. Well, the pool could be explicitly
> > > > > requested smaller but is that really what we expect them to do?
> > > > >   
> > > > > > > > I thought there are only few pages in the managed by the DMA zone. This
> > > > > > > > is still theoretically possible so I think __GFP_NOWARN makes sense here
> > > > > > > > but it would require to change the patch description.
> > > > > > > > 
> > > > > > > > Is this really worth it?
> > > > > > > 
> > > > > > > In general I think for kernels where we need the pool and can't allocate
> > > > > > > it, a warning is very useful.  We just shouldn't spew it when there is
> > > > > > > no need for the pool to start with.
> > > > > > 
> > > > > > Well, do we have any way to find that out during early boot?
> > > > > 
> > > > > Thinking about it. We should get a warning when the actual allocation
> > > > > from the pool fails no? That would be more useful information than the
> > > > > pre-allocation failure when it is not really clear whether anybody is
> > > > > ever going to consume it.
> > > > 
> > > > Hi Michal,
> > > > 
> > > > You haven't told on which ARCH you met this issue, is it x86_64?
> > > 
> > > yes x86_64, so a small 16MB DMA zone.
> > 
> > Yeah, the 16M DMA zone is redicilous and exists only for hardly seen
> > ISA-style devices support. Haven't prepared the log well.
> 
> Agreed on that! I would essentially suggest to completely ignore pool
> pre-allocation failures for the small DMA zone. There is barely anything
> to be ever consuming it.

I would personally suggest to keep it. W/o that, we even don't know the
issue we are talking about now. I see below commit as a workaround, and
have been trying to fix it finally with a better solution.

commit c4dc63f0032c ("mm/page_alloc.c: do not warn allocation failure on zone DMA if no managed pages")

After attempts, I realize it's time to let one zone DMA or DMA32 cover
the whole low 4G memory on x86_64. That's the real fix. The tiny 16M DMA
on 64bit system is root cause.

> 
> Unfortunately generic kernels cannot really know there is any
> crippled device without some code to some checking early boot (and I am
> not even sure this would be sufficient).
> 
> > >From 0b32b4c441f9e28bbda06eefbd14c25d00924830 Mon Sep 17 00:00:00 2001
> > From: Baoquan He <bhe@redhat.com>
> > Date: Wed, 6 Jul 2022 15:26:15 +0800
> > Subject: [PATCH] x86, 64: let zone DMA cover low 4G if no ISA-style devices
> > Content-type: text/plain
> > 
> > It doesn't make sense to let the rare legacy ISA-style devies
> > drag x86_64 to have a tiny zone DMA of 16M which cause many
> > troubles.
> > 
> > Signed-off-by: Baoquan He <bhe@redhat.com>
> > ---
> >  arch/x86/Kconfig   | 1 -
> >  arch/x86/mm/init.c | 5 ++++-
> >  2 files changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 5aa4c2ecf5c7..93af781f9445 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -2761,7 +2761,6 @@ config ISA_BUS
> >  # x86_64 have no ISA slots, but can have ISA-style DMA.
> >  config ISA_DMA_API
> >  	bool "ISA-style DMA support" if (X86_64 && EXPERT)
> > -	default y
> >  	help
> >  	  Enables ISA-style DMA support for devices requiring such controllers.
> >  	  If unsure, say Y.
> > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> > index 82a042c03824..c9ffb38dcc6a 100644
> > --- a/arch/x86/mm/init.c
> > +++ b/arch/x86/mm/init.c
> > @@ -1024,9 +1024,12 @@ void __init zone_sizes_init(void)
> >  
> >  	memset(max_zone_pfns, 0, sizeof(max_zone_pfns));
> >  
> > -#ifdef CONFIG_ZONE_DMA
> > +#if defined(CONFIG_ZONE_DMA) && defined(CONFIG_ISA_DMA_API)
> >  	max_zone_pfns[ZONE_DMA]		= min(MAX_DMA_PFN, max_low_pfn);
> > +#else
> > +	max_zone_pfns[ZONE_DMA]		= min(MAX_DMA32_PFN, max_low_pfn);
> >  #endif
> > +
> >  #ifdef CONFIG_ZONE_DMA32
> >  	max_zone_pfns[ZONE_DMA32]	= min(MAX_DMA32_PFN, max_low_pfn);
> >  #endif
> 
> I would rather see the zone go away completley and free up the slot in
> page flags.  This seems like a hack to have two zones representing the
> same physical memory range.
> 
> This also mostly papers over this particular problem by allocating
> allocating two pools for the same range.

No, it doesn't paper over anything, and isn't a hack. Zone dma now covers
low 4G, just zone DMA32 is empty. Any allocation request with GFP_DMA will
be satisfied, while request with GFP_DMA32 will fall back to zone DMA.

See my summary about zone DMA/DMA32 on ARCHes. Currently only x86_64
always has this burdensome tiny DMA zone. Other ARCH has made adjustment
to avoid that conditionally. The way I took in above patch is similar
with arm64 handling.

The two pools for the same range has been there on arm64 and mips, we
can easily fix it by introducing has_managed_dma32() and checking it
before allocating atomic_pool_dma32 pool, just like I have done for
atomic_pool_dma there.

=============================
ARCH which has DMA32
        ZONE_DMA       ZONE_DMA32
arm64   0~X            empty or X~4G  (X is got from ACPI or DT. Otherwise it's 4G by default, DMA32 is empty)
ia64    None           0~4G
mips    empty or 0~16M X~4G  (zone DMA is empty on SGI_IP22 or SGI_IP28, otherwise 16M by default like i386)
riscv   None           0~4G
x86_64  16M            16M~4G
=============================


As for the only one DMA or DMA32 zone exist on x86_64 you suggested, I
made below draft change which only creates zone DMA32 to cover the whole
low 4G meomry, just like RISC-V and ia64 are doing. It works well on
one intel machine, no other change is required. However, I have one
concern, it possibly comes from my own misunderstanding, please help
point out where I got it wrong. If there's only DMA32 zone, and
CONFIG_ZONE_DMA is disabled, how does it handle GFP_DMA allocation
request? See gfp_zone(), it will return ZONE_NORMAL to user even though
user expects to get memory for DMA handling?

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fb5900e2c29a..7ec4a7aec43c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -34,6 +34,7 @@ config X86_64
 	select SWIOTLB
 	select ARCH_HAS_ELFCORE_COMPAT
 	select ZONE_DMA32
+	select ZONE_DMA if ISA_DMA_API
 
 config FORCE_DYNAMIC_FTRACE
 	def_bool y
@@ -2752,7 +2753,6 @@ config ISA_BUS
 # x86_64 have no ISA slots, but can have ISA-style DMA.
 config ISA_DMA_API
 	bool "ISA-style DMA support" if (X86_64 && EXPERT)
-	default y
 	help
 	  Enables ISA-style DMA support for devices requiring such controllers.
 	  If unsure, say Y.
diff --git a/mm/Kconfig b/mm/Kconfig
index 169e64192e48..50ad23abcb7e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -962,7 +962,7 @@ config ARCH_HAS_ZONE_DMA_SET
 
 config ZONE_DMA
 	bool "Support DMA zone" if ARCH_HAS_ZONE_DMA_SET
-	default y if ARM64 || X86
+	default y if ARM64 || X86_32
 
 config ZONE_DMA32
 	bool "Support DMA32 zone" if ARCH_HAS_ZONE_DMA_SET
Michal Hocko Aug. 4, 2022, 12:01 p.m. UTC | #10
On Thu 04-08-22 19:01:28, Baoquan He wrote:
> On 08/03/22 at 05:44pm, Michal Hocko wrote:
> > On Wed 03-08-22 23:32:03, Baoquan He wrote:
> > > On 08/03/22 at 05:05pm, Michal Hocko wrote:
> > > > On Wed 03-08-22 22:59:26, Baoquan He wrote:
> > > > > On 08/03/22 at 11:52am, Michal Hocko wrote:
> > > > > > On Fri 25-03-22 17:54:33, Michal Hocko wrote:
> > > > > > > On Fri 25-03-22 17:48:56, Christoph Hellwig wrote:
> > > > > > > > On Fri, Mar 25, 2022 at 01:58:42PM +0100, Michal Hocko wrote:
> > > > > > > > > Dang, I have just realized that I have misread the boot log and it has
> > > > > > > > > turned out that a674e48c5443 is covering my situation because the
> > > > > > > > > allocation failure message says:
> > > > > > > > >
> > > > > > > > > Node 0 DMA free:0kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:636kB managed:0kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > > > > > > > 
> > > > > > > > As in your report is from a kernel that does not have a674e48c5443
> > > > > > > > yet?
> > > > > > > 
> > > > > > > yes. I just mixed up the early boot messages and thought that DMA zone
> > > > > > > ended up with a single page. That message was saying something else
> > > > > > > though.
> > > > > > 
> > > > > > OK, so I have another machine spewing this warning. Still on an older
> > > > > > kernel but I do not think the current upstream would be any different in
> > > > > > that regards. This time the DMA zone is populated and consumed from
> > > > > > large part and the pool size request is just too large for it:
> > > > > > 
> > > > > > [   14.017417][    T1] swapper/0: page allocation failure: order:10, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0-7
> > > > > > [   14.017429][    T1] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 5.14.21-150400.22-default #1 SLE15-SP4 0b6a6578ade2de5c4a0b916095dff44f76ef1704
> > > > > > [   14.017434][    T1] Hardware name: XXXX
> > > > > > [   14.017437][    T1] Call Trace:
> > > > > > [   14.017444][    T1]  <TASK>
> > > > > > [   14.017449][    T1]  dump_stack_lvl+0x45/0x57
> > > > > > [   14.017469][    T1]  warn_alloc+0xfe/0x160
> > > > > > [   14.017490][    T1]  __alloc_pages_slowpath.constprop.112+0xc27/0xc60
> > > > > > [   14.017497][    T1]  ? rdinit_setup+0x2b/0x2b
> > > > > > [   14.017509][    T1]  ? rdinit_setup+0x2b/0x2b
> > > > > > [   14.017512][    T1]  __alloc_pages+0x2d5/0x320
> > > > > > [   14.017517][    T1]  alloc_page_interleave+0xf/0x70
> > > > > > [   14.017531][    T1]  atomic_pool_expand+0x4a/0x200
> > > > > > [   14.017541][    T1]  ? rdinit_setup+0x2b/0x2b
> > > > > > [   14.017544][    T1]  __dma_atomic_pool_init+0x44/0x90
> > > > > > [   14.017556][    T1]  dma_atomic_pool_init+0xad/0x13f
> > > > > > [   14.017560][    T1]  ? __dma_atomic_pool_init+0x90/0x90
> > > > > > [   14.017562][    T1]  do_one_initcall+0x41/0x200
> > > > > > [   14.017581][    T1]  kernel_init_freeable+0x236/0x298
> > > > > > [   14.017589][    T1]  ? rest_init+0xd0/0xd0
> > > > > > [   14.017596][    T1]  kernel_init+0x16/0x120
> > > > > > [   14.017599][    T1]  ret_from_fork+0x22/0x30
> > > > > > [   14.017604][    T1]  </TASK>
> > > > > > [...]
> > > > > > [   14.018026][    T1] Node 0 DMA free:160kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > > > > > [   14.018035][    T1] lowmem_reserve[]: 0 0 0 0 0
> > > > > > [   14.018339][    T1] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 0*64kB 1*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 160kB
> > > > > > 
> > > > > > So the DMA zone has only 160kB free while the pool would like to use 4MB
> > > > > > of it which obviously fails. I haven't tried to check who is consuming
> > > > > > the DMA zone memory and why but this shouldn't be all that important
> > > > > > because the pool clearly cannot allocate and there is not much the
> > > > > > user/admin can do about that. Well, the pool could be explicitly
> > > > > > requested smaller but is that really what we expect them to do?
> > > > > >   
> > > > > > > > > I thought there are only few pages in the managed by the DMA zone. This
> > > > > > > > > is still theoretically possible so I think __GFP_NOWARN makes sense here
> > > > > > > > > but it would require to change the patch description.
> > > > > > > > > 
> > > > > > > > > Is this really worth it?
> > > > > > > > 
> > > > > > > > In general I think for kernels where we need the pool and can't allocate
> > > > > > > > it, a warning is very useful.  We just shouldn't spew it when there is
> > > > > > > > no need for the pool to start with.
> > > > > > > 
> > > > > > > Well, do we have any way to find that out during early boot?
> > > > > > 
> > > > > > Thinking about it. We should get a warning when the actual allocation
> > > > > > from the pool fails no? That would be more useful information than the
> > > > > > pre-allocation failure when it is not really clear whether anybody is
> > > > > > ever going to consume it.
> > > > > 
> > > > > Hi Michal,
> > > > > 
> > > > > You haven't told on which ARCH you met this issue, is it x86_64?
> > > > 
> > > > yes x86_64, so a small 16MB DMA zone.
> > > 
> > > Yeah, the 16M DMA zone is redicilous and exists only for hardly seen
> > > ISA-style devices support. Haven't prepared the log well.
> > 
> > Agreed on that! I would essentially suggest to completely ignore pool
> > pre-allocation failures for the small DMA zone. There is barely anything
> > to be ever consuming it.
> 
> I would personally suggest to keep it. W/o that, we even don't know the
> issue we are talking about now. I see below commit as a workaround, and
> have been trying to fix it finally with a better solution.
> 
> commit c4dc63f0032c ("mm/page_alloc.c: do not warn allocation failure on zone DMA if no managed pages")

This will not help in any case but an empty DMA zone. As you can see
this is not the case in my example.

> After attempts, I realize it's time to let one zone DMA or DMA32 cover
> the whole low 4G memory on x86_64. That's the real fix. The tiny 16M DMA
> on 64bit system is root cause.

Yes, I would agree with that. This means DMA zone is gone completely.
 
[...]
> > This also mostly papers over this particular problem by allocating
> > allocating two pools for the same range.
> 
> No, it doesn't paper over anything, and isn't a hack. Zone dma now covers
> low 4G, just zone DMA32 is empty. Any allocation request with GFP_DMA will
> be satisfied, while request with GFP_DMA32 will fall back to zone DMA.

This is breaking expectation of the DMA zone on x86. This is not
acceptable. You really want to do it other way around. Only have DMA32
zone and cover the whole 32b phys address range. DMA zone would be empty
which is something we can have already. This would make the crap HW fail
on allocations.

> See my summary about zone DMA/DMA32 on ARCHes. Currently only x86_64
> always has this burdensome tiny DMA zone. Other ARCH has made adjustment
> to avoid that conditionally. The way I took in above patch is similar
> with arm64 handling.

Yeah a historical baggage we have to live with or just stop pretending
we really do support that ISA HW. Or make that dynamic and only set up
DMA zone when there is a HW like that present.

> The two pools for the same range has been there on arm64 and mips, we
> can easily fix it by introducing has_managed_dma32() and checking it
> before allocating atomic_pool_dma32 pool, just like I have done for
> atomic_pool_dma there.
> 
> =============================
> ARCH which has DMA32
>         ZONE_DMA       ZONE_DMA32
> arm64   0~X            empty or X~4G  (X is got from ACPI or DT. Otherwise it's 4G by default, DMA32 is empty)
> ia64    None           0~4G
> mips    empty or 0~16M X~4G  (zone DMA is empty on SGI_IP22 or SGI_IP28, otherwise 16M by default like i386)
> riscv   None           0~4G
> x86_64  16M            16M~4G
> =============================
> 
> 
> As for the only one DMA or DMA32 zone exist on x86_64 you suggested, I
> made below draft change which only creates zone DMA32 to cover the whole
> low 4G meomry, just like RISC-V and ia64 are doing. It works well on
> one intel machine, no other change is required. However, I have one
> concern, it possibly comes from my own misunderstanding, please help
> point out where I got it wrong. If there's only DMA32 zone, and
> CONFIG_ZONE_DMA is disabled, how does it handle GFP_DMA allocation
> request?

the expected behavior should be to fail the allocation.

> See gfp_zone(), it will return ZONE_NORMAL to user even though
> user expects to get memory for DMA handling?

It's been quite some time since I've had gfp_zone cached in my brain. I
suspect it would require some special casing for GFP_DMA to always fail.
I would have to study the code again. We have discussed this at some
LSFMM few years back. Maybe you can find something at lwn.

In any case. This is a larger change. Do you agree that the warning is
pointless and __GFP_NOWARN is a very simple way to deal with it until we
sort out situation of the DMA zone on x86 which is a long standing
problem?
Baoquan He Aug. 5, 2022, 12:34 p.m. UTC | #11
On 08/04/22 at 02:01pm, Michal Hocko wrote:
...snip...
> > > > > > > Thinking about it. We should get a warning when the actual allocation
> > > > > > > from the pool fails no? That would be more useful information than the
> > > > > > > pre-allocation failure when it is not really clear whether anybody is
> > > > > > > ever going to consume it.
> > > > > > 
> > > > > > Hi Michal,
> > > > > > 
> > > > > > You haven't told on which ARCH you met this issue, is it x86_64?
> > > > > 
> > > > > yes x86_64, so a small 16MB DMA zone.
> > > > 
> > > > Yeah, the 16M DMA zone is redicilous and exists only for hardly seen
> > > > ISA-style devices support. Haven't prepared the log well.
> > > 
> > > Agreed on that! I would essentially suggest to completely ignore pool
> > > pre-allocation failures for the small DMA zone. There is barely anything
> > > to be ever consuming it.
> > 
> > I would personally suggest to keep it. W/o that, we even don't know the
> > issue we are talking about now. I see below commit as a workaround, and
> > have been trying to fix it finally with a better solution.
> > 
> > commit c4dc63f0032c ("mm/page_alloc.c: do not warn allocation failure on zone DMA if no managed pages")
> 
> This will not help in any case but an empty DMA zone. As you can see
> this is not the case in my example.

Yeah, it's different case.

> 
> > After attempts, I realize it's time to let one zone DMA or DMA32 cover
> > the whole low 4G memory on x86_64. That's the real fix. The tiny 16M DMA
> > on 64bit system is root cause.
> 
> Yes, I would agree with that. This means DMA zone is gone completely.
>  
> [...]
> > > This also mostly papers over this particular problem by allocating
> > > allocating two pools for the same range.
> > 
> > No, it doesn't paper over anything, and isn't a hack. Zone dma now covers
> > low 4G, just zone DMA32 is empty. Any allocation request with GFP_DMA will
> > be satisfied, while request with GFP_DMA32 will fall back to zone DMA.
> 
> This is breaking expectation of the DMA zone on x86. This is not
> acceptable. You really want to do it other way around. Only have DMA32
> zone and cover the whole 32b phys address range. DMA zone would be empty
> which is something we can have already. This would make the crap HW fail
> on allocations.

OK, I will consider it in the direction that only DMA32 zone cover the
low 4G as you suggested. Do you have pointer about them? Really
appreciate it.
> 
> > See my summary about zone DMA/DMA32 on ARCHes. Currently only x86_64
> > always has this burdensome tiny DMA zone. Other ARCH has made adjustment
> > to avoid that conditionally. The way I took in above patch is similar
> > with arm64 handling.
> 
> Yeah a historical baggage we have to live with or just stop pretending
> we really do support that ISA HW. Or make that dynamic and only set up
> DMA zone when there is a HW like that present.

Ok, I see, you don't like the handling on arm64 and mips either.

> 
> > The two pools for the same range has been there on arm64 and mips, we
> > can easily fix it by introducing has_managed_dma32() and checking it
> > before allocating atomic_pool_dma32 pool, just like I have done for
> > atomic_pool_dma there.
> > 
> > =============================
> > ARCH which has DMA32
> >         ZONE_DMA       ZONE_DMA32
> > arm64   0~X            empty or X~4G  (X is got from ACPI or DT. Otherwise it's 4G by default, DMA32 is empty)
> > ia64    None           0~4G
> > mips    empty or 0~16M X~4G  (zone DMA is empty on SGI_IP22 or SGI_IP28, otherwise 16M by default like i386)
> > riscv   None           0~4G
> > x86_64  16M            16M~4G
> > =============================
> > 
> > 
> > As for the only one DMA or DMA32 zone exist on x86_64 you suggested, I
> > made below draft change which only creates zone DMA32 to cover the whole
> > low 4G meomry, just like RISC-V and ia64 are doing. It works well on
> > one intel machine, no other change is required. However, I have one
> > concern, it possibly comes from my own misunderstanding, please help
> > point out where I got it wrong. If there's only DMA32 zone, and
> > CONFIG_ZONE_DMA is disabled, how does it handle GFP_DMA allocation
> > request?
> 
> the expected behavior should be to fail the allocation.

Could you tell why we should fail the allocation?

With my understanding, whether it is allocation request with GFP_DMA
or GFP_DMA32, it's requesting memory for DMA buffer, or requesting
memory under 32bit. If we only have zone DMA32 covering low 4G, and
people requests meomry with GFP_DMA, we should try to satisfy it with
memory from zone DMA32. Because people may not know the internal detail
about memory from zone DMA or DMA32, and they just want to get a buffer
for DMA transfering.

With your saying, if there's only DMA32 existing and cover the whole low
4G, any allocation request with GFP_DMA need be failed, we need search
and change all those places where GFP_DMA is set.

> 
> > See gfp_zone(), it will return ZONE_NORMAL to user even though
> > user expects to get memory for DMA handling?
> 
> It's been quite some time since I've had gfp_zone cached in my brain. I
> suspect it would require some special casing for GFP_DMA to always fail.
> I would have to study the code again. We have discussed this at some
> LSFMM few years back. Maybe you can find something at lwn.
> 
> In any case. This is a larger change. Do you agree that the warning is
> pointless and __GFP_NOWARN is a very simple way to deal with it until we
> sort out situation of the DMA zone on x86 which is a long standing
> problem?

I understand the warning will be annoying and customer will complain
about it. We can mute it for the time being, and may open it later
when we get rid of the 16M DMA zone.

So, yes, I agree that there's nothing we can do currently except of
muting it. I can add my tag when you update and post.
Michal Hocko Aug. 5, 2022, 5:37 p.m. UTC | #12
On Fri 05-08-22 20:34:32, Baoquan He wrote:
> On 08/04/22 at 02:01pm, Michal Hocko wrote:
> ...snip...
> > > > > > > > Thinking about it. We should get a warning when the actual allocation
> > > > > > > > from the pool fails no? That would be more useful information than the
> > > > > > > > pre-allocation failure when it is not really clear whether anybody is
> > > > > > > > ever going to consume it.
> > > > > > > 
> > > > > > > Hi Michal,
> > > > > > > 
> > > > > > > You haven't told on which ARCH you met this issue, is it x86_64?
> > > > > > 
> > > > > > yes x86_64, so a small 16MB DMA zone.
> > > > > 
> > > > > Yeah, the 16M DMA zone is redicilous and exists only for hardly seen
> > > > > ISA-style devices support. Haven't prepared the log well.
> > > > 
> > > > Agreed on that! I would essentially suggest to completely ignore pool
> > > > pre-allocation failures for the small DMA zone. There is barely anything
> > > > to be ever consuming it.
> > > 
> > > I would personally suggest to keep it. W/o that, we even don't know the
> > > issue we are talking about now. I see below commit as a workaround, and
> > > have been trying to fix it finally with a better solution.
> > > 
> > > commit c4dc63f0032c ("mm/page_alloc.c: do not warn allocation failure on zone DMA if no managed pages")
> > 
> > This will not help in any case but an empty DMA zone. As you can see
> > this is not the case in my example.
> 
> Yeah, it's different case.
> 
> > 
> > > After attempts, I realize it's time to let one zone DMA or DMA32 cover
> > > the whole low 4G memory on x86_64. That's the real fix. The tiny 16M DMA
> > > on 64bit system is root cause.
> > 
> > Yes, I would agree with that. This means DMA zone is gone completely.
> >  
> > [...]
> > > > This also mostly papers over this particular problem by allocating
> > > > allocating two pools for the same range.
> > > 
> > > No, it doesn't paper over anything, and isn't a hack. Zone dma now covers
> > > low 4G, just zone DMA32 is empty. Any allocation request with GFP_DMA will
> > > be satisfied, while request with GFP_DMA32 will fall back to zone DMA.
> > 
> > This is breaking expectation of the DMA zone on x86. This is not
> > acceptable. You really want to do it other way around. Only have DMA32
> > zone and cover the whole 32b phys address range. DMA zone would be empty
> > which is something we can have already. This would make the crap HW fail
> > on allocations.
> 
> OK, I will consider it in the direction that only DMA32 zone cover the
> low 4G as you suggested. Do you have pointer about them? Really
> appreciate it.

Not sure I understand what you are asking.

[...]

> > the expected behavior should be to fail the allocation.
> 
> Could you tell why we should fail the allocation?

If we cannot guarantee DMA (low 16MB) then the allocation should fail
otherwise any HW that really requires that address range could fail in
unpredictable way. Sure you can try harder and check whether DMA32 zone
has any pages from that range but I am not really sure this is worth it.

> With my understanding, whether it is allocation request with GFP_DMA
> or GFP_DMA32, it's requesting memory for DMA buffer, or requesting
> memory under 32bit. If we only have zone DMA32 covering low 4G, and
> people requests meomry with GFP_DMA, we should try to satisfy it with
> memory from zone DMA32. Because people may not know the internal detail
> about memory from zone DMA or DMA32, and they just want to get a buffer
> for DMA transfering.

GFP_DMA is a misnomer. It doesn't really say that the allocation should
be done for actual DMA. GFP_DMA really says allocate from ZONE_DMA. It
is my understanding that all actual DMA users should use a dedicated dma
allocation APIs which should do the right thing wrt. address constrains.

> With your saying, if there's only DMA32 existing and cover the whole low
> 4G, any allocation request with GFP_DMA need be failed, we need search
> and change all those places where GFP_DMA is set.

Yes a mass review of GFP_DMA usage is certainly due and it would have to
preceed any other changes. This is likely the reason why no unification
has happened yet.
 
> > > See gfp_zone(), it will return ZONE_NORMAL to user even though
> > > user expects to get memory for DMA handling?
> > 
> > It's been quite some time since I've had gfp_zone cached in my brain. I
> > suspect it would require some special casing for GFP_DMA to always fail.
> > I would have to study the code again. We have discussed this at some
> > LSFMM few years back. Maybe you can find something at lwn.
> > 
> > In any case. This is a larger change. Do you agree that the warning is
> > pointless and __GFP_NOWARN is a very simple way to deal with it until we
> > sort out situation of the DMA zone on x86 which is a long standing
> > problem?
> 
> I understand the warning will be annoying and customer will complain
> about it. We can mute it for the time being, and may open it later
> when we get rid of the 16M DMA zone.
> 
> So, yes, I agree that there's nothing we can do currently except of
> muting it. I can add my tag when you update and post.

Thanks. I will wait for Christoph and his comments and post a full patch
next week.
Michal Hocko Aug. 9, 2022, 3:37 p.m. UTC | #13
Here we go again.
---
From 1dc9d7504624b273de47a88a9907f43533bfe08e Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 25 Mar 2022 13:25:59 +0100
Subject: [PATCH] dma/pool: do not complain if DMA pool is not allocated

we have a system complainging about order-10 allocation for the DMA pool.

[   14.017417][    T1] swapper/0: page allocation failure: order:10, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0-7
[   14.017429][    T1] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 5.14.21-150400.22-default #1 SLE15-SP4 0b6a6578ade2de5c4a0b916095dff44f76ef1704
[   14.017434][    T1] Hardware name: XXXX
[   14.017437][    T1] Call Trace:
[   14.017444][    T1]  <TASK>
[   14.017449][    T1]  dump_stack_lvl+0x45/0x57
[   14.017469][    T1]  warn_alloc+0xfe/0x160
[   14.017490][    T1]  __alloc_pages_slowpath.constprop.112+0xc27/0xc60
[   14.017497][    T1]  ? rdinit_setup+0x2b/0x2b
[   14.017509][    T1]  ? rdinit_setup+0x2b/0x2b
[   14.017512][    T1]  __alloc_pages+0x2d5/0x320
[   14.017517][    T1]  alloc_page_interleave+0xf/0x70
[   14.017531][    T1]  atomic_pool_expand+0x4a/0x200
[   14.017541][    T1]  ? rdinit_setup+0x2b/0x2b
[   14.017544][    T1]  __dma_atomic_pool_init+0x44/0x90
[   14.017556][    T1]  dma_atomic_pool_init+0xad/0x13f
[   14.017560][    T1]  ? __dma_atomic_pool_init+0x90/0x90
[   14.017562][    T1]  do_one_initcall+0x41/0x200
[   14.017581][    T1]  kernel_init_freeable+0x236/0x298
[   14.017589][    T1]  ? rest_init+0xd0/0xd0
[   14.017596][    T1]  kernel_init+0x16/0x120
[   14.017599][    T1]  ret_from_fork+0x22/0x30
[   14.017604][    T1]  </TASK>
[...]
[   14.018026][    T1] Node 0 DMA free:160kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   14.018035][    T1] lowmem_reserve[]: 0 0 0 0 0
[   14.018339][    T1] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 0*64kB 1*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 160kB

The usable memory in the DMA zone is obviously too small for the pool
pre-allocation. The allocation failure raises concern by admins because
this is considered an error state.

In fact the preallocation itself doesn't expose any actual problem. It
is not even clear whether anybody is ever going to use this pool. If yes
then a warning will be triggered anyway.

Silence the warning to prevent confusion and bug reports.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 kernel/dma/pool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
index 4d40dcce7604..1bf6de398986 100644
--- a/kernel/dma/pool.c
+++ b/kernel/dma/pool.c
@@ -205,7 +205,7 @@ static int __init dma_atomic_pool_init(void)
 		ret = -ENOMEM;
 	if (has_managed_dma()) {
 		atomic_pool_dma = __dma_atomic_pool_init(atomic_pool_size,
-						GFP_KERNEL | GFP_DMA);
+						GFP_KERNEL | GFP_DMA | __GFP_NOWARN);
 		if (!atomic_pool_dma)
 			ret = -ENOMEM;
 	}
Andrew Morton Aug. 10, 2022, 1:32 a.m. UTC | #14
On Tue, 9 Aug 2022 17:37:59 +0200 Michal Hocko <mhocko@suse.com> wrote:

> we have a system complainging about order-10 allocation for the DMA pool.
> 

I'll add a cc:stable to this - if future users like the patch, so will
current ones!
Baoquan He Aug. 10, 2022, 2:19 a.m. UTC | #15
On 08/09/22 at 05:37pm, Michal Hocko wrote:
> Here we go again.
> ---
> From 1dc9d7504624b273de47a88a9907f43533bfe08e Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 25 Mar 2022 13:25:59 +0100
> Subject: [PATCH] dma/pool: do not complain if DMA pool is not allocated
> 
> we have a system complainging about order-10 allocation for the DMA pool.
> 
> [   14.017417][    T1] swapper/0: page allocation failure: order:10, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0-7
> [   14.017429][    T1] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 5.14.21-150400.22-default #1 SLE15-SP4 0b6a6578ade2de5c4a0b916095dff44f76ef1704
> [   14.017434][    T1] Hardware name: XXXX
> [   14.017437][    T1] Call Trace:
> [   14.017444][    T1]  <TASK>
> [   14.017449][    T1]  dump_stack_lvl+0x45/0x57
> [   14.017469][    T1]  warn_alloc+0xfe/0x160
> [   14.017490][    T1]  __alloc_pages_slowpath.constprop.112+0xc27/0xc60
> [   14.017497][    T1]  ? rdinit_setup+0x2b/0x2b
> [   14.017509][    T1]  ? rdinit_setup+0x2b/0x2b
> [   14.017512][    T1]  __alloc_pages+0x2d5/0x320
> [   14.017517][    T1]  alloc_page_interleave+0xf/0x70
> [   14.017531][    T1]  atomic_pool_expand+0x4a/0x200
> [   14.017541][    T1]  ? rdinit_setup+0x2b/0x2b
> [   14.017544][    T1]  __dma_atomic_pool_init+0x44/0x90
> [   14.017556][    T1]  dma_atomic_pool_init+0xad/0x13f
> [   14.017560][    T1]  ? __dma_atomic_pool_init+0x90/0x90
> [   14.017562][    T1]  do_one_initcall+0x41/0x200
> [   14.017581][    T1]  kernel_init_freeable+0x236/0x298
> [   14.017589][    T1]  ? rest_init+0xd0/0xd0
> [   14.017596][    T1]  kernel_init+0x16/0x120
> [   14.017599][    T1]  ret_from_fork+0x22/0x30
> [   14.017604][    T1]  </TASK>
> [...]
> [   14.018026][    T1] Node 0 DMA free:160kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> [   14.018035][    T1] lowmem_reserve[]: 0 0 0 0 0
> [   14.018339][    T1] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 0*64kB 1*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 160kB
> 
> The usable memory in the DMA zone is obviously too small for the pool
> pre-allocation. The allocation failure raises concern by admins because
> this is considered an error state.
> 
> In fact the preallocation itself doesn't expose any actual problem. It
> is not even clear whether anybody is ever going to use this pool. If yes
> then a warning will be triggered anyway.
> 
> Silence the warning to prevent confusion and bug reports.

LGTM,

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  kernel/dma/pool.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
> index 4d40dcce7604..1bf6de398986 100644
> --- a/kernel/dma/pool.c
> +++ b/kernel/dma/pool.c
> @@ -205,7 +205,7 @@ static int __init dma_atomic_pool_init(void)
>  		ret = -ENOMEM;
>  	if (has_managed_dma()) {
>  		atomic_pool_dma = __dma_atomic_pool_init(atomic_pool_size,
> -						GFP_KERNEL | GFP_DMA);
> +						GFP_KERNEL | GFP_DMA | __GFP_NOWARN);
>  		if (!atomic_pool_dma)
>  			ret = -ENOMEM;
>  	}
> -- 
> 2.30.2
> 
> -- 
> Michal Hocko
> SUSE Labs
>
Michal Hocko Aug. 10, 2022, 6:40 a.m. UTC | #16
On Tue 09-08-22 18:32:52, Andrew Morton wrote:
> On Tue, 9 Aug 2022 17:37:59 +0200 Michal Hocko <mhocko@suse.com> wrote:
> 
> > we have a system complainging about order-10 allocation for the DMA pool.
> > 
> 
> I'll add a cc:stable to this - if future users like the patch, so will
> current ones!

Technically speaking this is not a fix so I am not sure the stable tree
is really a great fit. On the other hand I am definitely going to
backport this to older SLES kernels because we have had at least 2
reports where this has been brought up for clarification.

That being said, no objection from me.
Christoph Hellwig Aug. 11, 2022, 7:27 a.m. UTC | #17
On Fri, Mar 25, 2022 at 05:54:32PM +0100, Michal Hocko wrote:
> > > I thought there are only few pages in the managed by the DMA zone. This
> > > is still theoretically possible so I think __GFP_NOWARN makes sense here
> > > but it would require to change the patch description.
> > > 
> > > Is this really worth it?
> > 
> > In general I think for kernels where we need the pool and can't allocate
> > it, a warning is very useful.  We just shouldn't spew it when there is
> > no need for the pool to start with.
> 
> Well, do we have any way to find that out during early boot?

In general an architecture / configuration that selects
CONFIG_ZONE_DMA needs it.  We could try to reduce that dependency and/or
make it boot time configurable, but there's still plenty of device with
sub-32bit addessing limits around, so I'm not sure it would help much.
Christoph Hellwig Aug. 11, 2022, 7:28 a.m. UTC | #18
On Wed, Aug 03, 2022 at 11:52:10AM +0200, Michal Hocko wrote:
> OK, so I have another machine spewing this warning. Still on an older
> kernel but I do not think the current upstream would be any different in
> that regards. This time the DMA zone is populated and consumed from
> large part and the pool size request is just too large for it:

I can't really parse the last sentence.  What does "consumed from large
part" mean here?
Christoph Hellwig Aug. 11, 2022, 7:29 a.m. UTC | #19
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5aa4c2ecf5c7..93af781f9445 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2761,7 +2761,6 @@ config ISA_BUS
>  # x86_64 have no ISA slots, but can have ISA-style DMA.
>  config ISA_DMA_API
>  	bool "ISA-style DMA support" if (X86_64 && EXPERT)
> -	default y

This looks sensible to me, but you'll have to get it past Linus.

> -#ifdef CONFIG_ZONE_DMA
> +#if defined(CONFIG_ZONE_DMA) && defined(CONFIG_ISA_DMA_API)
>  	max_zone_pfns[ZONE_DMA]		= min(MAX_DMA_PFN, max_low_pfn);
> +#else
> +	max_zone_pfns[ZONE_DMA]		= min(MAX_DMA32_PFN, max_low_pfn);
>  #endif

But this simply can't work at all.
Christoph Hellwig Aug. 11, 2022, 7:34 a.m. UTC | #20
On Fri, Aug 05, 2022 at 07:37:16PM +0200, Michal Hocko wrote:
> GFP_DMA is a misnomer. It doesn't really say that the allocation should
> be done for actual DMA. GFP_DMA really says allocate from ZONE_DMA.

Yes.  But in practice that either is to do DMA of some form, including
the s390 variant not using the dma api, or completely stupid cargo
cult code.

> It
> is my understanding that all actual DMA users should use a dedicated dma
> allocation APIs which should do the right thing wrt. address constrains.

Yes.  And we were at a point where we got to that, but it seems somewhat
recently a lot of people added completely stupid GFP_DMA kmallocs in the
all over the crypto drivers.  Sigh...
Christoph Hellwig Aug. 11, 2022, 7:38 a.m. UTC | #21
On Tue, Aug 09, 2022 at 05:37:59PM +0200, Michal Hocko wrote:
> Here we go again.

And just as last time I think this is the wrong thing to do.
IFF we actually need the pool and we can't allocate it we want this
warning.

But order-10 seems very larger for the 31-bit ISA DMA zone, so we might
want to look into calculating the pool size based on the zone size
instead.
Christoph Hellwig Aug. 11, 2022, 7:43 a.m. UTC | #22
On Wed, Aug 03, 2022 at 05:44:16PM +0200, Michal Hocko wrote:
> Unfortunately generic kernels cannot really know there is any
> crippled device without some code to some checking early boot (and I am
> not even sure this would be sufficient).

But we can know if we need the pool, which is only when AMD SEV is
enabled.  So we could add a check and skip allocating all the pools
including the GFP_DMA32 and GFP_KERNEL ones.  I can look into that.
Christoph Hellwig Aug. 11, 2022, 7:49 a.m. UTC | #23
On Thu, Aug 04, 2022 at 07:01:28PM +0800, Baoquan He wrote:
> After attempts, I realize it's time to let one zone DMA or DMA32 cover
> the whole low 4G memory on x86_64. That's the real fix. The tiny 16M DMA
> on 64bit system is root cause.

We can't for two reasons:

 - people still use ISA cards on x86, including the industrial PC104
   version, and we still have drivers that rely on it
 - we still have PCI and PCIe devices with small than 26, 28, 30 and 31
   bit addressing limitations

We could try to get the 24-bit DMA entirely out of the zone allocator
and only fill a genpool at bootmem time.  But that requires fixing up
all the direct users of page and slab allocations on it first (of
which 90+% look bogus, with the s390 drivers being the obvious
exception).

Or we could make 'low' memory a special ZONE_MOVABLE and have an
allocator that can search by physical address an replace ZONE_DMA
and ZONE_DMA32 with that.  Which sounds like a nice idea to me, but
is pretty invasive.
Michal Hocko Aug. 11, 2022, 8:20 a.m. UTC | #24
On Thu 11-08-22 09:28:17, Christoph Hellwig wrote:
> On Wed, Aug 03, 2022 at 11:52:10AM +0200, Michal Hocko wrote:
> > OK, so I have another machine spewing this warning. Still on an older
> > kernel but I do not think the current upstream would be any different in
> > that regards. This time the DMA zone is populated and consumed from
> > large part and the pool size request is just too large for it:
> 
> I can't really parse the last sentence.  What does "consumed from large
> part" mean here?

Meminfo part says
Node 0 DMA free:160kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

So the zone has 15MB of managed memory (by the page allocator), yet only
160kB is free early boot during the allocation. So it is mostly consumed
by somebody. I haven't really checked by whom.

Does that exaplain the above better?
Christoph Hellwig Aug. 11, 2022, 8:21 a.m. UTC | #25
On Thu, Aug 11, 2022 at 10:20:43AM +0200, Michal Hocko wrote:
> Meminfo part says
> Node 0 DMA free:160kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> 
> So the zone has 15MB of managed memory (by the page allocator), yet only
> 160kB is free early boot during the allocation. So it is mostly consumed
> by somebody. I haven't really checked by whom.
> 
> Does that exaplain the above better?

Yes.  I'm really curious who eats up all the GFP_DMA memory early during
boot, though.
Michal Hocko Aug. 11, 2022, 8:25 a.m. UTC | #26
On Thu 11-08-22 09:38:40, Christoph Hellwig wrote:
> On Tue, Aug 09, 2022 at 05:37:59PM +0200, Michal Hocko wrote:
> > Here we go again.
> 
> And just as last time I think this is the wrong thing to do.
> IFF we actually need the pool and we can't allocate it we want this
> warning.
> 
> But order-10 seems very larger for the 31-bit ISA DMA zone, so we might
> want to look into calculating the pool size based on the zone size
> instead.

No objection for a better size tuning from my side. I find it 
I suspect you will need __GFP_NOWARN to be used for downsizing the
request size for all attempts anyway.
Michal Hocko Aug. 11, 2022, 8:33 a.m. UTC | #27
On Thu 11-08-22 10:21:32, Christoph Hellwig wrote:
> On Thu, Aug 11, 2022 at 10:20:43AM +0200, Michal Hocko wrote:
> > Meminfo part says
> > Node 0 DMA free:160kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > 
> > So the zone has 15MB of managed memory (by the page allocator), yet only
> > 160kB is free early boot during the allocation. So it is mostly consumed
> > by somebody. I haven't really checked by whom.
> > 
> > Does that exaplain the above better?
> 
> Yes.  I'm really curious who eats up all the GFP_DMA memory early during
> boot, though.

Sorry, no idea and I do not have direct access to the machine. I can try
to dig out more but, honestly, I am not sure I will find time for that.
My main motivation was to reduce a shouting warning for something that
doesn't indicate any real problem as this has been second (maybe third)
time somebody has been complaining/asking about it.

I do get your point that the sizing is probably wrong and I agree this
is something that can be tuned better but I would rather vote for a
useful warning when the explicit request fails rather than being to
eager and warn when it is not really clear this is a problem in the
first place. In both cases admin cannot really do much other than
report. For the early boot we can only tell, this is not an immediate
problem, just ignore. For the later we know the device and see whether
we can do something about that.

Just my 2c
Michal Hocko Aug. 11, 2022, 8:42 a.m. UTC | #28
On Thu 11-08-22 09:49:46, Christoph Hellwig wrote:
> On Thu, Aug 04, 2022 at 07:01:28PM +0800, Baoquan He wrote:
> > After attempts, I realize it's time to let one zone DMA or DMA32 cover
> > the whole low 4G memory on x86_64. That's the real fix. The tiny 16M DMA
> > on 64bit system is root cause.
> 
> We can't for two reasons:
> 
>  - people still use ISA cards on x86, including the industrial PC104
>    version, and we still have drivers that rely on it
>  - we still have PCI and PCIe devices with small than 26, 28, 30 and 31
>    bit addressing limitations
> 
> We could try to get the 24-bit DMA entirely out of the zone allocator
> and only fill a genpool at bootmem time.  But that requires fixing up
> all the direct users of page and slab allocations on it first (of
> which 90+% look bogus, with the s390 drivers being the obvious
> exception).

Completely agreed!

> Or we could make 'low' memory a special ZONE_MOVABLE and have an
> allocator that can search by physical address an replace ZONE_DMA
> and ZONE_DMA32 with that.  Which sounds like a nice idea to me, but
> is pretty invasive.

Yes.
diff mbox series

Patch

diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
index 4d40dcce7604..1bf6de398986 100644
--- a/kernel/dma/pool.c
+++ b/kernel/dma/pool.c
@@ -205,7 +205,7 @@  static int __init dma_atomic_pool_init(void)
 		ret = -ENOMEM;
 	if (has_managed_dma()) {
 		atomic_pool_dma = __dma_atomic_pool_init(atomic_pool_size,
-						GFP_KERNEL | GFP_DMA);
+						GFP_KERNEL | GFP_DMA | __GFP_NOWARN);
 		if (!atomic_pool_dma)
 			ret = -ENOMEM;
 	}