diff mbox series

swiotlb: set IO TLB segment size via cmdline

Message ID 20210914151016.3174924-1-Roman_Skakun@epam.com (mailing list archive)
State New, archived
Headers show
Series swiotlb: set IO TLB segment size via cmdline | expand

Commit Message

Roman Skakun Sept. 14, 2021, 3:10 p.m. UTC
From: Roman Skakun <roman_skakun@epam.com>

It is possible when default IO TLB size is not
enough to fit a long buffers as described here [1].

This patch makes a way to set this parameter
using cmdline instead of recompiling a kernel.

[1] https://www.xilinx.com/support/answers/72694.html

Signed-off-by: Roman Skakun <roman_skakun@epam.com>
---
 .../admin-guide/kernel-parameters.txt         |  5 +-
 arch/mips/cavium-octeon/dma-octeon.c          |  2 +-
 arch/powerpc/platforms/pseries/svm.c          |  2 +-
 drivers/xen/swiotlb-xen.c                     |  7 +--
 include/linux/swiotlb.h                       |  1 +
 kernel/dma/swiotlb.c                          | 51 ++++++++++++++-----
 6 files changed, 48 insertions(+), 20 deletions(-)

Comments

Jan Beulich Sept. 14, 2021, 3:29 p.m. UTC | #1
On 14.09.2021 17:10, Roman Skakun wrote:
> From: Roman Skakun <roman_skakun@epam.com>
> 
> It is possible when default IO TLB size is not
> enough to fit a long buffers as described here [1].
> 
> This patch makes a way to set this parameter
> using cmdline instead of recompiling a kernel.
> 
> [1] https://www.xilinx.com/support/answers/72694.html

I'm not convinced the swiotlb use describe there falls under "intended
use" - mapping a 1280x720 framebuffer in a single chunk? (As an aside,
the bottom of this page is also confusing, as following "Then we can
confirm the modified swiotlb size in the boot log:" there is a log
fragment showing the same original size of 64Mb.

> --- a/arch/mips/cavium-octeon/dma-octeon.c
> +++ b/arch/mips/cavium-octeon/dma-octeon.c
> @@ -237,7 +237,7 @@ void __init plat_swiotlb_setup(void)
>  		swiotlbsize = 64 * (1<<20);
>  #endif
>  	swiotlb_nslabs = swiotlbsize >> IO_TLB_SHIFT;
> -	swiotlb_nslabs = ALIGN(swiotlb_nslabs, IO_TLB_SEGSIZE);
> +	swiotlb_nslabs = ALIGN(swiotlb_nslabs, swiotlb_io_seg_size());

In order to be sure to catch all uses like this one (including ones
which make it upstream in parallel to yours), I think you will want
to rename the original IO_TLB_SEGSIZE to e.g. IO_TLB_DEFAULT_SEGSIZE.

> @@ -81,15 +86,30 @@ static unsigned int max_segment;
>  static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
>  
>  static int __init
> -setup_io_tlb_npages(char *str)
> +setup_io_tlb_params(char *str)
>  {
> +	unsigned long tmp;
> +
>  	if (isdigit(*str)) {
> -		/* avoid tail segment of size < IO_TLB_SEGSIZE */
> -		default_nslabs =
> -			ALIGN(simple_strtoul(str, &str, 0), IO_TLB_SEGSIZE);
> +		default_nslabs = simple_strtoul(str, &str, 0);
>  	}
>  	if (*str == ',')
>  		++str;
> +
> +	/* get max IO TLB segment size */
> +	if (isdigit(*str)) {
> +		tmp = simple_strtoul(str, &str, 0);
> +		if (tmp)
> +			io_tlb_seg_size = ALIGN(tmp, IO_TLB_SEGSIZE);

From all I can tell io_tlb_seg_size wants to be a power of 2. Merely
aligning to a multiple of IO_TLB_SEGSIZE isn't going to be enough.

Jan
Christoph Hellwig Sept. 14, 2021, 3:30 p.m. UTC | #2
On Tue, Sep 14, 2021 at 05:29:07PM +0200, Jan Beulich wrote:
> I'm not convinced the swiotlb use describe there falls under "intended
> use" - mapping a 1280x720 framebuffer in a single chunk? (As an aside,
> the bottom of this page is also confusing, as following "Then we can
> confirm the modified swiotlb size in the boot log:" there is a log
> fragment showing the same original size of 64Mb.

It doesn't.  We also do not add hacks to the kernel for whacky out
of tree modules.
Stefano Stabellini Sept. 15, 2021, 1:51 a.m. UTC | #3
On Tue, 14 Sep 2021, Christoph Hellwig wrote:
> On Tue, Sep 14, 2021 at 05:29:07PM +0200, Jan Beulich wrote:
> > I'm not convinced the swiotlb use describe there falls under "intended
> > use" - mapping a 1280x720 framebuffer in a single chunk? (As an aside,
> > the bottom of this page is also confusing, as following "Then we can
> > confirm the modified swiotlb size in the boot log:" there is a log
> > fragment showing the same original size of 64Mb.
> 
> It doesn't.  We also do not add hacks to the kernel for whacky out
> of tree modules.

Also, Option 1 listed in the webpage seems to be a lot better. Any
reason you can't do that? Because that option both solves the problem
and increases performance.
Roman Skakun Sept. 15, 2021, 1:37 p.m. UTC | #4
Hi Jan,

Thanks for the answer.

>> From: Roman Skakun <roman_skakun@epam.com>
>>
>> It is possible when default IO TLB size is not
>> enough to fit a long buffers as described here [1].
>>
>> This patch makes a way to set this parameter
>> using cmdline instead of recompiling a kernel.
>>
>> [1] https://www.xilinx.com/support/answers/72694.html
>
>  I'm not convinced the swiotlb use describe there falls under "intended
>  use" - mapping a 1280x720 framebuffer in a single chunk?

I had the same issue while mapping DMA chuck ~4MB for gem fb when
using xen vdispl.
I got the next log:
[ 142.030421] rcar-fcp fea2f000.fcp: swiotlb buffer is full (sz:
3686400 bytes), total 32768 (slots), used 32 (slots)

It happened when I tried to map bounce buffer, which has a large size.
The default size if 128(IO_TLB_SEGSIZE) * 2048(IO_TLB_SHIFT) = 262144
bytes, but we requested 3686400 bytes.
When I change IO_TLB_SEGSIZE to 2048. (2048(IO_TLB_SEGSIZE)  *
2048(IO_TLB_SHIFT) = 4194304bytes).
It makes possible to retrieve a bounce buffer for requested size.
After changing this value, the problem is gone.

>  the bottom of this page is also confusing, as following "Then we can
>  confirm the modified swiotlb size in the boot log:" there is a log
>  fragment showing the same original size of 64Mb.

I suspect, this is a mistake in the article.
According to https://elixir.bootlin.com/linux/v5.14.4/source/kernel/dma/swiotlb.c#L214
and
https://elixir.bootlin.com/linux/v5.15-rc1/source/kernel/dma/swiotlb.c#L182
The IO_TLB_SEGSIZE is not used to calculate total size of swiotlb area.
This explains why we have the same total size before and after changing of
TLB segment size.

>  In order to be sure to catch all uses like this one (including ones
>  which make it upstream in parallel to yours), I think you will want
>  to rename the original IO_TLB_SEGSIZE to e.g. IO_TLB_DEFAULT_SEGSIZE.

I don't understand your point. Can you clarify this?

>> +     /* get max IO TLB segment size */
>> +     if (isdigit(*str)) {
>> +             tmp = simple_strtoul(str, &str, 0);
>> +             if (tmp)
>> +                     io_tlb_seg_size = ALIGN(tmp, IO_TLB_SEGSIZE);
>
> From all I can tell io_tlb_seg_size wants to be a power of 2. Merely
> aligning to a multiple of IO_TLB_SEGSIZE isn't going to be enough.

Yes, right, thanks!

Cheers,
Roman.

вт, 14 сент. 2021 г. в 18:29, Jan Beulich <jbeulich@suse.com>:
>
> On 14.09.2021 17:10, Roman Skakun wrote:
> > From: Roman Skakun <roman_skakun@epam.com>
> >
> > It is possible when default IO TLB size is not
> > enough to fit a long buffers as described here [1].
> >
> > This patch makes a way to set this parameter
> > using cmdline instead of recompiling a kernel.
> >
> > [1] https://www.xilinx.com/support/answers/72694.html
>
> I'm not convinced the swiotlb use describe there falls under "intended
> use" - mapping a 1280x720 framebuffer in a single chunk? (As an aside,
> the bottom of this page is also confusing, as following "Then we can
> confirm the modified swiotlb size in the boot log:" there is a log
> fragment showing the same original size of 64Mb.
>
> > --- a/arch/mips/cavium-octeon/dma-octeon.c
> > +++ b/arch/mips/cavium-octeon/dma-octeon.c
> > @@ -237,7 +237,7 @@ void __init plat_swiotlb_setup(void)
> >               swiotlbsize = 64 * (1<<20);
> >  #endif
> >       swiotlb_nslabs = swiotlbsize >> IO_TLB_SHIFT;
> > -     swiotlb_nslabs = ALIGN(swiotlb_nslabs, IO_TLB_SEGSIZE);
> > +     swiotlb_nslabs = ALIGN(swiotlb_nslabs, swiotlb_io_seg_size());
>
> In order to be sure to catch all uses like this one (including ones
> which make it upstream in parallel to yours), I think you will want
> to rename the original IO_TLB_SEGSIZE to e.g. IO_TLB_DEFAULT_SEGSIZE.
>
> > @@ -81,15 +86,30 @@ static unsigned int max_segment;
> >  static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
> >
> >  static int __init
> > -setup_io_tlb_npages(char *str)
> > +setup_io_tlb_params(char *str)
> >  {
> > +     unsigned long tmp;
> > +
> >       if (isdigit(*str)) {
> > -             /* avoid tail segment of size < IO_TLB_SEGSIZE */
> > -             default_nslabs =
> > -                     ALIGN(simple_strtoul(str, &str, 0), IO_TLB_SEGSIZE);
> > +             default_nslabs = simple_strtoul(str, &str, 0);
> >       }
> >       if (*str == ',')
> >               ++str;
> > +
> > +     /* get max IO TLB segment size */
> > +     if (isdigit(*str)) {
> > +             tmp = simple_strtoul(str, &str, 0);
> > +             if (tmp)
> > +                     io_tlb_seg_size = ALIGN(tmp, IO_TLB_SEGSIZE);
>
> From all I can tell io_tlb_seg_size wants to be a power of 2. Merely
> aligning to a multiple of IO_TLB_SEGSIZE isn't going to be enough.
>
> Jan
>
Jan Beulich Sept. 15, 2021, 1:49 p.m. UTC | #5
On 15.09.2021 15:37, Roman Skakun wrote:
>>> From: Roman Skakun <roman_skakun@epam.com>
>>>
>>> It is possible when default IO TLB size is not
>>> enough to fit a long buffers as described here [1].
>>>
>>> This patch makes a way to set this parameter
>>> using cmdline instead of recompiling a kernel.
>>>
>>> [1] https://www.xilinx.com/support/answers/72694.html
>>
>>  I'm not convinced the swiotlb use describe there falls under "intended
>>  use" - mapping a 1280x720 framebuffer in a single chunk?
> 
> I had the same issue while mapping DMA chuck ~4MB for gem fb when
> using xen vdispl.
> I got the next log:
> [ 142.030421] rcar-fcp fea2f000.fcp: swiotlb buffer is full (sz:
> 3686400 bytes), total 32768 (slots), used 32 (slots)
> 
> It happened when I tried to map bounce buffer, which has a large size.
> The default size if 128(IO_TLB_SEGSIZE) * 2048(IO_TLB_SHIFT) = 262144
> bytes, but we requested 3686400 bytes.
> When I change IO_TLB_SEGSIZE to 2048. (2048(IO_TLB_SEGSIZE)  *
> 2048(IO_TLB_SHIFT) = 4194304bytes).
> It makes possible to retrieve a bounce buffer for requested size.
> After changing this value, the problem is gone.

But the question remains: Why does the framebuffer need to be mapped
in a single giant chunk?

>>  In order to be sure to catch all uses like this one (including ones
>>  which make it upstream in parallel to yours), I think you will want
>>  to rename the original IO_TLB_SEGSIZE to e.g. IO_TLB_DEFAULT_SEGSIZE.
> 
> I don't understand your point. Can you clarify this?

There's a concrete present example: I have a patch pending adding
another use of IO_TLB_SEGSIZE. This use would need to be replaced
like you do here in several places. The need for the additional
replacement would be quite obvious (from a build failure) if you
renamed the manifest constant. Without renaming, it'll take
someone running into an issue on a live system, which I consider
far worse. This is because a simple re-basing of one of the
patches on top of the other will not point out the need for the
extra replacement, nor would a test build (with both patches in
place).

Jan
Christoph Hellwig Sept. 15, 2021, 1:53 p.m. UTC | #6
On Wed, Sep 15, 2021 at 03:49:52PM +0200, Jan Beulich wrote:
> But the question remains: Why does the framebuffer need to be mapped
> in a single giant chunk?

More importantly: if you use dynamic dma mappings for your framebuffer
you're doing something wrong.
Roman Skakun Sept. 16, 2021, 9:40 a.m. UTC | #7
Hi Stefano,

> Also, Option 1 listed in the webpage seems to be a lot better. Any
> reason you can't do that? Because that option both solves the problem
> and increases performance.

Yes, Option 1 is probably more efficient.
But I use another platform under Xen without DMA adjustment functionality.
I pinned this webpage only for example to describe the similar problem I had.

Cheers,
Roman

ср, 15 сент. 2021 г. в 04:51, Stefano Stabellini <sstabellini@kernel.org>:

>
> On Tue, 14 Sep 2021, Christoph Hellwig wrote:
> > On Tue, Sep 14, 2021 at 05:29:07PM +0200, Jan Beulich wrote:
> > > I'm not convinced the swiotlb use describe there falls under "intended
> > > use" - mapping a 1280x720 framebuffer in a single chunk? (As an aside,
> > > the bottom of this page is also confusing, as following "Then we can
> > > confirm the modified swiotlb size in the boot log:" there is a log
> > > fragment showing the same original size of 64Mb.
> >
> > It doesn't.  We also do not add hacks to the kernel for whacky out
> > of tree modules.
>
> Also, Option 1 listed in the webpage seems to be a lot better. Any
> reason you can't do that? Because that option both solves the problem
> and increases performance.



--
Best Regards, Roman.
Roman Skakun Sept. 17, 2021, 9:36 a.m. UTC | #8
Hi, Christoph

I use Xen PV display. In my case, PV display backend(Dom0) allocates
contiguous buffer via DMA-API to
to implement zero-copy between Dom0 and DomU.

When I start Weston under DomU, I got the next log in Dom0:
```
[ 112.554471] CPU: 0 PID: 367 Comm: weston Tainted: G O
5.10.0-yocto-standard+ #312
[ 112.575149] Call trace:
[ 112.577666] dump_backtrace+0x0/0x1b0
[ 112.581373] show_stack+0x18/0x70
[ 112.584746] dump_stack+0xd0/0x12c
[ 112.588200] swiotlb_tbl_map_single+0x234/0x360
[ 112.592781] xen_swiotlb_map_page+0xe4/0x4c0
[ 112.597095] xen_swiotlb_map_sg+0x84/0x12c
[ 112.601249] dma_map_sg_attrs+0x54/0x60
[ 112.605138] vsp1_du_map_sg+0x30/0x60
[ 112.608851] rcar_du_vsp_map_fb+0x134/0x170
[ 112.613082] rcar_du_vsp_plane_prepare_fb+0x44/0x64
[ 112.618007] drm_atomic_helper_prepare_planes+0xac/0x160
[ 112.623362] drm_atomic_helper_commit+0x88/0x390
[ 112.628029] drm_atomic_nonblocking_commit+0x4c/0x60
[ 112.633043] drm_mode_atomic_ioctl+0x9a8/0xb0c
[ 112.637532] drm_ioctl_kernel+0xc4/0x11c
[ 112.641506] drm_ioctl+0x21c/0x460
[ 112.644967] __arm64_sys_ioctl+0xa8/0xf0
[ 112.648939] el0_svc_common.constprop.0+0x78/0x1a0
[ 112.653775] do_el0_svc+0x24/0x90
[ 112.657148] el0_svc+0x14/0x20
[ 112.660254] el0_sync_handler+0x1a4/0x1b0
[ 112.664315] el0_sync+0x174/0x180
[ 112.668145] rcar-fcp fea2f000.fcp: swiotlb buffer is full (sz:
3686400 bytes), total 65536 (slots), used 112 (slots)
```
The problem is happened here:
https://elixir.bootlin.com/linux/v5.14.4/source/drivers/gpu/drm/rcar-du/rcar_du_vsp.c#L202

Sgt was created in dma_get_sgtable() by dma_common_get_sgtable() and
includes a single page chunk
as shown here:
https://elixir.bootlin.com/linux/v5.14.5/source/kernel/dma/ops_helpers.c#L18

After creating a new sgt, we tried to map this sgt through vsp1_du_map_sg().
Internally, vsp1_du_map_sg() using ops->map_sg (e.g
xen_swiotlb_map_sg) to perform
mapping.

I realized that required segment is too big to be fitted to default
swiotlb segment and condition
https://elixir.bootlin.com/linux/latest/source/kernel/dma/swiotlb.c#L474
is always false.

I know that I use a large buffer, but why can't I map this buffer in one chunk?

Thanks!

ср, 15 сент. 2021 г. в 16:53, Christoph Hellwig <hch@lst.de>:
>
> On Wed, Sep 15, 2021 at 03:49:52PM +0200, Jan Beulich wrote:
> > But the question remains: Why does the framebuffer need to be mapped
> > in a single giant chunk?
>
> More importantly: if you use dynamic dma mappings for your framebuffer
> you're doing something wrong.
Robin Murphy Sept. 17, 2021, 9:44 a.m. UTC | #9
On 2021-09-17 10:36, Roman Skakun wrote:
> Hi, Christoph
> 
> I use Xen PV display. In my case, PV display backend(Dom0) allocates
> contiguous buffer via DMA-API to
> to implement zero-copy between Dom0 and DomU.

Well, something's gone badly wrong there - if you have to shadow the 
entire thing in a bounce buffer to import it then it's hardly zero-copy, 
is it? If you want to do buffer sharing the buffer really needs to be 
allocated appropriately to begin with, such that all relevant devices 
can access it directly. That might be something which needs fixing in Xen.

Robin.

> When I start Weston under DomU, I got the next log in Dom0:
> ```
> [ 112.554471] CPU: 0 PID: 367 Comm: weston Tainted: G O
> 5.10.0-yocto-standard+ #312
> [ 112.575149] Call trace:
> [ 112.577666] dump_backtrace+0x0/0x1b0
> [ 112.581373] show_stack+0x18/0x70
> [ 112.584746] dump_stack+0xd0/0x12c
> [ 112.588200] swiotlb_tbl_map_single+0x234/0x360
> [ 112.592781] xen_swiotlb_map_page+0xe4/0x4c0
> [ 112.597095] xen_swiotlb_map_sg+0x84/0x12c
> [ 112.601249] dma_map_sg_attrs+0x54/0x60
> [ 112.605138] vsp1_du_map_sg+0x30/0x60
> [ 112.608851] rcar_du_vsp_map_fb+0x134/0x170
> [ 112.613082] rcar_du_vsp_plane_prepare_fb+0x44/0x64
> [ 112.618007] drm_atomic_helper_prepare_planes+0xac/0x160
> [ 112.623362] drm_atomic_helper_commit+0x88/0x390
> [ 112.628029] drm_atomic_nonblocking_commit+0x4c/0x60
> [ 112.633043] drm_mode_atomic_ioctl+0x9a8/0xb0c
> [ 112.637532] drm_ioctl_kernel+0xc4/0x11c
> [ 112.641506] drm_ioctl+0x21c/0x460
> [ 112.644967] __arm64_sys_ioctl+0xa8/0xf0
> [ 112.648939] el0_svc_common.constprop.0+0x78/0x1a0
> [ 112.653775] do_el0_svc+0x24/0x90
> [ 112.657148] el0_svc+0x14/0x20
> [ 112.660254] el0_sync_handler+0x1a4/0x1b0
> [ 112.664315] el0_sync+0x174/0x180
> [ 112.668145] rcar-fcp fea2f000.fcp: swiotlb buffer is full (sz:
> 3686400 bytes), total 65536 (slots), used 112 (slots)
> ```
> The problem is happened here:
> https://elixir.bootlin.com/linux/v5.14.4/source/drivers/gpu/drm/rcar-du/rcar_du_vsp.c#L202
> 
> Sgt was created in dma_get_sgtable() by dma_common_get_sgtable() and
> includes a single page chunk
> as shown here:
> https://elixir.bootlin.com/linux/v5.14.5/source/kernel/dma/ops_helpers.c#L18
> 
> After creating a new sgt, we tried to map this sgt through vsp1_du_map_sg().
> Internally, vsp1_du_map_sg() using ops->map_sg (e.g
> xen_swiotlb_map_sg) to perform
> mapping.
> 
> I realized that required segment is too big to be fitted to default
> swiotlb segment and condition
> https://elixir.bootlin.com/linux/latest/source/kernel/dma/swiotlb.c#L474
> is always false.
> 
> I know that I use a large buffer, but why can't I map this buffer in one chunk?
> 
> Thanks!
> 
> ср, 15 сент. 2021 г. в 16:53, Christoph Hellwig <hch@lst.de>:
>>
>> On Wed, Sep 15, 2021 at 03:49:52PM +0200, Jan Beulich wrote:
>>> But the question remains: Why does the framebuffer need to be mapped
>>> in a single giant chunk?
>>
>> More importantly: if you use dynamic dma mappings for your framebuffer
>> you're doing something wrong.
> 
> 
>
Jan Beulich Sept. 17, 2021, 9:52 a.m. UTC | #10
On 17.09.2021 11:36, Roman Skakun wrote:
> I use Xen PV display. In my case, PV display backend(Dom0) allocates
> contiguous buffer via DMA-API to
> to implement zero-copy between Dom0 and DomU.

Why does the buffer need to be allocated by Dom0? If it was allocated
by DomU, it could use grants to give the backend direct (zero-copy)
access.

Jan
Roman Skakun Sept. 17, 2021, 10:53 a.m. UTC | #11
Hi Jan,

>>>  In order to be sure to catch all uses like this one (including ones
>>>  which make it upstream in parallel to yours), I think you will want
>>>  to rename the original IO_TLB_SEGSIZE to e.g. IO_TLB_DEFAULT_SEGSIZE.
>>
>> I don't understand your point. Can you clarify this?
>
> There's a concrete present example: I have a patch pending adding
> another use of IO_TLB_SEGSIZE. This use would need to be replaced
> like you do here in several places. The need for the additional
> replacement would be quite obvious (from a build failure) if you
> renamed the manifest constant. Without renaming, it'll take
> someone running into an issue on a live system, which I consider
> far worse. This is because a simple re-basing of one of the
> patches on top of the other will not point out the need for the
> extra replacement, nor would a test build (with both patches in
> place).

It's reasonable.
I will change the original IO_TLB_SEGSIZE to IO_TLB_DEFAULT_SEGSIZE in the
next patch series.

Thanks.

ср, 15 сент. 2021 г. в 16:50, Jan Beulich <jbeulich@suse.com>:
>
> On 15.09.2021 15:37, Roman Skakun wrote:
> >>> From: Roman Skakun <roman_skakun@epam.com>
> >>>
> >>> It is possible when default IO TLB size is not
> >>> enough to fit a long buffers as described here [1].
> >>>
> >>> This patch makes a way to set this parameter
> >>> using cmdline instead of recompiling a kernel.
> >>>
> >>> [1] https://www.xilinx.com/support/answers/72694.html
> >>
> >>  I'm not convinced the swiotlb use describe there falls under "intended
> >>  use" - mapping a 1280x720 framebuffer in a single chunk?
> >
> > I had the same issue while mapping DMA chuck ~4MB for gem fb when
> > using xen vdispl.
> > I got the next log:
> > [ 142.030421] rcar-fcp fea2f000.fcp: swiotlb buffer is full (sz:
> > 3686400 bytes), total 32768 (slots), used 32 (slots)
> >
> > It happened when I tried to map bounce buffer, which has a large size.
> > The default size if 128(IO_TLB_SEGSIZE) * 2048(IO_TLB_SHIFT) = 262144
> > bytes, but we requested 3686400 bytes.
> > When I change IO_TLB_SEGSIZE to 2048. (2048(IO_TLB_SEGSIZE)  *
> > 2048(IO_TLB_SHIFT) = 4194304bytes).
> > It makes possible to retrieve a bounce buffer for requested size.
> > After changing this value, the problem is gone.
>
> But the question remains: Why does the framebuffer need to be mapped
> in a single giant chunk?
>
> >>  In order to be sure to catch all uses like this one (including ones
> >>  which make it upstream in parallel to yours), I think you will want
> >>  to rename the original IO_TLB_SEGSIZE to e.g. IO_TLB_DEFAULT_SEGSIZE.
> >
> > I don't understand your point. Can you clarify this?
>
> There's a concrete present example: I have a patch pending adding
> another use of IO_TLB_SEGSIZE. This use would need to be replaced
> like you do here in several places. The need for the additional
> replacement would be quite obvious (from a build failure) if you
> renamed the manifest constant. Without renaming, it'll take
> someone running into an issue on a live system, which I consider
> far worse. This is because a simple re-basing of one of the
> patches on top of the other will not point out the need for the
> extra replacement, nor would a test build (with both patches in
> place).
>
> Jan
>
Roman Skakun Sept. 21, 2021, 3:20 p.m. UTC | #12
Hi Robin,

>> I use Xen PV display. In my case, PV display backend(Dom0) allocates
>> contiguous buffer via DMA-API to
>> to implement zero-copy between Dom0 and DomU.
>>
> Well, something's gone badly wrong there - if you have to shadow the
> entire thing in a bounce buffer to import it then it's hardly zero-copy,
> is it? If you want to do buffer sharing the buffer really needs to be
> allocated appropriately to begin with, such that all relevant devices
> can access it directly. That might be something which needs fixing in Xen.
>

Right, in case when we want to use a zero-copy approach need to avoid
using swiotlb
bounce buffer for all devices which is potentially using this buffer.
The root of the problem is that this buffer mapped to foreign pages
and when I tried to
retrieve dma_addr for this buffer I got a foreign MFN that bigger than
32 bit and swiotlb tries to
use bounce buffer.
I understood, that, need to find a way to avoid using swiotlb in this case.
At the moment, it's unclear how to do this properly.
But, this is another story...

I guess, we can have the situation when some device like rcar-du needs
to use a sufficiently large
buffer which is greater than 256 KB (128(CURRENT_IO_TLB_SEGMENT *
2048) and need to
adjust this parameter during boot time, not compilation time.
In order to this point, this patch was created.

Thanks,
Roman

пт, 17 сент. 2021 г. в 12:44, Robin Murphy <robin.murphy@arm.com>:
>
> On 2021-09-17 10:36, Roman Skakun wrote:
> > Hi, Christoph
> >
> > I use Xen PV display. In my case, PV display backend(Dom0) allocates
> > contiguous buffer via DMA-API to
> > to implement zero-copy between Dom0 and DomU.
>
> Well, something's gone badly wrong there - if you have to shadow the
> entire thing in a bounce buffer to import it then it's hardly zero-copy,
> is it? If you want to do buffer sharing the buffer really needs to be
> allocated appropriately to begin with, such that all relevant devices
> can access it directly. That might be something which needs fixing in Xen.
>
> Robin.
>
> > When I start Weston under DomU, I got the next log in Dom0:
> > ```
> > [ 112.554471] CPU: 0 PID: 367 Comm: weston Tainted: G O
> > 5.10.0-yocto-standard+ #312
> > [ 112.575149] Call trace:
> > [ 112.577666] dump_backtrace+0x0/0x1b0
> > [ 112.581373] show_stack+0x18/0x70
> > [ 112.584746] dump_stack+0xd0/0x12c
> > [ 112.588200] swiotlb_tbl_map_single+0x234/0x360
> > [ 112.592781] xen_swiotlb_map_page+0xe4/0x4c0
> > [ 112.597095] xen_swiotlb_map_sg+0x84/0x12c
> > [ 112.601249] dma_map_sg_attrs+0x54/0x60
> > [ 112.605138] vsp1_du_map_sg+0x30/0x60
> > [ 112.608851] rcar_du_vsp_map_fb+0x134/0x170
> > [ 112.613082] rcar_du_vsp_plane_prepare_fb+0x44/0x64
> > [ 112.618007] drm_atomic_helper_prepare_planes+0xac/0x160
> > [ 112.623362] drm_atomic_helper_commit+0x88/0x390
> > [ 112.628029] drm_atomic_nonblocking_commit+0x4c/0x60
> > [ 112.633043] drm_mode_atomic_ioctl+0x9a8/0xb0c
> > [ 112.637532] drm_ioctl_kernel+0xc4/0x11c
> > [ 112.641506] drm_ioctl+0x21c/0x460
> > [ 112.644967] __arm64_sys_ioctl+0xa8/0xf0
> > [ 112.648939] el0_svc_common.constprop.0+0x78/0x1a0
> > [ 112.653775] do_el0_svc+0x24/0x90
> > [ 112.657148] el0_svc+0x14/0x20
> > [ 112.660254] el0_sync_handler+0x1a4/0x1b0
> > [ 112.664315] el0_sync+0x174/0x180
> > [ 112.668145] rcar-fcp fea2f000.fcp: swiotlb buffer is full (sz:
> > 3686400 bytes), total 65536 (slots), used 112 (slots)
> > ```
> > The problem is happened here:
> > https://elixir.bootlin.com/linux/v5.14.4/source/drivers/gpu/drm/rcar-du/rcar_du_vsp.c#L202
> >
> > Sgt was created in dma_get_sgtable() by dma_common_get_sgtable() and
> > includes a single page chunk
> > as shown here:
> > https://elixir.bootlin.com/linux/v5.14.5/source/kernel/dma/ops_helpers.c#L18
> >
> > After creating a new sgt, we tried to map this sgt through vsp1_du_map_sg().
> > Internally, vsp1_du_map_sg() using ops->map_sg (e.g
> > xen_swiotlb_map_sg) to perform
> > mapping.
> >
> > I realized that required segment is too big to be fitted to default
> > swiotlb segment and condition
> > https://elixir.bootlin.com/linux/latest/source/kernel/dma/swiotlb.c#L474
> > is always false.
> >
> > I know that I use a large buffer, but why can't I map this buffer in one chunk?
> >
> > Thanks!
> >
> > ср, 15 сент. 2021 г. в 16:53, Christoph Hellwig <hch@lst.de>:
> >>
> >> On Wed, Sep 15, 2021 at 03:49:52PM +0200, Jan Beulich wrote:
> >>> But the question remains: Why does the framebuffer need to be mapped
> >>> in a single giant chunk?
> >>
> >> More importantly: if you use dynamic dma mappings for your framebuffer
> >> you're doing something wrong.
> >
> >
> >
diff mbox series

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 91ba391f9b32..f842a523a485 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5558,8 +5558,9 @@ 
 			it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst)
 
 	swiotlb=	[ARM,IA-64,PPC,MIPS,X86]
-			Format: { <int> | force | noforce }
-			<int> -- Number of I/O TLB slabs
+			Format: { <slabs> [,<io_tlb_segment_size>] [,force | noforce]​ }
+			<slabs> -- Number of I/O TLB slabs
+			<io_tlb_segment_size> -- Max IO TLB segment size
 			force -- force using of bounce buffers even if they
 			         wouldn't be automatically used by the kernel
 			noforce -- Never use bounce buffers (for debugging)
diff --git a/arch/mips/cavium-octeon/dma-octeon.c b/arch/mips/cavium-octeon/dma-octeon.c
index df70308db0e6..446c73bc936e 100644
--- a/arch/mips/cavium-octeon/dma-octeon.c
+++ b/arch/mips/cavium-octeon/dma-octeon.c
@@ -237,7 +237,7 @@  void __init plat_swiotlb_setup(void)
 		swiotlbsize = 64 * (1<<20);
 #endif
 	swiotlb_nslabs = swiotlbsize >> IO_TLB_SHIFT;
-	swiotlb_nslabs = ALIGN(swiotlb_nslabs, IO_TLB_SEGSIZE);
+	swiotlb_nslabs = ALIGN(swiotlb_nslabs, swiotlb_io_seg_size());
 	swiotlbsize = swiotlb_nslabs << IO_TLB_SHIFT;
 
 	octeon_swiotlb = memblock_alloc_low(swiotlbsize, PAGE_SIZE);
diff --git a/arch/powerpc/platforms/pseries/svm.c b/arch/powerpc/platforms/pseries/svm.c
index 87f001b4c4e4..2a1f09c722ac 100644
--- a/arch/powerpc/platforms/pseries/svm.c
+++ b/arch/powerpc/platforms/pseries/svm.c
@@ -47,7 +47,7 @@  void __init svm_swiotlb_init(void)
 	unsigned long bytes, io_tlb_nslabs;
 
 	io_tlb_nslabs = (swiotlb_size_or_default() >> IO_TLB_SHIFT);
-	io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
+	io_tlb_nslabs = ALIGN(io_tlb_nslabs, swiotlb_io_seg_size());
 
 	bytes = io_tlb_nslabs << IO_TLB_SHIFT;
 
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 643fe440c46e..0fc9c6cb6815 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -110,12 +110,13 @@  static int xen_swiotlb_fixup(void *buf, unsigned long nslabs)
 	int dma_bits;
 	dma_addr_t dma_handle;
 	phys_addr_t p = virt_to_phys(buf);
+	unsigned long tlb_segment_size = swiotlb_io_seg_size();
 
-	dma_bits = get_order(IO_TLB_SEGSIZE << IO_TLB_SHIFT) + PAGE_SHIFT;
+	dma_bits = get_order(tlb_segment_size << IO_TLB_SHIFT) + PAGE_SHIFT;
 
 	i = 0;
 	do {
-		int slabs = min(nslabs - i, (unsigned long)IO_TLB_SEGSIZE);
+		int slabs = min(nslabs - i, (unsigned long)tlb_segment_size);
 
 		do {
 			rc = xen_create_contiguous_region(
@@ -153,7 +154,7 @@  static const char *xen_swiotlb_error(enum xen_swiotlb_err err)
 	return "";
 }
 
-#define DEFAULT_NSLABS		ALIGN(SZ_64M >> IO_TLB_SHIFT, IO_TLB_SEGSIZE)
+#define DEFAULT_NSLABS		ALIGN(SZ_64M >> IO_TLB_SHIFT, swiotlb_io_seg_size())
 
 int __ref xen_swiotlb_init(void)
 {
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index b0cb2a9973f4..35c3ffeda9fa 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -59,6 +59,7 @@  void swiotlb_sync_single_for_cpu(struct device *dev, phys_addr_t tlb_addr,
 		size_t size, enum dma_data_direction dir);
 dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
 		size_t size, enum dma_data_direction dir, unsigned long attrs);
+unsigned long swiotlb_io_seg_size(void);
 
 #ifdef CONFIG_SWIOTLB
 extern enum swiotlb_force swiotlb_force;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 87c40517e822..6b505206fc13 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -72,6 +72,11 @@  enum swiotlb_force swiotlb_force;
 
 struct io_tlb_mem io_tlb_default_mem;
 
+/*
+ * Maximum IO TLB segment size.
+ */
+static unsigned long io_tlb_seg_size = IO_TLB_SEGSIZE;
+
 /*
  * Max segment that we can provide which (if pages are contingous) will
  * not be bounced (unless SWIOTLB_FORCE is set).
@@ -81,15 +86,30 @@  static unsigned int max_segment;
 static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
 
 static int __init
-setup_io_tlb_npages(char *str)
+setup_io_tlb_params(char *str)
 {
+	unsigned long tmp;
+
 	if (isdigit(*str)) {
-		/* avoid tail segment of size < IO_TLB_SEGSIZE */
-		default_nslabs =
-			ALIGN(simple_strtoul(str, &str, 0), IO_TLB_SEGSIZE);
+		default_nslabs = simple_strtoul(str, &str, 0);
 	}
 	if (*str == ',')
 		++str;
+
+	/* get max IO TLB segment size */
+	if (isdigit(*str)) {
+		tmp = simple_strtoul(str, &str, 0);
+		if (tmp)
+			io_tlb_seg_size = ALIGN(tmp, IO_TLB_SEGSIZE);
+	}
+	if (*str == ',')
+		++str;
+
+	/* update io_tlb_nslabs after applying a new segment size and
+	 * avoid tail segment of size < IO TLB segment size
+	 */
+	default_nslabs = ALIGN(default_nslabs, io_tlb_seg_size);
+
 	if (!strcmp(str, "force"))
 		swiotlb_force = SWIOTLB_FORCE;
 	else if (!strcmp(str, "noforce"))
@@ -97,7 +117,7 @@  setup_io_tlb_npages(char *str)
 
 	return 0;
 }
-early_param("swiotlb", setup_io_tlb_npages);
+early_param("swiotlb", setup_io_tlb_params);
 
 unsigned int swiotlb_max_segment(void)
 {
@@ -118,6 +138,11 @@  unsigned long swiotlb_size_or_default(void)
 	return default_nslabs << IO_TLB_SHIFT;
 }
 
+unsigned long swiotlb_io_seg_size(void)
+{
+	return io_tlb_seg_size;
+}
+
 void __init swiotlb_adjust_size(unsigned long size)
 {
 	/*
@@ -128,7 +153,7 @@  void __init swiotlb_adjust_size(unsigned long size)
 	if (default_nslabs != IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT)
 		return;
 	size = ALIGN(size, IO_TLB_SIZE);
-	default_nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
+	default_nslabs = ALIGN(size >> IO_TLB_SHIFT, io_tlb_seg_size);
 	pr_info("SWIOTLB bounce buffer size adjusted to %luMB", size >> 20);
 }
 
@@ -147,7 +172,7 @@  void swiotlb_print_info(void)
 
 static inline unsigned long io_tlb_offset(unsigned long val)
 {
-	return val & (IO_TLB_SEGSIZE - 1);
+	return val & (io_tlb_seg_size - 1);
 }
 
 static inline unsigned long nr_slots(u64 val)
@@ -192,7 +217,7 @@  static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 
 	spin_lock_init(&mem->lock);
 	for (i = 0; i < mem->nslabs; i++) {
-		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
+		mem->slots[i].list = io_tlb_seg_size - io_tlb_offset(i);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
 	}
@@ -261,7 +286,7 @@  int
 swiotlb_late_init_with_default_size(size_t default_size)
 {
 	unsigned long nslabs =
-		ALIGN(default_size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
+		ALIGN(default_size >> IO_TLB_SHIFT, io_tlb_seg_size);
 	unsigned long bytes;
 	unsigned char *vstart = NULL;
 	unsigned int order;
@@ -522,7 +547,7 @@  static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 			alloc_size - (offset + ((i - index) << IO_TLB_SHIFT));
 	}
 	for (i = index - 1;
-	     io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&
+	     io_tlb_offset(i) != io_tlb_seg_size - 1 &&
 	     mem->slots[i].list; i--)
 		mem->slots[i].list = ++count;
 
@@ -600,7 +625,7 @@  static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 	 * with slots below and above the pool being returned.
 	 */
 	spin_lock_irqsave(&mem->lock, flags);
-	if (index + nslots < ALIGN(index + 1, IO_TLB_SEGSIZE))
+	if (index + nslots < ALIGN(index + 1, io_tlb_seg_size))
 		count = mem->slots[index + nslots].list;
 	else
 		count = 0;
@@ -620,7 +645,7 @@  static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 	 * available (non zero)
 	 */
 	for (i = index - 1;
-	     io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 && mem->slots[i].list;
+	     io_tlb_offset(i) != io_tlb_seg_size - 1 && mem->slots[i].list;
 	     i--)
 		mem->slots[i].list = ++count;
 	mem->used -= nslots;
@@ -698,7 +723,7 @@  dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
 
 size_t swiotlb_max_mapping_size(struct device *dev)
 {
-	return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE;
+	return ((size_t)IO_TLB_SIZE) * io_tlb_seg_size;
 }
 
 bool is_swiotlb_active(struct device *dev)