diff mbox series

mmc: dw_mmc: Fix IDMAC operation with pages bigger than 4K

Message ID 20240306232052.21317-1-semen.protsenko@linaro.org (mailing list archive)
State New, archived
Headers show
Series mmc: dw_mmc: Fix IDMAC operation with pages bigger than 4K | expand

Commit Message

Sam Protsenko March 6, 2024, 11:20 p.m. UTC
Commit 616f87661792 ("mmc: pass queue_limits to blk_mq_alloc_disk") [1]
revealed the long living issue in dw_mmc.c driver, existing since the
time when it was first introduced in commit f95f3850f7a9 ("mmc: dw_mmc:
Add Synopsys DesignWare mmc host driver."), also making kernel boot
broken on platforms using dw_mmc driver with 16K or 64K pages enabled,
with this message in dmesg:

    mmcblk: probe of mmc0:0001 failed with error -22

That's happening because mmc_blk_probe() fails when it calls
blk_validate_limits() consequently, which returns the error due to
failed max_segment_size check in this code:

    /*
     * The maximum segment size has an odd historic 64k default that
     * drivers probably should override.  Just like the I/O size we
     * require drivers to at least handle a full page per segment.
     */
    ...
    if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE))
        return -EINVAL;

In case when IDMAC (Internal DMA Controller) is used, dw_mmc.c always
sets .max_seg_size to 4 KiB:

    mmc->max_seg_size = 0x1000;

The comment in the code above explains why it's incorrect. Arnd
suggested setting .max_seg_size to .max_req_size to fix it, which is
also what some other drivers are doing:

   $ grep -rl 'max_seg_size.*=.*max_req_size' drivers/mmc/host/ | \
     wc -l
   18

This change is not only fixing the boot with 16K/64K pages, but also
leads to a better MMC performance. The linear write performance was
tested on E850-96 board (eMMC only), before commit [1] (where it's
possible to boot with 16K/64K pages without this fix, to be able to do
a comparison). It was tested with this command:

    # dd if=/dev/zero of=somefile bs=1M count=500 oflag=sync

Test results are as follows:

  - 4K pages,  .max_seg_size = 4 KiB:                   94.2 MB/s
  - 4K pages,  .max_seg_size = .max_req_size = 512 KiB: 96.9 MB/s
  - 16K pages, .max_seg_size = 4 KiB:                   126 MB/s
  - 16K pages, .max_seg_size = .max_req_size = 2 MiB:   128 MB/s
  - 64K pages, .max_seg_size = 4 KiB:                   138 MB/s
  - 64K pages, .max_seg_size = .max_req_size = 8 MiB:   138 MB/s

Unfortunately, SD card controller is not enabled in E850-96 yet, so it
wasn't possible for me to run the test on some cheap SD cards to check
this patch's impact on those. But it's possible that this change might
also reduce the writes count, thus improving SD/eMMC longevity.

All credit for the analysis and the suggested solution goes to Arnd.

[1] https://lore.kernel.org/all/20240215070300.2200308-18-hch@lst.de/

Fixes: f95f3850f7a9 ("mmc: dw_mmc: Add Synopsys DesignWare mmc host driver.")
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Closes: https://lore.kernel.org/all/CA+G9fYtddf2Fd3be+YShHP6CmSDNcn0ptW8qg+stUKW+Cn0rjQ@mail.gmail.com/
Signed-off-by: Sam Protsenko <semen.protsenko@linaro.org>
---
 drivers/mmc/host/dw_mmc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Arnd Bergmann March 7, 2024, 7:52 a.m. UTC | #1
On Thu, Mar 7, 2024, at 00:20, Sam Protsenko wrote:
> Commit 616f87661792 ("mmc: pass queue_limits to blk_mq_alloc_disk") [1]
> revealed the long living issue in dw_mmc.c driver, existing since the
> time when it was first introduced in commit f95f3850f7a9 ("mmc: dw_mmc:
> Add Synopsys DesignWare mmc host driver."), also making kernel boot
> broken on platforms using dw_mmc driver with 16K or 64K pages enabled,
> with this message in dmesg:
>
>     mmcblk: probe of mmc0:0001 failed with error -22
>
> That's happening because mmc_blk_probe() fails when it calls
> blk_validate_limits() consequently, which returns the error due to
> failed max_segment_size check in this code:
>
>     /*
>      * The maximum segment size has an odd historic 64k default that
>      * drivers probably should override.  Just like the I/O size we
>      * require drivers to at least handle a full page per segment.
>      */
>     ...
>     if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE))
>         return -EINVAL;
>
> In case when IDMAC (Internal DMA Controller) is used, dw_mmc.c always
> sets .max_seg_size to 4 KiB:
>
>     mmc->max_seg_size = 0x1000;
>
> The comment in the code above explains why it's incorrect. Arnd
> suggested setting .max_seg_size to .max_req_size to fix it, which is
> also what some other drivers are doing:
>
>    $ grep -rl 'max_seg_size.*=.*max_req_size' drivers/mmc/host/ | \
>      wc -l
>    18

Nice summary!

> This change is not only fixing the boot with 16K/64K pages, but also
> leads to a better MMC performance. The linear write performance was
> tested on E850-96 board (eMMC only), before commit [1] (where it's
> possible to boot with 16K/64K pages without this fix, to be able to do
> a comparison). It was tested with this command:
>
>     # dd if=/dev/zero of=somefile bs=1M count=500 oflag=sync
>
> Test results are as follows:
>
>   - 4K pages,  .max_seg_size = 4 KiB:                   94.2 MB/s
>   - 4K pages,  .max_seg_size = .max_req_size = 512 KiB: 96.9 MB/s
>   - 16K pages, .max_seg_size = 4 KiB:                   126 MB/s
>   - 16K pages, .max_seg_size = .max_req_size = 2 MiB:   128 MB/s
>   - 64K pages, .max_seg_size = 4 KiB:                   138 MB/s
>   - 64K pages, .max_seg_size = .max_req_size = 8 MiB:   138 MB/s

Thanks for sharing these results. From what I can see here, the
performance changes significantly with the page size, but barely
with the max_seg_size, so this does not have the effect I was
hoping for. On a more positive note this likely means that we
don't have to urgently backport your fix.

This could mean that either there is not much coalescing across
pages after all, or that the bottleneck is somewhere else.

> diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> index 8e2d676b9239..cccd5633ff40 100644
> --- a/drivers/mmc/host/dw_mmc.c
> +++ b/drivers/mmc/host/dw_mmc.c
> @@ -2951,8 +2951,8 @@ static int dw_mci_init_slot(struct dw_mci *host)
>  	if (host->use_dma == TRANS_MODE_IDMAC) {
>  		mmc->max_segs = host->ring_size;
>  		mmc->max_blk_size = 65535;
> -		mmc->max_seg_size = 0x1000;
> -		mmc->max_req_size = mmc->max_seg_size * host->ring_size;
> +		mmc->max_req_size = DW_MCI_DESC_DATA_LENGTH * host->ring_size;
> +		mmc->max_seg_size = mmc->max_req_size;

The change looks good to me.

I see that the host->ring_size depends on PAGE_SIZE as well:

#define DESC_RING_BUF_SZ        PAGE_SIZE
host->ring_size = DESC_RING_BUF_SZ / sizeof(struct idmac_desc_64addr);
host->sg_cpu = dmam_alloc_coherent(host->dev,
               DESC_RING_BUF_SZ, &host->sg_dma, GFP_KERNEL);

I don't see any reason for the ring buffer size to be tied to
PAGE_SIZE at all, it was probably picked as a reasonable
default in the initial driver but isn't necessarily ideal.

From what I can see, the number of 4KB elements in the
ring can be as small as 128 (4KB pages, 64-bit addresses)
or as big as 4096 (64KB pages, 32-bit addresses), which is
quite a difference. If you are still motivated to drill
down into this, could you try changing DESC_RING_BUF_SZ
to a fixed size of either 4KB or 64KB and test again
with the opposite page size, to see if that changes the
throughput?

If a larger ring buffer gives us significantly better
throughput, we may want to always use a higher number
independent of page size. On the other hand, if the
64KB number (the 138MB/s) does not change with a smaller
ring, we may as well reduce that in order to limit the
maximum latency that is caused by a single I/O operation.

     Arnd
Sam Protsenko April 2, 2024, 10:43 p.m. UTC | #2
On Thu, Mar 7, 2024 at 1:52 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Thu, Mar 7, 2024, at 00:20, Sam Protsenko wrote:
> > Commit 616f87661792 ("mmc: pass queue_limits to blk_mq_alloc_disk") [1]
> > revealed the long living issue in dw_mmc.c driver, existing since the
> > time when it was first introduced in commit f95f3850f7a9 ("mmc: dw_mmc:
> > Add Synopsys DesignWare mmc host driver."), also making kernel boot
> > broken on platforms using dw_mmc driver with 16K or 64K pages enabled,
> > with this message in dmesg:
> >
> >     mmcblk: probe of mmc0:0001 failed with error -22
> >
> > That's happening because mmc_blk_probe() fails when it calls
> > blk_validate_limits() consequently, which returns the error due to
> > failed max_segment_size check in this code:
> >
> >     /*
> >      * The maximum segment size has an odd historic 64k default that
> >      * drivers probably should override.  Just like the I/O size we
> >      * require drivers to at least handle a full page per segment.
> >      */
> >     ...
> >     if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE))
> >         return -EINVAL;
> >
> > In case when IDMAC (Internal DMA Controller) is used, dw_mmc.c always
> > sets .max_seg_size to 4 KiB:
> >
> >     mmc->max_seg_size = 0x1000;
> >
> > The comment in the code above explains why it's incorrect. Arnd
> > suggested setting .max_seg_size to .max_req_size to fix it, which is
> > also what some other drivers are doing:
> >
> >    $ grep -rl 'max_seg_size.*=.*max_req_size' drivers/mmc/host/ | \
> >      wc -l
> >    18
>
> Nice summary!
>
> > This change is not only fixing the boot with 16K/64K pages, but also
> > leads to a better MMC performance. The linear write performance was
> > tested on E850-96 board (eMMC only), before commit [1] (where it's
> > possible to boot with 16K/64K pages without this fix, to be able to do
> > a comparison). It was tested with this command:
> >
> >     # dd if=/dev/zero of=somefile bs=1M count=500 oflag=sync
> >
> > Test results are as follows:
> >
> >   - 4K pages,  .max_seg_size = 4 KiB:                   94.2 MB/s
> >   - 4K pages,  .max_seg_size = .max_req_size = 512 KiB: 96.9 MB/s
> >   - 16K pages, .max_seg_size = 4 KiB:                   126 MB/s
> >   - 16K pages, .max_seg_size = .max_req_size = 2 MiB:   128 MB/s
> >   - 64K pages, .max_seg_size = 4 KiB:                   138 MB/s
> >   - 64K pages, .max_seg_size = .max_req_size = 8 MiB:   138 MB/s
>
> Thanks for sharing these results. From what I can see here, the
> performance changes significantly with the page size, but barely
> with the max_seg_size, so this does not have the effect I was
> hoping for. On a more positive note this likely means that we
> don't have to urgently backport your fix.
>
> This could mean that either there is not much coalescing across
> pages after all, or that the bottleneck is somewhere else.
>
> > diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> > index 8e2d676b9239..cccd5633ff40 100644
> > --- a/drivers/mmc/host/dw_mmc.c
> > +++ b/drivers/mmc/host/dw_mmc.c
> > @@ -2951,8 +2951,8 @@ static int dw_mci_init_slot(struct dw_mci *host)
> >       if (host->use_dma == TRANS_MODE_IDMAC) {
> >               mmc->max_segs = host->ring_size;
> >               mmc->max_blk_size = 65535;
> > -             mmc->max_seg_size = 0x1000;
> > -             mmc->max_req_size = mmc->max_seg_size * host->ring_size;
> > +             mmc->max_req_size = DW_MCI_DESC_DATA_LENGTH * host->ring_size;
> > +             mmc->max_seg_size = mmc->max_req_size;
>
> The change looks good to me.
>
> I see that the host->ring_size depends on PAGE_SIZE as well:
>
> #define DESC_RING_BUF_SZ        PAGE_SIZE
> host->ring_size = DESC_RING_BUF_SZ / sizeof(struct idmac_desc_64addr);
> host->sg_cpu = dmam_alloc_coherent(host->dev,
>                DESC_RING_BUF_SZ, &host->sg_dma, GFP_KERNEL);
>
> I don't see any reason for the ring buffer size to be tied to
> PAGE_SIZE at all, it was probably picked as a reasonable
> default in the initial driver but isn't necessarily ideal.
>
> From what I can see, the number of 4KB elements in the
> ring can be as small as 128 (4KB pages, 64-bit addresses)
> or as big as 4096 (64KB pages, 32-bit addresses), which is
> quite a difference. If you are still motivated to drill
> down into this, could you try changing DESC_RING_BUF_SZ
> to a fixed size of either 4KB or 64KB and test again
> with the opposite page size, to see if that changes the
> throughput?
>

Hi Arnd,

Sorry for the late reply. I'm a bit of busy with something else right
now (trying to enable this same driver for Exynos850 in U-Boot, hehe),
I'll try to carve out some time later and tinker with
DESC_RING_BUF_SZ. But for now, can we just apply this patch as is? As
I understand, it's fixing quite a major issue (at least from what I
heard), so it would be nice to have it in -next and -stable. Does that
sound reasonable?

Thanks!

> If a larger ring buffer gives us significantly better
> throughput, we may want to always use a higher number
> independent of page size. On the other hand, if the
> 64KB number (the 138MB/s) does not change with a smaller
> ring, we may as well reduce that in order to limit the
> maximum latency that is caused by a single I/O operation.
>
>      Arnd
Ulf Hansson April 4, 2024, 9:28 a.m. UTC | #3
On Wed, 3 Apr 2024 at 00:43, Sam Protsenko <semen.protsenko@linaro.org> wrote:
>
> On Thu, Mar 7, 2024 at 1:52 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Thu, Mar 7, 2024, at 00:20, Sam Protsenko wrote:
> > > Commit 616f87661792 ("mmc: pass queue_limits to blk_mq_alloc_disk") [1]
> > > revealed the long living issue in dw_mmc.c driver, existing since the
> > > time when it was first introduced in commit f95f3850f7a9 ("mmc: dw_mmc:
> > > Add Synopsys DesignWare mmc host driver."), also making kernel boot
> > > broken on platforms using dw_mmc driver with 16K or 64K pages enabled,
> > > with this message in dmesg:
> > >
> > >     mmcblk: probe of mmc0:0001 failed with error -22
> > >
> > > That's happening because mmc_blk_probe() fails when it calls
> > > blk_validate_limits() consequently, which returns the error due to
> > > failed max_segment_size check in this code:
> > >
> > >     /*
> > >      * The maximum segment size has an odd historic 64k default that
> > >      * drivers probably should override.  Just like the I/O size we
> > >      * require drivers to at least handle a full page per segment.
> > >      */
> > >     ...
> > >     if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE))
> > >         return -EINVAL;
> > >
> > > In case when IDMAC (Internal DMA Controller) is used, dw_mmc.c always
> > > sets .max_seg_size to 4 KiB:
> > >
> > >     mmc->max_seg_size = 0x1000;
> > >
> > > The comment in the code above explains why it's incorrect. Arnd
> > > suggested setting .max_seg_size to .max_req_size to fix it, which is
> > > also what some other drivers are doing:
> > >
> > >    $ grep -rl 'max_seg_size.*=.*max_req_size' drivers/mmc/host/ | \
> > >      wc -l
> > >    18
> >
> > Nice summary!
> >
> > > This change is not only fixing the boot with 16K/64K pages, but also
> > > leads to a better MMC performance. The linear write performance was
> > > tested on E850-96 board (eMMC only), before commit [1] (where it's
> > > possible to boot with 16K/64K pages without this fix, to be able to do
> > > a comparison). It was tested with this command:
> > >
> > >     # dd if=/dev/zero of=somefile bs=1M count=500 oflag=sync
> > >
> > > Test results are as follows:
> > >
> > >   - 4K pages,  .max_seg_size = 4 KiB:                   94.2 MB/s
> > >   - 4K pages,  .max_seg_size = .max_req_size = 512 KiB: 96.9 MB/s
> > >   - 16K pages, .max_seg_size = 4 KiB:                   126 MB/s
> > >   - 16K pages, .max_seg_size = .max_req_size = 2 MiB:   128 MB/s
> > >   - 64K pages, .max_seg_size = 4 KiB:                   138 MB/s
> > >   - 64K pages, .max_seg_size = .max_req_size = 8 MiB:   138 MB/s
> >
> > Thanks for sharing these results. From what I can see here, the
> > performance changes significantly with the page size, but barely
> > with the max_seg_size, so this does not have the effect I was
> > hoping for. On a more positive note this likely means that we
> > don't have to urgently backport your fix.
> >
> > This could mean that either there is not much coalescing across
> > pages after all, or that the bottleneck is somewhere else.
> >
> > > diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> > > index 8e2d676b9239..cccd5633ff40 100644
> > > --- a/drivers/mmc/host/dw_mmc.c
> > > +++ b/drivers/mmc/host/dw_mmc.c
> > > @@ -2951,8 +2951,8 @@ static int dw_mci_init_slot(struct dw_mci *host)
> > >       if (host->use_dma == TRANS_MODE_IDMAC) {
> > >               mmc->max_segs = host->ring_size;
> > >               mmc->max_blk_size = 65535;
> > > -             mmc->max_seg_size = 0x1000;
> > > -             mmc->max_req_size = mmc->max_seg_size * host->ring_size;
> > > +             mmc->max_req_size = DW_MCI_DESC_DATA_LENGTH * host->ring_size;
> > > +             mmc->max_seg_size = mmc->max_req_size;
> >
> > The change looks good to me.
> >
> > I see that the host->ring_size depends on PAGE_SIZE as well:
> >
> > #define DESC_RING_BUF_SZ        PAGE_SIZE
> > host->ring_size = DESC_RING_BUF_SZ / sizeof(struct idmac_desc_64addr);
> > host->sg_cpu = dmam_alloc_coherent(host->dev,
> >                DESC_RING_BUF_SZ, &host->sg_dma, GFP_KERNEL);
> >
> > I don't see any reason for the ring buffer size to be tied to
> > PAGE_SIZE at all, it was probably picked as a reasonable
> > default in the initial driver but isn't necessarily ideal.
> >
> > From what I can see, the number of 4KB elements in the
> > ring can be as small as 128 (4KB pages, 64-bit addresses)
> > or as big as 4096 (64KB pages, 32-bit addresses), which is
> > quite a difference. If you are still motivated to drill
> > down into this, could you try changing DESC_RING_BUF_SZ
> > to a fixed size of either 4KB or 64KB and test again
> > with the opposite page size, to see if that changes the
> > throughput?
> >
>
> Hi Arnd,
>
> Sorry for the late reply. I'm a bit of busy with something else right
> now (trying to enable this same driver for Exynos850 in U-Boot, hehe),
> I'll try to carve out some time later and tinker with
> DESC_RING_BUF_SZ. But for now, can we just apply this patch as is? As
> I understand, it's fixing quite a major issue (at least from what I
> heard), so it would be nice to have it in -next and -stable. Does that
> sound reasonable?

Ideally, I would prefer it if you could try out Arnd's proposal,
unless you think that's too far ahead for you, of course? The point
is, that I would rather avoid us from messing around with these kinds
of configurations.

>
> Thanks!
>
> > If a larger ring buffer gives us significantly better
> > throughput, we may want to always use a higher number
> > independent of page size. On the other hand, if the
> > 64KB number (the 138MB/s) does not change with a smaller
> > ring, we may as well reduce that in order to limit the
> > maximum latency that is caused by a single I/O operation.
> >
> >      Arnd

Kind regards
Uffe
Sam Protsenko Aug. 27, 2024, 1:28 a.m. UTC | #4
On Thu, Mar 7, 2024 at 1:52 AM Arnd Bergmann <arnd@arndb.de> wrote:

[snip]

>
> > This change is not only fixing the boot with 16K/64K pages, but also
> > leads to a better MMC performance. The linear write performance was
> > tested on E850-96 board (eMMC only), before commit [1] (where it's
> > possible to boot with 16K/64K pages without this fix, to be able to do
> > a comparison). It was tested with this command:
> >
> >     # dd if=/dev/zero of=somefile bs=1M count=500 oflag=sync
> >
> > Test results are as follows:
> >
> >   - 4K pages,  .max_seg_size = 4 KiB:                   94.2 MB/s
> >   - 4K pages,  .max_seg_size = .max_req_size = 512 KiB: 96.9 MB/s
> >   - 16K pages, .max_seg_size = 4 KiB:                   126 MB/s
> >   - 16K pages, .max_seg_size = .max_req_size = 2 MiB:   128 MB/s
> >   - 64K pages, .max_seg_size = 4 KiB:                   138 MB/s
> >   - 64K pages, .max_seg_size = .max_req_size = 8 MiB:   138 MB/s
>
> Thanks for sharing these results. From what I can see here, the
> performance changes significantly with the page size, but barely
> with the max_seg_size, so this does not have the effect I was
> hoping for. On a more positive note this likely means that we
> don't have to urgently backport your fix.
>
> This could mean that either there is not much coalescing across
> pages after all, or that the bottleneck is somewhere else.
>
> > diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> > index 8e2d676b9239..cccd5633ff40 100644
> > --- a/drivers/mmc/host/dw_mmc.c
> > +++ b/drivers/mmc/host/dw_mmc.c
> > @@ -2951,8 +2951,8 @@ static int dw_mci_init_slot(struct dw_mci *host)
> >       if (host->use_dma == TRANS_MODE_IDMAC) {
> >               mmc->max_segs = host->ring_size;
> >               mmc->max_blk_size = 65535;
> > -             mmc->max_seg_size = 0x1000;
> > -             mmc->max_req_size = mmc->max_seg_size * host->ring_size;
> > +             mmc->max_req_size = DW_MCI_DESC_DATA_LENGTH * host->ring_size;
> > +             mmc->max_seg_size = mmc->max_req_size;
>
> The change looks good to me.
>
> I see that the host->ring_size depends on PAGE_SIZE as well:
>
> #define DESC_RING_BUF_SZ        PAGE_SIZE
> host->ring_size = DESC_RING_BUF_SZ / sizeof(struct idmac_desc_64addr);
> host->sg_cpu = dmam_alloc_coherent(host->dev,
>                DESC_RING_BUF_SZ, &host->sg_dma, GFP_KERNEL);
>
> I don't see any reason for the ring buffer size to be tied to
> PAGE_SIZE at all, it was probably picked as a reasonable
> default in the initial driver but isn't necessarily ideal.
>
> From what I can see, the number of 4KB elements in the
> ring can be as small as 128 (4KB pages, 64-bit addresses)
> or as big as 4096 (64KB pages, 32-bit addresses), which is
> quite a difference. If you are still motivated to drill
> down into this, could you try changing DESC_RING_BUF_SZ
> to a fixed size of either 4KB or 64KB and test again
> with the opposite page size, to see if that changes the
> throughput?
>

Sorry for the huge delay. Just ran the tests:

- 4K pages, DESC_RING_BUF_SZ = 4K: 97 MB/s
- 4K pages, DESC_RING_BUF_SZ = 16K: 98 MB/s
- 4K pages, DESC_RING_BUF_SZ = 64K: 97 MB/s
- 16K pages, DESC_RING_BUF_SZ = 4K: 123 MB/s
- 16K pages, DESC_RING_BUF_SZ = 16K: 125 MB/s
- 16K pages, DESC_RING_BUF_SZ = 64K: 124 MB/s
- 64K pages, DESC_RING_BUF_SZ = 4K: 137 MB/s
- 64K pages, DESC_RING_BUF_SZ = 16K: 135 MB/s
- 64K pages, DESC_RING_BUF_SZE = 64K: 138 MB/s

As you can see, changing the DESC_RING_BUF_SZ value doesn't change MMC
throughput much; the fluctuations are just a measurement/statistical
error. Not that it matters much, but my measurement method was running
the same dd command I mentioned earlier:

    # dd if=/dev/zero of=somefile bs=1M count=500 oflag=sync

For each combination, I ran it 10 times, with different wait time
between runs (as it affects the throughput), then taking top 10%
percentile value -- it just felt right. The dispersion wasn't too big
either. I used the same method in my previous mail as well, so it's
safe to compare these values with those.

Not sure if my test procedure and results cover everything you wanted
to see, so please let me know if you want me to run more tests (e.g.
by changing dd params, or running multiple dd commands with shorter
wait in between, etc).

> If a larger ring buffer gives us significantly better
> throughput, we may want to always use a higher number
> independent of page size. On the other hand, if the
> 64KB number (the 138MB/s) does not change with a smaller
> ring, we may as well reduce that in order to limit the
> maximum latency that is caused by a single I/O operation.
>

From what you said, it looks like it may make a sense to reduce
DESC_RING_BUF_SZ down to 4 KiB? If so, I'd suggest we do that in a
separate patch, as this one actually fixes kernel panic when 16k/64k
pages are enabled. Please let me know what you think.

Thanks!
Arnd Bergmann Aug. 27, 2024, 7:27 a.m. UTC | #5
On Tue, Aug 27, 2024, at 03:28, Sam Protsenko wrote:
> On Thu, Mar 7, 2024 at 1:52 AM Arnd Bergmann <arnd@arndb.de> wrote:

>>
>> The change looks good to me.
>>
>> I see that the host->ring_size depends on PAGE_SIZE as well:
>>
>> #define DESC_RING_BUF_SZ        PAGE_SIZE
>> host->ring_size = DESC_RING_BUF_SZ / sizeof(struct idmac_desc_64addr);
>> host->sg_cpu = dmam_alloc_coherent(host->dev,
>>                DESC_RING_BUF_SZ, &host->sg_dma, GFP_KERNEL);
>>
>> I don't see any reason for the ring buffer size to be tied to
>> PAGE_SIZE at all, it was probably picked as a reasonable
>> default in the initial driver but isn't necessarily ideal.
>>
>> From what I can see, the number of 4KB elements in the
>> ring can be as small as 128 (4KB pages, 64-bit addresses)
>> or as big as 4096 (64KB pages, 32-bit addresses), which is
>> quite a difference. If you are still motivated to drill
>> down into this, could you try changing DESC_RING_BUF_SZ
>> to a fixed size of either 4KB or 64KB and test again
>> with the opposite page size, to see if that changes the
>> throughput?
>>
>
> Sorry for the huge delay. Just ran the tests:
>
> - 4K pages, DESC_RING_BUF_SZ = 4K: 97 MB/s
> - 4K pages, DESC_RING_BUF_SZ = 16K: 98 MB/s
> - 4K pages, DESC_RING_BUF_SZ = 64K: 97 MB/s
> - 16K pages, DESC_RING_BUF_SZ = 4K: 123 MB/s
> - 16K pages, DESC_RING_BUF_SZ = 16K: 125 MB/s
> - 16K pages, DESC_RING_BUF_SZ = 64K: 124 MB/s
> - 64K pages, DESC_RING_BUF_SZ = 4K: 137 MB/s
> - 64K pages, DESC_RING_BUF_SZ = 16K: 135 MB/s
> - 64K pages, DESC_RING_BUF_SZE = 64K: 138 MB/s

Thanks!

> From what you said, it looks like it may make a sense to reduce
> DESC_RING_BUF_SZ down to 4 KiB? If so, I'd suggest we do that in a
> separate patch, as this one actually fixes kernel panic when 16k/64k
> pages are enabled. Please let me know what you think.

Agreed, sounds good to me.

     Arnd
Sam Protsenko Aug. 27, 2024, 5 p.m. UTC | #6
Hi Ulf,

On Tue, Aug 27, 2024 at 2:27 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Tue, Aug 27, 2024, at 03:28, Sam Protsenko wrote:
> > On Thu, Mar 7, 2024 at 1:52 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> >>
> >> The change looks good to me.
> >>
> >> I see that the host->ring_size depends on PAGE_SIZE as well:
> >>
> >> #define DESC_RING_BUF_SZ        PAGE_SIZE
> >> host->ring_size = DESC_RING_BUF_SZ / sizeof(struct idmac_desc_64addr);
> >> host->sg_cpu = dmam_alloc_coherent(host->dev,
> >>                DESC_RING_BUF_SZ, &host->sg_dma, GFP_KERNEL);
> >>
> >> I don't see any reason for the ring buffer size to be tied to
> >> PAGE_SIZE at all, it was probably picked as a reasonable
> >> default in the initial driver but isn't necessarily ideal.
> >>
> >> From what I can see, the number of 4KB elements in the
> >> ring can be as small as 128 (4KB pages, 64-bit addresses)
> >> or as big as 4096 (64KB pages, 32-bit addresses), which is
> >> quite a difference. If you are still motivated to drill
> >> down into this, could you try changing DESC_RING_BUF_SZ
> >> to a fixed size of either 4KB or 64KB and test again
> >> with the opposite page size, to see if that changes the
> >> throughput?
> >>
> >
> > Sorry for the huge delay. Just ran the tests:
> >
> > - 4K pages, DESC_RING_BUF_SZ = 4K: 97 MB/s
> > - 4K pages, DESC_RING_BUF_SZ = 16K: 98 MB/s
> > - 4K pages, DESC_RING_BUF_SZ = 64K: 97 MB/s
> > - 16K pages, DESC_RING_BUF_SZ = 4K: 123 MB/s
> > - 16K pages, DESC_RING_BUF_SZ = 16K: 125 MB/s
> > - 16K pages, DESC_RING_BUF_SZ = 64K: 124 MB/s
> > - 64K pages, DESC_RING_BUF_SZ = 4K: 137 MB/s
> > - 64K pages, DESC_RING_BUF_SZ = 16K: 135 MB/s
> > - 64K pages, DESC_RING_BUF_SZE = 64K: 138 MB/s
>
> Thanks!
>
> > From what you said, it looks like it may make a sense to reduce
> > DESC_RING_BUF_SZ down to 4 KiB? If so, I'd suggest we do that in a
> > separate patch, as this one actually fixes kernel panic when 16k/64k
> > pages are enabled. Please let me know what you think.
>
> Agreed, sounds good to me.
>

Can you please take this patch? It should apply fine without the need
to rebase it. It fixes a kernel panic when big pages are enabled, so
it's an important fix for us. As discussed with Arnd, the
DESC_RING_BUF_SZ change can be made later as a separate patch.

Thanks!
Ulf Hansson Aug. 28, 2024, 3:23 p.m. UTC | #7
On Thu, 7 Mar 2024 at 00:20, Sam Protsenko <semen.protsenko@linaro.org> wrote:
>
> Commit 616f87661792 ("mmc: pass queue_limits to blk_mq_alloc_disk") [1]
> revealed the long living issue in dw_mmc.c driver, existing since the
> time when it was first introduced in commit f95f3850f7a9 ("mmc: dw_mmc:
> Add Synopsys DesignWare mmc host driver."), also making kernel boot
> broken on platforms using dw_mmc driver with 16K or 64K pages enabled,
> with this message in dmesg:
>
>     mmcblk: probe of mmc0:0001 failed with error -22
>
> That's happening because mmc_blk_probe() fails when it calls
> blk_validate_limits() consequently, which returns the error due to
> failed max_segment_size check in this code:
>
>     /*
>      * The maximum segment size has an odd historic 64k default that
>      * drivers probably should override.  Just like the I/O size we
>      * require drivers to at least handle a full page per segment.
>      */
>     ...
>     if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE))
>         return -EINVAL;
>
> In case when IDMAC (Internal DMA Controller) is used, dw_mmc.c always
> sets .max_seg_size to 4 KiB:
>
>     mmc->max_seg_size = 0x1000;
>
> The comment in the code above explains why it's incorrect. Arnd
> suggested setting .max_seg_size to .max_req_size to fix it, which is
> also what some other drivers are doing:
>
>    $ grep -rl 'max_seg_size.*=.*max_req_size' drivers/mmc/host/ | \
>      wc -l
>    18
>
> This change is not only fixing the boot with 16K/64K pages, but also
> leads to a better MMC performance. The linear write performance was
> tested on E850-96 board (eMMC only), before commit [1] (where it's
> possible to boot with 16K/64K pages without this fix, to be able to do
> a comparison). It was tested with this command:
>
>     # dd if=/dev/zero of=somefile bs=1M count=500 oflag=sync
>
> Test results are as follows:
>
>   - 4K pages,  .max_seg_size = 4 KiB:                   94.2 MB/s
>   - 4K pages,  .max_seg_size = .max_req_size = 512 KiB: 96.9 MB/s
>   - 16K pages, .max_seg_size = 4 KiB:                   126 MB/s
>   - 16K pages, .max_seg_size = .max_req_size = 2 MiB:   128 MB/s
>   - 64K pages, .max_seg_size = 4 KiB:                   138 MB/s
>   - 64K pages, .max_seg_size = .max_req_size = 8 MiB:   138 MB/s
>
> Unfortunately, SD card controller is not enabled in E850-96 yet, so it
> wasn't possible for me to run the test on some cheap SD cards to check
> this patch's impact on those. But it's possible that this change might
> also reduce the writes count, thus improving SD/eMMC longevity.
>
> All credit for the analysis and the suggested solution goes to Arnd.
>
> [1] https://lore.kernel.org/all/20240215070300.2200308-18-hch@lst.de/
>
> Fixes: f95f3850f7a9 ("mmc: dw_mmc: Add Synopsys DesignWare mmc host driver.")
> Suggested-by: Arnd Bergmann <arnd@arndb.de>
> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> Closes: https://lore.kernel.org/all/CA+G9fYtddf2Fd3be+YShHP6CmSDNcn0ptW8qg+stUKW+Cn0rjQ@mail.gmail.com/
> Signed-off-by: Sam Protsenko <semen.protsenko@linaro.org>

Applied for fixes and by adding a stable-tag, thanks!

Kind regards
Uffe


> ---
>  drivers/mmc/host/dw_mmc.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> index 8e2d676b9239..cccd5633ff40 100644
> --- a/drivers/mmc/host/dw_mmc.c
> +++ b/drivers/mmc/host/dw_mmc.c
> @@ -2951,8 +2951,8 @@ static int dw_mci_init_slot(struct dw_mci *host)
>         if (host->use_dma == TRANS_MODE_IDMAC) {
>                 mmc->max_segs = host->ring_size;
>                 mmc->max_blk_size = 65535;
> -               mmc->max_seg_size = 0x1000;
> -               mmc->max_req_size = mmc->max_seg_size * host->ring_size;
> +               mmc->max_req_size = DW_MCI_DESC_DATA_LENGTH * host->ring_size;
> +               mmc->max_seg_size = mmc->max_req_size;
>                 mmc->max_blk_count = mmc->max_req_size / 512;
>         } else if (host->use_dma == TRANS_MODE_EDMAC) {
>                 mmc->max_segs = 64;
> --
> 2.39.2
>
Ron Economos Sept. 16, 2024, 6:09 a.m. UTC | #8
On 3/6/24 3:20 PM, Sam Protsenko wrote:
> Commit 616f87661792 ("mmc: pass queue_limits to blk_mq_alloc_disk") [1]
> revealed the long living issue in dw_mmc.c driver, existing since the
> time when it was first introduced in commit f95f3850f7a9 ("mmc: dw_mmc:
> Add Synopsys DesignWare mmc host driver."), also making kernel boot
> broken on platforms using dw_mmc driver with 16K or 64K pages enabled,
> with this message in dmesg:
>
>      mmcblk: probe of mmc0:0001 failed with error -22
>
> That's happening because mmc_blk_probe() fails when it calls
> blk_validate_limits() consequently, which returns the error due to
> failed max_segment_size check in this code:
>
>      /*
>       * The maximum segment size has an odd historic 64k default that
>       * drivers probably should override.  Just like the I/O size we
>       * require drivers to at least handle a full page per segment.
>       */
>      ...
>      if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE))
>          return -EINVAL;
>
> In case when IDMAC (Internal DMA Controller) is used, dw_mmc.c always
> sets .max_seg_size to 4 KiB:
>
>      mmc->max_seg_size = 0x1000;
>
> The comment in the code above explains why it's incorrect. Arnd
> suggested setting .max_seg_size to .max_req_size to fix it, which is
> also what some other drivers are doing:
>
>     $ grep -rl 'max_seg_size.*=.*max_req_size' drivers/mmc/host/ | \
>       wc -l
>     18
>
> This change is not only fixing the boot with 16K/64K pages, but also
> leads to a better MMC performance. The linear write performance was
> tested on E850-96 board (eMMC only), before commit [1] (where it's
> possible to boot with 16K/64K pages without this fix, to be able to do
> a comparison). It was tested with this command:
>
>      # dd if=/dev/zero of=somefile bs=1M count=500 oflag=sync
>
> Test results are as follows:
>
>    - 4K pages,  .max_seg_size = 4 KiB:                   94.2 MB/s
>    - 4K pages,  .max_seg_size = .max_req_size = 512 KiB: 96.9 MB/s
>    - 16K pages, .max_seg_size = 4 KiB:                   126 MB/s
>    - 16K pages, .max_seg_size = .max_req_size = 2 MiB:   128 MB/s
>    - 64K pages, .max_seg_size = 4 KiB:                   138 MB/s
>    - 64K pages, .max_seg_size = .max_req_size = 8 MiB:   138 MB/s
>
> Unfortunately, SD card controller is not enabled in E850-96 yet, so it
> wasn't possible for me to run the test on some cheap SD cards to check
> this patch's impact on those. But it's possible that this change might
> also reduce the writes count, thus improving SD/eMMC longevity.
>
> All credit for the analysis and the suggested solution goes to Arnd.
>
> [1] https://lore.kernel.org/all/20240215070300.2200308-18-hch@lst.de/
>
> Fixes: f95f3850f7a9 ("mmc: dw_mmc: Add Synopsys DesignWare mmc host driver.")
> Suggested-by: Arnd Bergmann <arnd@arndb.de>
> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> Closes: https://lore.kernel.org/all/CA+G9fYtddf2Fd3be+YShHP6CmSDNcn0ptW8qg+stUKW+Cn0rjQ@mail.gmail.com/
> Signed-off-by: Sam Protsenko <semen.protsenko@linaro.org>
> ---
>   drivers/mmc/host/dw_mmc.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> index 8e2d676b9239..cccd5633ff40 100644
> --- a/drivers/mmc/host/dw_mmc.c
> +++ b/drivers/mmc/host/dw_mmc.c
> @@ -2951,8 +2951,8 @@ static int dw_mci_init_slot(struct dw_mci *host)
>   	if (host->use_dma == TRANS_MODE_IDMAC) {
>   		mmc->max_segs = host->ring_size;
>   		mmc->max_blk_size = 65535;
> -		mmc->max_seg_size = 0x1000;
> -		mmc->max_req_size = mmc->max_seg_size * host->ring_size;
> +		mmc->max_req_size = DW_MCI_DESC_DATA_LENGTH * host->ring_size;
> +		mmc->max_seg_size = mmc->max_req_size;
>   		mmc->max_blk_count = mmc->max_req_size / 512;
>   	} else if (host->use_dma == TRANS_MODE_EDMAC) {
>   		mmc->max_segs = 64;

Unfortunately, this patch creates side effects for the RISC-V StarFive 
VisionFive2 board (which uses the DesignWare controller). The following 
swiotlb buffer warnings are produced at boot:

2024-09-15T17:54:27.194101-07:00 visionfive kernel: dwmmc_starfive 
16020000.mmc: swiotlb buffer is full (sz: 352256 bytes), total 32768 
(slots), used 222 (slots)
2024-09-15T17:54:27.194107-07:00 visionfive kernel: dwmmc_starfive 
16020000.mmc: swiotlb buffer is full (sz: 352256 bytes), total 32768 
(slots), used 222 (slots)
2024-09-15T17:54:27.194123-07:00 visionfive kernel: dwmmc_starfive 
16020000.mmc: swiotlb buffer is full (sz: 524288 bytes), total 32768 
(slots), used 110 (slots)
2024-09-15T17:54:27.194129-07:00 visionfive kernel: dwmmc_starfive 
16020000.mmc: swiotlb buffer is full (sz: 524288 bytes), total 32768 
(slots), used 110 (slots)
diff mbox series

Patch

diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
index 8e2d676b9239..cccd5633ff40 100644
--- a/drivers/mmc/host/dw_mmc.c
+++ b/drivers/mmc/host/dw_mmc.c
@@ -2951,8 +2951,8 @@  static int dw_mci_init_slot(struct dw_mci *host)
 	if (host->use_dma == TRANS_MODE_IDMAC) {
 		mmc->max_segs = host->ring_size;
 		mmc->max_blk_size = 65535;
-		mmc->max_seg_size = 0x1000;
-		mmc->max_req_size = mmc->max_seg_size * host->ring_size;
+		mmc->max_req_size = DW_MCI_DESC_DATA_LENGTH * host->ring_size;
+		mmc->max_seg_size = mmc->max_req_size;
 		mmc->max_blk_count = mmc->max_req_size / 512;
 	} else if (host->use_dma == TRANS_MODE_EDMAC) {
 		mmc->max_segs = 64;