mm: readahead: Increase maximum readahead window
diff mbox

Message ID 20171004091205.468-1-jack@suse.cz
State New
Headers show

Commit Message

Jan Kara Oct. 4, 2017, 9:12 a.m. UTC
Increase default maximum allowed readahead window from 128 KB to 512 KB.
This improves performance for some workloads (see below for details) where
ability to scale readahead window to larger sizes allows for better total
throughput while chances for regression are rather low given readahead
window size is dynamically computed based on observation (and thus it never
grows large for workloads with a random read pattern).

Note that the same tuning can be done using udev rules or by manually setting
the sysctl parameter however we believe the new value is a better default most
users will want to use. As a data point we carry this patch in SUSE kernels
for over 8 years.

Some data from the last evaluation of this patch (on 4.4-based kernel, I can
rerun those tests on a newer kernel but nothing has changed in the readahead
area since 4.4). The patch was evaluated on two machines
o a UMA machine, 8 cores and rotary storage
o A NUMA machine, 4 socket, 48 cores and SSD storage

Five basic tests were conducted;

1. paralleldd-single
   paralleldd uses different instances of dd to access a single file and
   write the contents to /dev/null. The performance of it depends on how
   well readahead works for a single file. It's mostly sequential IO.

2. paralleldd-multi
   Similar to test 1 except each instance of dd accesses a different file
   so each instance of dd is accessing data sequentially but the timing
   makes it look like random read IO.

3. pgbench-small
   A standard init of pgbench and execution with a small data set

4. pgbench-large
   A standard init of pgbench and execution with a large data set

5. bonnie++ with dataset sizes 2X RAM and in asyncronous mode

UMA paralleldd-single on ext3
                                  4.4.0                 4.4.0
                                vanilla        readahead-v1r1
Amean    Elapsd-1        5.42 (  0.00%)        5.40 (  0.50%)
Amean    Elapsd-3        7.51 (  0.00%)        5.54 ( 26.25%)
Amean    Elapsd-5        7.15 (  0.00%)        5.90 ( 17.46%)
Amean    Elapsd-7        5.81 (  0.00%)        5.61 (  3.42%)
Amean    Elapsd-8        6.05 (  0.00%)        5.73 (  5.36%)

Results speak for themselves, readahead is a major boost when there
are multiple readers of data. It's not displayed but system CPU
usage is overall. The IO stats support the results

                       4.4.0       4.4.0
                     vanillareadahead-v1r1
Mean sda-avgqusz        7.44        8.59
Mean sda-avgrqsz      279.77      722.52
Mean sda-await         31.95       48.82
Mean sda-r_await        3.32       11.58
Mean sda-w_await      127.51      119.60
Mean sda-svctm          1.47        3.46
Mean sda-rrqm          27.82       23.52
Mean sda-wrqm           4.52        5.00

It shows that the average request size is 2.5 times larger even
though the merging stats are similar. It's also interesting to
note that average wait times are higher but more IO is being
initiated per dd instance.

It's interesting to note that this is specific to ext3 and that xfs showed
a small regression with larger readahead.

UMA paralleldd-single on xfs
                                  4.4.0                 4.4.0
                                vanilla        readahead-v1r1
Min      Elapsd-1        6.91 (  0.00%)        7.10 ( -2.75%)
Min      Elapsd-3        6.77 (  0.00%)        6.93 ( -2.36%)
Min      Elapsd-5        6.82 (  0.00%)        7.00 ( -2.64%)
Min      Elapsd-7        6.84 (  0.00%)        7.05 ( -3.07%)
Min      Elapsd-8        7.02 (  0.00%)        7.04 ( -0.28%)
Amean    Elapsd-1        7.08 (  0.00%)        7.20 ( -1.68%)
Amean    Elapsd-3        7.03 (  0.00%)        7.12 ( -1.40%)
Amean    Elapsd-5        7.22 (  0.00%)        7.38 ( -2.34%)
Amean    Elapsd-7        7.07 (  0.00%)        7.19 ( -1.75%)
Amean    Elapsd-8        7.23 (  0.00%)        7.23 ( -0.10%)

The IO stats are not displayed but show a similar ratio to ext3 and system
CPU usage is also lower. Hence, this slowdown is unexplained but may be
due to differences in XFS in the read path and how it locks even though
direct IO is not a factor. Tracing was not enabled to see what flags are
passed into xfs_ilock to see if the IO is all behind one lock but it's
one potential explanation.

UMA paralleldd-single on ext3

This showed nothing interesting as the test was too short-lived to draw
any conclusions. There was some difference in the kernels but it was
within the noise. The same applies for XFS.

UMA pgbench-small on ext3

This showed very little that was interesting. The database load time
was slower but by a very small margin. The actual transaction times
were highly variable and inconclusive.

NUMA pgbench-small on ext3

Load times are not reported but they completed 1.5% faster.

                             4.4.0                 4.4.0
                           vanilla        readahead-v1r1
Hmean    1       3000.54 (  0.00%)     2895.28 ( -3.51%)
Hmean    8      20596.33 (  0.00%)    19291.92 ( -6.33%)
Hmean    12     30760.68 (  0.00%)    30019.58 ( -2.41%)
Hmean    24     74383.22 (  0.00%)    73580.80 ( -1.08%)
Hmean    32     88377.30 (  0.00%)    88928.70 (  0.62%)
Hmean    48     88133.53 (  0.00%)    96099.16 (  9.04%)
Hmean    80     55981.37 (  0.00%)    76886.10 ( 37.34%)
Hmean    112    74060.29 (  0.00%)    87632.95 ( 18.33%)
Hmean    144    51331.50 (  0.00%)    66135.77 ( 28.84%)
Hmean    172    44256.92 (  0.00%)    63521.73 ( 43.53%)
Hmean    192    35942.74 (  0.00%)    71121.35 ( 97.87%)

The impact here is substantial particularly for higher thread-counts.
It's interesting to note that there is an apparent regression for low
thread counts. In general, there was a high degree of variability
but the gains were all outside of the noise. In general, the io stats
did not show any particular pattern about request size as the workload
is mostly resident in memory. The real curiousity is that readahead
should have had little or no impact here as the data is mostly resident
in memory. Observing the transactions over time, there was a lot of
variability and the performance is likely dominated by whether the
data happened to be local or not. In itself, this test does not push
for inclusion of the patch due to the lack of IO but is included for
completeness.

UMA pgbench-small on xfs

Similar observations to ext3 on the load times. The transaction times
were stable but showed no significant performance difference.

UMA pgbench-large on ext3

Database load times were slightly faster (3.36%). The transaction times
were slower on average, more variable but still very close to the noise.

UMA pgbench-large on xfs

No significant difference on either database load times or transactions.

UMA bonnie on ext3

                                               4.4.0                       4.4.0
                                             vanilla              readahead-v1r1
Hmean    SeqOut Char            81079.98 (  0.00%)        81172.05 (  0.11%)
Hmean    SeqOut Block          104416.12 (  0.00%)       104116.24 ( -0.29%)
Hmean    SeqOut Rewrite         44153.34 (  0.00%)        44596.23 (  1.00%)
Hmean    SeqIn  Char            88144.56 (  0.00%)        91702.67 (  4.04%)
Hmean    SeqIn  Block          134581.06 (  0.00%)       137245.71 (  1.98%)
Hmean    Random seeks             258.46 (  0.00%)          280.82 (  8.65%)
Hmean    SeqCreate ops              2.25 (  0.00%)            2.25 (  0.00%)
Hmean    SeqCreate read             2.25 (  0.00%)            2.25 (  0.00%)
Hmean    SeqCreate del            911.29 (  0.00%)          880.24 ( -3.41%)
Hmean    RandCreate ops             2.25 (  0.00%)            2.25 (  0.00%)
Hmean    RandCreate read            2.00 (  0.00%)            2.25 ( 12.50%)
Hmean    RandCreate del           911.89 (  0.00%)          878.80 ( -3.63%)

The difference in headline performance figures is marginal and well within noise.
The system CPU usage tells a slightly different story

               4.4.0       4.4.0
             vanillareadahead-v1r1
User         1817.53     1798.89
System        499.40      420.65
Elapsed     10692.67    10588.08

As do the IO stats

                      4.4.0       4.4.0
                     vanillareadahead-v1r1
Mean sda-avgqusz     1079.16     1083.35
Mean sda-avgrqsz      807.95     1225.08
Mean sda-await       7308.06     9647.13
Mean sda-r_await      119.04      133.27
Mean sda-w_await    19106.20    20255.41
Mean sda-svctm          4.67        7.02
Mean sda-rrqm           1.80        0.99
Mean sda-wrqm        5597.12     5723.32

NUMA bonnie on ext3

bonnie
                                               4.4.0                       4.4.0
                                             vanilla              readahead-v1r1
Hmean    SeqOut Char            58660.72 (  0.00%)        58930.39 (  0.46%)
Hmean    SeqOut Block          253950.92 (  0.00%)       261466.37 (  2.96%)
Hmean    SeqOut Rewrite        151960.60 (  0.00%)       161300.48 (  6.15%)
Hmean    SeqIn  Char            57015.41 (  0.00%)        55699.16 ( -2.31%)
Hmean    SeqIn  Block          600448.14 (  0.00%)       627565.09 (  4.52%)
Hmean    Random seeks               0.00 (  0.00%)            0.00 (  0.00%)
Hmean    SeqCreate ops              1.00 (  0.00%)            1.00 (  0.00%)
Hmean    SeqCreate read             3.00 (  0.00%)            3.00 (  0.00%)
Hmean    SeqCreate del             90.91 (  0.00%)           79.88 (-12.14%)
Hmean    RandCreate ops             1.00 (  0.00%)            1.50 ( 50.00%)
Hmean    RandCreate read            3.00 (  0.00%)            3.00 (  0.00%)
Hmean    RandCreate del            92.95 (  0.00%)           93.97 (  1.10%)

The impact is small but in line with the UMA machine in a number of details.
As before, the CPU usage is lower even if the iostats show very little
differences overall.

Overall, the headline performance figures are mostly improved or show
little difference. There is a small anomaly with XFS that indicates it may
not always win there due to other factors. There is also the possibility
that a mostly random read workload that was larger than memory with each
read spanning multiple pages but less than the max readahead window would
suffer but the probability is low as the readahead window should scale
properly. On balance, this is a win -- particularly on the large read
workloads.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/mm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Darrick J. Wong Oct. 4, 2017, 5:41 p.m. UTC | #1
On Wed, Oct 04, 2017 at 11:12:05AM +0200, Jan Kara wrote:
> Increase default maximum allowed readahead window from 128 KB to 512 KB.
> This improves performance for some workloads (see below for details) where
> ability to scale readahead window to larger sizes allows for better total
> throughput while chances for regression are rather low given readahead
> window size is dynamically computed based on observation (and thus it never
> grows large for workloads with a random read pattern).
> 
> Note that the same tuning can be done using udev rules or by manually setting
> the sysctl parameter however we believe the new value is a better default most
> users will want to use. As a data point we carry this patch in SUSE kernels
> for over 8 years.
> 
> Some data from the last evaluation of this patch (on 4.4-based kernel, I can
> rerun those tests on a newer kernel but nothing has changed in the readahead
> area since 4.4). The patch was evaluated on two machines

This is purely speculating, but I think this is worth at least a quick
retry on 4.14 to see what's changed in the past 10 kernel release.  For
one thing, ext3 no longer exists, and XFS' file IO path has changed
quite a lot since then.

> o a UMA machine, 8 cores and rotary storage
> o A NUMA machine, 4 socket, 48 cores and SSD storage
> 
> Five basic tests were conducted;
> 
> 1. paralleldd-single
>    paralleldd uses different instances of dd to access a single file and
>    write the contents to /dev/null. The performance of it depends on how
>    well readahead works for a single file. It's mostly sequential IO.
> 
> 2. paralleldd-multi
>    Similar to test 1 except each instance of dd accesses a different file
>    so each instance of dd is accessing data sequentially but the timing
>    makes it look like random read IO.
> 
> 3. pgbench-small
>    A standard init of pgbench and execution with a small data set
> 
> 4. pgbench-large
>    A standard init of pgbench and execution with a large data set
> 
> 5. bonnie++ with dataset sizes 2X RAM and in asyncronous mode
> 
> UMA paralleldd-single on ext3
>                                   4.4.0                 4.4.0
>                                 vanilla        readahead-v1r1
> Amean    Elapsd-1        5.42 (  0.00%)        5.40 (  0.50%)
> Amean    Elapsd-3        7.51 (  0.00%)        5.54 ( 26.25%)
> Amean    Elapsd-5        7.15 (  0.00%)        5.90 ( 17.46%)
> Amean    Elapsd-7        5.81 (  0.00%)        5.61 (  3.42%)
> Amean    Elapsd-8        6.05 (  0.00%)        5.73 (  5.36%)
> 
> Results speak for themselves, readahead is a major boost when there
> are multiple readers of data. It's not displayed but system CPU
> usage is overall. The IO stats support the results
> 
>                        4.4.0       4.4.0
>                      vanillareadahead-v1r1
> Mean sda-avgqusz        7.44        8.59
> Mean sda-avgrqsz      279.77      722.52
> Mean sda-await         31.95       48.82
> Mean sda-r_await        3.32       11.58
> Mean sda-w_await      127.51      119.60
> Mean sda-svctm          1.47        3.46
> Mean sda-rrqm          27.82       23.52
> Mean sda-wrqm           4.52        5.00
> 
> It shows that the average request size is 2.5 times larger even
> though the merging stats are similar. It's also interesting to
> note that average wait times are higher but more IO is being
> initiated per dd instance.
> 
> It's interesting to note that this is specific to ext3 and that xfs showed
> a small regression with larger readahead.
> 
> UMA paralleldd-single on xfs
>                                   4.4.0                 4.4.0
>                                 vanilla        readahead-v1r1
> Min      Elapsd-1        6.91 (  0.00%)        7.10 ( -2.75%)
> Min      Elapsd-3        6.77 (  0.00%)        6.93 ( -2.36%)
> Min      Elapsd-5        6.82 (  0.00%)        7.00 ( -2.64%)
> Min      Elapsd-7        6.84 (  0.00%)        7.05 ( -3.07%)
> Min      Elapsd-8        7.02 (  0.00%)        7.04 ( -0.28%)
> Amean    Elapsd-1        7.08 (  0.00%)        7.20 ( -1.68%)
> Amean    Elapsd-3        7.03 (  0.00%)        7.12 ( -1.40%)
> Amean    Elapsd-5        7.22 (  0.00%)        7.38 ( -2.34%)
> Amean    Elapsd-7        7.07 (  0.00%)        7.19 ( -1.75%)
> Amean    Elapsd-8        7.23 (  0.00%)        7.23 ( -0.10%)
> 
> The IO stats are not displayed but show a similar ratio to ext3 and system
> CPU usage is also lower. Hence, this slowdown is unexplained but may be
> due to differences in XFS in the read path and how it locks even though
> direct IO is not a factor. Tracing was not enabled to see what flags are
> passed into xfs_ilock to see if the IO is all behind one lock but it's
> one potential explanation.
> 
> UMA paralleldd-single on ext3
> 
> This showed nothing interesting as the test was too short-lived to draw
> any conclusions. There was some difference in the kernels but it was
> within the noise. The same applies for XFS.
> 
> UMA pgbench-small on ext3
> 
> This showed very little that was interesting. The database load time
> was slower but by a very small margin. The actual transaction times
> were highly variable and inconclusive.
> 
> NUMA pgbench-small on ext3
> 
> Load times are not reported but they completed 1.5% faster.
> 
>                              4.4.0                 4.4.0
>                            vanilla        readahead-v1r1
> Hmean    1       3000.54 (  0.00%)     2895.28 ( -3.51%)
> Hmean    8      20596.33 (  0.00%)    19291.92 ( -6.33%)
> Hmean    12     30760.68 (  0.00%)    30019.58 ( -2.41%)
> Hmean    24     74383.22 (  0.00%)    73580.80 ( -1.08%)
> Hmean    32     88377.30 (  0.00%)    88928.70 (  0.62%)
> Hmean    48     88133.53 (  0.00%)    96099.16 (  9.04%)
> Hmean    80     55981.37 (  0.00%)    76886.10 ( 37.34%)
> Hmean    112    74060.29 (  0.00%)    87632.95 ( 18.33%)
> Hmean    144    51331.50 (  0.00%)    66135.77 ( 28.84%)
> Hmean    172    44256.92 (  0.00%)    63521.73 ( 43.53%)
> Hmean    192    35942.74 (  0.00%)    71121.35 ( 97.87%)
> 
> The impact here is substantial particularly for higher thread-counts.
> It's interesting to note that there is an apparent regression for low
> thread counts. In general, there was a high degree of variability
> but the gains were all outside of the noise. In general, the io stats
> did not show any particular pattern about request size as the workload
> is mostly resident in memory. The real curiousity is that readahead
> should have had little or no impact here as the data is mostly resident
> in memory. Observing the transactions over time, there was a lot of
> variability and the performance is likely dominated by whether the
> data happened to be local or not. In itself, this test does not push
> for inclusion of the patch due to the lack of IO but is included for
> completeness.
> 
> UMA pgbench-small on xfs
> 
> Similar observations to ext3 on the load times. The transaction times
> were stable but showed no significant performance difference.
> 
> UMA pgbench-large on ext3
> 
> Database load times were slightly faster (3.36%). The transaction times
> were slower on average, more variable but still very close to the noise.
> 
> UMA pgbench-large on xfs
> 
> No significant difference on either database load times or transactions.
> 
> UMA bonnie on ext3
> 
>                                                4.4.0                       4.4.0
>                                              vanilla              readahead-v1r1
> Hmean    SeqOut Char            81079.98 (  0.00%)        81172.05 (  0.11%)
> Hmean    SeqOut Block          104416.12 (  0.00%)       104116.24 ( -0.29%)
> Hmean    SeqOut Rewrite         44153.34 (  0.00%)        44596.23 (  1.00%)
> Hmean    SeqIn  Char            88144.56 (  0.00%)        91702.67 (  4.04%)
> Hmean    SeqIn  Block          134581.06 (  0.00%)       137245.71 (  1.98%)
> Hmean    Random seeks             258.46 (  0.00%)          280.82 (  8.65%)
> Hmean    SeqCreate ops              2.25 (  0.00%)            2.25 (  0.00%)
> Hmean    SeqCreate read             2.25 (  0.00%)            2.25 (  0.00%)
> Hmean    SeqCreate del            911.29 (  0.00%)          880.24 ( -3.41%)
> Hmean    RandCreate ops             2.25 (  0.00%)            2.25 (  0.00%)
> Hmean    RandCreate read            2.00 (  0.00%)            2.25 ( 12.50%)
> Hmean    RandCreate del           911.89 (  0.00%)          878.80 ( -3.63%)
> 
> The difference in headline performance figures is marginal and well within noise.
> The system CPU usage tells a slightly different story
> 
>                4.4.0       4.4.0
>              vanillareadahead-v1r1
> User         1817.53     1798.89
> System        499.40      420.65
> Elapsed     10692.67    10588.08
> 
> As do the IO stats
> 
>                       4.4.0       4.4.0
>                      vanillareadahead-v1r1
> Mean sda-avgqusz     1079.16     1083.35
> Mean sda-avgrqsz      807.95     1225.08
> Mean sda-await       7308.06     9647.13
> Mean sda-r_await      119.04      133.27
> Mean sda-w_await    19106.20    20255.41
> Mean sda-svctm          4.67        7.02
> Mean sda-rrqm           1.80        0.99
> Mean sda-wrqm        5597.12     5723.32
> 
> NUMA bonnie on ext3
> 
> bonnie
>                                                4.4.0                       4.4.0
>                                              vanilla              readahead-v1r1
> Hmean    SeqOut Char            58660.72 (  0.00%)        58930.39 (  0.46%)
> Hmean    SeqOut Block          253950.92 (  0.00%)       261466.37 (  2.96%)
> Hmean    SeqOut Rewrite        151960.60 (  0.00%)       161300.48 (  6.15%)
> Hmean    SeqIn  Char            57015.41 (  0.00%)        55699.16 ( -2.31%)
> Hmean    SeqIn  Block          600448.14 (  0.00%)       627565.09 (  4.52%)
> Hmean    Random seeks               0.00 (  0.00%)            0.00 (  0.00%)
> Hmean    SeqCreate ops              1.00 (  0.00%)            1.00 (  0.00%)
> Hmean    SeqCreate read             3.00 (  0.00%)            3.00 (  0.00%)
> Hmean    SeqCreate del             90.91 (  0.00%)           79.88 (-12.14%)
> Hmean    RandCreate ops             1.00 (  0.00%)            1.50 ( 50.00%)
> Hmean    RandCreate read            3.00 (  0.00%)            3.00 (  0.00%)
> Hmean    RandCreate del            92.95 (  0.00%)           93.97 (  1.10%)
> 
> The impact is small but in line with the UMA machine in a number of details.
> As before, the CPU usage is lower even if the iostats show very little
> differences overall.
> 
> Overall, the headline performance figures are mostly improved or show
> little difference. There is a small anomaly with XFS that indicates it may
> not always win there due to other factors. There is also the possibility

/me wonders what the anomaly is/was?

(Well, not that much.  If it disappears on 4.14 then I don't care at
all. :P)

--D

> that a mostly random read workload that was larger than memory with each
> read spanning multiple pages but less than the max readahead window would
> suffer but the probability is low as the readahead window should scale
> properly. On balance, this is a win -- particularly on the large read
> workloads.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  include/linux/mm.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 00bad7793788..c50c6f442786 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1991,7 +1991,7 @@ int write_one_page(struct page *page, int wait);
>  void task_dirty_inc(struct task_struct *tsk);
>  
>  /* readahead.c */
> -#define VM_MAX_READAHEAD	128	/* kbytes */
> +#define VM_MAX_READAHEAD	512	/* kbytes */
>  #define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
>  
>  int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
Jan Kara Oct. 5, 2017, 8:39 a.m. UTC | #2
On Wed 04-10-17 10:41:51, Darrick J. Wong wrote:
> On Wed, Oct 04, 2017 at 11:12:05AM +0200, Jan Kara wrote:
> > Increase default maximum allowed readahead window from 128 KB to 512 KB.
> > This improves performance for some workloads (see below for details) where
> > ability to scale readahead window to larger sizes allows for better total
> > throughput while chances for regression are rather low given readahead
> > window size is dynamically computed based on observation (and thus it never
> > grows large for workloads with a random read pattern).
> > 
> > Note that the same tuning can be done using udev rules or by manually setting
> > the sysctl parameter however we believe the new value is a better default most
> > users will want to use. As a data point we carry this patch in SUSE kernels
> > for over 8 years.
> > 
> > Some data from the last evaluation of this patch (on 4.4-based kernel, I can
> > rerun those tests on a newer kernel but nothing has changed in the readahead
> > area since 4.4). The patch was evaluated on two machines
> 
> This is purely speculating, but I think this is worth at least a quick
> retry on 4.14 to see what's changed in the past 10 kernel release.  For
> one thing, ext3 no longer exists, and XFS' file IO path has changed
> quite a lot since then.

ext3 in this test is actually using ext4 driver already, so that has not
changed. I agree XFS has changed quite a bit so results might differ there.
I can rerun it with current kernel to see whether XFS behavior changed.

								Honza

Patch
diff mbox

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 00bad7793788..c50c6f442786 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1991,7 +1991,7 @@  int write_one_page(struct page *page, int wait);
 void task_dirty_inc(struct task_struct *tsk);
 
 /* readahead.c */
-#define VM_MAX_READAHEAD	128	/* kbytes */
+#define VM_MAX_READAHEAD	512	/* kbytes */
 #define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
 
 int force_page_cache_readahead(struct address_space *mapping, struct file *filp,