diff mbox series

[RFC] mm: readahead: add readahead_shift into backing device

Message ID 20190322154610.164564-1-liumartin@google.com (mailing list archive)
State New, archived
Headers show
Series [RFC] mm: readahead: add readahead_shift into backing device | expand

Commit Message

Martin Liu March 22, 2019, 3:46 p.m. UTC
As the discussion https://lore.kernel.org/patchwork/patch/334982/
We know an open file's ra_pages might run out of sync from
bdi.ra_pages since sequential, random or error read. Current design
is we have to ask users to reopen the file or use fdavise system
call to get it sync. However, we might have some cases to change
system wide file ra_pages to enhance system performance such as
enhance the boot time by increasing the ra_pages or decrease it to
mitigate the trashing under memory pressure. Those are the cases
we need to globally and immediately change this value. However,
from the discussion, we know bdi readahead tend to be an optimal
value to be a physical property of the storage and doesn't tend to
change dynamically. Thus we'd like to purpose readahead_shift into
backing device. This value is used as a shift bit number with bdi
readahead value to come out the readahead value and the negative
value means the division effect in the fixup function. The fixup
would only take effect when shift value is not zero;otherwise,
it returns file's ra_pages directly. Thus, an administrator could
use this value to control the amount readahead referred to
its backing device to achieve its goal.

Signed-off-by: Martin Liu <liumartin@google.com>
---
I'd like to get some feedback about the patch. If you have any
concern, please let me know. Thanks. 
---
 block/blk-sysfs.c                | 28 ++++++++++++++++++++++++++++
 include/linux/backing-dev-defs.h |  1 +
 include/linux/backing-dev.h      | 10 ++++++++++
 mm/backing-dev.c                 | 27 +++++++++++++++++++++++++++
 mm/filemap.c                     |  8 +++++---
 mm/readahead.c                   |  2 +-
 6 files changed, 72 insertions(+), 4 deletions(-)

Comments

Fengguang Wu March 25, 2019, 12:16 p.m. UTC | #1
Martin,

On Fri, Mar 22, 2019 at 11:46:11PM +0800, Martin Liu wrote:
>As the discussion https://lore.kernel.org/patchwork/patch/334982/
>We know an open file's ra_pages might run out of sync from
>bdi.ra_pages since sequential, random or error read. Current design
>is we have to ask users to reopen the file or use fdavise system
>call to get it sync. However, we might have some cases to change
>system wide file ra_pages to enhance system performance such as
>enhance the boot time by increasing the ra_pages or decrease it to

Do you have examples that some distro making use of larger ra_pages
for boot time optimization?

There are fadvise() based readahead/preload tools that can be more
efficient for such purpose.

>mitigate the trashing under memory pressure. Those are the cases

Suppose N read streams with equal read speed. The thrash-free memory
requirement would be (N * 2 * ra_pages).

If N=1000 and ra_pages=1MB, it'd require 2GB memory. Which looks
affordable in mainstream servers.

Also we have try_context_readahead() that's good at estimating the
thrashing free readahead size -- as long as the system load and
read speed remains stable, and node/zone reclaims are balanced.

For normal systems w/o huge number slow read streams, e.g. desktop,
reducing ra_pages may save a little memory by reducing readahead
misses, especially on mmap() read-around. Though at the risk of more
number of IOs. So I'd imagine desktops with 1GB or fewer memory
to really need worry about conservative readahead size.

>we need to globally and immediately change this value. However,

Such requirement should be rare enough?

OTOH, the current Linux behavior of "changing ra_pages only changes
future file opens" does seem not a typical end user would expect and
understand.

>from the discussion, we know bdi readahead tend to be an optimal
>value to be a physical property of the storage and doesn't tend to
>change dynamically. Thus we'd like to purpose readahead_shift into
>backing device. This value is used as a shift bit number with bdi
>readahead value to come out the readahead value and the negative
>value means the division effect in the fixup function. The fixup
>would only take effect when shift value is not zero;otherwise,
>it returns file's ra_pages directly. Thus, an administrator could
>use this value to control the amount readahead referred to
>its backing device to achieve its goal.

Sorry but it sounds like introducing an unnecessarily twisted new
interface. I'm afraid it fixes the pain for 0.001% users while
bringing more puzzle to the majority others.

If we make up mind to fix the problem, it'd be more acceptable to
limit to an internal per-file data structure change, like this:

--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -884,7 +884,7 @@ struct file_ra_state {
        unsigned int async_size;        /* do asynchronous readahead when
                                           there are only # of pages ahead */

-       unsigned int ra_pages;          /* Maximum readahead window */
+       unsigned int ra_pages_shift;
        unsigned int mmap_miss;         /* Cache miss stat for mmap accesses */
        loff_t prev_pos;                /* Cache last read() position */
 };

Then let fadvise() and shrink_readahead_size_eio() adjust that
per-file ra_pages_shift.

Cheers,
Fengguang

>Signed-off-by: Martin Liu <liumartin@google.com>
>---
>I'd like to get some feedback about the patch. If you have any
>concern, please let me know. Thanks.
>---
> block/blk-sysfs.c                | 28 ++++++++++++++++++++++++++++
> include/linux/backing-dev-defs.h |  1 +
> include/linux/backing-dev.h      | 10 ++++++++++
> mm/backing-dev.c                 | 27 +++++++++++++++++++++++++++
> mm/filemap.c                     |  8 +++++---
> mm/readahead.c                   |  2 +-
> 6 files changed, 72 insertions(+), 4 deletions(-)
>
>diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>index 59685918167e..f48b68e05d0e 100644
>--- a/block/blk-sysfs.c
>+++ b/block/blk-sysfs.c
>@@ -107,6 +107,27 @@ queue_ra_store(struct request_queue *q, const char *page, size_t count)
> 	return ret;
> }
>
>+static ssize_t queue_ra_shift_show(struct request_queue *q, char *page)
>+{
>+	int ra_shift = q->backing_dev_info->read_ahead_shift;
>+
>+	return scnprintf(page, PAGE_SIZE - 1, "%d\n", ra_shift);
>+}
>+
>+static ssize_t queue_ra_shift_store(struct request_queue *q, const char *page,
>+				    size_t count)
>+{
>+	s8 ra_shift;
>+	ssize_t ret = kstrtos8(page, 10, &ra_shift);
>+
>+	if (ret)
>+		return ret;
>+
>+	q->backing_dev_info->read_ahead_shift = ra_shift;
>+
>+	return count;
>+}
>+
> static ssize_t queue_max_sectors_show(struct request_queue *q, char *page)
> {
> 	int max_sectors_kb = queue_max_sectors(q) >> 1;
>@@ -540,6 +561,12 @@ static struct queue_sysfs_entry queue_ra_entry = {
> 	.store = queue_ra_store,
> };
>
>+static struct queue_sysfs_entry queue_ra_shift_entry = {
>+	.attr = {.name = "read_ahead_shift", .mode = 0644 },
>+	.show = queue_ra_shift_show,
>+	.store = queue_ra_shift_store,
>+};
>+
> static struct queue_sysfs_entry queue_max_sectors_entry = {
> 	.attr = {.name = "max_sectors_kb", .mode = 0644 },
> 	.show = queue_max_sectors_show,
>@@ -729,6 +756,7 @@ static struct queue_sysfs_entry throtl_sample_time_entry = {
> static struct attribute *default_attrs[] = {
> 	&queue_requests_entry.attr,
> 	&queue_ra_entry.attr,
>+	&queue_ra_shift_entry.attr,
> 	&queue_max_hw_sectors_entry.attr,
> 	&queue_max_sectors_entry.attr,
> 	&queue_max_segments_entry.attr,
>diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
>index 07e02d6df5ad..452002df5232 100644
>--- a/include/linux/backing-dev-defs.h
>+++ b/include/linux/backing-dev-defs.h
>@@ -167,6 +167,7 @@ struct bdi_writeback {
> struct backing_dev_info {
> 	struct list_head bdi_list;
> 	unsigned long ra_pages;	/* max readahead in PAGE_SIZE units */
>+	s8 read_ahead_shift; /* shift to ra_pages */
> 	unsigned long io_pages;	/* max allowed IO size */
> 	congested_fn *congested_fn; /* Function pointer if device is md/dm */
> 	void *congested_data;	/* Pointer to aux data for congested func */
>diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
>index f9b029180241..23f755fe386e 100644
>--- a/include/linux/backing-dev.h
>+++ b/include/linux/backing-dev.h
>@@ -221,6 +221,16 @@ static inline int bdi_sched_wait(void *word)
> 	return 0;
> }
>
>+static inline unsigned long read_ahead_fixup(struct file *file)
>+{
>+	unsigned long ra = file->f_ra.ra_pages;
>+	s8 shift = (inode_to_bdi(file->f_mapping->host))->read_ahead_shift;
>+
>+	ra = (shift >= 0) ? ra << shift : ra >> abs(shift);
>+
>+	return ra;
>+}
>+
> #ifdef CONFIG_CGROUP_WRITEBACK
>
> struct bdi_writeback_congested *
>diff --git a/mm/backing-dev.c b/mm/backing-dev.c
>index 72e6d0c55cfa..1b5163a5ead5 100644
>--- a/mm/backing-dev.c
>+++ b/mm/backing-dev.c
>@@ -172,6 +172,32 @@ static DEVICE_ATTR_RW(name);
>
> BDI_SHOW(read_ahead_kb, K(bdi->ra_pages))
>
>+static ssize_t read_ahead_shift_store(struct device *dev,
>+				      struct device_attribute *attr,
>+				      const char *buf, size_t count)
>+{
>+	struct backing_dev_info *bdi = dev_get_drvdata(dev);
>+	s8 read_ahead_shift;
>+	ssize_t ret = kstrtos8(buf, 10, &read_ahead_shift);
>+
>+	if (ret)
>+		return ret;
>+
>+	bdi->read_ahead_shift = read_ahead_shift;
>+
>+	return count;
>+}
>+
>+static ssize_t read_ahead_shift_show(struct device *dev,
>+				     struct device_attribute *attr, char *page)
>+{
>+	struct backing_dev_info *bdi = dev_get_drvdata(dev);
>+
>+	return scnprintf(page, PAGE_SIZE - 1, "%d\n", bdi->read_ahead_shift);
>+}
>+
>+static DEVICE_ATTR_RW(read_ahead_shift);
>+
> static ssize_t min_ratio_store(struct device *dev,
> 		struct device_attribute *attr, const char *buf, size_t count)
> {
>@@ -223,6 +249,7 @@ static DEVICE_ATTR_RO(stable_pages_required);
>
> static struct attribute *bdi_dev_attrs[] = {
> 	&dev_attr_read_ahead_kb.attr,
>+	&dev_attr_read_ahead_shift.attr,
> 	&dev_attr_min_ratio.attr,
> 	&dev_attr_max_ratio.attr,
> 	&dev_attr_stable_pages_required.attr,
>diff --git a/mm/filemap.c b/mm/filemap.c
>index d78f577baef2..cd8d0e00ff93 100644
>--- a/mm/filemap.c
>+++ b/mm/filemap.c
>@@ -2469,6 +2469,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> 	struct address_space *mapping = file->f_mapping;
> 	struct file *fpin = NULL;
> 	pgoff_t offset = vmf->pgoff;
>+	unsigned long max;
>
> 	/* If we don't want any read-ahead, don't bother */
> 	if (vmf->vma->vm_flags & VM_RAND_READ)
>@@ -2498,9 +2499,10 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> 	 * mmap read-around
> 	 */
> 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>-	ra->start = max_t(long, 0, offset - ra->ra_pages / 2);
>-	ra->size = ra->ra_pages;
>-	ra->async_size = ra->ra_pages / 4;
>+	max = read_ahead_fixup(file);
>+	ra->start = max_t(long, 0, offset - max / 2);
>+	ra->size = max;
>+	ra->async_size = max / 4;
> 	ra_submit(ra, mapping, file);
> 	return fpin;
> }
>diff --git a/mm/readahead.c b/mm/readahead.c
>index a4593654a26c..546bf344fc4d 100644
>--- a/mm/readahead.c
>+++ b/mm/readahead.c
>@@ -384,7 +384,7 @@ ondemand_readahead(struct address_space *mapping,
> 		   unsigned long req_size)
> {
> 	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
>-	unsigned long max_pages = ra->ra_pages;
>+	unsigned long max_pages = read_ahead_fixup(filp);
> 	unsigned long add_pages;
> 	pgoff_t prev_offset;
>
>-- 
>2.21.0.392.gf8f6787159e-goog
>
Mark Salyzyn March 25, 2019, 4:59 p.m. UTC | #2
On 03/25/2019 05:16 AM, Fengguang Wu wrote:
> Martin,
>
> On Fri, Mar 22, 2019 at 11:46:11PM +0800, Martin Liu wrote:
>> As the discussion https://lore.kernel.org/patchwork/patch/334982/
>> We know an open file's ra_pages might run out of sync from
>> bdi.ra_pages since sequential, random or error read. Current design
>> is we have to ask users to reopen the file or use fdavise system
>> call to get it sync. However, we might have some cases to change
>> system wide file ra_pages to enhance system performance such as
>> enhance the boot time by increasing the ra_pages or decrease it to
>
> Do you have examples that some distro making use of larger ra_pages
> for boot time optimization?

Android (if you are willing to squint and look at android-common AOSP 
kernels as a Distro).

>
> Suppose N read streams with equal read speed. The thrash-free memory
> requirement would be (N * 2 * ra_pages).
>
> If N=1000 and ra_pages=1MB, it'd require 2GB memory. Which looks
> affordable in mainstream servers.
That is 50% of the memory on a high end Android device ...
>
> Sorry but it sounds like introducing an unnecessarily twisted new
> interface. I'm afraid it fixes the pain for 0.001% users while
> bringing more puzzle to the majority others.
 >2B Android devices on the planet is 0.001%?

I am not defending the proposed interface though, if there is something 
better that can be used, then looking into:
>
> Then let fadvise() and shrink_readahead_size_eio() adjust that
> per-file ra_pages_shift.
Sounds like this would require a lot from init to globally audit and 
reduce the read-ahead for all open files?

Sincerely -- Mark Salyzyn
Fengguang Wu March 26, 2019, 1:30 a.m. UTC | #3
On Mon, Mar 25, 2019 at 09:59:31AM -0700, Mark Salyzyn wrote:
>On 03/25/2019 05:16 AM, Fengguang Wu wrote:
>> Martin,
>>
>> On Fri, Mar 22, 2019 at 11:46:11PM +0800, Martin Liu wrote:
>>> As the discussion https://lore.kernel.org/patchwork/patch/334982/
>>> We know an open file's ra_pages might run out of sync from
>>> bdi.ra_pages since sequential, random or error read. Current design
>>> is we have to ask users to reopen the file or use fdavise system
>>> call to get it sync. However, we might have some cases to change
>>> system wide file ra_pages to enhance system performance such as
>>> enhance the boot time by increasing the ra_pages or decrease it to
>>
>> Do you have examples that some distro making use of larger ra_pages
>> for boot time optimization?
>
>Android (if you are willing to squint and look at android-common AOSP
>kernels as a Distro).

OK. I wonder how exactly Android makes use of it. Since phones are not
using hard disks, so should benefit less from large ra_pages.  Would
you kindly point me to the code?

>> Suppose N read streams with equal read speed. The thrash-free memory
>> requirement would be (N * 2 * ra_pages).
>>
>> If N=1000 and ra_pages=1MB, it'd require 2GB memory. Which looks
>> affordable in mainstream servers.
>That is 50% of the memory on a high end Android device ...

Yeah but I'm obviously not talking Android device here. Will a phone
serve 1000 concurrent read streams?

>> Sorry but it sounds like introducing an unnecessarily twisted new
>> interface. I'm afraid it fixes the pain for 0.001% users while
>> bringing more puzzle to the majority others.
> >2B Android devices on the planet is 0.001%?

Nope. Sorry I didn't know about the Android usage.
Actually nobody mentioned it in the past discussions.

>I am not defending the proposed interface though, if there is something
>better that can be used, then looking into:
>>
>> Then let fadvise() and shrink_readahead_size_eio() adjust that
>> per-file ra_pages_shift.
>Sounds like this would require a lot from init to globally audit and
>reduce the read-ahead for all open files?

It depends. In theory it should be possible to create a standalone
kernel module to dump the page cache and get the current snapshot of
all cached file pages. It'd be a one-shot action and don't require
continuous auditing.

[RFC] kernel facilities for cache prefetching
https://lwn.net/Articles/182128

This tool may also work. It's quick to get the list of opened files by
walking /proc/*/fd/, however not as easy to get the list of cached
file names.

https://github.com/tobert/pcstat

Perhaps we can do a simplified /proc/filecache that only dumps the
list of cached file names. Then let mincore() based tools take care
of the rest work.

Regards,
Fengguang
Martin Liu March 26, 2019, 8:12 a.m. UTC | #4
On Tue, Mar 26, 2019 at 09:30:58AM +0800, Fengguang Wu wrote:
> On Mon, Mar 25, 2019 at 09:59:31AM -0700, Mark Salyzyn wrote:
> > On 03/25/2019 05:16 AM, Fengguang Wu wrote:
> > > Martin,
> > > 
> > > On Fri, Mar 22, 2019 at 11:46:11PM +0800, Martin Liu wrote:
> > > > As the discussion https://lore.kernel.org/patchwork/patch/334982/
> > > > We know an open file's ra_pages might run out of sync from
> > > > bdi.ra_pages since sequential, random or error read. Current design
> > > > is we have to ask users to reopen the file or use fdavise system
> > > > call to get it sync. However, we might have some cases to change
> > > > system wide file ra_pages to enhance system performance such as
> > > > enhance the boot time by increasing the ra_pages or decrease it to
> > > 
> > > Do you have examples that some distro making use of larger ra_pages
> > > for boot time optimization?
> > 
> > Android (if you are willing to squint and look at android-common AOSP
> > kernels as a Distro).
> 
> OK. I wonder how exactly Android makes use of it. Since phones are not
> using hard disks, so should benefit less from large ra_pages.  Would
> you kindly point me to the code?
>
Yes, one of the example is as below.
https://source.android.com/devices/tech/perf/boot-times#optimizing-i-o-
efficiency
> > > Suppose N read streams with equal read speed. The thrash-free memory
> > > requirement would be (N * 2 * ra_pages).
> > > 
> > > If N=1000 and ra_pages=1MB, it'd require 2GB memory. Which looks
> > > affordable in mainstream servers.
> > That is 50% of the memory on a high end Android device ...
> 
> Yeah but I'm obviously not talking Android device here. Will a phone
> serve 1000 concurrent read streams?
> 
For Android, some important, persistent services and native HALs might
hold fd for a long time unless request a restart action and then would
impact overall user experience(guess more than 100). For some low end
devices which is a big portion of Android devices, their memory size
might be even smaller. Thus, when the device is under memory pressure,
this might bring more overhead to impact the performance. As current
design, we don't have a way to shrink readahead immediately. This
interface gives the flexibility to an adiminstrator to decide how
readahed to participate the mitigation level base on the metric it has.

> > > Sorry but it sounds like introducing an unnecessarily twisted new
> > > interface. I'm afraid it fixes the pain for 0.001% users while
> > > bringing more puzzle to the majority others.
> > >2B Android devices on the planet is 0.001%?
> 
> Nope. Sorry I didn't know about the Android usage.
> Actually nobody mentioned it in the past discussions.
> 
> > I am not defending the proposed interface though, if there is something
> > better that can be used, then looking into:
> > > 
> > > Then let fadvise() and shrink_readahead_size_eio() adjust that
> > > per-file ra_pages_shift.
> > Sounds like this would require a lot from init to globally audit and
> > reduce the read-ahead for all open files?
> 
> It depends. In theory it should be possible to create a standalone
> kernel module to dump the page cache and get the current snapshot of
> all cached file pages. It'd be a one-shot action and don't require
> continuous auditing.
> 
> [RFC] kernel facilities for cache prefetching
> https://lwn.net/Articles/182128
> 
> This tool may also work. It's quick to get the list of opened files by
> walking /proc/*/fd/, however not as easy to get the list of cached
> file names.
> 
> https://github.com/tobert/pcstat
> 
> Perhaps we can do a simplified /proc/filecache that only dumps the
> list of cached file names. Then let mincore() based tools take care
> of the rest work.
> 
Thanks for the information, they are very useful. For Android, it would
keep updating pretty frequently and the lists might need to be updated
as the end users install apps, runtime optimization or get new OTA.
Therefore, this might request pretty much effort to maintain this.
Please kindly correct me if any misunderstanding. Thanks.

Regards,
Martin
diff mbox series

Patch

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 59685918167e..f48b68e05d0e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -107,6 +107,27 @@  queue_ra_store(struct request_queue *q, const char *page, size_t count)
 	return ret;
 }
 
+static ssize_t queue_ra_shift_show(struct request_queue *q, char *page)
+{
+	int ra_shift = q->backing_dev_info->read_ahead_shift;
+
+	return scnprintf(page, PAGE_SIZE - 1, "%d\n", ra_shift);
+}
+
+static ssize_t queue_ra_shift_store(struct request_queue *q, const char *page,
+				    size_t count)
+{
+	s8 ra_shift;
+	ssize_t ret = kstrtos8(page, 10, &ra_shift);
+
+	if (ret)
+		return ret;
+
+	q->backing_dev_info->read_ahead_shift = ra_shift;
+
+	return count;
+}
+
 static ssize_t queue_max_sectors_show(struct request_queue *q, char *page)
 {
 	int max_sectors_kb = queue_max_sectors(q) >> 1;
@@ -540,6 +561,12 @@  static struct queue_sysfs_entry queue_ra_entry = {
 	.store = queue_ra_store,
 };
 
+static struct queue_sysfs_entry queue_ra_shift_entry = {
+	.attr = {.name = "read_ahead_shift", .mode = 0644 },
+	.show = queue_ra_shift_show,
+	.store = queue_ra_shift_store,
+};
+
 static struct queue_sysfs_entry queue_max_sectors_entry = {
 	.attr = {.name = "max_sectors_kb", .mode = 0644 },
 	.show = queue_max_sectors_show,
@@ -729,6 +756,7 @@  static struct queue_sysfs_entry throtl_sample_time_entry = {
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
+	&queue_ra_shift_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
 	&queue_max_segments_entry.attr,
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 07e02d6df5ad..452002df5232 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -167,6 +167,7 @@  struct bdi_writeback {
 struct backing_dev_info {
 	struct list_head bdi_list;
 	unsigned long ra_pages;	/* max readahead in PAGE_SIZE units */
+	s8 read_ahead_shift; /* shift to ra_pages */
 	unsigned long io_pages;	/* max allowed IO size */
 	congested_fn *congested_fn; /* Function pointer if device is md/dm */
 	void *congested_data;	/* Pointer to aux data for congested func */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index f9b029180241..23f755fe386e 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -221,6 +221,16 @@  static inline int bdi_sched_wait(void *word)
 	return 0;
 }
 
+static inline unsigned long read_ahead_fixup(struct file *file)
+{
+	unsigned long ra = file->f_ra.ra_pages;
+	s8 shift = (inode_to_bdi(file->f_mapping->host))->read_ahead_shift;
+
+	ra = (shift >= 0) ? ra << shift : ra >> abs(shift);
+
+	return ra;
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct bdi_writeback_congested *
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 72e6d0c55cfa..1b5163a5ead5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -172,6 +172,32 @@  static DEVICE_ATTR_RW(name);
 
 BDI_SHOW(read_ahead_kb, K(bdi->ra_pages))
 
+static ssize_t read_ahead_shift_store(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t count)
+{
+	struct backing_dev_info *bdi = dev_get_drvdata(dev);
+	s8 read_ahead_shift;
+	ssize_t ret = kstrtos8(buf, 10, &read_ahead_shift);
+
+	if (ret)
+		return ret;
+
+	bdi->read_ahead_shift = read_ahead_shift;
+
+	return count;
+}
+
+static ssize_t read_ahead_shift_show(struct device *dev,
+				     struct device_attribute *attr, char *page)
+{
+	struct backing_dev_info *bdi = dev_get_drvdata(dev);
+
+	return scnprintf(page, PAGE_SIZE - 1, "%d\n", bdi->read_ahead_shift);
+}
+
+static DEVICE_ATTR_RW(read_ahead_shift);
+
 static ssize_t min_ratio_store(struct device *dev,
 		struct device_attribute *attr, const char *buf, size_t count)
 {
@@ -223,6 +249,7 @@  static DEVICE_ATTR_RO(stable_pages_required);
 
 static struct attribute *bdi_dev_attrs[] = {
 	&dev_attr_read_ahead_kb.attr,
+	&dev_attr_read_ahead_shift.attr,
 	&dev_attr_min_ratio.attr,
 	&dev_attr_max_ratio.attr,
 	&dev_attr_stable_pages_required.attr,
diff --git a/mm/filemap.c b/mm/filemap.c
index d78f577baef2..cd8d0e00ff93 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2469,6 +2469,7 @@  static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	struct address_space *mapping = file->f_mapping;
 	struct file *fpin = NULL;
 	pgoff_t offset = vmf->pgoff;
+	unsigned long max;
 
 	/* If we don't want any read-ahead, don't bother */
 	if (vmf->vma->vm_flags & VM_RAND_READ)
@@ -2498,9 +2499,10 @@  static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	 * mmap read-around
 	 */
 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
-	ra->start = max_t(long, 0, offset - ra->ra_pages / 2);
-	ra->size = ra->ra_pages;
-	ra->async_size = ra->ra_pages / 4;
+	max = read_ahead_fixup(file);
+	ra->start = max_t(long, 0, offset - max / 2);
+	ra->size = max;
+	ra->async_size = max / 4;
 	ra_submit(ra, mapping, file);
 	return fpin;
 }
diff --git a/mm/readahead.c b/mm/readahead.c
index a4593654a26c..546bf344fc4d 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -384,7 +384,7 @@  ondemand_readahead(struct address_space *mapping,
 		   unsigned long req_size)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
-	unsigned long max_pages = ra->ra_pages;
+	unsigned long max_pages = read_ahead_fixup(filp);
 	unsigned long add_pages;
 	pgoff_t prev_offset;