diff mbox

[RFC] mm: implement write-behind policy for sequential file writes

Message ID 150693809463.587641.5712378065494786263.stgit@buzz (mailing list archive)
State New, archived
Headers show

Commit Message

Konstantin Khlebnikov Oct. 2, 2017, 9:54 a.m. UTC
Traditional writeback tries to accumulate as much dirty data as possible.
This is worth strategy for extremely short-living files and for batching
writes for saving battery power. But for workloads where disk latency is
important this policy generates periodic disk load spikes which increases
latency for concurrent operations.

Present writeback engine allows to tune only dirty data size or expiration
time. Such tuning cannot eliminate pikes - this just lowers and multiplies
them. Other option is switching into sync mode which flushes written data
right after each write, obviously this have significant performance impact.
Such tuning is system-wide and affects memory-mapped and randomly written
files, flusher threads handle them much better.

This patch implements write-behind policy which tracks sequential writes
and starts background writeback when have enough dirty pages in a row.

Write-behind tracks current writing position and looks into two windows
behind it: first represents unwitten pages, Second - async writeback.

Next write starts background writeback when first window exceed threshold
and waits for pages falling behind async writeback window. This allows to
combine small writes into bigger requests and maintain optimal io-depth.

This affects only writes via syscalls, memory mapped writes are unchanged.
Also write-behind doesn't affect files with fadvise POSIX_FADV_RANDOM.

If async window set to 0 then write-behind skips dirty pages for congested
disk and never wait for writeback. This is used for files with O_NONBLOCK.

Also for files with fadvise POSIX_FADV_NOREUSE write-behind automatically
evicts completely written pages from cache. This is perfect for writing
verbose logs without pushing more important data out of cache.

As a bonus write-behind makes blkio throttling much more smooth for most
bulk file operations like copying or downloading which writes sequentially.

Size of minimal write-behind request is set in:
/sys/block/$DISK/bdi/min_write_behind_kb
Default is 256Kb, 0 - disable write-behind for this disk.

Size of async window set in:
/sys/block/$DISK/bdi/async_write_behind_kb
Default is 1024Kb, 0 - disables sync write-behind.

Write-behind is controlled by sysctl vm.dirty_write_behind:
=0: disabled, default
=1: enabled

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 Documentation/ABI/testing/sysfs-class-bdi |   11 ++++
 Documentation/sysctl/vm.txt               |   15 +++++
 include/linux/backing-dev-defs.h          |    2 +
 include/linux/fs.h                        |    9 +++
 include/linux/mm.h                        |    3 +
 kernel/sysctl.c                           |    9 +++
 mm/backing-dev.c                          |   46 +++++++++-------
 mm/fadvise.c                              |    4 +
 mm/page-writeback.c                       |   84 +++++++++++++++++++++++++++++
 9 files changed, 162 insertions(+), 21 deletions(-)

Comments

Florian Weimer Oct. 2, 2017, 11:23 a.m. UTC | #1
On 10/02/2017 11:54 AM, Konstantin Khlebnikov wrote:
> This patch implements write-behind policy which tracks sequential writes
> and starts background writeback when have enough dirty pages in a row.

Does this apply to data for files which have never been written to disk 
before?

I think one of the largest benefits of the extensive write-back caching 
in Linux is that the cache is discarded if the file is deleted before it 
is ever written to disk.  (But maybe I'm wrong about this.)

Thanks,
Florian
Konstantin Khlebnikov Oct. 2, 2017, 11:55 a.m. UTC | #2
On 02.10.2017 14:23, Florian Weimer wrote:
> On 10/02/2017 11:54 AM, Konstantin Khlebnikov wrote:
>> This patch implements write-behind policy which tracks sequential writes
>> and starts background writeback when have enough dirty pages in a row.
> 
> Does this apply to data for files which have never been written to disk before?
> 
> I think one of the largest benefits of the extensive write-back caching in Linux is that the cache is discarded if the file is deleted 
> before it is ever written to disk.  (But maybe I'm wrong about this.)

Yes. I've mentioned that current policy is good for short-living files.

Write-behind keeps small files (<256kB) in cache and writes files smaller
than 1MB in background, synchronous writes starts only after 1MB.

But in other hand such files have to be written if somebody calls sync or
metadata changes are serialized by journal transactions, or memory pressure
flushes them to the disk. So this caching is very unstable and uncertain.
In some cases caching makes whole operation much slower because actual disk
write starts later than could be.
Linus Torvalds Oct. 2, 2017, 7:54 p.m. UTC | #3
On Mon, Oct 2, 2017 at 2:54 AM, Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
>
> This patch implements write-behind policy which tracks sequential writes
> and starts background writeback when have enough dirty pages in a row.

This looks lovely to me.

I do wonder if you also looked at finishing the background
write-behind at close() time, because it strikes me that once you
start doing that async writeout, it would probably be good to make
sure you try to do the whole file.

I'm thinking of filesystems that do delayed allocation etc - I'd
expect that you'd want the whole file to get allocated on disk
together, rather than have the "first 256kB aligned chunks" allocated
thanks to write-behind, and then the final part allocated much later
(after other files may have triggered their own write-behind). Think
loads like copying lots of pictures around, for example.

I don't have any particularly strong feelings about this, but I do
suspect that once you have started that IO, you do want to finish it
all up as the file write is done. No?

It would also be really nice to see some numbers. Perhaps a comparison
of "vmstat 1" or similar when writing a big file to some slow medium
like a USB stick (which is something we've done very very badly at,
and this should help smooth out)?

                Linus
Jens Axboe Oct. 2, 2017, 8 p.m. UTC | #4
On 10/02/2017 03:54 AM, Konstantin Khlebnikov wrote:
> Traditional writeback tries to accumulate as much dirty data as possible.
> This is worth strategy for extremely short-living files and for batching
> writes for saving battery power. But for workloads where disk latency is
> important this policy generates periodic disk load spikes which increases
> latency for concurrent operations.
> 
> Present writeback engine allows to tune only dirty data size or expiration
> time. Such tuning cannot eliminate pikes - this just lowers and multiplies
> them. Other option is switching into sync mode which flushes written data
> right after each write, obviously this have significant performance impact.
> Such tuning is system-wide and affects memory-mapped and randomly written
> files, flusher threads handle them much better.
> 
> This patch implements write-behind policy which tracks sequential writes
> and starts background writeback when have enough dirty pages in a row.

This is a great idea in general. My only concerns would be around cases
where we don't expect the writes to ever make it to media. It's not an
uncommon use case - app dirties some memory in a file, and expects
to truncate/unlink it before it makes it to disk. We don't want to trigger
writeback for those. Arguably that should be app hinted.

> Write-behind tracks current writing position and looks into two windows
> behind it: first represents unwitten pages, Second - async writeback.
> 
> Next write starts background writeback when first window exceed threshold
> and waits for pages falling behind async writeback window. This allows to
> combine small writes into bigger requests and maintain optimal io-depth.
> 
> This affects only writes via syscalls, memory mapped writes are unchanged.
> Also write-behind doesn't affect files with fadvise POSIX_FADV_RANDOM.
> 
> If async window set to 0 then write-behind skips dirty pages for congested
> disk and never wait for writeback. This is used for files with O_NONBLOCK.
> 
> Also for files with fadvise POSIX_FADV_NOREUSE write-behind automatically
> evicts completely written pages from cache. This is perfect for writing
> verbose logs without pushing more important data out of cache.
> 
> As a bonus write-behind makes blkio throttling much more smooth for most
> bulk file operations like copying or downloading which writes sequentially.
> 
> Size of minimal write-behind request is set in:
> /sys/block/$DISK/bdi/min_write_behind_kb
> Default is 256Kb, 0 - disable write-behind for this disk.
> 
> Size of async window set in:
> /sys/block/$DISK/bdi/async_write_behind_kb
> Default is 1024Kb, 0 - disables sync write-behind.

Should we expose these, or just make them a function of the IO limitations
exposed by the device? Something like 2x max request size, or similar.

Finally, do you have any test results?
Konstantin Khlebnikov Oct. 2, 2017, 8:58 p.m. UTC | #5
On 02.10.2017 22:54, Linus Torvalds wrote:
> On Mon, Oct 2, 2017 at 2:54 AM, Konstantin Khlebnikov
> <khlebnikov@yandex-team.ru> wrote:
>>
>> This patch implements write-behind policy which tracks sequential writes
>> and starts background writeback when have enough dirty pages in a row.
> 
> This looks lovely to me.
> 
> I do wonder if you also looked at finishing the background
> write-behind at close() time, because it strikes me that once you
> start doing that async writeout, it would probably be good to make
> sure you try to do the whole file.

Smaller files or tails is lesser problem and forced writeback here
might add bigger overhead due to small requests or too random IO.
Also open+append+close pattern could generate too much IO.

> 
> I'm thinking of filesystems that do delayed allocation etc - I'd
> expect that you'd want the whole file to get allocated on disk
> together, rather than have the "first 256kB aligned chunks" allocated
> thanks to write-behind, and then the final part allocated much later
> (after other files may have triggered their own write-behind). Think
> loads like copying lots of pictures around, for example.

As far as I know ext4 preallocates space beyond file end for writing
patterns like append + fsync. Thus allocated extents should be bigger
than 256k. I haven't looked into this yet.

> 
> I don't have any particularly strong feelings about this, but I do
> suspect that once you have started that IO, you do want to finish it
> all up as the file write is done. No?

I'm aiming into continuous file operations like downloading huge file
or writing verbose log. Original motivation came from low-latency server
workloads which suffers from parallel bulk operations which generates
tons of dirty pages. Probably for general-purpose usage thresholds
should be increased significantly to cover only really bulky patterns.

> 
> It would also be really nice to see some numbers. Perhaps a comparison
> of "vmstat 1" or similar when writing a big file to some slow medium
> like a USB stick (which is something we've done very very badly at,
> and this should help smooth out)?

I'll try to find out some real cases with numbers.

For now I see that massive write + fdatasync (dd conf=fdatasync, fio)
always ends earlier because writeback now starts earlier too.
Without fdatasync it's obviously slower.

Cp to usb stick + umount should show same result, plus cp could be
interrupted at any point without contaminating cache with dirty pages.

Kernel compilation tooks almost the same time because most files are
smaller than 256k.

> 
>                  Linus
>
Konstantin Khlebnikov Oct. 2, 2017, 9:50 p.m. UTC | #6
On 02.10.2017 23:00, Jens Axboe wrote:
> On 10/02/2017 03:54 AM, Konstantin Khlebnikov wrote:
>> Traditional writeback tries to accumulate as much dirty data as possible.
>> This is worth strategy for extremely short-living files and for batching
>> writes for saving battery power. But for workloads where disk latency is
>> important this policy generates periodic disk load spikes which increases
>> latency for concurrent operations.
>>
>> Present writeback engine allows to tune only dirty data size or expiration
>> time. Such tuning cannot eliminate pikes - this just lowers and multiplies
>> them. Other option is switching into sync mode which flushes written data
>> right after each write, obviously this have significant performance impact.
>> Such tuning is system-wide and affects memory-mapped and randomly written
>> files, flusher threads handle them much better.
>>
>> This patch implements write-behind policy which tracks sequential writes
>> and starts background writeback when have enough dirty pages in a row.
> 
> This is a great idea in general. My only concerns would be around cases
> where we don't expect the writes to ever make it to media. It's not an
> uncommon use case - app dirties some memory in a file, and expects
> to truncate/unlink it before it makes it to disk. We don't want to trigger
> writeback for those. Arguably that should be app hinted.

Yes, this is case where serious degradation might happens.

Threshold 256k saves small files from writing.
Big temporary files anyway have good chances to be pushed
into disk by memory pressure or flusher thread.

> 
>> Write-behind tracks current writing position and looks into two windows
>> behind it: first represents unwitten pages, Second - async writeback.
>>
>> Next write starts background writeback when first window exceed threshold
>> and waits for pages falling behind async writeback window. This allows to
>> combine small writes into bigger requests and maintain optimal io-depth.
>>
>> This affects only writes via syscalls, memory mapped writes are unchanged.
>> Also write-behind doesn't affect files with fadvise POSIX_FADV_RANDOM.
>>
>> If async window set to 0 then write-behind skips dirty pages for congested
>> disk and never wait for writeback. This is used for files with O_NONBLOCK.
>>
>> Also for files with fadvise POSIX_FADV_NOREUSE write-behind automatically
>> evicts completely written pages from cache. This is perfect for writing
>> verbose logs without pushing more important data out of cache.
>>
>> As a bonus write-behind makes blkio throttling much more smooth for most
>> bulk file operations like copying or downloading which writes sequentially.
>>
>> Size of minimal write-behind request is set in:
>> /sys/block/$DISK/bdi/min_write_behind_kb
>> Default is 256Kb, 0 - disable write-behind for this disk.
>>
>> Size of async window set in:
>> /sys/block/$DISK/bdi/async_write_behind_kb
>> Default is 1024Kb, 0 - disables sync write-behind.
> 
> Should we expose these, or just make them a function of the IO limitations
> exposed by the device? Something like 2x max request size, or similar.

Window depend on IO latency expectations for parallel workload and
concurrency at all levels.
Also it seems that RAIDs needs special treatment.
For now I think this is minimal possible interface.

> 
> Finally, do you have any test results?
> 

Nothing particular yet.

For example:
$ fio  --name=test --rw=write --filesize=1G --ioengine=sync --blocksize=4k --end_fsync=1

with patch ends earlier
9.0s -> 8.2s for HDD
5.4s -> 4.7s for SSD
because write starts earlier. both uses old sq/cfq.
Andreas Dilger Oct. 2, 2017, 10:29 p.m. UTC | #7
On Oct 2, 2017, at 10:58 PM, Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote:
> 
> On 02.10.2017 22:54, Linus Torvalds wrote:
>> On Mon, Oct 2, 2017 at 2:54 AM, Konstantin Khlebnikov
>> <khlebnikov@yandex-team.ru> wrote:
>>> 
>>> This patch implements write-behind policy which tracks sequential writes
>>> and starts background writeback when have enough dirty pages in a row.
>> This looks lovely to me.
>> I do wonder if you also looked at finishing the background
>> write-behind at close() time, because it strikes me that once you
>> start doing that async writeout, it would probably be good to make
>> sure you try to do the whole file.
> 
> Smaller files or tails is lesser problem and forced writeback here
> might add bigger overhead due to small requests or too random IO.
> Also open+append+close pattern could generate too much IO.
> 
>> I'm thinking of filesystems that do delayed allocation etc - I'd
>> expect that you'd want the whole file to get allocated on disk
>> together, rather than have the "first 256kB aligned chunks" allocated
>> thanks to write-behind, and then the final part allocated much later
>> (after other files may have triggered their own write-behind). Think
>> loads like copying lots of pictures around, for example.
> 
> As far as I know ext4 preallocates space beyond file end for writing
> patterns like append + fsync. Thus allocated extents should be bigger
> than 256k. I haven't looked into this yet.
> 
>> I don't have any particularly strong feelings about this, but I do
>> suspect that once you have started that IO, you do want to finish it
>> all up as the file write is done. No?
> 
> I'm aiming into continuous file operations like downloading huge file
> or writing verbose log. Original motivation came from low-latency server
> workloads which suffers from parallel bulk operations which generates
> tons of dirty pages. Probably for general-purpose usage thresholds
> should be increased significantly to cover only really bulky patterns.
> 
>> It would also be really nice to see some numbers. Perhaps a comparison
>> of "vmstat 1" or similar when writing a big file to some slow medium
>> like a USB stick (which is something we've done very very badly at,
>> and this should help smooth out)?
> 
> I'll try to find out some real cases with numbers.
> 
> For now I see that massive write + fdatasync (dd conf=fdatasync, fio)
> always ends earlier because writeback now starts earlier too.
> Without fdatasync it's obviously slower.
> 
> Cp to usb stick + umount should show same result, plus cp could be
> interrupted at any point without contaminating cache with dirty pages.
> 
> Kernel compilation tooks almost the same time because most files are
> smaller than 256k.

For what it's worth, Lustre clients have been doing "early writes" forever,
when at least a full/contiguous RPC worth (1MB) of dirty data is available,
because network bandwidth is a terrible thing to waste.  The oft-cited case
of "app writes to a file that only lives a few seconds on disk before it is
deleted" is IMHO fairly rare in real life, mostly dbench and back in the
days of disk based /tmp.

Delaying data writes for large files means that 30s * bandwidth of data
could have been written before VM page aging kicks in, unless memory
pressure causes writeout first.  With fast devices/networks, this might
be many GB of data filling up memory that could have been written out.

Cheers, Andreas
Dave Chinner Oct. 2, 2017, 10:45 p.m. UTC | #8
On Mon, Oct 02, 2017 at 12:54:53PM -0700, Linus Torvalds wrote:
> On Mon, Oct 2, 2017 at 2:54 AM, Konstantin Khlebnikov
> <khlebnikov@yandex-team.ru> wrote:
> >
> > This patch implements write-behind policy which tracks sequential writes
> > and starts background writeback when have enough dirty pages in a row.
> 
> This looks lovely to me.

Yup, it's a good idea. Needs some tweaking, though.

> I do wonder if you also looked at finishing the background
> write-behind at close() time, because it strikes me that once you
> start doing that async writeout, it would probably be good to make
> sure you try to do the whole file.

Inserting arbitrary pipeline bubbles is never good for
performance. Think untar:

create file, write data, create file, write data, create ....

With async write-behind, it pipelines like this:

create, write
	create, write
		create, write
			create, write

If we block on close, it becomes:

create, write, wait
		   create, write, wait
				      create, write, wait

Basically performance of things like cp, untar, etc will suck
badly if we wait for write behind on close() - it's essentially the
same thing as forcing these apps to fdatasync() after every file
is written....

> I'm thinking of filesystems that do delayed allocation etc - I'd
> expect that you'd want the whole file to get allocated on disk
> together, rather than have the "first 256kB aligned chunks" allocated
> thanks to write-behind, and then the final part allocated much later
> (after other files may have triggered their own write-behind). Think
> loads like copying lots of pictures around, for example.

Yeah, this is going to completely screw up delayed allocation
because it doesn't allow time to aggregate large extents in memory
before writeback and allocation occurs. Compared to above, the untar
behaviour for the existing writeback mode is:

create, create, create, create ......  write, write, write, write

i.e. the data writeback is completely decoupled from the creation of
files. With delalloc, this means all the directory and inodes are
created in the syscall context, all closely packed together on disk,
and once that is done the data writeback starts allocating file
extents, which then get allocated as single extents and get packed
tightly together to give cross-file sequential write behaviour and
minimal seeks.

And when the metadata gets written back, the internal XFS algorithms
will sort that all into sequentially issued IO that gets merged into
large sequential IO, too, further minimising seeks.

IOWs, what ends up on disk is:

<metadata><lots of file data in large contiguous extents>

With writebehind, we'll end up with "alloc metadata, alloc data"
for each file being written, which will result in this sort of thing
for an untar:

<m><ddd><m><d><m><ddddddd><m><ddd> .....

It also means we can no longer do cross-file sequentialisation to
minimise seeks on writeback, and metadata writeback will turn into a
massive seek fest instead of being nicely aggregated into large
writes.

If we consider large concurrent sequential writes, we have
heuristics in the delalloc code to get them built into
larger-than-necessary delalloc extents so we typically end up on
disk with very large data extents in each file (hundreds of MB to
8GB each) such that on disk we end up with:

<m><FILE 1 EXT 1><FILE 2 EXT 1><FILE 1 EXT 2> ....

With writebehind, we'll be doing allocation every MB, so end up with
lots of small interleaved extents on disk like this:

<m><f1><f2><f3><f1><f3><f2><f1><f2><f3><f1> ....

(FYI, this is what happened with these worklaods prior to all the
additiona of the speculative delalloc heuristics we have now.)

Now this won't affect overall write speed, but when we go to read
the file back, we have to seek every 1MB IO.  IOWs, we might get the
same write performance because we're still doing sequential writes,
but the performance degradation is seen on the read side when we
have to access that data again.

Further, what happens when we free just file 1? Instead of getting
back a large contiguous free space extent, we get back a heap of
tiny, fragmented free spaces. IOWs, the interleaving causes the
rapid onset of filesystem aging symptoms which will further degrade
allocation behaviour as we'll quickly run out of large contiguous
free spaces to allocate from.

IOWs, rapid write-behind behaviour might not signficantly affect
initial write performance on an empty filesystem. It will, in
general, increase file fragmentation, increase interleaving of
metadata and data, reduce metadata writeback and read performance,
increase free space fragmentation, reduce data read performance and
speed up the onset of aging related performance degradation.

Don't get me wrong - writebehind can be very effective for some
workloads. However, I think that a write-behind default of 1MB is
bordering on insane because it pushes most filesystems straight into
the above problems. At minimum, per-file writebehind needs to have a
much higher default threshold and writeback chunk size to allow
filesystems to avoid the above problems.

Perhaps we need to think about a small per-backing dev threshold
where the behaviour is the current writeback behaviour, but once
it's exceeded we then switch to write-behind so that the amount of
dirty data doesn't exceed that threshold. Make the threshold 2-3x
the bdi's current writeback throughput and we've got something that
should mostly self-tune to 2-3s of outstanding dirty data per
backing dev whilst mostly avoiding the issues with small, strict
write-behind thresholds.

Cheers,

Dave
Linus Torvalds Oct. 2, 2017, 11:08 p.m. UTC | #9
On Mon, Oct 2, 2017 at 3:45 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Yup, it's a good idea. Needs some tweaking, though.

Probably a lot. 256kB seems very eager.

> If we block on close, it becomes:

I'm not at all suggesting blocking at cl;ose, just doing that final
async writebehind (assuming we started any earlier write-behind) so
that the writeour ends up seeing the whole file, rather than
"everything but the very end"

> Perhaps we need to think about a small per-backing dev threshold
> where the behaviour is the current writeback behaviour, but once
> it's exceeded we then switch to write-behind so that the amount of
> dirty data doesn't exceed that threshold.

Yes, that sounds like a really good idea, and as a way to avoid
starting too early.

However, part of the problem there is that we don't have that
historical "what is dirty", because it would often be in previous
files. Konstantin's patch is simple partly because it has only that
single-file history to worry about.

You could obviously keep that simplicity, and just accept the fact
that the early dirty data ends up being kept dirty, and consider it
just the startup cost and not even try to do the write-behind on that
oldest data.

But I do agree that 256kB is a very early threshold, and likely too
small for many cases.

               Linus
Dave Chinner Oct. 3, 2017, 12:08 a.m. UTC | #10
On Mon, Oct 02, 2017 at 04:08:46PM -0700, Linus Torvalds wrote:
> On Mon, Oct 2, 2017 at 3:45 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Yup, it's a good idea. Needs some tweaking, though.
> 
> Probably a lot. 256kB seems very eager.
> 
> > If we block on close, it becomes:
> 
> I'm not at all suggesting blocking at cl;ose, just doing that final
> async writebehind (assuming we started any earlier write-behind) so
> that the writeour ends up seeing the whole file, rather than
> "everything but the very end"

That's fine by me - we already do that in certain cases - but
AFAICT that's not the way the writebehind code presented
works. If the file is larger than the async write behind size then
it will also block waiting for previous writebehind to complete.

I think all we'd need is a call is filemap_fdatawrite()....

> > Perhaps we need to think about a small per-backing dev threshold
> > where the behaviour is the current writeback behaviour, but once
> > it's exceeded we then switch to write-behind so that the amount of
> > dirty data doesn't exceed that threshold.
> 
> Yes, that sounds like a really good idea, and as a way to avoid
> starting too early.
> 
> However, part of the problem there is that we don't have that
> historical "what is dirty", because it would often be in previous
> files. Konstantin's patch is simple partly because it has only that
> single-file history to worry about.
>
> You could obviously keep that simplicity, and just accept the fact
> that the early dirty data ends up being kept dirty, and consider it
> just the startup cost and not even try to do the write-behind on that
> oldest data.

I'm not sure we need to care about that - the bdi knows how much
dirty data there is on the device, and so we can switch from
device-based writeback to per-file writeback at that point. If we
we trigger a background flush of all the existing dirty
data when we switch modes then we wouldn't leave any of it hanging
around for ages while other file data gets written...

Cheers,

Dave.
diff mbox

Patch

diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi
index d773d5697cf5..50a8b8750c13 100644
--- a/Documentation/ABI/testing/sysfs-class-bdi
+++ b/Documentation/ABI/testing/sysfs-class-bdi
@@ -30,6 +30,17 @@  read_ahead_kb (read-write)
 
 	Size of the read-ahead window in kilobytes
 
+min_write_behind_kb (read-write)
+
+	Size of minimal write-behind request in kilobytes.
+	0 -> disable write-behind for this disk.
+
+async_write_behind_kb (read-write)
+
+	Size of async write-behind window in kilobytes.
+	Next write will wait for writeback falling behind window.
+	0 -> completely async mode, skip if disk is congested.
+
 min_ratio (read-write)
 
 	Under normal circumstances each device is given a part of the
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 9baf66a9ef4e..c491fb6d8ba6 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -28,6 +28,7 @@  Currently, these files are in /proc/sys/vm:
 - dirty_expire_centisecs
 - dirty_ratio
 - dirty_writeback_centisecs
+- dirty_write_behind
 - drop_caches
 - extfrag_threshold
 - hugepages_treat_as_movable
@@ -188,6 +189,20 @@  Setting this to zero disables periodic writeback altogether.
 
 ==============================================================
 
+dirty_write_behind
+
+This controls write-behind writeback policy - automatic background writeback
+for sequentially written data behind current writing position.
+
+=0: disabled, default
+=1: enabled
+
+Minimum requeqst size and async window size configured in for each bdi:
+/sys/block/$DEV/bdi/min_write_behind_kb
+/sys/block/$DEV/bdi/async_write_behind_kb
+
+==============================================================
+
 drop_caches
 
 Writing to this will cause the kernel to drop clean caches, as well as
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 866c433e7d32..ba5322ea970a 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -143,6 +143,8 @@  struct backing_dev_info {
 	struct list_head bdi_list;
 	unsigned long ra_pages;	/* max readahead in PAGE_SIZE units */
 	unsigned long io_pages;	/* max allowed IO size */
+	unsigned long min_write_behind; /* Minimum write-behind in pages */
+	unsigned long async_write_behind; /* Async write-behind in pages */
 	congested_fn *congested_fn; /* Function pointer if device is md/dm */
 	void *congested_data;	/* Pointer to aux data for congested func */
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 339e73742e73..828494ce556e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -144,6 +144,8 @@  typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 #define FMODE_CAN_READ          ((__force fmode_t)0x20000)
 /* Has write method(s) */
 #define FMODE_CAN_WRITE         ((__force fmode_t)0x40000)
+/* "Use once" access pattern is expected */
+#define FMODE_NOREUSE		((__force fmode_t)0x80000)
 
 /* File was opened by fanotify and shouldn't generate fanotify events */
 #define FMODE_NONOTIFY		((__force fmode_t)0x4000000)
@@ -871,6 +873,7 @@  struct file {
 	struct fown_struct	f_owner;
 	const struct cred	*f_cred;
 	struct file_ra_state	f_ra;
+	pgoff_t			f_write_behind;
 
 	u64			f_version;
 #ifdef CONFIG_SECURITY
@@ -2655,6 +2658,9 @@  extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
 			   int datasync);
 extern int vfs_fsync(struct file *file, int datasync);
 
+extern int vm_dirty_write_behind;
+extern ssize_t generic_write_behind(struct kiocb *iocb, ssize_t count);
+
 /*
  * Sync the bytes written if this was a synchronous write.  Expect ki_pos
  * to already be updated for the write, and will return either the amount
@@ -2668,7 +2674,8 @@  static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
 				(iocb->ki_flags & IOCB_SYNC) ? 0 : 1);
 		if (ret)
 			return ret;
-	}
+	} else if (vm_dirty_write_behind)
+		return generic_write_behind(iocb, count);
 
 	return count;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f8c10d336e42..592efaeca2d4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2217,6 +2217,9 @@  extern int filemap_page_mkwrite(struct vm_fault *vmf);
 int __must_check write_one_page(struct page *page);
 void task_dirty_inc(struct task_struct *tsk);
 
+#define VM_MIN_WRITE_BEHIND_KB		256
+#define VM_ASYNC_WRITE_BEHIND_KB	1024
+
 /* readahead.c */
 #define VM_MAX_READAHEAD	128	/* kbytes */
 #define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 423554ad3610..a40e4839a390 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1346,6 +1346,15 @@  static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 	},
 	{
+		.procname	= "dirty_write_behind",
+		.data		= &vm_dirty_write_behind,
+		.maxlen		= sizeof(vm_dirty_write_behind),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+	{
 		.procname       = "nr_pdflush_threads",
 		.mode           = 0444 /* read-only */,
 		.proc_handler   = pdflush_proc_obsolete,
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e19606bb41a0..c0f8aba4133d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -138,25 +138,6 @@  static inline void bdi_debug_unregister(struct backing_dev_info *bdi)
 }
 #endif
 
-static ssize_t read_ahead_kb_store(struct device *dev,
-				  struct device_attribute *attr,
-				  const char *buf, size_t count)
-{
-	struct backing_dev_info *bdi = dev_get_drvdata(dev);
-	unsigned long read_ahead_kb;
-	ssize_t ret;
-
-	ret = kstrtoul(buf, 10, &read_ahead_kb);
-	if (ret < 0)
-		return ret;
-
-	bdi->ra_pages = read_ahead_kb >> (PAGE_SHIFT - 10);
-
-	return count;
-}
-
-#define K(pages) ((pages) << (PAGE_SHIFT - 10))
-
 #define BDI_SHOW(name, expr)						\
 static ssize_t name##_show(struct device *dev,				\
 			   struct device_attribute *attr, char *page)	\
@@ -167,7 +148,27 @@  static ssize_t name##_show(struct device *dev,				\
 }									\
 static DEVICE_ATTR_RW(name);
 
-BDI_SHOW(read_ahead_kb, K(bdi->ra_pages))
+#define BDI_ATTR_KB(name, field)					\
+static ssize_t name##_store(struct device *dev,				\
+			    struct device_attribute *attr,		\
+			    const char *buf, size_t count)		\
+{									\
+	struct backing_dev_info *bdi = dev_get_drvdata(dev);		\
+	unsigned long kb;						\
+	ssize_t ret;							\
+									\
+	ret = kstrtoul(buf, 10, &kb);					\
+	if (ret < 0)							\
+		return ret;						\
+									\
+	bdi->field = kb >> (PAGE_SHIFT - 10);				\
+	return count;							\
+}									\
+BDI_SHOW(name, ((bdi->field) << (PAGE_SHIFT - 10)))
+
+BDI_ATTR_KB(read_ahead_kb, ra_pages)
+BDI_ATTR_KB(min_write_behind_kb, min_write_behind)
+BDI_ATTR_KB(async_write_behind_kb, async_write_behind)
 
 static ssize_t min_ratio_store(struct device *dev,
 		struct device_attribute *attr, const char *buf, size_t count)
@@ -220,6 +221,8 @@  static DEVICE_ATTR_RO(stable_pages_required);
 
 static struct attribute *bdi_dev_attrs[] = {
 	&dev_attr_read_ahead_kb.attr,
+	&dev_attr_min_write_behind_kb.attr,
+	&dev_attr_async_write_behind_kb.attr,
 	&dev_attr_min_ratio.attr,
 	&dev_attr_max_ratio.attr,
 	&dev_attr_stable_pages_required.attr,
@@ -836,6 +839,9 @@  static int bdi_init(struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&bdi->wb_list);
 	init_waitqueue_head(&bdi->wb_waitq);
 
+	bdi->min_write_behind = VM_MIN_WRITE_BEHIND_KB >> (PAGE_SHIFT - 10);
+	bdi->async_write_behind = VM_ASYNC_WRITE_BEHIND_KB >> (PAGE_SHIFT - 10);
+
 	ret = cgwb_bdi_init(bdi);
 
 	return ret;
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 702f239cd6db..8817343955e7 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -82,6 +82,7 @@  SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 		f.file->f_ra.ra_pages = bdi->ra_pages;
 		spin_lock(&f.file->f_lock);
 		f.file->f_mode &= ~FMODE_RANDOM;
+		f.file->f_mode &= ~FMODE_NOREUSE;
 		spin_unlock(&f.file->f_lock);
 		break;
 	case POSIX_FADV_RANDOM:
@@ -113,6 +114,9 @@  SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 					   nrpages);
 		break;
 	case POSIX_FADV_NOREUSE:
+		spin_lock(&f.file->f_lock);
+		f.file->f_mode |= FMODE_NOREUSE;
+		spin_unlock(&f.file->f_lock);
 		break;
 	case POSIX_FADV_DONTNEED:
 		if (!inode_write_congested(mapping->host))
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0b9c5cbe8eba..95151f3ebd4f 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2851,3 +2851,87 @@  void wait_for_stable_page(struct page *page)
 		wait_on_page_writeback(page);
 }
 EXPORT_SYMBOL_GPL(wait_for_stable_page);
+
+int vm_dirty_write_behind __read_mostly;
+EXPORT_SYMBOL(vm_dirty_write_behind);
+
+/**
+ * generic_write_behind() - writeback dirty pages behind current position.
+ *
+ * This function tracks writing position and starts background writeback if
+ * file has enough sequentially written data.
+ *
+ * Returns @count or a negative error code if I/O failed.
+ */
+extern ssize_t generic_write_behind(struct kiocb *iocb, ssize_t count)
+{
+	struct file *file = iocb->ki_filp;
+	struct address_space *mapping = file->f_mapping;
+	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
+	unsigned long min_size = READ_ONCE(bdi->min_write_behind);
+	unsigned long async_size = READ_ONCE(bdi->async_write_behind);
+	pgoff_t head = file->f_write_behind;
+	pgoff_t begin = (iocb->ki_pos - count) >> PAGE_SHIFT;
+	pgoff_t end = iocb->ki_pos >> PAGE_SHIFT;
+	int ret;
+
+	/* Disabled, contiguous and not big enough yet or marked as random. */
+	if (!min_size || end - head < min_size || (file->f_mode & FMODE_RANDOM))
+		goto out;
+
+	spin_lock(&file->f_lock);
+
+	/* Re-read under lock. */
+	head = file->f_write_behind;
+
+	/* Non-contiguous, move head position. */
+	if (head > end || begin - head > async_size)
+		file->f_write_behind = head = begin;
+
+	/* Still not big enough. */
+	if (end - head < min_size) {
+		spin_unlock(&file->f_lock);
+		goto out;
+	}
+
+	/* Set head for next iteration, everything behind will be written. */
+	file->f_write_behind = end;
+
+	spin_unlock(&file->f_lock);
+
+	/* Non-blocking files always works in async mode. */
+	if (file->f_flags & O_NONBLOCK)
+		async_size = 0;
+
+	/* Skip pages in async mode if disk is congested. */
+	if (!async_size && inode_write_congested(mapping->host))
+		goto out;
+
+	/* Start background writeback. */
+	ret = __filemap_fdatawrite_range(mapping,
+					 (loff_t)head << PAGE_SHIFT,
+					 ((loff_t)end << PAGE_SHIFT) - 1,
+					 WB_SYNC_NONE);
+	if (ret < 0)
+		return ret;
+
+	if (!async_size || head < async_size)
+		goto out;
+
+	/* Wait for pages falling behind async window. */
+	head -= async_size;
+	end -= async_size;
+	ret = filemap_fdatawait_range(mapping,
+				      (loff_t)head << PAGE_SHIFT,
+				      ((loff_t)end << PAGE_SHIFT) - 1);
+	if (ret < 0)
+		return ret;
+
+	/* Evict completely written pages if no more access expected. */
+	if (file->f_mode & FMODE_NOREUSE)
+		invalidate_mapping_pages(mapping, head, end - 1);
+
+out:
+	return count;
+}
+EXPORT_SYMBOL(generic_write_behind);