Message ID | 150693809463.587641.5712378065494786263.stgit@buzz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 10/02/2017 11:54 AM, Konstantin Khlebnikov wrote: > This patch implements write-behind policy which tracks sequential writes > and starts background writeback when have enough dirty pages in a row. Does this apply to data for files which have never been written to disk before? I think one of the largest benefits of the extensive write-back caching in Linux is that the cache is discarded if the file is deleted before it is ever written to disk. (But maybe I'm wrong about this.) Thanks, Florian
On 02.10.2017 14:23, Florian Weimer wrote: > On 10/02/2017 11:54 AM, Konstantin Khlebnikov wrote: >> This patch implements write-behind policy which tracks sequential writes >> and starts background writeback when have enough dirty pages in a row. > > Does this apply to data for files which have never been written to disk before? > > I think one of the largest benefits of the extensive write-back caching in Linux is that the cache is discarded if the file is deleted > before it is ever written to disk. (But maybe I'm wrong about this.) Yes. I've mentioned that current policy is good for short-living files. Write-behind keeps small files (<256kB) in cache and writes files smaller than 1MB in background, synchronous writes starts only after 1MB. But in other hand such files have to be written if somebody calls sync or metadata changes are serialized by journal transactions, or memory pressure flushes them to the disk. So this caching is very unstable and uncertain. In some cases caching makes whole operation much slower because actual disk write starts later than could be.
On Mon, Oct 2, 2017 at 2:54 AM, Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote: > > This patch implements write-behind policy which tracks sequential writes > and starts background writeback when have enough dirty pages in a row. This looks lovely to me. I do wonder if you also looked at finishing the background write-behind at close() time, because it strikes me that once you start doing that async writeout, it would probably be good to make sure you try to do the whole file. I'm thinking of filesystems that do delayed allocation etc - I'd expect that you'd want the whole file to get allocated on disk together, rather than have the "first 256kB aligned chunks" allocated thanks to write-behind, and then the final part allocated much later (after other files may have triggered their own write-behind). Think loads like copying lots of pictures around, for example. I don't have any particularly strong feelings about this, but I do suspect that once you have started that IO, you do want to finish it all up as the file write is done. No? It would also be really nice to see some numbers. Perhaps a comparison of "vmstat 1" or similar when writing a big file to some slow medium like a USB stick (which is something we've done very very badly at, and this should help smooth out)? Linus
On 10/02/2017 03:54 AM, Konstantin Khlebnikov wrote: > Traditional writeback tries to accumulate as much dirty data as possible. > This is worth strategy for extremely short-living files and for batching > writes for saving battery power. But for workloads where disk latency is > important this policy generates periodic disk load spikes which increases > latency for concurrent operations. > > Present writeback engine allows to tune only dirty data size or expiration > time. Such tuning cannot eliminate pikes - this just lowers and multiplies > them. Other option is switching into sync mode which flushes written data > right after each write, obviously this have significant performance impact. > Such tuning is system-wide and affects memory-mapped and randomly written > files, flusher threads handle them much better. > > This patch implements write-behind policy which tracks sequential writes > and starts background writeback when have enough dirty pages in a row. This is a great idea in general. My only concerns would be around cases where we don't expect the writes to ever make it to media. It's not an uncommon use case - app dirties some memory in a file, and expects to truncate/unlink it before it makes it to disk. We don't want to trigger writeback for those. Arguably that should be app hinted. > Write-behind tracks current writing position and looks into two windows > behind it: first represents unwitten pages, Second - async writeback. > > Next write starts background writeback when first window exceed threshold > and waits for pages falling behind async writeback window. This allows to > combine small writes into bigger requests and maintain optimal io-depth. > > This affects only writes via syscalls, memory mapped writes are unchanged. > Also write-behind doesn't affect files with fadvise POSIX_FADV_RANDOM. > > If async window set to 0 then write-behind skips dirty pages for congested > disk and never wait for writeback. This is used for files with O_NONBLOCK. > > Also for files with fadvise POSIX_FADV_NOREUSE write-behind automatically > evicts completely written pages from cache. This is perfect for writing > verbose logs without pushing more important data out of cache. > > As a bonus write-behind makes blkio throttling much more smooth for most > bulk file operations like copying or downloading which writes sequentially. > > Size of minimal write-behind request is set in: > /sys/block/$DISK/bdi/min_write_behind_kb > Default is 256Kb, 0 - disable write-behind for this disk. > > Size of async window set in: > /sys/block/$DISK/bdi/async_write_behind_kb > Default is 1024Kb, 0 - disables sync write-behind. Should we expose these, or just make them a function of the IO limitations exposed by the device? Something like 2x max request size, or similar. Finally, do you have any test results?
On 02.10.2017 22:54, Linus Torvalds wrote: > On Mon, Oct 2, 2017 at 2:54 AM, Konstantin Khlebnikov > <khlebnikov@yandex-team.ru> wrote: >> >> This patch implements write-behind policy which tracks sequential writes >> and starts background writeback when have enough dirty pages in a row. > > This looks lovely to me. > > I do wonder if you also looked at finishing the background > write-behind at close() time, because it strikes me that once you > start doing that async writeout, it would probably be good to make > sure you try to do the whole file. Smaller files or tails is lesser problem and forced writeback here might add bigger overhead due to small requests or too random IO. Also open+append+close pattern could generate too much IO. > > I'm thinking of filesystems that do delayed allocation etc - I'd > expect that you'd want the whole file to get allocated on disk > together, rather than have the "first 256kB aligned chunks" allocated > thanks to write-behind, and then the final part allocated much later > (after other files may have triggered their own write-behind). Think > loads like copying lots of pictures around, for example. As far as I know ext4 preallocates space beyond file end for writing patterns like append + fsync. Thus allocated extents should be bigger than 256k. I haven't looked into this yet. > > I don't have any particularly strong feelings about this, but I do > suspect that once you have started that IO, you do want to finish it > all up as the file write is done. No? I'm aiming into continuous file operations like downloading huge file or writing verbose log. Original motivation came from low-latency server workloads which suffers from parallel bulk operations which generates tons of dirty pages. Probably for general-purpose usage thresholds should be increased significantly to cover only really bulky patterns. > > It would also be really nice to see some numbers. Perhaps a comparison > of "vmstat 1" or similar when writing a big file to some slow medium > like a USB stick (which is something we've done very very badly at, > and this should help smooth out)? I'll try to find out some real cases with numbers. For now I see that massive write + fdatasync (dd conf=fdatasync, fio) always ends earlier because writeback now starts earlier too. Without fdatasync it's obviously slower. Cp to usb stick + umount should show same result, plus cp could be interrupted at any point without contaminating cache with dirty pages. Kernel compilation tooks almost the same time because most files are smaller than 256k. > > Linus >
On 02.10.2017 23:00, Jens Axboe wrote: > On 10/02/2017 03:54 AM, Konstantin Khlebnikov wrote: >> Traditional writeback tries to accumulate as much dirty data as possible. >> This is worth strategy for extremely short-living files and for batching >> writes for saving battery power. But for workloads where disk latency is >> important this policy generates periodic disk load spikes which increases >> latency for concurrent operations. >> >> Present writeback engine allows to tune only dirty data size or expiration >> time. Such tuning cannot eliminate pikes - this just lowers and multiplies >> them. Other option is switching into sync mode which flushes written data >> right after each write, obviously this have significant performance impact. >> Such tuning is system-wide and affects memory-mapped and randomly written >> files, flusher threads handle them much better. >> >> This patch implements write-behind policy which tracks sequential writes >> and starts background writeback when have enough dirty pages in a row. > > This is a great idea in general. My only concerns would be around cases > where we don't expect the writes to ever make it to media. It's not an > uncommon use case - app dirties some memory in a file, and expects > to truncate/unlink it before it makes it to disk. We don't want to trigger > writeback for those. Arguably that should be app hinted. Yes, this is case where serious degradation might happens. Threshold 256k saves small files from writing. Big temporary files anyway have good chances to be pushed into disk by memory pressure or flusher thread. > >> Write-behind tracks current writing position and looks into two windows >> behind it: first represents unwitten pages, Second - async writeback. >> >> Next write starts background writeback when first window exceed threshold >> and waits for pages falling behind async writeback window. This allows to >> combine small writes into bigger requests and maintain optimal io-depth. >> >> This affects only writes via syscalls, memory mapped writes are unchanged. >> Also write-behind doesn't affect files with fadvise POSIX_FADV_RANDOM. >> >> If async window set to 0 then write-behind skips dirty pages for congested >> disk and never wait for writeback. This is used for files with O_NONBLOCK. >> >> Also for files with fadvise POSIX_FADV_NOREUSE write-behind automatically >> evicts completely written pages from cache. This is perfect for writing >> verbose logs without pushing more important data out of cache. >> >> As a bonus write-behind makes blkio throttling much more smooth for most >> bulk file operations like copying or downloading which writes sequentially. >> >> Size of minimal write-behind request is set in: >> /sys/block/$DISK/bdi/min_write_behind_kb >> Default is 256Kb, 0 - disable write-behind for this disk. >> >> Size of async window set in: >> /sys/block/$DISK/bdi/async_write_behind_kb >> Default is 1024Kb, 0 - disables sync write-behind. > > Should we expose these, or just make them a function of the IO limitations > exposed by the device? Something like 2x max request size, or similar. Window depend on IO latency expectations for parallel workload and concurrency at all levels. Also it seems that RAIDs needs special treatment. For now I think this is minimal possible interface. > > Finally, do you have any test results? > Nothing particular yet. For example: $ fio --name=test --rw=write --filesize=1G --ioengine=sync --blocksize=4k --end_fsync=1 with patch ends earlier 9.0s -> 8.2s for HDD 5.4s -> 4.7s for SSD because write starts earlier. both uses old sq/cfq.
On Oct 2, 2017, at 10:58 PM, Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote: > > On 02.10.2017 22:54, Linus Torvalds wrote: >> On Mon, Oct 2, 2017 at 2:54 AM, Konstantin Khlebnikov >> <khlebnikov@yandex-team.ru> wrote: >>> >>> This patch implements write-behind policy which tracks sequential writes >>> and starts background writeback when have enough dirty pages in a row. >> This looks lovely to me. >> I do wonder if you also looked at finishing the background >> write-behind at close() time, because it strikes me that once you >> start doing that async writeout, it would probably be good to make >> sure you try to do the whole file. > > Smaller files or tails is lesser problem and forced writeback here > might add bigger overhead due to small requests or too random IO. > Also open+append+close pattern could generate too much IO. > >> I'm thinking of filesystems that do delayed allocation etc - I'd >> expect that you'd want the whole file to get allocated on disk >> together, rather than have the "first 256kB aligned chunks" allocated >> thanks to write-behind, and then the final part allocated much later >> (after other files may have triggered their own write-behind). Think >> loads like copying lots of pictures around, for example. > > As far as I know ext4 preallocates space beyond file end for writing > patterns like append + fsync. Thus allocated extents should be bigger > than 256k. I haven't looked into this yet. > >> I don't have any particularly strong feelings about this, but I do >> suspect that once you have started that IO, you do want to finish it >> all up as the file write is done. No? > > I'm aiming into continuous file operations like downloading huge file > or writing verbose log. Original motivation came from low-latency server > workloads which suffers from parallel bulk operations which generates > tons of dirty pages. Probably for general-purpose usage thresholds > should be increased significantly to cover only really bulky patterns. > >> It would also be really nice to see some numbers. Perhaps a comparison >> of "vmstat 1" or similar when writing a big file to some slow medium >> like a USB stick (which is something we've done very very badly at, >> and this should help smooth out)? > > I'll try to find out some real cases with numbers. > > For now I see that massive write + fdatasync (dd conf=fdatasync, fio) > always ends earlier because writeback now starts earlier too. > Without fdatasync it's obviously slower. > > Cp to usb stick + umount should show same result, plus cp could be > interrupted at any point without contaminating cache with dirty pages. > > Kernel compilation tooks almost the same time because most files are > smaller than 256k. For what it's worth, Lustre clients have been doing "early writes" forever, when at least a full/contiguous RPC worth (1MB) of dirty data is available, because network bandwidth is a terrible thing to waste. The oft-cited case of "app writes to a file that only lives a few seconds on disk before it is deleted" is IMHO fairly rare in real life, mostly dbench and back in the days of disk based /tmp. Delaying data writes for large files means that 30s * bandwidth of data could have been written before VM page aging kicks in, unless memory pressure causes writeout first. With fast devices/networks, this might be many GB of data filling up memory that could have been written out. Cheers, Andreas
On Mon, Oct 02, 2017 at 12:54:53PM -0700, Linus Torvalds wrote: > On Mon, Oct 2, 2017 at 2:54 AM, Konstantin Khlebnikov > <khlebnikov@yandex-team.ru> wrote: > > > > This patch implements write-behind policy which tracks sequential writes > > and starts background writeback when have enough dirty pages in a row. > > This looks lovely to me. Yup, it's a good idea. Needs some tweaking, though. > I do wonder if you also looked at finishing the background > write-behind at close() time, because it strikes me that once you > start doing that async writeout, it would probably be good to make > sure you try to do the whole file. Inserting arbitrary pipeline bubbles is never good for performance. Think untar: create file, write data, create file, write data, create .... With async write-behind, it pipelines like this: create, write create, write create, write create, write If we block on close, it becomes: create, write, wait create, write, wait create, write, wait Basically performance of things like cp, untar, etc will suck badly if we wait for write behind on close() - it's essentially the same thing as forcing these apps to fdatasync() after every file is written.... > I'm thinking of filesystems that do delayed allocation etc - I'd > expect that you'd want the whole file to get allocated on disk > together, rather than have the "first 256kB aligned chunks" allocated > thanks to write-behind, and then the final part allocated much later > (after other files may have triggered their own write-behind). Think > loads like copying lots of pictures around, for example. Yeah, this is going to completely screw up delayed allocation because it doesn't allow time to aggregate large extents in memory before writeback and allocation occurs. Compared to above, the untar behaviour for the existing writeback mode is: create, create, create, create ...... write, write, write, write i.e. the data writeback is completely decoupled from the creation of files. With delalloc, this means all the directory and inodes are created in the syscall context, all closely packed together on disk, and once that is done the data writeback starts allocating file extents, which then get allocated as single extents and get packed tightly together to give cross-file sequential write behaviour and minimal seeks. And when the metadata gets written back, the internal XFS algorithms will sort that all into sequentially issued IO that gets merged into large sequential IO, too, further minimising seeks. IOWs, what ends up on disk is: <metadata><lots of file data in large contiguous extents> With writebehind, we'll end up with "alloc metadata, alloc data" for each file being written, which will result in this sort of thing for an untar: <m><ddd><m><d><m><ddddddd><m><ddd> ..... It also means we can no longer do cross-file sequentialisation to minimise seeks on writeback, and metadata writeback will turn into a massive seek fest instead of being nicely aggregated into large writes. If we consider large concurrent sequential writes, we have heuristics in the delalloc code to get them built into larger-than-necessary delalloc extents so we typically end up on disk with very large data extents in each file (hundreds of MB to 8GB each) such that on disk we end up with: <m><FILE 1 EXT 1><FILE 2 EXT 1><FILE 1 EXT 2> .... With writebehind, we'll be doing allocation every MB, so end up with lots of small interleaved extents on disk like this: <m><f1><f2><f3><f1><f3><f2><f1><f2><f3><f1> .... (FYI, this is what happened with these worklaods prior to all the additiona of the speculative delalloc heuristics we have now.) Now this won't affect overall write speed, but when we go to read the file back, we have to seek every 1MB IO. IOWs, we might get the same write performance because we're still doing sequential writes, but the performance degradation is seen on the read side when we have to access that data again. Further, what happens when we free just file 1? Instead of getting back a large contiguous free space extent, we get back a heap of tiny, fragmented free spaces. IOWs, the interleaving causes the rapid onset of filesystem aging symptoms which will further degrade allocation behaviour as we'll quickly run out of large contiguous free spaces to allocate from. IOWs, rapid write-behind behaviour might not signficantly affect initial write performance on an empty filesystem. It will, in general, increase file fragmentation, increase interleaving of metadata and data, reduce metadata writeback and read performance, increase free space fragmentation, reduce data read performance and speed up the onset of aging related performance degradation. Don't get me wrong - writebehind can be very effective for some workloads. However, I think that a write-behind default of 1MB is bordering on insane because it pushes most filesystems straight into the above problems. At minimum, per-file writebehind needs to have a much higher default threshold and writeback chunk size to allow filesystems to avoid the above problems. Perhaps we need to think about a small per-backing dev threshold where the behaviour is the current writeback behaviour, but once it's exceeded we then switch to write-behind so that the amount of dirty data doesn't exceed that threshold. Make the threshold 2-3x the bdi's current writeback throughput and we've got something that should mostly self-tune to 2-3s of outstanding dirty data per backing dev whilst mostly avoiding the issues with small, strict write-behind thresholds. Cheers, Dave
On Mon, Oct 2, 2017 at 3:45 PM, Dave Chinner <david@fromorbit.com> wrote: > > Yup, it's a good idea. Needs some tweaking, though. Probably a lot. 256kB seems very eager. > If we block on close, it becomes: I'm not at all suggesting blocking at cl;ose, just doing that final async writebehind (assuming we started any earlier write-behind) so that the writeour ends up seeing the whole file, rather than "everything but the very end" > Perhaps we need to think about a small per-backing dev threshold > where the behaviour is the current writeback behaviour, but once > it's exceeded we then switch to write-behind so that the amount of > dirty data doesn't exceed that threshold. Yes, that sounds like a really good idea, and as a way to avoid starting too early. However, part of the problem there is that we don't have that historical "what is dirty", because it would often be in previous files. Konstantin's patch is simple partly because it has only that single-file history to worry about. You could obviously keep that simplicity, and just accept the fact that the early dirty data ends up being kept dirty, and consider it just the startup cost and not even try to do the write-behind on that oldest data. But I do agree that 256kB is a very early threshold, and likely too small for many cases. Linus
On Mon, Oct 02, 2017 at 04:08:46PM -0700, Linus Torvalds wrote: > On Mon, Oct 2, 2017 at 3:45 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > Yup, it's a good idea. Needs some tweaking, though. > > Probably a lot. 256kB seems very eager. > > > If we block on close, it becomes: > > I'm not at all suggesting blocking at cl;ose, just doing that final > async writebehind (assuming we started any earlier write-behind) so > that the writeour ends up seeing the whole file, rather than > "everything but the very end" That's fine by me - we already do that in certain cases - but AFAICT that's not the way the writebehind code presented works. If the file is larger than the async write behind size then it will also block waiting for previous writebehind to complete. I think all we'd need is a call is filemap_fdatawrite().... > > Perhaps we need to think about a small per-backing dev threshold > > where the behaviour is the current writeback behaviour, but once > > it's exceeded we then switch to write-behind so that the amount of > > dirty data doesn't exceed that threshold. > > Yes, that sounds like a really good idea, and as a way to avoid > starting too early. > > However, part of the problem there is that we don't have that > historical "what is dirty", because it would often be in previous > files. Konstantin's patch is simple partly because it has only that > single-file history to worry about. > > You could obviously keep that simplicity, and just accept the fact > that the early dirty data ends up being kept dirty, and consider it > just the startup cost and not even try to do the write-behind on that > oldest data. I'm not sure we need to care about that - the bdi knows how much dirty data there is on the device, and so we can switch from device-based writeback to per-file writeback at that point. If we we trigger a background flush of all the existing dirty data when we switch modes then we wouldn't leave any of it hanging around for ages while other file data gets written... Cheers, Dave.
diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi index d773d5697cf5..50a8b8750c13 100644 --- a/Documentation/ABI/testing/sysfs-class-bdi +++ b/Documentation/ABI/testing/sysfs-class-bdi @@ -30,6 +30,17 @@ read_ahead_kb (read-write) Size of the read-ahead window in kilobytes +min_write_behind_kb (read-write) + + Size of minimal write-behind request in kilobytes. + 0 -> disable write-behind for this disk. + +async_write_behind_kb (read-write) + + Size of async write-behind window in kilobytes. + Next write will wait for writeback falling behind window. + 0 -> completely async mode, skip if disk is congested. + min_ratio (read-write) Under normal circumstances each device is given a part of the diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 9baf66a9ef4e..c491fb6d8ba6 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm: - dirty_expire_centisecs - dirty_ratio - dirty_writeback_centisecs +- dirty_write_behind - drop_caches - extfrag_threshold - hugepages_treat_as_movable @@ -188,6 +189,20 @@ Setting this to zero disables periodic writeback altogether. ============================================================== +dirty_write_behind + +This controls write-behind writeback policy - automatic background writeback +for sequentially written data behind current writing position. + +=0: disabled, default +=1: enabled + +Minimum requeqst size and async window size configured in for each bdi: +/sys/block/$DEV/bdi/min_write_behind_kb +/sys/block/$DEV/bdi/async_write_behind_kb + +============================================================== + drop_caches Writing to this will cause the kernel to drop clean caches, as well as diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 866c433e7d32..ba5322ea970a 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -143,6 +143,8 @@ struct backing_dev_info { struct list_head bdi_list; unsigned long ra_pages; /* max readahead in PAGE_SIZE units */ unsigned long io_pages; /* max allowed IO size */ + unsigned long min_write_behind; /* Minimum write-behind in pages */ + unsigned long async_write_behind; /* Async write-behind in pages */ congested_fn *congested_fn; /* Function pointer if device is md/dm */ void *congested_data; /* Pointer to aux data for congested func */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 339e73742e73..828494ce556e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -144,6 +144,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, #define FMODE_CAN_READ ((__force fmode_t)0x20000) /* Has write method(s) */ #define FMODE_CAN_WRITE ((__force fmode_t)0x40000) +/* "Use once" access pattern is expected */ +#define FMODE_NOREUSE ((__force fmode_t)0x80000) /* File was opened by fanotify and shouldn't generate fanotify events */ #define FMODE_NONOTIFY ((__force fmode_t)0x4000000) @@ -871,6 +873,7 @@ struct file { struct fown_struct f_owner; const struct cred *f_cred; struct file_ra_state f_ra; + pgoff_t f_write_behind; u64 f_version; #ifdef CONFIG_SECURITY @@ -2655,6 +2658,9 @@ extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync); extern int vfs_fsync(struct file *file, int datasync); +extern int vm_dirty_write_behind; +extern ssize_t generic_write_behind(struct kiocb *iocb, ssize_t count); + /* * Sync the bytes written if this was a synchronous write. Expect ki_pos * to already be updated for the write, and will return either the amount @@ -2668,7 +2674,8 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count) (iocb->ki_flags & IOCB_SYNC) ? 0 : 1); if (ret) return ret; - } + } else if (vm_dirty_write_behind) + return generic_write_behind(iocb, count); return count; } diff --git a/include/linux/mm.h b/include/linux/mm.h index f8c10d336e42..592efaeca2d4 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2217,6 +2217,9 @@ extern int filemap_page_mkwrite(struct vm_fault *vmf); int __must_check write_one_page(struct page *page); void task_dirty_inc(struct task_struct *tsk); +#define VM_MIN_WRITE_BEHIND_KB 256 +#define VM_ASYNC_WRITE_BEHIND_KB 1024 + /* readahead.c */ #define VM_MAX_READAHEAD 128 /* kbytes */ #define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 423554ad3610..a40e4839a390 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1346,6 +1346,15 @@ static struct ctl_table vm_table[] = { .extra1 = &zero, }, { + .procname = "dirty_write_behind", + .data = &vm_dirty_write_behind, + .maxlen = sizeof(vm_dirty_write_behind), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &one, + }, + { .procname = "nr_pdflush_threads", .mode = 0444 /* read-only */, .proc_handler = pdflush_proc_obsolete, diff --git a/mm/backing-dev.c b/mm/backing-dev.c index e19606bb41a0..c0f8aba4133d 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -138,25 +138,6 @@ static inline void bdi_debug_unregister(struct backing_dev_info *bdi) } #endif -static ssize_t read_ahead_kb_store(struct device *dev, - struct device_attribute *attr, - const char *buf, size_t count) -{ - struct backing_dev_info *bdi = dev_get_drvdata(dev); - unsigned long read_ahead_kb; - ssize_t ret; - - ret = kstrtoul(buf, 10, &read_ahead_kb); - if (ret < 0) - return ret; - - bdi->ra_pages = read_ahead_kb >> (PAGE_SHIFT - 10); - - return count; -} - -#define K(pages) ((pages) << (PAGE_SHIFT - 10)) - #define BDI_SHOW(name, expr) \ static ssize_t name##_show(struct device *dev, \ struct device_attribute *attr, char *page) \ @@ -167,7 +148,27 @@ static ssize_t name##_show(struct device *dev, \ } \ static DEVICE_ATTR_RW(name); -BDI_SHOW(read_ahead_kb, K(bdi->ra_pages)) +#define BDI_ATTR_KB(name, field) \ +static ssize_t name##_store(struct device *dev, \ + struct device_attribute *attr, \ + const char *buf, size_t count) \ +{ \ + struct backing_dev_info *bdi = dev_get_drvdata(dev); \ + unsigned long kb; \ + ssize_t ret; \ + \ + ret = kstrtoul(buf, 10, &kb); \ + if (ret < 0) \ + return ret; \ + \ + bdi->field = kb >> (PAGE_SHIFT - 10); \ + return count; \ +} \ +BDI_SHOW(name, ((bdi->field) << (PAGE_SHIFT - 10))) + +BDI_ATTR_KB(read_ahead_kb, ra_pages) +BDI_ATTR_KB(min_write_behind_kb, min_write_behind) +BDI_ATTR_KB(async_write_behind_kb, async_write_behind) static ssize_t min_ratio_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) @@ -220,6 +221,8 @@ static DEVICE_ATTR_RO(stable_pages_required); static struct attribute *bdi_dev_attrs[] = { &dev_attr_read_ahead_kb.attr, + &dev_attr_min_write_behind_kb.attr, + &dev_attr_async_write_behind_kb.attr, &dev_attr_min_ratio.attr, &dev_attr_max_ratio.attr, &dev_attr_stable_pages_required.attr, @@ -836,6 +839,9 @@ static int bdi_init(struct backing_dev_info *bdi) INIT_LIST_HEAD(&bdi->wb_list); init_waitqueue_head(&bdi->wb_waitq); + bdi->min_write_behind = VM_MIN_WRITE_BEHIND_KB >> (PAGE_SHIFT - 10); + bdi->async_write_behind = VM_ASYNC_WRITE_BEHIND_KB >> (PAGE_SHIFT - 10); + ret = cgwb_bdi_init(bdi); return ret; diff --git a/mm/fadvise.c b/mm/fadvise.c index 702f239cd6db..8817343955e7 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -82,6 +82,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) f.file->f_ra.ra_pages = bdi->ra_pages; spin_lock(&f.file->f_lock); f.file->f_mode &= ~FMODE_RANDOM; + f.file->f_mode &= ~FMODE_NOREUSE; spin_unlock(&f.file->f_lock); break; case POSIX_FADV_RANDOM: @@ -113,6 +114,9 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) nrpages); break; case POSIX_FADV_NOREUSE: + spin_lock(&f.file->f_lock); + f.file->f_mode |= FMODE_NOREUSE; + spin_unlock(&f.file->f_lock); break; case POSIX_FADV_DONTNEED: if (!inode_write_congested(mapping->host)) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 0b9c5cbe8eba..95151f3ebd4f 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2851,3 +2851,87 @@ void wait_for_stable_page(struct page *page) wait_on_page_writeback(page); } EXPORT_SYMBOL_GPL(wait_for_stable_page); + +int vm_dirty_write_behind __read_mostly; +EXPORT_SYMBOL(vm_dirty_write_behind); + +/** + * generic_write_behind() - writeback dirty pages behind current position. + * + * This function tracks writing position and starts background writeback if + * file has enough sequentially written data. + * + * Returns @count or a negative error code if I/O failed. + */ +extern ssize_t generic_write_behind(struct kiocb *iocb, ssize_t count) +{ + struct file *file = iocb->ki_filp; + struct address_space *mapping = file->f_mapping; + struct backing_dev_info *bdi = inode_to_bdi(mapping->host); + unsigned long min_size = READ_ONCE(bdi->min_write_behind); + unsigned long async_size = READ_ONCE(bdi->async_write_behind); + pgoff_t head = file->f_write_behind; + pgoff_t begin = (iocb->ki_pos - count) >> PAGE_SHIFT; + pgoff_t end = iocb->ki_pos >> PAGE_SHIFT; + int ret; + + /* Disabled, contiguous and not big enough yet or marked as random. */ + if (!min_size || end - head < min_size || (file->f_mode & FMODE_RANDOM)) + goto out; + + spin_lock(&file->f_lock); + + /* Re-read under lock. */ + head = file->f_write_behind; + + /* Non-contiguous, move head position. */ + if (head > end || begin - head > async_size) + file->f_write_behind = head = begin; + + /* Still not big enough. */ + if (end - head < min_size) { + spin_unlock(&file->f_lock); + goto out; + } + + /* Set head for next iteration, everything behind will be written. */ + file->f_write_behind = end; + + spin_unlock(&file->f_lock); + + /* Non-blocking files always works in async mode. */ + if (file->f_flags & O_NONBLOCK) + async_size = 0; + + /* Skip pages in async mode if disk is congested. */ + if (!async_size && inode_write_congested(mapping->host)) + goto out; + + /* Start background writeback. */ + ret = __filemap_fdatawrite_range(mapping, + (loff_t)head << PAGE_SHIFT, + ((loff_t)end << PAGE_SHIFT) - 1, + WB_SYNC_NONE); + if (ret < 0) + return ret; + + if (!async_size || head < async_size) + goto out; + + /* Wait for pages falling behind async window. */ + head -= async_size; + end -= async_size; + ret = filemap_fdatawait_range(mapping, + (loff_t)head << PAGE_SHIFT, + ((loff_t)end << PAGE_SHIFT) - 1); + if (ret < 0) + return ret; + + /* Evict completely written pages if no more access expected. */ + if (file->f_mode & FMODE_NOREUSE) + invalidate_mapping_pages(mapping, head, end - 1); + +out: + return count; +} +EXPORT_SYMBOL(generic_write_behind);
Traditional writeback tries to accumulate as much dirty data as possible. This is worth strategy for extremely short-living files and for batching writes for saving battery power. But for workloads where disk latency is important this policy generates periodic disk load spikes which increases latency for concurrent operations. Present writeback engine allows to tune only dirty data size or expiration time. Such tuning cannot eliminate pikes - this just lowers and multiplies them. Other option is switching into sync mode which flushes written data right after each write, obviously this have significant performance impact. Such tuning is system-wide and affects memory-mapped and randomly written files, flusher threads handle them much better. This patch implements write-behind policy which tracks sequential writes and starts background writeback when have enough dirty pages in a row. Write-behind tracks current writing position and looks into two windows behind it: first represents unwitten pages, Second - async writeback. Next write starts background writeback when first window exceed threshold and waits for pages falling behind async writeback window. This allows to combine small writes into bigger requests and maintain optimal io-depth. This affects only writes via syscalls, memory mapped writes are unchanged. Also write-behind doesn't affect files with fadvise POSIX_FADV_RANDOM. If async window set to 0 then write-behind skips dirty pages for congested disk and never wait for writeback. This is used for files with O_NONBLOCK. Also for files with fadvise POSIX_FADV_NOREUSE write-behind automatically evicts completely written pages from cache. This is perfect for writing verbose logs without pushing more important data out of cache. As a bonus write-behind makes blkio throttling much more smooth for most bulk file operations like copying or downloading which writes sequentially. Size of minimal write-behind request is set in: /sys/block/$DISK/bdi/min_write_behind_kb Default is 256Kb, 0 - disable write-behind for this disk. Size of async window set in: /sys/block/$DISK/bdi/async_write_behind_kb Default is 1024Kb, 0 - disables sync write-behind. Write-behind is controlled by sysctl vm.dirty_write_behind: =0: disabled, default =1: enabled Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> --- Documentation/ABI/testing/sysfs-class-bdi | 11 ++++ Documentation/sysctl/vm.txt | 15 +++++ include/linux/backing-dev-defs.h | 2 + include/linux/fs.h | 9 +++ include/linux/mm.h | 3 + kernel/sysctl.c | 9 +++ mm/backing-dev.c | 46 +++++++++------- mm/fadvise.c | 4 + mm/page-writeback.c | 84 +++++++++++++++++++++++++++++ 9 files changed, 162 insertions(+), 21 deletions(-)