diff mbox

fs: Make sync_file_range(2) use WB_SYNC_NONE writeback

Message ID 1445714897-26342-1-git-send-email-jack@suse.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jan Kara Oct. 24, 2015, 7:28 p.m. UTC
sync_file_range(2) is documented to issue writeback only for pages that
are not currently being written. After all the system call has been
created for userspace to be able to issue background writeout and so
waiting for in-flight IO is undesirable there. However commit
ee53a891f474 (mm: do_sync_mapping_range integrity fix) switched
do_sync_mapping_range() and thus sync_file_range() to issue writeback in
WB_SYNC_ALL mode since do_sync_mapping_range() was used by other code
relying on WB_SYNC_ALL semantics.

These days do_sync_mapping_range() went away and we can switch
sync_file_range(2) back to issuing WB_SYNC_NONE writeback. That should
help PostgreSQL avoid large latency spikes when flushing data in the
background.

Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jan Kara <jack@suse.com>
---
 fs/sync.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Andres Freund Oct. 28, 2015, 9:30 a.m. UTC | #1
Hi Jan,

Thanks for looking into this and promptly sending a patch.

On 2015-10-24 21:28:17 +0200, Jan Kara wrote:
> These days do_sync_mapping_range() went away and we can switch
> sync_file_range(2) back to issuing WB_SYNC_NONE writeback. That should
> help PostgreSQL avoid large latency spikes when flushing data in the
> background.
> 
> diff --git a/fs/sync.c b/fs/sync.c
> index fbc98ee62044..ef60e812d771 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -343,7 +343,8 @@ SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes,
>  	}
>  
>  	if (flags & SYNC_FILE_RANGE_WRITE) {
> -		ret = filemap_fdatawrite_range(mapping, offset, endbyte);
> +		ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
> +						 WB_SYNC_NONE);
>  		if (ret < 0)
>  			goto out_put;
>  	}

Thanks. Would scheduling a comparative benchmark of this be helpful
pushing htis forward ? Would probably only be early next week, I'm at
the european postgresql conference right now.

Regards,

Andres
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Oct. 28, 2015, 2:18 p.m. UTC | #2
Hi Andres,

On Wed 28-10-15 10:30:40, Andres Freund wrote:
> On 2015-10-24 21:28:17 +0200, Jan Kara wrote:
> > These days do_sync_mapping_range() went away and we can switch
> > sync_file_range(2) back to issuing WB_SYNC_NONE writeback. That should
> > help PostgreSQL avoid large latency spikes when flushing data in the
> > background.
> > 
> > diff --git a/fs/sync.c b/fs/sync.c
> > index fbc98ee62044..ef60e812d771 100644
> > --- a/fs/sync.c
> > +++ b/fs/sync.c
> > @@ -343,7 +343,8 @@ SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes,
> >  	}
> >  
> >  	if (flags & SYNC_FILE_RANGE_WRITE) {
> > -		ret = filemap_fdatawrite_range(mapping, offset, endbyte);
> > +		ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
> > +						 WB_SYNC_NONE);
> >  		if (ret < 0)
> >  			goto out_put;
> >  	}
> 
> Thanks. Would scheduling a comparative benchmark of this be helpful
> pushing htis forward ? Would probably only be early next week, I'm at
> the european postgresql conference right now.

If you could run it, it would be nice. Thanks!

								Honza
Andres Freund Nov. 6, 2015, 1:06 p.m. UTC | #3
Hi,

On 2015-10-28 15:18:52 +0100, Jan Kara wrote:
> On Wed 28-10-15 10:30:40, Andres Freund wrote:
> > On 2015-10-24 21:28:17 +0200, Jan Kara wrote:
> > Thanks. Would scheduling a comparative benchmark of this be helpful
> > pushing htis forward ? Would probably only be early next week, I'm at
> > the european postgresql conference right now.
> 
> If you could run it, it would be nice. Thanks!

Sorry that it took longer. Had some $work deadline making a surprise
attack (sneaky bastard) and then difficulity getting time on a bigger
box. Hence I only have numbers from a single SSD drive (840 EVO 1TB)
laptop (16GB RAM, i7-4800MQ), with nothing else running.

I've compared 4.3.0-andres-07965-gd1e41ff, with/without the patch
applied. Best of three results:

Before:
transaction type: TPC-B (sort of)
scaling factor: 300
query mode: prepared
number of clients: 48
number of threads: 48
duration: 1000 s
number of transactions actually processed: 7873859
latency average: 6.094 ms
latency stddev: 35.376 ms
tps = 7871.887045 (including connections establishing)
tps = 7872.135871 (excluding connections establishing)


After:
transaction type: TPC-B (sort of)
scaling factor: 300
query mode: prepared
number of clients: 48
number of threads: 48
duration: 1000 s
number of transactions actually processed: 9370637
latency average: 5.119 ms
latency stddev: 24.423 ms
tps = 9369.212486 (including connections establishing)
tps = 9369.725595 (excluding connections establishing)

Pretty tidy improvement I'd say.

Feel free to add a
Tested-By: Andres Freund <andres@anarazel.de>

There's still quite some further room of improvement. E.g. some
quite massive peaks in latency, which don't coincide with background
work in postgres:
progress: 869.0 s, 9971.2 tps, lat 4.828 ms stddev 6.201
progress: 870.0 s, 8196.4 tps, lat 4.632 ms stddev 6.774
progress: 871.0 s, 497.0 tps, lat 89.598 ms stddev 308.398
progress: 872.0 s, 10360.6 tps, lat 5.917 ms stddev 40.041
progress: 873.0 s, 7005.5 tps, lat 6.743 ms stddev 21.985

Without controlling writeout via sync_file_range(), we frequently can
see stalls in the tens of seconds on busy machines though... So this is
still much better ;)

Greetings,

Andres Freund
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Nov. 6, 2015, 8:43 p.m. UTC | #4
Hello,

On Fri 06-11-15 14:06:12, Andres Freund wrote:
> On 2015-10-28 15:18:52 +0100, Jan Kara wrote:
> > On Wed 28-10-15 10:30:40, Andres Freund wrote:
> > > On 2015-10-24 21:28:17 +0200, Jan Kara wrote:
> > > Thanks. Would scheduling a comparative benchmark of this be helpful
> > > pushing htis forward ? Would probably only be early next week, I'm at
> > > the european postgresql conference right now.
> > 
> > If you could run it, it would be nice. Thanks!
> 
> Sorry that it took longer. Had some $work deadline making a surprise
> attack (sneaky bastard) and then difficulity getting time on a bigger
> box. Hence I only have numbers from a single SSD drive (840 EVO 1TB)
> laptop (16GB RAM, i7-4800MQ), with nothing else running.
> 
> I've compared 4.3.0-andres-07965-gd1e41ff, with/without the patch
> applied. Best of three results:
> 
> Before:
> transaction type: TPC-B (sort of)
> scaling factor: 300
> query mode: prepared
> number of clients: 48
> number of threads: 48
> duration: 1000 s
> number of transactions actually processed: 7873859
> latency average: 6.094 ms
> latency stddev: 35.376 ms
> tps = 7871.887045 (including connections establishing)
> tps = 7872.135871 (excluding connections establishing)
> 
> 
> After:
> transaction type: TPC-B (sort of)
> scaling factor: 300
> query mode: prepared
> number of clients: 48
> number of threads: 48
> duration: 1000 s
> number of transactions actually processed: 9370637
> latency average: 5.119 ms
> latency stddev: 24.423 ms
> tps = 9369.212486 (including connections establishing)
> tps = 9369.725595 (excluding connections establishing)
> 
> Pretty tidy improvement I'd say.
> 
> Feel free to add a
> Tested-By: Andres Freund <andres@anarazel.de>
> 
> There's still quite some further room of improvement. E.g. some
> quite massive peaks in latency, which don't coincide with background
> work in postgres:
> progress: 869.0 s, 9971.2 tps, lat 4.828 ms stddev 6.201
> progress: 870.0 s, 8196.4 tps, lat 4.632 ms stddev 6.774
> progress: 871.0 s, 497.0 tps, lat 89.598 ms stddev 308.398
> progress: 872.0 s, 10360.6 tps, lat 5.917 ms stddev 40.041
> progress: 873.0 s, 7005.5 tps, lat 6.743 ms stddev 21.985
> 
> Without controlling writeout via sync_file_range(), we frequently can
> see stalls in the tens of seconds on busy machines though... So this is
> still much better ;)

Thanks for doing the test and I'm happy it improved your results. Andrew
has already picked up the patch into his tree so it will eventually hit the
upstream kernel (not sure if for 4.4 already but for 4.5 certainly).

								Honza
diff mbox

Patch

diff --git a/fs/sync.c b/fs/sync.c
index fbc98ee62044..ef60e812d771 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -343,7 +343,8 @@  SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes,
 	}
 
 	if (flags & SYNC_FILE_RANGE_WRITE) {
-		ret = filemap_fdatawrite_range(mapping, offset, endbyte);
+		ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
+						 WB_SYNC_NONE);
 		if (ret < 0)
 			goto out_put;
 	}