[v2,0/7] large atomic writes for xfs

Message ID	20241210125737.786928-1-john.g.garry@oracle.com (mailing list archive)
Headers	show Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EF5F278F44; Tue, 10 Dec 2024 12:58:06 +0000 (UTC) From: John Garry <john.g.garry@oracle.com> To: brauner@kernel.org, djwong@kernel.org, cem@kernel.org, dchinner@redhat.com, hch@lst.de, ritesh.list@gmail.com Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, martin.petersen@oracle.com, John Garry <john.g.garry@oracle.com> Subject: [PATCH v2 0/7] large atomic writes for xfs Date: Tue, 10 Dec 2024 12:57:30 +0000 Message-Id: <20241210125737.786928-1-john.g.garry@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk MIME-Version: 1.0
Series	large atomic writes for xfs \| expand [v2,0/7] large atomic writes for xfs [v2,1/7] iomap: Increase iomap_dio_zero() size limit [v2,2/7] iomap: Add zero unwritten mappings dio support [v2,3/7] iomap: Lift blocksize restriction on atomic writes [v2,4/7] xfs: Add extent zeroing support for atomic writes [v2,5/7] xfs: Switch atomic write size check in xfs_file_write_iter() [v2,6/7] xfs: Add RT atomic write unit max to xfs_mount [v2,7/7] xfs: Update xfs_get_atomic_write_attr() for large atomic writes

John Garry Dec. 10, 2024, 12:57 p.m. UTC

Currently the atomic write unit min and max is fixed at the FS blocksize
for xfs and ext4.

This series expands support to allow multiple FS blocks to be written
atomically.

To allow multiple blocks be written atomically, the fs must ensure blocks
are allocated with some alignment and granularity. For xfs, today only
rtvol provides this through rt_extsize. So initial support for large
atomic writes will be for rtvol here. Support can easily be expanded to
regular files through the proposed forcealign feature.

An atomic write which spans mixed unwritten and mapped extents will be
required to have the unwritten extents pre-zeroed, which will be supported
in iomap.

Based on v6.13-rc2.

Patches available at the following:
https://github.com/johnpgarry/linux/tree/atomic-write-large-atomics-v6.13-v2

Changes since v1:
- Add extent zeroing support
- Rebase

John Garry (6):
  iomap: Increase iomap_dio_zero() size limit
  iomap: Add zero unwritten mappings dio support
  xfs: Add extent zeroing support for atomic writes
  xfs: Switch atomic write size check in xfs_file_write_iter()
  xfs: Add RT atomic write unit max to xfs_mount
  xfs: Update xfs_get_atomic_write_attr() for large atomic writes

Ritesh Harjani (IBM) (1):
  iomap: Lift blocksize restriction on atomic writes

 fs/iomap/direct-io.c   | 100 +++++++++++++++++++++++++++++++++++++----
 fs/xfs/libxfs/xfs_sb.c |   3 ++
 fs/xfs/xfs_file.c      |  97 ++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_iops.c      |  21 ++++++++-
 fs/xfs/xfs_iops.h      |   2 +
 fs/xfs/xfs_mount.h     |   1 +
 fs/xfs/xfs_rtalloc.c   |  25 +++++++++++
 fs/xfs/xfs_rtalloc.h   |   4 ++
 include/linux/iomap.h  |   3 ++
 9 files changed, 239 insertions(+), 17 deletions(-)

Christoph Hellwig Dec. 13, 2024, 2:38 p.m. UTC | #1

On Tue, Dec 10, 2024 at 12:57:30PM +0000, John Garry wrote:
> Currently the atomic write unit min and max is fixed at the FS blocksize
> for xfs and ext4.
> 
> This series expands support to allow multiple FS blocks to be written
> atomically.

Can you explain the workload you're interested in a bit more? 

I'm still very scared of expanding use of the large allocation sizes.

IIRC you showed some numbers where increasing the FSB size to something
larger did not look good in your benchmarks, but I'd like to understand
why.  Do you have a link to these numbers just to refresh everyones minds
why that wasn't a good idea.  Did that also include supporting atomic
writes in the sector size <= write size <= FS block size range, which
aren't currently supported, but very useful?

John Garry Dec. 13, 2024, 5:15 p.m. UTC | #2

On 13/12/2024 14:38, Christoph Hellwig wrote:
> On Tue, Dec 10, 2024 at 12:57:30PM +0000, John Garry wrote:
>> Currently the atomic write unit min and max is fixed at the FS blocksize
>> for xfs and ext4.
>>
>> This series expands support to allow multiple FS blocks to be written
>> atomically.
> 
> Can you explain the workload you're interested in a bit more?

Sure, so some background is that we are using atomic writes for innodb 
MySQL so that we can stop relying on the double-write buffer for crash 
protection. MySQL is using an internal 16K page size (so we want 16K 
atomic writes).

MySQL has what is known as a REDO log - see 
https://dev.mysql.com/doc/dev/mysql-server/9.0.1/PAGE_INNODB_REDO_LOG.html

Essentially it means that for any data page we write, ahead of time we 
do a buffered 512B log update followed by a periodic fsync. I think that 
such a thing is common to many apps.

> 
> I'm still very scared of expanding use of the large allocation sizes.

Yes

> 
> IIRC you showed some numbers where increasing the FSB size to something
> larger did not look good in your benchmarks, but I'd like to understand
> why.  Do you have a link to these numbers just to refresh everyones minds
> why that wasn't a good idea. 

I don't think that I can share numbers, but I will summarize the findings.

When we tried just using 16K FS blocksize, we found for low thread count 
testing that performance was poor - even worse baseline of 4K FS 
blocksize and double-write buffer. We put this down to high write 
latency for REDO log. As you can imagine, mostly writing 16K for only a 
512B update is not efficient in terms of traffic generated and increased 
latency (versus 4K FS block size). At higher thread count, performance 
was better. We put that down to bigger log data portions to be written 
to REDO per FS block write.

For 4K FS blocksize and 16K atomic writes configs - supported via 
forcealign or RTvol - performance will generally good across the board. 
forcealign was a bit better.

We also tried a hybrid solution with 2x partitions - 1x partition with 
16K FS block size for data and 1x partition with 4K FS block size for 
REDO. Performance here was good also. Unfortunately, though, this config 
is not fit for production - that is because we have a requirement to do 
FS snapshot and that is not possible over 2x FS instances. We also did 
consider block device snapshot, but there is reluctance to try this also.

> Did that also include supporting atomic
> writes in the sector size <= write size <= FS block size range, which
> aren't currently supported, but very useful?

I have no use for that so far.

Thanks,
John

Christoph Hellwig Dec. 13, 2024, 5:22 p.m. UTC | #3

On Fri, Dec 13, 2024 at 05:15:55PM +0000, John Garry wrote:
> Sure, so some background is that we are using atomic writes for innodb 
> MySQL so that we can stop relying on the double-write buffer for crash 
> protection. MySQL is using an internal 16K page size (so we want 16K atomic 
> writes).

Make perfect sense so far.

>
> MySQL has what is known as a REDO log - see 
> https://dev.mysql.com/doc/dev/mysql-server/9.0.1/PAGE_INNODB_REDO_LOG.html
>
> Essentially it means that for any data page we write, ahead of time we do a 
> buffered 512B log update followed by a periodic fsync. I think that such a 
> thing is common to many apps.

So it's actually using buffered I/O for that and not direct I/O?

> When we tried just using 16K FS blocksize, we found for low thread count 
> testing that performance was poor - even worse baseline of 4K FS blocksize 
> and double-write buffer. We put this down to high write latency for REDO 
> log. As you can imagine, mostly writing 16K for only a 512B update is not 
> efficient in terms of traffic generated and increased latency (versus 4K FS 
> block size). At higher thread count, performance was better. We put that 
> down to bigger log data portions to be written to REDO per FS block write.

So if the redo log uses buffered I/O I can see how that would bloat writes.
But then again using buffered I/O for a REDO log seems pretty silly
to start with.

John Garry Dec. 13, 2024, 5:43 p.m. UTC | #4

On 13/12/2024 17:22, Christoph Hellwig wrote:
> On Fri, Dec 13, 2024 at 05:15:55PM +0000, John Garry wrote:
>> Sure, so some background is that we are using atomic writes for innodb
>> MySQL so that we can stop relying on the double-write buffer for crash
>> protection. MySQL is using an internal 16K page size (so we want 16K atomic
>> writes).
> 
> Make perfect sense so far.
> 
>>
>> MySQL has what is known as a REDO log - see
>> https://dev.mysql.com/doc/dev/mysql-server/9.0.1/PAGE_INNODB_REDO_LOG.html
>>
>> Essentially it means that for any data page we write, ahead of time we do a
>> buffered 512B log update followed by a periodic fsync. I think that such a
>> thing is common to many apps.
> 
> So it's actually using buffered I/O for that and not direct I/O?

Right

 > >> When we tried just using 16K FS blocksize, we found for low thread 
count
>> testing that performance was poor - even worse baseline of 4K FS blocksize
>> and double-write buffer. We put this down to high write latency for REDO
>> log. As you can imagine, mostly writing 16K for only a 512B update is not
>> efficient in terms of traffic generated and increased latency (versus 4K FS
>> block size). At higher thread count, performance was better. We put that
>> down to bigger log data portions to be written to REDO per FS block write.
> 
> So if the redo log uses buffered I/O I can see how that would bloat writes.
> But then again using buffered I/O for a REDO log seems pretty silly
> to start with.
> 

Yeah, at the low end, it may make sense to do the 512B write via DIO. 
But OTOH sync'ing many redo log FS blocks at once at the high end can be 
more efficient.

 From what I have heard, this was attempted before (using DIO) by some 
vendor, but did not come to much.

So it seems that we are stuck with this redo log limitation.

Let me know if you have any other ideas to avoid large atomic writes...

Cheers,
John

Darrick J. Wong Dec. 14, 2024, 12:42 a.m. UTC | #5

On Fri, Dec 13, 2024 at 05:43:09PM +0000, John Garry wrote:
> On 13/12/2024 17:22, Christoph Hellwig wrote:
> > On Fri, Dec 13, 2024 at 05:15:55PM +0000, John Garry wrote:
> > > Sure, so some background is that we are using atomic writes for innodb
> > > MySQL so that we can stop relying on the double-write buffer for crash
> > > protection. MySQL is using an internal 16K page size (so we want 16K atomic
> > > writes).
> > 
> > Make perfect sense so far.
> > 
> > > 
> > > MySQL has what is known as a REDO log - see
> > > https://dev.mysql.com/doc/dev/mysql-server/9.0.1/PAGE_INNODB_REDO_LOG.html
> > > 
> > > Essentially it means that for any data page we write, ahead of time we do a
> > > buffered 512B log update followed by a periodic fsync. I think that such a
> > > thing is common to many apps.
> > 
> > So it's actually using buffered I/O for that and not direct I/O?
> 
> Right
> 
> > >> When we tried just using 16K FS blocksize, we found for low thread
> count
> > > testing that performance was poor - even worse baseline of 4K FS blocksize
> > > and double-write buffer. We put this down to high write latency for REDO
> > > log. As you can imagine, mostly writing 16K for only a 512B update is not
> > > efficient in terms of traffic generated and increased latency (versus 4K FS
> > > block size). At higher thread count, performance was better. We put that
> > > down to bigger log data portions to be written to REDO per FS block write.
> > 
> > So if the redo log uses buffered I/O I can see how that would bloat writes.
> > But then again using buffered I/O for a REDO log seems pretty silly
> > to start with.
> > 
> 
> Yeah, at the low end, it may make sense to do the 512B write via DIO. But
> OTOH sync'ing many redo log FS blocks at once at the high end can be more
> efficient.
> 
> From what I have heard, this was attempted before (using DIO) by some
> vendor, but did not come to much.
> 
> So it seems that we are stuck with this redo log limitation.
> 
> Let me know if you have any other ideas to avoid large atomic writes...

From the description it sounds like the redo log consists of 512b blocks
that describe small changes to the 16k table file pages.  If they're
issuing 16k atomic writes to get each of those 512b redo log records to
disk it's no wonder that cranks up the overhead substantially.  Also,
replaying those tiny updates through the pagecache beats issuing a bunch
of tiny nonlocalized writes.

For the first case I don't know why they need atomic writes -- 512b redo
log records can't be torn because they're single-sector writes.  The
second case might be better done with exchange-range.

--D

> Cheers,
> John
> 
>

John Garry Dec. 16, 2024, 8:40 a.m. UTC | #6

>>
>> Yeah, at the low end, it may make sense to do the 512B write via DIO. But
>> OTOH sync'ing many redo log FS blocks at once at the high end can be more
>> efficient.
>>
>>  From what I have heard, this was attempted before (using DIO) by some
>> vendor, but did not come to much.
>>
>> So it seems that we are stuck with this redo log limitation.
>>
>> Let me know if you have any other ideas to avoid large atomic writes...
> 
>  From the description it sounds like the redo log consists of 512b blocks
> that describe small changes to the 16k table file pages.  If they're
> issuing 16k atomic writes to get each of those 512b redo log records to
> disk it's no wonder that cranks up the overhead substantially. 

They are not issuing the redo log atomically. They do 512B buffered 
writes and then periodically fsync.

> Also,
> replaying those tiny updates through the pagecache beats issuing a bunch
> of tiny nonlocalized writes.
> 
> For the first case I don't know why they need atomic writes -- 512b redo
> log records can't be torn because they're single-sector writes.  The
> second case might be better done with exchange-range.
> 

As for exchange-range, that would very much pre-date any MySQL port. 
Furthermore, I can't imagine that exchange-range support is portable to 
other FSes, which is probably quite important. Anyway, they are not 
issuing the redo log atomically, so I don't know if mentioning 
exchange-range is relevant.

Regardless of what MySQL is specifically doing here, there are going to 
be other users/applications which want to keep a 4K FS blocksize and do 
larger atomic writes.

Thanks,
John

Christoph Hellwig Dec. 17, 2024, 7:11 a.m. UTC | #7

On Fri, Dec 13, 2024 at 05:43:09PM +0000, John Garry wrote:
>> So if the redo log uses buffered I/O I can see how that would bloat writes.
>> But then again using buffered I/O for a REDO log seems pretty silly
>> to start with.
>>
>
> Yeah, at the low end, it may make sense to do the 512B write via DIO. But 
> OTOH sync'ing many redo log FS blocks at once at the high end can be more 
> efficient.
>
> From what I have heard, this was attempted before (using DIO) by some 
> vendor, but did not come to much.

I can't see how buffered I/O will be fast than an optimized direct I/O
implementation.  Then again compared to very dumb dio code that doesn't
replace the caching in the page I can easily see how dio would be
much worse.  But given that people care about optimizing this workload
enough to look into changes all over the kernel I/O stack I would
expected that touching the code to write the redo log should not be
out of the picture.

John Garry Dec. 17, 2024, 8:23 a.m. UTC | #8

On 17/12/2024 07:11, Christoph Hellwig wrote:
> On Fri, Dec 13, 2024 at 05:43:09PM +0000, John Garry wrote:
>>> So if the redo log uses buffered I/O I can see how that would bloat writes.
>>> But then again using buffered I/O for a REDO log seems pretty silly
>>> to start with.
>>>
>>
>> Yeah, at the low end, it may make sense to do the 512B write via DIO. But
>> OTOH sync'ing many redo log FS blocks at once at the high end can be more
>> efficient.
>>
>>  From what I have heard, this was attempted before (using DIO) by some
>> vendor, but did not come to much.
> 
> I can't see how buffered I/O will be fast than an optimized direct I/O
> implementation.  Then again compared to very dumb dio code that doesn't
> replace the caching in the page I can easily see how dio would be
> much worse.  But given that people care about optimizing this workload
> enough to look into changes all over the kernel I/O stack I would
> expected that touching the code to write the redo log should not be
> out of the picture.

For sure, and I get the impression that - in general - optimising this 
redo log is something that effort is put into. I will admit that I don't 
know much about this redo log code, but I can go back with the feedback 
that redo log should be optimised for switching to the larger FS 
blocksize. But that may take a long time and be fruitless.

And even if something is done for this particular case, other scenarios 
are still going to want large atomics but keep the 4K FS block size.

Apart from all of that, I get it that you don't want to grow the big 
alloc code, but is this series really doing that? Or the smaller v1?

Thanks,
John

[v2,0/7] large atomic writes for xfs

Message

Comments