mbox series

[RFC,0/7] buffered block atomic writes

Message ID 20240422143923.3927601-1-john.g.garry@oracle.com (mailing list archive)
Headers show
Series buffered block atomic writes | expand

Message

John Garry April 22, 2024, 2:39 p.m. UTC
This series introduces a proof-of-concept for buffered block atomic
writes.

There is a requirement for userspace to be able to issue a write which
will not be torn due to HW or some other failure. A solution is presented
in [0] and [1].

Those series mentioned only support atomic writes for direct IO. The
primary target of atomic (or untorn) writes is DBs like InnoDB/MySQL,
which require direct IO support. However, as mentioned in [2], there is
a want to support atomic writes for DBs which use buffered writes, like
Postgres.

The issue raised in [2] was that the API proposed is not suitable for
buffered atomic writes. Specifically, since the API permits a range of
sizes of atomic writes, it is too difficult to track in the pagecache the
geometry of atomic writes which overlap with other atomic writes of
differing sizes and alignment. In addition, tracking and handling
overlapping atomic and non-atomic writes is difficult also.

In this series, buffered atomic writes are supported based upon the
following principles:
- A buffered atomic write requires RWF_ATOMIC flag be set, same as
  direct IO. The same other atomic writes rules apply, like power-of-2
  size and naturally aligned.
- For an inode, only a single size of buffered write is allowed. So for
  statx, atomic_write_unit_min = atomic_write_unit_max always for
  buffered atomic writes.
- A single folio maps to an atomic write in the pagecache. Folios match
  atomic writes well, as an atomic write must be a power-of-2 in size and
  naturally aligned.
- A folio is tagged as "atomic" when atomically written. If any part of an
  "atomic" folio is fully or partially overwritten with a non-atomic
  write, the folio loses it atomicity. Indeed, issuing a non-atomic write
  over an atomic write would typically be seen as a userspace bug.
- If userspace wants to guarantee a buffered atomic write is written to
  media atomically after the write syscall returns, it must use RWF_SYNC
  or similar (along with RWF_ATOMIC).

This series just supports buffered atomic writes for XFS. I do have some
patches for bdev file operations buffered atomic writes. I did not include
them, as:
a. I don't know of any requirement for this support
b. atomic_write_unit_min and atomic_write_unit_max would be fixed at
   PAGE_SIZE there. This is very limiting. However an API like BLKBSZSET
   could be added to allow userspace to program the values for
   atomic_write_unit_{min, max}.
c. We may want to support atomic_write_unit_{min, max} < PAGE_SIZE, and
   this becomes more complicated to support.
d. I would like to see what happens with bs > ps work there.

This series is just an early proof-of-concept, to prove that the API
proposed for block atomic writes can work for buffered IO. I would like to
unblock that direct IO series and have it merged.

Patches are based on [0], [1], and [3] (the bs > ps series). For the bs >
ps series, I had to borrow an earlier filemap change which allows the
folio min and max order be selected.

All patches can be found at:
https://github.com/johnpgarry/linux/tree/atomic-writes-v6.9-v6-fs-v2-buffered

[0] https://lore.kernel.org/linux-block/20240326133813.3224593-1-john.g.garry@oracle.com/
[1] https://lore.kernel.org/linux-block/20240304130428.13026-1-john.g.garry@oracle.com/
[2] https://lore.kernel.org/linux-fsdevel/20240228061257.GA106651@mit.edu/
[3] https://lore.kernel.org/linux-xfs/20240313170253.2324812-1-kernel@pankajraghav.com/

John Garry (7):
  fs: Rename STATX{_ATTR}_WRITE_ATOMIC -> STATX{_ATTR}_WRITE_ATOMIC_DIO
  filemap: Change mapping_set_folio_min_order() ->
    mapping_set_folio_orders()
  mm: Add PG_atomic
  fs: Add initial buffered atomic write support info to statx
  fs: iomap: buffered atomic write support
  fs: xfs: buffered atomic writes statx support
  fs: xfs: Enable buffered atomic writes

 block/bdev.c                   |  9 +++---
 fs/iomap/buffered-io.c         | 53 +++++++++++++++++++++++++++++-----
 fs/iomap/trace.h               |  3 +-
 fs/stat.c                      | 26 ++++++++++++-----
 fs/xfs/libxfs/xfs_inode_buf.c  |  8 +++++
 fs/xfs/xfs_file.c              | 12 ++++++--
 fs/xfs/xfs_icache.c            | 10 ++++---
 fs/xfs/xfs_ioctl.c             |  3 ++
 fs/xfs/xfs_iops.c              | 11 +++++--
 include/linux/fs.h             |  3 +-
 include/linux/iomap.h          |  1 +
 include/linux/page-flags.h     |  5 ++++
 include/linux/pagemap.h        | 20 ++++++++-----
 include/trace/events/mmflags.h |  3 +-
 include/uapi/linux/stat.h      |  6 ++--
 mm/filemap.c                   |  8 ++++-
 16 files changed, 141 insertions(+), 40 deletions(-)

Comments

Pankaj Raghav (Samsung) April 25, 2024, 2:47 p.m. UTC | #1
On Mon, Apr 22, 2024 at 02:39:18PM +0000, John Garry wrote:
> Borrowed from:
> 
> https://lore.kernel.org/linux-fsdevel/20240213093713.1753368-2-kernel@pankajraghav.com/
> (credit given in due course)
> 
> We will need to be able to only use a single folio order for buffered
> atomic writes, so allow the mapping folio order min and max be set.

> 
> We still have the restriction of not being able to support order-1
> folios - it will be required to lift this limit at some stage.

This is already supported upstream for file-backed folios:
commit: 8897277acfef7f70fdecc054073bea2542fc7a1b

> index fc8eb9c94e9c..c22455fa28a1 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -363,9 +363,10 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>  #endif
>  
>  /*
> - * mapping_set_folio_min_order() - Set the minimum folio order
> + * mapping_set_folio_orders() - Set the minimum and max folio order

In the new series (sorry forgot to CC you), I added a new helper called
mapping_set_folio_order_range() which does something similar to avoid
confusion based on willy's suggestion:
https://lore.kernel.org/linux-xfs/20240425113746.335530-3-kernel@pankajraghav.com/

mapping_set_folio_min_order() also sets max folio order to be 
MAX_PAGECACHE_ORDER order anyway. So no need of explicitly calling it
here?

>  /**
> @@ -400,7 +406,7 @@ static inline void mapping_set_folio_min_order(struct address_space *mapping,
>   */
>  static inline void mapping_set_large_folios(struct address_space *mapping)
>  {
> -	mapping_set_folio_min_order(mapping, 0);
> +	mapping_set_folio_orders(mapping, 0, MAX_PAGECACHE_ORDER);
>  }
>  
>  static inline unsigned int mapping_max_folio_order(struct address_space *mapping)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index d81530b0aac0..d5effe50ddcb 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1898,9 +1898,15 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  no_page:
>  	if (!folio && (fgp_flags & FGP_CREAT)) {
>  		unsigned int min_order = mapping_min_folio_order(mapping);
> -		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
> +		unsigned int max_order = mapping_max_folio_order(mapping);
> +		unsigned int order = FGF_GET_ORDER(fgp_flags);
>  		int err;
>  
> +		if (order > max_order)
> +			order = max_order;
> +		else if (order < min_order)
> +			order = max_order;

order = min_order; ?
--
Pankaj
John Garry April 26, 2024, 8:02 a.m. UTC | #2
On 25/04/2024 15:47, Pankaj Raghav (Samsung) wrote:
> On Mon, Apr 22, 2024 at 02:39:18PM +0000, John Garry wrote:
>> Borrowed from:
>>
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/20240213093713.1753368-2-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!LvajFab0xQx8oBWDlDtVY8duiLDjOKX91G4YqadoCu6gqatA2H0FzBUvdSC69dqXNoe2QvStSwrxIZ142MXOKk8$
>> (credit given in due course)
>>
>> We will need to be able to only use a single folio order for buffered
>> atomic writes, so allow the mapping folio order min and max be set.
> 
>>
>> We still have the restriction of not being able to support order-1
>> folios - it will be required to lift this limit at some stage.
> 
> This is already supported upstream for file-backed folios:
> commit: 8897277acfef7f70fdecc054073bea2542fc7a1b

ok

> 
>> index fc8eb9c94e9c..c22455fa28a1 100644
>> --- a/include/linux/pagemap.h
>> +++ b/include/linux/pagemap.h
>> @@ -363,9 +363,10 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>>   #endif
>>   
>>   /*
>> - * mapping_set_folio_min_order() - Set the minimum folio order
>> + * mapping_set_folio_orders() - Set the minimum and max folio order
> 
> In the new series (sorry forgot to CC you),

no worries, I saw it

> I added a new helper called
> mapping_set_folio_order_range() which does something similar to avoid
> confusion based on willy's suggestion:
> https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240425113746.335530-3-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!LvajFab0xQx8oBWDlDtVY8duiLDjOKX91G4YqadoCu6gqatA2H0FzBUvdSC69dqXNoe2QvStSwrxIZ14opzAoso$
> 

Fine, I can include that

> mapping_set_folio_min_order() also sets max folio order to be
> MAX_PAGECACHE_ORDER order anyway. So no need of explicitly calling it
> here?
> 

Here mapping_set_folio_min_order() is being replaced with 
mapping_set_folio_order_range(), so not sure why you mention that. 
Regardless, I'll use your mapping_set_folio_order_range().

>>   /**
>> @@ -400,7 +406,7 @@ static inline void mapping_set_folio_min_order(struct address_space *mapping,
>>    */
>>   static inline void mapping_set_large_folios(struct address_space *mapping)
>>   {
>> -	mapping_set_folio_min_order(mapping, 0);
>> +	mapping_set_folio_orders(mapping, 0, MAX_PAGECACHE_ORDER);
>>   }
>>   
>>   static inline unsigned int mapping_max_folio_order(struct address_space *mapping)
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index d81530b0aac0..d5effe50ddcb 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -1898,9 +1898,15 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>>   no_page:
>>   	if (!folio && (fgp_flags & FGP_CREAT)) {
>>   		unsigned int min_order = mapping_min_folio_order(mapping);
>> -		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
>> +		unsigned int max_order = mapping_max_folio_order(mapping);
>> +		unsigned int order = FGF_GET_ORDER(fgp_flags);
>>   		int err;
>>   
>> +		if (order > max_order)
>> +			order = max_order;
>> +		else if (order < min_order)
>> +			order = max_order;
> 
> order = min_order; ?

right

Thanks,
John