diff mbox series

[RFC,2/2] ceph: truncate the file contents when needed when file scrypted

Message ID 20210903081510.982827-3-xiubli@redhat.com (mailing list archive)
State New, archived
Headers show
Series ceph: size handling for the fscrypt | expand

Commit Message

Xiubo Li Sept. 3, 2021, 8:15 a.m. UTC
From: Xiubo Li <xiubli@redhat.com>

When truncating the file, it will leave the MDS to handle that,
but the MDS won't update the file contents. So when the fscrypt
is enabled, if the truncate size is not aligned to the block size,
the kclient will round up the truancate size to the block size and
leave the the last block untouched.

The opaque fscrypt_file field will be used to tricker whether the
last block need to do the rmw to truncate the a specified block's
contents, we can get which block needs to do the rmw by round down
the fscrypt_file.

In kclient side, there is not need to do the rmw immediately after
the file is truncated. We can defer doing that whenever the kclient
will update that block in late future. And before that any kclient
will check the fscrypt_file field when reading that block, if the
fscrypt_file field is none zero that means the related block needs
to be zeroed in range of [fscrypt_file, round_up(fscrypt_file + PAGE_SIZE))
in pagecache or readed data buffer.

Once the that block contents are updated and writeback to ceph, the
kclient will reset the fscrypt_file field in MDS side, then 0 means
no need to care about the truncate stuff any more.

There has one special case and that is when there have 2 ftruncates
are called:

1) If the second's size equals to the first one, do nothing about
   the fscrypt_file.
2) If the second's size is smaller than the first one, then we need
   to update the fscrypt_file with new size.
3) If the second's size is larger than the first one, then we must
   leave what the fscrypt_file is. Because we always need to truncate
   more.

Add one CEPH_CLIENT_CAPS_RESET_FSCRYPT_FILE flag in the cap reqeust
to tell the MDS to reset the scrypt_file field once the specified
block has been updated, so there still need to adapt to this in the
MDS PR.

And also this patch just assume that all the buffer and none buffer
read/write related enscrypt/descrypt work has been done.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
---
 fs/ceph/addr.c  | 19 ++++++++++++++-
 fs/ceph/caps.c  | 24 +++++++++++++++++++
 fs/ceph/file.c  | 62 ++++++++++++++++++++++++++++++++++++++++++++++---
 fs/ceph/inode.c | 27 ++++++++++++++++++---
 fs/ceph/super.h | 12 ++++++++--
 5 files changed, 135 insertions(+), 9 deletions(-)

Comments

Jeff Layton Sept. 7, 2021, 4:26 p.m. UTC | #1
On Fri, 2021-09-03 at 16:15 +0800, xiubli@redhat.com wrote:
> From: Xiubo Li <xiubli@redhat.com>
> 
> When truncating the file, it will leave the MDS to handle that,
> but the MDS won't update the file contents. So when the fscrypt
> is enabled, if the truncate size is not aligned to the block size,
> the kclient will round up the truancate size to the block size and
> leave the the last block untouched.
> 
> The opaque fscrypt_file field will be used to tricker whether the
> last block need to do the rmw to truncate the a specified block's
> contents, we can get which block needs to do the rmw by round down
> the fscrypt_file.
> 
> In kclient side, there is not need to do the rmw immediately after
> the file is truncated. We can defer doing that whenever the kclient
> will update that block in late future. And before that any kclient
> will check the fscrypt_file field when reading that block, if the
> fscrypt_file field is none zero that means the related block needs
> to be zeroed in range of [fscrypt_file, round_up(fscrypt_file + PAGE_SIZE))
> in pagecache or readed data buffer.
> 

s/PAGE_SIZE/CEPH_FSCRYPT_BLOCK_SIZE/

Yes, on x86_64 they are equal, but that's not the case on all arches.
Also, we are moving toward a pagecache that may hold larger pages on
x86_64 too.
 
> Once the that block contents are updated and writeback
> kclient will reset the fscrypt_file field in MDS side, then 0 means
> no need to care about the truncate stuff any more.
> 

I'm a little unclear on what the fscrypt_file actually represents here. 

I had proposed that we just make the fscrypt_file field hold the
"actual" i_size and we'd make the old size field always be a rounded-
up version of the size. The MDS would treat that as an opaque value
under Fw caps, and the client could use that field to determine i_size.
That has a side benefit too -- if the client doesn't support fscrypt,
it'll see the rounded-up sizes which are close enough and don't violate
any POSIX rules.

In your version, fscrypt_file also holds the actual size of the inode,
but sometimes you're zeroing it out, and I don't understand why:

1) What benefit does this extra complexity provide?

2) once you zero out that value, how do you know what the real i_size
is? I assume the old size field will hold a rounded-up size (mostly for
the benefit of the MDS), so you don't have a record of the real i_size
at that point.

Note that there is no reason you can't store _more_ than just a size in
fscrypt_file. For example, If you need extra flags to indicate
truncation/rmw state, then you could just make it hold a struct that has
the size and a flags field.

> There has one special case and that is when there have 2 ftruncates
> are called:
> 
> 1) If the second's size equals to the first one, do nothing about
>    the fscrypt_file.
> 2) If the second's size is smaller than the first one, then we need
>    to update the fscrypt_file with new size.
> 3) If the second's size is larger than the first one, then we must
>    leave what the fscrypt_file is. Because we always need to truncate
>    more.
> 
> Add one CEPH_CLIENT_CAPS_RESET_FSCRYPT_FILE flag in the cap reqeust
> to tell the MDS to reset the scrypt_file field once the specified
> block has been updated, so there still need to adapt to this in the
> MDS PR.
> 
> And also this patch just assume that all the buffer and none buffer
> read/write related enscrypt/descrypt work has been done.
> 
> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> ---
>  fs/ceph/addr.c  | 19 ++++++++++++++-
>  fs/ceph/caps.c  | 24 +++++++++++++++++++
>  fs/ceph/file.c  | 62 ++++++++++++++++++++++++++++++++++++++++++++++---
>  fs/ceph/inode.c | 27 ++++++++++++++++++---
>  fs/ceph/super.h | 12 ++++++++--
>  5 files changed, 135 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 6d3f74d46e5b..9f1dd2fc427d 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -212,6 +212,7 @@ static bool ceph_netfs_clamp_length(struct netfs_read_subrequest *subreq)
>  static void finish_netfs_read(struct ceph_osd_request *req)
>  {
>  	struct ceph_fs_client *fsc = ceph_inode_to_client(req->r_inode);
> +	struct ceph_inode_info *ci = ceph_inode(req->r_inode);
>  	struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0);
>  	struct netfs_read_subrequest *subreq = req->r_priv;
>  	int num_pages;
> @@ -234,8 +235,15 @@ static void finish_netfs_read(struct ceph_osd_request *req)
>  
>  	netfs_subreq_terminated(subreq, err, true);
>  
> +	/* FIXME: This should be done after descryped */
> +	if (req->r_result > 0)
> +		ceph_try_to_zero_truncate_block_off(ci, osd_data->alignment,
> +						    osd_data->length,ceph_osd_data
> +						    osd_data->pages);
> +

The 3rd arg to ceph_try_to_zero_truncate_block_off is end_pos, but here
you're passing in the length of the OSD write. Doesn't that need to
added to the pos of the write?

I'm also a little unclear as to why you need to adjust the truncation
offset at this point. 

>  	num_pages = calc_pages_for(osd_data->alignment, osd_data->length);
>  	ceph_put_page_vector(osd_data->pages, num_pages, false);
> +
>  	iput(req->r_inode);
>  }
>  
> @@ -555,8 +563,10 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc)
>  				  req->r_end_latency, len, err);
>  
>  	ceph_osdc_put_request(req);
> -	if (err == 0)
> +	if (err == 0) {
> +		ceph_reset_truncate_block_off(ci, page_off, len);
>  		err = len;
> +	}
>  
>  	if (err < 0) {
>  		struct writeback_control tmp_wbc;
> @@ -661,10 +671,17 @@ static void writepages_finish(struct ceph_osd_request *req)
>  					   (u64)osd_data->length);
>  		total_pages += num_pages;
>  		for (j = 0; j < num_pages; j++) {
> +			u64 page_off;
> +
>  			page = osd_data->pages[j];
>  			BUG_ON(!page);
>  			WARN_ON(!PageUptodate(page));
>  
> +			page_off = osd_data->alignment + j * PAGE_SIZE;
> +			if (rc >= 0)
> +			    ceph_reset_truncate_block_off(ci, page_off,
> +							  page_off + PAGE_SIZE);
> +
>  			if (atomic_long_dec_return(&fsc->writeback_count) <
>  			     CONGESTION_OFF_THRESH(
>  					fsc->mount_options->congestion_kb))
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index d628dcdbf869..a211ab4c3f7a 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -1425,6 +1425,9 @@ static vthatoid __prep_cap(struct cap_msg_args *arg, struct ceph_cap *cap,
>  			}
>  		}
>  	}
> +	if (ci->i_truncate_block_off < 0)
> +		flags |= CEPH_CLIENT_CAPS_RESET_FSCRYPT_FILE;It sounds like you want to do something quite different from what I originally proposed. 
> +
>  	arg->flags = flags;
>  	arg->encrypted = IS_ENCRYPTED(inode);
>  	if (ci->fscrypt_auth_len &&
> @@ -3155,6 +3158,27 @@ void ceph_put_cap_refs_no_check_caps(struct ceph_inode_info *ci, int had)
>  	__ceph_put_cap_refs(ci, had, PUT_CAP_REFS_NO_CHECK);
>  }
>  
> +/*
> + * Clear the i_truncate_block_off and fscrypt_file
> + * if old last encrypted block has been updated.
> + */
> +void __ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
> +				      u64 start_pos, u64 end_pos)
> +{
> +	if (ci->i_truncate_block_off > 0 &&
> +	    ci->i_truncate_block_off >= start_pos &&
> +	    ci->i_truncate_block_off < end_pos)
> +		ci->i_truncate_block_off = 0;
> +}
> +
> +void ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
> +				    u64 start_pos, u64 end_pos)
> +{
> +	spin_lock(&ci->i_ceph_lock);
> +	__ceph_reset_truncate_block_off(ci, start_pos, end_pos);
> +	spin_unlock(&ci->i_ceph_lock);
> +}
> +
>  /*
>   * Release @nr WRBUFFER refs on dirty pages for the given @snapc snap
>   * context.  Adjust per-snap dirty page accounting as appropriate.
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 6e677b40410e..cfa4cbe08c10 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -885,10 +885,34 @@ static void fscrypt_adjust_off_and_len(struct inode *inode, u64 *off, u64 *len)
>  	}
>  }
>  
> +void ceph_try_to_zero_truncate_block_off(struct ceph_inode_info *ci,
> +					 u64 start_pos, u64 end_pos,
> +					 struct page **pages)
> +{
> +	u64 zoff, zlen;
> +
> +	spin_lock(&ci->i_ceph_lock);
> +	if (ci->i_truncate_block_off >= start_pos &&
> +			ci->i_truncate_block_off < end_pos) {
> +		zoff = ci->i_truncate_block_off - start_pos;
> +		zlen = round_up(ci->i_truncate_block_off, PAGE_SIZE) - ci->i_truncate_block_off;
> +
> +		spin_unlock(&ci->i_ceph_lock);
> +		ceph_zero_page_vector_range(zoff, zlen, pages);
> +		spin_lock(&ci->i_ceph_lock);
> +	}
> +	spin_unlock(&ci->i_ceph_lock);
> +}
>  #else
>  static inline void fscrypt_adjust_off_and_len(struct inode *inode, u64 *off, u64 *len)
>  {
>  }
> +
> +void ceph_try_to_zero_truncate_block_off(struct ceph_inode_info *ci,
> +					 u64 start_pos, u64 end_pos,
> +					 struct page **pages)
> +{
> +}
>  #endif
>  
>  /*
> @@ -1030,6 +1054,13 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
>  			ret += zlen;
>  		}
>  
> +		/*
> +		 * If the inode is ENCRYPTED the read_off is aligned to PAGE_SIZE
> +		 */
> +		ceph_try_to_zero_truncate_block_off(ci, read_off,
> +						    read_off + read_len,
> +						    pages);
> +
>  		idx = 0;
>  		left = ret > 0 ? ret : 0;
>  		while (left > 0) {
> @@ -1413,12 +1444,34 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
>  
>  		size = i_size_read(inode);
>  		if (!write) {
> +			struct iov_iter i;
> +			size_t boff;
> +			int zlen;
> +
>  			if (ret == -ENOENT)
>  				ret = 0;
> +
> +			/* Zero the truncate block off */
> +			spin_lock(&ci->i_ceph_lock);
> +			boff = ci->i_truncate_block_off;
> +			if (IS_ENCRYPTED(inode) && ret > 0 && boff > 0 &&
> +			    boff >= (iocb->ki_pos & PAGE_MASK) &&
> +			    boff < round_up(ret, PAGE_SIZE)) {
> +				int advance = 0;
> +
> +				zlen = round_up(boff, PAGE_SIZE) - boff;
> +				if (ci->i_truncate_block_off >= iocb->ki_pos)
> +					advance = boff - iocb->ki_pos;
> +
> +				iov_iter_bvec(&i, READ, bvecs, num_pages, len);
> +				iov_iter_advance(&i, advance);
> +				iov_iter_zero(zlen, &i);
> +			}
> +			spin_unlock(&ci->i_ceph_lock);
> +
>  			if (ret >= 0 && ret < len && pos + ret < size) {
> -				struct iov_iter i;
> -				int zlen = min_t(size_t, len - ret,
> -						 size - pos - ret);
> +				zlen = min_t(size_t, len - ret,
> +					     size - pos - ret);
>  
>  				iov_iter_bvec(&i, READ, bvecs, num_pages, len);
>  				iov_iter_advance(&i, ret);
> @@ -1967,6 +2020,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
>  	struct ceph_osd_client *osdc = &fsc->client->osdc;
>  	struct ceph_cap_flush *prealloc_cf;
> +	u64 start_pos = iocb->ki_pos;
>  	ssize_t count, written = 0;
>  	int err, want, got;
>  	bool direct_lock = false;
> @@ -2110,6 +2164,8 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		int dirty;
>  
>  		spin_lock(&ci->i_ceph_lock);
> +		__ceph_reset_truncate_block_off(ci, start_pos, iocb->ki_pos);
> +
>  		ci->i_inline_version = CEPH_INLINE_NONE;
>  		dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR,
>  					       &prealloc_cf);
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 1a4c9bc485fc..c48c77c1bcf4 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -625,6 +625,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
>  	ci->fscrypt_auth = NULL;
>  	ci->fscrypt_auth_len = 0;
>  #endif
> +	ci->i_truncate_block_off = 0;
>  
>  	return &ci->vfs_inode;
>  }
> @@ -1033,11 +1034,24 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
>  
>  		pool_ns = old_ns;
>  
> +		/* the fscrypt_file is 0 means the file content truncate has been done */
>  		if (IS_ENCRYPTED(inode) && size &&
> -		    (iinfo->fscrypt_file_len == sizeof(__le64))) {
> +		    iinfo->fscrypt_file_len == sizeof(__le64) &&
> +		    __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file) > 0) {
>  			size = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
>  			if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
>  				pr_warn("size=%llu fscrypt_file=%llu\n", info->size, size);
> +
> +			/*
> +			 * If the second truncate come just after the first
> +			 * truncate, and if the second has a larger size there
> +			 * is no need to update the i_truncate_block_off.
> +			 * Only when the second one has a smaller size, that
> +			 * means we need to truncate more.
> +			 */
> +			if (ci->i_truncate_block_off > 0 &&
> +			    size < ci->i_truncate_block_off)
> +				ci->i_truncate_block_off = size;
>  		}
>  
>  		queue_trunc = ceph_fill_file_size(inode, issued,
> @@ -2390,8 +2404,15 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>  				req->r_args.setattr.old_size =
>  					cpu_to_le64(round_up(isize,
>  							     CEPH_FSCRYPT_BLOCK_SIZE));
> -				req->r_fscrypt_file = attr->ia_size;
> -				/* FIXME: client must zero out any partial blocks! */
> +				if (attr->ia_size < isize) {
> +					if(IS_ALIGNED(attr->ia_size, CEPH_FSCRYPT_BLOCK_SIZE))
> +						req->r_fscrypt_file = 0;
> +					else
> +						req->r_fscrypt_file = attr->ia_size;
> +					/* FIXME: client must zero out any partial blocks! */
> +				} else if (attr->ia_size > isize) {
> +					req->r_fscrypt_file = attr->ia_size;
> +				}
>  			} else {
>  				req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
>  				req->r_args.setattr.old_size = cpu_to_le64(isize);
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index 7f3976b3319d..856caeb25fb6 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -443,9 +443,9 @@ struct ceph_inode_info {
>  	struct fscache_cookie *fscache;
>  #endif
>  	u32 fscrypt_auth_len;
> -	u32 fscrypt_file_len;
>  	u8 *fscrypt_auth;
> -	u8 *fscrypt_file;
> +	/* need to zero the last block when decrypting the content to pagecache */
> +	size_t i_truncate_block_off;
>  

Ugh, do we really need yet another field in the inode? This seems
totally unnecessary, but maybe I'm missing some subtlety in the
truncation handling that requires this extra tracking.

What is this intended to represent anyway?

>  	errseq_t i_meta_err;
>  
> @@ -1192,6 +1192,10 @@ extern void ceph_put_cap_refs(struct ceph_inode_info *ci, int had);
>  extern void ceph_put_cap_refs_async(struct ceph_inode_info *ci, int had);
>  extern void ceph_put_cap_refs_no_check_caps(struct ceph_inode_info *ci,
>  					    int had);
> +extern void __ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
> +					    u64 start_pos, u64 end_pos);
> +extern void ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
> +					  u64 start_pos, u64 end_pos);
>  extern void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr,
>  				       struct ceph_snap_context *snapc);
>  extern void ceph_flush_snaps(struct ceph_inode_info *ci,
> @@ -1282,6 +1286,10 @@ extern int ceph_locks_to_pagelist(struct ceph_filelock *flocks,
>  extern void ceph_fs_debugfs_init(struct ceph_fs_client *client);
>  extern void ceph_fs_debugfs_cleanup(struct ceph_fs_client *client);
>  
> +extern void ceph_try_to_zero_truncate_block_off(struct ceph_inode_info *ci,
> +						u64 start_pos, u64 end_pos,
> +						struct page **pages);
> +
>  /* quota.c */
>  static inline bool __ceph_has_any_quota(struct ceph_inode_info *ci)
>  {
Xiubo Li Sept. 8, 2021, 9:37 a.m. UTC | #2
On 9/8/21 12:26 AM, Jeff Layton wrote:
> On Fri, 2021-09-03 at 16:15 +0800, xiubli@redhat.com wrote:
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> When truncating the file, it will leave the MDS to handle that,
>> but the MDS won't update the file contents. So when the fscrypt
>> is enabled, if the truncate size is not aligned to the block size,
>> the kclient will round up the truancate size to the block size and
>> leave the the last block untouched.
>>
>> The opaque fscrypt_file field will be used to tricker whether the
>> last block need to do the rmw to truncate the a specified block's
>> contents, we can get which block needs to do the rmw by round down
>> the fscrypt_file.
>>
>> In kclient side, there is not need to do the rmw immediately after
>> the file is truncated. We can defer doing that whenever the kclient
>> will update that block in late future. And before that any kclient
>> will check the fscrypt_file field when reading that block, if the
>> fscrypt_file field is none zero that means the related block needs
>> to be zeroed in range of [fscrypt_file, round_up(fscrypt_file + PAGE_SIZE))
>> in pagecache or readed data buffer.
>>
> s/PAGE_SIZE/CEPH_FSCRYPT_BLOCK_SIZE/
>
> Yes, on x86_64 they are equal, but that's not the case on all arches.
> Also, we are moving toward a pagecache that may hold larger pages on
> x86_64 too.
>   
Okay.
>> Once the that block contents are updated and writeback
>> kclient will reset the fscrypt_file field in MDS side, then 0 means
>> no need to care about the truncate stuff any more.
>>
> I'm a little unclear on what the fscrypt_file actually represents here.
>
> I had proposed that we just make the fscrypt_file field hold the
> "actual" i_size and we'd make the old size field always be a rounded-
> up version of the size. The MDS would treat that as an opaque value
> under Fw caps, and the client could use that field to determine i_size.
> That has a side benefit too -- if the client doesn't support fscrypt,
> it'll see the rounded-up sizes which are close enough and don't violate
> any POSIX rules.
>
> In your version, fscrypt_file also holds the actual size of the inode,
> but sometimes you're zeroing it out, and I don't understand why:

I think I forgot to fix this after I adapt to multiple ftruncates case, 
this patch is not correctly handling the "actual" file size.

I just want the fscrypt_file field always to hold the offset from which 
the contents needed to be zeroed, and the range should be [fscrypt_file, 
round_up(fscrypt_file +CEPH_FSCRYPT_BLOCK_SIZE)).

In single ftruncate case the fscrypt_file should equal to the "actual" 
file size. Then the "req->r_args.setattr.size = attr->ia_size" and 
"req->r_args.setattr.old_size = isize", no need to round up in 
__ceph_setattr() in kclient side, and leave the MDS to do that, but we 
need to pass the CEPH_FSCRYPT_BLOCK_SIZE at the same time.


But in multiple ftruncates case if the second ftruncate has a larger 
size, the fscrypt_file won't be changed and the ceph backend will help 
zero the extended part, but we still need to zero the contents in the 
first ftruncate. If the second ftruncate has a smaller size, the 
fscrypt_file need to be updated and always keeps the smaller size.


> 1) What benefit does this extra complexity provide?

Just in case for multiple ftruncates case as above.

> 2) once you zero out that value, how do you know what the real i_size
> is? I assume the old size field will hold a rounded-up size (mostly for
> the benefit of the MDS), so you don't have a record of the real i_size
> at that point.

As above mentioned, the MDS will help do the round up instead and the 
"setattr.size" and "setattr.old_size" won't do the round up any more in 
kclient side.

>
> Note that there is no reason you can't store _more_ than just a size in
> fscrypt_file. For example, If you need extra flags to indicate
> truncation/rmw state, then you could just make it hold a struct that has
> the size and a flags field.

Yeah, a struct is a good idea, something likes:

struct fscrypt_file {

     u64 truncate_offset;

     u8 fscrypt_block_size;  // in kbytes

     ...

};


>
>> There has one special case and that is when there have 2 ftruncates
>> are called:
>>
>> 1) If the second's size equals to the first one, do nothing about
>>     the fscrypt_file.
>> 2) If the second's size is smaller than the first one, then we need
>>     to update the fscrypt_file with new size.
>> 3) If the second's size is larger than the first one, then we must
>>     leave what the fscrypt_file is. Because we always need to truncate
>>     more.
>>
>> Add one CEPH_CLIENT_CAPS_RESET_FSCRYPT_FILE flag in the cap reqeust
>> to tell the MDS to reset the scrypt_file field once the specified
>> block has been updated, so there still need to adapt to this in the
>> MDS PR.
>>
>> And also this patch just assume that all the buffer and none buffer
>> read/write related enscrypt/descrypt work has been done.
>>
>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>> ---
>>   fs/ceph/addr.c  | 19 ++++++++++++++-
>>   fs/ceph/caps.c  | 24 +++++++++++++++++++
>>   fs/ceph/file.c  | 62 ++++++++++++++++++++++++++++++++++++++++++++++---
>>   fs/ceph/inode.c | 27 ++++++++++++++++++---
>>   fs/ceph/super.h | 12 ++++++++--
>>   5 files changed, 135 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> index 6d3f74d46e5b..9f1dd2fc427d 100644
>> --- a/fs/ceph/addr.c
>> +++ b/fs/ceph/addr.c
>> @@ -212,6 +212,7 @@ static bool ceph_netfs_clamp_length(struct netfs_read_subrequest *subreq)
>>   static void finish_netfs_read(struct ceph_osd_request *req)
>>   {
>>   	struct ceph_fs_client *fsc = ceph_inode_to_client(req->r_inode);
>> +	struct ceph_inode_info *ci = ceph_inode(req->r_inode);
>>   	struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0);
>>   	struct netfs_read_subrequest *subreq = req->r_priv;
>>   	int num_pages;
>> @@ -234,8 +235,15 @@ static void finish_netfs_read(struct ceph_osd_request *req)
>>   
>>   	netfs_subreq_terminated(subreq, err, true);
>>   
>> +	/* FIXME: This should be done after descryped */
>> +	if (req->r_result > 0)
>> +		ceph_try_to_zero_truncate_block_off(ci, osd_data->alignment,
>> +						    osd_data->length,
>> +						    osd_data->pages);
>> +
> The 3rd arg to ceph_try_to_zero_truncate_block_off is end_pos, but here
> you're passing in the length of the OSD write. Doesn't that need to
> added to the pos of the write?

Yeah, it should be the page_off instead.


>
> I'm also a little unclear as to why you need to adjust the truncation
> offset at this point.

It won't, the ceph_try_to_zero_truncate_block_off() will only zero the 
readed pages in range of [fscrypt_off, round_up(fscrypt_file 
+CEPH_FSCRYPT_BLOCK_SIZE)).


>>   	num_pages = calc_pages_for(osd_data->alignment, osd_data->length);
>>   	ceph_put_page_vector(osd_data->pages, num_pages, false);
>> +
>>   	iput(req->r_inode);
>>   }
>>   
>> @@ -555,8 +563,10 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc)
>>   				  req->r_end_latency, len, err);
>>   
>>   	ceph_osdc_put_request(req);
>> -	if (err == 0)
>> +	if (err == 0) {
>> +		ceph_reset_truncate_block_off(ci, page_off, len);
>>   		err = len;
>> +	}
>>   
>>   	if (err < 0) {
>>   		struct writeback_control tmp_wbc;
>> @@ -661,10 +671,17 @@ static void writepages_finish(struct ceph_osd_request *req)
>>   					   (u64)osd_data->length);
>>   		total_pages += num_pages;
>>   		for (j = 0; j < num_pages; j++) {
>> +			u64 page_off;
>> +
>>   			page = osd_data->pages[j];
>>   			BUG_ON(!page);
>>   			WARN_ON(!PageUptodate(page));
>>   
>> +			page_off = osd_data->alignment + j * PAGE_SIZE;
>> +			if (rc >= 0)
>> +			    ceph_reset_truncate_block_off(ci, page_off,
>> +							  page_off + PAGE_SIZE);
>> +
>>   			if (atomic_long_dec_return(&fsc->writeback_count) <
>>   			     CONGESTION_OFF_THRESH(
>>   					fsc->mount_options->congestion_kb))
>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>> index d628dcdbf869..a211ab4c3f7a 100644
>> --- a/fs/ceph/caps.c
>> +++ b/fs/ceph/caps.c
>> @@ -1425,6 +1425,9 @@ static vthatoid __prep_cap(struct cap_msg_args *arg, struct ceph_cap *cap,
>>   			}
>>   		}
>>   	}
>> +	if (ci->i_truncate_block_off < 0)
>> +		flags |= CEPH_CLIENT_CAPS_RESET_FSCRYPT_FILE;It sounds like you want to do something quite different from what I originally proposed.
>> +
>>   	arg->flags = flags;
>>   	arg->encrypted = IS_ENCRYPTED(inode);
>>   	if (ci->fscrypt_auth_len &&
>> @@ -3155,6 +3158,27 @@ void ceph_put_cap_refs_no_check_caps(struct ceph_inode_info *ci, int had)
>>   	__ceph_put_cap_refs(ci, had, PUT_CAP_REFS_NO_CHECK);
>>   }
>>   
>> +/*
>> + * Clear the i_truncate_block_off and fscrypt_file
>> + * if old last encrypted block has been updated.
>> + */
>> +void __ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
>> +				      u64 start_pos, u64 end_pos)
>> +{
>> +	if (ci->i_truncate_block_off > 0 &&
>> +	    ci->i_truncate_block_off >= start_pos &&
>> +	    ci->i_truncate_block_off < end_pos)
>> +		ci->i_truncate_block_off = 0;
>> +}
>> +
>> +void ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
>> +				    u64 start_pos, u64 end_pos)
>> +{
>> +	spin_lock(&ci->i_ceph_lock);
>> +	__ceph_reset_truncate_block_off(ci, start_pos, end_pos);
>> +	spin_unlock(&ci->i_ceph_lock);
>> +}
>> +
>>   /*
>>    * Release @nr WRBUFFER refs on dirty pages for the given @snapc snap
>>    * context.  Adjust per-snap dirty page accounting as appropriate.
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index 6e677b40410e..cfa4cbe08c10 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -885,10 +885,34 @@ static void fscrypt_adjust_off_and_len(struct inode *inode, u64 *off, u64 *len)
>>   	}
>>   }
>>   
>> +void ceph_try_to_zero_truncate_block_off(struct ceph_inode_info *ci,
>> +					 u64 start_pos, u64 end_pos,
>> +					 struct page **pages)
>> +{
>> +	u64 zoff, zlen;
>> +
>> +	spin_lock(&ci->i_ceph_lock);
>> +	if (ci->i_truncate_block_off >= start_pos &&
>> +			ci->i_truncate_block_off < end_pos) {
>> +		zoff = ci->i_truncate_block_off - start_pos;
>> +		zlen = round_up(ci->i_truncate_block_off, PAGE_SIZE) - ci->i_truncate_block_off;
>> +
>> +		spin_unlock(&ci->i_ceph_lock);
>> +		ceph_zero_page_vector_range(zoff, zlen, pages);
>> +		spin_lock(&ci->i_ceph_lock);
>> +	}
>> +	spin_unlock(&ci->i_ceph_lock);
>> +}
>>   #else
>>   static inline void fscrypt_adjust_off_and_len(struct inode *inode, u64 *off, u64 *len)
>>   {
>>   }
>> +
>> +void ceph_try_to_zero_truncate_block_off(struct ceph_inode_info *ci,
>> +					 u64 start_pos, u64 end_pos,
>> +					 struct page **pages)
>> +{
>> +}
>>   #endif
>>   
>>   /*
>> @@ -1030,6 +1054,13 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
>>   			ret += zlen;
>>   		}
>>   
>> +		/*
>> +		 * If the inode is ENCRYPTED the read_off is aligned to PAGE_SIZE
>> +		 */
>> +		ceph_try_to_zero_truncate_block_off(ci, read_off,
>> +						    read_off + read_len,
>> +						    pages);
>> +
>>   		idx = 0;
>>   		left = ret > 0 ? ret : 0;
>>   		while (left > 0) {
>> @@ -1413,12 +1444,34 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
>>   
>>   		size = i_size_read(inode);
>>   		if (!write) {
>> +			struct iov_iter i;
>> +			size_t boff;
>> +			int zlen;
>> +
>>   			if (ret == -ENOENT)
>>   				ret = 0;
>> +
>> +			/* Zero the truncate block off */
>> +			spin_lock(&ci->i_ceph_lock);
>> +			boff = ci->i_truncate_block_off;
>> +			if (IS_ENCRYPTED(inode) && ret > 0 && boff > 0 &&
>> +			    boff >= (iocb->ki_pos & PAGE_MASK) &&
>> +			    boff < round_up(ret, PAGE_SIZE)) {
>> +				int advance = 0;
>> +
>> +				zlen = round_up(boff, PAGE_SIZE) - boff;
>> +				if (ci->i_truncate_block_off >= iocb->ki_pos)
>> +					advance = boff - iocb->ki_pos;
>> +
>> +				iov_iter_bvec(&i, READ, bvecs, num_pages, len);
>> +				iov_iter_advance(&i, advance);
>> +				iov_iter_zero(zlen, &i);
>> +			}
>> +			spin_unlock(&ci->i_ceph_lock);
>> +
>>   			if (ret >= 0 && ret < len && pos + ret < size) {
>> -				struct iov_iter i;
>> -				int zlen = min_t(size_t, len - ret,
>> -						 size - pos - ret);
>> +				zlen = min_t(size_t, len - ret,
>> +					     size - pos - ret);
>>   
>>   				iov_iter_bvec(&i, READ, bvecs, num_pages, len);
>>   				iov_iter_advance(&i, ret);
>> @@ -1967,6 +2020,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>   	struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
>>   	struct ceph_osd_client *osdc = &fsc->client->osdc;
>>   	struct ceph_cap_flush *prealloc_cf;
>> +	u64 start_pos = iocb->ki_pos;
>>   	ssize_t count, written = 0;
>>   	int err, want, got;
>>   	bool direct_lock = false;
>> @@ -2110,6 +2164,8 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>   		int dirty;
>>   
>>   		spin_lock(&ci->i_ceph_lock);
>> +		__ceph_reset_truncate_block_off(ci, start_pos, iocb->ki_pos);
>> +
>>   		ci->i_inline_version = CEPH_INLINE_NONE;
>>   		dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR,
>>   					       &prealloc_cf);
>> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
>> index 1a4c9bc485fc..c48c77c1bcf4 100644
>> --- a/fs/ceph/inode.c
>> +++ b/fs/ceph/inode.c
>> @@ -625,6 +625,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
>>   	ci->fscrypt_auth = NULL;
>>   	ci->fscrypt_auth_len = 0;
>>   #endif
>> +	ci->i_truncate_block_off = 0;
>>   
>>   	return &ci->vfs_inode;
>>   }
>> @@ -1033,11 +1034,24 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
>>   
>>   		pool_ns = old_ns;
>>   
>> +		/* the fscrypt_file is 0 means the file content truncate has been done */
>>   		if (IS_ENCRYPTED(inode) && size &&
>> -		    (iinfo->fscrypt_file_len == sizeof(__le64))) {
>> +		    iinfo->fscrypt_file_len == sizeof(__le64) &&
>> +		    __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file) > 0) {
>>   			size = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
>>   			if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
>>   				pr_warn("size=%llu fscrypt_file=%llu\n", info->size, size);
>> +
>> +			/*
>> +			 * If the second truncate come just after the first
>> +			 * truncate, and if the second has a larger size there
>> +			 * is no need to update the i_truncate_block_off.
>> +			 * Only when the second one has a smaller size, that
>> +			 * means we need to truncate more.
>> +			 */
>> +			if (ci->i_truncate_block_off > 0 &&
>> +			    size < ci->i_truncate_block_off)
>> +				ci->i_truncate_block_off = size;
>>   		}
>>   
>>   		queue_trunc = ceph_fill_file_size(inode, issued,
>> @@ -2390,8 +2404,15 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>>   				req->r_args.setattr.old_size =
>>   					cpu_to_le64(round_up(isize,
>>   							     CEPH_FSCRYPT_BLOCK_SIZE));
>> -				req->r_fscrypt_file = attr->ia_size;
>> -				/* FIXME: client must zero out any partial blocks! */
>> +				if (attr->ia_size < isize) {
>> +					if(IS_ALIGNED(attr->ia_size, CEPH_FSCRYPT_BLOCK_SIZE))
>> +						req->r_fscrypt_file = 0;
>> +					else
>> +						req->r_fscrypt_file = attr->ia_size;
>> +					/* FIXME: client must zero out any partial blocks! */
>> +				} else if (attr->ia_size > isize) {
>> +					req->r_fscrypt_file = attr->ia_size;
>> +				}
>>   			} else {
>>   				req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
>>   				req->r_args.setattr.old_size = cpu_to_le64(isize);
>> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
>> index 7f3976b3319d..856caeb25fb6 100644
>> --- a/fs/ceph/super.h
>> +++ b/fs/ceph/super.h
>> @@ -443,9 +443,9 @@ struct ceph_inode_info {
>>   	struct fscache_cookie *fscache;
>>   #endif
>>   	u32 fscrypt_auth_len;
>> -	u32 fscrypt_file_len;
>>   	u8 *fscrypt_auth;
>> -	u8 *fscrypt_file;
>> +	/* need to zero the last block when decrypting the content to pagecache */
>> +	size_t i_truncate_block_off;
>>   
> Ugh, do we really need yet another field in the inode? This seems
> totally unnecessary, but maybe I'm missing some subtlety in the
> truncation handling that requires this extra tracking.
>
> What is this intended to represent anyway?

Just a draft patch. And I remove the unused "u8 *fscrypt_file" member.

We can remove this and just reuse the "u8 *fscrypt_file" here.

After the inode is filled, this will keep the offset from which it needs 
to do the zero stuff after reading to the local pagecache or buffer.

>>   	errseq_t i_meta_err;
>>   
>> @@ -1192,6 +1192,10 @@ extern void ceph_put_cap_refs(struct ceph_inode_info *ci, int had);
>>   extern void ceph_put_cap_refs_async(struct ceph_inode_info *ci, int had);
>>   extern void ceph_put_cap_refs_no_check_caps(struct ceph_inode_info *ci,
>>   					    int had);
>> +extern void __ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
>> +					    u64 start_pos, u64 end_pos);
>> +extern void ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
>> +					  u64 start_pos, u64 end_pos);
>>   extern void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr,
>>   				       struct ceph_snap_context *snapc);
>>   extern void ceph_flush_snaps(struct ceph_inode_info *ci,
>> @@ -1282,6 +1286,10 @@ extern int ceph_locks_to_pagelist(struct ceph_filelock *flocks,
>>   extern void ceph_fs_debugfs_init(struct ceph_fs_client *client);
>>   extern void ceph_fs_debugfs_cleanup(struct ceph_fs_client *client);
>>   
>> +extern void ceph_try_to_zero_truncate_block_off(struct ceph_inode_info *ci,
>> +						u64 start_pos, u64 end_pos,
>> +						struct page **pages);
>> +
>>   /* quota.c */
>>   static inline bool __ceph_has_any_quota(struct ceph_inode_info *ci)
>>   {
Jeff Layton Sept. 8, 2021, 1:57 p.m. UTC | #3
On Wed, 2021-09-08 at 17:37 +0800, Xiubo Li wrote:
> On 9/8/21 12:26 AM, Jeff Layton wrote:
> > On Fri, 2021-09-03 at 16:15 +0800, xiubli@redhat.com wrote:
> > > From: Xiubo Li <xiubli@redhat.com>
> > > 
> > > When truncating the file, it will leave the MDS to handle that,
> > > but the MDS won't update the file contents. So when the fscrypt
> > > is enabled, if the truncate size is not aligned to the block size,
> > > the kclient will round up the truancate size to the block size and
> > > leave the the last block untouched.
> > > 
> > > The opaque fscrypt_file field will be used to tricker whether the
> > > last block need to do the rmw to truncate the a specified block's
> > > contents, we can get which block needs to do the rmw by round down
> > > the fscrypt_file.
> > > 
> > > In kclient side, there is not need to do the rmw immediately after
> > > the file is truncated. We can defer doing that whenever the kclient
> > > will update that block in late future. And before that any kclient
> > > will check the fscrypt_file field when reading that block, if the
> > > fscrypt_file field is none zero that means the related block needs
> > > to be zeroed in range of [fscrypt_file, round_up(fscrypt_file + PAGE_SIZE))
> > > in pagecache or readed data buffer.
> > > 
> > s/PAGE_SIZE/CEPH_FSCRYPT_BLOCK_SIZE/
> > 
> > Yes, on x86_64 they are equal, but that's not the case on all arches.
> > Also, we are moving toward a pagecache that may hold larger pages on
> > x86_64 too.
> >   
> Okay.
> > > Once the that block contents are updated and writeback
> > > kclient will reset the fscrypt_file field in MDS side, then 0 means
> > > no need to care about the truncate stuff any more.
> > > 
> > I'm a little unclear on what the fscrypt_file actually represents here.
> > 
> > I had proposed that we just make the fscrypt_file field hold the
> > "actual" i_size and we'd make the old size field always be a rounded-
> > up version of the size. The MDS would treat that as an opaque value
> > under Fw caps, and the client could use that field to determine i_size.
> > That has a side benefit too -- if the client doesn't support fscrypt,
> > it'll see the rounded-up sizes which are close enough and don't violate
> > any POSIX rules.
> > 
> > In your version, fscrypt_file also holds the actual size of the inode,
> > but sometimes you're zeroing it out, and I don't understand why:
> 
> I think I forgot to fix this after I adapt to multiple ftruncates case, 
> this patch is not correctly handling the "actual" file size.
> 
> I just want the fscrypt_file field always to hold the offset from which 
> the contents needed to be zeroed, and the range should be [fscrypt_file, 
> round_up(fscrypt_file +CEPH_FSCRYPT_BLOCK_SIZE)).
> 
> In single ftruncate case the fscrypt_file should equal to the "actual" 
> file size. Then the "req->r_args.setattr.size = attr->ia_size" and 
> "req->r_args.setattr.old_size = isize", no need to round up in 
> __ceph_setattr() in kclient side, and leave the MDS to do that, but we 
> need to pass the CEPH_FSCRYPT_BLOCK_SIZE at the same time.
> 

I'm really not a fan of pushing this logic into the MDS. Why does it
need to know anything about the CEPH_FSCRYPT_BLOCK_SIZE at all?

> 
> But in multiple ftruncates case if the second ftruncate has a larger 
> size, the fscrypt_file won't be changed and the ceph backend will help 
> zero the extended part, but we still need to zero the contents in the 
> first ftruncate. If the second ftruncate has a smaller size, the 
> fscrypt_file need to be updated and always keeps the smaller size.
> 

I don't get it. Maybe you can walk me through a concrete example of how
racing ftruncates are a problem?

Suppose we have a file and client1 truncates it down from a large size
to 7k. client1 then sends the MDS a SETATTR to truncate it at 8k, and
does a RMW on the last (4k) block. client2 comes along at the same time
and truncates it up to 13k. client2 issues a SETATTR to extend the file
to 16k and does a RMW on the last block too (which would presumably
already be all zeroes anyway).

Maybe I'm missing something obvious, but I don't see how any of this is
an issue? Where does the race come in?

> 
> > 1) What benefit does this extra complexity provide?
> 
> Just in case for multiple ftruncates case as above.
> 
> > 2) once you zero out that value, how do you know what the real i_size
> > is? I assume the old size field will hold a rounded-up size (mostly for
> > the benefit of the MDS), so you don't have a record of the real i_size
> > at that point.
> 
> As above mentioned, the MDS will help do the round up instead and the 
> "setattr.size" and "setattr.old_size" won't do the round up any more in 
> kclient side.
> 


I really don't understand what this is fixing and why you need to push
the rounding logic into the MDS. fscrypt is largely a client side
feature, and I think it'd be best to keep as much of the logic in the
client as possible.

> > 
> > Note that there is no reason you can't store _more_ than just a size in
> > fscrypt_file. For example, If you need extra flags to indicate
> > truncation/rmw state, then you could just make it hold a struct that has
> > the size and a flags field.
> 
> Yeah, a struct is a good idea, something likes:
> 
> struct fscrypt_file {
> 
>      u64 truncate_offset;
> 
>      u8 fscrypt_block_size;  // in kbytes
> 
>      ...
> 
> };
> 
> 
> > 
> > > There has one special case and that is when there have 2 ftruncates
> > > are called:
> > > 
> > > 1) If the second's size equals to the first one, do nothing about
> > >     the fscrypt_file.
> > > 2) If the second's size is smaller than the first one, then we need
> > >     to update the fscrypt_file with new size.
> > > 3) If the second's size is larger than the first one, then we must
> > >     leave what the fscrypt_file is. Because we always need to truncate
> > >     more.
> > > 
> > > Add one CEPH_CLIENT_CAPS_RESET_FSCRYPT_FILE flag in the cap reqeust
> > > to tell the MDS to reset the scrypt_file field once the specified
> > > block has been updated, so there still need to adapt to this in the
> > > MDS PR.
> > > 
> > > And also this patch just assume that all the buffer and none buffer
> > > read/write related enscrypt/descrypt work has been done.
> > > 
> > > Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > > ---
> > >   fs/ceph/addr.c  | 19 ++++++++++++++-
> > >   fs/ceph/caps.c  | 24 +++++++++++++++++++
> > >   fs/ceph/file.c  | 62 ++++++++++++++++++++++++++++++++++++++++++++++---
> > >   fs/ceph/inode.c | 27 ++++++++++++++++++---
> > >   fs/ceph/super.h | 12 ++++++++--
> > >   5 files changed, 135 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > index 6d3f74d46e5b..9f1dd2fc427d 100644
> > > --- a/fs/ceph/addr.c
> > > +++ b/fs/ceph/addr.c
> > > @@ -212,6 +212,7 @@ static bool ceph_netfs_clamp_length(struct netfs_read_subrequest *subreq)
> > >   static void finish_netfs_read(struct ceph_osd_request *req)
> > >   {
> > >   	struct ceph_fs_client *fsc = ceph_inode_to_client(req->r_inode);
> > > +	struct ceph_inode_info *ci = ceph_inode(req->r_inode);
> > >   	struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0);
> > >   	struct netfs_read_subrequest *subreq = req->r_priv;
> > >   	int num_pages;
> > > @@ -234,8 +235,15 @@ static void finish_netfs_read(struct ceph_osd_request *req)
> > >   
> > >   	netfs_subreq_terminated(subreq, err, true);
> > >   
> > > +	/* FIXME: This should be done after descryped */
> > > +	if (req->r_result > 0)
> > > +		ceph_try_to_zero_truncate_block_off(ci, osd_data->alignment,
> > > +						    osd_data->length,
> > > +						    osd_data->pages);
> > > +
> > The 3rd arg to ceph_try_to_zero_truncate_block_off is end_pos, but here
> > you're passing in the length of the OSD write. Doesn't that need to
> > added to the pos of the write?
> 
> Yeah, it should be the page_off instead.
> 
> 
> > 
> > I'm also a little unclear as to why you need to adjust the truncation
> > offset at this point.
> 
> It won't, the ceph_try_to_zero_truncate_block_off() will only zero the 
> readed pages in range of [fscrypt_off, round_up(fscrypt_file 
> +CEPH_FSCRYPT_BLOCK_SIZE)).
> 
> 
> > >   	num_pages = calc_pages_for(osd_data->alignment, osd_data->length);
> > >   	ceph_put_page_vector(osd_data->pages, num_pages, false);
> > > +
> > >   	iput(req->r_inode);
> > >   }
> > >   
> > > @@ -555,8 +563,10 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc)
> > >   				  req->r_end_latency, len, err);
> > >   
> > >   	ceph_osdc_put_request(req);
> > > -	if (err == 0)
> > > +	if (err == 0) {
> > > +		ceph_reset_truncate_block_off(ci, page_off, len);
> > >   		err = len;
> > > +	}
> > >   
> > >   	if (err < 0) {
> > >   		struct writeback_control tmp_wbc;
> > > @@ -661,10 +671,17 @@ static void writepages_finish(struct ceph_osd_request *req)
> > >   					   (u64)osd_data->length);
> > >   		total_pages += num_pages;
> > >   		for (j = 0; j < num_pages; j++) {
> > > +			u64 page_off;
> > > +
> > >   			page = osd_data->pages[j];
> > >   			BUG_ON(!page);
> > >   			WARN_ON(!PageUptodate(page));
> > >   
> > > +			page_off = osd_data->alignment + j * PAGE_SIZE;
> > > +			if (rc >= 0)
> > > +			    ceph_reset_truncate_block_off(ci, page_off,
> > > +							  page_off + PAGE_SIZE);
> > > +
> > >   			if (atomic_long_dec_return(&fsc->writeback_count) <
> > >   			     CONGESTION_OFF_THRESH(
> > >   					fsc->mount_options->congestion_kb))
> > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > index d628dcdbf869..a211ab4c3f7a 100644
> > > --- a/fs/ceph/caps.c
> > > +++ b/fs/ceph/caps.c
> > > @@ -1425,6 +1425,9 @@ static vthatoid __prep_cap(struct cap_msg_args *arg, struct ceph_cap *cap,
> > >   			}
> > >   		}
> > >   	}
> > > +	if (ci->i_truncate_block_off < 0)
> > > +		flags |= CEPH_CLIENT_CAPS_RESET_FSCRYPT_FILE;It sounds like you want to do something quite different from what I originally proposed.
> > > +
> > >   	arg->flags = flags;
> > >   	arg->encrypted = IS_ENCRYPTED(inode);
> > >   	if (ci->fscrypt_auth_len &&
> > > @@ -3155,6 +3158,27 @@ void ceph_put_cap_refs_no_check_caps(struct ceph_inode_info *ci, int had)
> > >   	__ceph_put_cap_refs(ci, had, PUT_CAP_REFS_NO_CHECK);
> > >   }
> > >   
> > > +/*
> > > + * Clear the i_truncate_block_off and fscrypt_file
> > > + * if old last encrypted block has been updated.
> > > + */
> > > +void __ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
> > > +				      u64 start_pos, u64 end_pos)
> > > +{
> > > +	if (ci->i_truncate_block_off > 0 &&
> > > +	    ci->i_truncate_block_off >= start_pos &&
> > > +	    ci->i_truncate_block_off < end_pos)
> > > +		ci->i_truncate_block_off = 0;
> > > +}
> > > +
> > > +void ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
> > > +				    u64 start_pos, u64 end_pos)
> > > +{
> > > +	spin_lock(&ci->i_ceph_lock);
> > > +	__ceph_reset_truncate_block_off(ci, start_pos, end_pos);
> > > +	spin_unlock(&ci->i_ceph_lock);
> > > +}
> > > +
> > >   /*
> > >    * Release @nr WRBUFFER refs on dirty pages for the given @snapc snap
> > >    * context.  Adjust per-snap dirty page accounting as appropriate.
> > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > index 6e677b40410e..cfa4cbe08c10 100644
> > > --- a/fs/ceph/file.c
> > > +++ b/fs/ceph/file.c
> > > @@ -885,10 +885,34 @@ static void fscrypt_adjust_off_and_len(struct inode *inode, u64 *off, u64 *len)
> > >   	}
> > >   }
> > >   
> > > +void ceph_try_to_zero_truncate_block_off(struct ceph_inode_info *ci,
> > > +					 u64 start_pos, u64 end_pos,
> > > +					 struct page **pages)
> > > +{
> > > +	u64 zoff, zlen;
> > > +
> > > +	spin_lock(&ci->i_ceph_lock);
> > > +	if (ci->i_truncate_block_off >= start_pos &&
> > > +			ci->i_truncate_block_off < end_pos) {
> > > +		zoff = ci->i_truncate_block_off - start_pos;
> > > +		zlen = round_up(ci->i_truncate_block_off, PAGE_SIZE) - ci->i_truncate_block_off;
> > > +
> > > +		spin_unlock(&ci->i_ceph_lock);
> > > +		ceph_zero_page_vector_range(zoff, zlen, pages);
> > > +		spin_lock(&ci->i_ceph_lock);
> > > +	}
> > > +	spin_unlock(&ci->i_ceph_lock);
> > > +}
> > >   #else
> > >   static inline void fscrypt_adjust_off_and_len(struct inode *inode, u64 *off, u64 *len)
> > >   {
> > >   }
> > > +
> > > +void ceph_try_to_zero_truncate_block_off(struct ceph_inode_info *ci,
> > > +					 u64 start_pos, u64 end_pos,
> > > +					 struct page **pages)
> > > +{
> > > +}
> > >   #endif
> > >   
> > >   /*
> > > @@ -1030,6 +1054,13 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
> > >   			ret += zlen;
> > >   		}
> > >   
> > > +		/*
> > > +		 * If the inode is ENCRYPTED the read_off is aligned to PAGE_SIZE
> > > +		 */
> > > +		ceph_try_to_zero_truncate_block_off(ci, read_off,
> > > +						    read_off + read_len,
> > > +						    pages);
> > > +
> > >   		idx = 0;
> > >   		left = ret > 0 ? ret : 0;
> > >   		while (left > 0) {
> > > @@ -1413,12 +1444,34 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
> > >   
> > >   		size = i_size_read(inode);
> > >   		if (!write) {
> > > +			struct iov_iter i;
> > > +			size_t boff;
> > > +			int zlen;
> > > +
> > >   			if (ret == -ENOENT)
> > >   				ret = 0;
> > > +
> > > +			/* Zero the truncate block off */
> > > +			spin_lock(&ci->i_ceph_lock);
> > > +			boff = ci->i_truncate_block_off;
> > > +			if (IS_ENCRYPTED(inode) && ret > 0 && boff > 0 &&
> > > +			    boff >= (iocb->ki_pos & PAGE_MASK) &&
> > > +			    boff < round_up(ret, PAGE_SIZE)) {
> > > +				int advance = 0;
> > > +rovide?
> > > +				zlen = round_up(boff, PAGE_SIZE) - boff;
> > > +				if (ci->i_truncate_block_off >= iocb->ki_pos)
> > > +					advance = boff - iocb->ki_pos;
> > > +
> > > +				iov_iter_bvec(&i, READ, bvecs, num_pages, len);
> > > +				iov_iter_advance(&i, advance);
> > > +				iov_iter_zero(zlen, &i);
> > > +			}
> > > +			spin_unlock(&ci->i_ceph_lock);
> > > +
> > >   			if (ret >= 0 && ret < len && pos + ret < size) {
> > > -				struct iov_iter i;
> > > -				int zlen = min_t(size_t, len - ret,
> > > -						 size - pos - ret);
> > > +				zlen = min_t(size_t, len - ret,
> > > +					     size - pos - ret);
> > >   
> > >   				iov_iter_bvec(&i, READ, bvecs, num_pages, len);
> > >   				iov_iter_advance(&i, ret);
> > > @@ -1967,6 +2020,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > >   	struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
> > >   	struct ceph_osd_client *osdc = &fsc->client->osdc;
> > >   	struct ceph_cap_flush *prealloc_cf;
> > > +	u64 start_pos = iocb->ki_pos;
> > >   	ssize_t count, written = 0;
> > >   	int err, want, got;
> > >   	bool direct_lock = false;
> > > @@ -2110,6 +2164,8 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > >   		int dirty;
> > >   
> > >   		spin_lock(&ci->i_ceph_lock);
> > > +		__ceph_reset_truncate_block_off(ci, start_pos, iocb->ki_pos);
> > > +
> > >   		ci->i_inline_version = CEPH_INLINE_NONE;
> > >   		dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR,
> > >   					       &prealloc_cf);
> > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > > index 1a4c9bc485fc..c48c77c1bcf4 100644
> > > --- a/fs/ceph/inode.c
> > > +++ b/fs/ceph/inode.c
> > > @@ -625,6 +625,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
> > >   	ci->fscrypt_auth = NULL;
> > >   	ci->fscrypt_auth_len = 0;
> > >   #endif
> > > +	ci->i_truncate_block_off = 0;
> > >   
> > >   	return &ci->vfs_inode;
> > >   }
> > > @@ -1033,11 +1034,24 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
> > >   
> > >   		pool_ns = old_ns;
> > >   
> > > +		/* the fscrypt_file is 0 means the file content truncate has been done */
> > >   		if (IS_ENCRYPTED(inode) && size &&
> > > -		    (iinfo->fscrypt_file_len == sizeof(__le64))) {
> > > +		    iinfo->fscrypt_file_len == sizeof(__le64) &&
> > > +		    __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file) > 0) {
> > >   			size = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
> > >   			if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
> > >   				pr_warn("size=%llu fscrypt_file=%llu\n", info->size, size);
> > > +
> > > +			/*
> > > +			 * If the second truncate come just after the first
> > > +			 * truncate, and if the second has a larger size there
> > > +			 * is no need to update the i_truncate_block_off.
> > > +			 * Only when the second one has a smaller size, that
> > > +			 * means we need to truncate more.
> > > +			 */
> > > +			if (ci->i_truncate_block_off > 0 &&
> > > +			    size < ci->i_truncate_block_off)
> > > +				ci->i_truncate_block_off = size;
> > >   		}
> > >   
> > >   		queue_trunc = ceph_fill_file_size(inode, issued,
> > > @@ -2390,8 +2404,15 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
> > >   				req->r_args.setattr.old_size =
> > >   					cpu_to_le64(round_up(isize,
> > >   							     CEPH_FSCRYPT_BLOCK_SIZE));
> > > -				req->r_fscrypt_file = attr->ia_size;
> > > -				/* FIXME: client must zero out any partial blocks! */
> > > +				if (attr->ia_size < isize) {
> > > +					if(IS_ALIGNED(attr->ia_size, CEPH_FSCRYPT_BLOCK_SIZE))
> > > +						req->r_fscrypt_file = 0;
> > > +					else
> > > +						req->r_fscrypt_file = attr->ia_size;
> > > +					/* FIXME: client must zero out any partial blocks! */
> > > +				} else if (attr->ia_size > isize) {
> > > +					req->r_fscrypt_file = attr->ia_size;
> > > +				}
> > >   			} else {
> > >   				req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
> > >   				req->r_args.setattr.old_size = cpu_to_le64(isize);
> > > diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> > > index 7f3976b3319d..856caeb25fb6 100644
> > > --- a/fs/ceph/super.h
> > > +++ b/fs/ceph/super.h
> > > @@ -443,9 +443,9 @@ struct ceph_inode_info {
> > >   	struct fscache_cookie *fscache;
> > >   #endif
> > >   	u32 fscrypt_auth_len;
> > > -	u32 fscrypt_file_len;
> > >   	u8 *fscrypt_auth;
> > > -	u8 *fscrypt_file;
> > > +	/* need to zero the last block when decrypting the content to pagecache */
> > > +	size_t i_truncate_block_off;
> > >   
> > Ugh, do we really need yet another field in the inode? This seems
> > totally unnecessary, but maybe I'm missing some subtlety in the
> > truncation handling that requires this extra tracking.
> > 
> > What is this intended to represent anyway?
> 
> Just a draft patch. And I remove the unused "u8 *fscrypt_file" member.
> 
> We can remove this and just reuse the "u8 *fscrypt_file" here.
> 
> After the inode is filled, this will keep the offset from which it needs 
> to do the zero stuff after reading to the local pagecache or buffer.
> 
> > >   	errseq_t i_meta_err;
> > >   
> > > @@ -1192,6 +1192,10 @@ extern void ceph_put_cap_refs(struct ceph_inode_info *ci, int had);
> > >   extern void ceph_put_cap_refs_async(struct ceph_inode_info *ci, int had);
> > >   extern void ceph_put_cap_refs_no_check_caps(struct ceph_inode_info *ci,
> > >   					    int had);
> > > +extern void __ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
> > > +					    u64 start_pos, u64 end_pos);
> > > +extern void ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
> > > +					  u64 start_pos, u64 end_pos);
> > >   extern void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr,
> > >   				       struct ceph_snap_context *snapc);
> > >   extern void ceph_flush_snaps(struct ceph_inode_info *ci,
> > > @@ -1282,6 +1286,10 @@ extern int ceph_locks_to_pagelist(struct ceph_filelock *flocks,
> > >   extern void ceph_fs_debugfs_init(struct ceph_fs_client *client);
> > >   extern void ceph_fs_debugfs_cleanup(struct ceph_fs_client *client);
> > >   
> > > +extern void ceph_try_to_zero_truncate_block_off(struct ceph_inode_info *ci,
> > > +						u64 start_pos, u64 end_pos,
> > > +						struct page **pages);
> > > +
> > >   /* quota.c */
> > >   static inline bool __ceph_has_any_quota(struct ceph_inode_info *ci)
> > >   {
>
Xiubo Li Sept. 9, 2021, 3:38 a.m. UTC | #4
On 9/8/21 9:57 PM, Jeff Layton wrote:
> On Wed, 2021-09-08 at 17:37 +0800, Xiubo Li wrote:
>> On 9/8/21 12:26 AM, Jeff Layton wrote:
>>> On Fri, 2021-09-03 at 16:15 +0800, xiubli@redhat.com wrote:
>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>
>>>> When truncating the file, it will leave the MDS to handle that,
>>>> but the MDS won't update the file contents. So when the fscrypt
>>>> is enabled, if the truncate size is not aligned to the block size,
>>>> the kclient will round up the truancate size to the block size and
>>>> leave the the last block untouched.
>>>>
>>>> The opaque fscrypt_file field will be used to tricker whether the
>>>> last block need to do the rmw to truncate the a specified block's
>>>> contents, we can get which block needs to do the rmw by round down
>>>> the fscrypt_file.
>>>>
>>>> In kclient side, there is not need to do the rmw immediately after
>>>> the file is truncated. We can defer doing that whenever the kclient
>>>> will update that block in late future. And before that any kclient
>>>> will check the fscrypt_file field when reading that block, if the
>>>> fscrypt_file field is none zero that means the related block needs
>>>> to be zeroed in range of [fscrypt_file, round_up(fscrypt_file + PAGE_SIZE))
>>>> in pagecache or readed data buffer.
>>>>
>>> s/PAGE_SIZE/CEPH_FSCRYPT_BLOCK_SIZE/
>>>
>>> Yes, on x86_64 they are equal, but that's not the case on all arches.
>>> Also, we are moving toward a pagecache that may hold larger pages on
>>> x86_64 too.
>>>    
>> Okay.
>>>> Once the that block contents are updated and writeback
>>>> kclient will reset the fscrypt_file field in MDS side, then 0 means
>>>> no need to care about the truncate stuff any more.
>>>>
>>> I'm a little unclear on what the fscrypt_file actually represents here.
>>>
>>> I had proposed that we just make the fscrypt_file field hold the
>>> "actual" i_size and we'd make the old size field always be a rounded-
>>> up version of the size. The MDS would treat that as an opaque value
>>> under Fw caps, and the client could use that field to determine i_size.
>>> That has a side benefit too -- if the client doesn't support fscrypt,
>>> it'll see the rounded-up sizes which are close enough and don't violate
>>> any POSIX rules.
>>>
>>> In your version, fscrypt_file also holds the actual size of the inode,
>>> but sometimes you're zeroing it out, and I don't understand why:
>> I think I forgot to fix this after I adapt to multiple ftruncates case,
>> this patch is not correctly handling the "actual" file size.
>>
>> I just want the fscrypt_file field always to hold the offset from which
>> the contents needed to be zeroed, and the range should be [fscrypt_file,
>> round_up(fscrypt_file +CEPH_FSCRYPT_BLOCK_SIZE)).
>>
>> In single ftruncate case the fscrypt_file should equal to the "actual"
>> file size. Then the "req->r_args.setattr.size = attr->ia_size" and
>> "req->r_args.setattr.old_size = isize", no need to round up in
>> __ceph_setattr() in kclient side, and leave the MDS to do that, but we
>> need to pass the CEPH_FSCRYPT_BLOCK_SIZE at the same time.
>>
> I'm really not a fan of pushing this logic into the MDS. Why does it
> need to know anything about the CEPH_FSCRYPT_BLOCK_SIZE at all?

 From your current patch set, you are rounding up the 
"req->r_args.setattr.size" and "req->r_args.setattr.old_size" to the 
BLOCK end in __ceph_setattr().

Without considering keep the file scryption logic in kclient only, I 
need the "req->r_args.setattr.size" to keep the file's real size.

Since the MDS will do the truncate stuff. And if we won't round the 
"req->r_args.setattr.size" up to the BLOCK end any more, then the MDS 
needs to know whether and how to round up the file size to the block end 
when truncating the file. Because the fscrypt_file won't record the 
file's real size any more, it maybe zero, more detail please see the 
example below.

Yeah, but as you mentioned bellow if we will keep the file scryption 
logic in kclient only, I need one extra field to do the defer rmw:

struct fscrypt_file {

     u64 file_real_size;         // always keep the file's real size and 
the "req->r_args.setattr.size = round_up(file_real_size, BLOCK_SIZE)" as 
you do in your current patch set.

     u64 file_truncate_offset;  // this will always record in which 
BLOCK we need to do the rmw, this maybe 0 or located in the file's LAST 
block and maybe not, more detail please the example below.

}

The "file_truncate_offset" member will be what I need to do the defer rmw.


>> But in multiple ftruncates case if the second ftruncate has a larger
>> size, the fscrypt_file won't be changed and the ceph backend will help
>> zero the extended part, but we still need to zero the contents in the
>> first ftruncate. If the second ftruncate has a smaller size, the
>> fscrypt_file need to be updated and always keeps the smaller size.
>>
> I don't get it. Maybe you can walk me through a concrete example of how
> racing ftruncates are a problem?

Sorry for confusing.

For example:

1), if there has a file named "bar", and currently the size is 100K 
bytes, the CEPH_FSCRYPT_BLOCK_SIZE size equals to 4K.

2), first ftruncate("bar", 7K) comes, then both the file real size and 
"file_truncate_offset" will be set to 7K, then in MDS it will truncate 
the file from 8K, and the last block's [7K, 8K) contents need to be 
zeroed anyway later.

3), immediately a second ftruncate("bar", 16K) comes, from the ftruncate 
man page it says the new extended [7K, 16K) should be zeroed when 
truncating the file. That means the OSD should help zero the [8K, 16K), 
but won't touch the block [4K, 8K), in which the [7K, 8K) contents still 
needs to be zeroed. So in this case the "file_truncate_offset" won't be 
changed and still be 7K. Then the "file_truncate_offset" won't be 
located in the last block of the file any more.

4), if the second ftruncate in step 3) the new size is 3K, then the MDS 
will truncate the file from 4K and [7K, 8K) contents will be discard 
anyway, so we need to update the "file_truncate_offset" to 3K, that 
means a new BLOCK [0K, 4K) needs to do the rmw, by zeroing the [3K, 4K).

5), if the new truncate in step 3) the new size is 4K, since the 4K < 7K 
and 4K is aligned to the BLOCK size, so no need to rmw any block any 
more, then we can just clear the "file_truncate_offset" field.


For defer RMW logic please see the following example:



> Suppose we have a file and client1 truncates it down from a large size
> to 7k. client1 then sends the MDS a SETATTR to truncate it at 8k, and
> does a RMW on the last (4k) block. client2 comes along at the same time
> and truncates it up to 13k. client2 issues a SETATTR to extend the file
> to 16k and does a RMW on the last block too (which would presumably
> already be all zeroes anyway).

I think you meant in the none defer RMW case, this is not what my defer 
RMW approach will do.

After the client1 truncated the file, it won't do the RMW if it won't 
write any data to that file, and then we assume the client1 is unmounted 
immediately.

And when the client1 is truncating the file, it will update the 
"file_truncate_offset" to 7K, which means the [7K, 8K) in the LAST block 
needs to be zeroed.

Then the client2 reads that the file size is 7K and the 
"file_truncate_offset" is 7K too, and the client2 wants to truncate the 
file up to 13K. Since the OSD should help us zero the extended part [8K, 
13K] when truncating, but won't touch the block [4K, 8K), for which it 
still needs to do the RMW. Then the client2 is unmounted too before 
writing any data to the file. After this the "file_truncate_offset" 
won't be located in the file's LAST block any more.

After that if the client3 will update the whole file's contents, it will 
read all the file 13K bytes contents to local page buffers, since the 
"file_truncate_offset" is 7K and then in the page buffer the range [7K, 
8K) will be zeroed just after the contents are dencrypted inplace. Then 
if the client3 successfully flushes that dirty data back and then the 
deferred RMW for block [4K, 8K) should be done at the same time, and the 
"file_truncate_offset" should be cleared too.

While if the client3 won't update the block [4K, 8K), the 
"file_truncate_offset" will be kept all the time until the above RMW is 
done in future.


BRs

- Xiubo


>
> Maybe I'm missing something obvious, but I don't see how any of this is
> an issue? Where does the race come in?
>
>>> 1) What benefit does this extra complexity provide?
>> Just in case for multiple ftruncates case as above.
>>
>>> 2) once you zero out that value, how do you know what the real i_size
>>> is? I assume the old size field will hold a rounded-up size (mostly for
>>> the benefit of the MDS), so you don't have a record of the real i_size
>>> at that point.
>> As above mentioned, the MDS will help do the round up instead and the
>> "setattr.size" and "setattr.old_size" won't do the round up any more in
>> kclient side.
>>
>
> I really don't understand what this is fixing and why you need to push
> the rounding logic into the MDS. fscrypt is largely a client side
> feature, and I think it'd be best to keep as much of the logic in the
> client as possible.

>
>>> Note that there is no reason you can't store _more_ than just a size in
>>> fscrypt_file. For example, If you need extra flags to indicate
>>> truncation/rmw state, then you could just make it hold a struct that has
>>> the size and a flags field.
>> Yeah, a struct is a good idea, something likes:
>>
>> struct fscrypt_file {
>>
>>       u64 truncate_offset;
>>
>>       u8 fscrypt_block_size;  // in kbytes
>>
>>       ...
>>
>> };
>>
>>
>>>> There has one special case and that is when there have 2 ftruncates
>>>> are called:
>>>>
>>>> 1) If the second's size equals to the first one, do nothing about
>>>>      the fscrypt_file.
>>>> 2) If the second's size is smaller than the first one, then we need
>>>>      to update the fscrypt_file with new size.
>>>> 3) If the second's size is larger than the first one, then we must
>>>>      leave what the fscrypt_file is. Because we always need to truncate
>>>>      more.
>>>>
>>>> Add one CEPH_CLIENT_CAPS_RESET_FSCRYPT_FILE flag in the cap reqeust
>>>> to tell the MDS to reset the scrypt_file field once the specified
>>>> block has been updated, so there still need to adapt to this in the
>>>> MDS PR.
>>>>
>>>> And also this patch just assume that all the buffer and none buffer
>>>> read/write related enscrypt/descrypt work has been done.
>>>>
>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>> ---
>>>>    fs/ceph/addr.c  | 19 ++++++++++++++-
>>>>    fs/ceph/caps.c  | 24 +++++++++++++++++++
>>>>    fs/ceph/file.c  | 62 ++++++++++++++++++++++++++++++++++++++++++++++---
>>>>    fs/ceph/inode.c | 27 ++++++++++++++++++---
>>>>    fs/ceph/super.h | 12 ++++++++--
>>>>    5 files changed, 135 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>>>> index 6d3f74d46e5b..9f1dd2fc427d 100644
>>>> --- a/fs/ceph/addr.c
>>>> +++ b/fs/ceph/addr.c
>>>> @@ -212,6 +212,7 @@ static bool ceph_netfs_clamp_length(struct netfs_read_subrequest *subreq)
>>>>    static void finish_netfs_read(struct ceph_osd_request *req)
>>>>    {
>>>>    	struct ceph_fs_client *fsc = ceph_inode_to_client(req->r_inode);
>>>> +	struct ceph_inode_info *ci = ceph_inode(req->r_inode);
>>>>    	struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0);
>>>>    	struct netfs_read_subrequest *subreq = req->r_priv;
>>>>    	int num_pages;
>>>> @@ -234,8 +235,15 @@ static void finish_netfs_read(struct ceph_osd_request *req)
>>>>    
>>>>    	netfs_subreq_terminated(subreq, err, true);
>>>>    
>>>> +	/* FIXME: This should be done after descryped */
>>>> +	if (req->r_result > 0)
>>>> +		ceph_try_to_zero_truncate_block_off(ci, osd_data->alignment,
>>>> +						    osd_data->length,
>>>> +						    osd_data->pages);
>>>> +
>>> The 3rd arg to ceph_try_to_zero_truncate_block_off is end_pos, but here
>>> you're passing in the length of the OSD write. Doesn't that need to
>>> added to the pos of the write?
>> Yeah, it should be the page_off instead.
>>
>>
>>> I'm also a little unclear as to why you need to adjust the truncation
>>> offset at this point.
>> It won't, the ceph_try_to_zero_truncate_block_off() will only zero the
>> readed pages in range of [fscrypt_off, round_up(fscrypt_file
>> +CEPH_FSCRYPT_BLOCK_SIZE)).
>>
>>
>>>>    	num_pages = calc_pages_for(osd_data->alignment, osd_data->length);
>>>>    	ceph_put_page_vector(osd_data->pages, num_pages, false);
>>>> +
>>>>    	iput(req->r_inode);
>>>>    }
>>>>    
>>>> @@ -555,8 +563,10 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc)
>>>>    				  req->r_end_latency, len, err);
>>>>    
>>>>    	ceph_osdc_put_request(req);
>>>> -	if (err == 0)
>>>> +	if (err == 0) {
>>>> +		ceph_reset_truncate_block_off(ci, page_off, len);
>>>>    		err = len;
>>>> +	}
>>>>    
>>>>    	if (err < 0) {
>>>>    		struct writeback_control tmp_wbc;
>>>> @@ -661,10 +671,17 @@ static void writepages_finish(struct ceph_osd_request *req)
>>>>    					   (u64)osd_data->length);
>>>>    		total_pages += num_pages;
>>>>    		for (j = 0; j < num_pages; j++) {
>>>> +			u64 page_off;
>>>> +
>>>>    			page = osd_data->pages[j];
>>>>    			BUG_ON(!page);
>>>>    			WARN_ON(!PageUptodate(page));
>>>>    
>>>> +			page_off = osd_data->alignment + j * PAGE_SIZE;
>>>> +			if (rc >= 0)
>>>> +			    ceph_reset_truncate_block_off(ci, page_off,
>>>> +							  page_off + PAGE_SIZE);
>>>> +
>>>>    			if (atomic_long_dec_return(&fsc->writeback_count) <
>>>>    			     CONGESTION_OFF_THRESH(
>>>>    					fsc->mount_options->congestion_kb))
>>>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>>>> index d628dcdbf869..a211ab4c3f7a 100644
>>>> --- a/fs/ceph/caps.c
>>>> +++ b/fs/ceph/caps.c
>>>> @@ -1425,6 +1425,9 @@ static vthatoid __prep_cap(struct cap_msg_args *arg, struct ceph_cap *cap,
>>>>    			}
>>>>    		}
>>>>    	}
>>>> +	if (ci->i_truncate_block_off < 0)
>>>> +		flags |= CEPH_CLIENT_CAPS_RESET_FSCRYPT_FILE;It sounds like you want to do something quite different from what I originally proposed.
>>>> +
>>>>    	arg->flags = flags;
>>>>    	arg->encrypted = IS_ENCRYPTED(inode);
>>>>    	if (ci->fscrypt_auth_len &&
>>>> @@ -3155,6 +3158,27 @@ void ceph_put_cap_refs_no_check_caps(struct ceph_inode_info *ci, int had)
>>>>    	__ceph_put_cap_refs(ci, had, PUT_CAP_REFS_NO_CHECK);
>>>>    }
>>>>    
>>>> +/*
>>>> + * Clear the i_truncate_block_off and fscrypt_file
>>>> + * if old last encrypted block has been updated.
>>>> + */
>>>> +void __ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
>>>> +				      u64 start_pos, u64 end_pos)
>>>> +{
>>>> +	if (ci->i_truncate_block_off > 0 &&
>>>> +	    ci->i_truncate_block_off >= start_pos &&
>>>> +	    ci->i_truncate_block_off < end_pos)
>>>> +		ci->i_truncate_block_off = 0;
>>>> +}
>>>> +
>>>> +void ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
>>>> +				    u64 start_pos, u64 end_pos)
>>>> +{
>>>> +	spin_lock(&ci->i_ceph_lock);
>>>> +	__ceph_reset_truncate_block_off(ci, start_pos, end_pos);
>>>> +	spin_unlock(&ci->i_ceph_lock);
>>>> +}
>>>> +
>>>>    /*
>>>>     * Release @nr WRBUFFER refs on dirty pages for the given @snapc snap
>>>>     * context.  Adjust per-snap dirty page accounting as appropriate.
>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>>> index 6e677b40410e..cfa4cbe08c10 100644
>>>> --- a/fs/ceph/file.c
>>>> +++ b/fs/ceph/file.c
>>>> @@ -885,10 +885,34 @@ static void fscrypt_adjust_off_and_len(struct inode *inode, u64 *off, u64 *len)
>>>>    	}
>>>>    }
>>>>    
>>>> +void ceph_try_to_zero_truncate_block_off(struct ceph_inode_info *ci,
>>>> +					 u64 start_pos, u64 end_pos,
>>>> +					 struct page **pages)
>>>> +{
>>>> +	u64 zoff, zlen;
>>>> +
>>>> +	spin_lock(&ci->i_ceph_lock);
>>>> +	if (ci->i_truncate_block_off >= start_pos &&
>>>> +			ci->i_truncate_block_off < end_pos) {
>>>> +		zoff = ci->i_truncate_block_off - start_pos;
>>>> +		zlen = round_up(ci->i_truncate_block_off, PAGE_SIZE) - ci->i_truncate_block_off;
>>>> +
>>>> +		spin_unlock(&ci->i_ceph_lock);
>>>> +		ceph_zero_page_vector_range(zoff, zlen, pages);
>>>> +		spin_lock(&ci->i_ceph_lock);
>>>> +	}
>>>> +	spin_unlock(&ci->i_ceph_lock);
>>>> +}
>>>>    #else
>>>>    static inline void fscrypt_adjust_off_and_len(struct inode *inode, u64 *off, u64 *len)
>>>>    {
>>>>    }
>>>> +
>>>> +void ceph_try_to_zero_truncate_block_off(struct ceph_inode_info *ci,
>>>> +					 u64 start_pos, u64 end_pos,
>>>> +					 struct page **pages)
>>>> +{
>>>> +}
>>>>    #endif
>>>>    
>>>>    /*
>>>> @@ -1030,6 +1054,13 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
>>>>    			ret += zlen;
>>>>    		}
>>>>    
>>>> +		/*
>>>> +		 * If the inode is ENCRYPTED the read_off is aligned to PAGE_SIZE
>>>> +		 */
>>>> +		ceph_try_to_zero_truncate_block_off(ci, read_off,
>>>> +						    read_off + read_len,
>>>> +						    pages);
>>>> +
>>>>    		idx = 0;
>>>>    		left = ret > 0 ? ret : 0;
>>>>    		while (left > 0) {
>>>> @@ -1413,12 +1444,34 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
>>>>    
>>>>    		size = i_size_read(inode);
>>>>    		if (!write) {
>>>> +			struct iov_iter i;
>>>> +			size_t boff;
>>>> +			int zlen;
>>>> +
>>>>    			if (ret == -ENOENT)
>>>>    				ret = 0;
>>>> +
>>>> +			/* Zero the truncate block off */
>>>> +			spin_lock(&ci->i_ceph_lock);
>>>> +			boff = ci->i_truncate_block_off;
>>>> +			if (IS_ENCRYPTED(inode) && ret > 0 && boff > 0 &&
>>>> +			    boff >= (iocb->ki_pos & PAGE_MASK) &&
>>>> +			    boff < round_up(ret, PAGE_SIZE)) {
>>>> +				int advance = 0;
>>>> +rovide?
>>>> +				zlen = round_up(boff, PAGE_SIZE) - boff;
>>>> +				if (ci->i_truncate_block_off >= iocb->ki_pos)
>>>> +					advance = boff - iocb->ki_pos;
>>>> +
>>>> +				iov_iter_bvec(&i, READ, bvecs, num_pages, len);
>>>> +				iov_iter_advance(&i, advance);
>>>> +				iov_iter_zero(zlen, &i);
>>>> +			}
>>>> +			spin_unlock(&ci->i_ceph_lock);
>>>> +
>>>>    			if (ret >= 0 && ret < len && pos + ret < size) {
>>>> -				struct iov_iter i;
>>>> -				int zlen = min_t(size_t, len - ret,
>>>> -						 size - pos - ret);
>>>> +				zlen = min_t(size_t, len - ret,
>>>> +					     size - pos - ret);
>>>>    
>>>>    				iov_iter_bvec(&i, READ, bvecs, num_pages, len);
>>>>    				iov_iter_advance(&i, ret);
>>>> @@ -1967,6 +2020,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>>>    	struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
>>>>    	struct ceph_osd_client *osdc = &fsc->client->osdc;
>>>>    	struct ceph_cap_flush *prealloc_cf;
>>>> +	u64 start_pos = iocb->ki_pos;
>>>>    	ssize_t count, written = 0;
>>>>    	int err, want, got;
>>>>    	bool direct_lock = false;
>>>> @@ -2110,6 +2164,8 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>>>    		int dirty;
>>>>    
>>>>    		spin_lock(&ci->i_ceph_lock);
>>>> +		__ceph_reset_truncate_block_off(ci, start_pos, iocb->ki_pos);
>>>> +
>>>>    		ci->i_inline_version = CEPH_INLINE_NONE;
>>>>    		dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR,
>>>>    					       &prealloc_cf);
>>>> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
>>>> index 1a4c9bc485fc..c48c77c1bcf4 100644
>>>> --- a/fs/ceph/inode.c
>>>> +++ b/fs/ceph/inode.c
>>>> @@ -625,6 +625,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
>>>>    	ci->fscrypt_auth = NULL;
>>>>    	ci->fscrypt_auth_len = 0;
>>>>    #endif
>>>> +	ci->i_truncate_block_off = 0;
>>>>    
>>>>    	return &ci->vfs_inode;
>>>>    }
>>>> @@ -1033,11 +1034,24 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
>>>>    
>>>>    		pool_ns = old_ns;
>>>>    
>>>> +		/* the fscrypt_file is 0 means the file content truncate has been done */
>>>>    		if (IS_ENCRYPTED(inode) && size &&
>>>> -		    (iinfo->fscrypt_file_len == sizeof(__le64))) {
>>>> +		    iinfo->fscrypt_file_len == sizeof(__le64) &&
>>>> +		    __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file) > 0) {
>>>>    			size = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
>>>>    			if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
>>>>    				pr_warn("size=%llu fscrypt_file=%llu\n", info->size, size);
>>>> +
>>>> +			/*
>>>> +			 * If the second truncate come just after the first
>>>> +			 * truncate, and if the second has a larger size there
>>>> +			 * is no need to update the i_truncate_block_off.
>>>> +			 * Only when the second one has a smaller size, that
>>>> +			 * means we need to truncate more.
>>>> +			 */
>>>> +			if (ci->i_truncate_block_off > 0 &&
>>>> +			    size < ci->i_truncate_block_off)
>>>> +				ci->i_truncate_block_off = size;
>>>>    		}
>>>>    
>>>>    		queue_trunc = ceph_fill_file_size(inode, issued,
>>>> @@ -2390,8 +2404,15 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>>>>    				req->r_args.setattr.old_size =
>>>>    					cpu_to_le64(round_up(isize,
>>>>    							     CEPH_FSCRYPT_BLOCK_SIZE));
>>>> -				req->r_fscrypt_file = attr->ia_size;
>>>> -				/* FIXME: client must zero out any partial blocks! */
>>>> +				if (attr->ia_size < isize) {
>>>> +					if(IS_ALIGNED(attr->ia_size, CEPH_FSCRYPT_BLOCK_SIZE))
>>>> +						req->r_fscrypt_file = 0;
>>>> +					else
>>>> +						req->r_fscrypt_file = attr->ia_size;
>>>> +					/* FIXME: client must zero out any partial blocks! */
>>>> +				} else if (attr->ia_size > isize) {
>>>> +					req->r_fscrypt_file = attr->ia_size;
>>>> +				}
>>>>    			} else {
>>>>    				req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
>>>>    				req->r_args.setattr.old_size = cpu_to_le64(isize);
>>>> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
>>>> index 7f3976b3319d..856caeb25fb6 100644
>>>> --- a/fs/ceph/super.h
>>>> +++ b/fs/ceph/super.h
>>>> @@ -443,9 +443,9 @@ struct ceph_inode_info {
>>>>    	struct fscache_cookie *fscache;
>>>>    #endif
>>>>    	u32 fscrypt_auth_len;
>>>> -	u32 fscrypt_file_len;
>>>>    	u8 *fscrypt_auth;
>>>> -	u8 *fscrypt_file;
>>>> +	/* need to zero the last block when decrypting the content to pagecache */
>>>> +	size_t i_truncate_block_off;
>>>>    
>>> Ugh, do we really need yet another field in the inode? This seems
>>> totally unnecessary, but maybe I'm missing some subtlety in the
>>> truncation handling that requires this extra tracking.
>>>
>>> What is this intended to represent anyway?
>> Just a draft patch. And I remove the unused "u8 *fscrypt_file" member.
>>
>> We can remove this and just reuse the "u8 *fscrypt_file" here.
>>
>> After the inode is filled, this will keep the offset from which it needs
>> to do the zero stuff after reading to the local pagecache or buffer.
>>
>>>>    	errseq_t i_meta_err;
>>>>    
>>>> @@ -1192,6 +1192,10 @@ extern void ceph_put_cap_refs(struct ceph_inode_info *ci, int had);
>>>>    extern void ceph_put_cap_refs_async(struct ceph_inode_info *ci, int had);
>>>>    extern void ceph_put_cap_refs_no_check_caps(struct ceph_inode_info *ci,
>>>>    					    int had);
>>>> +extern void __ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
>>>> +					    u64 start_pos, u64 end_pos);
>>>> +extern void ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
>>>> +					  u64 start_pos, u64 end_pos);
>>>>    extern void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr,
>>>>    				       struct ceph_snap_context *snapc);
>>>>    extern void ceph_flush_snaps(struct ceph_inode_info *ci,
>>>> @@ -1282,6 +1286,10 @@ extern int ceph_locks_to_pagelist(struct ceph_filelock *flocks,
>>>>    extern void ceph_fs_debugfs_init(struct ceph_fs_client *client);
>>>>    extern void ceph_fs_debugfs_cleanup(struct ceph_fs_client *client);
>>>>    
>>>> +extern void ceph_try_to_zero_truncate_block_off(struct ceph_inode_info *ci,
>>>> +						u64 start_pos, u64 end_pos,
>>>> +						struct page **pages);
>>>> +
>>>>    /* quota.c */
>>>>    static inline bool __ceph_has_any_quota(struct ceph_inode_info *ci)
>>>>    {
Jeff Layton Sept. 9, 2021, 12:48 p.m. UTC | #5
On Thu, 2021-09-09 at 11:38 +0800, Xiubo Li wrote:
> On 9/8/21 9:57 PM, Jeff Layton wrote:
> > On Wed, 2021-09-08 at 17:37 +0800, Xiubo Li wrote:
> > > On 9/8/21 12:26 AM, Jeff Layton wrote:
> > > > On Fri, 2021-09-03 at 16:15 +0800, xiubli@redhat.com wrote:
> > > > > From: Xiubo Li <xiubli@redhat.com>
> > > > > 
> > > > > When truncating the file, it will leave the MDS to handle that,
> > > > > but the MDS won't update the file contents. So when the fscrypt
> > > > > is enabled, if the truncate size is not aligned to the block size,
> > > > > the kclient will round up the truancate size to the block size and
> > > > > leave the the last block untouched.
> > > > > 
> > > > > The opaque fscrypt_file field will be used to tricker whether the
> > > > > last block need to do the rmw to truncate the a specified block's
> > > > > contents, we can get which block needs to do the rmw by round down
> > > > > the fscrypt_file.
> > > > > 
> > > > > In kclient side, there is not need to do the rmw immediately after
> > > > > the file is truncated. We can defer doing that whenever the kclient
> > > > > will update that block in late future. And before that any kclient
> > > > > will check the fscrypt_file field when reading that block, if the
> > > > > fscrypt_file field is none zero that means the related block needs
> > > > > to be zeroed in range of [fscrypt_file, round_up(fscrypt_file + PAGE_SIZE))
> > > > > in pagecache or readed data buffer.
> > > > > 
> > > > s/PAGE_SIZE/CEPH_FSCRYPT_BLOCK_SIZE/
> > > > 
> > > > Yes, on x86_64 they are equal, but that's not the case on all arches.
> > > > Also, we are moving toward a pagecache that may hold larger pages on
> > > > x86_64 too.
> > > >    
> > > Okay.
> > > > > Once the that block contents are updated and writeback
> > > > > kclient will reset the fscrypt_file field in MDS side, then 0 means
> > > > > no need to care about the truncate stuff any more.
> > > > > 
> > > > I'm a little unclear on what the fscrypt_file actually represents here.
> > > > 
> > > > I had proposed that we just make the fscrypt_file field hold the
> > > > "actual" i_size and we'd make the old size field always be a rounded-
> > > > up version of the size. The MDS would treat that as an opaque value
> > > > under Fw caps, and the client could use that field to determine i_size.
> > > > That has a side benefit too -- if the client doesn't support fscrypt,
> > > > it'll see the rounded-up sizes which are close enough and don't violate
> > > > any POSIX rules.
> > > > 
> > > > In your version, fscrypt_file also holds the actual size of the inode,
> > > > but sometimes you're zeroing it out, and I don't understand why:
> > > I think I forgot to fix this after I adapt to multiple ftruncates case,
> > > this patch is not correctly handling the "actual" file size.
> > > 
> > > I just want the fscrypt_file field always to hold the offset from which
> > > the contents needed to be zeroed, and the range should be [fscrypt_file,
> > > round_up(fscrypt_file +CEPH_FSCRYPT_BLOCK_SIZE)).
> > > 
> > > In single ftruncate case the fscrypt_file should equal to the "actual"
> > > file size. Then the "req->r_args.setattr.size = attr->ia_size" and
> > > "req->r_args.setattr.old_size = isize", no need to round up in
> > > __ceph_setattr() in kclient side, and leave the MDS to do that, but we
> > > need to pass the CEPH_FSCRYPT_BLOCK_SIZE at the same time.
> > > 
> > I'm really not a fan of pushing this logic into the MDS. Why does it
> > need to know anything about the CEPH_FSCRYPT_BLOCK_SIZE at all?
> 
>  From your current patch set, you are rounding up the 
> "req->r_args.setattr.size" and "req->r_args.setattr.old_size" to the 
> BLOCK end in __ceph_setattr().
> 
> Without considering keep the file scryption logic in kclient only, I 
> need the "req->r_args.setattr.size" to keep the file's real size.
> 
> Since the MDS will do the truncate stuff. And if we won't round the 
> "req->r_args.setattr.size" up to the BLOCK end any more, then the MDS 
> needs to know whether and how to round up the file size to the block end 
> when truncating the file. Because the fscrypt_file won't record the 
> file's real size any more, it maybe zero, more detail please see the 
> example below.
> 
> Yeah, but as you mentioned bellow if we will keep the file scryption 
> logic in kclient only, I need one extra field to do the defer rmw:
> 
> struct fscrypt_file {
> 
>      u64 file_real_size;         // always keep the file's real size and 
> the "req->r_args.setattr.size = round_up(file_real_size, BLOCK_SIZE)" as 
> you do in your current patch set.
> 
>      u64 file_truncate_offset;  // this will always record in which 
> BLOCK we need to do the rmw, this maybe 0 or located in the file's LAST 
> block and maybe not, more detail please the example below.
> 
> }
> 
> The "file_truncate_offset" member will be what I need to do the defer rmw.
> 
> 
> > > But in multiple ftruncates case if the second ftruncate has a larger
> > > size, the fscrypt_file won't be changed and the ceph backend will help
> > > zero the extended part, but we still need to zero the contents in the
> > > first ftruncate. If the second ftruncate has a smaller size, the
> > > fscrypt_file need to be updated and always keeps the smaller size.
> > > 
> > I don't get it. Maybe you can walk me through a concrete example of how
> > racing ftruncates are a problem?
> 
> Sorry for confusing.
> 
> For example:
> 
> 1), if there has a file named "bar", and currently the size is 100K 
> bytes, the CEPH_FSCRYPT_BLOCK_SIZE size equals to 4K.
> 
> 2), first ftruncate("bar", 7K) comes, then both the file real size and 
> "file_truncate_offset" will be set to 7K, then in MDS it will truncate 
> the file from 8K, and the last block's [7K, 8K) contents need to be 
> zeroed anyway later.
> 
> 3), immediately a second ftruncate("bar", 16K) comes, from the ftruncate 
> man page it says the new extended [7K, 16K) should be zeroed when 
> truncating the file. That means the OSD should help zero the [8K, 16K), 
> but won't touch the block [4K, 8K), in which the [7K, 8K) contents still 
> needs to be zeroed. So in this case the "file_truncate_offset" won't be 
> changed and still be 7K. Then the "file_truncate_offset" won't be 
> located in the last block of the file any more.
> 

Woah, why didn't the file_truncate_offset get changed when you truncated
up to 16k?

When you issue the truncate SETATTR to the MDS, you're changing the size
field in the inode. The fscrypt_file field _must_ be updated at the same
time. I think that means that we need to extend SETATTR to also update
fscrypt_file.

> 4), if the second ftruncate in step 3) the new size is 3K, then the MDS 
> will truncate the file from 4K and [7K, 8K) contents will be discard 
> anyway, so we need to update the "file_truncate_offset" to 3K, that 
> means a new BLOCK [0K, 4K) needs to do the rmw, by zeroing the [3K, 4K).
> 
> 5), if the new truncate in step 3) the new size is 4K, since the 4K < 7K 
> and 4K is aligned to the BLOCK size, so no need to rmw any block any 
> more, then we can just clear the "file_truncate_offset" field.
> 
> 
> For defer RMW logic please see the following example:
> 
> 

Ok, thanks. I think I understand now how racing truncates are a problem.
Really, it comes down to the fact that we don't have a good way to
mediate the last-block RMW operation when competing clients are issuing
truncates.

I'm not sure adding this file_truncate_offset field really does all that
much good though. You don't really have a way to ensure that a
truncating client will see the changes to that field before it issues
its RMW op.

> 
> > Suppose we have a file and client1 truncates it down from a large size
> > to 7k. client1 then sends the MDS a SETATTR to truncate it at 8k, and
> > does a RMW on the last (4k) block. client2 comes along at the same time
> > and truncates it up to 13k. client2 issues a SETATTR to extend the file
> > to 16k and does a RMW on the last block too (which would presumably
> > already be all zeroes anyway).
> 
> I think you meant in the none defer RMW case, this is not what my defer 
> RMW approach will do.
> 
> After the client1 truncated the file, it won't do the RMW if it won't 
> write any data to that file, and then we assume the client1 is unmounted 
> immediately.
> 
> And when the client1 is truncating the file, it will update the 
> "file_truncate_offset" to 7K, which means the [7K, 8K) in the LAST block 
> needs to be zeroed.
> 
> Then the client2 reads that the file size is 7K and the 
> "file_truncate_offset" is 7K too, and the client2 wants to truncate the 
> file up to 13K. Since the OSD should help us zero the extended part [8K, 
> 13K] when truncating, but won't touch the block [4K, 8K), for which it 
> still needs to do the RMW. Then the client2 is unmounted too before 
> writing any data to the file. After this the "file_truncate_offset" 
> won't be located in the file's LAST block any more.
> 
> After that if the client3 will update the whole file's contents, it will 
> read all the file 13K bytes contents to local page buffers, since the 
> "file_truncate_offset" is 7K and then in the page buffer the range [7K, 
> 8K) will be zeroed just after the contents are dencrypted inplace. Then 
> if the client3 successfully flushes that dirty data back and then the 
> deferred RMW for block [4K, 8K) should be done at the same time, and the 
> "file_truncate_offset" should be cleared too.
> 
> While if the client3 won't update the block [4K, 8K), the 
> "file_truncate_offset" will be kept all the time until the above RMW is 
> done in future.
> 
> 

Ok, I think I finally understand what you're saying.

You want to rely on the next client to do a write to handle the zeroing
at the end. You basically just want to keep track of whether and where
it should zero up to the end of the next crypto block, and defer that
until a client is writing.

I'll have to think about whether that's still racy. Part of the problem
is that once the client doesn't have caps, it doesn't have a way to
ensure that fscrypt_file (whatever it holds) doesn't change while it's
doing that zeroing.

Really, it comes down to the fact that truncates are supposed to be an
atomic operation, but we need to perform actions in two different
places.

Hmmm...it's worse than that even -- if the truncate changes it so that
the last block is in an entirely different object, then there are 3
places that will need to coordinate access. 

Tricky.
Xiubo Li Sept. 10, 2021, 2:30 a.m. UTC | #6
On 9/9/21 8:48 PM, Jeff Layton wrote:
> On Thu, 2021-09-09 at 11:38 +0800, Xiubo Li wrote:
>> On 9/8/21 9:57 PM, Jeff Layton wrote:
>>> On Wed, 2021-09-08 at 17:37 +0800, Xiubo Li wrote:
>>>> On 9/8/21 12:26 AM, Jeff Layton wrote:
>>>>> On Fri, 2021-09-03 at 16:15 +0800, xiubli@redhat.com wrote:
>>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>>
>>>>>> When truncating the file, it will leave the MDS to handle that,
>>>>>> but the MDS won't update the file contents. So when the fscrypt
>>>>>> is enabled, if the truncate size is not aligned to the block size,
>>>>>> the kclient will round up the truancate size to the block size and
>>>>>> leave the the last block untouched.
>>>>>>
>>>>>> The opaque fscrypt_file field will be used to tricker whether the
>>>>>> last block need to do the rmw to truncate the a specified block's
>>>>>> contents, we can get which block needs to do the rmw by round down
>>>>>> the fscrypt_file.
>>>>>>
>>>>>> In kclient side, there is not need to do the rmw immediately after
>>>>>> the file is truncated. We can defer doing that whenever the kclient
>>>>>> will update that block in late future. And before that any kclient
>>>>>> will check the fscrypt_file field when reading that block, if the
>>>>>> fscrypt_file field is none zero that means the related block needs
>>>>>> to be zeroed in range of [fscrypt_file, round_up(fscrypt_file + PAGE_SIZE))
>>>>>> in pagecache or readed data buffer.
>>>>>>
>>>>> s/PAGE_SIZE/CEPH_FSCRYPT_BLOCK_SIZE/
>>>>>
>>>>> Yes, on x86_64 they are equal, but that's not the case on all arches.
>>>>> Also, we are moving toward a pagecache that may hold larger pages on
>>>>> x86_64 too.
>>>>>     
>>>> Okay.
>>>>>> Once the that block contents are updated and writeback
>>>>>> kclient will reset the fscrypt_file field in MDS side, then 0 means
>>>>>> no need to care about the truncate stuff any more.
>>>>>>
>>>>> I'm a little unclear on what the fscrypt_file actually represents here.
>>>>>
>>>>> I had proposed that we just make the fscrypt_file field hold the
>>>>> "actual" i_size and we'd make the old size field always be a rounded-
>>>>> up version of the size. The MDS would treat that as an opaque value
>>>>> under Fw caps, and the client could use that field to determine i_size.
>>>>> That has a side benefit too -- if the client doesn't support fscrypt,
>>>>> it'll see the rounded-up sizes which are close enough and don't violate
>>>>> any POSIX rules.
>>>>>
>>>>> In your version, fscrypt_file also holds the actual size of the inode,
>>>>> but sometimes you're zeroing it out, and I don't understand why:
>>>> I think I forgot to fix this after I adapt to multiple ftruncates case,
>>>> this patch is not correctly handling the "actual" file size.
>>>>
>>>> I just want the fscrypt_file field always to hold the offset from which
>>>> the contents needed to be zeroed, and the range should be [fscrypt_file,
>>>> round_up(fscrypt_file +CEPH_FSCRYPT_BLOCK_SIZE)).
>>>>
>>>> In single ftruncate case the fscrypt_file should equal to the "actual"
>>>> file size. Then the "req->r_args.setattr.size = attr->ia_size" and
>>>> "req->r_args.setattr.old_size = isize", no need to round up in
>>>> __ceph_setattr() in kclient side, and leave the MDS to do that, but we
>>>> need to pass the CEPH_FSCRYPT_BLOCK_SIZE at the same time.
>>>>
>>> I'm really not a fan of pushing this logic into the MDS. Why does it
>>> need to know anything about the CEPH_FSCRYPT_BLOCK_SIZE at all?
>>   From your current patch set, you are rounding up the
>> "req->r_args.setattr.size" and "req->r_args.setattr.old_size" to the
>> BLOCK end in __ceph_setattr().
>>
>> Without considering keep the file scryption logic in kclient only, I
>> need the "req->r_args.setattr.size" to keep the file's real size.
>>
>> Since the MDS will do the truncate stuff. And if we won't round the
>> "req->r_args.setattr.size" up to the BLOCK end any more, then the MDS
>> needs to know whether and how to round up the file size to the block end
>> when truncating the file. Because the fscrypt_file won't record the
>> file's real size any more, it maybe zero, more detail please see the
>> example below.
>>
>> Yeah, but as you mentioned bellow if we will keep the file scryption
>> logic in kclient only, I need one extra field to do the defer rmw:
>>
>> struct fscrypt_file {
>>
>>       u64 file_real_size;         // always keep the file's real size and
>> the "req->r_args.setattr.size = round_up(file_real_size, BLOCK_SIZE)" as
>> you do in your current patch set.
>>
>>       u64 file_truncate_offset;  // this will always record in which
>> BLOCK we need to do the rmw, this maybe 0 or located in the file's LAST
>> block and maybe not, more detail please the example below.
>>
>> }
>>
>> The "file_truncate_offset" member will be what I need to do the defer rmw.
>>
>>
>>>> But in multiple ftruncates case if the second ftruncate has a larger
>>>> size, the fscrypt_file won't be changed and the ceph backend will help
>>>> zero the extended part, but we still need to zero the contents in the
>>>> first ftruncate. If the second ftruncate has a smaller size, the
>>>> fscrypt_file need to be updated and always keeps the smaller size.
>>>>
>>> I don't get it. Maybe you can walk me through a concrete example of how
>>> racing ftruncates are a problem?
>> Sorry for confusing.
>>
>> For example:
>>
>> 1), if there has a file named "bar", and currently the size is 100K
>> bytes, the CEPH_FSCRYPT_BLOCK_SIZE size equals to 4K.
>>
>> 2), first ftruncate("bar", 7K) comes, then both the file real size and
>> "file_truncate_offset" will be set to 7K, then in MDS it will truncate
>> the file from 8K, and the last block's [7K, 8K) contents need to be
>> zeroed anyway later.
>>
>> 3), immediately a second ftruncate("bar", 16K) comes, from the ftruncate
>> man page it says the new extended [7K, 16K) should be zeroed when
>> truncating the file. That means the OSD should help zero the [8K, 16K),
>> but won't touch the block [4K, 8K), in which the [7K, 8K) contents still
>> needs to be zeroed. So in this case the "file_truncate_offset" won't be
>> changed and still be 7K. Then the "file_truncate_offset" won't be
>> located in the last block of the file any more.
>>
> Woah, why didn't the file_truncate_offset get changed when you truncated
> up to 16k?
>
> When you issue the truncate SETATTR to the MDS, you're changing the size
> field in the inode. The fscrypt_file field _must_ be updated at the same
> time. I think that means that we need to extend SETATTR to also update
> fscrypt_file.
>
>> 4), if the second ftruncate in step 3) the new size is 3K, then the MDS
>> will truncate the file from 4K and [7K, 8K) contents will be discard
>> anyway, so we need to update the "file_truncate_offset" to 3K, that
>> means a new BLOCK [0K, 4K) needs to do the rmw, by zeroing the [3K, 4K).
>>
>> 5), if the new truncate in step 3) the new size is 4K, since the 4K < 7K
>> and 4K is aligned to the BLOCK size, so no need to rmw any block any
>> more, then we can just clear the "file_truncate_offset" field.
>>
>>
>> For defer RMW logic please see the following example:
>>
>>
> Ok, thanks. I think I understand now how racing truncates are a problem.
> Really, it comes down to the fact that we don't have a good way to
> mediate the last-block RMW operation when competing clients are issuing
> truncates.
>
> I'm not sure adding this file_truncate_offset field really does all that
> much good though. You don't really have a way to ensure that a
> truncating client will see the changes to that field before it issues
> its RMW op.
>
>>> Suppose we have a file and client1 truncates it down from a large size
>>> to 7k. client1 then sends the MDS a SETATTR to truncate it at 8k, and
>>> does a RMW on the last (4k) block. client2 comes along at the same time
>>> and truncates it up to 13k. client2 issues a SETATTR to extend the file
>>> to 16k and does a RMW on the last block too (which would presumably
>>> already be all zeroes anyway).
>> I think you meant in the none defer RMW case, this is not what my defer
>> RMW approach will do.
>>
>> After the client1 truncated the file, it won't do the RMW if it won't
>> write any data to that file, and then we assume the client1 is unmounted
>> immediately.
>>
>> And when the client1 is truncating the file, it will update the
>> "file_truncate_offset" to 7K, which means the [7K, 8K) in the LAST block
>> needs to be zeroed.
>>
>> Then the client2 reads that the file size is 7K and the
>> "file_truncate_offset" is 7K too, and the client2 wants to truncate the
>> file up to 13K. Since the OSD should help us zero the extended part [8K,
>> 13K] when truncating, but won't touch the block [4K, 8K), for which it
>> still needs to do the RMW. Then the client2 is unmounted too before
>> writing any data to the file. After this the "file_truncate_offset"
>> won't be located in the file's LAST block any more.
>>
>> After that if the client3 will update the whole file's contents, it will
>> read all the file 13K bytes contents to local page buffers, since the
>> "file_truncate_offset" is 7K and then in the page buffer the range [7K,
>> 8K) will be zeroed just after the contents are dencrypted inplace. Then
>> if the client3 successfully flushes that dirty data back and then the
>> deferred RMW for block [4K, 8K) should be done at the same time, and the
>> "file_truncate_offset" should be cleared too.
>>
>> While if the client3 won't update the block [4K, 8K), the
>> "file_truncate_offset" will be kept all the time until the above RMW is
>> done in future.
>>
>>
> Ok, I think I finally understand what you're saying.
>
> You want to rely on the next client to do a write to handle the zeroing
> at the end. You basically just want to keep track of whether and where
> it should zero up to the end of the next crypto block, and defer that
> until a client is writing.

Yeah, the writing could also be in the same client just after the file 
is truncated.


>
> I'll have to think about whether that's still racy. Part of the problem
> is that once the client doesn't have caps, it doesn't have a way to
> ensure that fscrypt_file (whatever it holds) doesn't change while it's
> doing that zeroing.

As my understanding, if clientA want to read a file, it will open it 
with RD mode, then it will get the fscrypt_file meta and "Fr" caps, then 
it can safely read the file contents and do the zeroing for that block. 
Then the opened file shouldn't worry whether the fscrypt_file will be 
changed by others during it still holding the Fr caps, because if the 
clientB wants to update the fscrypt_file it must acquire the Fw caps 
first, before that the Fr caps must be revoked from clientA and at the 
same time the read pagecache in clientA will be invalidated.

And if after clientB have been issued the Fw caps and have modified that 
block and still not flushed back, a new clientC comes and wants to read 
the file, so the Fw caps must be revoked from clientB and the dirty data 
will be flushed, and then when releasing the Fw caps to the MDS, it will 
update the new fscrypt_file meta together.

I haven't see which case will be racy yet. Or did I miss some cases or 
something important ?

Thanks


>
> Really, it comes down to the fact that truncates are supposed to be an
> atomic operation, but we need to perform actions in two different
> places.
>
> Hmmm...it's worse than that even -- if the truncate changes it so that
> the last block is in an entirely different object, then there are 3
> places that will need to coordinate access.
>
> Tricky.
Jeff Layton Sept. 10, 2021, 11:46 a.m. UTC | #7
On Fri, 2021-09-10 at 10:30 +0800, Xiubo Li wrote:
> On 9/9/21 8:48 PM, Jeff Layton wrote:
> > On Thu, 2021-09-09 at 11:38 +0800, Xiubo Li wrote:
> > > On 9/8/21 9:57 PM, Jeff Layton wrote:
> > > > On Wed, 2021-09-08 at 17:37 +0800, Xiubo Li wrote:
> > > > > On 9/8/21 12:26 AM, Jeff Layton wrote:
> > > > > > On Fri, 2021-09-03 at 16:15 +0800, xiubli@redhat.com wrote:
> > > > > > > From: Xiubo Li <xiubli@redhat.com>
> > > > > > > 
> > > > > > > When truncating the file, it will leave the MDS to handle that,
> > > > > > > but the MDS won't update the file contents. So when the fscrypt
> > > > > > > is enabled, if the truncate size is not aligned to the block size,
> > > > > > > the kclient will round up the truancate size to the block size and
> > > > > > > leave the the last block untouched.
> > > > > > > 
> > > > > > > The opaque fscrypt_file field will be used to tricker whether the
> > > > > > > last block need to do the rmw to truncate the a specified block's
> > > > > > > contents, we can get which block needs to do the rmw by round down
> > > > > > > the fscrypt_file.
> > > > > > > 
> > > > > > > In kclient side, there is not need to do the rmw immediately after
> > > > > > > the file is truncated. We can defer doing that whenever the kclient
> > > > > > > will update that block in late future. And before that any kclient
> > > > > > > will check the fscrypt_file field when reading that block, if the
> > > > > > > fscrypt_file field is none zero that means the related block needs
> > > > > > > to be zeroed in range of [fscrypt_file, round_up(fscrypt_file + PAGE_SIZE))
> > > > > > > in pagecache or readed data buffer.
> > > > > > > 
> > > > > > s/PAGE_SIZE/CEPH_FSCRYPT_BLOCK_SIZE/
> > > > > > 
> > > > > > Yes, on x86_64 they are equal, but that's not the case on all arches.
> > > > > > Also, we are moving toward a pagecache that may hold larger pages on
> > > > > > x86_64 too.
> > > > > >     
> > > > > Okay.
> > > > > > > Once the that block contents are updated and writeback
> > > > > > > kclient will reset the fscrypt_file field in MDS side, then 0 means
> > > > > > > no need to care about the truncate stuff any more.
> > > > > > > 
> > > > > > I'm a little unclear on what the fscrypt_file actually represents here.
> > > > > > 
> > > > > > I had proposed that we just make the fscrypt_file field hold the
> > > > > > "actual" i_size and we'd make the old size field always be a rounded-
> > > > > > up version of the size. The MDS would treat that as an opaque value
> > > > > > under Fw caps, and the client could use that field to determine i_size.
> > > > > > That has a side benefit too -- if the client doesn't support fscrypt,
> > > > > > it'll see the rounded-up sizes which are close enough and don't violate
> > > > > > any POSIX rules.
> > > > > > 
> > > > > > In your version, fscrypt_file also holds the actual size of the inode,
> > > > > > but sometimes you're zeroing it out, and I don't understand why:
> > > > > I think I forgot to fix this after I adapt to multiple ftruncates case,
> > > > > this patch is not correctly handling the "actual" file size.
> > > > > 
> > > > > I just want the fscrypt_file field always to hold the offset from which
> > > > > the contents needed to be zeroed, and the range should be [fscrypt_file,
> > > > > round_up(fscrypt_file +CEPH_FSCRYPT_BLOCK_SIZE)).
> > > > > 
> > > > > In single ftruncate case the fscrypt_file should equal to the "actual"
> > > > > file size. Then the "req->r_args.setattr.size = attr->ia_size" and
> > > > > "req->r_args.setattr.old_size = isize", no need to round up in
> > > > > __ceph_setattr() in kclient side, and leave the MDS to do that, but we
> > > > > need to pass the CEPH_FSCRYPT_BLOCK_SIZE at the same time.
> > > > > 
> > > > I'm really not a fan of pushing this logic into the MDS. Why does it
> > > > need to know anything about the CEPH_FSCRYPT_BLOCK_SIZE at all?
> > >   From your current patch set, you are rounding up the
> > > "req->r_args.setattr.size" and "req->r_args.setattr.old_size" to the
> > > BLOCK end in __ceph_setattr().
> > > 
> > > Without considering keep the file scryption logic in kclient only, I
> > > need the "req->r_args.setattr.size" to keep the file's real size.
> > > 
> > > Since the MDS will do the truncate stuff. And if we won't round the
> > > "req->r_args.setattr.size" up to the BLOCK end any more, then the MDS
> > > needs to know whether and how to round up the file size to the block end
> > > when truncating the file. Because the fscrypt_file won't record the
> > > file's real size any more, it maybe zero, more detail please see the
> > > example below.
> > > 
> > > Yeah, but as you mentioned bellow if we will keep the file scryption
> > > logic in kclient only, I need one extra field to do the defer rmw:
> > > 
> > > struct fscrypt_file {
> > > 
> > >       u64 file_real_size;         // always keep the file's real size and
> > > the "req->r_args.setattr.size = round_up(file_real_size, BLOCK_SIZE)" as
> > > you do in your current patch set.
> > > 
> > >       u64 file_truncate_offset;  // this will always record in which
> > > BLOCK we need to do the rmw, this maybe 0 or located in the file's LAST
> > > block and maybe not, more detail please the example below.
> > > 
> > > }
> > > 
> > > The "file_truncate_offset" member will be what I need to do the defer rmw.
> > > 
> > > 
> > > > > But in multiple ftruncates case if the second ftruncate has a larger
> > > > > size, the fscrypt_file won't be changed and the ceph backend will help
> > > > > zero the extended part, but we still need to zero the contents in the
> > > > > first ftruncate. If the second ftruncate has a smaller size, the
> > > > > fscrypt_file need to be updated and always keeps the smaller size.
> > > > > 
> > > > I don't get it. Maybe you can walk me through a concrete example of how
> > > > racing ftruncates are a problem?
> > > Sorry for confusing.
> > > 
> > > For example:
> > > 
> > > 1), if there has a file named "bar", and currently the size is 100K
> > > bytes, the CEPH_FSCRYPT_BLOCK_SIZE size equals to 4K.
> > > 
> > > 2), first ftruncate("bar", 7K) comes, then both the file real size and
> > > "file_truncate_offset" will be set to 7K, then in MDS it will truncate
> > > the file from 8K, and the last block's [7K, 8K) contents need to be
> > > zeroed anyway later.
> > > 
> > > 3), immediately a second ftruncate("bar", 16K) comes, from the ftruncate
> > > man page it says the new extended [7K, 16K) should be zeroed when
> > > truncating the file. That means the OSD should help zero the [8K, 16K),
> > > but won't touch the block [4K, 8K), in which the [7K, 8K) contents still
> > > needs to be zeroed. So in this case the "file_truncate_offset" won't be
> > > changed and still be 7K. Then the "file_truncate_offset" won't be
> > > located in the last block of the file any more.
> > > 
> > Woah, why didn't the file_truncate_offset get changed when you truncated
> > up to 16k?
> > 
> > When you issue the truncate SETATTR to the MDS, you're changing the size
> > field in the inode. The fscrypt_file field _must_ be updated at the same
> > time. I think that means that we need to extend SETATTR to also update
> > fscrypt_file.
> > 
> > > 4), if the second ftruncate in step 3) the new size is 3K, then the MDS
> > > will truncate the file from 4K and [7K, 8K) contents will be discard
> > > anyway, so we need to update the "file_truncate_offset" to 3K, that
> > > means a new BLOCK [0K, 4K) needs to do the rmw, by zeroing the [3K, 4K).
> > > 
> > > 5), if the new truncate in step 3) the new size is 4K, since the 4K < 7K
> > > and 4K is aligned to the BLOCK size, so no need to rmw any block any
> > > more, then we can just clear the "file_truncate_offset" field.
> > > 
> > > 
> > > For defer RMW logic please see the following example:
> > > 
> > > 
> > Ok, thanks. I think I understand now how racing truncates are a problem.
> > Really, it comes down to the fact that we don't have a good way to
> > mediate the last-block RMW operation when competing clients are issuing
> > truncates.
> > 
> > I'm not sure adding this file_truncate_offset field really does all that
> > much good though. You don't really have a way to ensure that a
> > truncating client will see the changes to that field before it issues
> > its RMW op.
> > 
> > > > Suppose we have a file and client1 truncates it down from a large size
> > > > to 7k. client1 then sends the MDS a SETATTR to truncate it at 8k, and
> > > > does a RMW on the last (4k) block. client2 comes along at the same time
> > > > and truncates it up to 13k. client2 issues a SETATTR to extend the file
> > > > to 16k and does a RMW on the last block too (which would presumably
> > > > already be all zeroes anyway).
> > > I think you meant in the none defer RMW case, this is not what my defer
> > > RMW approach will do.
> > > 
> > > After the client1 truncated the file, it won't do the RMW if it won't
> > > write any data to that file, and then we assume the client1 is unmounted
> > > immediately.
> > > 
> > > And when the client1 is truncating the file, it will update the
> > > "file_truncate_offset" to 7K, which means the [7K, 8K) in the LAST block
> > > needs to be zeroed.
> > > 
> > > Then the client2 reads that the file size is 7K and the
> > > "file_truncate_offset" is 7K too, and the client2 wants to truncate the
> > > file up to 13K. Since the OSD should help us zero the extended part [8K,
> > > 13K] when truncating, but won't touch the block [4K, 8K), for which it
> > > still needs to do the RMW. Then the client2 is unmounted too before
> > > writing any data to the file. After this the "file_truncate_offset"
> > > won't be located in the file's LAST block any more.
> > > 
> > > After that if the client3 will update the whole file's contents, it will
> > > read all the file 13K bytes contents to local page buffers, since the
> > > "file_truncate_offset" is 7K and then in the page buffer the range [7K,
> > > 8K) will be zeroed just after the contents are dencrypted inplace. Then
> > > if the client3 successfully flushes that dirty data back and then the
> > > deferred RMW for block [4K, 8K) should be done at the same time, and the
> > > "file_truncate_offset" should be cleared too.
> > > 
> > > While if the client3 won't update the block [4K, 8K), the
> > > "file_truncate_offset" will be kept all the time until the above RMW is
> > > done in future.
> > > 
> > > 
> > Ok, I think I finally understand what you're saying.
> > 
> > You want to rely on the next client to do a write to handle the zeroing
> > at the end. You basically just want to keep track of whether and where
> > it should zero up to the end of the next crypto block, and defer that
> > until a client is writing.
> 
> Yeah, the writing could also be in the same client just after the file 
> is truncated.
> 
> 
> > 
> > I'll have to think about whether that's still racy. Part of the problem
> > is that once the client doesn't have caps, it doesn't have a way to
> > ensure that fscrypt_file (whatever it holds) doesn't change while it's
> > doing that zeroing.
> 
> As my understanding, if clientA want to read a file, it will open it 
> with RD mode, then it will get the fscrypt_file meta and "Fr" caps, then 
> it can safely read the file contents and do the zeroing for that block. 

Ok, so in this case, the client is just zeroing out the appropriate
region in the resulting read data, and not on the OSD. That should be
ok.

> Then the opened file shouldn't worry whether the fscrypt_file will be 
> changed by others during it still holding the Fr caps, because if the 
> clientB wants to update the fscrypt_file it must acquire the Fw caps 
> first, before that the Fr caps must be revoked from clientA and at the 
> same time the read pagecache in clientA will be invalidated.
> 

Are you certain that Fw caps is enough to ensure that no other client
holds Fr caps? IIRC, Frw don't have the same exclusionary relationship
that (e.g.) Asx has. To exclude Fr, you may need Fb.

(Side note: these rules really need to be codified into a document
somewhere. I started that with the capabilities doc in the ceph tree,
but it's light on details of the FILE caps)

> And if after clientB have been issued the Fw caps and have modified that 
> block and still not flushed back, a new clientC comes and wants to read 
> the file, so the Fw caps must be revoked from clientB and the dirty data 
> will be flushed, and then when releasing the Fw caps to the MDS, it will 
> update the new fscrypt_file meta together.
> 
> I haven't see which case will be racy yet. Or did I miss some cases or 
> something important ?
> 
> 

We also need to consider how legacy, non-fscrypt-capable clients will
interact with files that have this field set. If one comes in and writes
to or truncates one of these files, it's liable to leave a real mess.
The simplest solution may be to just bar them from doing anything with
fscrypted files aside from deleting them -- maybe we'd allow them to
acquire Fr caps but not Fw?

> > 
> > Really, it comes down to the fact that truncates are supposed to be an
> > atomic operation, but we need to perform actions in two different
> > places.
> > 
> > Hmmm...it's worse than that even -- if the truncate changes it so that
> > the last block is in an entirely different object, then there are 3
> > places that will need to coordinate access.
> > 
> > Tricky.
>
Xiubo Li Sept. 13, 2021, 5:42 a.m. UTC | #8
On 9/10/21 7:46 PM, Jeff Layton wrote:
> On Fri, 2021-09-10 at 10:30 +0800, Xiubo Li wrote:
>> On 9/9/21 8:48 PM, Jeff Layton wrote:
>>> On Thu, 2021-09-09 at 11:38 +0800, Xiubo Li wrote:
>>>> On 9/8/21 9:57 PM, Jeff Layton wrote:
>>>>> On Wed, 2021-09-08 at 17:37 +0800, Xiubo Li wrote:
>>>>>> On 9/8/21 12:26 AM, Jeff Layton wrote:
>>>>>>> On Fri, 2021-09-03 at 16:15 +0800, xiubli@redhat.com wrote:
>>>>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>>>>
>>>>>>>> When truncating the file, it will leave the MDS to handle that,
>>>>>>>> but the MDS won't update the file contents. So when the fscrypt
>>>>>>>> is enabled, if the truncate size is not aligned to the block size,
>>>>>>>> the kclient will round up the truancate size to the block size and
>>>>>>>> leave the the last block untouched.
>>>>>>>>
>>>>>>>> The opaque fscrypt_file field will be used to tricker whether the
>>>>>>>> last block need to do the rmw to truncate the a specified block's
>>>>>>>> contents, we can get which block needs to do the rmw by round down
>>>>>>>> the fscrypt_file.
>>>>>>>>
>>>>>>>> In kclient side, there is not need to do the rmw immediately after
>>>>>>>> the file is truncated. We can defer doing that whenever the kclient
>>>>>>>> will update that block in late future. And before that any kclient
>>>>>>>> will check the fscrypt_file field when reading that block, if the
>>>>>>>> fscrypt_file field is none zero that means the related block needs
>>>>>>>> to be zeroed in range of [fscrypt_file, round_up(fscrypt_file + PAGE_SIZE))
>>>>>>>> in pagecache or readed data buffer.
>>>>>>>>
>>>>>>> s/PAGE_SIZE/CEPH_FSCRYPT_BLOCK_SIZE/
>>>>>>>
>>>>>>> Yes, on x86_64 they are equal, but that's not the case on all arches.
>>>>>>> Also, we are moving toward a pagecache that may hold larger pages on
>>>>>>> x86_64 too.
>>>>>>>      
>>>>>> Okay.
>>>>>>>> Once the that block contents are updated and writeback
>>>>>>>> kclient will reset the fscrypt_file field in MDS side, then 0 means
>>>>>>>> no need to care about the truncate stuff any more.
>>>>>>>>
>>>>>>> I'm a little unclear on what the fscrypt_file actually represents here.
>>>>>>>
>>>>>>> I had proposed that we just make the fscrypt_file field hold the
>>>>>>> "actual" i_size and we'd make the old size field always be a rounded-
>>>>>>> up version of the size. The MDS would treat that as an opaque value
>>>>>>> under Fw caps, and the client could use that field to determine i_size.
>>>>>>> That has a side benefit too -- if the client doesn't support fscrypt,
>>>>>>> it'll see the rounded-up sizes which are close enough and don't violate
>>>>>>> any POSIX rules.
>>>>>>>
>>>>>>> In your version, fscrypt_file also holds the actual size of the inode,
>>>>>>> but sometimes you're zeroing it out, and I don't understand why:
>>>>>> I think I forgot to fix this after I adapt to multiple ftruncates case,
>>>>>> this patch is not correctly handling the "actual" file size.
>>>>>>
>>>>>> I just want the fscrypt_file field always to hold the offset from which
>>>>>> the contents needed to be zeroed, and the range should be [fscrypt_file,
>>>>>> round_up(fscrypt_file +CEPH_FSCRYPT_BLOCK_SIZE)).
>>>>>>
>>>>>> In single ftruncate case the fscrypt_file should equal to the "actual"
>>>>>> file size. Then the "req->r_args.setattr.size = attr->ia_size" and
>>>>>> "req->r_args.setattr.old_size = isize", no need to round up in
>>>>>> __ceph_setattr() in kclient side, and leave the MDS to do that, but we
>>>>>> need to pass the CEPH_FSCRYPT_BLOCK_SIZE at the same time.
>>>>>>
>>>>> I'm really not a fan of pushing this logic into the MDS. Why does it
>>>>> need to know anything about the CEPH_FSCRYPT_BLOCK_SIZE at all?
>>>>    From your current patch set, you are rounding up the
>>>> "req->r_args.setattr.size" and "req->r_args.setattr.old_size" to the
>>>> BLOCK end in __ceph_setattr().
>>>>
>>>> Without considering keep the file scryption logic in kclient only, I
>>>> need the "req->r_args.setattr.size" to keep the file's real size.
>>>>
>>>> Since the MDS will do the truncate stuff. And if we won't round the
>>>> "req->r_args.setattr.size" up to the BLOCK end any more, then the MDS
>>>> needs to know whether and how to round up the file size to the block end
>>>> when truncating the file. Because the fscrypt_file won't record the
>>>> file's real size any more, it maybe zero, more detail please see the
>>>> example below.
>>>>
>>>> Yeah, but as you mentioned bellow if we will keep the file scryption
>>>> logic in kclient only, I need one extra field to do the defer rmw:
>>>>
>>>> struct fscrypt_file {
>>>>
>>>>        u64 file_real_size;         // always keep the file's real size and
>>>> the "req->r_args.setattr.size = round_up(file_real_size, BLOCK_SIZE)" as
>>>> you do in your current patch set.
>>>>
>>>>        u64 file_truncate_offset;  // this will always record in which
>>>> BLOCK we need to do the rmw, this maybe 0 or located in the file's LAST
>>>> block and maybe not, more detail please the example below.
>>>>
>>>> }
>>>>
>>>> The "file_truncate_offset" member will be what I need to do the defer rmw.
>>>>
>>>>
>>>>>> But in multiple ftruncates case if the second ftruncate has a larger
>>>>>> size, the fscrypt_file won't be changed and the ceph backend will help
>>>>>> zero the extended part, but we still need to zero the contents in the
>>>>>> first ftruncate. If the second ftruncate has a smaller size, the
>>>>>> fscrypt_file need to be updated and always keeps the smaller size.
>>>>>>
>>>>> I don't get it. Maybe you can walk me through a concrete example of how
>>>>> racing ftruncates are a problem?
>>>> Sorry for confusing.
>>>>
>>>> For example:
>>>>
>>>> 1), if there has a file named "bar", and currently the size is 100K
>>>> bytes, the CEPH_FSCRYPT_BLOCK_SIZE size equals to 4K.
>>>>
>>>> 2), first ftruncate("bar", 7K) comes, then both the file real size and
>>>> "file_truncate_offset" will be set to 7K, then in MDS it will truncate
>>>> the file from 8K, and the last block's [7K, 8K) contents need to be
>>>> zeroed anyway later.
>>>>
>>>> 3), immediately a second ftruncate("bar", 16K) comes, from the ftruncate
>>>> man page it says the new extended [7K, 16K) should be zeroed when
>>>> truncating the file. That means the OSD should help zero the [8K, 16K),
>>>> but won't touch the block [4K, 8K), in which the [7K, 8K) contents still
>>>> needs to be zeroed. So in this case the "file_truncate_offset" won't be
>>>> changed and still be 7K. Then the "file_truncate_offset" won't be
>>>> located in the last block of the file any more.
>>>>
>>> Woah, why didn't the file_truncate_offset get changed when you truncated
>>> up to 16k?
>>>
>>> When you issue the truncate SETATTR to the MDS, you're changing the size
>>> field in the inode. The fscrypt_file field _must_ be updated at the same
>>> time. I think that means that we need to extend SETATTR to also update
>>> fscrypt_file.
>>>
>>>> 4), if the second ftruncate in step 3) the new size is 3K, then the MDS
>>>> will truncate the file from 4K and [7K, 8K) contents will be discard
>>>> anyway, so we need to update the "file_truncate_offset" to 3K, that
>>>> means a new BLOCK [0K, 4K) needs to do the rmw, by zeroing the [3K, 4K).
>>>>
>>>> 5), if the new truncate in step 3) the new size is 4K, since the 4K < 7K
>>>> and 4K is aligned to the BLOCK size, so no need to rmw any block any
>>>> more, then we can just clear the "file_truncate_offset" field.
>>>>
>>>>
>>>> For defer RMW logic please see the following example:
>>>>
>>>>
>>> Ok, thanks. I think I understand now how racing truncates are a problem.
>>> Really, it comes down to the fact that we don't have a good way to
>>> mediate the last-block RMW operation when competing clients are issuing
>>> truncates.
>>>
>>> I'm not sure adding this file_truncate_offset field really does all that
>>> much good though. You don't really have a way to ensure that a
>>> truncating client will see the changes to that field before it issues
>>> its RMW op.
>>>
>>>>> Suppose we have a file and client1 truncates it down from a large size
>>>>> to 7k. client1 then sends the MDS a SETATTR to truncate it at 8k, and
>>>>> does a RMW on the last (4k) block. client2 comes along at the same time
>>>>> and truncates it up to 13k. client2 issues a SETATTR to extend the file
>>>>> to 16k and does a RMW on the last block too (which would presumably
>>>>> already be all zeroes anyway).
>>>> I think you meant in the none defer RMW case, this is not what my defer
>>>> RMW approach will do.
>>>>
>>>> After the client1 truncated the file, it won't do the RMW if it won't
>>>> write any data to that file, and then we assume the client1 is unmounted
>>>> immediately.
>>>>
>>>> And when the client1 is truncating the file, it will update the
>>>> "file_truncate_offset" to 7K, which means the [7K, 8K) in the LAST block
>>>> needs to be zeroed.
>>>>
>>>> Then the client2 reads that the file size is 7K and the
>>>> "file_truncate_offset" is 7K too, and the client2 wants to truncate the
>>>> file up to 13K. Since the OSD should help us zero the extended part [8K,
>>>> 13K] when truncating, but won't touch the block [4K, 8K), for which it
>>>> still needs to do the RMW. Then the client2 is unmounted too before
>>>> writing any data to the file. After this the "file_truncate_offset"
>>>> won't be located in the file's LAST block any more.
>>>>
>>>> After that if the client3 will update the whole file's contents, it will
>>>> read all the file 13K bytes contents to local page buffers, since the
>>>> "file_truncate_offset" is 7K and then in the page buffer the range [7K,
>>>> 8K) will be zeroed just after the contents are dencrypted inplace. Then
>>>> if the client3 successfully flushes that dirty data back and then the
>>>> deferred RMW for block [4K, 8K) should be done at the same time, and the
>>>> "file_truncate_offset" should be cleared too.
>>>>
>>>> While if the client3 won't update the block [4K, 8K), the
>>>> "file_truncate_offset" will be kept all the time until the above RMW is
>>>> done in future.
>>>>
>>>>
>>> Ok, I think I finally understand what you're saying.
>>>
>>> You want to rely on the next client to do a write to handle the zeroing
>>> at the end. You basically just want to keep track of whether and where
>>> it should zero up to the end of the next crypto block, and defer that
>>> until a client is writing.
>> Yeah, the writing could also be in the same client just after the file
>> is truncated.
>>
>>
>>> I'll have to think about whether that's still racy. Part of the problem
>>> is that once the client doesn't have caps, it doesn't have a way to
>>> ensure that fscrypt_file (whatever it holds) doesn't change while it's
>>> doing that zeroing.
>> As my understanding, if clientA want to read a file, it will open it
>> with RD mode, then it will get the fscrypt_file meta and "Fr" caps, then
>> it can safely read the file contents and do the zeroing for that block.
> Ok, so in this case, the client is just zeroing out the appropriate
> region in the resulting read data, and not on the OSD. That should be
> ok.
>
>> Then the opened file shouldn't worry whether the fscrypt_file will be
>> changed by others during it still holding the Fr caps, because if the
>> clientB wants to update the fscrypt_file it must acquire the Fw caps
>> first, before that the Fr caps must be revoked from clientA and at the
>> same time the read pagecache in clientA will be invalidated.
>>
> Are you certain that Fw caps is enough to ensure that no other client
> holds Fr caps?

I spent hours and went through the mds Locker related code on the weekends.

 From the mds/lock.cc code, for mds filelock for example in the LOCK_MIX 
state and some interim transition states to LOCK_MIX it will allow 
different clients could hold any of Fw or Fr caps. But the Fb/Fc will be 
disabled. Checked the mds/Locker.cc code, found that the mds filelock 
could to switch LOCK_MIX state in some cases when there has one client 
wants Fw and another client wants any of Fr and Fw.

In this case I think the Linux advisory or mandatory locks are necessary 
to keep the file contents concurrency. In multiple processes' concurrent 
read/write or write/write cases without the Linux advisory/mandatory 
locks the file contents' concurrency won't be guaranteed, so the logic 
is the same here ?

If so, couldn't we just assume the Fw vs Fw and Fr vs Fw caps should be 
exclusive in correct use case ? For example, just after the mds filelock 
state switches to LOCK_MIX, if clientA gets the advisory file lock and 
the Fw caps, and even another clientB could be successfully issued the 
Fr caps, the clientB won't do any read because it should be still stuck 
and be waiting for the advisory file lock.


> IIRC, Frw don't have the same exclusionary relationship
> that (e.g.) Asx has. To exclude Fr, you may need Fb.
>
> (Side note: these rules really need to be codified into a document
> somewhere. I started that with the capabilities doc in the ceph tree,
> but it's light on details of the FILE caps)

Yeah, that doc is light on the details for now.


>> And if after clientB have been issued the Fw caps and have modified that
>> block and still not flushed back, a new clientC comes and wants to read
>> the file, so the Fw caps must be revoked from clientB and the dirty data
>> will be flushed, and then when releasing the Fw caps to the MDS, it will
>> update the new fscrypt_file meta together.
>>
>> I haven't see which case will be racy yet. Or did I miss some cases or
>> something important ?
>>
>>
> We also need to consider how legacy, non-fscrypt-capable clients will
> interact with files that have this field set. If one comes in and writes
> to or truncates one of these files, it's liable to leave a real mess.
> The simplest solution may be to just bar them from doing anything with
> fscrypted files aside from deleting them -- maybe we'd allow them to
> acquire Fr caps but not Fw?

For the legacy, non-fscrypt-capable clients, the reading contents should 
be encrypted any way, so there won't be any problem even if that 
specified block is not RMWed yet and it should ignore this field.

But for write case, I think the MDS should fail it in the open() stage 
if the mode has any of Write/Truncate, etc, and only Read/Buffer-read, 
etc are allowed. Or if we change the mds/Locker.cc code by not allowing 
it to issue the Fw caps to the legacy/non-fscrypt-capable clients, after 
the file is successfully opened with Write mode, it should be stuck 
forever when writing data to the file by waiting the Fw caps, which will 
never come ?

Thanks

>>> Really, it comes down to the fact that truncates are supposed to be an
>>> atomic operation, but we need to perform actions in two different
>>> places.
>>>
>>> Hmmm...it's worse than that even -- if the truncate changes it so that
>>> the last block is in an entirely different object, then there are 3
>>> places that will need to coordinate access.
>>>
>>> Tricky.
Jeff Layton Sept. 13, 2021, 2:05 p.m. UTC | #9
On Mon, 2021-09-13 at 13:42 +0800, Xiubo Li wrote:
> On 9/10/21 7:46 PM, Jeff Layton wrote:
> > On Fri, 2021-09-10 at 10:30 +0800, Xiubo Li wrote:
> > > On 9/9/21 8:48 PM, Jeff Layton wrote:
> > > > On Thu, 2021-09-09 at 11:38 +0800, Xiubo Li wrote:
> > > > > On 9/8/21 9:57 PM, Jeff Layton wrote:
> > > > > > On Wed, 2021-09-08 at 17:37 +0800, Xiubo Li wrote:
> > > > > > > On 9/8/21 12:26 AM, Jeff Layton wrote:
> > > > > > > > On Fri, 2021-09-03 at 16:15 +0800, xiubli@redhat.com wrote:
> > > > > > > > > From: Xiubo Li <xiubli@redhat.com>
> > > > > > > > > 
> > > > > > > > > When truncating the file, it will leave the MDS to handle that,
> > > > > > > > > but the MDS won't update the file contents. So when the fscrypt
> > > > > > > > > is enabled, if the truncate size is not aligned to the block size,
> > > > > > > > > the kclient will round up the truancate size to the block size and
> > > > > > > > > leave the the last block untouched.
> > > > > > > > > 
> > > > > > > > > The opaque fscrypt_file field will be used to tricker whether the
> > > > > > > > > last block need to do the rmw to truncate the a specified block's
> > > > > > > > > contents, we can get which block needs to do the rmw by round down
> > > > > > > > > the fscrypt_file.
> > > > > > > > > 
> > > > > > > > > In kclient side, there is not need to do the rmw immediately after
> > > > > > > > > the file is truncated. We can defer doing that whenever the kclient
> > > > > > > > > will update that block in late future. And before that any kclient
> > > > > > > > > will check the fscrypt_file field when reading that block, if the
> > > > > > > > > fscrypt_file field is none zero that means the related block needs
> > > > > > > > > to be zeroed in range of [fscrypt_file, round_up(fscrypt_file + PAGE_SIZE))
> > > > > > > > > in pagecache or readed data buffer.
> > > > > > > > > 
> > > > > > > > s/PAGE_SIZE/CEPH_FSCRYPT_BLOCK_SIZE/
> > > > > > > > 
> > > > > > > > Yes, on x86_64 they are equal, but that's not the case on all arches.
> > > > > > > > Also, we are moving toward a pagecache that may hold larger pages on
> > > > > > > > x86_64 too.
> > > > > > > >      
> > > > > > > Okay.
> > > > > > > > > Once the that block contents are updated and writeback
> > > > > > > > > kclient will reset the fscrypt_file field in MDS side, then 0 means
> > > > > > > > > no need to care about the truncate stuff any more.
> > > > > > > > > 
> > > > > > > > I'm a little unclear on what the fscrypt_file actually represents here.
> > > > > > > > 
> > > > > > > > I had proposed that we just make the fscrypt_file field hold the
> > > > > > > > "actual" i_size and we'd make the old size field always be a rounded-
> > > > > > > > up version of the size. The MDS would treat that as an opaque value
> > > > > > > > under Fw caps, and the client could use that field to determine i_size.
> > > > > > > > That has a side benefit too -- if the client doesn't support fscrypt,
> > > > > > > > it'll see the rounded-up sizes which are close enough and don't violate
> > > > > > > > any POSIX rules.
> > > > > > > > 
> > > > > > > > In your version, fscrypt_file also holds the actual size of the inode,
> > > > > > > > but sometimes you're zeroing it out, and I don't understand why:
> > > > > > > I think I forgot to fix this after I adapt to multiple ftruncates case,
> > > > > > > this patch is not correctly handling the "actual" file size.
> > > > > > > 
> > > > > > > I just want the fscrypt_file field always to hold the offset from which
> > > > > > > the contents needed to be zeroed, and the range should be [fscrypt_file,
> > > > > > > round_up(fscrypt_file +CEPH_FSCRYPT_BLOCK_SIZE)).
> > > > > > > 
> > > > > > > In single ftruncate case the fscrypt_file should equal to the "actual"
> > > > > > > file size. Then the "req->r_args.setattr.size = attr->ia_size" and
> > > > > > > "req->r_args.setattr.old_size = isize", no need to round up in
> > > > > > > __ceph_setattr() in kclient side, and leave the MDS to do that, but we
> > > > > > > need to pass the CEPH_FSCRYPT_BLOCK_SIZE at the same time.
> > > > > > > 
> > > > > > I'm really not a fan of pushing this logic into the MDS. Why does it
> > > > > > need to know anything about the CEPH_FSCRYPT_BLOCK_SIZE at all?
> > > > >    From your current patch set, you are rounding up the
> > > > > "req->r_args.setattr.size" and "req->r_args.setattr.old_size" to the
> > > > > BLOCK end in __ceph_setattr().
> > > > > 
> > > > > Without considering keep the file scryption logic in kclient only, I
> > > > > need the "req->r_args.setattr.size" to keep the file's real size.
> > > > > 
> > > > > Since the MDS will do the truncate stuff. And if we won't round the
> > > > > "req->r_args.setattr.size" up to the BLOCK end any more, then the MDS
> > > > > needs to know whether and how to round up the file size to the block end
> > > > > when truncating the file. Because the fscrypt_file won't record the
> > > > > file's real size any more, it maybe zero, more detail please see the
> > > > > example below.
> > > > > 
> > > > > Yeah, but as you mentioned bellow if we will keep the file scryption
> > > > > logic in kclient only, I need one extra field to do the defer rmw:
> > > > > 
> > > > > struct fscrypt_file {
> > > > > 
> > > > >        u64 file_real_size;         // always keep the file's real size and
> > > > > the "req->r_args.setattr.size = round_up(file_real_size, BLOCK_SIZE)" as
> > > > > you do in your current patch set.
> > > > > 
> > > > >        u64 file_truncate_offset;  // this will always record in which
> > > > > BLOCK we need to do the rmw, this maybe 0 or located in the file's LAST
> > > > > block and maybe not, more detail please the example below.
> > > > > 
> > > > > }
> > > > > 
> > > > > The "file_truncate_offset" member will be what I need to do the defer rmw.
> > > > > 
> > > > > 
> > > > > > > But in multiple ftruncates case if the second ftruncate has a larger
> > > > > > > size, the fscrypt_file won't be changed and the ceph backend will help
> > > > > > > zero the extended part, but we still need to zero the contents in the
> > > > > > > first ftruncate. If the second ftruncate has a smaller size, the
> > > > > > > fscrypt_file need to be updated and always keeps the smaller size.
> > > > > > > 
> > > > > > I don't get it. Maybe you can walk me through a concrete example of how
> > > > > > racing ftruncates are a problem?
> > > > > Sorry for confusing.
> > > > > 
> > > > > For example:
> > > > > 
> > > > > 1), if there has a file named "bar", and currently the size is 100K
> > > > > bytes, the CEPH_FSCRYPT_BLOCK_SIZE size equals to 4K.
> > > > > 
> > > > > 2), first ftruncate("bar", 7K) comes, then both the file real size and
> > > > > "file_truncate_offset" will be set to 7K, then in MDS it will truncate
> > > > > the file from 8K, and the last block's [7K, 8K) contents need to be
> > > > > zeroed anyway later.
> > > > > 
> > > > > 3), immediately a second ftruncate("bar", 16K) comes, from the ftruncate
> > > > > man page it says the new extended [7K, 16K) should be zeroed when
> > > > > truncating the file. That means the OSD should help zero the [8K, 16K),
> > > > > but won't touch the block [4K, 8K), in which the [7K, 8K) contents still
> > > > > needs to be zeroed. So in this case the "file_truncate_offset" won't be
> > > > > changed and still be 7K. Then the "file_truncate_offset" won't be
> > > > > located in the last block of the file any more.
> > > > > 
> > > > Woah, why didn't the file_truncate_offset get changed when you truncated
> > > > up to 16k?
> > > > 
> > > > When you issue the truncate SETATTR to the MDS, you're changing the size
> > > > field in the inode. The fscrypt_file field _must_ be updated at the same
> > > > time. I think that means that we need to extend SETATTR to also update
> > > > fscrypt_file.
> > > > 
> > > > > 4), if the second ftruncate in step 3) the new size is 3K, then the MDS
> > > > > will truncate the file from 4K and [7K, 8K) contents will be discard
> > > > > anyway, so we need to update the "file_truncate_offset" to 3K, that
> > > > > means a new BLOCK [0K, 4K) needs to do the rmw, by zeroing the [3K, 4K).
> > > > > 
> > > > > 5), if the new truncate in step 3) the new size is 4K, since the 4K < 7K
> > > > > and 4K is aligned to the BLOCK size, so no need to rmw any block any
> > > > > more, then we can just clear the "file_truncate_offset" field.
> > > > > 
> > > > > 
> > > > > For defer RMW logic please see the following example:
> > > > > 
> > > > > 
> > > > Ok, thanks. I think I understand now how racing truncates are a problem.
> > > > Really, it comes down to the fact that we don't have a good way to
> > > > mediate the last-block RMW operation when competing clients are issuing
> > > > truncates.
> > > > 
> > > > I'm not sure adding this file_truncate_offset field really does all that
> > > > much good though. You don't really have a way to ensure that a
> > > > truncating client will see the changes to that field before it issues
> > > > its RMW op.
> > > > 
> > > > > > Suppose we have a file and client1 truncates it down from a large size
> > > > > > to 7k. client1 then sends the MDS a SETATTR to truncate it at 8k, and
> > > > > > does a RMW on the last (4k) block. client2 comes along at the same time
> > > > > > and truncates it up to 13k. client2 issues a SETATTR to extend the file
> > > > > > to 16k and does a RMW on the last block too (which would presumably
> > > > > > already be all zeroes anyway).
> > > > > I think you meant in the none defer RMW case, this is not what my defer
> > > > > RMW approach will do.
> > > > > 
> > > > > After the client1 truncated the file, it won't do the RMW if it won't
> > > > > write any data to that file, and then we assume the client1 is unmounted
> > > > > immediately.
> > > > > 
> > > > > And when the client1 is truncating the file, it will update the
> > > > > "file_truncate_offset" to 7K, which means the [7K, 8K) in the LAST block
> > > > > needs to be zeroed.
> > > > > 
> > > > > Then the client2 reads that the file size is 7K and the
> > > > > "file_truncate_offset" is 7K too, and the client2 wants to truncate the
> > > > > file up to 13K. Since the OSD should help us zero the extended part [8K,
> > > > > 13K] when truncating, but won't touch the block [4K, 8K), for which it
> > > > > still needs to do the RMW. Then the client2 is unmounted too before
> > > > > writing any data to the file. After this the "file_truncate_offset"
> > > > > won't be located in the file's LAST block any more.
> > > > > 
> > > > > After that if the client3 will update the whole file's contents, it will
> > > > > read all the file 13K bytes contents to local page buffers, since the
> > > > > "file_truncate_offset" is 7K and then in the page buffer the range [7K,
> > > > > 8K) will be zeroed just after the contents are dencrypted inplace. Then
> > > > > if the client3 successfully flushes that dirty data back and then the
> > > > > deferred RMW for block [4K, 8K) should be done at the same time, and the
> > > > > "file_truncate_offset" should be cleared too.
> > > > > 
> > > > > While if the client3 won't update the block [4K, 8K), the
> > > > > "file_truncate_offset" will be kept all the time until the above RMW is
> > > > > done in future.
> > > > > 
> > > > > 
> > > > Ok, I think I finally understand what you're saying.
> > > > 
> > > > You want to rely on the next client to do a write to handle the zeroing
> > > > at the end. You basically just want to keep track of whether and where
> > > > it should zero up to the end of the next crypto block, and defer that
> > > > until a client is writing.
> > > Yeah, the writing could also be in the same client just after the file
> > > is truncated.
> > > 
> > > 
> > > > I'll have to think about whether that's still racy. Part of the problem
> > > > is that once the client doesn't have caps, it doesn't have a way to
> > > > ensure that fscrypt_file (whatever it holds) doesn't change while it's
> > > > doing that zeroing.
> > > As my understanding, if clientA want to read a file, it will open it
> > > with RD mode, then it will get the fscrypt_file meta and "Fr" caps, then
> > > it can safely read the file contents and do the zeroing for that block.
> > Ok, so in this case, the client is just zeroing out the appropriate
> > region in the resulting read data, and not on the OSD. That should be
> > ok.
> > 
> > > Then the opened file shouldn't worry whether the fscrypt_file will be
> > > changed by others during it still holding the Fr caps, because if the
> > > clientB wants to update the fscrypt_file it must acquire the Fw caps
> > > first, before that the Fr caps must be revoked from clientA and at the
> > > same time the read pagecache in clientA will be invalidated.
> > > 
> > Are you certain that Fw caps is enough to ensure that no other client
> > holds Fr caps?
> 
> I spent hours and went through the mds Locker related code on the weekends.
> 

Ouch. ;)

>  From the mds/lock.cc code, for mds filelock for example in the LOCK_MIX 
> state and some interim transition states to LOCK_MIX it will allow 
> different clients could hold any of Fw or Fr caps. But the Fb/Fc will be 
> disabled. Checked the mds/Locker.cc code, found that the mds filelock 
> could to switch LOCK_MIX state in some cases when there has one client 
> wants Fw and another client wants any of Fr and Fw.
> 
> In this case I think the Linux advisory or mandatory locks are necessary 
> to keep the file contents concurrency. In multiple processes' concurrent 
> read/write or write/write cases without the Linux advisory/mandatory 
> locks the file contents' concurrency won't be guaranteed, so the logic 
> is the same here ?
> 

You mean like advisory fcntl or flock locks? Or something else? We can't
assume that applications will be coded to use locking, so this would
mean doing some sort of advisory/mandatory locking between the client
and the MDS itself.

Handling file locking over a distributed system is _really_ difficult.
You have to deal with all of the cases where the client or MDS could
spontaneously crash, and deal with them.

I don't think we can rely on any sort of exclusion or locking between
the clients for this. We'll need a way to do this that doesn't rely on
it.

> If so, couldn't we just assume the Fw vs Fw and Fr vs Fw caps should be 
> exclusive in correct use case ? For example, just after the mds filelock 
> state switches to LOCK_MIX, if clientA gets the advisory file lock and 
> the Fw caps, and even another clientB could be successfully issued the 
> Fr caps, the clientB won't do any read because it should be still stuck 
> and be waiting for the advisory file lock.
> 
> 

Changing the semantics of how Frw behave is probably a non-starter, but
maybe there is some way to gate that on the clients being fscrypt-
enabled.


> > IIRC, Frw don't have the same exclusionary relationship
> > that (e.g.) Asx has. To exclude Fr, you may need Fb.
> > 
> > (Side note: these rules really need to be codified into a document
> > somewhere. I started that with the capabilities doc in the ceph tree,
> > but it's light on details of the FILE caps)
> 
> Yeah, that doc is light on the details for now.
> 
> 
> > > And if after clientB have been issued the Fw caps and have modified that
> > > block and still not flushed back, a new clientC comes and wants to read
> > > the file, so the Fw caps must be revoked from clientB and the dirty data
> > > will be flushed, and then when releasing the Fw caps to the MDS, it will
> > > update the new fscrypt_file meta together.
> > > 
> > > I haven't see which case will be racy yet. Or did I miss some cases or
> > > something important ?
> > > 
> > > 
> > We also need to consider how legacy, non-fscrypt-capable clients will
> > interact with files that have this field set. If one comes in and writes
> > to or truncates one of these files, it's liable to leave a real mess.
> > The simplest solution may be to just bar them from doing anything with
> > fscrypted files aside from deleting them -- maybe we'd allow them to
> > acquire Fr caps but not Fw?
> 
> For the legacy, non-fscrypt-capable clients, the reading contents should 
> be encrypted any way, so there won't be any problem even if that 
> specified block is not RMWed yet and it should ignore this field.
> 
> But for write case, I think the MDS should fail it in the open() stage 
> if the mode has any of Write/Truncate, etc, and only Read/Buffer-read, 
> etc are allowed. Or if we change the mds/Locker.cc code by not allowing 
> it to issue the Fw caps to the legacy/non-fscrypt-capable clients, after 
> the file is successfully opened with Write mode, it should be stuck 
> forever when writing data to the file by waiting the Fw caps, which will 
> never come ?
> 

Yeah. I think we're probably going to need the MDS to do something to
keep non-fscrypt enabled clients away from encrypted file contents.
Making them fail at open time is probably the right thing to do.

That said, I'm not sure that really helps too much for solving the
truncation problems.

> Thanks
> 
> > > > Really, it comes down to the fact that truncates are supposed to be an
> > > > atomic operation, but we need to perform actions in two different
> > > > places.
> > > > 
> > > > Hmmm...it's worse than that even -- if the truncate changes it so that
> > > > the last block is in an entirely different object, then there are 3
> > > > places that will need to coordinate access.
> > > > 
> > > > Tricky.
>
Jeff Layton Sept. 13, 2021, 7:34 p.m. UTC | #10
On Mon, 2021-09-13 at 13:42 +0800, Xiubo Li wrote:
> On 9/10/21 7:46 PM, Jeff Layton wrote:
> > On Fri, 2021-09-10 at 10:30 +0800, Xiubo Li wrote:
> > > On 9/9/21 8:48 PM, Jeff Layton wrote:
> > > > On Thu, 2021-09-09 at 11:38 +0800, Xiubo Li wrote:
> > > > > On 9/8/21 9:57 PM, Jeff Layton wrote:
> > > > > > On Wed, 2021-09-08 at 17:37 +0800, Xiubo Li wrote:
> > > > > > > On 9/8/21 12:26 AM, Jeff Layton wrote:
> > > > > > > > On Fri, 2021-09-03 at 16:15 +0800, xiubli@redhat.com wrote:
> > > > > > > > > From: Xiubo Li <xiubli@redhat.com>
> > > > > > > > > 
> > > > > > > > > When truncating the file, it will leave the MDS to handle that,
> > > > > > > > > but the MDS won't update the file contents. So when the fscrypt
> > > > > > > > > is enabled, if the truncate size is not aligned to the block size,
> > > > > > > > > the kclient will round up the truancate size to the block size and
> > > > > > > > > leave the the last block untouched.
> > > > > > > > > 
> > > > > > > > > The opaque fscrypt_file field will be used to tricker whether the
> > > > > > > > > last block need to do the rmw to truncate the a specified block's
> > > > > > > > > contents, we can get which block needs to do the rmw by round down
> > > > > > > > > the fscrypt_file.
> > > > > > > > > 
> > > > > > > > > In kclient side, there is not need to do the rmw immediately after
> > > > > > > > > the file is truncated. We can defer doing that whenever the kclient
> > > > > > > > > will update that block in late future. And before that any kclient
> > > > > > > > > will check the fscrypt_file field when reading that block, if the
> > > > > > > > > fscrypt_file field is none zero that means the related block needs
> > > > > > > > > to be zeroed in range of [fscrypt_file, round_up(fscrypt_file + PAGE_SIZE))
> > > > > > > > > in pagecache or readed data buffer.
> > > > > > > > > 
> > > > > > > > s/PAGE_SIZE/CEPH_FSCRYPT_BLOCK_SIZE/
> > > > > > > > 
> > > > > > > > Yes, on x86_64 they are equal, but that's not the case on all arches.
> > > > > > > > Also, we are moving toward a pagecache that may hold larger pages on
> > > > > > > > x86_64 too.
> > > > > > > >      
> > > > > > > Okay.
> > > > > > > > > Once the that block contents are updated and writeback
> > > > > > > > > kclient will reset the fscrypt_file field in MDS side, then 0 means
> > > > > > > > > no need to care about the truncate stuff any more.
> > > > > > > > > 
> > > > > > > > I'm a little unclear on what the fscrypt_file actually represents here.
> > > > > > > > 
> > > > > > > > I had proposed that we just make the fscrypt_file field hold the
> > > > > > > > "actual" i_size and we'd make the old size field always be a rounded-
> > > > > > > > up version of the size. The MDS would treat that as an opaque value
> > > > > > > > under Fw caps, and the client could use that field to determine i_size.
> > > > > > > > That has a side benefit too -- if the client doesn't support fscrypt,
> > > > > > > > it'll see the rounded-up sizes which are close enough and don't violate
> > > > > > > > any POSIX rules.
> > > > > > > > 
> > > > > > > > In your version, fscrypt_file also holds the actual size of the inode,
> > > > > > > > but sometimes you're zeroing it out, and I don't understand why:
> > > > > > > I think I forgot to fix this after I adapt to multiple ftruncates case,
> > > > > > > this patch is not correctly handling the "actual" file size.
> > > > > > > 
> > > > > > > I just want the fscrypt_file field always to hold the offset from which
> > > > > > > the contents needed to be zeroed, and the range should be [fscrypt_file,
> > > > > > > round_up(fscrypt_file +CEPH_FSCRYPT_BLOCK_SIZE)).
> > > > > > > 
> > > > > > > In single ftruncate case the fscrypt_file should equal to the "actual"
> > > > > > > file size. Then the "req->r_args.setattr.size = attr->ia_size" and
> > > > > > > "req->r_args.setattr.old_size = isize", no need to round up in
> > > > > > > __ceph_setattr() in kclient side, and leave the MDS to do that, but we
> > > > > > > need to pass the CEPH_FSCRYPT_BLOCK_SIZE at the same time.
> > > > > > > 
> > > > > > I'm really not a fan of pushing this logic into the MDS. Why does it
> > > > > > need to know anything about the CEPH_FSCRYPT_BLOCK_SIZE at all?
> > > > >    From your current patch set, you are rounding up the
> > > > > "req->r_args.setattr.size" and "req->r_args.setattr.old_size" to the
> > > > > BLOCK end in __ceph_setattr().
> > > > > 
> > > > > Without considering keep the file scryption logic in kclient only, I
> > > > > need the "req->r_args.setattr.size" to keep the file's real size.
> > > > > 
> > > > > Since the MDS will do the truncate stuff. And if we won't round the
> > > > > "req->r_args.setattr.size" up to the BLOCK end any more, then the MDS
> > > > > needs to know whether and how to round up the file size to the block end
> > > > > when truncating the file. Because the fscrypt_file won't record the
> > > > > file's real size any more, it maybe zero, more detail please see the
> > > > > example below.
> > > > > 
> > > > > Yeah, but as you mentioned bellow if we will keep the file scryption
> > > > > logic in kclient only, I need one extra field to do the defer rmw:
> > > > > 
> > > > > struct fscrypt_file {
> > > > > 
> > > > >        u64 file_real_size;         // always keep the file's real size and
> > > > > the "req->r_args.setattr.size = round_up(file_real_size, BLOCK_SIZE)" as
> > > > > you do in your current patch set.
> > > > > 
> > > > >        u64 file_truncate_offset;  // this will always record in which
> > > > > BLOCK we need to do the rmw, this maybe 0 or located in the file's LAST
> > > > > block and maybe not, more detail please the example below.
> > > > > 
> > > > > }
> > > > > 
> > > > > The "file_truncate_offset" member will be what I need to do the defer rmw.
> > > > > 
> > > > > 
> > > > > > > But in multiple ftruncates case if the second ftruncate has a larger
> > > > > > > size, the fscrypt_file won't be changed and the ceph backend will help
> > > > > > > zero the extended part, but we still need to zero the contents in the
> > > > > > > first ftruncate. If the second ftruncate has a smaller size, the
> > > > > > > fscrypt_file need to be updated and always keeps the smaller size.
> > > > > > > 
> > > > > > I don't get it. Maybe you can walk me through a concrete example of how
> > > > > > racing ftruncates are a problem?
> > > > > Sorry for confusing.
> > > > > 
> > > > > For example:
> > > > > 
> > > > > 1), if there has a file named "bar", and currently the size is 100K
> > > > > bytes, the CEPH_FSCRYPT_BLOCK_SIZE size equals to 4K.
> > > > > 
> > > > > 2), first ftruncate("bar", 7K) comes, then both the file real size and
> > > > > "file_truncate_offset" will be set to 7K, then in MDS it will truncate
> > > > > the file from 8K, and the last block's [7K, 8K) contents need to be
> > > > > zeroed anyway later.
> > > > > 
> > > > > 3), immediately a second ftruncate("bar", 16K) comes, from the ftruncate
> > > > > man page it says the new extended [7K, 16K) should be zeroed when
> > > > > truncating the file. That means the OSD should help zero the [8K, 16K),
> > > > > but won't touch the block [4K, 8K), in which the [7K, 8K) contents still
> > > > > needs to be zeroed. So in this case the "file_truncate_offset" won't be
> > > > > changed and still be 7K. Then the "file_truncate_offset" won't be
> > > > > located in the last block of the file any more.
> > > > > 
> > > > Woah, why didn't the file_truncate_offset get changed when you truncated
> > > > up to 16k?
> > > > 
> > > > When you issue the truncate SETATTR to the MDS, you're changing the size
> > > > field in the inode. The fscrypt_file field _must_ be updated at the same
> > > > time. I think that means that we need to extend SETATTR to also update
> > > > fscrypt_file.
> > > > 
> > > > > 4), if the second ftruncate in step 3) the new size is 3K, then the MDS
> > > > > will truncate the file from 4K and [7K, 8K) contents will be discard
> > > > > anyway, so we need to update the "file_truncate_offset" to 3K, that
> > > > > means a new BLOCK [0K, 4K) needs to do the rmw, by zeroing the [3K, 4K).
> > > > > 
> > > > > 5), if the new truncate in step 3) the new size is 4K, since the 4K < 7K
> > > > > and 4K is aligned to the BLOCK size, so no need to rmw any block any
> > > > > more, then we can just clear the "file_truncate_offset" field.
> > > > > 
> > > > > 
> > > > > For defer RMW logic please see the following example:
> > > > > 
> > > > > 
> > > > Ok, thanks. I think I understand now how racing truncates are a problem.
> > > > Really, it comes down to the fact that we don't have a good way to
> > > > mediate the last-block RMW operation when competing clients are issuing
> > > > truncates.
> > > > 
> > > > I'm not sure adding this file_truncate_offset field really does all that
> > > > much good though. You don't really have a way to ensure that a
> > > > truncating client will see the changes to that field before it issues
> > > > its RMW op.
> > > > 
> > > > > > Suppose we have a file and client1 truncates it down from a large size
> > > > > > to 7k. client1 then sends the MDS a SETATTR to truncate it at 8k, and
> > > > > > does a RMW on the last (4k) block. client2 comes along at the same time
> > > > > > and truncates it up to 13k. client2 issues a SETATTR to extend the file
> > > > > > to 16k and does a RMW on the last block too (which would presumably
> > > > > > already be all zeroes anyway).
> > > > > I think you meant in the none defer RMW case, this is not what my defer
> > > > > RMW approach will do.
> > > > > 
> > > > > After the client1 truncated the file, it won't do the RMW if it won't
> > > > > write any data to that file, and then we assume the client1 is unmounted
> > > > > immediately.
> > > > > 
> > > > > And when the client1 is truncating the file, it will update the
> > > > > "file_truncate_offset" to 7K, which means the [7K, 8K) in the LAST block
> > > > > needs to be zeroed.
> > > > > 
> > > > > Then the client2 reads that the file size is 7K and the
> > > > > "file_truncate_offset" is 7K too, and the client2 wants to truncate the
> > > > > file up to 13K. Since the OSD should help us zero the extended part [8K,
> > > > > 13K] when truncating, but won't touch the block [4K, 8K), for which it
> > > > > still needs to do the RMW. Then the client2 is unmounted too before
> > > > > writing any data to the file. After this the "file_truncate_offset"
> > > > > won't be located in the file's LAST block any more.
> > > > > 
> > > > > After that if the client3 will update the whole file's contents, it will
> > > > > read all the file 13K bytes contents to local page buffers, since the
> > > > > "file_truncate_offset" is 7K and then in the page buffer the range [7K,
> > > > > 8K) will be zeroed just after the contents are dencrypted inplace. Then
> > > > > if the client3 successfully flushes that dirty data back and then the
> > > > > deferred RMW for block [4K, 8K) should be done at the same time, and the
> > > > > "file_truncate_offset" should be cleared too.
> > > > > 
> > > > > While if the client3 won't update the block [4K, 8K), the
> > > > > "file_truncate_offset" will be kept all the time until the above RMW is
> > > > > done in future.
> > > > > 
> > > > > 
> > > > Ok, I think I finally understand what you're saying.
> > > > 
> > > > You want to rely on the next client to do a write to handle the zeroing
> > > > at the end. You basically just want to keep track of whether and where
> > > > it should zero up to the end of the next crypto block, and defer that
> > > > until a client is writing.
> > > Yeah, the writing could also be in the same client just after the file
> > > is truncated.
> > > 
> > > 
> > > > I'll have to think about whether that's still racy. Part of the problem
> > > > is that once the client doesn't have caps, it doesn't have a way to
> > > > ensure that fscrypt_file (whatever it holds) doesn't change while it's
> > > > doing that zeroing.
> > > As my understanding, if clientA want to read a file, it will open it
> > > with RD mode, then it will get the fscrypt_file meta and "Fr" caps, then
> > > it can safely read the file contents and do the zeroing for that block.
> > Ok, so in this case, the client is just zeroing out the appropriate
> > region in the resulting read data, and not on the OSD. That should be
> > ok.
> > 
> > > Then the opened file shouldn't worry whether the fscrypt_file will be
> > > changed by others during it still holding the Fr caps, because if the
> > > clientB wants to update the fscrypt_file it must acquire the Fw caps
> > > first, before that the Fr caps must be revoked from clientA and at the
> > > same time the read pagecache in clientA will be invalidated.
> > > 
> > Are you certain that Fw caps is enough to ensure that no other client
> > holds Fr caps?
> 
> I spent hours and went through the mds Locker related code on the weekends.
> 
>  From the mds/lock.cc code, for mds filelock for example in the LOCK_MIX 
> state and some interim transition states to LOCK_MIX it will allow 
> different clients could hold any of Fw or Fr caps. But the Fb/Fc will be 
> disabled. Checked the mds/Locker.cc code, found that the mds filelock 
> could to switch LOCK_MIX state in some cases when there has one client 
> wants Fw and another client wants any of Fr and Fw.
> 
> In this case I think the Linux advisory or mandatory locks are necessary 
> to keep the file contents concurrency. In multiple processes' concurrent 
> read/write or write/write cases without the Linux advisory/mandatory 
> locks the file contents' concurrency won't be guaranteed, so the logic 
> is the same here ?
> 
> If so, couldn't we just assume the Fw vs Fw and Fr vs Fw caps should be 
> exclusive in correct use case ? For example, just after the mds filelock 
> state switches to LOCK_MIX, if clientA gets the advisory file lock and 
> the Fw caps, and even another clientB could be successfully issued the 
> Fr caps, the clientB won't do any read because it should be still stuck 
> and be waiting for the advisory file lock.
> 

I'm not sure I like that idea. Basically, that would change the meaning
of the what Frw caps represent, in a way that is not really consistent
with how they have been used before.

We could gate that new behavior on the new feature flags, but it sounds
pretty tough.

I think we have a couple of options:

1) we could just make the clients request and wait on Fx caps when they
do a truncate. They might stall for a bit if there is contention, but it
would ensure consistency and the client could be completely in charge of
the truncate. [a]

2) we could rev the protocol, and have the client send along the last
block to be written along with the SETATTR request. Maybe we even
consider just adding a new TRUNCATE call independent of SETATTR. The MDS
would remain in complete control of it at that point.

The other ideas I've considered seem more complex and don't offer any
significant advantages that I can see.

[a]: Side question: why does buffering a truncate require Fx and not Fb?
How do Fx and Fb interact?

> 
> > IIRC, Frw don't have the same exclusionary relationship
> > that (e.g.) Asx has. To exclude Fr, you may need Fb.
> > 
> > (Side note: these rules really need to be codified into a document
> > somewhere. I started that with the capabilities doc in the ceph tree,
> > but it's light on details of the FILE caps)
> 
> Yeah, that doc is light on the details for now.
> 
> 
> > > And if after clientB have been issued the Fw caps and have modified that
> > > block and still not flushed back, a new clientC comes and wants to read
> > > the file, so the Fw caps must be revoked from clientB and the dirty data
> > > will be flushed, and then when releasing the Fw caps to the MDS, it will
> > > update the new fscrypt_file meta together.
> > > 
> > > I haven't see which case will be racy yet. Or did I miss some cases or
> > > something important ?
> > > 
> > > 
> > We also need to consider how legacy, non-fscrypt-capable clients will
> > interact with files that have this field set. If one comes in and writes
> > to or truncates one of these files, it's liable to leave a real mess.
> > The simplest solution may be to just bar them from doing anything with
> > fscrypted files aside from deleting them -- maybe we'd allow them to
> > acquire Fr caps but not Fw?
> 
> For the legacy, non-fscrypt-capable clients, the reading contents should 
> be encrypted any way, so there won't be any problem even if that 
> specified block is not RMWed yet and it should ignore this field.
> 

Right, and I think we have to allow those clients to request Fr caps so
that they have the ability to backup and archive encrypted files without
needing the key. The cephfs-mirror-daemon, in particular, may need this.

> But for write case, I think the MDS should fail it in the open() stage 
> if the mode has any of Write/Truncate, etc, and only Read/Buffer-read, 
> etc are allowed. Or if we change the mds/Locker.cc code by not allowing 
> it to issue the Fw caps to the legacy/non-fscrypt-capable clients, after 
> the file is successfully opened with Write mode, it should be stuck 
> forever when writing data to the file by waiting the Fw caps, which will 
> never come ?
> 

Yes. Those clients should be barred from making any changes to file
contents or doing anything that might result in a new dentry being
attached to an existing inode.

We need to allow them to read files, and unlink them, but that's really
about it.

> 
> > > > Really, it comes down to the fact that truncates are supposed to be an
> > > > atomic operation, but we need to perform actions in two different
> > > > places.
> > > > 
> > > > Hmmm...it's worse than that even -- if the truncate changes it so that
> > > > the last block is in an entirely different object, then there are 3
> > > > places that will need to coordinate access.
> > > > 
> > > > Tricky.
>
Xiubo Li Sept. 14, 2021, 5:40 a.m. UTC | #11
On 9/14/21 3:34 AM, Jeff Layton wrote:

[...]

> I'll have to think about whether that's still racy. Part of the problem
>>>>> is that once the client doesn't have caps, it doesn't have a way to
>>>>> ensure that fscrypt_file (whatever it holds) doesn't change while it's
>>>>> doing that zeroing.
>>>> As my understanding, if clientA want to read a file, it will open it
>>>> with RD mode, then it will get the fscrypt_file meta and "Fr" caps, then
>>>> it can safely read the file contents and do the zeroing for that block.
>>> Ok, so in this case, the client is just zeroing out the appropriate
>>> region in the resulting read data, and not on the OSD. That should be
>>> ok.
>>>
>>>> Then the opened file shouldn't worry whether the fscrypt_file will be
>>>> changed by others during it still holding the Fr caps, because if the
>>>> clientB wants to update the fscrypt_file it must acquire the Fw caps
>>>> first, before that the Fr caps must be revoked from clientA and at the
>>>> same time the read pagecache in clientA will be invalidated.
>>>>
>>> Are you certain that Fw caps is enough to ensure that no other client
>>> holds Fr caps?
>> I spent hours and went through the mds Locker related code on the weekends.
>>
>>   From the mds/lock.cc code, for mds filelock for example in the LOCK_MIX
>> state and some interim transition states to LOCK_MIX it will allow
>> different clients could hold any of Fw or Fr caps. But the Fb/Fc will be
>> disabled. Checked the mds/Locker.cc code, found that the mds filelock
>> could to switch LOCK_MIX state in some cases when there has one client
>> wants Fw and another client wants any of Fr and Fw.
>>
>> In this case I think the Linux advisory or mandatory locks are necessary
>> to keep the file contents concurrency. In multiple processes' concurrent
>> read/write or write/write cases without the Linux advisory/mandatory
>> locks the file contents' concurrency won't be guaranteed, so the logic
>> is the same here ?
>>
>> If so, couldn't we just assume the Fw vs Fw and Fr vs Fw caps should be
>> exclusive in correct use case ? For example, just after the mds filelock
>> state switches to LOCK_MIX, if clientA gets the advisory file lock and
>> the Fw caps, and even another clientB could be successfully issued the
>> Fr caps, the clientB won't do any read because it should be still stuck
>> and be waiting for the advisory file lock.
>>
> I'm not sure I like that idea. Basically, that would change the meaning
> of the what Frw caps represent, in a way that is not really consistent
> with how they have been used before.
>
> We could gate that new behavior on the new feature flags, but it sounds
> pretty tough.
>
> I think we have a couple of options:
>
> 1) we could just make the clients request and wait on Fx caps when they
> do a truncate. They might stall for a bit if there is contention, but it
> would ensure consistency and the client could be completely in charge of
> the truncate. [a]

Yeah, for my defer RMW approach we need to held the Fx caps every time 
when writing/truncating files, and the Fs caps every time when reading.

While currently almost all the read/write code have ignored them because 
read/write do not need them in most cases.

I am not sure if we add the Fx caps to the 'need' in 
write/truncating,etc code and the Fs caps in "need" in reading related 
code will slow the perf. If my understanding is correct, the most of the 
mds filelock's lock states do no allow the Fx/Fs caps to clients, so the 
clients may need to wait a longer time than before.

After checking more about the Locker code, this seems not a perfect 
approach IMO.


> 2) we could rev the protocol, and have the client send along the last
> block to be written along with the SETATTR request. Maybe we even
> consider just adding a new TRUNCATE call independent of SETATTR. The MDS
> would remain in complete control of it at that point.

This approach seems much better, since the last block size will always 
less than or equal to 4K(CEPH_FSCRYPT_BLOCK_SIZE) and the truncate 
should be rare in normal use cases (?), with extra ~4K data in the 
SETATTR should be okay when truncating the file.

So when truncating a file, in kclient it should read that block, which 
needs to do the RMW, first, and then do the truncate locally and encrypt 
it again, and then together with SETATTR request send it to MDS. And the 
MDS will update that block just before truncating the file.

This approach could also keep the fscrypt logic being opaque for the MDS.


>
> The other ideas I've considered seem more complex and don't offer any
> significant advantages that I can see.
>
> [a]: Side question: why does buffering a truncate require Fx and not Fb?
> How do Fx and Fb interact?

For my defer RMW approach we need the Fx caps every time when writing 
the file, and the Fw caps is the 'need' caps for write, while the Fb is 
the 'want' caps. If the Fb caps is not allowed or issued by the MDS, it 
will write-through data to the osd, after that the Fxw could be safely 
released. If the client gets the Fb caps, the client must also hold the 
Fx caps until the buffer has been writen back.

>>> IIRC, Frw don't have the same exclusionary relationship
>>> that (e.g.) Asx has. To exclude Fr, you may need Fb.
>>>
>>> (Side note: these rules really need to be codified into a document
>>> somewhere. I started that with the capabilities doc in the ceph tree,
>>> but it's light on details of the FILE caps)
>> Yeah, that doc is light on the details for now.
>>
>>
>>>> And if after clientB have been issued the Fw caps and have modified that
>>>> block and still not flushed back, a new clientC comes and wants to read
>>>> the file, so the Fw caps must be revoked from clientB and the dirty data
>>>> will be flushed, and then when releasing the Fw caps to the MDS, it will
>>>> update the new fscrypt_file meta together.
>>>>
>>>> I haven't see which case will be racy yet. Or did I miss some cases or
>>>> something important ?
>>>>
>>>>
>>> We also need to consider how legacy, non-fscrypt-capable clients will
>>> interact with files that have this field set. If one comes in and writes
>>> to or truncates one of these files, it's liable to leave a real mess.
>>> The simplest solution may be to just bar them from doing anything with
>>> fscrypted files aside from deleting them -- maybe we'd allow them to
>>> acquire Fr caps but not Fw?
>> For the legacy, non-fscrypt-capable clients, the reading contents should
>> be encrypted any way, so there won't be any problem even if that
>> specified block is not RMWed yet and it should ignore this field.
>>
> Right, and I think we have to allow those clients to request Fr caps so
> that they have the ability to backup and archive encrypted files without
> needing the key. The cephfs-mirror-daemon, in particular, may need this.
>
>> But for write case, I think the MDS should fail it in the open() stage
>> if the mode has any of Write/Truncate, etc, and only Read/Buffer-read,
>> etc are allowed. Or if we change the mds/Locker.cc code by not allowing
>> it to issue the Fw caps to the legacy/non-fscrypt-capable clients, after
>> the file is successfully opened with Write mode, it should be stuck
>> forever when writing data to the file by waiting the Fw caps, which will
>> never come ?
>>
> Yes. Those clients should be barred from making any changes to file
> contents or doing anything that might result in a new dentry being
> attached to an existing inode.
>
> We need to allow them to read files, and unlink them, but that's really
> about it.

Yeah, agree.

BRs


>>>>> Really, it comes down to the fact that truncates are supposed to be an
>>>>> atomic operation, but we need to perform actions in two different
>>>>> places.
>>>>>
>>>>> Hmmm...it's worse than that even -- if the truncate changes it so that
>>>>> the last block is in an entirely different object, then there are 3
>>>>> places that will need to coordinate access.
>>>>>
>>>>> Tricky.
Xiubo Li Sept. 14, 2021, 5:43 a.m. UTC | #12
On 9/13/21 10:05 PM, Jeff Layton wrote:
> On Mon, 2021-09-13 at 13:42 +0800, Xiubo Li wrote:
>> On 9/10/21 7:46 PM, Jeff Layton wrote:
>>> On Fri, 2021-09-10 at 10:30 +0800, Xiubo Li wrote:
>>>> On 9/9/21 8:48 PM, Jeff Layton wrote:
>>>>> On Thu, 2021-09-09 at 11:38 +0800, Xiubo Li wrote:
>>>>>> On 9/8/21 9:57 PM, Jeff Layton wrote:
>>>>>>> On Wed, 2021-09-08 at 17:37 +0800, Xiubo Li wrote:
>>>>>>>> On 9/8/21 12:26 AM, Jeff Layton wrote:
>>>>>>>>> On Fri, 2021-09-03 at 16:15 +0800, xiubli@redhat.com wrote:
>>>>>>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>>>>>>
>>>>>>>>>> When truncating the file, it will leave the MDS to handle that,
>>>>>>>>>> but the MDS won't update the file contents. So when the fscrypt
>>>>>>>>>> is enabled, if the truncate size is not aligned to the block size,
>>>>>>>>>> the kclient will round up the truancate size to the block size and
>>>>>>>>>> leave the the last block untouched.
>>>>>>>>>>
>>>>>>>>>> The opaque fscrypt_file field will be used to tricker whether the
>>>>>>>>>> last block need to do the rmw to truncate the a specified block's
>>>>>>>>>> contents, we can get which block needs to do the rmw by round down
>>>>>>>>>> the fscrypt_file.
>>>>>>>>>>
>>>>>>>>>> In kclient side, there is not need to do the rmw immediately after
>>>>>>>>>> the file is truncated. We can defer doing that whenever the kclient
>>>>>>>>>> will update that block in late future. And before that any kclient
>>>>>>>>>> will check the fscrypt_file field when reading that block, if the
>>>>>>>>>> fscrypt_file field is none zero that means the related block needs
>>>>>>>>>> to be zeroed in range of [fscrypt_file, round_up(fscrypt_file + PAGE_SIZE))
>>>>>>>>>> in pagecache or readed data buffer.
>>>>>>>>>>
>>>>>>>>> s/PAGE_SIZE/CEPH_FSCRYPT_BLOCK_SIZE/
>>>>>>>>>
>>>>>>>>> Yes, on x86_64 they are equal, but that's not the case on all arches.
>>>>>>>>> Also, we are moving toward a pagecache that may hold larger pages on
>>>>>>>>> x86_64 too.
>>>>>>>>>       
>>>>>>>> Okay.
>>>>>>>>>> Once the that block contents are updated and writeback
>>>>>>>>>> kclient will reset the fscrypt_file field in MDS side, then 0 means
>>>>>>>>>> no need to care about the truncate stuff any more.
>>>>>>>>>>
>>>>>>>>> I'm a little unclear on what the fscrypt_file actually represents here.
>>>>>>>>>
>>>>>>>>> I had proposed that we just make the fscrypt_file field hold the
>>>>>>>>> "actual" i_size and we'd make the old size field always be a rounded-
>>>>>>>>> up version of the size. The MDS would treat that as an opaque value
>>>>>>>>> under Fw caps, and the client could use that field to determine i_size.
>>>>>>>>> That has a side benefit too -- if the client doesn't support fscrypt,
>>>>>>>>> it'll see the rounded-up sizes which are close enough and don't violate
>>>>>>>>> any POSIX rules.
>>>>>>>>>
>>>>>>>>> In your version, fscrypt_file also holds the actual size of the inode,
>>>>>>>>> but sometimes you're zeroing it out, and I don't understand why:
>>>>>>>> I think I forgot to fix this after I adapt to multiple ftruncates case,
>>>>>>>> this patch is not correctly handling the "actual" file size.
>>>>>>>>
>>>>>>>> I just want the fscrypt_file field always to hold the offset from which
>>>>>>>> the contents needed to be zeroed, and the range should be [fscrypt_file,
>>>>>>>> round_up(fscrypt_file +CEPH_FSCRYPT_BLOCK_SIZE)).
>>>>>>>>
>>>>>>>> In single ftruncate case the fscrypt_file should equal to the "actual"
>>>>>>>> file size. Then the "req->r_args.setattr.size = attr->ia_size" and
>>>>>>>> "req->r_args.setattr.old_size = isize", no need to round up in
>>>>>>>> __ceph_setattr() in kclient side, and leave the MDS to do that, but we
>>>>>>>> need to pass the CEPH_FSCRYPT_BLOCK_SIZE at the same time.
>>>>>>>>
>>>>>>> I'm really not a fan of pushing this logic into the MDS. Why does it
>>>>>>> need to know anything about the CEPH_FSCRYPT_BLOCK_SIZE at all?
>>>>>>     From your current patch set, you are rounding up the
>>>>>> "req->r_args.setattr.size" and "req->r_args.setattr.old_size" to the
>>>>>> BLOCK end in __ceph_setattr().
>>>>>>
>>>>>> Without considering keep the file scryption logic in kclient only, I
>>>>>> need the "req->r_args.setattr.size" to keep the file's real size.
>>>>>>
>>>>>> Since the MDS will do the truncate stuff. And if we won't round the
>>>>>> "req->r_args.setattr.size" up to the BLOCK end any more, then the MDS
>>>>>> needs to know whether and how to round up the file size to the block end
>>>>>> when truncating the file. Because the fscrypt_file won't record the
>>>>>> file's real size any more, it maybe zero, more detail please see the
>>>>>> example below.
>>>>>>
>>>>>> Yeah, but as you mentioned bellow if we will keep the file scryption
>>>>>> logic in kclient only, I need one extra field to do the defer rmw:
>>>>>>
>>>>>> struct fscrypt_file {
>>>>>>
>>>>>>         u64 file_real_size;         // always keep the file's real size and
>>>>>> the "req->r_args.setattr.size = round_up(file_real_size, BLOCK_SIZE)" as
>>>>>> you do in your current patch set.
>>>>>>
>>>>>>         u64 file_truncate_offset;  // this will always record in which
>>>>>> BLOCK we need to do the rmw, this maybe 0 or located in the file's LAST
>>>>>> block and maybe not, more detail please the example below.
>>>>>>
>>>>>> }
>>>>>>
>>>>>> The "file_truncate_offset" member will be what I need to do the defer rmw.
>>>>>>
>>>>>>
>>>>>>>> But in multiple ftruncates case if the second ftruncate has a larger
>>>>>>>> size, the fscrypt_file won't be changed and the ceph backend will help
>>>>>>>> zero the extended part, but we still need to zero the contents in the
>>>>>>>> first ftruncate. If the second ftruncate has a smaller size, the
>>>>>>>> fscrypt_file need to be updated and always keeps the smaller size.
>>>>>>>>
>>>>>>> I don't get it. Maybe you can walk me through a concrete example of how
>>>>>>> racing ftruncates are a problem?
>>>>>> Sorry for confusing.
>>>>>>
>>>>>> For example:
>>>>>>
>>>>>> 1), if there has a file named "bar", and currently the size is 100K
>>>>>> bytes, the CEPH_FSCRYPT_BLOCK_SIZE size equals to 4K.
>>>>>>
>>>>>> 2), first ftruncate("bar", 7K) comes, then both the file real size and
>>>>>> "file_truncate_offset" will be set to 7K, then in MDS it will truncate
>>>>>> the file from 8K, and the last block's [7K, 8K) contents need to be
>>>>>> zeroed anyway later.
>>>>>>
>>>>>> 3), immediately a second ftruncate("bar", 16K) comes, from the ftruncate
>>>>>> man page it says the new extended [7K, 16K) should be zeroed when
>>>>>> truncating the file. That means the OSD should help zero the [8K, 16K),
>>>>>> but won't touch the block [4K, 8K), in which the [7K, 8K) contents still
>>>>>> needs to be zeroed. So in this case the "file_truncate_offset" won't be
>>>>>> changed and still be 7K. Then the "file_truncate_offset" won't be
>>>>>> located in the last block of the file any more.
>>>>>>
>>>>> Woah, why didn't the file_truncate_offset get changed when you truncated
>>>>> up to 16k?
>>>>>
>>>>> When you issue the truncate SETATTR to the MDS, you're changing the size
>>>>> field in the inode. The fscrypt_file field _must_ be updated at the same
>>>>> time. I think that means that we need to extend SETATTR to also update
>>>>> fscrypt_file.
>>>>>
>>>>>> 4), if the second ftruncate in step 3) the new size is 3K, then the MDS
>>>>>> will truncate the file from 4K and [7K, 8K) contents will be discard
>>>>>> anyway, so we need to update the "file_truncate_offset" to 3K, that
>>>>>> means a new BLOCK [0K, 4K) needs to do the rmw, by zeroing the [3K, 4K).
>>>>>>
>>>>>> 5), if the new truncate in step 3) the new size is 4K, since the 4K < 7K
>>>>>> and 4K is aligned to the BLOCK size, so no need to rmw any block any
>>>>>> more, then we can just clear the "file_truncate_offset" field.
>>>>>>
>>>>>>
>>>>>> For defer RMW logic please see the following example:
>>>>>>
>>>>>>
>>>>> Ok, thanks. I think I understand now how racing truncates are a problem.
>>>>> Really, it comes down to the fact that we don't have a good way to
>>>>> mediate the last-block RMW operation when competing clients are issuing
>>>>> truncates.
>>>>>
>>>>> I'm not sure adding this file_truncate_offset field really does all that
>>>>> much good though. You don't really have a way to ensure that a
>>>>> truncating client will see the changes to that field before it issues
>>>>> its RMW op.
>>>>>
>>>>>>> Suppose we have a file and client1 truncates it down from a large size
>>>>>>> to 7k. client1 then sends the MDS a SETATTR to truncate it at 8k, and
>>>>>>> does a RMW on the last (4k) block. client2 comes along at the same time
>>>>>>> and truncates it up to 13k. client2 issues a SETATTR to extend the file
>>>>>>> to 16k and does a RMW on the last block too (which would presumably
>>>>>>> already be all zeroes anyway).
>>>>>> I think you meant in the none defer RMW case, this is not what my defer
>>>>>> RMW approach will do.
>>>>>>
>>>>>> After the client1 truncated the file, it won't do the RMW if it won't
>>>>>> write any data to that file, and then we assume the client1 is unmounted
>>>>>> immediately.
>>>>>>
>>>>>> And when the client1 is truncating the file, it will update the
>>>>>> "file_truncate_offset" to 7K, which means the [7K, 8K) in the LAST block
>>>>>> needs to be zeroed.
>>>>>>
>>>>>> Then the client2 reads that the file size is 7K and the
>>>>>> "file_truncate_offset" is 7K too, and the client2 wants to truncate the
>>>>>> file up to 13K. Since the OSD should help us zero the extended part [8K,
>>>>>> 13K] when truncating, but won't touch the block [4K, 8K), for which it
>>>>>> still needs to do the RMW. Then the client2 is unmounted too before
>>>>>> writing any data to the file. After this the "file_truncate_offset"
>>>>>> won't be located in the file's LAST block any more.
>>>>>>
>>>>>> After that if the client3 will update the whole file's contents, it will
>>>>>> read all the file 13K bytes contents to local page buffers, since the
>>>>>> "file_truncate_offset" is 7K and then in the page buffer the range [7K,
>>>>>> 8K) will be zeroed just after the contents are dencrypted inplace. Then
>>>>>> if the client3 successfully flushes that dirty data back and then the
>>>>>> deferred RMW for block [4K, 8K) should be done at the same time, and the
>>>>>> "file_truncate_offset" should be cleared too.
>>>>>>
>>>>>> While if the client3 won't update the block [4K, 8K), the
>>>>>> "file_truncate_offset" will be kept all the time until the above RMW is
>>>>>> done in future.
>>>>>>
>>>>>>
>>>>> Ok, I think I finally understand what you're saying.
>>>>>
>>>>> You want to rely on the next client to do a write to handle the zeroing
>>>>> at the end. You basically just want to keep track of whether and where
>>>>> it should zero up to the end of the next crypto block, and defer that
>>>>> until a client is writing.
>>>> Yeah, the writing could also be in the same client just after the file
>>>> is truncated.
>>>>
>>>>
>>>>> I'll have to think about whether that's still racy. Part of the problem
>>>>> is that once the client doesn't have caps, it doesn't have a way to
>>>>> ensure that fscrypt_file (whatever it holds) doesn't change while it's
>>>>> doing that zeroing.
>>>> As my understanding, if clientA want to read a file, it will open it
>>>> with RD mode, then it will get the fscrypt_file meta and "Fr" caps, then
>>>> it can safely read the file contents and do the zeroing for that block.
>>> Ok, so in this case, the client is just zeroing out the appropriate
>>> region in the resulting read data, and not on the OSD. That should be
>>> ok.
>>>
>>>> Then the opened file shouldn't worry whether the fscrypt_file will be
>>>> changed by others during it still holding the Fr caps, because if the
>>>> clientB wants to update the fscrypt_file it must acquire the Fw caps
>>>> first, before that the Fr caps must be revoked from clientA and at the
>>>> same time the read pagecache in clientA will be invalidated.
>>>>
>>> Are you certain that Fw caps is enough to ensure that no other client
>>> holds Fr caps?
>> I spent hours and went through the mds Locker related code on the weekends.
>>
> Ouch. ;)
>
>>   From the mds/lock.cc code, for mds filelock for example in the LOCK_MIX
>> state and some interim transition states to LOCK_MIX it will allow
>> different clients could hold any of Fw or Fr caps. But the Fb/Fc will be
>> disabled. Checked the mds/Locker.cc code, found that the mds filelock
>> could to switch LOCK_MIX state in some cases when there has one client
>> wants Fw and another client wants any of Fr and Fw.
>>
>> In this case I think the Linux advisory or mandatory locks are necessary
>> to keep the file contents concurrency. In multiple processes' concurrent
>> read/write or write/write cases without the Linux advisory/mandatory
>> locks the file contents' concurrency won't be guaranteed, so the logic
>> is the same here ?
>>
> You mean like advisory fcntl or flock locks? Or something else? We can't
> assume that applications will be coded to use locking, so this would
> mean doing some sort of advisory/mandatory locking between the client
> and the MDS itself.
>
> Handling file locking over a distributed system is _really_ difficult.
> You have to deal with all of the cases where the client or MDS could
> spontaneously crash, and deal with them.
>
> I don't think we can rely on any sort of exclusion or locking between
> the clients for this. We'll need a way to do this that doesn't rely on
> it.

Yeah, we shouldn't rely on them.


>> If so, couldn't we just assume the Fw vs Fw and Fr vs Fw caps should be
>> exclusive in correct use case ? For example, just after the mds filelock
>> state switches to LOCK_MIX, if clientA gets the advisory file lock and
>> the Fw caps, and even another clientB could be successfully issued the
>> Fr caps, the clientB won't do any read because it should be still stuck
>> and be waiting for the advisory file lock.
>>
>>
> Changing the semantics of how Frw behave is probably a non-starter, but
> maybe there is some way to gate that on the clients being fscrypt-
> enabled.
It is.
>
>>> IIRC, Frw don't have the same exclusionary relationship
>>> that (e.g.) Asx has. To exclude Fr, you may need Fb.
>>>
>>> (Side note: these rules really need to be codified into a document
>>> somewhere. I started that with the capabilities doc in the ceph tree,
>>> but it's light on details of the FILE caps)
>> Yeah, that doc is light on the details for now.
>>
>>
>>>> And if after clientB have been issued the Fw caps and have modified that
>>>> block and still not flushed back, a new clientC comes and wants to read
>>>> the file, so the Fw caps must be revoked from clientB and the dirty data
>>>> will be flushed, and then when releasing the Fw caps to the MDS, it will
>>>> update the new fscrypt_file meta together.
>>>>
>>>> I haven't see which case will be racy yet. Or did I miss some cases or
>>>> something important ?
>>>>
>>>>
>>> We also need to consider how legacy, non-fscrypt-capable clients will
>>> interact with files that have this field set. If one comes in and writes
>>> to or truncates one of these files, it's liable to leave a real mess.
>>> The simplest solution may be to just bar them from doing anything with
>>> fscrypted files aside from deleting them -- maybe we'd allow them to
>>> acquire Fr caps but not Fw?
>> For the legacy, non-fscrypt-capable clients, the reading contents should
>> be encrypted any way, so there won't be any problem even if that
>> specified block is not RMWed yet and it should ignore this field.
>>
>> But for write case, I think the MDS should fail it in the open() stage
>> if the mode has any of Write/Truncate, etc, and only Read/Buffer-read,
>> etc are allowed. Or if we change the mds/Locker.cc code by not allowing
>> it to issue the Fw caps to the legacy/non-fscrypt-capable clients, after
>> the file is successfully opened with Write mode, it should be stuck
>> forever when writing data to the file by waiting the Fw caps, which will
>> never come ?
>>
> Yeah. I think we're probably going to need the MDS to do something to
> keep non-fscrypt enabled clients away from encrypted file contents.
> Making them fail at open time is probably the right thing to do.
>
> That said, I'm not sure that really helps too much for solving the
> truncation problems.

I think we need to update your MDS PR to fix this.


>> Thanks
>>
>>>>> Really, it comes down to the fact that truncates are supposed to be an
>>>>> atomic operation, but we need to perform actions in two different
>>>>> places.
>>>>>
>>>>> Hmmm...it's worse than that even -- if the truncate changes it so that
>>>>> the last block is in an entirely different object, then there are 3
>>>>> places that will need to coordinate access.
>>>>>
>>>>> Tricky.
Jeff Layton Sept. 14, 2021, 2:24 p.m. UTC | #13
On Tue, 2021-09-14 at 13:40 +0800, Xiubo Li wrote:
> On 9/14/21 3:34 AM, Jeff Layton wrote:
> 
> [...]
> 
> > I'll have to think about whether that's still racy. Part of the problem
> > > > > > is that once the client doesn't have caps, it doesn't have a way to
> > > > > > ensure that fscrypt_file (whatever it holds) doesn't change while it's
> > > > > > doing that zeroing.
> > > > > As my understanding, if clientA want to read a file, it will open it
> > > > > with RD mode, then it will get the fscrypt_file meta and "Fr" caps, then
> > > > > it can safely read the file contents and do the zeroing for that block.
> > > > Ok, so in this case, the client is just zeroing out the appropriate
> > > > region in the resulting read data, and not on the OSD. That should be
> > > > ok.
> > > > 
> > > > > Then the opened file shouldn't worry whether the fscrypt_file will be
> > > > > changed by others during it still holding the Fr caps, because if the
> > > > > clientB wants to update the fscrypt_file it must acquire the Fw caps
> > > > > first, before that the Fr caps must be revoked from clientA and at the
> > > > > same time the read pagecache in clientA will be invalidated.
> > > > > 
> > > > Are you certain that Fw caps is enough to ensure that no other client
> > > > holds Fr caps?
> > > I spent hours and went through the mds Locker related code on the weekends.
> > > 
> > >   From the mds/lock.cc code, for mds filelock for example in the LOCK_MIX
> > > state and some interim transition states to LOCK_MIX it will allow
> > > different clients could hold any of Fw or Fr caps. But the Fb/Fc will be
> > > disabled. Checked the mds/Locker.cc code, found that the mds filelock
> > > could to switch LOCK_MIX state in some cases when there has one client
> > > wants Fw and another client wants any of Fr and Fw.
> > > 
> > > In this case I think the Linux advisory or mandatory locks are necessary
> > > to keep the file contents concurrency. In multiple processes' concurrent
> > > read/write or write/write cases without the Linux advisory/mandatory
> > > locks the file contents' concurrency won't be guaranteed, so the logic
> > > is the same here ?
> > > 
> > > If so, couldn't we just assume the Fw vs Fw and Fr vs Fw caps should be
> > > exclusive in correct use case ? For example, just after the mds filelock
> > > state switches to LOCK_MIX, if clientA gets the advisory file lock and
> > > the Fw caps, and even another clientB could be successfully issued the
> > > Fr caps, the clientB won't do any read because it should be still stuck
> > > and be waiting for the advisory file lock.
> > > 
> > I'm not sure I like that idea. Basically, that would change the meaning
> > of the what Frw caps represent, in a way that is not really consistent
> > with how they have been used before.
> > 
> > We could gate that new behavior on the new feature flags, but it sounds
> > pretty tough.
> > 
> > I think we have a couple of options:
> > 
> > 1) we could just make the clients request and wait on Fx caps when they
> > do a truncate. They might stall for a bit if there is contention, but it
> > would ensure consistency and the client could be completely in charge of
> > the truncate. [a]
> 
> Yeah, for my defer RMW approach we need to held the Fx caps every time 
> when writing/truncating files, and the Fs caps every time when reading.
> 
> While currently almost all the read/write code have ignored them because 
> read/write do not need them in most cases.
> 

Note that we already cache truncate operations when we have Fx. See
__ceph_setattr. Do we even need to change the read path here, or is the
existing code just wrong?

This is info I've been trying to get a handle on since I started working
on cephfs. The rules for FILE caps are still extremely unclear.

> I am not sure if we add the Fx caps to the 'need' in 
> write/truncating,etc code and the Fs caps in "need" in reading related 
> code will slow the perf. If my understanding is correct, the most of the 
> mds filelock's lock states do no allow the Fx/Fs caps to clients, so the 
> clients may need to wait a longer time than before.
> 

Maybe. If we start handing out Fs as a matter of course along with Fr,
it may not be _too_ bad, but I think we want to avoid relying on Fx in
the write path.

> After checking more about the Locker code, this seems not a perfect 
> approach IMO.
> 

That's fine, I trust your judgment here.

> 
> 
> > 2) we could rev the protocol, and have the client send along the last
> > block to be written along with the SETATTR request. Maybe we even
> > consider just adding a new TRUNCATE call independent of SETATTR. The MDS
> > would remain in complete control of it at that point.
> 
> This approach seems much better, since the last block size will always 
> less than or equal to 4K(CEPH_FSCRYPT_BLOCK_SIZE) and the truncate 
> should be rare in normal use cases (?), with extra ~4K data in the 
> SETATTR should be okay when truncating the file.
> 
> So when truncating a file, in kclient it should read that block, which 
> needs to do the RMW, first, and then do the truncate locally and encrypt 
> it again, and then together with SETATTR request send it to MDS. And the 
> MDS will update that block just before truncating the file.
> 
> This approach could also keep the fscrypt logic being opaque for the MDS.
> 
> 

Note that you'll need to guard against races on the RMW. For instance,
another client could issue a write to that last block just after we do
the read for the rmw.

If you do this, then you'll probably need to send along the object
version that you got from the read and have the MDS validate that before
it does the truncate and writes out the updated crypto block.

If something changed in the interim, the MDS can just return -EAGAIN or
whatever to the client and it can re-do the whole thing. It's a mess,
but it should be consistent.

I think we probably want a new call for this too instead of overloading
SETATTR (which is already extremely messy).

> > 
> > The other ideas I've considered seem more complex and don't offer any
> > significant advantages that I can see.
> > 
> > [a]: Side question: why does buffering a truncate require Fx and not Fb?
> > How do Fx and Fb interact?
> 
> For my defer RMW approach we need the Fx caps every time when writing 
> the file, and the Fw caps is the 'need' caps for write, while the Fb is 
> the 'want' caps. If the Fb caps is not allowed or issued by the MDS, it 
> will write-through data to the osd, after that the Fxw could be safely 
> released. If the client gets the Fb caps, the client must also hold the 
> Fx caps until the buffer has been writen back.
> 

The problem there is that will effectively serialize all writers of a
file -- even ones that are writing to completely different backend
objects.

That will almost certainly regress performance. I think we want to try
to avoid that. I'd rather that truncate be extremely slow and expensive
than slow down writes.

> > > > IIRC, Frw don't have the same exclusionary relationship
> > > > that (e.g.) Asx has. To exclude Fr, you may need Fb.
> > > > 
> > > > (Side note: these rules really need to be codified into a document
> > > > somewhere. I started that with the capabilities doc in the ceph tree,
> > > > but it's light on details of the FILE caps)
> > > Yeah, that doc is light on the details for now.
> > > 
> > > 
> > > > > And if after clientB have been issued the Fw caps and have modified that
> > > > > block and still not flushed back, a new clientC comes and wants to read
> > > > > the file, so the Fw caps must be revoked from clientB and the dirty data
> > > > > will be flushed, and then when releasing the Fw caps to the MDS, it will
> > > > > update the new fscrypt_file meta together.
> > > > > 
> > > > > I haven't see which case will be racy yet. Or did I miss some cases or
> > > > > something important ?
> > > > > 
> > > > > 
> > > > We also need to consider how legacy, non-fscrypt-capable clients will
> > > > interact with files that have this field set. If one comes in and writes
> > > > to or truncates one of these files, it's liable to leave a real mess.
> > > > The simplest solution may be to just bar them from doing anything with
> > > > fscrypted files aside from deleting them -- maybe we'd allow them to
> > > > acquire Fr caps but not Fw?
> > > For the legacy, non-fscrypt-capable clients, the reading contents should
> > > be encrypted any way, so there won't be any problem even if that
> > > specified block is not RMWed yet and it should ignore this field.
> > > 
> > Right, and I think we have to allow those clients to request Fr caps so
> > that they have the ability to backup and archive encrypted files without
> > needing the key. The cephfs-mirror-daemon, in particular, may need this.
> > 
> > > But for write case, I think the MDS should fail it in the open() stage
> > > if the mode has any of Write/Truncate, etc, and only Read/Buffer-read,
> > > etc are allowed. Or if we change the mds/Locker.cc code by not allowing
> > > it to issue the Fw caps to the legacy/non-fscrypt-capable clients, after
> > > the file is successfully opened with Write mode, it should be stuck
> > > forever when writing data to the file by waiting the Fw caps, which will
> > > never come ?
> > > 
> > Yes. Those clients should be barred from making any changes to file
> > contents or doing anything that might result in a new dentry being
> > attached to an existing inode.
> > 
> > We need to allow them to read files, and unlink them, but that's really
> > about it.
> 
> Yeah, agree.
> 
> BRs
> 
> 
> > > > > > Really, it comes down to the fact that truncates are supposed to be an
> > > > > > atomic operation, but we need to perform actions in two different
> > > > > > places.
> > > > > > 
> > > > > > Hmmm...it's worse than that even -- if the truncate changes it so that
> > > > > > the last block is in an entirely different object, then there are 3
> > > > > > places that will need to coordinate access.
> > > > > > 
> > > > > > Tricky.
>
Xiubo Li Sept. 16, 2021, 10:02 a.m. UTC | #14
On 9/14/21 10:24 PM, Jeff Layton wrote:
> On Tue, 2021-09-14 at 13:40 +0800, Xiubo Li wrote:
>> On 9/14/21 3:34 AM, Jeff Layton wrote:
>>
>> [...]
>>
>>> I'll have to think about whether that's still racy. Part of the problem
>>>>>>> is that once the client doesn't have caps, it doesn't have a way to
>>>>>>> ensure that fscrypt_file (whatever it holds) doesn't change while it's
>>>>>>> doing that zeroing.
>>>>>> As my understanding, if clientA want to read a file, it will open it
>>>>>> with RD mode, then it will get the fscrypt_file meta and "Fr" caps, then
>>>>>> it can safely read the file contents and do the zeroing for that block.
>>>>> Ok, so in this case, the client is just zeroing out the appropriate
>>>>> region in the resulting read data, and not on the OSD. That should be
>>>>> ok.
>>>>>
>>>>>> Then the opened file shouldn't worry whether the fscrypt_file will be
>>>>>> changed by others during it still holding the Fr caps, because if the
>>>>>> clientB wants to update the fscrypt_file it must acquire the Fw caps
>>>>>> first, before that the Fr caps must be revoked from clientA and at the
>>>>>> same time the read pagecache in clientA will be invalidated.
>>>>>>
>>>>> Are you certain that Fw caps is enough to ensure that no other client
>>>>> holds Fr caps?
>>>> I spent hours and went through the mds Locker related code on the weekends.
>>>>
>>>>    From the mds/lock.cc code, for mds filelock for example in the LOCK_MIX
>>>> state and some interim transition states to LOCK_MIX it will allow
>>>> different clients could hold any of Fw or Fr caps. But the Fb/Fc will be
>>>> disabled. Checked the mds/Locker.cc code, found that the mds filelock
>>>> could to switch LOCK_MIX state in some cases when there has one client
>>>> wants Fw and another client wants any of Fr and Fw.
>>>>
>>>> In this case I think the Linux advisory or mandatory locks are necessary
>>>> to keep the file contents concurrency. In multiple processes' concurrent
>>>> read/write or write/write cases without the Linux advisory/mandatory
>>>> locks the file contents' concurrency won't be guaranteed, so the logic
>>>> is the same here ?
>>>>
>>>> If so, couldn't we just assume the Fw vs Fw and Fr vs Fw caps should be
>>>> exclusive in correct use case ? For example, just after the mds filelock
>>>> state switches to LOCK_MIX, if clientA gets the advisory file lock and
>>>> the Fw caps, and even another clientB could be successfully issued the
>>>> Fr caps, the clientB won't do any read because it should be still stuck
>>>> and be waiting for the advisory file lock.
>>>>
>>> I'm not sure I like that idea. Basically, that would change the meaning
>>> of the what Frw caps represent, in a way that is not really consistent
>>> with how they have been used before.
>>>
>>> We could gate that new behavior on the new feature flags, but it sounds
>>> pretty tough.
>>>
>>> I think we have a couple of options:
>>>
>>> 1) we could just make the clients request and wait on Fx caps when they
>>> do a truncate. They might stall for a bit if there is contention, but it
>>> would ensure consistency and the client could be completely in charge of
>>> the truncate. [a]
>> Yeah, for my defer RMW approach we need to held the Fx caps every time
>> when writing/truncating files, and the Fs caps every time when reading.
>>
>> While currently almost all the read/write code have ignored them because
>> read/write do not need them in most cases.
>>
> Note that we already cache truncate operations when we have Fx.

Yeah, only when we have Fx and the attr.ia_size > isize will it cache 
the truncate operation.

For this I am a little curious why couldn't we cache truncate operations 
when attr.ia_size >= isize ?


>   See
> __ceph_setattr. Do we even need to change the read path here, or is the
> existing code just wrong?
>
> This is info I've been trying to get a handle on since I started working
> on cephfs. The rules for FILE caps are still extremely unclear.

I am still check this logic from MDS side. Still not very clear.


[...]
>>
>>> 2) we could rev the protocol, and have the client send along the last
>>> block to be written along with the SETATTR request. Maybe we even
>>> consider just adding a new TRUNCATE call independent of SETATTR. The MDS
>>> would remain in complete control of it at that point.
>> This approach seems much better, since the last block size will always
>> less than or equal to 4K(CEPH_FSCRYPT_BLOCK_SIZE) and the truncate
>> should be rare in normal use cases (?), with extra ~4K data in the
>> SETATTR should be okay when truncating the file.
>>
>> So when truncating a file, in kclient it should read that block, which
>> needs to do the RMW, first, and then do the truncate locally and encrypt
>> it again, and then together with SETATTR request send it to MDS. And the
>> MDS will update that block just before truncating the file.
>>
>> This approach could also keep the fscrypt logic being opaque for the MDS.
>>
>>
> Note that you'll need to guard against races on the RMW. For instance,
> another client could issue a write to that last block just after we do
> the read for the rmw.
>
> If you do this, then you'll probably need to send along the object
> version that you got from the read and have the MDS validate that before
> it does the truncate and writes out the updated crypto block.
>
> If something changed in the interim, the MDS can just return -EAGAIN or
> whatever to the client and it can re-do the whole thing. It's a mess,
> but it should be consistent.
>
> I think we probably want a new call for this too instead of overloading
> SETATTR (which is already extremely messy).

Okay, I will check this more carefully.


>>> The other ideas I've considered seem more complex and don't offer any
>>> significant advantages that I can see.
>>>
>>> [a]: Side question: why does buffering a truncate require Fx and not Fb?
>>> How do Fx and Fb interact?
>> For my defer RMW approach we need the Fx caps every time when writing
>> the file, and the Fw caps is the 'need' caps for write, while the Fb is
>> the 'want' caps. If the Fb caps is not allowed or issued by the MDS, it
>> will write-through data to the osd, after that the Fxw could be safely
>> released. If the client gets the Fb caps, the client must also hold the
>> Fx caps until the buffer has been writen back.
>>
> The problem there is that will effectively serialize all writers of a
> file -- even ones that are writing to completely different backend
> objects.
>
> That will almost certainly regress performance. I think we want to try
> to avoid that. I'd rather that truncate be extremely slow and expensive
> than slow down writes.

Agree.

Thanks.
Jeff Layton Sept. 17, 2021, 5:19 p.m. UTC | #15
On Thu, 2021-09-16 at 18:02 +0800, Xiubo Li wrote:
> On 9/14/21 10:24 PM, Jeff Layton wrote:
> > On Tue, 2021-09-14 at 13:40 +0800, Xiubo Li wrote:
> > > On 9/14/21 3:34 AM, Jeff Layton wrote:
> > > 
> > > [...]
> > > 
> > > > I'll have to think about whether that's still racy. Part of the problem
> > > > > > > > is that once the client doesn't have caps, it doesn't have a way to
> > > > > > > > ensure that fscrypt_file (whatever it holds) doesn't change while it's
> > > > > > > > doing that zeroing.
> > > > > > > As my understanding, if clientA want to read a file, it will open it
> > > > > > > with RD mode, then it will get the fscrypt_file meta and "Fr" caps, then
> > > > > > > it can safely read the file contents and do the zeroing for that block.
> > > > > > Ok, so in this case, the client is just zeroing out the appropriate
> > > > > > region in the resulting read data, and not on the OSD. That should be
> > > > > > ok.
> > > > > > 
> > > > > > > Then the opened file shouldn't worry whether the fscrypt_file will be
> > > > > > > changed by others during it still holding the Fr caps, because if the
> > > > > > > clientB wants to update the fscrypt_file it must acquire the Fw caps
> > > > > > > first, before that the Fr caps must be revoked from clientA and at the
> > > > > > > same time the read pagecache in clientA will be invalidated.
> > > > > > > 
> > > > > > Are you certain that Fw caps is enough to ensure that no other client
> > > > > > holds Fr caps?
> > > > > I spent hours and went through the mds Locker related code on the weekends.
> > > > > 
> > > > >    From the mds/lock.cc code, for mds filelock for example in the LOCK_MIX
> > > > > state and some interim transition states to LOCK_MIX it will allow
> > > > > different clients could hold any of Fw or Fr caps. But the Fb/Fc will be
> > > > > disabled. Checked the mds/Locker.cc code, found that the mds filelock
> > > > > could to switch LOCK_MIX state in some cases when there has one client
> > > > > wants Fw and another client wants any of Fr and Fw.
> > > > > 
> > > > > In this case I think the Linux advisory or mandatory locks are necessary
> > > > > to keep the file contents concurrency. In multiple processes' concurrent
> > > > > read/write or write/write cases without the Linux advisory/mandatory
> > > > > locks the file contents' concurrency won't be guaranteed, so the logic
> > > > > is the same here ?
> > > > > 
> > > > > If so, couldn't we just assume the Fw vs Fw and Fr vs Fw caps should be
> > > > > exclusive in correct use case ? For example, just after the mds filelock
> > > > > state switches to LOCK_MIX, if clientA gets the advisory file lock and
> > > > > the Fw caps, and even another clientB could be successfully issued the
> > > > > Fr caps, the clientB won't do any read because it should be still stuck
> > > > > and be waiting for the advisory file lock.
> > > > > 
> > > > I'm not sure I like that idea. Basically, that would change the meaning
> > > > of the what Frw caps represent, in a way that is not really consistent
> > > > with how they have been used before.
> > > > 
> > > > We could gate that new behavior on the new feature flags, but it sounds
> > > > pretty tough.
> > > > 
> > > > I think we have a couple of options:
> > > > 
> > > > 1) we could just make the clients request and wait on Fx caps when they
> > > > do a truncate. They might stall for a bit if there is contention, but it
> > > > would ensure consistency and the client could be completely in charge of
> > > > the truncate. [a]
> > > Yeah, for my defer RMW approach we need to held the Fx caps every time
> > > when writing/truncating files, and the Fs caps every time when reading.
> > > 
> > > While currently almost all the read/write code have ignored them because
> > > read/write do not need them in most cases.
> > > 
> > Note that we already cache truncate operations when we have Fx.
> 
> Yeah, only when we have Fx and the attr.ia_size > isize will it cache 
> the truncate operation.
> 
> For this I am a little curious why couldn't we cache truncate operations 
> when attr.ia_size >= isize ?
> 

It seems to me that if attr.ia_size == i_size, then there is nothing to
change. We can just optimize that case away, assuming the client has
caps that ensure the size is correct.

Based on the kclient code, for size changes after a write that extends a
file, it seems like Fw is sufficient to allow the client to buffer these
things. For a truncate (aka setattr) operation, we apparently need Fx.

It sort of makes sense, but the more I look over the existing code, the
less sure I am about how this is intended to work. I think before we
make any changes for fscrypt, we need to make sure we understand what's
there now.

> 
> >   See
> > __ceph_setattr. Do we even need to change the read path here, or is the
> > existing code just wrong?
> > 
> > This is info I've been trying to get a handle on since I started working
> > on cephfs. The rules for FILE caps are still extremely unclear.
> 
> I am still check this logic from MDS side. Still not very clear.
> 
> 
> [...]
> > > 
> > > > 2) we could rev the protocol, and have the client send along the last
> > > > block to be written along with the SETATTR request. Maybe we even
> > > > consider just adding a new TRUNCATE call independent of SETATTR. The MDS
> > > > would remain in complete control of it at that point.
> > > This approach seems much better, since the last block size will always
> > > less than or equal to 4K(CEPH_FSCRYPT_BLOCK_SIZE) and the truncate
> > > should be rare in normal use cases (?), with extra ~4K data in the
> > > SETATTR should be okay when truncating the file.
> > > 
> > > So when truncating a file, in kclient it should read that block, which
> > > needs to do the RMW, first, and then do the truncate locally and encrypt
> > > it again, and then together with SETATTR request send it to MDS. And the
> > > MDS will update that block just before truncating the file.
> > > 
> > > This approach could also keep the fscrypt logic being opaque for the MDS.
> > > 
> > > 
> > Note that you'll need to guard against races on the RMW. For instance,
> > another client could issue a write to that last block just after we do
> > the read for the rmw.
> > 
> > If you do this, then you'll probably need to send along the object
> > version that you got from the read and have the MDS validate that before
> > it does the truncate and writes out the updated crypto block.
> > 
> > If something changed in the interim, the MDS can just return -EAGAIN or
> > whatever to the client and it can re-do the whole thing. It's a mess,
> > but it should be consistent.
> > 
> > I think we probably want a new call for this too instead of overloading
> > SETATTR (which is already extremely messy).
> 
> Okay, I will check this more carefully.
> 
> 
> > > > The other ideas I've considered seem more complex and don't offer any
> > > > significant advantages that I can see.
> > > > 
> > > > [a]: Side question: why does buffering a truncate require Fx and not Fb?
> > > > How do Fx and Fb interact?
> > > For my defer RMW approach we need the Fx caps every time when writing
> > > the file, and the Fw caps is the 'need' caps for write, while the Fb is
> > > the 'want' caps. If the Fb caps is not allowed or issued by the MDS, it
> > > will write-through data to the osd, after that the Fxw could be safely
> > > released. If the client gets the Fb caps, the client must also hold the
> > > Fx caps until the buffer has been writen back.
> > > 
> > The problem there is that will effectively serialize all writers of a
> > file -- even ones that are writing to completely different backend
> > objects.
> > 
> > That will almost certainly regress performance. I think we want to try
> > to avoid that. I'd rather that truncate be extremely slow and expensive
> > than slow down writes.
> 
> Agree.
> 
> Thanks.
> 
>
Xiubo Li Sept. 20, 2021, 2:32 p.m. UTC | #16
On 9/18/21 1:19 AM, Jeff Layton wrote:
> On Thu, 2021-09-16 at 18:02 +0800, Xiubo Li wrote:
>> On 9/14/21 10:24 PM, Jeff Layton wrote:
>>> On Tue, 2021-09-14 at 13:40 +0800, Xiubo Li wrote:
>>>> On 9/14/21 3:34 AM, Jeff Layton wrote:
>>>>
>>>> [...]
>>>>
>>>>> I'll have to think about whether that's still racy. Part of the problem
>>>>>>>>> is that once the client doesn't have caps, it doesn't have a way to
>>>>>>>>> ensure that fscrypt_file (whatever it holds) doesn't change while it's
>>>>>>>>> doing that zeroing.
>>>>>>>> As my understanding, if clientA want to read a file, it will open it
>>>>>>>> with RD mode, then it will get the fscrypt_file meta and "Fr" caps, then
>>>>>>>> it can safely read the file contents and do the zeroing for that block.
>>>>>>> Ok, so in this case, the client is just zeroing out the appropriate
>>>>>>> region in the resulting read data, and not on the OSD. That should be
>>>>>>> ok.
>>>>>>>
>>>>>>>> Then the opened file shouldn't worry whether the fscrypt_file will be
>>>>>>>> changed by others during it still holding the Fr caps, because if the
>>>>>>>> clientB wants to update the fscrypt_file it must acquire the Fw caps
>>>>>>>> first, before that the Fr caps must be revoked from clientA and at the
>>>>>>>> same time the read pagecache in clientA will be invalidated.
>>>>>>>>
>>>>>>> Are you certain that Fw caps is enough to ensure that no other client
>>>>>>> holds Fr caps?
>>>>>> I spent hours and went through the mds Locker related code on the weekends.
>>>>>>
>>>>>>     From the mds/lock.cc code, for mds filelock for example in the LOCK_MIX
>>>>>> state and some interim transition states to LOCK_MIX it will allow
>>>>>> different clients could hold any of Fw or Fr caps. But the Fb/Fc will be
>>>>>> disabled. Checked the mds/Locker.cc code, found that the mds filelock
>>>>>> could to switch LOCK_MIX state in some cases when there has one client
>>>>>> wants Fw and another client wants any of Fr and Fw.
>>>>>>
>>>>>> In this case I think the Linux advisory or mandatory locks are necessary
>>>>>> to keep the file contents concurrency. In multiple processes' concurrent
>>>>>> read/write or write/write cases without the Linux advisory/mandatory
>>>>>> locks the file contents' concurrency won't be guaranteed, so the logic
>>>>>> is the same here ?
>>>>>>
>>>>>> If so, couldn't we just assume the Fw vs Fw and Fr vs Fw caps should be
>>>>>> exclusive in correct use case ? For example, just after the mds filelock
>>>>>> state switches to LOCK_MIX, if clientA gets the advisory file lock and
>>>>>> the Fw caps, and even another clientB could be successfully issued the
>>>>>> Fr caps, the clientB won't do any read because it should be still stuck
>>>>>> and be waiting for the advisory file lock.
>>>>>>
>>>>> I'm not sure I like that idea. Basically, that would change the meaning
>>>>> of the what Frw caps represent, in a way that is not really consistent
>>>>> with how they have been used before.
>>>>>
>>>>> We could gate that new behavior on the new feature flags, but it sounds
>>>>> pretty tough.
>>>>>
>>>>> I think we have a couple of options:
>>>>>
>>>>> 1) we could just make the clients request and wait on Fx caps when they
>>>>> do a truncate. They might stall for a bit if there is contention, but it
>>>>> would ensure consistency and the client could be completely in charge of
>>>>> the truncate. [a]
>>>> Yeah, for my defer RMW approach we need to held the Fx caps every time
>>>> when writing/truncating files, and the Fs caps every time when reading.
>>>>
>>>> While currently almost all the read/write code have ignored them because
>>>> read/write do not need them in most cases.
>>>>
>>> Note that we already cache truncate operations when we have Fx.
>> Yeah, only when we have Fx and the attr.ia_size > isize will it cache
>> the truncate operation.
>>
>> For this I am a little curious why couldn't we cache truncate operations
>> when attr.ia_size >= isize ?
>>
> It seems to me that if attr.ia_size == i_size, then there is nothing to
> change. We can just optimize that case away, assuming the client has
> caps that ensure the size is correct.

 From MDS side I didn't find any limitation why couldn't we optimize it.


>
> Based on the kclient code, for size changes after a write that extends a
> file, it seems like Fw is sufficient to allow the client to buffer these
> things.

Since the MDS will only allow us to increase the file size, just like 
the mtime/ctime/atime, so even the write size has exceed current file 
size, the Fw is sufficient.


>   For a truncate (aka setattr) operation, we apparently need Fx.

In case one client is truncating the file, if the new size is larger 
than or equal to current size, this should be okay and will behave just 
like normal write case above.

If the new size is smaller, the MDS will handle this in a different way. 
When the MDS received the truncate request, it will first xlock the 
filelock, which will switch the filelock state and in all these possible 
interim or stable states, the Fw caps will be revoked from all the 
clients, but the clients still could cache/buffer the file contents, 
that means no client is allowed to change the size during the truncate 
operation is on the way. After the truncate succeeds the MDS Locker will 
issue_truncate() to all the clients and the clients will truncate the 
caches/buffers, etc.

And also the truncate operations will always be queued sequentially.


> It sort of makes sense, but the more I look over the existing code, the
> less sure I am about how this is intended to work. I think before we
> make any changes for fscrypt, we need to make sure we understand what's
> there now.

So if my understanding is correct, the Fx is not a must for the truncate 
operation.

I will check more about this later.

BRs

-- Xiubo

>>>    See
>>> __ceph_setattr. Do we even need to change the read path here, or is the
>>> existing code just wrong?
>>>
>>> This is info I've been trying to get a handle on since I started working
>>> on cephfs. The rules for FILE caps are still extremely unclear.
>> I am still check this logic from MDS side. Still not very clear.
>>
>>
>> [...]
>>>>> 2) we could rev the protocol, and have the client send along the last
>>>>> block to be written along with the SETATTR request. Maybe we even
>>>>> consider just adding a new TRUNCATE call independent of SETATTR. The MDS
>>>>> would remain in complete control of it at that point.
>>>> This approach seems much better, since the last block size will always
>>>> less than or equal to 4K(CEPH_FSCRYPT_BLOCK_SIZE) and the truncate
>>>> should be rare in normal use cases (?), with extra ~4K data in the
>>>> SETATTR should be okay when truncating the file.
>>>>
>>>> So when truncating a file, in kclient it should read that block, which
>>>> needs to do the RMW, first, and then do the truncate locally and encrypt
>>>> it again, and then together with SETATTR request send it to MDS. And the
>>>> MDS will update that block just before truncating the file.
>>>>
>>>> This approach could also keep the fscrypt logic being opaque for the MDS.
>>>>
>>>>
>>> Note that you'll need to guard against races on the RMW. For instance,
>>> another client could issue a write to that last block just after we do
>>> the read for the rmw.
>>>
>>> If you do this, then you'll probably need to send along the object
>>> version that you got from the read and have the MDS validate that before
>>> it does the truncate and writes out the updated crypto block.
>>>
>>> If something changed in the interim, the MDS can just return -EAGAIN or
>>> whatever to the client and it can re-do the whole thing. It's a mess,
>>> but it should be consistent.
>>>
>>> I think we probably want a new call for this too instead of overloading
>>> SETATTR (which is already extremely messy).
>> Okay, I will check this more carefully.
>>
>>
>>>>> The other ideas I've considered seem more complex and don't offer any
>>>>> significant advantages that I can see.
>>>>>
>>>>> [a]: Side question: why does buffering a truncate require Fx and not Fb?
>>>>> How do Fx and Fb interact?
>>>> For my defer RMW approach we need the Fx caps every time when writing
>>>> the file, and the Fw caps is the 'need' caps for write, while the Fb is
>>>> the 'want' caps. If the Fb caps is not allowed or issued by the MDS, it
>>>> will write-through data to the osd, after that the Fxw could be safely
>>>> released. If the client gets the Fb caps, the client must also hold the
>>>> Fx caps until the buffer has been writen back.
>>>>
>>> The problem there is that will effectively serialize all writers of a
>>> file -- even ones that are writing to completely different backend
>>> objects.
>>>
>>> That will almost certainly regress performance. I think we want to try
>>> to avoid that. I'd rather that truncate be extremely slow and expensive
>>> than slow down writes.
>> Agree.
>>
>> Thanks.
>>
>>
Jeff Layton Sept. 20, 2021, 7:24 p.m. UTC | #17
On Mon, 2021-09-20 at 22:32 +0800, Xiubo Li wrote:
> On 9/18/21 1:19 AM, Jeff Layton wrote:
> > On Thu, 2021-09-16 at 18:02 +0800, Xiubo Li wrote:
> > > 
> > > For this I am a little curious why couldn't we cache truncate operations
> > > when attr.ia_size >= isize ?
> > > 
> > It seems to me that if attr.ia_size == i_size, then there is nothing to
> > change. We can just optimize that case away, assuming the client has
> > caps that ensure the size is correct.
> 
>  From MDS side I didn't find any limitation why couldn't we optimize it.
> 
> 

Well...I think we do still need to change the mtime in that case. We
probably do still need to do a setattr or something that makes the MDS
set it to its current local time, even if the size doesn't change.

> > 
> > Based on the kclient code, for size changes after a write that extends a
> > file, it seems like Fw is sufficient to allow the client to buffer these
> > things.
> 
> Since the MDS will only allow us to increase the file size, just like 
> the mtime/ctime/atime, so even the write size has exceed current file 
> size, the Fw is sufficient.
> 

Good.

> 
> >   For a truncate (aka setattr) operation, we apparently need Fx.
> 
> In case one client is truncating the file, if the new size is larger 
> than or equal to current size, this should be okay and will behave just 
> like normal write case above.
> 
> If the new size is smaller, the MDS will handle this in a different way. 
> When the MDS received the truncate request, it will first xlock the 
> filelock, which will switch the filelock state and in all these possible 
> interim or stable states, the Fw caps will be revoked from all the 
> clients, but the clients still could cache/buffer the file contents, 
> that means no client is allowed to change the size during the truncate 
> operation is on the way. After the truncate succeeds the MDS Locker will 
> issue_truncate() to all the clients and the clients will truncate the 
> caches/buffers, etc.
> 
> And also the truncate operations will always be queued sequentially.
> 
> 
> > It sort of makes sense, but the more I look over the existing code, the
> > less sure I am about how this is intended to work. I think before we
> > make any changes for fscrypt, we need to make sure we understand what's
> > there now.
> 
> So if my understanding is correct, the Fx is not a must for the truncate 
> operation.
> 

Basically, when a truncate up or down comes in, we have to make a
determination of whether we can buffer that change, or whether we need
to do it synchronously.

The current kclient code bases that on whether it has Fx. I suspect
that ensures that no other client has Frw, which sounds like what we
probably want. We need to prevent anyone from doing writes that might
extend the file at that time _and_ need to ensure that stat/statx for
the file size blocks until it's complete. ceph_getattr will issue a
GETATTR to the MDS if it doesn't have Fs in that case, so that should
be fine.

I assume that the MDS will never give out Fx caps to one client and Fs
to another. What about Fw, though? Do Fx caps conflict with Frw the
same way as they do with Fs?
Xiubo Li Sept. 22, 2021, 2:23 a.m. UTC | #18
On 9/21/21 3:24 AM, Jeff Layton wrote:
> On Mon, 2021-09-20 at 22:32 +0800, Xiubo Li wrote:
>> On 9/18/21 1:19 AM, Jeff Layton wrote:
>>> On Thu, 2021-09-16 at 18:02 +0800, Xiubo Li wrote:
>>>> For this I am a little curious why couldn't we cache truncate operations
>>>> when attr.ia_size >= isize ?
>>>>
>>> It seems to me that if attr.ia_size == i_size, then there is nothing to
>>> change. We can just optimize that case away, assuming the client has
>>> caps that ensure the size is correct.
>>   From MDS side I didn't find any limitation why couldn't we optimize it.
>>
>>
> Well...I think we do still need to change the mtime in that case. We
> probably do still need to do a setattr or something that makes the MDS
> set it to its current local time, even if the size doesn't change.

Since there hasn't any change for the file data, will change the 'mtime' 
make sense here ? If so, then why in case the current client has the Fr 
caps and sizes are the same it won't.

In this case I found in the MDS side, it will also update the 'ctime' 
always even it will change nothing.

>>> Based on the kclient code, for size changes after a write that extends a
>>> file, it seems like Fw is sufficient to allow the client to buffer these
>>> things.
>> Since the MDS will only allow us to increase the file size, just like
>> the mtime/ctime/atime, so even the write size has exceed current file
>> size, the Fw is sufficient.
>>
> Good.
>
>>>    For a truncate (aka setattr) operation, we apparently need Fx.
>> In case one client is truncating the file, if the new size is larger
>> than or equal to current size, this should be okay and will behave just
>> like normal write case above.
>>
>> If the new size is smaller, the MDS will handle this in a different way.
>> When the MDS received the truncate request, it will first xlock the
>> filelock, which will switch the filelock state and in all these possible
>> interim or stable states, the Fw caps will be revoked from all the
>> clients, but the clients still could cache/buffer the file contents,
>> that means no client is allowed to change the size during the truncate
>> operation is on the way. After the truncate succeeds the MDS Locker will
>> issue_truncate() to all the clients and the clients will truncate the
>> caches/buffers, etc.
>>
>> And also the truncate operations will always be queued sequentially.
>>
>>
>>> It sort of makes sense, but the more I look over the existing code, the
>>> less sure I am about how this is intended to work. I think before we
>>> make any changes for fscrypt, we need to make sure we understand what's
>>> there now.
>> So if my understanding is correct, the Fx is not a must for the truncate
>> operation.
>>
> Basically, when a truncate up or down comes in, we have to make a
> determination of whether we can buffer that change, or whether we need
> to do it synchronously.
>
> The current kclient code bases that on whether it has Fx. I suspect
> that ensures that no other client has Frw, which sounds like what we
> probably want.

The Fx caps is only for the loner client and once the client has the Fx 
caps it will always have the Fsrwcb at the same time. From the 
mds/locker.cc I didn't find it'll allow that.

If current client has the Fx caps, so when there has another client is 
trying to truncate the same file at the same time, the MDS will have to 
revoke the Fx caps first and during which the buffered truncate will be 
flushed and be finished first too. So when Fx caps is issued and new 
size equals to the current size, why couldn't we buffer it ?


>   We need to prevent anyone from doing writes that might
> extend the file at that time

Yeah, since the Fx is only for loner client, and all other clients will 
have zero file caps. so it won't happen.


>   _and_ need to ensure that stat/statx for
> the file size blocks until it's complete. ceph_getattr will issue a
> GETATTR to the MDS if it doesn't have Fs in that case, so that should
> be fine.
>
> I assume that the MDS will never give out Fx caps to one client and Fs
> to another.
Yeah, It won't happen.


>   What about Fw, though?

While for the Fw caps, it will allow it. If the filelock is in LOCK_MIX 
state, the filelock maybe in this state if there at least one client 
wants the Fw caps and some others want any of Fr and Fw caps.


>   Do Fx caps conflict with Frw the
> same way as they do with Fs?

Yeah, it will be the same with Fs.

Thanks.
Xiubo Li Sept. 24, 2021, 3:01 p.m. UTC | #19
On 9/14/21 3:34 AM, Jeff Layton wrote:
> On Mon, 2021-09-13 at 13:42 +0800, Xiubo Li wrote:
>> On 9/10/21 7:46 PM, Jeff Layton wrote:
>>> On Fri, 2021-09-10 at 10:30 +0800, Xiubo Li wrote:
>>>> On 9/9/21 8:48 PM, Jeff Layton wrote:
>>>>> On Thu, 2021-09-09 at 11:38 +0800, Xiubo Li wrote:
>>>>>> On 9/8/21 9:57 PM, Jeff Layton wrote:
>>>>>>> On Wed, 2021-09-08 at 17:37 +0800, Xiubo Li wrote:
>>>>>>>> On 9/8/21 12:26 AM, Jeff Layton wrote:
>>>>>>>>> On Fri, 2021-09-03 at 16:15 +0800,xiubli@redhat.com  wrote:
[...]
>> I spent hours and went through the mds Locker related code on the weekends.
>>
>>   From the mds/lock.cc code, for mds filelock for example in the LOCK_MIX
>> state and some interim transition states to LOCK_MIX it will allow
>> different clients could hold any of Fw or Fr caps. But the Fb/Fc will be
>> disabled. Checked the mds/Locker.cc code, found that the mds filelock
>> could to switch LOCK_MIX state in some cases when there has one client
>> wants Fw and another client wants any of Fr and Fw.
>>
>> In this case I think the Linux advisory or mandatory locks are necessary
>> to keep the file contents concurrency. In multiple processes' concurrent
>> read/write or write/write cases without the Linux advisory/mandatory
>> locks the file contents' concurrency won't be guaranteed, so the logic
>> is the same here ?
>>
>> If so, couldn't we just assume the Fw vs Fw and Fr vs Fw caps should be
>> exclusive in correct use case ? For example, just after the mds filelock
>> state switches to LOCK_MIX, if clientA gets the advisory file lock and
>> the Fw caps, and even another clientB could be successfully issued the
>> Fr caps, the clientB won't do any read because it should be still stuck
>> and be waiting for the advisory file lock.
>>
> I'm not sure I like that idea. Basically, that would change the meaning
> of the what Frw caps represent, in a way that is not really consistent
> with how they have been used before.
>
> We could gate that new behavior on the new feature flags, but it sounds
> pretty tough.
>
> I think we have a couple of options:
>
> 1) we could just make the clients request and wait on Fx caps when they
> do a truncate. They might stall for a bit if there is contention, but it
> would ensure consistency and the client could be completely in charge of
> the truncate. [a]

For this approach do you mean holding the Fx caps when doing a truncate ?

If so, I am afraid it won't work.

Because only in loner mode will the kclient could be issued the Fx caps, 
if we can get the Fx caps in the kclient then in MDS the filelock will 
be in LOCK_EXCL state.

And then if we send MDS the setattr request to truncate the size, then 
in the MDS it will try to xlock the filelock, then it will try to switch 
the filelock's state to LOCK_XLOCK, during which it must revoke the 
Frxrw caps from the kclient. That means the kclient will wait for the 
setattr request to finish by holding the Fx caps while the MDS will wait 
for the kclient to release the Fx caps.



> 2) we could rev the protocol, and have the client send along the last
> block to be written along with the SETATTR request. Maybe we even
> consider just adding a new TRUNCATE call independent of SETATTR. The MDS
> would remain in complete control of it at that point.
>
> The other ideas I've considered seem more complex and don't offer any
> significant advantages that I can see.
>
> [a]: Side question: why does buffering a truncate require Fx and not Fb?
> How do Fx and Fb interact?
>
>>> IIRC, Frw don't have the same exclusionary relationship
>>> that (e.g.) Asx has. To exclude Fr, you may need Fb.
>>>
>>> (Side note: these rules really need to be codified into a document
>>> somewhere. I started that with the capabilities doc in the ceph tree,
>>> but it's light on details of the FILE caps)
>> Yeah, that doc is light on the details for now.
>>
>>
>>>> And if after clientB have been issued the Fw caps and have modified that
>>>> block and still not flushed back, a new clientC comes and wants to read
>>>> the file, so the Fw caps must be revoked from clientB and the dirty data
>>>> will be flushed, and then when releasing the Fw caps to the MDS, it will
>>>> update the new fscrypt_file meta together.
>>>>
>>>> I haven't see which case will be racy yet. Or did I miss some cases or
>>>> something important ?
>>>>
>>>>
>>> We also need to consider how legacy, non-fscrypt-capable clients will
>>> interact with files that have this field set. If one comes in and writes
>>> to or truncates one of these files, it's liable to leave a real mess.
>>> The simplest solution may be to just bar them from doing anything with
>>> fscrypted files aside from deleting them -- maybe we'd allow them to
>>> acquire Fr caps but not Fw?
>> For the legacy, non-fscrypt-capable clients, the reading contents should
>> be encrypted any way, so there won't be any problem even if that
>> specified block is not RMWed yet and it should ignore this field.
>>
> Right, and I think we have to allow those clients to request Fr caps so
> that they have the ability to backup and archive encrypted files without
> needing the key. The cephfs-mirror-daemon, in particular, may need this.
>
>> But for write case, I think the MDS should fail it in the open() stage
>> if the mode has any of Write/Truncate, etc, and only Read/Buffer-read,
>> etc are allowed. Or if we change the mds/Locker.cc code by not allowing
>> it to issue the Fw caps to the legacy/non-fscrypt-capable clients, after
>> the file is successfully opened with Write mode, it should be stuck
>> forever when writing data to the file by waiting the Fw caps, which will
>> never come ?
>>
> Yes. Those clients should be barred from making any changes to file
> contents or doing anything that might result in a new dentry being
> attached to an existing inode.
>
> We need to allow them to read files, and unlink them, but that's really
> about it.
>
>>>>> Really, it comes down to the fact that truncates are supposed to be an
>>>>> atomic operation, but we need to perform actions in two different
>>>>> places.
>>>>>
>>>>> Hmmm...it's worse than that even -- if the truncate changes it so that
>>>>> the last block is in an entirely different object, then there are 3
>>>>> places that will need to coordinate access.
>>>>>
>>>>> Tricky.
Jeff Layton Sept. 24, 2021, 6:52 p.m. UTC | #20
On Wed, 2021-09-22 at 10:23 +0800, Xiubo Li wrote:
> On 9/21/21 3:24 AM, Jeff Layton wrote:
> > On Mon, 2021-09-20 at 22:32 +0800, Xiubo Li wrote:
> > > On 9/18/21 1:19 AM, Jeff Layton wrote:
> > > > On Thu, 2021-09-16 at 18:02 +0800, Xiubo Li wrote:
> > > > > For this I am a little curious why couldn't we cache truncate operations
> > > > > when attr.ia_size >= isize ?
> > > > > 
> > > > It seems to me that if attr.ia_size == i_size, then there is nothing to
> > > > change. We can just optimize that case away, assuming the client has
> > > > caps that ensure the size is correct.
> > >   From MDS side I didn't find any limitation why couldn't we optimize it.
> > > 
> > > 
> > Well...I think we do still need to change the mtime in that case. We
> > probably do still need to do a setattr or something that makes the MDS
> > set it to its current local time, even if the size doesn't change.
> 
> Since there hasn't any change for the file data, will change the 'mtime' 
> make sense here ? If so, then why in case the current client has the Fr 
> caps and sizes are the same it won't.
> 
> In this case I found in the MDS side, it will also update the 'ctime' 
> always even it will change nothing.
> 

I think I'm wrong here, actually, but the POSIX spec is a bit vague on
the topic. Conceptually though since nothing changed, it doesn't seem
like we'd be mandated to change the mtime or change attribute.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/truncate.html

> > > > Based on the kclient code, for size changes after a write that extends a
> > > > file, it seems like Fw is sufficient to allow the client to buffer these
> > > > things.
> > > Since the MDS will only allow us to increase the file size, just like
> > > the mtime/ctime/atime, so even the write size has exceed current file
> > > size, the Fw is sufficient.
> > > 
> > Good.
> > 
> > > >    For a truncate (aka setattr) operation, we apparently need Fx.
> > > In case one client is truncating the file, if the new size is larger
> > > than or equal to current size, this should be okay and will behave just
> > > like normal write case above.
> > > 
> > > If the new size is smaller, the MDS will handle this in a different way.
> > > When the MDS received the truncate request, it will first xlock the
> > > filelock, which will switch the filelock state and in all these possible
> > > interim or stable states, the Fw caps will be revoked from all the
> > > clients, but the clients still could cache/buffer the file contents,
> > > that means no client is allowed to change the size during the truncate
> > > operation is on the way. After the truncate succeeds the MDS Locker will
> > > issue_truncate() to all the clients and the clients will truncate the
> > > caches/buffers, etc.
> > > 
> > > And also the truncate operations will always be queued sequentially.
> > > 
> > > 
> > > > It sort of makes sense, but the more I look over the existing code, the
> > > > less sure I am about how this is intended to work. I think before we
> > > > make any changes for fscrypt, we need to make sure we understand what's
> > > > there now.
> > > So if my understanding is correct, the Fx is not a must for the truncate
> > > operation.
> > > 
> > Basically, when a truncate up or down comes in, we have to make a
> > determination of whether we can buffer that change, or whether we need
> > to do it synchronously.
> > 
> > The current kclient code bases that on whether it has Fx. I suspect
> > that ensures that no other client has Frw, which sounds like what we
> > probably want.
> 
> The Fx caps is only for the loner client and once the client has the Fx 
> caps it will always have the Fsrwcb at the same time. From the 
> mds/locker.cc I didn't find it'll allow that.
> 

So if I'm a "loner" client, then no one else has _any_ caps, right? If
so, then really Fx conflicts with all other FILE caps.

> If current client has the Fx caps, so when there has another client is 
> trying to truncate the same file at the same time, the MDS will have to 
> revoke the Fx caps first and during which the buffered truncate will be 
> flushed and be finished first too. So when Fx caps is issued and new 
> size equals to the current size, why couldn't we buffer it ?
> 
> 

I think we should be able to buffer it in that case.

> >   We need to prevent anyone from doing writes that might
> > extend the file at that time
> 
> Yeah, since the Fx is only for loner client, and all other clients will 
> have zero file caps. so it won't happen.
> 


What would be nice is to formally define what it means to be a "loner"
client? Does that mean that no other client has any outstanding caps?

A related question: Is loner status attached to the client or the cap
family? IOW, can I be a loner for FILE caps but not AUTH caps?

> 
> >   _and_ need to ensure that stat/statx for
> > the file size blocks until it's complete. ceph_getattr will issue a
> > GETATTR to the MDS if it doesn't have Fs in that case, so that should
> > be fine.
> > 
> > I assume that the MDS will never give out Fx caps to one client and Fs
> > to another.
> Yeah, It won't happen.
> 
> 
> >   What about Fw, though?
> 
> While for the Fw caps, it will allow it. If the filelock is in LOCK_MIX 
> state, the filelock maybe in this state if there at least one client 
> wants the Fw caps and some others want any of Fr and Fw caps.
> 

Sorry, I think you misunderstood my question.

We have to think about whether certain caps conflict with one another
when given to different clients. I meant do Fx and Fw conflict with one
another? If a client holds Fx and another requests Fw, then the MDS will
need to revoke Fx before it will hand Fw to another client. Correct?

I suspect that's the case as nothing else makes any sense. :)

> 
> >   Do Fx caps conflict with Frw the
> > same way as they do with Fs?
> 
> Yeah, it will be the same with Fs.
> 

Thanks for doing the archaeology on this. This is all really good info.
We need to get this into the capabilities.rst file in the ceph tree. I'd
really like to see a formal writeup of how FILE caps work.

Maybe we need a nice ASCII art matrix that indicates how various FILE
caps conflict with one another?
Xiubo Li Sept. 25, 2021, 1:02 a.m. UTC | #21
On 9/25/21 2:52 AM, Jeff Layton wrote:
> On Wed, 2021-09-22 at 10:23 +0800, Xiubo Li wrote:
>> On 9/21/21 3:24 AM, Jeff Layton wrote:
>>> On Mon, 2021-09-20 at 22:32 +0800, Xiubo Li wrote:
>>>> On 9/18/21 1:19 AM, Jeff Layton wrote:
>>>>> On Thu, 2021-09-16 at 18:02 +0800, Xiubo Li wrote:
>>>>>> For this I am a little curious why couldn't we cache truncate operations
>>>>>> when attr.ia_size >= isize ?
>>>>>>
>>>>> It seems to me that if attr.ia_size == i_size, then there is nothing to
>>>>> change. We can just optimize that case away, assuming the client has
>>>>> caps that ensure the size is correct.
>>>>    From MDS side I didn't find any limitation why couldn't we optimize it.
>>>>
>>>>
>>> Well...I think we do still need to change the mtime in that case. We
>>> probably do still need to do a setattr or something that makes the MDS
>>> set it to its current local time, even if the size doesn't change.
>> Since there hasn't any change for the file data, will change the 'mtime'
>> make sense here ? If so, then why in case the current client has the Fr

Sorry, s/Fr/Fs/


>> caps and sizes are the same it won't.
>>
>> In this case I found in the MDS side, it will also update the 'ctime'
>> always even it will change nothing.
>>
> I think I'm wrong here, actually, but the POSIX spec is a bit vague on
> the topic. Conceptually though since nothing changed, it doesn't seem
> like we'd be mandated to change the mtime or change attribute.
>
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/truncate.html

Yeah, I also read this spec and didn't find any place saying in this 
case should the mtime need to update.


>>>>> Based on the kclient code, for size changes after a write that extends a
>>>>> file, it seems like Fw is sufficient to allow the client to buffer these
>>>>> things.
>>>> Since the MDS will only allow us to increase the file size, just like
>>>> the mtime/ctime/atime, so even the write size has exceed current file
>>>> size, the Fw is sufficient.
>>>>
>>> Good.
>>>
>>>>>     For a truncate (aka setattr) operation, we apparently need Fx.
>>>> In case one client is truncating the file, if the new size is larger
>>>> than or equal to current size, this should be okay and will behave just
>>>> like normal write case above.
>>>>
>>>> If the new size is smaller, the MDS will handle this in a different way.
>>>> When the MDS received the truncate request, it will first xlock the
>>>> filelock, which will switch the filelock state and in all these possible
>>>> interim or stable states, the Fw caps will be revoked from all the
>>>> clients, but the clients still could cache/buffer the file contents,
>>>> that means no client is allowed to change the size during the truncate
>>>> operation is on the way. After the truncate succeeds the MDS Locker will
>>>> issue_truncate() to all the clients and the clients will truncate the
>>>> caches/buffers, etc.
>>>>
>>>> And also the truncate operations will always be queued sequentially.
>>>>
>>>>
>>>>> It sort of makes sense, but the more I look over the existing code, the
>>>>> less sure I am about how this is intended to work. I think before we
>>>>> make any changes for fscrypt, we need to make sure we understand what's
>>>>> there now.
>>>> So if my understanding is correct, the Fx is not a must for the truncate
>>>> operation.
>>>>
>>> Basically, when a truncate up or down comes in, we have to make a
>>> determination of whether we can buffer that change, or whether we need
>>> to do it synchronously.
>>>
>>> The current kclient code bases that on whether it has Fx. I suspect
>>> that ensures that no other client has Frw, which sounds like what we
>>> probably want.
>> The Fx caps is only for the loner client and once the client has the Fx
>> caps it will always have the Fsrwcb at the same time. From the
>> mds/locker.cc I didn't find it'll allow that.
>>
> So if I'm a "loner" client, then no one else has _any_ caps, right?

Yeah, from the mds/locker.c no one else should have any file caps.

If I didn't miss something from the MDS code, if there has any other 
client also wants any of Fr and Fw, so there will be no loner client 
exist or won't set a new one. That means when there has only one client 
will the loner should always exist, and in multiple clients case the 
loner should rarely exist.


>   If
> so, then really Fx conflicts with all other FILE caps.

Once a client has Fx, I think it always should have the Fsxrcwb.


>
>> If current client has the Fx caps, so when there has another client is
>> trying to truncate the same file at the same time, the MDS will have to
>> revoke the Fx caps first and during which the buffered truncate will be
>> flushed and be finished first too. So when Fx caps is issued and new
>> size equals to the current size, why couldn't we buffer it ?
>>
>>
> I think we should be able to buffer it in that case.
>
>>>    We need to prevent anyone from doing writes that might
>>> extend the file at that time
>> Yeah, since the Fx is only for loner client, and all other clients will
>> have zero file caps. so it won't happen.
>>
>
> What would be nice is to formally define what it means to be a "loner"
> client? Does that mean that no other client has any outstanding caps?
When the first client comes it will be the loner client, then if a new 
client comes the loner client should be revoked usually. From the code 
this is what I can get.

>
> A related question: Is loner status attached to the client or the cap
> family? IOW, can I be a loner for FILE caps but not AUTH caps?

The loner should always have the whole FILE caps, but possibly not the 
whole AUTH caps, only when the client want Ax will it try to switch the 
authlock to LOCK_EXCL, or just switch to LOCK_SYNC.

 From calc_ideal_loner(), it's calculating the loner client and if there 
has only one client's "cap.wanted() & 
(CEPH_CAP_ANY_WR|CEPH_CAP_FILE_RD)" is true will the calc_ideal_loner() 
succeed.

IMO we can just simply assume that the loner is for single client use case.


>>>    _and_ need to ensure that stat/statx for
>>> the file size blocks until it's complete. ceph_getattr will issue a
>>> GETATTR to the MDS if it doesn't have Fs in that case, so that should
>>> be fine.
>>>
>>> I assume that the MDS will never give out Fx caps to one client and Fs
>>> to another.
>> Yeah, It won't happen.
>>
>>
>>>    What about Fw, though?
>> While for the Fw caps, it will allow it. If the filelock is in LOCK_MIX
>> state, the filelock maybe in this state if there at least one client
>> wants the Fw caps and some others want any of Fr and Fw caps.
>>
> Sorry, I think you misunderstood my question.
>
> We have to think about whether certain caps conflict with one another
> when given to different clients. I meant do Fx and Fw conflict with one
> another? If a client holds Fx and another requests Fw, then the MDS will
> need to revoke Fx before it will hand Fw to another client. Correct?
>
Okay, yeah. Since only the loner client will have the Fx caps and 
because the FILE's LOCK_EXCL state won't allow any other client have the 
Fw caps, so in this case when the MDS is doing the FILE lock state 
transition the Fx will be revoked.


> I suspect that's the case as nothing else makes any sense. :)
>
>>>    Do Fx caps conflict with Frw the
>>> same way as they do with Fs?
>> Yeah, it will be the same with Fs.
>>
> Thanks for doing the archaeology on this. This is all really good info.
> We need to get this into the capabilities.rst file in the ceph tree. I'd
> really like to see a formal writeup of how FILE caps work.

Maybe we can try this after the file scryption feature is done.


> Maybe we need a nice ASCII art matrix that indicates how various FILE
> caps conflict with one another?
>
Yeah, sounds nice.
Xiubo Li Sept. 25, 2021, 9:56 a.m. UTC | #22
On 9/14/21 3:34 AM, Jeff Layton wrote:
> On Mon, 2021-09-13 at 13:42 +0800, Xiubo Li wrote:
>> On 9/10/21 7:46 PM, Jeff Layton wrote:
[...]
>>> Are you certain that Fw caps is enough to ensure that no other client
>>> holds Fr caps?
>> I spent hours and went through the mds Locker related code on the weekends.
>>
>>   From the mds/lock.cc code, for mds filelock for example in the LOCK_MIX
>> state and some interim transition states to LOCK_MIX it will allow
>> different clients could hold any of Fw or Fr caps. But the Fb/Fc will be
>> disabled. Checked the mds/Locker.cc code, found that the mds filelock
>> could to switch LOCK_MIX state in some cases when there has one client
>> wants Fw and another client wants any of Fr and Fw.
>>
>> In this case I think the Linux advisory or mandatory locks are necessary
>> to keep the file contents concurrency. In multiple processes' concurrent
>> read/write or write/write cases without the Linux advisory/mandatory
>> locks the file contents' concurrency won't be guaranteed, so the logic
>> is the same here ?
>>
>> If so, couldn't we just assume the Fw vs Fw and Fr vs Fw caps should be
>> exclusive in correct use case ? For example, just after the mds filelock
>> state switches to LOCK_MIX, if clientA gets the advisory file lock and
>> the Fw caps, and even another clientB could be successfully issued the
>> Fr caps, the clientB won't do any read because it should be still stuck
>> and be waiting for the advisory file lock.
>>
> I'm not sure I like that idea. Basically, that would change the meaning
> of the what Frw caps represent, in a way that is not really consistent
> with how they have been used before.
>
> We could gate that new behavior on the new feature flags, but it sounds
> pretty tough.
>
> I think we have a couple of options:
>
> 1) we could just make the clients request and wait on Fx caps when they
> do a truncate. They might stall for a bit if there is contention, but it
> would ensure consistency and the client could be completely in charge of
> the truncate. [a]
>
> 2) we could rev the protocol, and have the client send along the last
> block to be written along with the SETATTR request.

I am also thinking send the last block along with SETATTR request, it 
must journal the last block too, I am afraid it will occupy many cephfs 
meta pool in corner case, such as when client send massive truncate 
requests in a short time, just like in this bug: 
https://tracker.ceph.com/issues/52280.


>   Maybe we even
> consider just adding a new TRUNCATE call independent of SETATTR. The MDS
> would remain in complete control of it at that point.

Maybe we can just do:

When the MDS received a SETATTR request with size changing from clientA, 
it will try to xlock(filelock), during which the MDS will always only 
allow Fcb caps to all the clients, so another client could still 
buffering the last block.

I think we can just nudge the journal log for this request in MDS and do 
not do the early reply to let the clientA's truncate request to wait. 
When the journal log is successfully flushed and before releasing the 
xlock(filelock), we can tell the clientA to do the RMW for the last 
block. Currently while the xlock is held, no client could get the Frw 
caps, so we need to add one interim xlock state to only allow the 
xlocker clientA to have the Frw, such as:

[LOCK_XLOCKDONE_TRUNC]  = { LOCK_LOCK, false, LOCK_LOCK, 0, XCL, 0,   
0,   0,   0,   0,  0,0,CEPH_CAP_GRD|CEPH_CAP_GWR,0 },

So for clientA it will be safe to do the RMW, after this the MDS will 
finished the truncate request with safe reply only.


>
> The other ideas I've considered seem more complex and don't offer any
> significant advantages that I can see.
>
> [a]: Side question: why does buffering a truncate require Fx and not Fb?
> How do Fx and Fb interact?
>
>>> IIRC, Frw don't have the same exclusionary relationship
>>> that (e.g.) Asx has. To exclude Fr, you may need Fb.
>>>
>>> (Side note: these rules really need to be codified into a document
>>> somewhere. I started that with the capabilities doc in the ceph tree,
>>> but it's light on details of the FILE caps)
>> Yeah, that doc is light on the details for now.
>>
>>
>>>> And if after clientB have been issued the Fw caps and have modified that
>>>> block and still not flushed back, a new clientC comes and wants to read
>>>> the file, so the Fw caps must be revoked from clientB and the dirty data
>>>> will be flushed, and then when releasing the Fw caps to the MDS, it will
>>>> update the new fscrypt_file meta together.
>>>>
>>>> I haven't see which case will be racy yet. Or did I miss some cases or
>>>> something important ?
>>>>
>>>>
>>> We also need to consider how legacy, non-fscrypt-capable clients will
>>> interact with files that have this field set. If one comes in and writes
>>> to or truncates one of these files, it's liable to leave a real mess.
>>> The simplest solution may be to just bar them from doing anything with
>>> fscrypted files aside from deleting them -- maybe we'd allow them to
>>> acquire Fr caps but not Fw?
>> For the legacy, non-fscrypt-capable clients, the reading contents should
>> be encrypted any way, so there won't be any problem even if that
>> specified block is not RMWed yet and it should ignore this field.
>>
> Right, and I think we have to allow those clients to request Fr caps so
> that they have the ability to backup and archive encrypted files without
> needing the key. The cephfs-mirror-daemon, in particular, may need this.
>
>> But for write case, I think the MDS should fail it in the open() stage
>> if the mode has any of Write/Truncate, etc, and only Read/Buffer-read,
>> etc are allowed. Or if we change the mds/Locker.cc code by not allowing
>> it to issue the Fw caps to the legacy/non-fscrypt-capable clients, after
>> the file is successfully opened with Write mode, it should be stuck
>> forever when writing data to the file by waiting the Fw caps, which will
>> never come ?
>>
> Yes. Those clients should be barred from making any changes to file
> contents or doing anything that might result in a new dentry being
> attached to an existing inode.
>
> We need to allow them to read files, and unlink them, but that's really
> about it.
>
>>>>> Really, it comes down to the fact that truncates are supposed to be an
>>>>> atomic operation, but we need to perform actions in two different
>>>>> places.
>>>>>
>>>>> Hmmm...it's worse than that even -- if the truncate changes it so that
>>>>> the last block is in an entirely different object, then there are 3
>>>>> places that will need to coordinate access.
>>>>>
>>>>> Tricky.
Jeff Layton Oct. 11, 2021, 1:29 p.m. UTC | #23
On Sat, 2021-09-25 at 17:56 +0800, Xiubo Li wrote:
> On 9/14/21 3:34 AM, Jeff Layton wrote:
> > On Mon, 2021-09-13 at 13:42 +0800, Xiubo Li wrote:
> > > On 9/10/21 7:46 PM, Jeff Layton wrote:
> [...]
> > > > Are you certain that Fw caps is enough to ensure that no other client
> > > > holds Fr caps?
> > > I spent hours and went through the mds Locker related code on the weekends.
> > > 
> > >   From the mds/lock.cc code, for mds filelock for example in the LOCK_MIX
> > > state and some interim transition states to LOCK_MIX it will allow
> > > different clients could hold any of Fw or Fr caps. But the Fb/Fc will be
> > > disabled. Checked the mds/Locker.cc code, found that the mds filelock
> > > could to switch LOCK_MIX state in some cases when there has one client
> > > wants Fw and another client wants any of Fr and Fw.
> > > 
> > > In this case I think the Linux advisory or mandatory locks are necessary
> > > to keep the file contents concurrency. In multiple processes' concurrent
> > > read/write or write/write cases without the Linux advisory/mandatory
> > > locks the file contents' concurrency won't be guaranteed, so the logic
> > > is the same here ?
> > > 
> > > If so, couldn't we just assume the Fw vs Fw and Fr vs Fw caps should be
> > > exclusive in correct use case ? For example, just after the mds filelock
> > > state switches to LOCK_MIX, if clientA gets the advisory file lock and
> > > the Fw caps, and even another clientB could be successfully issued the
> > > Fr caps, the clientB won't do any read because it should be still stuck
> > > and be waiting for the advisory file lock.
> > > 
> > I'm not sure I like that idea. Basically, that would change the meaning
> > of the what Frw caps represent, in a way that is not really consistent
> > with how they have been used before.
> > 
> > We could gate that new behavior on the new feature flags, but it sounds
> > pretty tough.
> > 
> > I think we have a couple of options:
> > 
> > 1) we could just make the clients request and wait on Fx caps when they
> > do a truncate. They might stall for a bit if there is contention, but it
> > would ensure consistency and the client could be completely in charge of
> > the truncate. [a]
> > 
> > 2) we could rev the protocol, and have the client send along the last
> > block to be written along with the SETATTR request.
> 
> I am also thinking send the last block along with SETATTR request, it 
> must journal the last block too, I am afraid it will occupy many cephfs 
> meta pool in corner case, such as when client send massive truncate 
> requests in a short time, just like in this bug: 
> https://tracker.ceph.com/issues/52280.
> 
> 

Good point.

Yes, we'd need to buffer the last block on a truncate like this, but we
could limit the amount of truncates with "last block" operations that
run concurrently. We'd probably also want to cap the size of the "last
block" too.


> >   Maybe we even
> > consider just adding a new TRUNCATE call independent of SETATTR. The MDS
> > would remain in complete control of it at that point.
> 
> Maybe we can just do:
> 
> When the MDS received a SETATTR request with size changing from clientA, 
> it will try to xlock(filelock), during which the MDS will always only 
> allow Fcb caps to all the clients, so another client could still 
> buffering the last block.
> 
> I think we can just nudge the journal log for this request in MDS and do 
> not do the early reply to let the clientA's truncate request to wait. 
> When the journal log is successfully flushed and before releasing the 
> xlock(filelock), we can tell the clientA to do the RMW for the last 
> block. Currently while the xlock is held, no client could get the Frw 
> caps, so we need to add one interim xlock state to only allow the 
> xlocker clientA to have the Frw, such as:
> 
> [LOCK_XLOCKDONE_TRUNC]  = { LOCK_LOCK, false, LOCK_LOCK, 0, XCL, 0,   
> 0,   0,   0,   0,  0,0,CEPH_CAP_GRD|CEPH_CAP_GWR,0 },
> 
> So for clientA it will be safe to do the RMW, after this the MDS will 
> finished the truncate request with safe reply only.
> 
> 

This sounds pretty fragile. I worry about splitting responsibility for
truncates across two different entities (MDS and client). That means a
lot more complex failure cases.

What will you do if you do this, and then the client dies before it can
finish the RMW? How will you know when the client's RMW cycle is
complete? I assume it'll have to send a "truncate complete" message to
the MDS in that case to know when it can release the xlock?


> > 
> > The other ideas I've considered seem more complex and don't offer any
> > significant advantages that I can see.
> > 
> > [a]: Side question: why does buffering a truncate require Fx and not Fb?
> > How do Fx and Fb interact?
> > 
> > > > IIRC, Frw don't have the same exclusionary relationship
> > > > that (e.g.) Asx has. To exclude Fr, you may need Fb.
> > > > 
> > > > (Side note: these rules really need to be codified into a document
> > > > somewhere. I started that with the capabilities doc in the ceph tree,
> > > > but it's light on details of the FILE caps)
> > > Yeah, that doc is light on the details for now.
> > > 
> > > 
> > > > > And if after clientB have been issued the Fw caps and have modified that
> > > > > block and still not flushed back, a new clientC comes and wants to read
> > > > > the file, so the Fw caps must be revoked from clientB and the dirty data
> > > > > will be flushed, and then when releasing the Fw caps to the MDS, it will
> > > > > update the new fscrypt_file meta together.
> > > > > 
> > > > > I haven't see which case will be racy yet. Or did I miss some cases or
> > > > > something important ?
> > > > > 
> > > > > 
> > > > We also need to consider how legacy, non-fscrypt-capable clients will
> > > > interact with files that have this field set. If one comes in and writes
> > > > to or truncates one of these files, it's liable to leave a real mess.
> > > > The simplest solution may be to just bar them from doing anything with
> > > > fscrypted files aside from deleting them -- maybe we'd allow them to
> > > > acquire Fr caps but not Fw?
> > > For the legacy, non-fscrypt-capable clients, the reading contents should
> > > be encrypted any way, so there won't be any problem even if that
> > > specified block is not RMWed yet and it should ignore this field.
> > > 
> > Right, and I think we have to allow those clients to request Fr caps so
> > that they have the ability to backup and archive encrypted files without
> > needing the key. The cephfs-mirror-daemon, in particular, may need this.
> > 
> > > But for write case, I think the MDS should fail it in the open() stage
> > > if the mode has any of Write/Truncate, etc, and only Read/Buffer-read,
> > > etc are allowed. Or if we change the mds/Locker.cc code by not allowing
> > > it to issue the Fw caps to the legacy/non-fscrypt-capable clients, after
> > > the file is successfully opened with Write mode, it should be stuck
> > > forever when writing data to the file by waiting the Fw caps, which will
> > > never come ?
> > > 
> > Yes. Those clients should be barred from making any changes to file
> > contents or doing anything that might result in a new dentry being
> > attached to an existing inode.
> > 
> > We need to allow them to read files, and unlink them, but that's really
> > about it.
> > 
> > > > > > Really, it comes down to the fact that truncates are supposed to be an
> > > > > > atomic operation, but we need to perform actions in two different
> > > > > > places.
> > > > > > 
> > > > > > Hmmm...it's worse than that even -- if the truncate changes it so that
> > > > > > the last block is in an entirely different object, then there are 3
> > > > > > places that will need to coordinate access.
> > > > > > 
> > > > > > Tricky.
>
Xiubo Li Oct. 11, 2021, 3:16 p.m. UTC | #24
On 10/11/21 9:29 PM, Jeff Layton wrote:
> On Sat, 2021-09-25 at 17:56 +0800, Xiubo Li wrote:
>> On 9/14/21 3:34 AM, Jeff Layton wrote:
>>> On Mon, 2021-09-13 at 13:42 +0800, Xiubo Li wrote:
>>>> On 9/10/21 7:46 PM, Jeff Layton wrote:
>> [...]
>>>>> Are you certain that Fw caps is enough to ensure that no other client
>>>>> holds Fr caps?
>>>> I spent hours and went through the mds Locker related code on the weekends.
>>>>
>>>>    From the mds/lock.cc code, for mds filelock for example in the LOCK_MIX
>>>> state and some interim transition states to LOCK_MIX it will allow
>>>> different clients could hold any of Fw or Fr caps. But the Fb/Fc will be
>>>> disabled. Checked the mds/Locker.cc code, found that the mds filelock
>>>> could to switch LOCK_MIX state in some cases when there has one client
>>>> wants Fw and another client wants any of Fr and Fw.
>>>>
>>>> In this case I think the Linux advisory or mandatory locks are necessary
>>>> to keep the file contents concurrency. In multiple processes' concurrent
>>>> read/write or write/write cases without the Linux advisory/mandatory
>>>> locks the file contents' concurrency won't be guaranteed, so the logic
>>>> is the same here ?
>>>>
>>>> If so, couldn't we just assume the Fw vs Fw and Fr vs Fw caps should be
>>>> exclusive in correct use case ? For example, just after the mds filelock
>>>> state switches to LOCK_MIX, if clientA gets the advisory file lock and
>>>> the Fw caps, and even another clientB could be successfully issued the
>>>> Fr caps, the clientB won't do any read because it should be still stuck
>>>> and be waiting for the advisory file lock.
>>>>
>>> I'm not sure I like that idea. Basically, that would change the meaning
>>> of the what Frw caps represent, in a way that is not really consistent
>>> with how they have been used before.
>>>
>>> We could gate that new behavior on the new feature flags, but it sounds
>>> pretty tough.
>>>
>>> I think we have a couple of options:
>>>
>>> 1) we could just make the clients request and wait on Fx caps when they
>>> do a truncate. They might stall for a bit if there is contention, but it
>>> would ensure consistency and the client could be completely in charge of
>>> the truncate. [a]
>>>
>>> 2) we could rev the protocol, and have the client send along the last
>>> block to be written along with the SETATTR request.
>> I am also thinking send the last block along with SETATTR request, it
>> must journal the last block too, I am afraid it will occupy many cephfs
>> meta pool in corner case, such as when client send massive truncate
>> requests in a short time, just like in this bug:
>> https://tracker.ceph.com/issues/52280.
>>
>>
> Good point.
>
> Yes, we'd need to buffer the last block on a truncate like this, but we
> could limit the amount of truncates with "last block" operations that
> run concurrently. We'd probably also want to cap the size of the "last
> block" too.

Okay. So by far this seems the best approach.

I will try to it tomorrow.


>
>>>    Maybe we even
>>> consider just adding a new TRUNCATE call independent of SETATTR. The MDS
>>> would remain in complete control of it at that point.
>> Maybe we can just do:
>>
>> When the MDS received a SETATTR request with size changing from clientA,
>> it will try to xlock(filelock), during which the MDS will always only
>> allow Fcb caps to all the clients, so another client could still
>> buffering the last block.
>>
>> I think we can just nudge the journal log for this request in MDS and do
>> not do the early reply to let the clientA's truncate request to wait.
>> When the journal log is successfully flushed and before releasing the
>> xlock(filelock), we can tell the clientA to do the RMW for the last
>> block. Currently while the xlock is held, no client could get the Frw
>> caps, so we need to add one interim xlock state to only allow the
>> xlocker clientA to have the Frw, such as:
>>
>> [LOCK_XLOCKDONE_TRUNC]  = { LOCK_LOCK, false, LOCK_LOCK, 0, XCL, 0,
>> 0,   0,   0,   0,  0,0,CEPH_CAP_GRD|CEPH_CAP_GWR,0 },
>>
>> So for clientA it will be safe to do the RMW, after this the MDS will
>> finished the truncate request with safe reply only.
>>
>>
> This sounds pretty fragile. I worry about splitting responsibility for
> truncates across two different entities (MDS and client). That means a
> lot more complex failure cases.

Yeah, this will be more complex to handle the failure cases.


> What will you do if you do this, and then the client dies before it can
> finish the RMW? How will you know when the client's RMW cycle is
> complete? I assume it'll have to send a "truncate complete" message to
> the MDS in that case to know when it can release the xlock?
>
Okay, I didn't foresee this case, this sounds making it very complex...
>>> The other ideas I've considered seem more complex and don't offer any
>>> significant advantages that I can see.
>>>
>>> [a]: Side question: why does buffering a truncate require Fx and not Fb?
>>> How do Fx and Fb interact?
>>>
[...]
diff mbox series

Patch

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 6d3f74d46e5b..9f1dd2fc427d 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -212,6 +212,7 @@  static bool ceph_netfs_clamp_length(struct netfs_read_subrequest *subreq)
 static void finish_netfs_read(struct ceph_osd_request *req)
 {
 	struct ceph_fs_client *fsc = ceph_inode_to_client(req->r_inode);
+	struct ceph_inode_info *ci = ceph_inode(req->r_inode);
 	struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0);
 	struct netfs_read_subrequest *subreq = req->r_priv;
 	int num_pages;
@@ -234,8 +235,15 @@  static void finish_netfs_read(struct ceph_osd_request *req)
 
 	netfs_subreq_terminated(subreq, err, true);
 
+	/* FIXME: This should be done after descryped */
+	if (req->r_result > 0)
+		ceph_try_to_zero_truncate_block_off(ci, osd_data->alignment,
+						    osd_data->length,
+						    osd_data->pages);
+
 	num_pages = calc_pages_for(osd_data->alignment, osd_data->length);
 	ceph_put_page_vector(osd_data->pages, num_pages, false);
+
 	iput(req->r_inode);
 }
 
@@ -555,8 +563,10 @@  static int writepage_nounlock(struct page *page, struct writeback_control *wbc)
 				  req->r_end_latency, len, err);
 
 	ceph_osdc_put_request(req);
-	if (err == 0)
+	if (err == 0) {
+		ceph_reset_truncate_block_off(ci, page_off, len);
 		err = len;
+	}
 
 	if (err < 0) {
 		struct writeback_control tmp_wbc;
@@ -661,10 +671,17 @@  static void writepages_finish(struct ceph_osd_request *req)
 					   (u64)osd_data->length);
 		total_pages += num_pages;
 		for (j = 0; j < num_pages; j++) {
+			u64 page_off;
+
 			page = osd_data->pages[j];
 			BUG_ON(!page);
 			WARN_ON(!PageUptodate(page));
 
+			page_off = osd_data->alignment + j * PAGE_SIZE;
+			if (rc >= 0)
+			    ceph_reset_truncate_block_off(ci, page_off,
+							  page_off + PAGE_SIZE);
+
 			if (atomic_long_dec_return(&fsc->writeback_count) <
 			     CONGESTION_OFF_THRESH(
 					fsc->mount_options->congestion_kb))
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index d628dcdbf869..a211ab4c3f7a 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1425,6 +1425,9 @@  static void __prep_cap(struct cap_msg_args *arg, struct ceph_cap *cap,
 			}
 		}
 	}
+	if (ci->i_truncate_block_off < 0)
+		flags |= CEPH_CLIENT_CAPS_RESET_FSCRYPT_FILE;
+
 	arg->flags = flags;
 	arg->encrypted = IS_ENCRYPTED(inode);
 	if (ci->fscrypt_auth_len &&
@@ -3155,6 +3158,27 @@  void ceph_put_cap_refs_no_check_caps(struct ceph_inode_info *ci, int had)
 	__ceph_put_cap_refs(ci, had, PUT_CAP_REFS_NO_CHECK);
 }
 
+/*
+ * Clear the i_truncate_block_off and fscrypt_file
+ * if old last encrypted block has been updated.
+ */
+void __ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
+				      u64 start_pos, u64 end_pos)
+{
+	if (ci->i_truncate_block_off > 0 &&
+	    ci->i_truncate_block_off >= start_pos &&
+	    ci->i_truncate_block_off < end_pos)
+		ci->i_truncate_block_off = 0;
+}
+
+void ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
+				    u64 start_pos, u64 end_pos)
+{
+	spin_lock(&ci->i_ceph_lock);
+	__ceph_reset_truncate_block_off(ci, start_pos, end_pos);
+	spin_unlock(&ci->i_ceph_lock);
+}
+
 /*
  * Release @nr WRBUFFER refs on dirty pages for the given @snapc snap
  * context.  Adjust per-snap dirty page accounting as appropriate.
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 6e677b40410e..cfa4cbe08c10 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -885,10 +885,34 @@  static void fscrypt_adjust_off_and_len(struct inode *inode, u64 *off, u64 *len)
 	}
 }
 
+void ceph_try_to_zero_truncate_block_off(struct ceph_inode_info *ci,
+					 u64 start_pos, u64 end_pos,
+					 struct page **pages)
+{
+	u64 zoff, zlen;
+
+	spin_lock(&ci->i_ceph_lock);
+	if (ci->i_truncate_block_off >= start_pos &&
+			ci->i_truncate_block_off < end_pos) {
+		zoff = ci->i_truncate_block_off - start_pos;
+		zlen = round_up(ci->i_truncate_block_off, PAGE_SIZE) - ci->i_truncate_block_off;
+
+		spin_unlock(&ci->i_ceph_lock);
+		ceph_zero_page_vector_range(zoff, zlen, pages);
+		spin_lock(&ci->i_ceph_lock);
+	}
+	spin_unlock(&ci->i_ceph_lock);
+}
 #else
 static inline void fscrypt_adjust_off_and_len(struct inode *inode, u64 *off, u64 *len)
 {
 }
+
+void ceph_try_to_zero_truncate_block_off(struct ceph_inode_info *ci,
+					 u64 start_pos, u64 end_pos,
+					 struct page **pages)
+{
+}
 #endif
 
 /*
@@ -1030,6 +1054,13 @@  static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
 			ret += zlen;
 		}
 
+		/*
+		 * If the inode is ENCRYPTED the read_off is aligned to PAGE_SIZE
+		 */
+		ceph_try_to_zero_truncate_block_off(ci, read_off,
+						    read_off + read_len,
+						    pages);
+
 		idx = 0;
 		left = ret > 0 ? ret : 0;
 		while (left > 0) {
@@ -1413,12 +1444,34 @@  ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 
 		size = i_size_read(inode);
 		if (!write) {
+			struct iov_iter i;
+			size_t boff;
+			int zlen;
+
 			if (ret == -ENOENT)
 				ret = 0;
+
+			/* Zero the truncate block off */
+			spin_lock(&ci->i_ceph_lock);
+			boff = ci->i_truncate_block_off;
+			if (IS_ENCRYPTED(inode) && ret > 0 && boff > 0 &&
+			    boff >= (iocb->ki_pos & PAGE_MASK) &&
+			    boff < round_up(ret, PAGE_SIZE)) {
+				int advance = 0;
+
+				zlen = round_up(boff, PAGE_SIZE) - boff;
+				if (ci->i_truncate_block_off >= iocb->ki_pos)
+					advance = boff - iocb->ki_pos;
+
+				iov_iter_bvec(&i, READ, bvecs, num_pages, len);
+				iov_iter_advance(&i, advance);
+				iov_iter_zero(zlen, &i);
+			}
+			spin_unlock(&ci->i_ceph_lock);
+
 			if (ret >= 0 && ret < len && pos + ret < size) {
-				struct iov_iter i;
-				int zlen = min_t(size_t, len - ret,
-						 size - pos - ret);
+				zlen = min_t(size_t, len - ret,
+					     size - pos - ret);
 
 				iov_iter_bvec(&i, READ, bvecs, num_pages, len);
 				iov_iter_advance(&i, ret);
@@ -1967,6 +2020,7 @@  static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
 	struct ceph_osd_client *osdc = &fsc->client->osdc;
 	struct ceph_cap_flush *prealloc_cf;
+	u64 start_pos = iocb->ki_pos;
 	ssize_t count, written = 0;
 	int err, want, got;
 	bool direct_lock = false;
@@ -2110,6 +2164,8 @@  static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		int dirty;
 
 		spin_lock(&ci->i_ceph_lock);
+		__ceph_reset_truncate_block_off(ci, start_pos, iocb->ki_pos);
+
 		ci->i_inline_version = CEPH_INLINE_NONE;
 		dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR,
 					       &prealloc_cf);
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 1a4c9bc485fc..c48c77c1bcf4 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -625,6 +625,7 @@  struct inode *ceph_alloc_inode(struct super_block *sb)
 	ci->fscrypt_auth = NULL;
 	ci->fscrypt_auth_len = 0;
 #endif
+	ci->i_truncate_block_off = 0;
 
 	return &ci->vfs_inode;
 }
@@ -1033,11 +1034,24 @@  int ceph_fill_inode(struct inode *inode, struct page *locked_page,
 
 		pool_ns = old_ns;
 
+		/* the fscrypt_file is 0 means the file content truncate has been done */
 		if (IS_ENCRYPTED(inode) && size &&
-		    (iinfo->fscrypt_file_len == sizeof(__le64))) {
+		    iinfo->fscrypt_file_len == sizeof(__le64) &&
+		    __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file) > 0) {
 			size = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
 			if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
 				pr_warn("size=%llu fscrypt_file=%llu\n", info->size, size);
+
+			/*
+			 * If the second truncate come just after the first
+			 * truncate, and if the second has a larger size there
+			 * is no need to update the i_truncate_block_off.
+			 * Only when the second one has a smaller size, that
+			 * means we need to truncate more.
+			 */
+			if (ci->i_truncate_block_off > 0 &&
+			    size < ci->i_truncate_block_off)
+				ci->i_truncate_block_off = size;
 		}
 
 		queue_trunc = ceph_fill_file_size(inode, issued,
@@ -2390,8 +2404,15 @@  int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
 				req->r_args.setattr.old_size =
 					cpu_to_le64(round_up(isize,
 							     CEPH_FSCRYPT_BLOCK_SIZE));
-				req->r_fscrypt_file = attr->ia_size;
-				/* FIXME: client must zero out any partial blocks! */
+				if (attr->ia_size < isize) {
+					if(IS_ALIGNED(attr->ia_size, CEPH_FSCRYPT_BLOCK_SIZE))
+						req->r_fscrypt_file = 0;
+					else
+						req->r_fscrypt_file = attr->ia_size;
+					/* FIXME: client must zero out any partial blocks! */
+				} else if (attr->ia_size > isize) {
+					req->r_fscrypt_file = attr->ia_size;
+				}
 			} else {
 				req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
 				req->r_args.setattr.old_size = cpu_to_le64(isize);
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 7f3976b3319d..856caeb25fb6 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -443,9 +443,9 @@  struct ceph_inode_info {
 	struct fscache_cookie *fscache;
 #endif
 	u32 fscrypt_auth_len;
-	u32 fscrypt_file_len;
 	u8 *fscrypt_auth;
-	u8 *fscrypt_file;
+	/* need to zero the last block when decrypting the content to pagecache */
+	size_t i_truncate_block_off;
 
 	errseq_t i_meta_err;
 
@@ -1192,6 +1192,10 @@  extern void ceph_put_cap_refs(struct ceph_inode_info *ci, int had);
 extern void ceph_put_cap_refs_async(struct ceph_inode_info *ci, int had);
 extern void ceph_put_cap_refs_no_check_caps(struct ceph_inode_info *ci,
 					    int had);
+extern void __ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
+					    u64 start_pos, u64 end_pos);
+extern void ceph_reset_truncate_block_off(struct ceph_inode_info *ci,
+					  u64 start_pos, u64 end_pos);
 extern void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr,
 				       struct ceph_snap_context *snapc);
 extern void ceph_flush_snaps(struct ceph_inode_info *ci,
@@ -1282,6 +1286,10 @@  extern int ceph_locks_to_pagelist(struct ceph_filelock *flocks,
 extern void ceph_fs_debugfs_init(struct ceph_fs_client *client);
 extern void ceph_fs_debugfs_cleanup(struct ceph_fs_client *client);
 
+extern void ceph_try_to_zero_truncate_block_off(struct ceph_inode_info *ci,
+						u64 start_pos, u64 end_pos,
+						struct page **pages);
+
 /* quota.c */
 static inline bool __ceph_has_any_quota(struct ceph_inode_info *ci)
 {