diff mbox series

[v2] ceph: invalidate pages when doing DIO in encrypted inodes

Message ID 20220401133243.1075-1-lhenriques@suse.de (mailing list archive)
State New, archived
Headers show
Series [v2] ceph: invalidate pages when doing DIO in encrypted inodes | expand

Commit Message

Luis Henriques April 1, 2022, 1:32 p.m. UTC
When doing DIO on an encrypted node, we need to invalidate the page cache in
the range being written to, otherwise the cache will include invalid data.

Signed-off-by: Luís Henriques <lhenriques@suse.de>
---
 fs/ceph/file.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Changes since v1:
- Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
- Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO

Note: I'm not really sure this last change is required, it doesn't really
affect generic/647 result, but seems to be the most correct.

Comments

Xiubo Li April 6, 2022, 5:24 a.m. UTC | #1
On 4/1/22 9:32 PM, Luís Henriques wrote:
> When doing DIO on an encrypted node, we need to invalidate the page cache in
> the range being written to, otherwise the cache will include invalid data.
>
> Signed-off-by: Luís Henriques <lhenriques@suse.de>
> ---
>   fs/ceph/file.c | 11 ++++++++++-
>   1 file changed, 10 insertions(+), 1 deletion(-)
>
> Changes since v1:
> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>
> Note: I'm not really sure this last change is required, it doesn't really
> affect generic/647 result, but seems to be the most correct.
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 5072570c2203..b2743c342305 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1605,7 +1605,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>   	if (ret < 0)
>   		return ret;
>   
> -	ceph_fscache_invalidate(inode, false);
> +	ceph_fscache_invalidate(inode, (iocb->ki_flags & IOCB_DIRECT));
>   	ret = invalidate_inode_pages2_range(inode->i_mapping,
>   					    pos >> PAGE_SHIFT,
>   					    (pos + count - 1) >> PAGE_SHIFT);
> @@ -1895,6 +1895,15 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>   		req->r_inode = inode;
>   		req->r_mtime = mtime;
>   
> +		if (IS_ENCRYPTED(inode) && (iocb->ki_flags & IOCB_DIRECT)) {
> +			ret = invalidate_inode_pages2_range(
> +				inode->i_mapping,
> +				write_pos >> PAGE_SHIFT,
> +				(write_pos + write_len - 1) >> PAGE_SHIFT);
> +			if (ret < 0)
> +				dout("invalidate_inode_pages2_range returned %d\n", ret);
> +		}

Shouldn't we fail it if the 'invalidate_inode_pages2_range()' fails here ?

-- Xiubo

> +
>   		/* Set up the assertion */
>   		if (rmw) {
>   			/*
>
Xiubo Li April 6, 2022, 6:28 a.m. UTC | #2
On 4/1/22 9:32 PM, Luís Henriques wrote:
> When doing DIO on an encrypted node, we need to invalidate the page cache in
> the range being written to, otherwise the cache will include invalid data.
>
> Signed-off-by: Luís Henriques <lhenriques@suse.de>
> ---
>   fs/ceph/file.c | 11 ++++++++++-
>   1 file changed, 10 insertions(+), 1 deletion(-)
>
> Changes since v1:
> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>
> Note: I'm not really sure this last change is required, it doesn't really
> affect generic/647 result, but seems to be the most correct.
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 5072570c2203..b2743c342305 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1605,7 +1605,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>   	if (ret < 0)
>   		return ret;
>   
> -	ceph_fscache_invalidate(inode, false);
> +	ceph_fscache_invalidate(inode, (iocb->ki_flags & IOCB_DIRECT));
>   	ret = invalidate_inode_pages2_range(inode->i_mapping,
>   					    pos >> PAGE_SHIFT,
>   					    (pos + count - 1) >> PAGE_SHIFT);

The above has already invalidated the pages, why doesn't it work ?

-- Xiubo

> @@ -1895,6 +1895,15 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>   		req->r_inode = inode;
>   		req->r_mtime = mtime;
>   
> +		if (IS_ENCRYPTED(inode) && (iocb->ki_flags & IOCB_DIRECT)) {
> +			ret = invalidate_inode_pages2_range(
> +				inode->i_mapping,
> +				write_pos >> PAGE_SHIFT,
> +				(write_pos + write_len - 1) >> PAGE_SHIFT);
> +			if (ret < 0)
> +				dout("invalidate_inode_pages2_range returned %d\n", ret);
> +		}
> +
>   		/* Set up the assertion */
>   		if (rmw) {
>   			/*
>
Luis Henriques April 6, 2022, 10:50 a.m. UTC | #3
Xiubo Li <xiubli@redhat.com> writes:

> On 4/1/22 9:32 PM, Luís Henriques wrote:
>> When doing DIO on an encrypted node, we need to invalidate the page cache in
>> the range being written to, otherwise the cache will include invalid data.
>>
>> Signed-off-by: Luís Henriques <lhenriques@suse.de>
>> ---
>>   fs/ceph/file.c | 11 ++++++++++-
>>   1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> Changes since v1:
>> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
>> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>>
>> Note: I'm not really sure this last change is required, it doesn't really
>> affect generic/647 result, but seems to be the most correct.
>>
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index 5072570c2203..b2743c342305 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -1605,7 +1605,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>   	if (ret < 0)
>>   		return ret;
>>   -	ceph_fscache_invalidate(inode, false);
>> +	ceph_fscache_invalidate(inode, (iocb->ki_flags & IOCB_DIRECT));
>>   	ret = invalidate_inode_pages2_range(inode->i_mapping,
>>   					    pos >> PAGE_SHIFT,
>>   					    (pos + count - 1) >> PAGE_SHIFT);
>> @@ -1895,6 +1895,15 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>   		req->r_inode = inode;
>>   		req->r_mtime = mtime;
>>   +		if (IS_ENCRYPTED(inode) && (iocb->ki_flags & IOCB_DIRECT)) {
>> +			ret = invalidate_inode_pages2_range(
>> +				inode->i_mapping,
>> +				write_pos >> PAGE_SHIFT,
>> +				(write_pos + write_len - 1) >> PAGE_SHIFT);
>> +			if (ret < 0)
>> +				dout("invalidate_inode_pages2_range returned %d\n", ret);
>> +		}
>
> Shouldn't we fail it if the 'invalidate_inode_pages2_range()' fails here ?

Yeah, I'm not really sure.  I'm simply following the usual pattern where
an invalidate_inode_pages2_range() failure is logged and ignored.  And
this is not ceph-specific, other filesystems seem to do the same thing.

Cheers,
Xiubo Li April 6, 2022, 10:57 a.m. UTC | #4
On 4/6/22 6:50 PM, Luís Henriques wrote:
> Xiubo Li <xiubli@redhat.com> writes:
>
>> On 4/1/22 9:32 PM, Luís Henriques wrote:
>>> When doing DIO on an encrypted node, we need to invalidate the page cache in
>>> the range being written to, otherwise the cache will include invalid data.
>>>
>>> Signed-off-by: Luís Henriques <lhenriques@suse.de>
>>> ---
>>>    fs/ceph/file.c | 11 ++++++++++-
>>>    1 file changed, 10 insertions(+), 1 deletion(-)
>>>
>>> Changes since v1:
>>> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
>>> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>>>
>>> Note: I'm not really sure this last change is required, it doesn't really
>>> affect generic/647 result, but seems to be the most correct.
>>>
>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>> index 5072570c2203..b2743c342305 100644
>>> --- a/fs/ceph/file.c
>>> +++ b/fs/ceph/file.c
>>> @@ -1605,7 +1605,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>>    	if (ret < 0)
>>>    		return ret;
>>>    -	ceph_fscache_invalidate(inode, false);
>>> +	ceph_fscache_invalidate(inode, (iocb->ki_flags & IOCB_DIRECT));
>>>    	ret = invalidate_inode_pages2_range(inode->i_mapping,
>>>    					    pos >> PAGE_SHIFT,
>>>    					    (pos + count - 1) >> PAGE_SHIFT);
>>> @@ -1895,6 +1895,15 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>>    		req->r_inode = inode;
>>>    		req->r_mtime = mtime;
>>>    +		if (IS_ENCRYPTED(inode) && (iocb->ki_flags & IOCB_DIRECT)) {
>>> +			ret = invalidate_inode_pages2_range(
>>> +				inode->i_mapping,
>>> +				write_pos >> PAGE_SHIFT,
>>> +				(write_pos + write_len - 1) >> PAGE_SHIFT);
>>> +			if (ret < 0)
>>> +				dout("invalidate_inode_pages2_range returned %d\n", ret);
>>> +		}
>> Shouldn't we fail it if the 'invalidate_inode_pages2_range()' fails here ?
> Yeah, I'm not really sure.  I'm simply following the usual pattern where
> an invalidate_inode_pages2_range() failure is logged and ignored.  And
> this is not ceph-specific, other filesystems seem to do the same thing.

I think it should be they are using this to invalidate the range only, 
do not depend on it to writeback the dirty pages.

Such as they may will call 'filemap_fdatawrite_range()', etc.

I saw in the beginning of the 'ceph_sync_write()', it will do 
'filemap_write_and_wait_range()' too. So the dirty pages should have 
already flushed.

-- Xiubo



> Cheers,
Luis Henriques April 6, 2022, 10:57 a.m. UTC | #5
Xiubo Li <xiubli@redhat.com> writes:

> On 4/1/22 9:32 PM, Luís Henriques wrote:
>> When doing DIO on an encrypted node, we need to invalidate the page cache in
>> the range being written to, otherwise the cache will include invalid data.
>>
>> Signed-off-by: Luís Henriques <lhenriques@suse.de>
>> ---
>>   fs/ceph/file.c | 11 ++++++++++-
>>   1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> Changes since v1:
>> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
>> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>>
>> Note: I'm not really sure this last change is required, it doesn't really
>> affect generic/647 result, but seems to be the most correct.
>>
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index 5072570c2203..b2743c342305 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -1605,7 +1605,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>   	if (ret < 0)
>>   		return ret;
>>   -	ceph_fscache_invalidate(inode, false);
>> +	ceph_fscache_invalidate(inode, (iocb->ki_flags & IOCB_DIRECT));
>>   	ret = invalidate_inode_pages2_range(inode->i_mapping,
>>   					    pos >> PAGE_SHIFT,
>>   					    (pos + count - 1) >> PAGE_SHIFT);
>
> The above has already invalidated the pages, why doesn't it work ?

I suspect the reason is because later on we loop through the number of
pages, call copy_page_from_iter() and then ceph_fscrypt_encrypt_pages().

Cheers,
Xiubo Li April 6, 2022, 11:18 a.m. UTC | #6
On 4/6/22 6:57 PM, Luís Henriques wrote:
> Xiubo Li <xiubli@redhat.com> writes:
>
>> On 4/1/22 9:32 PM, Luís Henriques wrote:
>>> When doing DIO on an encrypted node, we need to invalidate the page cache in
>>> the range being written to, otherwise the cache will include invalid data.
>>>
>>> Signed-off-by: Luís Henriques <lhenriques@suse.de>
>>> ---
>>>    fs/ceph/file.c | 11 ++++++++++-
>>>    1 file changed, 10 insertions(+), 1 deletion(-)
>>>
>>> Changes since v1:
>>> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
>>> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>>>
>>> Note: I'm not really sure this last change is required, it doesn't really
>>> affect generic/647 result, but seems to be the most correct.
>>>
>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>> index 5072570c2203..b2743c342305 100644
>>> --- a/fs/ceph/file.c
>>> +++ b/fs/ceph/file.c
>>> @@ -1605,7 +1605,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>>    	if (ret < 0)
>>>    		return ret;
>>>    -	ceph_fscache_invalidate(inode, false);
>>> +	ceph_fscache_invalidate(inode, (iocb->ki_flags & IOCB_DIRECT));
>>>    	ret = invalidate_inode_pages2_range(inode->i_mapping,
>>>    					    pos >> PAGE_SHIFT,
>>>    					    (pos + count - 1) >> PAGE_SHIFT);
>> The above has already invalidated the pages, why doesn't it work ?
> I suspect the reason is because later on we loop through the number of
> pages, call copy_page_from_iter() and then ceph_fscrypt_encrypt_pages().

Checked the 'copy_page_from_iter()', it will do the kmap for the pages 
but will kunmap them again later. And they shouldn't update the 
i_mapping if I didn't miss something important.

For 'ceph_fscrypt_encrypt_pages()' it will encrypt/dencrypt the context 
inplace, IMO if it needs to map the page and it should also unmap it 
just like in 'copy_page_from_iter()'.

I thought it possibly be when we need to do RMW, it may will update the 
i_mapping when reading contents, but I checked the code didn't find any 
place is doing this. So I am wondering where tha page caches come from ? 
If that page caches really from reading the contents, then we should 
discard it instead of flushing it back ?

BTW, what's the problem without this fixing ? xfstest fails ?


-- Xiubo

> Cheers,
Luis Henriques April 6, 2022, 11:33 a.m. UTC | #7
Xiubo Li <xiubli@redhat.com> writes:

> On 4/6/22 6:57 PM, Luís Henriques wrote:
>> Xiubo Li <xiubli@redhat.com> writes:
>>
>>> On 4/1/22 9:32 PM, Luís Henriques wrote:
>>>> When doing DIO on an encrypted node, we need to invalidate the page cache in
>>>> the range being written to, otherwise the cache will include invalid data.
>>>>
>>>> Signed-off-by: Luís Henriques <lhenriques@suse.de>
>>>> ---
>>>>    fs/ceph/file.c | 11 ++++++++++-
>>>>    1 file changed, 10 insertions(+), 1 deletion(-)
>>>>
>>>> Changes since v1:
>>>> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
>>>> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>>>>
>>>> Note: I'm not really sure this last change is required, it doesn't really
>>>> affect generic/647 result, but seems to be the most correct.
>>>>
>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>>> index 5072570c2203..b2743c342305 100644
>>>> --- a/fs/ceph/file.c
>>>> +++ b/fs/ceph/file.c
>>>> @@ -1605,7 +1605,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>>>    	if (ret < 0)
>>>>    		return ret;
>>>>    -	ceph_fscache_invalidate(inode, false);
>>>> +	ceph_fscache_invalidate(inode, (iocb->ki_flags & IOCB_DIRECT));
>>>>    	ret = invalidate_inode_pages2_range(inode->i_mapping,
>>>>    					    pos >> PAGE_SHIFT,
>>>>    					    (pos + count - 1) >> PAGE_SHIFT);
>>> The above has already invalidated the pages, why doesn't it work ?
>> I suspect the reason is because later on we loop through the number of
>> pages, call copy_page_from_iter() and then ceph_fscrypt_encrypt_pages().
>
> Checked the 'copy_page_from_iter()', it will do the kmap for the pages but will
> kunmap them again later. And they shouldn't update the i_mapping if I didn't
> miss something important.
>
> For 'ceph_fscrypt_encrypt_pages()' it will encrypt/dencrypt the context inplace,
> IMO if it needs to map the page and it should also unmap it just like in
> 'copy_page_from_iter()'.
>
> I thought it possibly be when we need to do RMW, it may will update the
> i_mapping when reading contents, but I checked the code didn't find any 
> place is doing this. So I am wondering where tha page caches come from ? If that
> page caches really from reading the contents, then we should discard it instead
> of flushing it back ?
>
> BTW, what's the problem without this fixing ? xfstest fails ?

Yes, generic/647 fails if you run it with test_dummy_encryption.  And I've
also checked that the RMW code was never executed in this test.

But yeah I have assumed (perhaps wrongly) that the kmap/kunmap could
change the inode->i_mapping.  In my debugging this seemed to be the case
for the O_DIRECT path.  That's why I added this extra call here.

Cheers,
Jeff Layton April 6, 2022, 11:48 a.m. UTC | #8
On Wed, 2022-04-06 at 12:33 +0100, Luís Henriques wrote:
> Xiubo Li <xiubli@redhat.com> writes:
> 
> > On 4/6/22 6:57 PM, Luís Henriques wrote:
> > > Xiubo Li <xiubli@redhat.com> writes:
> > > 
> > > > On 4/1/22 9:32 PM, Luís Henriques wrote:
> > > > > When doing DIO on an encrypted node, we need to invalidate the page cache in
> > > > > the range being written to, otherwise the cache will include invalid data.
> > > > > 
> > > > > Signed-off-by: Luís Henriques <lhenriques@suse.de>
> > > > > ---
> > > > >    fs/ceph/file.c | 11 ++++++++++-
> > > > >    1 file changed, 10 insertions(+), 1 deletion(-)
> > > > > 
> > > > > Changes since v1:
> > > > > - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
> > > > > - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
> > > > > 
> > > > > Note: I'm not really sure this last change is required, it doesn't really
> > > > > affect generic/647 result, but seems to be the most correct.
> > > > > 
> > > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > > > index 5072570c2203..b2743c342305 100644
> > > > > --- a/fs/ceph/file.c
> > > > > +++ b/fs/ceph/file.c
> > > > > @@ -1605,7 +1605,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
> > > > >    	if (ret < 0)
> > > > >    		return ret;
> > > > >    -	ceph_fscache_invalidate(inode, false);
> > > > > +	ceph_fscache_invalidate(inode, (iocb->ki_flags & IOCB_DIRECT));
> > > > >    	ret = invalidate_inode_pages2_range(inode->i_mapping,
> > > > >    					    pos >> PAGE_SHIFT,
> > > > >    					    (pos + count - 1) >> PAGE_SHIFT);
> > > > The above has already invalidated the pages, why doesn't it work ?
> > > I suspect the reason is because later on we loop through the number of
> > > pages, call copy_page_from_iter() and then ceph_fscrypt_encrypt_pages().
> > 
> > Checked the 'copy_page_from_iter()', it will do the kmap for the pages but will
> > kunmap them again later. And they shouldn't update the i_mapping if I didn't
> > miss something important.
> > 
> > For 'ceph_fscrypt_encrypt_pages()' it will encrypt/dencrypt the context inplace,
> > IMO if it needs to map the page and it should also unmap it just like in
> > 'copy_page_from_iter()'.
> > 
> > I thought it possibly be when we need to do RMW, it may will update the
> > i_mapping when reading contents, but I checked the code didn't find any 
> > place is doing this. So I am wondering where tha page caches come from ? If that
> > page caches really from reading the contents, then we should discard it instead
> > of flushing it back ?
> > 
> > BTW, what's the problem without this fixing ? xfstest fails ?
> 
> Yes, generic/647 fails if you run it with test_dummy_encryption.  And I've
> also checked that the RMW code was never executed in this test.
> 
> But yeah I have assumed (perhaps wrongly) that the kmap/kunmap could
> change the inode->i_mapping. 
> 

No, kmap/unmap are all about high memory and 32-bit architectures. Those
functions are usually no-ops on 64-bit arches.

> In my debugging this seemed to be the case
> for the O_DIRECT path.  That's why I added this extra call here.
> 

I agree with Xiubo that we really shouldn't need to invalidate multiple
times.

I guess in this test, we have a DIO write racing with an mmap read
Probably what's happening is either that we can't invalidate the page
because it needs to be cleaned, or the mmap read is racing in just after
the invalidate occurs but before writeback.

In any case, it might be interesting to see whether you're getting
-EBUSY back from the new invalidate_inode_pages2 calls with your patch.
Xiubo Li April 6, 2022, 1:10 p.m. UTC | #9
On 4/6/22 7:48 PM, Jeff Layton wrote:
> On Wed, 2022-04-06 at 12:33 +0100, Luís Henriques wrote:
>> Xiubo Li <xiubli@redhat.com> writes:
>>
>>> On 4/6/22 6:57 PM, Luís Henriques wrote:
>>>> Xiubo Li <xiubli@redhat.com> writes:
>>>>
>>>>> On 4/1/22 9:32 PM, Luís Henriques wrote:
>>>>>> When doing DIO on an encrypted node, we need to invalidate the page cache in
>>>>>> the range being written to, otherwise the cache will include invalid data.
>>>>>>
>>>>>> Signed-off-by: Luís Henriques <lhenriques@suse.de>
>>>>>> ---
>>>>>>     fs/ceph/file.c | 11 ++++++++++-
>>>>>>     1 file changed, 10 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> Changes since v1:
>>>>>> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
>>>>>> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>>>>>>
>>>>>> Note: I'm not really sure this last change is required, it doesn't really
>>>>>> affect generic/647 result, but seems to be the most correct.
>>>>>>
>>>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>>>>> index 5072570c2203..b2743c342305 100644
>>>>>> --- a/fs/ceph/file.c
>>>>>> +++ b/fs/ceph/file.c
>>>>>> @@ -1605,7 +1605,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>>>>>     	if (ret < 0)
>>>>>>     		return ret;
>>>>>>     -	ceph_fscache_invalidate(inode, false);
>>>>>> +	ceph_fscache_invalidate(inode, (iocb->ki_flags & IOCB_DIRECT));
>>>>>>     	ret = invalidate_inode_pages2_range(inode->i_mapping,
>>>>>>     					    pos >> PAGE_SHIFT,
>>>>>>     					    (pos + count - 1) >> PAGE_SHIFT);
>>>>> The above has already invalidated the pages, why doesn't it work ?
>>>> I suspect the reason is because later on we loop through the number of
>>>> pages, call copy_page_from_iter() and then ceph_fscrypt_encrypt_pages().
>>> Checked the 'copy_page_from_iter()', it will do the kmap for the pages but will
>>> kunmap them again later. And they shouldn't update the i_mapping if I didn't
>>> miss something important.
>>>
>>> For 'ceph_fscrypt_encrypt_pages()' it will encrypt/dencrypt the context inplace,
>>> IMO if it needs to map the page and it should also unmap it just like in
>>> 'copy_page_from_iter()'.
>>>
>>> I thought it possibly be when we need to do RMW, it may will update the
>>> i_mapping when reading contents, but I checked the code didn't find any
>>> place is doing this. So I am wondering where tha page caches come from ? If that
>>> page caches really from reading the contents, then we should discard it instead
>>> of flushing it back ?
>>>
>>> BTW, what's the problem without this fixing ? xfstest fails ?
>> Yes, generic/647 fails if you run it with test_dummy_encryption.  And I've
>> also checked that the RMW code was never executed in this test.
>>
>> But yeah I have assumed (perhaps wrongly) that the kmap/kunmap could
>> change the inode->i_mapping.
>>
> No, kmap/unmap are all about high memory and 32-bit architectures. Those
> functions are usually no-ops on 64-bit arches.

Yeah, right.

So they do nothing here.

>> In my debugging this seemed to be the case
>> for the O_DIRECT path.  That's why I added this extra call here.
>>
> I agree with Xiubo that we really shouldn't need to invalidate multiple
> times.
>
> I guess in this test, we have a DIO write racing with an mmap read
> Probably what's happening is either that we can't invalidate the page
> because it needs to be cleaned, or the mmap read is racing in just after
> the invalidate occurs but before writeback.

This sounds a possible case.


> In any case, it might be interesting to see whether you're getting
> -EBUSY back from the new invalidate_inode_pages2 calls with your patch.
>
If it's really this case maybe this should be retried some where ?

-- Xiubo
Jeff Layton April 6, 2022, 1:41 p.m. UTC | #10
On Wed, 2022-04-06 at 21:10 +0800, Xiubo Li wrote:
> On 4/6/22 7:48 PM, Jeff Layton wrote:
> > On Wed, 2022-04-06 at 12:33 +0100, Luís Henriques wrote:
> > > Xiubo Li <xiubli@redhat.com> writes:
> > > 
> > > > On 4/6/22 6:57 PM, Luís Henriques wrote:
> > > > > Xiubo Li <xiubli@redhat.com> writes:
> > > > > 
> > > > > > On 4/1/22 9:32 PM, Luís Henriques wrote:
> > > > > > > When doing DIO on an encrypted node, we need to invalidate the page cache in
> > > > > > > the range being written to, otherwise the cache will include invalid data.
> > > > > > > 
> > > > > > > Signed-off-by: Luís Henriques <lhenriques@suse.de>
> > > > > > > ---
> > > > > > >     fs/ceph/file.c | 11 ++++++++++-
> > > > > > >     1 file changed, 10 insertions(+), 1 deletion(-)
> > > > > > > 
> > > > > > > Changes since v1:
> > > > > > > - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
> > > > > > > - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
> > > > > > > 
> > > > > > > Note: I'm not really sure this last change is required, it doesn't really
> > > > > > > affect generic/647 result, but seems to be the most correct.
> > > > > > > 
> > > > > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > > > > > index 5072570c2203..b2743c342305 100644
> > > > > > > --- a/fs/ceph/file.c
> > > > > > > +++ b/fs/ceph/file.c
> > > > > > > @@ -1605,7 +1605,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
> > > > > > >     	if (ret < 0)
> > > > > > >     		return ret;
> > > > > > >     -	ceph_fscache_invalidate(inode, false);
> > > > > > > +	ceph_fscache_invalidate(inode, (iocb->ki_flags & IOCB_DIRECT));
> > > > > > >     	ret = invalidate_inode_pages2_range(inode->i_mapping,
> > > > > > >     					    pos >> PAGE_SHIFT,
> > > > > > >     					    (pos + count - 1) >> PAGE_SHIFT);
> > > > > > The above has already invalidated the pages, why doesn't it work ?
> > > > > I suspect the reason is because later on we loop through the number of
> > > > > pages, call copy_page_from_iter() and then ceph_fscrypt_encrypt_pages().
> > > > Checked the 'copy_page_from_iter()', it will do the kmap for the pages but will
> > > > kunmap them again later. And they shouldn't update the i_mapping if I didn't
> > > > miss something important.
> > > > 
> > > > For 'ceph_fscrypt_encrypt_pages()' it will encrypt/dencrypt the context inplace,
> > > > IMO if it needs to map the page and it should also unmap it just like in
> > > > 'copy_page_from_iter()'.
> > > > 
> > > > I thought it possibly be when we need to do RMW, it may will update the
> > > > i_mapping when reading contents, but I checked the code didn't find any
> > > > place is doing this. So I am wondering where tha page caches come from ? If that
> > > > page caches really from reading the contents, then we should discard it instead
> > > > of flushing it back ?
> > > > 
> > > > BTW, what's the problem without this fixing ? xfstest fails ?
> > > Yes, generic/647 fails if you run it with test_dummy_encryption.  And I've
> > > also checked that the RMW code was never executed in this test.
> > > 
> > > But yeah I have assumed (perhaps wrongly) that the kmap/kunmap could
> > > change the inode->i_mapping.
> > > 
> > No, kmap/unmap are all about high memory and 32-bit architectures. Those
> > functions are usually no-ops on 64-bit arches.
> 
> Yeah, right.
> 
> So they do nothing here.
> 
> > > In my debugging this seemed to be the case
> > > for the O_DIRECT path.  That's why I added this extra call here.
> > > 
> > I agree with Xiubo that we really shouldn't need to invalidate multiple
> > times.
> > 
> > I guess in this test, we have a DIO write racing with an mmap read
> > Probably what's happening is either that we can't invalidate the page
> > because it needs to be cleaned, or the mmap read is racing in just after
> > the invalidate occurs but before writeback.
> 
> This sounds a possible case.
> 
> 
> > In any case, it might be interesting to see whether you're getting
> > -EBUSY back from the new invalidate_inode_pages2 calls with your patch.
> > 
> If it's really this case maybe this should be retried some where ?
> 

Possibly, or we may need to implement ->launder_folio.

Either way, we need to understand what's happening first and then we can
figure out a solution for it.
Xiubo Li April 7, 2022, 1:17 a.m. UTC | #11
On 4/6/22 9:41 PM, Jeff Layton wrote:
> On Wed, 2022-04-06 at 21:10 +0800, Xiubo Li wrote:
>> On 4/6/22 7:48 PM, Jeff Layton wrote:
>>> On Wed, 2022-04-06 at 12:33 +0100, Luís Henriques wrote:
>>>> Xiubo Li <xiubli@redhat.com> writes:
>>>>
>>>>> On 4/6/22 6:57 PM, Luís Henriques wrote:
>>>>>> Xiubo Li <xiubli@redhat.com> writes:
>>>>>>
>>>>>>> On 4/1/22 9:32 PM, Luís Henriques wrote:
>>>>>>>> When doing DIO on an encrypted node, we need to invalidate the page cache in
>>>>>>>> the range being written to, otherwise the cache will include invalid data.
>>>>>>>>
>>>>>>>> Signed-off-by: Luís Henriques <lhenriques@suse.de>
>>>>>>>> ---
>>>>>>>>      fs/ceph/file.c | 11 ++++++++++-
>>>>>>>>      1 file changed, 10 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> Changes since v1:
>>>>>>>> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
>>>>>>>> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>>>>>>>>
>>>>>>>> Note: I'm not really sure this last change is required, it doesn't really
>>>>>>>> affect generic/647 result, but seems to be the most correct.
>>>>>>>>
>>>>>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>>>>>>> index 5072570c2203..b2743c342305 100644
>>>>>>>> --- a/fs/ceph/file.c
>>>>>>>> +++ b/fs/ceph/file.c
>>>>>>>> @@ -1605,7 +1605,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>>>>>>>      	if (ret < 0)
>>>>>>>>      		return ret;
>>>>>>>>      -	ceph_fscache_invalidate(inode, false);
>>>>>>>> +	ceph_fscache_invalidate(inode, (iocb->ki_flags & IOCB_DIRECT));
>>>>>>>>      	ret = invalidate_inode_pages2_range(inode->i_mapping,
>>>>>>>>      					    pos >> PAGE_SHIFT,
>>>>>>>>      					    (pos + count - 1) >> PAGE_SHIFT);
>>>>>>> The above has already invalidated the pages, why doesn't it work ?
>>>>>> I suspect the reason is because later on we loop through the number of
>>>>>> pages, call copy_page_from_iter() and then ceph_fscrypt_encrypt_pages().
>>>>> Checked the 'copy_page_from_iter()', it will do the kmap for the pages but will
>>>>> kunmap them again later. And they shouldn't update the i_mapping if I didn't
>>>>> miss something important.
>>>>>
>>>>> For 'ceph_fscrypt_encrypt_pages()' it will encrypt/dencrypt the context inplace,
>>>>> IMO if it needs to map the page and it should also unmap it just like in
>>>>> 'copy_page_from_iter()'.
>>>>>
>>>>> I thought it possibly be when we need to do RMW, it may will update the
>>>>> i_mapping when reading contents, but I checked the code didn't find any
>>>>> place is doing this. So I am wondering where tha page caches come from ? If that
>>>>> page caches really from reading the contents, then we should discard it instead
>>>>> of flushing it back ?
>>>>>
>>>>> BTW, what's the problem without this fixing ? xfstest fails ?
>>>> Yes, generic/647 fails if you run it with test_dummy_encryption.  And I've
>>>> also checked that the RMW code was never executed in this test.
>>>>
>>>> But yeah I have assumed (perhaps wrongly) that the kmap/kunmap could
>>>> change the inode->i_mapping.
>>>>
>>> No, kmap/unmap are all about high memory and 32-bit architectures. Those
>>> functions are usually no-ops on 64-bit arches.
>> Yeah, right.
>>
>> So they do nothing here.
>>
>>>> In my debugging this seemed to be the case
>>>> for the O_DIRECT path.  That's why I added this extra call here.
>>>>
>>> I agree with Xiubo that we really shouldn't need to invalidate multiple
>>> times.
>>>
>>> I guess in this test, we have a DIO write racing with an mmap read
>>> Probably what's happening is either that we can't invalidate the page
>>> because it needs to be cleaned, or the mmap read is racing in just after
>>> the invalidate occurs but before writeback.
>> This sounds a possible case.
>>
>>
>>> In any case, it might be interesting to see whether you're getting
>>> -EBUSY back from the new invalidate_inode_pages2 calls with your patch.
>>>
>> If it's really this case maybe this should be retried some where ?
>>
> Possibly, or we may need to implement ->launder_folio.
>
> Either way, we need to understand what's happening first and then we can
> figure out a solution for it.

Yeah, make sense.
Xiubo Li April 7, 2022, 3:19 a.m. UTC | #12
Hi Luis,

Please try the following patch, to see could it resolve your issue:


diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 5d39d8e54273..3507e4066de4 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -2011,6 +2011,7 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, 
struct iov_iter *to)
                      ceph_cap_string(got));

                 if (ci->i_inline_version == CEPH_INLINE_NONE) {
+ filemap_invalidate_lock(inode->i_mapping);
                         if (!retry_op &&
                             (iocb->ki_flags & IOCB_DIRECT) &&
                             !IS_ENCRYPTED(inode)) {
@@ -2021,6 +2022,7 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, 
struct iov_iter *to)
                         } else {
                                 ret = ceph_sync_read(iocb, to, &retry_op);
                         }
+ filemap_invalidate_unlock(inode->i_mapping);
                 } else {
                         retry_op = READ_INLINE;
                 }
@@ -2239,11 +2241,13 @@ static ssize_t ceph_write_iter(struct kiocb 
*iocb, struct iov_iter *from)

                 /* we might need to revert back to that point */
                 data = *from;
+               filemap_invalidate_lock(inode->i_mapping);
                 if ((iocb->ki_flags & IOCB_DIRECT) && !IS_ENCRYPTED(inode))
                         written = ceph_direct_read_write(iocb, &data, 
snapc,
&prealloc_cf);
                 else
                         written = ceph_sync_write(iocb, &data, pos, snapc);
+               filemap_invalidate_unlock(inode->i_mapping);
                 if (direct_lock)
                         ceph_end_io_direct(inode);
                 else



On 4/1/22 9:32 PM, Luís Henriques wrote:
> When doing DIO on an encrypted node, we need to invalidate the page cache in
> the range being written to, otherwise the cache will include invalid data.
>
> Signed-off-by: Luís Henriques <lhenriques@suse.de>
> ---
>   fs/ceph/file.c | 11 ++++++++++-
>   1 file changed, 10 insertions(+), 1 deletion(-)
>
> Changes since v1:
> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>
> Note: I'm not really sure this last change is required, it doesn't really
> affect generic/647 result, but seems to be the most correct.
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 5072570c2203..b2743c342305 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1605,7 +1605,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>   	if (ret < 0)
>   		return ret;
>   
> -	ceph_fscache_invalidate(inode, false);
> +	ceph_fscache_invalidate(inode, (iocb->ki_flags & IOCB_DIRECT));
>   	ret = invalidate_inode_pages2_range(inode->i_mapping,
>   					    pos >> PAGE_SHIFT,
>   					    (pos + count - 1) >> PAGE_SHIFT);
> @@ -1895,6 +1895,15 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>   		req->r_inode = inode;
>   		req->r_mtime = mtime;
>   
> +		if (IS_ENCRYPTED(inode) && (iocb->ki_flags & IOCB_DIRECT)) {
> +			ret = invalidate_inode_pages2_range(
> +				inode->i_mapping,
> +				write_pos >> PAGE_SHIFT,
> +				(write_pos + write_len - 1) >> PAGE_SHIFT);
> +			if (ret < 0)
> +				dout("invalidate_inode_pages2_range returned %d\n", ret);
> +		}
> +
>   		/* Set up the assertion */
>   		if (rmw) {
>   			/*
>
Luis Henriques April 7, 2022, 9:06 a.m. UTC | #13
Xiubo Li <xiubli@redhat.com> writes:

> Hi Luis,
>
> Please try the following patch, to see could it resolve your issue:

No, this seems to deadlock when running my test.  I'm attaching the code
I'm using to test it; it's part of generic/647 but I've removed all the
other test cases that were passing.  I simply mount the filesystem with
test_dummy_encryption and run this test.  With your patch it'll hang;
without it it'll show "pwrite (O_DIRECT) is broken".  The extra
invalidate_inode_pages2_range() will make it pass.

Cheers,
Luis Henriques April 7, 2022, 11:55 a.m. UTC | #14
Xiubo Li <xiubli@redhat.com> writes:

> On 4/6/22 9:41 PM, Jeff Layton wrote:
>> On Wed, 2022-04-06 at 21:10 +0800, Xiubo Li wrote:
>>> On 4/6/22 7:48 PM, Jeff Layton wrote:
>>>> On Wed, 2022-04-06 at 12:33 +0100, Luís Henriques wrote:
>>>>> Xiubo Li <xiubli@redhat.com> writes:
>>>>>
>>>>>> On 4/6/22 6:57 PM, Luís Henriques wrote:
>>>>>>> Xiubo Li <xiubli@redhat.com> writes:
>>>>>>>
>>>>>>>> On 4/1/22 9:32 PM, Luís Henriques wrote:
>>>>>>>>> When doing DIO on an encrypted node, we need to invalidate the page cache in
>>>>>>>>> the range being written to, otherwise the cache will include invalid data.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Luís Henriques <lhenriques@suse.de>
>>>>>>>>> ---
>>>>>>>>>      fs/ceph/file.c | 11 ++++++++++-
>>>>>>>>>      1 file changed, 10 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> Changes since v1:
>>>>>>>>> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
>>>>>>>>> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>>>>>>>>>
>>>>>>>>> Note: I'm not really sure this last change is required, it doesn't really
>>>>>>>>> affect generic/647 result, but seems to be the most correct.
>>>>>>>>>
>>>>>>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>>>>>>>> index 5072570c2203..b2743c342305 100644
>>>>>>>>> --- a/fs/ceph/file.c
>>>>>>>>> +++ b/fs/ceph/file.c
>>>>>>>>> @@ -1605,7 +1605,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>>>>>>>>      	if (ret < 0)
>>>>>>>>>      		return ret;
>>>>>>>>>      -	ceph_fscache_invalidate(inode, false);
>>>>>>>>> +	ceph_fscache_invalidate(inode, (iocb->ki_flags & IOCB_DIRECT));
>>>>>>>>>      	ret = invalidate_inode_pages2_range(inode->i_mapping,
>>>>>>>>>      					    pos >> PAGE_SHIFT,
>>>>>>>>>      					    (pos + count - 1) >> PAGE_SHIFT);
>>>>>>>> The above has already invalidated the pages, why doesn't it work ?
>>>>>>> I suspect the reason is because later on we loop through the number of
>>>>>>> pages, call copy_page_from_iter() and then ceph_fscrypt_encrypt_pages().
>>>>>> Checked the 'copy_page_from_iter()', it will do the kmap for the pages but will
>>>>>> kunmap them again later. And they shouldn't update the i_mapping if I didn't
>>>>>> miss something important.
>>>>>>
>>>>>> For 'ceph_fscrypt_encrypt_pages()' it will encrypt/dencrypt the context inplace,
>>>>>> IMO if it needs to map the page and it should also unmap it just like in
>>>>>> 'copy_page_from_iter()'.
>>>>>>
>>>>>> I thought it possibly be when we need to do RMW, it may will update the
>>>>>> i_mapping when reading contents, but I checked the code didn't find any
>>>>>> place is doing this. So I am wondering where tha page caches come from ? If that
>>>>>> page caches really from reading the contents, then we should discard it instead
>>>>>> of flushing it back ?
>>>>>>
>>>>>> BTW, what's the problem without this fixing ? xfstest fails ?
>>>>> Yes, generic/647 fails if you run it with test_dummy_encryption.  And I've
>>>>> also checked that the RMW code was never executed in this test.
>>>>>
>>>>> But yeah I have assumed (perhaps wrongly) that the kmap/kunmap could
>>>>> change the inode->i_mapping.
>>>>>
>>>> No, kmap/unmap are all about high memory and 32-bit architectures. Those
>>>> functions are usually no-ops on 64-bit arches.
>>> Yeah, right.
>>>
>>> So they do nothing here.
>>>
>>>>> In my debugging this seemed to be the case
>>>>> for the O_DIRECT path.  That's why I added this extra call here.
>>>>>
>>>> I agree with Xiubo that we really shouldn't need to invalidate multiple
>>>> times.
>>>>
>>>> I guess in this test, we have a DIO write racing with an mmap read
>>>> Probably what's happening is either that we can't invalidate the page
>>>> because it needs to be cleaned, or the mmap read is racing in just after
>>>> the invalidate occurs but before writeback.
>>> This sounds a possible case.
>>>
>>>
>>>> In any case, it might be interesting to see whether you're getting
>>>> -EBUSY back from the new invalidate_inode_pages2 calls with your patch.
>>>>
>>> If it's really this case maybe this should be retried some where ?
>>>
>> Possibly, or we may need to implement ->launder_folio.
>>
>> Either way, we need to understand what's happening first and then we can
>> figure out a solution for it.
>
> Yeah, make sense.
>

OK, so here's what I got so far:

When we run this test *without* test_dummy_encryption, ceph_direct_read_write()
will be called and invalidate_inode_pages2_range() will do pretty much
nothing because the mapping will be empty (mapping_empty(inode->i_mapping)
will return 1).  If we use encryption, ceph_sync_write() will be called
instead and the mapping, obviously, be will be empty as well.

The difference between in encrypted vs non-encrypted (and the reason the
test passes without encryption) is that ceph_direct_read_write()
(non-encrypted) will call truncate_inode_pages_range() at a stage where
the mapping is not empty anymore (iter_get_bvecs_alloc will take care of
that).  In the encryption path (ceph_sync_write) the mapping will be
filled with copy_page_from_iter(), which will fault and do the read.
Because we don't have the truncate_inode_pages_range(), the cache will
contain invalid data after the write.  And that's why the extra
invalidate_inode_pages2_range (or truncate_...) fixes this.

Cheers,
Jeff Layton April 7, 2022, 1:23 p.m. UTC | #15
On Thu, 2022-04-07 at 12:55 +0100, Luís Henriques wrote:
> Xiubo Li <xiubli@redhat.com> writes:
> 
> > On 4/6/22 9:41 PM, Jeff Layton wrote:
> > > On Wed, 2022-04-06 at 21:10 +0800, Xiubo Li wrote:
> > > > On 4/6/22 7:48 PM, Jeff Layton wrote:
> > > > > On Wed, 2022-04-06 at 12:33 +0100, Luís Henriques wrote:
> > > > > > Xiubo Li <xiubli@redhat.com> writes:
> > > > > > 
> > > > > > > On 4/6/22 6:57 PM, Luís Henriques wrote:
> > > > > > > > Xiubo Li <xiubli@redhat.com> writes:
> > > > > > > > 
> > > > > > > > > On 4/1/22 9:32 PM, Luís Henriques wrote:
> > > > > > > > > > When doing DIO on an encrypted node, we need to invalidate the page cache in
> > > > > > > > > > the range being written to, otherwise the cache will include invalid data.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Luís Henriques <lhenriques@suse.de>
> > > > > > > > > > ---
> > > > > > > > > >      fs/ceph/file.c | 11 ++++++++++-
> > > > > > > > > >      1 file changed, 10 insertions(+), 1 deletion(-)
> > > > > > > > > > 
> > > > > > > > > > Changes since v1:
> > > > > > > > > > - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
> > > > > > > > > > - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
> > > > > > > > > > 
> > > > > > > > > > Note: I'm not really sure this last change is required, it doesn't really
> > > > > > > > > > affect generic/647 result, but seems to be the most correct.
> > > > > > > > > > 
> > > > > > > > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > > > > > > > > index 5072570c2203..b2743c342305 100644
> > > > > > > > > > --- a/fs/ceph/file.c
> > > > > > > > > > +++ b/fs/ceph/file.c
> > > > > > > > > > @@ -1605,7 +1605,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
> > > > > > > > > >      	if (ret < 0)
> > > > > > > > > >      		return ret;
> > > > > > > > > >      -	ceph_fscache_invalidate(inode, false);
> > > > > > > > > > +	ceph_fscache_invalidate(inode, (iocb->ki_flags & IOCB_DIRECT));
> > > > > > > > > >      	ret = invalidate_inode_pages2_range(inode->i_mapping,
> > > > > > > > > >      					    pos >> PAGE_SHIFT,
> > > > > > > > > >      					    (pos + count - 1) >> PAGE_SHIFT);
> > > > > > > > > The above has already invalidated the pages, why doesn't it work ?
> > > > > > > > I suspect the reason is because later on we loop through the number of
> > > > > > > > pages, call copy_page_from_iter() and then ceph_fscrypt_encrypt_pages().
> > > > > > > Checked the 'copy_page_from_iter()', it will do the kmap for the pages but will
> > > > > > > kunmap them again later. And they shouldn't update the i_mapping if I didn't
> > > > > > > miss something important.
> > > > > > > 
> > > > > > > For 'ceph_fscrypt_encrypt_pages()' it will encrypt/dencrypt the context inplace,
> > > > > > > IMO if it needs to map the page and it should also unmap it just like in
> > > > > > > 'copy_page_from_iter()'.
> > > > > > > 
> > > > > > > I thought it possibly be when we need to do RMW, it may will update the
> > > > > > > i_mapping when reading contents, but I checked the code didn't find any
> > > > > > > place is doing this. So I am wondering where tha page caches come from ? If that
> > > > > > > page caches really from reading the contents, then we should discard it instead
> > > > > > > of flushing it back ?
> > > > > > > 
> > > > > > > BTW, what's the problem without this fixing ? xfstest fails ?
> > > > > > Yes, generic/647 fails if you run it with test_dummy_encryption.  And I've
> > > > > > also checked that the RMW code was never executed in this test.
> > > > > > 
> > > > > > But yeah I have assumed (perhaps wrongly) that the kmap/kunmap could
> > > > > > change the inode->i_mapping.
> > > > > > 
> > > > > No, kmap/unmap are all about high memory and 32-bit architectures. Those
> > > > > functions are usually no-ops on 64-bit arches.
> > > > Yeah, right.
> > > > 
> > > > So they do nothing here.
> > > > 
> > > > > > In my debugging this seemed to be the case
> > > > > > for the O_DIRECT path.  That's why I added this extra call here.
> > > > > > 
> > > > > I agree with Xiubo that we really shouldn't need to invalidate multiple
> > > > > times.
> > > > > 
> > > > > I guess in this test, we have a DIO write racing with an mmap read
> > > > > Probably what's happening is either that we can't invalidate the page
> > > > > because it needs to be cleaned, or the mmap read is racing in just after
> > > > > the invalidate occurs but before writeback.
> > > > This sounds a possible case.
> > > > 
> > > > 
> > > > > In any case, it might be interesting to see whether you're getting
> > > > > -EBUSY back from the new invalidate_inode_pages2 calls with your patch.
> > > > > 
> > > > If it's really this case maybe this should be retried some where ?
> > > > 
> > > Possibly, or we may need to implement ->launder_folio.
> > > 
> > > Either way, we need to understand what's happening first and then we can
> > > figure out a solution for it.
> > 
> > Yeah, make sense.
> > 
> 
> OK, so here's what I got so far:
> 
> When we run this test *without* test_dummy_encryption, ceph_direct_read_write()
> will be called and invalidate_inode_pages2_range() will do pretty much
> nothing because the mapping will be empty (mapping_empty(inode->i_mapping)
> will return 1).  If we use encryption, ceph_sync_write() will be called
> instead and the mapping, obviously, be will be empty as well.
> 
> The difference between in encrypted vs non-encrypted (and the reason the
> test passes without encryption) is that ceph_direct_read_write()
> (non-encrypted) will call truncate_inode_pages_range() at a stage where
> the mapping is not empty anymore (iter_get_bvecs_alloc will take care of
> that).
> 

Wait...why does iter_get_bvecs_alloc populate the mapping? The iter in
this case is almost certainly an iov_iter from userland so none of this
should have anything to do with the pagecache.

I suspect the faulting in occurs via the mmap reader task, and that the
truncate_inode_pages_range calls just happen enough to invalidate it.

>  In the encryption path (ceph_sync_write) the mapping will be
> filled with copy_page_from_iter(), which will fault and do the read.
> Because we don't have the truncate_inode_pages_range(), the cache will
> contain invalid data after the write.  And that's why the extra
> invalidate_inode_pages2_range (or truncate_...) fixes this.
> 

I think what we may want to do is consider adding these calls into
ceph_page_mkwrite:

        if (direct_lock)
                ceph_start_io_direct(inode);
        else
                ceph_start_io_write(inode);

...and similar ones (for read) in ceph_filemap_fault, along with "end"
calls to end the I/Os.

This is how we handle races between buffered read/write and direct I/O,
and I suspect the mmap codepaths may just need similar treatment.

Thoughts?
Jeff Layton April 7, 2022, 2:08 p.m. UTC | #16
On Thu, 2022-04-07 at 09:23 -0400, Jeff Layton wrote:
> On Thu, 2022-04-07 at 12:55 +0100, Luís Henriques wrote:
> > Xiubo Li <xiubli@redhat.com> writes:
> > 
> > > On 4/6/22 9:41 PM, Jeff Layton wrote:
> > > > On Wed, 2022-04-06 at 21:10 +0800, Xiubo Li wrote:
> > > > > On 4/6/22 7:48 PM, Jeff Layton wrote:
> > > > > > On Wed, 2022-04-06 at 12:33 +0100, Luís Henriques wrote:
> > > > > > > Xiubo Li <xiubli@redhat.com> writes:
> > > > > > > 
> > > > > > > > On 4/6/22 6:57 PM, Luís Henriques wrote:
> > > > > > > > > Xiubo Li <xiubli@redhat.com> writes:
> > > > > > > > > 
> > > > > > > > > > On 4/1/22 9:32 PM, Luís Henriques wrote:
> > > > > > > > > > > When doing DIO on an encrypted node, we need to invalidate the page cache in
> > > > > > > > > > > the range being written to, otherwise the cache will include invalid data.
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Luís Henriques <lhenriques@suse.de>
> > > > > > > > > > > ---
> > > > > > > > > > >      fs/ceph/file.c | 11 ++++++++++-
> > > > > > > > > > >      1 file changed, 10 insertions(+), 1 deletion(-)
> > > > > > > > > > > 
> > > > > > > > > > > Changes since v1:
> > > > > > > > > > > - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
> > > > > > > > > > > - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
> > > > > > > > > > > 
> > > > > > > > > > > Note: I'm not really sure this last change is required, it doesn't really
> > > > > > > > > > > affect generic/647 result, but seems to be the most correct.
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > > > > > > > > > index 5072570c2203..b2743c342305 100644
> > > > > > > > > > > --- a/fs/ceph/file.c
> > > > > > > > > > > +++ b/fs/ceph/file.c
> > > > > > > > > > > @@ -1605,7 +1605,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
> > > > > > > > > > >      	if (ret < 0)
> > > > > > > > > > >      		return ret;
> > > > > > > > > > >      -	ceph_fscache_invalidate(inode, false);
> > > > > > > > > > > +	ceph_fscache_invalidate(inode, (iocb->ki_flags & IOCB_DIRECT));
> > > > > > > > > > >      	ret = invalidate_inode_pages2_range(inode->i_mapping,
> > > > > > > > > > >      					    pos >> PAGE_SHIFT,
> > > > > > > > > > >      					    (pos + count - 1) >> PAGE_SHIFT);
> > > > > > > > > > The above has already invalidated the pages, why doesn't it work ?
> > > > > > > > > I suspect the reason is because later on we loop through the number of
> > > > > > > > > pages, call copy_page_from_iter() and then ceph_fscrypt_encrypt_pages().
> > > > > > > > Checked the 'copy_page_from_iter()', it will do the kmap for the pages but will
> > > > > > > > kunmap them again later. And they shouldn't update the i_mapping if I didn't
> > > > > > > > miss something important.
> > > > > > > > 
> > > > > > > > For 'ceph_fscrypt_encrypt_pages()' it will encrypt/dencrypt the context inplace,
> > > > > > > > IMO if it needs to map the page and it should also unmap it just like in
> > > > > > > > 'copy_page_from_iter()'.
> > > > > > > > 
> > > > > > > > I thought it possibly be when we need to do RMW, it may will update the
> > > > > > > > i_mapping when reading contents, but I checked the code didn't find any
> > > > > > > > place is doing this. So I am wondering where tha page caches come from ? If that
> > > > > > > > page caches really from reading the contents, then we should discard it instead
> > > > > > > > of flushing it back ?
> > > > > > > > 
> > > > > > > > BTW, what's the problem without this fixing ? xfstest fails ?
> > > > > > > Yes, generic/647 fails if you run it with test_dummy_encryption.  And I've
> > > > > > > also checked that the RMW code was never executed in this test.
> > > > > > > 
> > > > > > > But yeah I have assumed (perhaps wrongly) that the kmap/kunmap could
> > > > > > > change the inode->i_mapping.
> > > > > > > 
> > > > > > No, kmap/unmap are all about high memory and 32-bit architectures. Those
> > > > > > functions are usually no-ops on 64-bit arches.
> > > > > Yeah, right.
> > > > > 
> > > > > So they do nothing here.
> > > > > 
> > > > > > > In my debugging this seemed to be the case
> > > > > > > for the O_DIRECT path.  That's why I added this extra call here.
> > > > > > > 
> > > > > > I agree with Xiubo that we really shouldn't need to invalidate multiple
> > > > > > times.
> > > > > > 
> > > > > > I guess in this test, we have a DIO write racing with an mmap read
> > > > > > Probably what's happening is either that we can't invalidate the page
> > > > > > because it needs to be cleaned, or the mmap read is racing in just after
> > > > > > the invalidate occurs but before writeback.
> > > > > This sounds a possible case.
> > > > > 
> > > > > 
> > > > > > In any case, it might be interesting to see whether you're getting
> > > > > > -EBUSY back from the new invalidate_inode_pages2 calls with your patch.
> > > > > > 
> > > > > If it's really this case maybe this should be retried some where ?
> > > > > 
> > > > Possibly, or we may need to implement ->launder_folio.
> > > > 
> > > > Either way, we need to understand what's happening first and then we can
> > > > figure out a solution for it.
> > > 
> > > Yeah, make sense.
> > > 
> > 
> > OK, so here's what I got so far:
> > 
> > When we run this test *without* test_dummy_encryption, ceph_direct_read_write()
> > will be called and invalidate_inode_pages2_range() will do pretty much
> > nothing because the mapping will be empty (mapping_empty(inode->i_mapping)
> > will return 1).  If we use encryption, ceph_sync_write() will be called
> > instead and the mapping, obviously, be will be empty as well.
> > 
> > The difference between in encrypted vs non-encrypted (and the reason the
> > test passes without encryption) is that ceph_direct_read_write()
> > (non-encrypted) will call truncate_inode_pages_range() at a stage where
> > the mapping is not empty anymore (iter_get_bvecs_alloc will take care of
> > that).
> > 
> 
> Wait...why does iter_get_bvecs_alloc populate the mapping? The iter in
> this case is almost certainly an iov_iter from userland so none of this
> should have anything to do with the pagecache.
> 
> I suspect the faulting in occurs via the mmap reader task, and that the
> truncate_inode_pages_range calls just happen enough to invalidate it.
> 
> >  In the encryption path (ceph_sync_write) the mapping will be
> > filled with copy_page_from_iter(), which will fault and do the read.
> > Because we don't have the truncate_inode_pages_range(), the cache will
> > contain invalid data after the write.  And that's why the extra
> > invalidate_inode_pages2_range (or truncate_...) fixes this.
> > 
> 
> I think what we may want to do is consider adding these calls into
> ceph_page_mkwrite:
> 
>         if (direct_lock)
>                 ceph_start_io_direct(inode);
>         else
>                 ceph_start_io_write(inode);
> 
> ...and similar ones (for read) in ceph_filemap_fault, along with "end"
> calls to end the I/Os.
> 
> This is how we handle races between buffered read/write and direct I/O,
> and I suspect the mmap codepaths may just need similar treatment.
> 
> Thoughts?

No, Luis tried this and said that it deadlocked on IRC. It seems obvious
in retrospect...

What we probably need to do is call filemap_write_and_wait_range before
issuing a direct or sync read or write. Then for direct/sync writes, we
also want to call invalidate_inode_pages2_range after the write returns.

We might consider doing an invalidation before issuing the call, but I
think it wouldn't help this testcase. generic/647 is doing O_DIRECT
writes to the file from a buffer that is mmapped from the same file. If
you invalidate before the write occurs you'll just end up faulting the
pages right back in.
diff mbox series

Patch

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 5072570c2203..b2743c342305 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1605,7 +1605,7 @@  ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
 	if (ret < 0)
 		return ret;
 
-	ceph_fscache_invalidate(inode, false);
+	ceph_fscache_invalidate(inode, (iocb->ki_flags & IOCB_DIRECT));
 	ret = invalidate_inode_pages2_range(inode->i_mapping,
 					    pos >> PAGE_SHIFT,
 					    (pos + count - 1) >> PAGE_SHIFT);
@@ -1895,6 +1895,15 @@  ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
 		req->r_inode = inode;
 		req->r_mtime = mtime;
 
+		if (IS_ENCRYPTED(inode) && (iocb->ki_flags & IOCB_DIRECT)) {
+			ret = invalidate_inode_pages2_range(
+				inode->i_mapping,
+				write_pos >> PAGE_SHIFT,
+				(write_pos + write_len - 1) >> PAGE_SHIFT);
+			if (ret < 0)
+				dout("invalidate_inode_pages2_range returned %d\n", ret);
+		}
+
 		/* Set up the assertion */
 		if (rmw) {
 			/*