[RFC,v3,00/10] ceph: asynchronous file create support

Message ID	20200121192928.469316-1-jlayton@kernel.org (mailing list archive)
Headers	show Return-Path: <SRS0=qTS0=3K=vger.kernel.org=ceph-devel-owner@kernel.org> From: Jeff Layton <jlayton@kernel.org> To: ceph-devel@vger.kernel.org Cc: idridryomov@gmail.com, sage@redhat.com, zyan@redhat.com Subject: [RFC PATCH v3 00/10] ceph: asynchronous file create support Date: Tue, 21 Jan 2020 14:29:18 -0500 Message-Id: <20200121192928.469316-1-jlayton@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk
Series	ceph: asynchronous file create support \| expand [RFC,v3,00/10] ceph: asynchronous file create support [RFC,v3,01/10] ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c [RFC,v3,02/10] ceph: make ceph_fill_inode non-static [RFC,v3,03/10] ceph: make dentry_lease_is_valid non-static [RFC,v3,04/10] ceph: make __take_cap_refs non-static [RFC,v3,05/10] ceph: decode interval_sets for delegated inos [RFC,v3,06/10] ceph: add flag to designate that a request is asynchronous [RFC,v3,07/10] ceph: add infrastructure for waiting for async create to complete [RFC,v3,08/10] ceph: add new MDS req field to hold delegated inode number [RFC,v3,09/10] ceph: cache layout in parent dir on first sync create [RFC,v3,10/10] ceph: attempt to do async create when possible

Jeffrey Layton Jan. 21, 2020, 7:29 p.m. UTC

v3:
- move some cephfs-specific code into ceph.ko
- present and track inode numbers as u64 values
- fix up check for dentry and cap eligibility checks
- set O_CEPH_EXCL on async creates
- attempt to handle errors better on async create (invalidate dentries
  and dir completeness).
- ensure that fsync waits for async create to complete

v2:
- move cached layout to dedicated field in inode
- protect cached layout with i_ceph_lock
- wipe cached layout in __check_cap_issue
- set max_size of file to layout.stripe_unit
- set truncate_size to (u64)-1
- use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
- set cap_id to 1 in async created inode
- allocate inode number before submitting request
- rework the prep for an async create to be more efficient
- don't allow MDS or cap messages involving an inode until we get async
  create reply

Still not quite ready for merge, but I've cleaned up a number of warts
in the v2 set. Performance numbers still look about the same.

There is definitely still a race of some sort that causes the client to
try to asynchronously create a dentry that already exists. I'm still
working on tracking that down.

Jeff Layton (10):
  ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
  ceph: make ceph_fill_inode non-static
  ceph: make dentry_lease_is_valid non-static
  ceph: make __take_cap_refs non-static
  ceph: decode interval_sets for delegated inos
  ceph: add flag to designate that a request is asynchronous
  ceph: add infrastructure for waiting for async create to complete
  ceph: add new MDS req field to hold delegated inode number
  ceph: cache layout in parent dir on first sync create
  ceph: attempt to do async create when possible

 fs/ceph/Makefile                     |   2 +-
 fs/ceph/caps.c                       |  38 +++--
 fs/ceph/dir.c                        |  13 +-
 fs/ceph/file.c                       | 240 +++++++++++++++++++++++++--
 fs/ceph/inode.c                      |  50 +++---
 fs/ceph/mds_client.c                 | 123 ++++++++++++--
 fs/ceph/mds_client.h                 |  17 +-
 fs/ceph/super.h                      |  16 +-
 net/ceph/ceph_fs.c => fs/ceph/util.c |   4 -
 include/linux/ceph/ceph_fs.h         |   8 +-
 net/ceph/Makefile                    |   2 +-
 11 files changed, 443 insertions(+), 70 deletions(-)
 rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%)

Yan, Zheng Jan. 22, 2020, 7:29 a.m. UTC | #1

On 1/22/20 3:29 AM, Jeff Layton wrote:
> v3:
> - move some cephfs-specific code into ceph.ko
> - present and track inode numbers as u64 values
> - fix up check for dentry and cap eligibility checks
> - set O_CEPH_EXCL on async creates
> - attempt to handle errors better on async create (invalidate dentries
>    and dir completeness).
> - ensure that fsync waits for async create to complete
> 
> v2:
> - move cached layout to dedicated field in inode
> - protect cached layout with i_ceph_lock
> - wipe cached layout in __check_cap_issue
> - set max_size of file to layout.stripe_unit
> - set truncate_size to (u64)-1
> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> - set cap_id to 1 in async created inode
> - allocate inode number before submitting request
> - rework the prep for an async create to be more efficient
> - don't allow MDS or cap messages involving an inode until we get async
>    create reply
> 
> Still not quite ready for merge, but I've cleaned up a number of warts
> in the v2 set. Performance numbers still look about the same.
> 
> There is definitely still a race of some sort that causes the client to
> try to asynchronously create a dentry that already exists. I'm still
> working on tracking that down.
> 
> Jeff Layton (10):
>    ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
>    ceph: make ceph_fill_inode non-static
>    ceph: make dentry_lease_is_valid non-static
>    ceph: make __take_cap_refs non-static
>    ceph: decode interval_sets for delegated inos
>    ceph: add flag to designate that a request is asynchronous
>    ceph: add infrastructure for waiting for async create to complete
>    ceph: add new MDS req field to hold delegated inode number
>    ceph: cache layout in parent dir on first sync create
>    ceph: attempt to do async create when possible
> 
>   fs/ceph/Makefile                     |   2 +-
>   fs/ceph/caps.c                       |  38 +++--
>   fs/ceph/dir.c                        |  13 +-
>   fs/ceph/file.c                       | 240 +++++++++++++++++++++++++--
>   fs/ceph/inode.c                      |  50 +++---
>   fs/ceph/mds_client.c                 | 123 ++++++++++++--
>   fs/ceph/mds_client.h                 |  17 +-
>   fs/ceph/super.h                      |  16 +-
>   net/ceph/ceph_fs.c => fs/ceph/util.c |   4 -
>   include/linux/ceph/ceph_fs.h         |   8 +-
>   net/ceph/Makefile                    |   2 +-
>   11 files changed, 443 insertions(+), 70 deletions(-)
>   rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%)
> 

Series

Reviewed-By: "Yan, Zheng" <zyan@redhat.com>

Yan, Zheng Jan. 22, 2020, 5:04 p.m. UTC | #2

On Wed, Jan 22, 2020 at 3:31 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> v3:
> - move some cephfs-specific code into ceph.ko
> - present and track inode numbers as u64 values
> - fix up check for dentry and cap eligibility checks
> - set O_CEPH_EXCL on async creates
> - attempt to handle errors better on async create (invalidate dentries
>   and dir completeness).
> - ensure that fsync waits for async create to complete
>
> v2:
> - move cached layout to dedicated field in inode
> - protect cached layout with i_ceph_lock
> - wipe cached layout in __check_cap_issue
> - set max_size of file to layout.stripe_unit
> - set truncate_size to (u64)-1
> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> - set cap_id to 1 in async created inode
> - allocate inode number before submitting request
> - rework the prep for an async create to be more efficient
> - don't allow MDS or cap messages involving an inode until we get async
>   create reply
>
> Still not quite ready for merge, but I've cleaned up a number of warts
> in the v2 set. Performance numbers still look about the same.
>
> There is definitely still a race of some sort that causes the client to
> try to asynchronously create a dentry that already exists. I'm still
> working on tracking that down.
>
> Jeff Layton (10):
>   ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
>   ceph: make ceph_fill_inode non-static
>   ceph: make dentry_lease_is_valid non-static
>   ceph: make __take_cap_refs non-static
>   ceph: decode interval_sets for delegated inos
>   ceph: add flag to designate that a request is asynchronous
>   ceph: add infrastructure for waiting for async create to complete
>   ceph: add new MDS req field to hold delegated inode number
>   ceph: cache layout in parent dir on first sync create
>   ceph: attempt to do async create when possible
>
>  fs/ceph/Makefile                     |   2 +-
>  fs/ceph/caps.c                       |  38 +++--
>  fs/ceph/dir.c                        |  13 +-
>  fs/ceph/file.c                       | 240 +++++++++++++++++++++++++--
>  fs/ceph/inode.c                      |  50 +++---
>  fs/ceph/mds_client.c                 | 123 ++++++++++++--
>  fs/ceph/mds_client.h                 |  17 +-
>  fs/ceph/super.h                      |  16 +-
>  net/ceph/ceph_fs.c => fs/ceph/util.c |   4 -
>  include/linux/ceph/ceph_fs.h         |   8 +-
>  net/ceph/Makefile                    |   2 +-
>  11 files changed, 443 insertions(+), 70 deletions(-)
>  rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%)
>
> --
> 2.24.1
>

I realized that there still are two issues:
-  we needs to clear delegated inos when mds failover
- we needs to clear caps for async dir operations when reconnecting to
mds. (see last commit of https://github.com/ceph/ceph/pull/32576)

Regards
Yan, Zheng

Jeffrey Layton Jan. 24, 2020, 5:19 p.m. UTC | #3

On Thu, 2020-01-23 at 01:04 +0800, Yan, Zheng wrote:
> On Wed, Jan 22, 2020 at 3:31 AM Jeff Layton <jlayton@kernel.org> wrote:
> > v3:
> > - move some cephfs-specific code into ceph.ko
> > - present and track inode numbers as u64 values
> > - fix up check for dentry and cap eligibility checks
> > - set O_CEPH_EXCL on async creates
> > - attempt to handle errors better on async create (invalidate dentries
> >   and dir completeness).
> > - ensure that fsync waits for async create to complete
> > 
> > v2:
> > - move cached layout to dedicated field in inode
> > - protect cached layout with i_ceph_lock
> > - wipe cached layout in __check_cap_issue
> > - set max_size of file to layout.stripe_unit
> > - set truncate_size to (u64)-1
> > - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> > - set cap_id to 1 in async created inode
> > - allocate inode number before submitting request
> > - rework the prep for an async create to be more efficient
> > - don't allow MDS or cap messages involving an inode until we get async
> >   create reply
> > 
> > Still not quite ready for merge, but I've cleaned up a number of warts
> > in the v2 set. Performance numbers still look about the same.
> > 
> > There is definitely still a race of some sort that causes the client to
> > try to asynchronously create a dentry that already exists. I'm still
> > working on tracking that down.
> > 
> > Jeff Layton (10):
> >   ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
> >   ceph: make ceph_fill_inode non-static
> >   ceph: make dentry_lease_is_valid non-static
> >   ceph: make __take_cap_refs non-static
> >   ceph: decode interval_sets for delegated inos
> >   ceph: add flag to designate that a request is asynchronous
> >   ceph: add infrastructure for waiting for async create to complete
> >   ceph: add new MDS req field to hold delegated inode number
> >   ceph: cache layout in parent dir on first sync create
> >   ceph: attempt to do async create when possible
> > 
> >  fs/ceph/Makefile                     |   2 +-
> >  fs/ceph/caps.c                       |  38 +++--
> >  fs/ceph/dir.c                        |  13 +-
> >  fs/ceph/file.c                       | 240 +++++++++++++++++++++++++--
> >  fs/ceph/inode.c                      |  50 +++---
> >  fs/ceph/mds_client.c                 | 123 ++++++++++++--
> >  fs/ceph/mds_client.h                 |  17 +-
> >  fs/ceph/super.h                      |  16 +-
> >  net/ceph/ceph_fs.c => fs/ceph/util.c |   4 -
> >  include/linux/ceph/ceph_fs.h         |   8 +-
> >  net/ceph/Makefile                    |   2 +-
> >  11 files changed, 443 insertions(+), 70 deletions(-)
> >  rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%)
> > 
> > --
> > 2.24.1
> > 
> 
> I realized that there still are two issues:
> -  we needs to clear delegated inos when mds failover

I'm guessing we need to do this whenever any session is reconnected,
right? I think we may have bigger problems here though:

The issue is that with this set we assign out ino_t's prior to
submitting the request. We could get down into it and find that we're
reconnecting the session. At that point, that ino_t is no longer valid
for the session.

> - we needs to clear caps for async dir operations when reconnecting to
> mds. (see last commit of https://github.com/ceph/ceph/pull/32576)
> 

I guess we can use ceph_iterate_session_caps to do this. That said,
looking at your patch:

        if (in->is_dir()) {                                                     
          // remove caps for async dir operations                               
          cap.implemented &= (CEPH_CAP_ANY_SHARED | CEPH_CAP_ANY_EXCL);         
        }                                                                       

We remove all but Fsx here. That seems like quite a subtle difference
from how FILE caps work.

Given the way that we're doing the inode delegation and handling dir
caps, I think we may need to rework things such that async requests
never get queued to the s_waiting list, and instead return some sort of
distinct error that cues the caller to fall back to a synchronous
request.

That would help prevent some of the potential races here.

Yan, Zheng Jan. 25, 2020, 1:58 a.m. UTC | #4

On 1/25/20 1:19 AM, Jeff Layton wrote:
> On Thu, 2020-01-23 at 01:04 +0800, Yan, Zheng wrote:
>> On Wed, Jan 22, 2020 at 3:31 AM Jeff Layton <jlayton@kernel.org> wrote:
>>> v3:
>>> - move some cephfs-specific code into ceph.ko
>>> - present and track inode numbers as u64 values
>>> - fix up check for dentry and cap eligibility checks
>>> - set O_CEPH_EXCL on async creates
>>> - attempt to handle errors better on async create (invalidate dentries
>>>    and dir completeness).
>>> - ensure that fsync waits for async create to complete
>>>
>>> v2:
>>> - move cached layout to dedicated field in inode
>>> - protect cached layout with i_ceph_lock
>>> - wipe cached layout in __check_cap_issue
>>> - set max_size of file to layout.stripe_unit
>>> - set truncate_size to (u64)-1
>>> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
>>> - set cap_id to 1 in async created inode
>>> - allocate inode number before submitting request
>>> - rework the prep for an async create to be more efficient
>>> - don't allow MDS or cap messages involving an inode until we get async
>>>    create reply
>>>
>>> Still not quite ready for merge, but I've cleaned up a number of warts
>>> in the v2 set. Performance numbers still look about the same.
>>>
>>> There is definitely still a race of some sort that causes the client to
>>> try to asynchronously create a dentry that already exists. I'm still
>>> working on tracking that down.
>>>
>>> Jeff Layton (10):
>>>    ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
>>>    ceph: make ceph_fill_inode non-static
>>>    ceph: make dentry_lease_is_valid non-static
>>>    ceph: make __take_cap_refs non-static
>>>    ceph: decode interval_sets for delegated inos
>>>    ceph: add flag to designate that a request is asynchronous
>>>    ceph: add infrastructure for waiting for async create to complete
>>>    ceph: add new MDS req field to hold delegated inode number
>>>    ceph: cache layout in parent dir on first sync create
>>>    ceph: attempt to do async create when possible
>>>
>>>   fs/ceph/Makefile                     |   2 +-
>>>   fs/ceph/caps.c                       |  38 +++--
>>>   fs/ceph/dir.c                        |  13 +-
>>>   fs/ceph/file.c                       | 240 +++++++++++++++++++++++++--
>>>   fs/ceph/inode.c                      |  50 +++---
>>>   fs/ceph/mds_client.c                 | 123 ++++++++++++--
>>>   fs/ceph/mds_client.h                 |  17 +-
>>>   fs/ceph/super.h                      |  16 +-
>>>   net/ceph/ceph_fs.c => fs/ceph/util.c |   4 -
>>>   include/linux/ceph/ceph_fs.h         |   8 +-
>>>   net/ceph/Makefile                    |   2 +-
>>>   11 files changed, 443 insertions(+), 70 deletions(-)
>>>   rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%)
>>>
>>> --
>>> 2.24.1
>>>
>>
>> I realized that there still are two issues:
>> -  we needs to clear delegated inos when mds failover
> 
> I'm guessing we need to do this whenever any session is reconnected,
> right? I think we may have bigger problems here though:
> 
> The issue is that with this set we assign out ino_t's prior to
> submitting the request. We could get down into it and find that we're
> reconnecting the session. At that point, that ino_t is no longer valid
> for the session.

yes, that's a problem for current kclient impelmentaion


> 
>> - we needs to clear caps for async dir operations when reconnecting to
>> mds. (see last commit of https://github.com/ceph/ceph/pull/32576)
>>
> 
> I guess we can use ceph_iterate_session_caps to do this. That said,
> looking at your patch:
> 
>          if (in->is_dir()) {
>            // remove caps for async dir operations
>            cap.implemented &= (CEPH_CAP_ANY_SHARED | CEPH_CAP_ANY_EXCL);
>          }
> 
> We remove all but Fsx here. That seems like quite a subtle difference
> from how FILE caps work.

only file caps have more than 2 bits. this removes all but AsxLsxXsxFsx.

> 
> Given the way that we're doing the inode delegation and handling dir
> caps, I think we may need to rework things such that async requests
> never get queued to the s_waiting list, and instead return some sort of
> distinct error that cues the caller to fall back to a synchronous
> request.
> 
> That would help prevent some of the potential races here.

my libcephfs implementation check if a request can be async just before 
sending the request.
>

Yan, Zheng Jan. 25, 2020, 1:58 a.m. UTC | #5

On 1/25/20 1:19 AM, Jeff Layton wrote:
> On Thu, 2020-01-23 at 01:04 +0800, Yan, Zheng wrote:
>> On Wed, Jan 22, 2020 at 3:31 AM Jeff Layton <jlayton@kernel.org> wrote:
>>> v3:
>>> - move some cephfs-specific code into ceph.ko
>>> - present and track inode numbers as u64 values
>>> - fix up check for dentry and cap eligibility checks
>>> - set O_CEPH_EXCL on async creates
>>> - attempt to handle errors better on async create (invalidate dentries
>>>    and dir completeness).
>>> - ensure that fsync waits for async create to complete
>>>
>>> v2:
>>> - move cached layout to dedicated field in inode
>>> - protect cached layout with i_ceph_lock
>>> - wipe cached layout in __check_cap_issue
>>> - set max_size of file to layout.stripe_unit
>>> - set truncate_size to (u64)-1
>>> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
>>> - set cap_id to 1 in async created inode
>>> - allocate inode number before submitting request
>>> - rework the prep for an async create to be more efficient
>>> - don't allow MDS or cap messages involving an inode until we get async
>>>    create reply
>>>
>>> Still not quite ready for merge, but I've cleaned up a number of warts
>>> in the v2 set. Performance numbers still look about the same.
>>>
>>> There is definitely still a race of some sort that causes the client to
>>> try to asynchronously create a dentry that already exists. I'm still
>>> working on tracking that down.
>>>
>>> Jeff Layton (10):
>>>    ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
>>>    ceph: make ceph_fill_inode non-static
>>>    ceph: make dentry_lease_is_valid non-static
>>>    ceph: make __take_cap_refs non-static
>>>    ceph: decode interval_sets for delegated inos
>>>    ceph: add flag to designate that a request is asynchronous
>>>    ceph: add infrastructure for waiting for async create to complete
>>>    ceph: add new MDS req field to hold delegated inode number
>>>    ceph: cache layout in parent dir on first sync create
>>>    ceph: attempt to do async create when possible
>>>
>>>   fs/ceph/Makefile                     |   2 +-
>>>   fs/ceph/caps.c                       |  38 +++--
>>>   fs/ceph/dir.c                        |  13 +-
>>>   fs/ceph/file.c                       | 240 +++++++++++++++++++++++++--
>>>   fs/ceph/inode.c                      |  50 +++---
>>>   fs/ceph/mds_client.c                 | 123 ++++++++++++--
>>>   fs/ceph/mds_client.h                 |  17 +-
>>>   fs/ceph/super.h                      |  16 +-
>>>   net/ceph/ceph_fs.c => fs/ceph/util.c |   4 -
>>>   include/linux/ceph/ceph_fs.h         |   8 +-
>>>   net/ceph/Makefile                    |   2 +-
>>>   11 files changed, 443 insertions(+), 70 deletions(-)
>>>   rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%)
>>>
>>> --
>>> 2.24.1
>>>
>>
>> I realized that there still are two issues:
>> -  we needs to clear delegated inos when mds failover
> 
> I'm guessing we need to do this whenever any session is reconnected,
> right? I think we may have bigger problems here though:
> 
> The issue is that with this set we assign out ino_t's prior to
> submitting the request. We could get down into it and find that we're
> reconnecting the session. At that point, that ino_t is no longer valid
> for the session.

yes, that's a problem for current kclient impelmentaion


> 
>> - we needs to clear caps for async dir operations when reconnecting to
>> mds. (see last commit of https://github.com/ceph/ceph/pull/32576)
>>
> 
> I guess we can use ceph_iterate_session_caps to do this. That said,
> looking at your patch:
> 
>          if (in->is_dir()) {
>            // remove caps for async dir operations
>            cap.implemented &= (CEPH_CAP_ANY_SHARED | CEPH_CAP_ANY_EXCL);
>          }
> 
> We remove all but Fsx here. That seems like quite a subtle difference
> from how FILE caps work.

only file caps have more than 2 bits. this removes caps other than 
AsxLsxXsxFsx.

> 
> Given the way that we're doing the inode delegation and handling dir
> caps, I think we may need to rework things such that async requests
> never get queued to the s_waiting list, and instead return some sort of
> distinct error that cues the caller to fall back to a synchronous
> request.
> 
> That would help prevent some of the potential races here.

my libcephfs implementation check if a request can be async just before 
sending the request.
>

[RFC,v3,00/10] ceph: asynchronous file create support

Message

Comments