mbox series

[RFC,v2,00/10] ceph: asynchronous file create support

Message ID 20200115205912.38688-1-jlayton@kernel.org (mailing list archive)
Headers show
Series ceph: asynchronous file create support | expand

Message

Jeff Layton Jan. 15, 2020, 8:59 p.m. UTC
v2:
- move cached layout to dedicated field in inode
- protect cached layout with i_ceph_lock
- wipe cached layout in __check_cap_issue
- set max_size of file to layout.stripe_unit
- set truncate_size to (u64)-1
- use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
- set cap_id to 1 in async created inode
- allocate inode number before submitting request
- rework the prep for an async create to be more efficient
- don't allow MDS or cap messages involving an inode until we get async
  create reply

A lot of changes in this set, mostly based on Zheng and Xiubo's
comments. Performance is pretty similar to the previous set:

Untarring a kernel tarball into a cephfs takes about 98s with
async dirops disabled. With them enabled, it takes around 78s,
which is about a 25% improvement.

This is not quite ready for merge. Error handling could still be
improved. With xfstest generic/531, I see some messages like this pop
up in the ring buffer:

    [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9

Basically, we went to do an async create and got a different inode
number back than expected. That still needs investigation, but I
didn't see any test failures due to it.

Jeff Layton (10):
  libceph: export ceph_file_layout_is_valid
  ceph: make ceph_fill_inode non-static
  ceph: make dentry_lease_is_valid non-static
  ceph: make __take_cap_refs a public function
  ceph: decode interval_sets for delegated inos
  ceph: add flag to designate that a request is asynchronous
  ceph: add infrastructure for waiting for async create to complete
  ceph: add new MDS req field to hold delegated inode number
  ceph: cache layout in parent dir on first sync create
  ceph: attempt to do async create when possible

 fs/ceph/caps.c               |  34 ++++--
 fs/ceph/dir.c                |  13 ++-
 fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
 fs/ceph/inode.c              |  50 ++++----
 fs/ceph/mds_client.c         | 126 ++++++++++++++++++--
 fs/ceph/mds_client.h         |   9 +-
 fs/ceph/super.h              |  16 ++-
 include/linux/ceph/ceph_fs.h |   8 +-
 net/ceph/ceph_fs.c           |   1 +
 9 files changed, 410 insertions(+), 65 deletions(-)

Comments

Jeff Layton Jan. 15, 2020, 9:30 p.m. UTC | #1
On Wed, 2020-01-15 at 15:59 -0500, Jeff Layton wrote:
> v2:
> - move cached layout to dedicated field in inode
> - protect cached layout with i_ceph_lock
> - wipe cached layout in __check_cap_issue
> - set max_size of file to layout.stripe_unit
> - set truncate_size to (u64)-1
> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> - set cap_id to 1 in async created inode
> - allocate inode number before submitting request
> - rework the prep for an async create to be more efficient
> - don't allow MDS or cap messages involving an inode until we get async
>   create reply
> 
> A lot of changes in this set, mostly based on Zheng and Xiubo's
> comments. Performance is pretty similar to the previous set:
> 
> Untarring a kernel tarball into a cephfs takes about 98s with
> async dirops disabled. With them enabled, it takes around 78s,
> which is about a 25% improvement.
> 
> This is not quite ready for merge. Error handling could still be
> improved. With xfstest generic/531, I see some messages like this pop
> up in the ring buffer:
> 
>     [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9
> 
> Basically, we went to do an async create and got a different inode
> number back than expected. That still needs investigation, but I
> didn't see any test failures due to it.
> 
> Jeff Layton (10):
>   libceph: export ceph_file_layout_is_valid
>   ceph: make ceph_fill_inode non-static
>   ceph: make dentry_lease_is_valid non-static
>   ceph: make __take_cap_refs a public function
>   ceph: decode interval_sets for delegated inos
>   ceph: add flag to designate that a request is asynchronous
>   ceph: add infrastructure for waiting for async create to complete
>   ceph: add new MDS req field to hold delegated inode number
>   ceph: cache layout in parent dir on first sync create
>   ceph: attempt to do async create when possible
> 
>  fs/ceph/caps.c               |  34 ++++--
>  fs/ceph/dir.c                |  13 ++-
>  fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
>  fs/ceph/inode.c              |  50 ++++----
>  fs/ceph/mds_client.c         | 126 ++++++++++++++++++--
>  fs/ceph/mds_client.h         |   9 +-
>  fs/ceph/super.h              |  16 ++-
>  include/linux/ceph/ceph_fs.h |   8 +-
>  net/ceph/ceph_fs.c           |   1 +
>  9 files changed, 410 insertions(+), 65 deletions(-)
> 

I forgot to mention that I went ahead and merged a few patches from the
first pile into testing, so this series is based on top of
ceph-client/testing as of today.

Cheers,
Xiubo Li Jan. 16, 2020, 8:57 a.m. UTC | #2
On 2020/1/16 4:59, Jeff Layton wrote:
> v2:
> - move cached layout to dedicated field in inode
> - protect cached layout with i_ceph_lock
> - wipe cached layout in __check_cap_issue
> - set max_size of file to layout.stripe_unit
> - set truncate_size to (u64)-1
> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> - set cap_id to 1 in async created inode
> - allocate inode number before submitting request
> - rework the prep for an async create to be more efficient
> - don't allow MDS or cap messages involving an inode until we get async
>    create reply
>
> A lot of changes in this set, mostly based on Zheng and Xiubo's
> comments. Performance is pretty similar to the previous set:
>
> Untarring a kernel tarball into a cephfs takes about 98s with
> async dirops disabled. With them enabled, it takes around 78s,
> which is about a 25% improvement.
>
> This is not quite ready for merge. Error handling could still be
> improved. With xfstest generic/531, I see some messages like this pop
> up in the ring buffer:
>
>      [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9
>
> Basically, we went to do an async create and got a different inode
> number back than expected. That still needs investigation, but I
> didn't see any test failures due to it.
>
> Jeff Layton (10):
>    libceph: export ceph_file_layout_is_valid
>    ceph: make ceph_fill_inode non-static
>    ceph: make dentry_lease_is_valid non-static
>    ceph: make __take_cap_refs a public function
>    ceph: decode interval_sets for delegated inos
>    ceph: add flag to designate that a request is asynchronous
>    ceph: add infrastructure for waiting for async create to complete
>    ceph: add new MDS req field to hold delegated inode number
>    ceph: cache layout in parent dir on first sync create
>    ceph: attempt to do async create when possible
>
>   fs/ceph/caps.c               |  34 ++++--
>   fs/ceph/dir.c                |  13 ++-
>   fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
>   fs/ceph/inode.c              |  50 ++++----
>   fs/ceph/mds_client.c         | 126 ++++++++++++++++++--
>   fs/ceph/mds_client.h         |   9 +-
>   fs/ceph/super.h              |  16 ++-
>   include/linux/ceph/ceph_fs.h |   8 +-
>   net/ceph/ceph_fs.c           |   1 +
>   9 files changed, 410 insertions(+), 65 deletions(-)
>
This looks good to me.

Thanks.
Yan, Zheng Jan. 16, 2020, 1:10 p.m. UTC | #3
On 1/16/20 4:59 AM, Jeff Layton wrote:
> v2:
> - move cached layout to dedicated field in inode
> - protect cached layout with i_ceph_lock
> - wipe cached layout in __check_cap_issue
> - set max_size of file to layout.stripe_unit
> - set truncate_size to (u64)-1
> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> - set cap_id to 1 in async created inode
> - allocate inode number before submitting request
> - rework the prep for an async create to be more efficient
> - don't allow MDS or cap messages involving an inode until we get async
>    create reply
> 
> A lot of changes in this set, mostly based on Zheng and Xiubo's
> comments. Performance is pretty similar to the previous set:
> 
> Untarring a kernel tarball into a cephfs takes about 98s with
> async dirops disabled. With them enabled, it takes around 78s,
> which is about a 25% improvement.
> 
> This is not quite ready for merge. Error handling could still be
> improved. With xfstest generic/531, I see some messages like this pop
> up in the ring buffer:
> 
>      [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9
> 

how about always set O_EXCL flag in async create request. It may help to 
debug this issue.

> Basically, we went to do an async create and got a different inode
> number back than expected. That still needs investigation, but I
> didn't see any test failures due to it.
> 
> Jeff Layton (10):
>    libceph: export ceph_file_layout_is_valid
>    ceph: make ceph_fill_inode non-static
>    ceph: make dentry_lease_is_valid non-static
>    ceph: make __take_cap_refs a public function
>    ceph: decode interval_sets for delegated inos
>    ceph: add flag to designate that a request is asynchronous
>    ceph: add infrastructure for waiting for async create to complete
>    ceph: add new MDS req field to hold delegated inode number
>    ceph: cache layout in parent dir on first sync create
>    ceph: attempt to do async create when possible
> 
>   fs/ceph/caps.c               |  34 ++++--
>   fs/ceph/dir.c                |  13 ++-
>   fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
>   fs/ceph/inode.c              |  50 ++++----
>   fs/ceph/mds_client.c         | 126 ++++++++++++++++++--
>   fs/ceph/mds_client.h         |   9 +-
>   fs/ceph/super.h              |  16 ++-
>   include/linux/ceph/ceph_fs.h |   8 +-
>   net/ceph/ceph_fs.c           |   1 +
>   9 files changed, 410 insertions(+), 65 deletions(-)
>
Jeff Layton Jan. 16, 2020, 2:15 p.m. UTC | #4
On Thu, 2020-01-16 at 21:10 +0800, Yan, Zheng wrote:
> On 1/16/20 4:59 AM, Jeff Layton wrote:
> > v2:
> > - move cached layout to dedicated field in inode
> > - protect cached layout with i_ceph_lock
> > - wipe cached layout in __check_cap_issue
> > - set max_size of file to layout.stripe_unit
> > - set truncate_size to (u64)-1
> > - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> > - set cap_id to 1 in async created inode
> > - allocate inode number before submitting request
> > - rework the prep for an async create to be more efficient
> > - don't allow MDS or cap messages involving an inode until we get async
> >    create reply
> > 
> > A lot of changes in this set, mostly based on Zheng and Xiubo's
> > comments. Performance is pretty similar to the previous set:
> > 
> > Untarring a kernel tarball into a cephfs takes about 98s with
> > async dirops disabled. With them enabled, it takes around 78s,
> > which is about a 25% improvement.
> > 
> > This is not quite ready for merge. Error handling could still be
> > improved. With xfstest generic/531, I see some messages like this pop
> > up in the ring buffer:
> > 
> >      [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9
> > 
> 
> how about always set O_EXCL flag in async create request. It may help to 
> debug this issue.
> 

I was just thinking the same thing yesterday. I'll do that and we can
see what that turns up. Thanks!

> > Basically, we went to do an async create and got a different inode
> > number back than expected. That still needs investigation, but I
> > didn't see any test failures due to it.
> > 
> > Jeff Layton (10):
> >    libceph: export ceph_file_layout_is_valid
> >    ceph: make ceph_fill_inode non-static
> >    ceph: make dentry_lease_is_valid non-static
> >    ceph: make __take_cap_refs a public function
> >    ceph: decode interval_sets for delegated inos
> >    ceph: add flag to designate that a request is asynchronous
> >    ceph: add infrastructure for waiting for async create to complete
> >    ceph: add new MDS req field to hold delegated inode number
> >    ceph: cache layout in parent dir on first sync create
> >    ceph: attempt to do async create when possible
> > 
> >   fs/ceph/caps.c               |  34 ++++--
> >   fs/ceph/dir.c                |  13 ++-
> >   fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
> >   fs/ceph/inode.c              |  50 ++++----
> >   fs/ceph/mds_client.c         | 126 ++++++++++++++++++--
> >   fs/ceph/mds_client.h         |   9 +-
> >   fs/ceph/super.h              |  16 ++-
> >   include/linux/ceph/ceph_fs.h |   8 +-
> >   net/ceph/ceph_fs.c           |   1 +
> >   9 files changed, 410 insertions(+), 65 deletions(-)
> >
Yan, Zheng Jan. 20, 2020, 1:20 p.m. UTC | #5
On 1/16/20 4:59 AM, Jeff Layton wrote:
> v2:
> - move cached layout to dedicated field in inode
> - protect cached layout with i_ceph_lock
> - wipe cached layout in __check_cap_issue
> - set max_size of file to layout.stripe_unit
> - set truncate_size to (u64)-1
> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> - set cap_id to 1 in async created inode
> - allocate inode number before submitting request
> - rework the prep for an async create to be more efficient
> - don't allow MDS or cap messages involving an inode until we get async
>    create reply
> 
> A lot of changes in this set, mostly based on Zheng and Xiubo's
> comments. Performance is pretty similar to the previous set:
> 
> Untarring a kernel tarball into a cephfs takes about 98s with
> async dirops disabled. With them enabled, it takes around 78s,
> which is about a 25% improvement.
> 
> This is not quite ready for merge. Error handling could still be
> improved. With xfstest generic/531, I see some messages like this pop
> up in the ring buffer:
> 
>      [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9
> 
> Basically, we went to do an async create and got a different inode
> number back than expected. That still needs investigation, but I
> didn't see any test failures due to it.
> 
> Jeff Layton (10):
>    libceph: export ceph_file_layout_is_valid
>    ceph: make ceph_fill_inode non-static
>    ceph: make dentry_lease_is_valid non-static
>    ceph: make __take_cap_refs a public function
>    ceph: decode interval_sets for delegated inos
>    ceph: add flag to designate that a request is asynchronous
>    ceph: add infrastructure for waiting for async create to complete
>    ceph: add new MDS req field to hold delegated inode number
>    ceph: cache layout in parent dir on first sync create
>    ceph: attempt to do async create when possible
> 
>   fs/ceph/caps.c               |  34 ++++--
>   fs/ceph/dir.c                |  13 ++-
>   fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
>   fs/ceph/inode.c              |  50 ++++----
>   fs/ceph/mds_client.c         | 126 ++++++++++++++++++--
>   fs/ceph/mds_client.h         |   9 +-
>   fs/ceph/super.h              |  16 ++-
>   include/linux/ceph/ceph_fs.h |   8 +-
>   net/ceph/ceph_fs.c           |   1 +
>   9 files changed, 410 insertions(+), 65 deletions(-)
> 

An issue of kernel async unlink/create implementation is 
get_caps_for_async_unlink/try_prep_async_create are called before
calling ceph_mdsc_submit_request. There is a small windows that 
session's state may change. If session is in wrong state, 
ceph_mdsc_submit_request() may wait and not send request immediately.
Jeff Layton Jan. 21, 2020, 10:56 a.m. UTC | #6
On Mon, 2020-01-20 at 21:20 +0800, Yan, Zheng wrote:
> On 1/16/20 4:59 AM, Jeff Layton wrote:
> > v2:
> > - move cached layout to dedicated field in inode
> > - protect cached layout with i_ceph_lock
> > - wipe cached layout in __check_cap_issue
> > - set max_size of file to layout.stripe_unit
> > - set truncate_size to (u64)-1
> > - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> > - set cap_id to 1 in async created inode
> > - allocate inode number before submitting request
> > - rework the prep for an async create to be more efficient
> > - don't allow MDS or cap messages involving an inode until we get async
> >    create reply
> > 
> > A lot of changes in this set, mostly based on Zheng and Xiubo's
> > comments. Performance is pretty similar to the previous set:
> > 
> > Untarring a kernel tarball into a cephfs takes about 98s with
> > async dirops disabled. With them enabled, it takes around 78s,
> > which is about a 25% improvement.
> > 
> > This is not quite ready for merge. Error handling could still be
> > improved. With xfstest generic/531, I see some messages like this pop
> > up in the ring buffer:
> > 
> >      [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9
> > 
> > Basically, we went to do an async create and got a different inode
> > number back than expected. That still needs investigation, but I
> > didn't see any test failures due to it.
> > 
> > Jeff Layton (10):
> >    libceph: export ceph_file_layout_is_valid
> >    ceph: make ceph_fill_inode non-static
> >    ceph: make dentry_lease_is_valid non-static
> >    ceph: make __take_cap_refs a public function
> >    ceph: decode interval_sets for delegated inos
> >    ceph: add flag to designate that a request is asynchronous
> >    ceph: add infrastructure for waiting for async create to complete
> >    ceph: add new MDS req field to hold delegated inode number
> >    ceph: cache layout in parent dir on first sync create
> >    ceph: attempt to do async create when possible
> > 
> >   fs/ceph/caps.c               |  34 ++++--
> >   fs/ceph/dir.c                |  13 ++-
> >   fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
> >   fs/ceph/inode.c              |  50 ++++----
> >   fs/ceph/mds_client.c         | 126 ++++++++++++++++++--
> >   fs/ceph/mds_client.h         |   9 +-
> >   fs/ceph/super.h              |  16 ++-
> >   include/linux/ceph/ceph_fs.h |   8 +-
> >   net/ceph/ceph_fs.c           |   1 +
> >   9 files changed, 410 insertions(+), 65 deletions(-)
> > 
> 
> An issue of kernel async unlink/create implementation is 
> get_caps_for_async_unlink/try_prep_async_create are called before
> calling ceph_mdsc_submit_request. There is a small windows that 
> session's state may change. If session is in wrong state, 
> ceph_mdsc_submit_request() may wait and not send request immediately.
> 

Is that a real issue (other than performance)?

We hold cap references, so assuming that the session can be reconnected
and that we keep the caps, everything should still work correctly, no?
Yan, Zheng Jan. 21, 2020, 1:20 p.m. UTC | #7
On 1/21/20 6:56 PM, Jeff Layton wrote:
> On Mon, 2020-01-20 at 21:20 +0800, Yan, Zheng wrote:
>> On 1/16/20 4:59 AM, Jeff Layton wrote:
>>> v2:
>>> - move cached layout to dedicated field in inode
>>> - protect cached layout with i_ceph_lock
>>> - wipe cached layout in __check_cap_issue
>>> - set max_size of file to layout.stripe_unit
>>> - set truncate_size to (u64)-1
>>> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
>>> - set cap_id to 1 in async created inode
>>> - allocate inode number before submitting request
>>> - rework the prep for an async create to be more efficient
>>> - don't allow MDS or cap messages involving an inode until we get async
>>>     create reply
>>>
>>> A lot of changes in this set, mostly based on Zheng and Xiubo's
>>> comments. Performance is pretty similar to the previous set:
>>>
>>> Untarring a kernel tarball into a cephfs takes about 98s with
>>> async dirops disabled. With them enabled, it takes around 78s,
>>> which is about a 25% improvement.
>>>
>>> This is not quite ready for merge. Error handling could still be
>>> improved. With xfstest generic/531, I see some messages like this pop
>>> up in the ring buffer:
>>>
>>>       [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9
>>>
>>> Basically, we went to do an async create and got a different inode
>>> number back than expected. That still needs investigation, but I
>>> didn't see any test failures due to it.
>>>
>>> Jeff Layton (10):
>>>     libceph: export ceph_file_layout_is_valid
>>>     ceph: make ceph_fill_inode non-static
>>>     ceph: make dentry_lease_is_valid non-static
>>>     ceph: make __take_cap_refs a public function
>>>     ceph: decode interval_sets for delegated inos
>>>     ceph: add flag to designate that a request is asynchronous
>>>     ceph: add infrastructure for waiting for async create to complete
>>>     ceph: add new MDS req field to hold delegated inode number
>>>     ceph: cache layout in parent dir on first sync create
>>>     ceph: attempt to do async create when possible
>>>
>>>    fs/ceph/caps.c               |  34 ++++--
>>>    fs/ceph/dir.c                |  13 ++-
>>>    fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
>>>    fs/ceph/inode.c              |  50 ++++----
>>>    fs/ceph/mds_client.c         | 126 ++++++++++++++++++--
>>>    fs/ceph/mds_client.h         |   9 +-
>>>    fs/ceph/super.h              |  16 ++-
>>>    include/linux/ceph/ceph_fs.h |   8 +-
>>>    net/ceph/ceph_fs.c           |   1 +
>>>    9 files changed, 410 insertions(+), 65 deletions(-)
>>>
>>
>> An issue of kernel async unlink/create implementation is
>> get_caps_for_async_unlink/try_prep_async_create are called before
>> calling ceph_mdsc_submit_request. There is a small windows that
>> session's state may change. If session is in wrong state,
>> ceph_mdsc_submit_request() may wait and not send request immediately.
>>
> 
> Is that a real issue (other than performance)?
> 
> We hold cap references, so assuming that the session can be reconnected
> and that we keep the caps, everything should still work correctly, no?
> 

I think it may cause more troubles for error handling (For the case that 
client failed to reconnect session)

Regards
Yan, Zheng
Jeff Layton Jan. 21, 2020, 2:37 p.m. UTC | #8
On Tue, 2020-01-21 at 21:20 +0800, Yan, Zheng wrote:
> On 1/21/20 6:56 PM, Jeff Layton wrote:
> > On Mon, 2020-01-20 at 21:20 +0800, Yan, Zheng wrote:
> > > On 1/16/20 4:59 AM, Jeff Layton wrote:
> > > > v2:
> > > > - move cached layout to dedicated field in inode
> > > > - protect cached layout with i_ceph_lock
> > > > - wipe cached layout in __check_cap_issue
> > > > - set max_size of file to layout.stripe_unit
> > > > - set truncate_size to (u64)-1
> > > > - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> > > > - set cap_id to 1 in async created inode
> > > > - allocate inode number before submitting request
> > > > - rework the prep for an async create to be more efficient
> > > > - don't allow MDS or cap messages involving an inode until we get async
> > > >     create reply
> > > > 
> > > > A lot of changes in this set, mostly based on Zheng and Xiubo's
> > > > comments. Performance is pretty similar to the previous set:
> > > > 
> > > > Untarring a kernel tarball into a cephfs takes about 98s with
> > > > async dirops disabled. With them enabled, it takes around 78s,
> > > > which is about a 25% improvement.
> > > > 
> > > > This is not quite ready for merge. Error handling could still be
> > > > improved. With xfstest generic/531, I see some messages like this pop
> > > > up in the ring buffer:
> > > > 
> > > >       [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9
> > > > 
> > > > Basically, we went to do an async create and got a different inode
> > > > number back than expected. That still needs investigation, but I
> > > > didn't see any test failures due to it.
> > > > 
> > > > Jeff Layton (10):
> > > >     libceph: export ceph_file_layout_is_valid
> > > >     ceph: make ceph_fill_inode non-static
> > > >     ceph: make dentry_lease_is_valid non-static
> > > >     ceph: make __take_cap_refs a public function
> > > >     ceph: decode interval_sets for delegated inos
> > > >     ceph: add flag to designate that a request is asynchronous
> > > >     ceph: add infrastructure for waiting for async create to complete
> > > >     ceph: add new MDS req field to hold delegated inode number
> > > >     ceph: cache layout in parent dir on first sync create
> > > >     ceph: attempt to do async create when possible
> > > > 
> > > >    fs/ceph/caps.c               |  34 ++++--
> > > >    fs/ceph/dir.c                |  13 ++-
> > > >    fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
> > > >    fs/ceph/inode.c              |  50 ++++----
> > > >    fs/ceph/mds_client.c         | 126 ++++++++++++++++++--
> > > >    fs/ceph/mds_client.h         |   9 +-
> > > >    fs/ceph/super.h              |  16 ++-
> > > >    include/linux/ceph/ceph_fs.h |   8 +-
> > > >    net/ceph/ceph_fs.c           |   1 +
> > > >    9 files changed, 410 insertions(+), 65 deletions(-)
> > > > 
> > > 
> > > An issue of kernel async unlink/create implementation is
> > > get_caps_for_async_unlink/try_prep_async_create are called before
> > > calling ceph_mdsc_submit_request. There is a small windows that
> > > session's state may change. If session is in wrong state,
> > > ceph_mdsc_submit_request() may wait and not send request immediately.
> > > 
> > 
> > Is that a real issue (other than performance)?
> > 
> > We hold cap references, so assuming that the session can be reconnected
> > and that we keep the caps, everything should still work correctly, no?
> > 
> 
> I think it may cause more troubles for error handling (For the case that 
> client failed to reconnect session)
> 

Yeah, error handling is a real Achilles heel in this whole thing. I'm
still considering how best to deal with errors when this occurs. There
may not be any good answers for that w/o adding new interfaces.