Message ID | 20200115205912.38688-1-jlayton@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | ceph: asynchronous file create support | expand |
On Wed, 2020-01-15 at 15:59 -0500, Jeff Layton wrote: > v2: > - move cached layout to dedicated field in inode > - protect cached layout with i_ceph_lock > - wipe cached layout in __check_cap_issue > - set max_size of file to layout.stripe_unit > - set truncate_size to (u64)-1 > - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS > - set cap_id to 1 in async created inode > - allocate inode number before submitting request > - rework the prep for an async create to be more efficient > - don't allow MDS or cap messages involving an inode until we get async > create reply > > A lot of changes in this set, mostly based on Zheng and Xiubo's > comments. Performance is pretty similar to the previous set: > > Untarring a kernel tarball into a cephfs takes about 98s with > async dirops disabled. With them enabled, it takes around 78s, > which is about a 25% improvement. > > This is not quite ready for merge. Error handling could still be > improved. With xfstest generic/531, I see some messages like this pop > up in the ring buffer: > > [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9 > > Basically, we went to do an async create and got a different inode > number back than expected. That still needs investigation, but I > didn't see any test failures due to it. > > Jeff Layton (10): > libceph: export ceph_file_layout_is_valid > ceph: make ceph_fill_inode non-static > ceph: make dentry_lease_is_valid non-static > ceph: make __take_cap_refs a public function > ceph: decode interval_sets for delegated inos > ceph: add flag to designate that a request is asynchronous > ceph: add infrastructure for waiting for async create to complete > ceph: add new MDS req field to hold delegated inode number > ceph: cache layout in parent dir on first sync create > ceph: attempt to do async create when possible > > fs/ceph/caps.c | 34 ++++-- > fs/ceph/dir.c | 13 ++- > fs/ceph/file.c | 218 +++++++++++++++++++++++++++++++++-- > fs/ceph/inode.c | 50 ++++---- > fs/ceph/mds_client.c | 126 ++++++++++++++++++-- > fs/ceph/mds_client.h | 9 +- > fs/ceph/super.h | 16 ++- > include/linux/ceph/ceph_fs.h | 8 +- > net/ceph/ceph_fs.c | 1 + > 9 files changed, 410 insertions(+), 65 deletions(-) > I forgot to mention that I went ahead and merged a few patches from the first pile into testing, so this series is based on top of ceph-client/testing as of today. Cheers,
On 2020/1/16 4:59, Jeff Layton wrote: > v2: > - move cached layout to dedicated field in inode > - protect cached layout with i_ceph_lock > - wipe cached layout in __check_cap_issue > - set max_size of file to layout.stripe_unit > - set truncate_size to (u64)-1 > - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS > - set cap_id to 1 in async created inode > - allocate inode number before submitting request > - rework the prep for an async create to be more efficient > - don't allow MDS or cap messages involving an inode until we get async > create reply > > A lot of changes in this set, mostly based on Zheng and Xiubo's > comments. Performance is pretty similar to the previous set: > > Untarring a kernel tarball into a cephfs takes about 98s with > async dirops disabled. With them enabled, it takes around 78s, > which is about a 25% improvement. > > This is not quite ready for merge. Error handling could still be > improved. With xfstest generic/531, I see some messages like this pop > up in the ring buffer: > > [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9 > > Basically, we went to do an async create and got a different inode > number back than expected. That still needs investigation, but I > didn't see any test failures due to it. > > Jeff Layton (10): > libceph: export ceph_file_layout_is_valid > ceph: make ceph_fill_inode non-static > ceph: make dentry_lease_is_valid non-static > ceph: make __take_cap_refs a public function > ceph: decode interval_sets for delegated inos > ceph: add flag to designate that a request is asynchronous > ceph: add infrastructure for waiting for async create to complete > ceph: add new MDS req field to hold delegated inode number > ceph: cache layout in parent dir on first sync create > ceph: attempt to do async create when possible > > fs/ceph/caps.c | 34 ++++-- > fs/ceph/dir.c | 13 ++- > fs/ceph/file.c | 218 +++++++++++++++++++++++++++++++++-- > fs/ceph/inode.c | 50 ++++---- > fs/ceph/mds_client.c | 126 ++++++++++++++++++-- > fs/ceph/mds_client.h | 9 +- > fs/ceph/super.h | 16 ++- > include/linux/ceph/ceph_fs.h | 8 +- > net/ceph/ceph_fs.c | 1 + > 9 files changed, 410 insertions(+), 65 deletions(-) > This looks good to me. Thanks.
On 1/16/20 4:59 AM, Jeff Layton wrote: > v2: > - move cached layout to dedicated field in inode > - protect cached layout with i_ceph_lock > - wipe cached layout in __check_cap_issue > - set max_size of file to layout.stripe_unit > - set truncate_size to (u64)-1 > - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS > - set cap_id to 1 in async created inode > - allocate inode number before submitting request > - rework the prep for an async create to be more efficient > - don't allow MDS or cap messages involving an inode until we get async > create reply > > A lot of changes in this set, mostly based on Zheng and Xiubo's > comments. Performance is pretty similar to the previous set: > > Untarring a kernel tarball into a cephfs takes about 98s with > async dirops disabled. With them enabled, it takes around 78s, > which is about a 25% improvement. > > This is not quite ready for merge. Error handling could still be > improved. With xfstest generic/531, I see some messages like this pop > up in the ring buffer: > > [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9 > how about always set O_EXCL flag in async create request. It may help to debug this issue. > Basically, we went to do an async create and got a different inode > number back than expected. That still needs investigation, but I > didn't see any test failures due to it. > > Jeff Layton (10): > libceph: export ceph_file_layout_is_valid > ceph: make ceph_fill_inode non-static > ceph: make dentry_lease_is_valid non-static > ceph: make __take_cap_refs a public function > ceph: decode interval_sets for delegated inos > ceph: add flag to designate that a request is asynchronous > ceph: add infrastructure for waiting for async create to complete > ceph: add new MDS req field to hold delegated inode number > ceph: cache layout in parent dir on first sync create > ceph: attempt to do async create when possible > > fs/ceph/caps.c | 34 ++++-- > fs/ceph/dir.c | 13 ++- > fs/ceph/file.c | 218 +++++++++++++++++++++++++++++++++-- > fs/ceph/inode.c | 50 ++++---- > fs/ceph/mds_client.c | 126 ++++++++++++++++++-- > fs/ceph/mds_client.h | 9 +- > fs/ceph/super.h | 16 ++- > include/linux/ceph/ceph_fs.h | 8 +- > net/ceph/ceph_fs.c | 1 + > 9 files changed, 410 insertions(+), 65 deletions(-) >
On Thu, 2020-01-16 at 21:10 +0800, Yan, Zheng wrote: > On 1/16/20 4:59 AM, Jeff Layton wrote: > > v2: > > - move cached layout to dedicated field in inode > > - protect cached layout with i_ceph_lock > > - wipe cached layout in __check_cap_issue > > - set max_size of file to layout.stripe_unit > > - set truncate_size to (u64)-1 > > - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS > > - set cap_id to 1 in async created inode > > - allocate inode number before submitting request > > - rework the prep for an async create to be more efficient > > - don't allow MDS or cap messages involving an inode until we get async > > create reply > > > > A lot of changes in this set, mostly based on Zheng and Xiubo's > > comments. Performance is pretty similar to the previous set: > > > > Untarring a kernel tarball into a cephfs takes about 98s with > > async dirops disabled. With them enabled, it takes around 78s, > > which is about a 25% improvement. > > > > This is not quite ready for merge. Error handling could still be > > improved. With xfstest generic/531, I see some messages like this pop > > up in the ring buffer: > > > > [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9 > > > > how about always set O_EXCL flag in async create request. It may help to > debug this issue. > I was just thinking the same thing yesterday. I'll do that and we can see what that turns up. Thanks! > > Basically, we went to do an async create and got a different inode > > number back than expected. That still needs investigation, but I > > didn't see any test failures due to it. > > > > Jeff Layton (10): > > libceph: export ceph_file_layout_is_valid > > ceph: make ceph_fill_inode non-static > > ceph: make dentry_lease_is_valid non-static > > ceph: make __take_cap_refs a public function > > ceph: decode interval_sets for delegated inos > > ceph: add flag to designate that a request is asynchronous > > ceph: add infrastructure for waiting for async create to complete > > ceph: add new MDS req field to hold delegated inode number > > ceph: cache layout in parent dir on first sync create > > ceph: attempt to do async create when possible > > > > fs/ceph/caps.c | 34 ++++-- > > fs/ceph/dir.c | 13 ++- > > fs/ceph/file.c | 218 +++++++++++++++++++++++++++++++++-- > > fs/ceph/inode.c | 50 ++++---- > > fs/ceph/mds_client.c | 126 ++++++++++++++++++-- > > fs/ceph/mds_client.h | 9 +- > > fs/ceph/super.h | 16 ++- > > include/linux/ceph/ceph_fs.h | 8 +- > > net/ceph/ceph_fs.c | 1 + > > 9 files changed, 410 insertions(+), 65 deletions(-) > >
On 1/16/20 4:59 AM, Jeff Layton wrote: > v2: > - move cached layout to dedicated field in inode > - protect cached layout with i_ceph_lock > - wipe cached layout in __check_cap_issue > - set max_size of file to layout.stripe_unit > - set truncate_size to (u64)-1 > - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS > - set cap_id to 1 in async created inode > - allocate inode number before submitting request > - rework the prep for an async create to be more efficient > - don't allow MDS or cap messages involving an inode until we get async > create reply > > A lot of changes in this set, mostly based on Zheng and Xiubo's > comments. Performance is pretty similar to the previous set: > > Untarring a kernel tarball into a cephfs takes about 98s with > async dirops disabled. With them enabled, it takes around 78s, > which is about a 25% improvement. > > This is not quite ready for merge. Error handling could still be > improved. With xfstest generic/531, I see some messages like this pop > up in the ring buffer: > > [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9 > > Basically, we went to do an async create and got a different inode > number back than expected. That still needs investigation, but I > didn't see any test failures due to it. > > Jeff Layton (10): > libceph: export ceph_file_layout_is_valid > ceph: make ceph_fill_inode non-static > ceph: make dentry_lease_is_valid non-static > ceph: make __take_cap_refs a public function > ceph: decode interval_sets for delegated inos > ceph: add flag to designate that a request is asynchronous > ceph: add infrastructure for waiting for async create to complete > ceph: add new MDS req field to hold delegated inode number > ceph: cache layout in parent dir on first sync create > ceph: attempt to do async create when possible > > fs/ceph/caps.c | 34 ++++-- > fs/ceph/dir.c | 13 ++- > fs/ceph/file.c | 218 +++++++++++++++++++++++++++++++++-- > fs/ceph/inode.c | 50 ++++---- > fs/ceph/mds_client.c | 126 ++++++++++++++++++-- > fs/ceph/mds_client.h | 9 +- > fs/ceph/super.h | 16 ++- > include/linux/ceph/ceph_fs.h | 8 +- > net/ceph/ceph_fs.c | 1 + > 9 files changed, 410 insertions(+), 65 deletions(-) > An issue of kernel async unlink/create implementation is get_caps_for_async_unlink/try_prep_async_create are called before calling ceph_mdsc_submit_request. There is a small windows that session's state may change. If session is in wrong state, ceph_mdsc_submit_request() may wait and not send request immediately.
On Mon, 2020-01-20 at 21:20 +0800, Yan, Zheng wrote: > On 1/16/20 4:59 AM, Jeff Layton wrote: > > v2: > > - move cached layout to dedicated field in inode > > - protect cached layout with i_ceph_lock > > - wipe cached layout in __check_cap_issue > > - set max_size of file to layout.stripe_unit > > - set truncate_size to (u64)-1 > > - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS > > - set cap_id to 1 in async created inode > > - allocate inode number before submitting request > > - rework the prep for an async create to be more efficient > > - don't allow MDS or cap messages involving an inode until we get async > > create reply > > > > A lot of changes in this set, mostly based on Zheng and Xiubo's > > comments. Performance is pretty similar to the previous set: > > > > Untarring a kernel tarball into a cephfs takes about 98s with > > async dirops disabled. With them enabled, it takes around 78s, > > which is about a 25% improvement. > > > > This is not quite ready for merge. Error handling could still be > > improved. With xfstest generic/531, I see some messages like this pop > > up in the ring buffer: > > > > [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9 > > > > Basically, we went to do an async create and got a different inode > > number back than expected. That still needs investigation, but I > > didn't see any test failures due to it. > > > > Jeff Layton (10): > > libceph: export ceph_file_layout_is_valid > > ceph: make ceph_fill_inode non-static > > ceph: make dentry_lease_is_valid non-static > > ceph: make __take_cap_refs a public function > > ceph: decode interval_sets for delegated inos > > ceph: add flag to designate that a request is asynchronous > > ceph: add infrastructure for waiting for async create to complete > > ceph: add new MDS req field to hold delegated inode number > > ceph: cache layout in parent dir on first sync create > > ceph: attempt to do async create when possible > > > > fs/ceph/caps.c | 34 ++++-- > > fs/ceph/dir.c | 13 ++- > > fs/ceph/file.c | 218 +++++++++++++++++++++++++++++++++-- > > fs/ceph/inode.c | 50 ++++---- > > fs/ceph/mds_client.c | 126 ++++++++++++++++++-- > > fs/ceph/mds_client.h | 9 +- > > fs/ceph/super.h | 16 ++- > > include/linux/ceph/ceph_fs.h | 8 +- > > net/ceph/ceph_fs.c | 1 + > > 9 files changed, 410 insertions(+), 65 deletions(-) > > > > An issue of kernel async unlink/create implementation is > get_caps_for_async_unlink/try_prep_async_create are called before > calling ceph_mdsc_submit_request. There is a small windows that > session's state may change. If session is in wrong state, > ceph_mdsc_submit_request() may wait and not send request immediately. > Is that a real issue (other than performance)? We hold cap references, so assuming that the session can be reconnected and that we keep the caps, everything should still work correctly, no?
On 1/21/20 6:56 PM, Jeff Layton wrote: > On Mon, 2020-01-20 at 21:20 +0800, Yan, Zheng wrote: >> On 1/16/20 4:59 AM, Jeff Layton wrote: >>> v2: >>> - move cached layout to dedicated field in inode >>> - protect cached layout with i_ceph_lock >>> - wipe cached layout in __check_cap_issue >>> - set max_size of file to layout.stripe_unit >>> - set truncate_size to (u64)-1 >>> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS >>> - set cap_id to 1 in async created inode >>> - allocate inode number before submitting request >>> - rework the prep for an async create to be more efficient >>> - don't allow MDS or cap messages involving an inode until we get async >>> create reply >>> >>> A lot of changes in this set, mostly based on Zheng and Xiubo's >>> comments. Performance is pretty similar to the previous set: >>> >>> Untarring a kernel tarball into a cephfs takes about 98s with >>> async dirops disabled. With them enabled, it takes around 78s, >>> which is about a 25% improvement. >>> >>> This is not quite ready for merge. Error handling could still be >>> improved. With xfstest generic/531, I see some messages like this pop >>> up in the ring buffer: >>> >>> [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9 >>> >>> Basically, we went to do an async create and got a different inode >>> number back than expected. That still needs investigation, but I >>> didn't see any test failures due to it. >>> >>> Jeff Layton (10): >>> libceph: export ceph_file_layout_is_valid >>> ceph: make ceph_fill_inode non-static >>> ceph: make dentry_lease_is_valid non-static >>> ceph: make __take_cap_refs a public function >>> ceph: decode interval_sets for delegated inos >>> ceph: add flag to designate that a request is asynchronous >>> ceph: add infrastructure for waiting for async create to complete >>> ceph: add new MDS req field to hold delegated inode number >>> ceph: cache layout in parent dir on first sync create >>> ceph: attempt to do async create when possible >>> >>> fs/ceph/caps.c | 34 ++++-- >>> fs/ceph/dir.c | 13 ++- >>> fs/ceph/file.c | 218 +++++++++++++++++++++++++++++++++-- >>> fs/ceph/inode.c | 50 ++++---- >>> fs/ceph/mds_client.c | 126 ++++++++++++++++++-- >>> fs/ceph/mds_client.h | 9 +- >>> fs/ceph/super.h | 16 ++- >>> include/linux/ceph/ceph_fs.h | 8 +- >>> net/ceph/ceph_fs.c | 1 + >>> 9 files changed, 410 insertions(+), 65 deletions(-) >>> >> >> An issue of kernel async unlink/create implementation is >> get_caps_for_async_unlink/try_prep_async_create are called before >> calling ceph_mdsc_submit_request. There is a small windows that >> session's state may change. If session is in wrong state, >> ceph_mdsc_submit_request() may wait and not send request immediately. >> > > Is that a real issue (other than performance)? > > We hold cap references, so assuming that the session can be reconnected > and that we keep the caps, everything should still work correctly, no? > I think it may cause more troubles for error handling (For the case that client failed to reconnect session) Regards Yan, Zheng
On Tue, 2020-01-21 at 21:20 +0800, Yan, Zheng wrote: > On 1/21/20 6:56 PM, Jeff Layton wrote: > > On Mon, 2020-01-20 at 21:20 +0800, Yan, Zheng wrote: > > > On 1/16/20 4:59 AM, Jeff Layton wrote: > > > > v2: > > > > - move cached layout to dedicated field in inode > > > > - protect cached layout with i_ceph_lock > > > > - wipe cached layout in __check_cap_issue > > > > - set max_size of file to layout.stripe_unit > > > > - set truncate_size to (u64)-1 > > > > - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS > > > > - set cap_id to 1 in async created inode > > > > - allocate inode number before submitting request > > > > - rework the prep for an async create to be more efficient > > > > - don't allow MDS or cap messages involving an inode until we get async > > > > create reply > > > > > > > > A lot of changes in this set, mostly based on Zheng and Xiubo's > > > > comments. Performance is pretty similar to the previous set: > > > > > > > > Untarring a kernel tarball into a cephfs takes about 98s with > > > > async dirops disabled. With them enabled, it takes around 78s, > > > > which is about a 25% improvement. > > > > > > > > This is not quite ready for merge. Error handling could still be > > > > improved. With xfstest generic/531, I see some messages like this pop > > > > up in the ring buffer: > > > > > > > > [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9 > > > > > > > > Basically, we went to do an async create and got a different inode > > > > number back than expected. That still needs investigation, but I > > > > didn't see any test failures due to it. > > > > > > > > Jeff Layton (10): > > > > libceph: export ceph_file_layout_is_valid > > > > ceph: make ceph_fill_inode non-static > > > > ceph: make dentry_lease_is_valid non-static > > > > ceph: make __take_cap_refs a public function > > > > ceph: decode interval_sets for delegated inos > > > > ceph: add flag to designate that a request is asynchronous > > > > ceph: add infrastructure for waiting for async create to complete > > > > ceph: add new MDS req field to hold delegated inode number > > > > ceph: cache layout in parent dir on first sync create > > > > ceph: attempt to do async create when possible > > > > > > > > fs/ceph/caps.c | 34 ++++-- > > > > fs/ceph/dir.c | 13 ++- > > > > fs/ceph/file.c | 218 +++++++++++++++++++++++++++++++++-- > > > > fs/ceph/inode.c | 50 ++++---- > > > > fs/ceph/mds_client.c | 126 ++++++++++++++++++-- > > > > fs/ceph/mds_client.h | 9 +- > > > > fs/ceph/super.h | 16 ++- > > > > include/linux/ceph/ceph_fs.h | 8 +- > > > > net/ceph/ceph_fs.c | 1 + > > > > 9 files changed, 410 insertions(+), 65 deletions(-) > > > > > > > > > > An issue of kernel async unlink/create implementation is > > > get_caps_for_async_unlink/try_prep_async_create are called before > > > calling ceph_mdsc_submit_request. There is a small windows that > > > session's state may change. If session is in wrong state, > > > ceph_mdsc_submit_request() may wait and not send request immediately. > > > > > > > Is that a real issue (other than performance)? > > > > We hold cap references, so assuming that the session can be reconnected > > and that we keep the caps, everything should still work correctly, no? > > > > I think it may cause more troubles for error handling (For the case that > client failed to reconnect session) > Yeah, error handling is a real Achilles heel in this whole thing. I'm still considering how best to deal with errors when this occurs. There may not be any good answers for that w/o adding new interfaces.