Message ID | 20200121192928.469316-1-jlayton@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | ceph: asynchronous file create support | expand |
On 1/22/20 3:29 AM, Jeff Layton wrote: > v3: > - move some cephfs-specific code into ceph.ko > - present and track inode numbers as u64 values > - fix up check for dentry and cap eligibility checks > - set O_CEPH_EXCL on async creates > - attempt to handle errors better on async create (invalidate dentries > and dir completeness). > - ensure that fsync waits for async create to complete > > v2: > - move cached layout to dedicated field in inode > - protect cached layout with i_ceph_lock > - wipe cached layout in __check_cap_issue > - set max_size of file to layout.stripe_unit > - set truncate_size to (u64)-1 > - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS > - set cap_id to 1 in async created inode > - allocate inode number before submitting request > - rework the prep for an async create to be more efficient > - don't allow MDS or cap messages involving an inode until we get async > create reply > > Still not quite ready for merge, but I've cleaned up a number of warts > in the v2 set. Performance numbers still look about the same. > > There is definitely still a race of some sort that causes the client to > try to asynchronously create a dentry that already exists. I'm still > working on tracking that down. > > Jeff Layton (10): > ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c > ceph: make ceph_fill_inode non-static > ceph: make dentry_lease_is_valid non-static > ceph: make __take_cap_refs non-static > ceph: decode interval_sets for delegated inos > ceph: add flag to designate that a request is asynchronous > ceph: add infrastructure for waiting for async create to complete > ceph: add new MDS req field to hold delegated inode number > ceph: cache layout in parent dir on first sync create > ceph: attempt to do async create when possible > > fs/ceph/Makefile | 2 +- > fs/ceph/caps.c | 38 +++-- > fs/ceph/dir.c | 13 +- > fs/ceph/file.c | 240 +++++++++++++++++++++++++-- > fs/ceph/inode.c | 50 +++--- > fs/ceph/mds_client.c | 123 ++++++++++++-- > fs/ceph/mds_client.h | 17 +- > fs/ceph/super.h | 16 +- > net/ceph/ceph_fs.c => fs/ceph/util.c | 4 - > include/linux/ceph/ceph_fs.h | 8 +- > net/ceph/Makefile | 2 +- > 11 files changed, 443 insertions(+), 70 deletions(-) > rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%) > Series Reviewed-By: "Yan, Zheng" <zyan@redhat.com>
On Wed, Jan 22, 2020 at 3:31 AM Jeff Layton <jlayton@kernel.org> wrote: > > v3: > - move some cephfs-specific code into ceph.ko > - present and track inode numbers as u64 values > - fix up check for dentry and cap eligibility checks > - set O_CEPH_EXCL on async creates > - attempt to handle errors better on async create (invalidate dentries > and dir completeness). > - ensure that fsync waits for async create to complete > > v2: > - move cached layout to dedicated field in inode > - protect cached layout with i_ceph_lock > - wipe cached layout in __check_cap_issue > - set max_size of file to layout.stripe_unit > - set truncate_size to (u64)-1 > - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS > - set cap_id to 1 in async created inode > - allocate inode number before submitting request > - rework the prep for an async create to be more efficient > - don't allow MDS or cap messages involving an inode until we get async > create reply > > Still not quite ready for merge, but I've cleaned up a number of warts > in the v2 set. Performance numbers still look about the same. > > There is definitely still a race of some sort that causes the client to > try to asynchronously create a dentry that already exists. I'm still > working on tracking that down. > > Jeff Layton (10): > ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c > ceph: make ceph_fill_inode non-static > ceph: make dentry_lease_is_valid non-static > ceph: make __take_cap_refs non-static > ceph: decode interval_sets for delegated inos > ceph: add flag to designate that a request is asynchronous > ceph: add infrastructure for waiting for async create to complete > ceph: add new MDS req field to hold delegated inode number > ceph: cache layout in parent dir on first sync create > ceph: attempt to do async create when possible > > fs/ceph/Makefile | 2 +- > fs/ceph/caps.c | 38 +++-- > fs/ceph/dir.c | 13 +- > fs/ceph/file.c | 240 +++++++++++++++++++++++++-- > fs/ceph/inode.c | 50 +++--- > fs/ceph/mds_client.c | 123 ++++++++++++-- > fs/ceph/mds_client.h | 17 +- > fs/ceph/super.h | 16 +- > net/ceph/ceph_fs.c => fs/ceph/util.c | 4 - > include/linux/ceph/ceph_fs.h | 8 +- > net/ceph/Makefile | 2 +- > 11 files changed, 443 insertions(+), 70 deletions(-) > rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%) > > -- > 2.24.1 > I realized that there still are two issues: - we needs to clear delegated inos when mds failover - we needs to clear caps for async dir operations when reconnecting to mds. (see last commit of https://github.com/ceph/ceph/pull/32576) Regards Yan, Zheng
On Thu, 2020-01-23 at 01:04 +0800, Yan, Zheng wrote: > On Wed, Jan 22, 2020 at 3:31 AM Jeff Layton <jlayton@kernel.org> wrote: > > v3: > > - move some cephfs-specific code into ceph.ko > > - present and track inode numbers as u64 values > > - fix up check for dentry and cap eligibility checks > > - set O_CEPH_EXCL on async creates > > - attempt to handle errors better on async create (invalidate dentries > > and dir completeness). > > - ensure that fsync waits for async create to complete > > > > v2: > > - move cached layout to dedicated field in inode > > - protect cached layout with i_ceph_lock > > - wipe cached layout in __check_cap_issue > > - set max_size of file to layout.stripe_unit > > - set truncate_size to (u64)-1 > > - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS > > - set cap_id to 1 in async created inode > > - allocate inode number before submitting request > > - rework the prep for an async create to be more efficient > > - don't allow MDS or cap messages involving an inode until we get async > > create reply > > > > Still not quite ready for merge, but I've cleaned up a number of warts > > in the v2 set. Performance numbers still look about the same. > > > > There is definitely still a race of some sort that causes the client to > > try to asynchronously create a dentry that already exists. I'm still > > working on tracking that down. > > > > Jeff Layton (10): > > ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c > > ceph: make ceph_fill_inode non-static > > ceph: make dentry_lease_is_valid non-static > > ceph: make __take_cap_refs non-static > > ceph: decode interval_sets for delegated inos > > ceph: add flag to designate that a request is asynchronous > > ceph: add infrastructure for waiting for async create to complete > > ceph: add new MDS req field to hold delegated inode number > > ceph: cache layout in parent dir on first sync create > > ceph: attempt to do async create when possible > > > > fs/ceph/Makefile | 2 +- > > fs/ceph/caps.c | 38 +++-- > > fs/ceph/dir.c | 13 +- > > fs/ceph/file.c | 240 +++++++++++++++++++++++++-- > > fs/ceph/inode.c | 50 +++--- > > fs/ceph/mds_client.c | 123 ++++++++++++-- > > fs/ceph/mds_client.h | 17 +- > > fs/ceph/super.h | 16 +- > > net/ceph/ceph_fs.c => fs/ceph/util.c | 4 - > > include/linux/ceph/ceph_fs.h | 8 +- > > net/ceph/Makefile | 2 +- > > 11 files changed, 443 insertions(+), 70 deletions(-) > > rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%) > > > > -- > > 2.24.1 > > > > I realized that there still are two issues: > - we needs to clear delegated inos when mds failover I'm guessing we need to do this whenever any session is reconnected, right? I think we may have bigger problems here though: The issue is that with this set we assign out ino_t's prior to submitting the request. We could get down into it and find that we're reconnecting the session. At that point, that ino_t is no longer valid for the session. > - we needs to clear caps for async dir operations when reconnecting to > mds. (see last commit of https://github.com/ceph/ceph/pull/32576) > I guess we can use ceph_iterate_session_caps to do this. That said, looking at your patch: if (in->is_dir()) { // remove caps for async dir operations cap.implemented &= (CEPH_CAP_ANY_SHARED | CEPH_CAP_ANY_EXCL); } We remove all but Fsx here. That seems like quite a subtle difference from how FILE caps work. Given the way that we're doing the inode delegation and handling dir caps, I think we may need to rework things such that async requests never get queued to the s_waiting list, and instead return some sort of distinct error that cues the caller to fall back to a synchronous request. That would help prevent some of the potential races here.
On 1/25/20 1:19 AM, Jeff Layton wrote: > On Thu, 2020-01-23 at 01:04 +0800, Yan, Zheng wrote: >> On Wed, Jan 22, 2020 at 3:31 AM Jeff Layton <jlayton@kernel.org> wrote: >>> v3: >>> - move some cephfs-specific code into ceph.ko >>> - present and track inode numbers as u64 values >>> - fix up check for dentry and cap eligibility checks >>> - set O_CEPH_EXCL on async creates >>> - attempt to handle errors better on async create (invalidate dentries >>> and dir completeness). >>> - ensure that fsync waits for async create to complete >>> >>> v2: >>> - move cached layout to dedicated field in inode >>> - protect cached layout with i_ceph_lock >>> - wipe cached layout in __check_cap_issue >>> - set max_size of file to layout.stripe_unit >>> - set truncate_size to (u64)-1 >>> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS >>> - set cap_id to 1 in async created inode >>> - allocate inode number before submitting request >>> - rework the prep for an async create to be more efficient >>> - don't allow MDS or cap messages involving an inode until we get async >>> create reply >>> >>> Still not quite ready for merge, but I've cleaned up a number of warts >>> in the v2 set. Performance numbers still look about the same. >>> >>> There is definitely still a race of some sort that causes the client to >>> try to asynchronously create a dentry that already exists. I'm still >>> working on tracking that down. >>> >>> Jeff Layton (10): >>> ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c >>> ceph: make ceph_fill_inode non-static >>> ceph: make dentry_lease_is_valid non-static >>> ceph: make __take_cap_refs non-static >>> ceph: decode interval_sets for delegated inos >>> ceph: add flag to designate that a request is asynchronous >>> ceph: add infrastructure for waiting for async create to complete >>> ceph: add new MDS req field to hold delegated inode number >>> ceph: cache layout in parent dir on first sync create >>> ceph: attempt to do async create when possible >>> >>> fs/ceph/Makefile | 2 +- >>> fs/ceph/caps.c | 38 +++-- >>> fs/ceph/dir.c | 13 +- >>> fs/ceph/file.c | 240 +++++++++++++++++++++++++-- >>> fs/ceph/inode.c | 50 +++--- >>> fs/ceph/mds_client.c | 123 ++++++++++++-- >>> fs/ceph/mds_client.h | 17 +- >>> fs/ceph/super.h | 16 +- >>> net/ceph/ceph_fs.c => fs/ceph/util.c | 4 - >>> include/linux/ceph/ceph_fs.h | 8 +- >>> net/ceph/Makefile | 2 +- >>> 11 files changed, 443 insertions(+), 70 deletions(-) >>> rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%) >>> >>> -- >>> 2.24.1 >>> >> >> I realized that there still are two issues: >> - we needs to clear delegated inos when mds failover > > I'm guessing we need to do this whenever any session is reconnected, > right? I think we may have bigger problems here though: > > The issue is that with this set we assign out ino_t's prior to > submitting the request. We could get down into it and find that we're > reconnecting the session. At that point, that ino_t is no longer valid > for the session. yes, that's a problem for current kclient impelmentaion > >> - we needs to clear caps for async dir operations when reconnecting to >> mds. (see last commit of https://github.com/ceph/ceph/pull/32576) >> > > I guess we can use ceph_iterate_session_caps to do this. That said, > looking at your patch: > > if (in->is_dir()) { > // remove caps for async dir operations > cap.implemented &= (CEPH_CAP_ANY_SHARED | CEPH_CAP_ANY_EXCL); > } > > We remove all but Fsx here. That seems like quite a subtle difference > from how FILE caps work. only file caps have more than 2 bits. this removes all but AsxLsxXsxFsx. > > Given the way that we're doing the inode delegation and handling dir > caps, I think we may need to rework things such that async requests > never get queued to the s_waiting list, and instead return some sort of > distinct error that cues the caller to fall back to a synchronous > request. > > That would help prevent some of the potential races here. my libcephfs implementation check if a request can be async just before sending the request. >
On 1/25/20 1:19 AM, Jeff Layton wrote: > On Thu, 2020-01-23 at 01:04 +0800, Yan, Zheng wrote: >> On Wed, Jan 22, 2020 at 3:31 AM Jeff Layton <jlayton@kernel.org> wrote: >>> v3: >>> - move some cephfs-specific code into ceph.ko >>> - present and track inode numbers as u64 values >>> - fix up check for dentry and cap eligibility checks >>> - set O_CEPH_EXCL on async creates >>> - attempt to handle errors better on async create (invalidate dentries >>> and dir completeness). >>> - ensure that fsync waits for async create to complete >>> >>> v2: >>> - move cached layout to dedicated field in inode >>> - protect cached layout with i_ceph_lock >>> - wipe cached layout in __check_cap_issue >>> - set max_size of file to layout.stripe_unit >>> - set truncate_size to (u64)-1 >>> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS >>> - set cap_id to 1 in async created inode >>> - allocate inode number before submitting request >>> - rework the prep for an async create to be more efficient >>> - don't allow MDS or cap messages involving an inode until we get async >>> create reply >>> >>> Still not quite ready for merge, but I've cleaned up a number of warts >>> in the v2 set. Performance numbers still look about the same. >>> >>> There is definitely still a race of some sort that causes the client to >>> try to asynchronously create a dentry that already exists. I'm still >>> working on tracking that down. >>> >>> Jeff Layton (10): >>> ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c >>> ceph: make ceph_fill_inode non-static >>> ceph: make dentry_lease_is_valid non-static >>> ceph: make __take_cap_refs non-static >>> ceph: decode interval_sets for delegated inos >>> ceph: add flag to designate that a request is asynchronous >>> ceph: add infrastructure for waiting for async create to complete >>> ceph: add new MDS req field to hold delegated inode number >>> ceph: cache layout in parent dir on first sync create >>> ceph: attempt to do async create when possible >>> >>> fs/ceph/Makefile | 2 +- >>> fs/ceph/caps.c | 38 +++-- >>> fs/ceph/dir.c | 13 +- >>> fs/ceph/file.c | 240 +++++++++++++++++++++++++-- >>> fs/ceph/inode.c | 50 +++--- >>> fs/ceph/mds_client.c | 123 ++++++++++++-- >>> fs/ceph/mds_client.h | 17 +- >>> fs/ceph/super.h | 16 +- >>> net/ceph/ceph_fs.c => fs/ceph/util.c | 4 - >>> include/linux/ceph/ceph_fs.h | 8 +- >>> net/ceph/Makefile | 2 +- >>> 11 files changed, 443 insertions(+), 70 deletions(-) >>> rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%) >>> >>> -- >>> 2.24.1 >>> >> >> I realized that there still are two issues: >> - we needs to clear delegated inos when mds failover > > I'm guessing we need to do this whenever any session is reconnected, > right? I think we may have bigger problems here though: > > The issue is that with this set we assign out ino_t's prior to > submitting the request. We could get down into it and find that we're > reconnecting the session. At that point, that ino_t is no longer valid > for the session. yes, that's a problem for current kclient impelmentaion > >> - we needs to clear caps for async dir operations when reconnecting to >> mds. (see last commit of https://github.com/ceph/ceph/pull/32576) >> > > I guess we can use ceph_iterate_session_caps to do this. That said, > looking at your patch: > > if (in->is_dir()) { > // remove caps for async dir operations > cap.implemented &= (CEPH_CAP_ANY_SHARED | CEPH_CAP_ANY_EXCL); > } > > We remove all but Fsx here. That seems like quite a subtle difference > from how FILE caps work. only file caps have more than 2 bits. this removes caps other than AsxLsxXsxFsx. > > Given the way that we're doing the inode delegation and handling dir > caps, I think we may need to rework things such that async requests > never get queued to the s_waiting list, and instead return some sort of > distinct error that cues the caller to fall back to a synchronous > request. > > That would help prevent some of the potential races here. my libcephfs implementation check if a request can be async just before sending the request. >