From patchwork Sun Oct 28 13:57:41 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Xuehan Xu X-Patchwork-Id: 10658635 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D970D14E2 for ; Sun, 28 Oct 2018 13:58:05 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C34842913E for ; Sun, 28 Oct 2018 13:58:05 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id B071429195; Sun, 28 Oct 2018 13:58:05 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 953C82913E for ; Sun, 28 Oct 2018 13:58:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727579AbeJ1Wmo (ORCPT ); Sun, 28 Oct 2018 18:42:44 -0400 Received: from mail-pf1-f193.google.com ([209.85.210.193]:33209 "EHLO mail-pf1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727115AbeJ1Wmo (ORCPT ); Sun, 28 Oct 2018 18:42:44 -0400 Received: by mail-pf1-f193.google.com with SMTP id a15-v6so2727392pfn.0 for ; Sun, 28 Oct 2018 06:58:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=UcD8WSrz8N177IQ2v8XnbTJ8sUCdsP5RKMjYDFC2CzI=; b=qDPp35Ivj81ire7XwDKIC8krG1mR/0v8+XxuAWPXHvzXYHxttNAcABEpn/N3v+1M2q uEtaIfsnIthAq5I42TGHncWyhEfCyEnG2qRkewHM6/HBukW7yAMnQwfi+TOGcNduTcea 4fenmRxy/sSBLMCk4bCgx8bjMkY6p8tWuqIRHlhIyoNniKFND0VRypjXX7OLN3Xes3tq dAN7o5kpFu+1g2zHf8DP8T2a5Z3aFb68rXpbZTUo8rJJcYleZxX2IK/my6ppwLq5l18o r8fFKcU+5DCUXQH9HYjjT/Hu756R/vPZJlPxK0P/CPVGAdx56MMWRI0wa6oY0IveTgOM I4UA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=UcD8WSrz8N177IQ2v8XnbTJ8sUCdsP5RKMjYDFC2CzI=; b=louz6HDWPdvBcW5uaoQboYYMEegxF8aLxWlaEqZAZl6/b7XqvmfOnoQF/tZLxMXLcj tPIEzQ7Bib+4QozTmGxuHgL0I4Gq9mAeDmpqTsBoROfUjtKRH7vNnOnrUuHT6dcLnbAr wnKFOGj7IzMCMo7cNyOKPr7/JhivKwYfRrsu6WYD348o+oKXgVXl7B5sSB+plCu+TGrw 9jc+kgR0mDWjyNbzuKtSdlLcORdNqPNTiM7eNHoFhH7AS5bVjxFPwWusHXAajI6LKT8G lRwZj/l16ch27ZKM75mFCf7FMy64YYIsBUA8wWUX8UwPAXmAJagw6SKh+H4eCohODuq1 14kA== X-Gm-Message-State: AGRZ1gJ58QDH7Wx+Vmf7Ebw14Id5BtZy01GfNjQyeFJ92MPLDHYznE9P q6soIqMAPcAM3H+PFu/W/uMtla0UWLQp8Q== X-Google-Smtp-Source: AJdET5d645//Ju8Owhpwmf8OPMiIC7SovILjLhaPJZQmtymSo+VMPjlBLll/y4IJo7eYJNzUzdY+mw== X-Received: by 2002:a62:6dc3:: with SMTP id i186-v6mr11452429pfc.218.1540735080994; Sun, 28 Oct 2018 06:58:00 -0700 (PDT) Received: from ceph02v.ops.corp.qihoo.net ([104.192.108.10]) by smtp.gmail.com with ESMTPSA id x19-v6sm28811119pga.15.2018.10.28.06.57.57 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 28 Oct 2018 06:57:59 -0700 (PDT) From: xxhdx1985126@gmail.com To: ceph-devel@vger.kernel.org Cc: Xuehan Xu Subject: [PATCH 1/2] ceph: issue getattr/lookup reqs to MDSes in an aggregative pattern Date: Sun, 28 Oct 2018 21:57:41 +0800 Message-Id: <20181028135742.24668-1-xxhdx1985126@gmail.com> X-Mailer: git-send-email 2.19.1 MIME-Version: 1.0 Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Xuehan Xu Instead of issue a new getattr/lookup req to MDSes for each getattr/lookup op, issue a new one if there is no inflight req that requires that same caps as the current getattr/lookup op. Signed-off-by: Xuehan Xu --- fs/ceph/dir.c | 99 ++++++++++++++++++++++++++++++-------------- fs/ceph/inode.c | 48 ++++++++++++++++----- fs/ceph/mds_client.c | 23 +++++++++- fs/ceph/mds_client.h | 5 ++- fs/ceph/super.c | 68 ++++++++++++++++++++++++++++++ fs/ceph/super.h | 13 ++++++ 6 files changed, 211 insertions(+), 45 deletions(-) diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 036ac0f3a393..fa4911bd5576 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -731,7 +731,7 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry, { struct ceph_fs_client *fsc = ceph_sb_to_client(dir->i_sb); struct ceph_mds_client *mdsc = fsc->mdsc; - struct ceph_mds_request *req; + struct ceph_mds_request *req = NULL; int op; int mask; int err; @@ -765,6 +765,10 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry, spin_unlock(&ci->i_ceph_lock); } + mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED; + if (ceph_security_xattr_wanted(dir)) + mask |= CEPH_CAP_XATTR_SHARED; + op = ceph_snap(dir) == CEPH_SNAPDIR ? CEPH_MDS_OP_LOOKUPSNAP : CEPH_MDS_OP_LOOKUP; req = ceph_mdsc_create_request(mdsc, op, USE_ANY_MDS); @@ -772,12 +776,9 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry, return ERR_CAST(req); req->r_dentry = dget(dentry); req->r_num_caps = 2; - - mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED; - if (ceph_security_xattr_wanted(dir)) - mask |= CEPH_CAP_XATTR_SHARED; + req->r_args.getattr.mask = cpu_to_le32(mask); - + req->r_parent = dir; set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags); err = ceph_mdsc_do_request(mdsc, NULL, req); @@ -1176,6 +1177,7 @@ static int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags, } } } + dout("dentry_lease_is_valid ttl = %ld, ceph_dentry.time = %ld, lease_renew_after = %ld, lease_renew_from = %ld, jiffies = %ld\n", ttl, di->time, di->lease_renew_after, di->lease_renew_from, jiffies); } spin_unlock(&dentry->d_lock); @@ -1184,7 +1186,7 @@ static int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags, CEPH_MDS_LEASE_RENEW, seq); ceph_put_mds_session(session); } - dout("dentry_lease_is_valid - dentry %p = %d\n", dentry, valid); + dout("dentry_lease_is_valid - di %p, dentry %p = %d\n", di, dentry, valid); return valid; } @@ -1252,46 +1254,79 @@ static int ceph_d_revalidate(struct dentry *dentry, unsigned int flags) if (!valid) { struct ceph_mds_client *mdsc = ceph_sb_to_client(dir->i_sb)->mdsc; - struct ceph_mds_request *req; + struct ceph_mds_request *req = NULL; + struct ceph_inode_info* cdir = ceph_inode(dir); int op, err; u32 mask; if (flags & LOOKUP_RCU) return -ECHILD; + mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED; + if (ceph_security_xattr_wanted(dir)) + mask |= CEPH_CAP_XATTR_SHARED; op = ceph_snap(dir) == CEPH_SNAPDIR ? CEPH_MDS_OP_LOOKUPSNAP : CEPH_MDS_OP_LOOKUP; - req = ceph_mdsc_create_request(mdsc, op, USE_ANY_MDS); - if (!IS_ERR(req)) { - req->r_dentry = dget(dentry); - req->r_num_caps = 2; - req->r_parent = dir; + if (op == CEPH_MDS_OP_LOOKUP) { + mutex_lock(&cdir->lookups_inflight_lock); + dout("d_revalidate searching inode lookups inflight, %p, '%pd', inode %p offset %lld, mask: %d\n", + dentry, dentry, d_inode(dentry), ceph_dentry(dentry)->offset, mask); + req = __search_inode_getattr_or_lookup(&cdir->lookups_inflight, mask, true); + } + if (req && op == CEPH_MDS_OP_LOOKUP) { + dout("d_revalidate found previous lookup inflight, %p, '%pd', inode %p offset %lld, mask: %d, req jiffies: %ld\n", + dentry, dentry, d_inode(dentry), ceph_dentry(dentry)->offset, mask, req->r_started); + ceph_mdsc_get_request(req); + mutex_unlock(&cdir->lookups_inflight_lock); + err = ceph_mdsc_wait_for_request(req); + dout("d_revalidate waited previous lookup inflight, %p, '%pd', inode %p offset %lld, mask: %d, req jiffies: %ld, err: %d\n", + dentry, dentry, d_inode(dentry), ceph_dentry(dentry)->offset, mask, req->r_started, err); + } else { - mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED; - if (ceph_security_xattr_wanted(dir)) - mask |= CEPH_CAP_XATTR_SHARED; - req->r_args.getattr.mask = cpu_to_le32(mask); + req = ceph_mdsc_create_request(mdsc, op, USE_ANY_MDS); + if (op == CEPH_MDS_OP_LOOKUP) { + if (!IS_ERR(req)) { + req->r_dentry = dget(dentry); + req->r_num_caps = 2; + req->r_parent = dir; + req->r_args.getattr.mask = cpu_to_le32(mask); + __register_inode_getattr_or_lookup(cdir, req, true); + dout("d_revalidate no previous lookup inflight, just registered a new one, %p, '%pd', inode %p offset %lld, mask: %d, req jiffies: %ld\n", + dentry, dentry, d_inode(dentry), ceph_dentry(dentry)->offset, mask, req->r_started); + } + mutex_unlock(&cdir->lookups_inflight_lock); + } + if (IS_ERR(req)) + goto out; err = ceph_mdsc_do_request(mdsc, NULL, req); - switch (err) { - case 0: - if (d_really_is_positive(dentry) && - d_inode(dentry) == req->r_target_inode) - valid = 1; - break; - case -ENOENT: - if (d_really_is_negative(dentry)) - valid = 1; - /* Fallthrough */ - default: - break; + if (op == CEPH_MDS_OP_LOOKUP) { + mutex_lock(&cdir->lookups_inflight_lock); + __unregister_inode_getattr_or_lookup(cdir, req, true); + dout("d_revalidate just unregistered one, %p, '%pd', inode %p offset %lld, mask: %d, req jiffies: %ld, err: %d\n", + dentry, dentry, d_inode(dentry), ceph_dentry(dentry)->offset, mask, req->r_started, err); + mutex_unlock(&cdir->lookups_inflight_lock); } - ceph_mdsc_put_request(req); - dout("d_revalidate %p lookup result=%d\n", - dentry, err); } + switch (err) { + case 0: + if (d_really_is_positive(dentry) && + d_inode(dentry) == req->r_target_inode) + valid = 1; + break; + case -ENOENT: + if (d_really_is_negative(dentry)) + valid = 1; + /* Fallthrough */ + default: + break; + } + ceph_mdsc_put_request(req); + dout("d_revalidate %p lookup result=%d\n", + dentry, err); } +out: dout("d_revalidate %p %s\n", dentry, valid ? "valid" : "invalid"); if (valid) { ceph_dentry_lru_touch(dentry); diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index a866be999216..c51e2f186139 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -430,6 +430,8 @@ struct inode *ceph_alloc_inode(struct super_block *sb) dout("alloc_inode %p\n", &ci->vfs_inode); spin_lock_init(&ci->i_ceph_lock); + mutex_init(&ci->getattrs_inflight_lock); + mutex_init(&ci->lookups_inflight_lock); ci->i_version = 0; ci->i_inline_version = 0; @@ -461,6 +463,8 @@ struct inode *ceph_alloc_inode(struct super_block *sb) ci->i_xattrs.index_version = 0; ci->i_caps = RB_ROOT; + ci->getattrs_inflight = RB_ROOT; + ci->lookups_inflight = RB_ROOT; ci->i_auth_cap = NULL; ci->i_dirty_caps = 0; ci->i_flushing_caps = 0; @@ -1047,9 +1051,10 @@ static void update_dentry_lease(struct dentry *dentry, * Make sure dentry's inode matches tgt_vino. NULL tgt_vino means that * we expect a negative dentry. */ + dout("update_dentry_lease, d_inode: %p\n", dentry->d_inode); if (!tgt_vino && d_really_is_positive(dentry)) return; - + dout("update_dentry_lease, d_inode: %p\n", dentry->d_inode); if (tgt_vino && (d_really_is_negative(dentry) || !ceph_ino_compare(d_inode(dentry), tgt_vino))) return; @@ -2194,6 +2199,7 @@ int __ceph_do_getattr(struct inode *inode, struct page *locked_page, struct ceph_mds_request *req; int mode; int err; + struct ceph_inode_info* cinode = ceph_inode(inode); if (ceph_snap(inode) == CEPH_SNAPDIR) { dout("do_getattr inode %p SNAPDIR\n", inode); @@ -2205,16 +2211,36 @@ int __ceph_do_getattr(struct inode *inode, struct page *locked_page, if (!force && ceph_caps_issued_mask(ceph_inode(inode), mask, 1)) return 0; - mode = (mask & CEPH_STAT_RSTAT) ? USE_AUTH_MDS : USE_ANY_MDS; - req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_GETATTR, mode); - if (IS_ERR(req)) - return PTR_ERR(req); - req->r_inode = inode; - ihold(inode); - req->r_num_caps = 1; - req->r_args.getattr.mask = cpu_to_le32(mask); - req->r_locked_page = locked_page; - err = ceph_mdsc_do_request(mdsc, NULL, req); + mutex_lock(&cinode->getattrs_inflight_lock); + dout("__ceph_do_getattr searching inode getattrs inflight, inode %p, mask: %d\n", inode, mask); + req = __search_inode_getattr_or_lookup(&cinode->getattrs_inflight, mask, false); + if (req) { + dout("__ceph_do_getattr found previous inode getattr inflight, inode %p, mask: %d, req jiffies: %ld\n", inode, mask, req->r_started); + ceph_mdsc_get_request(req); + mutex_unlock(&cinode->getattrs_inflight_lock); + err = ceph_mdsc_wait_for_request(req); + dout("__ceph_do_getattr waited previous inode getattr inflight, inode %p, mask: %d, req jiffies: %ld, err: %d\n", inode, mask, req->r_started, err); + } else { + mode = (mask & CEPH_STAT_RSTAT) ? USE_AUTH_MDS : USE_ANY_MDS; + req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_GETATTR, mode); + if (!IS_ERR(req)) { + req->r_inode = inode; + ihold(inode); + req->r_num_caps = 1; + req->r_args.getattr.mask = cpu_to_le32(mask); + req->r_locked_page = locked_page; + __register_inode_getattr_or_lookup(cinode, req, false); + dout("__ceph_do_getattr no previous getattr inflight, inode %p, mask: %d, req jiffies: %ld\n", inode, mask, req->r_started); + } + mutex_unlock(&cinode->getattrs_inflight_lock); + if (IS_ERR(req)) + return PTR_ERR(req); + err = ceph_mdsc_do_request(mdsc, NULL, req); + mutex_lock(&cinode->getattrs_inflight_lock); + __unregister_inode_getattr_or_lookup(cinode, req, false); + dout("__ceph_do_getattr just unregistered inode getattr inflight, inode %p, mask: %d, req jiffies: %ld, err: %d\n", inode, mask, req->r_started, err); + mutex_unlock(&cinode->getattrs_inflight_lock); + } if (locked_page && err == 0) { u64 inline_version = req->r_reply_info.targeti.inline_version; if (inline_version == 0) { diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index dc8bc664a871..4412ee13164e 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -1792,7 +1792,10 @@ ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode) req->r_fmode = -1; kref_init(&req->r_kref); RB_CLEAR_NODE(&req->r_node); + RB_CLEAR_NODE(&req->getattr_node); + RB_CLEAR_NODE(&req->lookup_node); INIT_LIST_HEAD(&req->r_wait); + init_completion(&req->batch_op_completion); init_completion(&req->r_completion); init_completion(&req->r_safe_completion); INIT_LIST_HEAD(&req->r_unsafe_item); @@ -2386,6 +2389,23 @@ void ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, mutex_unlock(&mdsc->mutex); } +int ceph_mdsc_wait_for_request(struct ceph_mds_request* req) +{ + int err = 0; + long timeleft = wait_for_completion_killable_timeout( + &req->batch_op_completion, + ceph_timeout_jiffies(req->r_timeout)); + if (timeleft > 0) + err = 0; + else if (!timeleft) + err = -EIO; /* timed out */ + else + err = timeleft; /* killed */ + if (!err) + return err; + return req->batch_op_err; +} + /* * Synchrously perform an mds request. Take care of all of the * session setup, forwarding, retry details. @@ -2458,7 +2478,8 @@ int ceph_mdsc_do_request(struct ceph_mds_client *mdsc, } else { err = req->r_err; } - + req->batch_op_err = err; + complete_all(&req->batch_op_completion); out: mutex_unlock(&mdsc->mutex); dout("do_request %p done, result %d\n", req, err); diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index 2ec3b5b35067..830c97e1bcf0 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -199,6 +199,7 @@ typedef int (*ceph_mds_request_wait_callback_t) (struct ceph_mds_client *mdsc, struct ceph_mds_request { u64 r_tid; /* transaction id */ struct rb_node r_node; + struct rb_node getattr_node, lookup_node; struct ceph_mds_client *r_mdsc; int r_op; /* mds op code */ @@ -250,7 +251,7 @@ struct ceph_mds_request { struct ceph_msg *r_reply; struct ceph_mds_reply_info_parsed r_reply_info; struct page *r_locked_page; - int r_err; + int r_err, batch_op_err; unsigned long r_timeout; /* optional. jiffies, 0 is "wait forever" */ unsigned long r_started; /* start time to measure timeout against */ @@ -273,6 +274,7 @@ struct ceph_mds_request { struct kref r_kref; struct list_head r_wait; + struct completion batch_op_completion; struct completion r_completion; struct completion r_safe_completion; ceph_mds_request_callback_t r_callback; @@ -411,6 +413,7 @@ extern struct ceph_mds_request * ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode); extern void ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct ceph_mds_request *req); +extern int ceph_mdsc_wait_for_request(struct ceph_mds_request* req); extern int ceph_mdsc_do_request(struct ceph_mds_client *mdsc, struct inode *dir, struct ceph_mds_request *req); diff --git a/fs/ceph/super.c b/fs/ceph/super.c index 95a3b3ac9b6e..021fb7c1072c 100644 --- a/fs/ceph/super.c +++ b/fs/ceph/super.c @@ -1158,6 +1158,74 @@ static void __exit exit_ceph(void) destroy_caches(); } +void __unregister_inode_getattr_or_lookup(struct ceph_inode_info* ci, + struct ceph_mds_request* req, + bool is_lookup) +{ + if (!is_lookup) + rb_erase(&req->getattr_node, &ci->getattrs_inflight); + else + rb_erase(&req->lookup_node, &ci->lookups_inflight); +} + +void __register_inode_getattr_or_lookup(struct ceph_inode_info* ci, + struct ceph_mds_request* req, + bool is_lookup) +{ + struct rb_node **p = NULL, *parent = NULL; + struct ceph_mds_request *tmp = NULL; + + if (!is_lookup) + p = &ci->getattrs_inflight.rb_node; + else + p = &ci->lookups_inflight.rb_node; + + while (*p) { + parent = *p; + if (!is_lookup) + tmp = rb_entry(parent, struct ceph_mds_request, getattr_node); + else + tmp = rb_entry(parent, struct ceph_mds_request, lookup_node); + if (req->r_args.getattr.mask < tmp->r_args.getattr.mask) + p = &(*p)->rb_left; + else if (req->r_args.getattr.mask > tmp->r_args.getattr.mask) + p = &(*p)->rb_right; + else + BUG(); + } + + if (!is_lookup) { + rb_link_node(&req->getattr_node, parent, p); + rb_insert_color(&req->getattr_node, &ci->getattrs_inflight); + } else { + rb_link_node(&req->lookup_node, parent, p); + rb_insert_color(&req->lookup_node, &ci->getattrs_inflight); + } +} + +struct ceph_mds_request* __search_inode_getattr_or_lookup(struct rb_root* root, + int mask, + bool is_lookup) +{ + struct rb_node *node = root->rb_node; /* top of the tree */ + + while (node) + { + struct ceph_mds_request* tmp = NULL; + if (!is_lookup) + tmp = rb_entry(node, struct ceph_mds_request, getattr_node); + else + tmp = rb_entry(node, struct ceph_mds_request, lookup_node); + + if (tmp->r_args.getattr.mask > mask) + node = node->rb_left; + else if (tmp->r_args.getattr.mask < mask) + node = node->rb_right; + else + return tmp; /* Found it */ + } + return NULL; +} module_init(init_ceph); module_exit(exit_ceph); diff --git a/fs/ceph/super.h b/fs/ceph/super.h index a7077a0c989f..d39234049e88 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -292,6 +292,8 @@ struct ceph_inode_info { struct ceph_vino i_vino; /* ceph ino + snap */ spinlock_t i_ceph_lock; + struct mutex getattrs_inflight_lock, lookups_inflight_lock; + struct rb_root getattrs_inflight, lookups_inflight; u64 i_version; u64 i_inline_version; @@ -859,6 +861,17 @@ extern int ceph_fill_file_size(struct inode *inode, int issued, extern void ceph_fill_file_time(struct inode *inode, int issued, u64 time_warp_seq, struct timespec *ctime, struct timespec *mtime, struct timespec *atime); +extern void __register_inode_getattr_or_lookup(struct ceph_inode_info* ci, + struct ceph_mds_request* req, + bool is_lookup); + +extern void __unregister_inode_getattr_or_lookup(struct ceph_inode_info* ci, + struct ceph_mds_request* req, + bool is_lookup); + +extern struct ceph_mds_request* __search_inode_getattr_or_lookup(struct rb_root* root, + int mask, + bool is_lookup); extern int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req); extern int ceph_readdir_prepopulate(struct ceph_mds_request *req, From patchwork Sun Oct 28 13:57:42 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Xuehan Xu X-Patchwork-Id: 10658637 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DAFF713A9 for ; Sun, 28 Oct 2018 13:58:25 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C5FF229A79 for ; Sun, 28 Oct 2018 13:58:25 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id B573129A7D; Sun, 28 Oct 2018 13:58:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E677729A79 for ; Sun, 28 Oct 2018 13:58:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727586AbeJ1WnF (ORCPT ); Sun, 28 Oct 2018 18:43:05 -0400 Received: from mail-pl1-f196.google.com ([209.85.214.196]:38650 "EHLO mail-pl1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727115AbeJ1WnF (ORCPT ); Sun, 28 Oct 2018 18:43:05 -0400 Received: by mail-pl1-f196.google.com with SMTP id p7-v6so2559662plk.5 for ; Sun, 28 Oct 2018 06:58:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ZZL+Lw8nY8VeRq6CuSqIYGdRBQJUL+qkhBkxHifxcxQ=; b=rVXirur+bFQYvXNzkM5nij/2zdqNoJCWSrBoteNH3xVVKTwZXFvpOtzItQuikqkssn dulPfwdvslJ2u8ykBibXdtMmvxYmaHHY03ha/gHJO9mzc9xAq9jxC3+wrIALNZUmsDML GkSiwuiUmZhBq8zSB5BrqlWhTov8xLhtTh+0YjW2/qjBUFyUg2JWAQ0GU1zRHirWbbsy SppLEvpuGihV4mRPxGU9QyWMB0xzIQGsVqUeySrtrLbOmm17vUwch8Jp6dmvuhuDIPi2 S8diZui2Z2povyK2ELtv50qE2QTZqva6MeVT/BcjEZa3ucfUeCYEaE8lAdYN8xw7ybc9 yrAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ZZL+Lw8nY8VeRq6CuSqIYGdRBQJUL+qkhBkxHifxcxQ=; b=hmsjUShRgR00qqkUQKYSjyyCsq53ii8MIE0YKfRXxHlNeeHjPaU5eVMa5bcebNL2WA 5ObLOJnsHGctxQpmxgcy7G900p6cP38eHnXYxzdYWTElBjQkZUXvDSsUJwaQmyYNjaaX z/IuUjVTL6fqXOb+bG3rE+2mhVfAsUonNRU3hyIW9Zp5Wj6w7PvQoaWQQ4knXoRJY55w ndym7CPWXrcIDWp8N1pSdFGrMZ104nFdKSf8uIv235/n/N44Xs0k1pd13XgeD0Ph9VnS UVyckKagEhrOIsZ+/A3AJBBvumOBpximJINvE5y33xeXwIIUV2zhy8k47jmkF9F8NJAc ibIw== X-Gm-Message-State: AGRZ1gKyw/CZw1OvFxUK0ziCU667P5x8L3MWq0DvEMDfSUDi8vEqwk9d 7/VoQ6JlCPnezl6IDOXeH4b7cQXk9mc7ig== X-Google-Smtp-Source: AJdET5f4TmQ3gscTkMniWQFzM350lc5gddZKo8mDLrNJ4InitF6wMZuI7eXe1koRCIA5mAwv57XSvQ== X-Received: by 2002:a17:902:7484:: with SMTP id h4-v6mr10469918pll.227.1540735102119; Sun, 28 Oct 2018 06:58:22 -0700 (PDT) Received: from ceph02v.ops.corp.qihoo.net ([104.192.108.10]) by smtp.gmail.com with ESMTPSA id x19-v6sm28811119pga.15.2018.10.28.06.58.18 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 28 Oct 2018 06:58:21 -0700 (PDT) From: xxhdx1985126@gmail.com To: ceph-devel@vger.kernel.org Cc: Xuehan Xu Subject: [PATCH 2/2] ceph: aggregate ceph_sync_read requests Date: Sun, 28 Oct 2018 21:57:42 +0800 Message-Id: <20181028135742.24668-2-xxhdx1985126@gmail.com> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20181028135742.24668-1-xxhdx1985126@gmail.com> References: <20181028135742.24668-1-xxhdx1985126@gmail.com> MIME-Version: 1.0 Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Xuehan Xu As for now, concurrent threads may issue concurrent file read reqs to MDSes, ignoring the fact that the requested file ranges of some reqs may be included by previous issued reqs. This commit make those reqs wait for the previous ones to finish, saving the overhead of issuing them. Signed-off-by: Xuehan Xu --- fs/ceph/file.c | 158 ++++++++++++++++++++++++++++++++++++++++++++++-- fs/ceph/inode.c | 3 + fs/ceph/super.h | 29 ++++++++- 3 files changed, 184 insertions(+), 6 deletions(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index ad0bed99b1d5..fb83c037a40f 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -626,6 +626,92 @@ static int striped_read(struct inode *inode, return ret; } +int ceph_wait_for_aggregated_read_op (struct ceph_aggregated_read_op* op) +{ + long timeleft = wait_for_completion_killable_timeout(&op->comp, ceph_timeout_jiffies(op->timeout)); + if (timeleft > 0) + return op->result; + else + return timeleft ? timeleft : -ETIMEDOUT; +} + +bool find_previous_aggregated_read_op(struct ceph_inode_info* cinode, + unsigned long start, unsigned long end, + bool* repeated_low_endpoint, + struct ceph_aggregated_read_op** ag_op) +{ + struct interval_tree_node* node_p = interval_tree_iter_first(&cinode->aggregated_read_ops, start, end); + bool positive_found = false, negative_found = false; + while (node_p) { + if (node_p->start == start) + *repeated_low_endpoint = true; + if (node_p->start <= start && + node_p->last >= end) { + positive_found = true; + break; + } + + node_p = interval_tree_iter_next(node_p, start, end); + } + + dout("searched positive tree: found: %d\n", positive_found); + + if (!positive_found) { + node_p = interval_tree_iter_first(&cinode->aggregated_read_ops_suppliment, + ULONG_MAX - end, + ULONG_MAX - start); + while (node_p) { + if (node_p->start <= ULONG_MAX - end && + node_p->last >= ULONG_MAX - start) { + negative_found = true; + break; + } + node_p = interval_tree_iter_next(node_p, + ULONG_MAX - end, + ULONG_MAX - start); + } + } + + dout("searched negative tree: found: %d\n", negative_found); + + if (positive_found) + *ag_op = container_of(node_p, struct ceph_aggregated_read_op, pos_node); + else if (negative_found) + *ag_op = container_of(node_p, struct ceph_aggregated_read_op, neg_node); + + return positive_found || negative_found; +} + +void register_aggregated_read_op(struct ceph_inode_info* cinode, + struct ceph_aggregated_read_op* ag_op, + bool suppliment) +{ + if (suppliment) { + interval_tree_insert(&ag_op->neg_node, &cinode->aggregated_read_ops_suppliment); + } else + interval_tree_insert(&ag_op->pos_node, &cinode->aggregated_read_ops); +} + +void unregister_aggregated_read_op(struct ceph_inode_info* cinode, + struct ceph_aggregated_read_op* ag_op, + bool suppliment) +{ + if (suppliment) + interval_tree_remove(&ag_op->neg_node, &cinode->aggregated_read_ops_suppliment); + else + interval_tree_remove(&ag_op->pos_node, &cinode->aggregated_read_ops); +} + +void ceph_put_aggregated_read_op(struct kref* kref) +{ + struct ceph_aggregated_read_op* ag_op = container_of(kref, + struct ceph_aggregated_read_op, + kref); + if (ag_op->num_pages) + ceph_release_page_vector(ag_op->pages, ag_op->num_pages); + kfree(ag_op); +} + /* * Completely synchronous read and write methods. Direct from __user * buffer to osd, or directly to user pages (if O_DIRECT). @@ -637,11 +723,15 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to, { struct file *file = iocb->ki_filp; struct inode *inode = file_inode(file); + struct ceph_inode_info* cinode = ceph_inode(inode); struct page **pages; u64 off = iocb->ki_pos; int num_pages; ssize_t ret; size_t len = iov_iter_count(to); + bool found_previous_req = false; + bool repeated_low_endpoint = false; + struct ceph_aggregated_read_op *ag_op = NULL; dout("sync_read on file %p %llu~%u %s\n", file, off, (unsigned)len, (file->f_flags & O_DIRECT) ? "O_DIRECT" : ""); @@ -676,24 +766,82 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to, iov_iter_advance(to, 0); } ceph_put_page_vector(pages, num_pages, false); + } else if ((off + len) < (ULONG_MAX / 2)) { + mutex_lock(&cinode->aggregated_ops_lock); + dout("ceph_sync_read: trying to find previous aggregated read op, off: %lld, len: %ld.\n", off, len); + found_previous_req = find_previous_aggregated_read_op(cinode, off, + off + len, &repeated_low_endpoint, &ag_op); + if (found_previous_req) { + dout("ceph_sync_read: found previous aggregated read op, off: %lld, len: %ld.\n", off, len); + kref_get(&ag_op->kref); + mutex_unlock(&cinode->aggregated_ops_lock); + ret = ceph_wait_for_aggregated_read_op(ag_op); + dout("ceph_sync_read: waited aggregated read op, off: %lld, len: %ld.\n", off, len); + } else { + dout("ceph_sync_read: no previous aggregated read op, off: %lld, len: %ld.\n", off, len); + ag_op = kzalloc(sizeof(struct ceph_aggregated_read_op), GFP_KERNEL); + kref_init(&ag_op->kref); + ag_op->pos_node.start = off; + ag_op->pos_node.last = off + len; + ag_op->neg_node.start = ULONG_MAX - off - len; + ag_op->neg_node.last = ULONG_MAX - off; + init_completion(&ag_op->comp); + register_aggregated_read_op(cinode, ag_op, repeated_low_endpoint); + dout("ceph_sync_read: register new aggregated read op, off: %lld, len: %ld.\n", off, len); + mutex_unlock(&cinode->aggregated_ops_lock); + + num_pages = calc_pages_for(off, len); + ag_op->pages = ceph_alloc_page_vector(num_pages, GFP_KERNEL); + if (IS_ERR(ag_op->pages)) + return PTR_ERR(ag_op->pages); + ag_op->num_pages = num_pages; + + ret = striped_read(inode, off, len, ag_op->pages, num_pages, + (off & ~PAGE_MASK), checkeof); + dout("ceph_sync_read: aggregated read op striped_readed, off: %lld, len: %ld, ret: %ld.\n", off, len, ret); + ag_op->result = ret; + complete_all(&ag_op->comp); + mutex_lock(&cinode->aggregated_ops_lock); + unregister_aggregated_read_op(cinode, ag_op, repeated_low_endpoint); + mutex_unlock(&cinode->aggregated_ops_lock); + dout("ceph_sync_read: unregistered aggregated read op, off: %lld, len: %ld, ret: %ld.\n", off, len, ret); + } + if (ret > 0) { + int l, k = (off - ag_op->pos_node.start) >> PAGE_SHIFT; + size_t left = min_t(size_t, len, ret); + + while (left) { + size_t page_off = off & ~PAGE_MASK; + size_t copy = min_t(size_t, left, + PAGE_SIZE - page_off); + l = copy_page_to_iter(ag_op->pages[k++], page_off, + copy, to); + off += l; + left -= l; + if (l < copy) + break; + } + dout("finished copy_page_to_iter: off: %lld, len: %ld\n", off, left); + } + kref_put(&ag_op->kref, ceph_put_aggregated_read_op); } else { num_pages = calc_pages_for(off, len); pages = ceph_alloc_page_vector(num_pages, GFP_KERNEL); if (IS_ERR(pages)) return PTR_ERR(pages); - + ret = striped_read(inode, off, len, pages, num_pages, - (off & ~PAGE_MASK), checkeof); + (off & ~PAGE_MASK), checkeof); if (ret > 0) { int l, k = 0; size_t left = ret; - + while (left) { size_t page_off = off & ~PAGE_MASK; size_t copy = min_t(size_t, left, - PAGE_SIZE - page_off); + PAGE_SIZE - page_off); l = copy_page_to_iter(pages[k++], page_off, - copy, to); + copy, to); off += l; left -= l; if (l < copy) diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index c51e2f186139..21b9bac2d8bb 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -432,6 +432,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb) spin_lock_init(&ci->i_ceph_lock); mutex_init(&ci->getattrs_inflight_lock); mutex_init(&ci->lookups_inflight_lock); + mutex_init(&ci->aggregated_ops_lock); ci->i_version = 0; ci->i_inline_version = 0; @@ -465,6 +466,8 @@ struct inode *ceph_alloc_inode(struct super_block *sb) ci->i_caps = RB_ROOT; ci->getattrs_inflight = RB_ROOT; ci->lookups_inflight = RB_ROOT; + ci->aggregated_read_ops = RB_ROOT_CACHED; + ci->aggregated_read_ops_suppliment = RB_ROOT_CACHED; ci->i_auth_cap = NULL; ci->i_dirty_caps = 0; ci->i_flushing_caps = 0; diff --git a/fs/ceph/super.h b/fs/ceph/super.h index d39234049e88..811f2ab83331 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -16,6 +16,7 @@ #include #include #include +#include #include @@ -285,6 +286,31 @@ struct ceph_inode_xattrs_info { u64 version, index_version; }; +struct ceph_aggregated_read_op { + struct kref kref; + struct page** pages; + int num_pages; + unsigned long timeout; + int result; + struct interval_tree_node pos_node, neg_node; + struct completion comp; +}; + +extern void ceph_put_aggregated_read_op(struct kref* kref); + +extern bool find_previous_aggregated_read_op(struct ceph_inode_info* cinode, + unsigned long start, unsigned long end, + bool* repeated_low_endpoint, + struct ceph_aggregated_read_op** ag_op); + +extern void register_aggregated_read_op(struct ceph_inode_info* cinode, + struct ceph_aggregated_read_op* ag_op, + bool suppliment); + +extern void unregister_aggregated_read_op(struct ceph_inode_info* cinode, + struct ceph_aggregated_read_op* ag_op, + bool suppliment); + /* * Ceph inode. */ @@ -292,8 +318,9 @@ struct ceph_inode_info { struct ceph_vino i_vino; /* ceph ino + snap */ spinlock_t i_ceph_lock; - struct mutex getattrs_inflight_lock, lookups_inflight_lock; + struct mutex getattrs_inflight_lock, lookups_inflight_lock, aggregated_ops_lock; struct rb_root getattrs_inflight, lookups_inflight; + struct rb_root_cached aggregated_read_ops, aggregated_read_ops_suppliment; u64 i_version; u64 i_inline_version;