From patchwork Thu May 21 19:23:50 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Jerome Glisse X-Patchwork-Id: 6458391 Return-Path: X-Original-To: patchwork-linux-rdma@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork1.web.kernel.org (Postfix) with ESMTP id C62059F399 for ; Thu, 21 May 2015 19:26:12 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 646C820531 for ; Thu, 21 May 2015 19:26:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 0CF7B2052F for ; Thu, 21 May 2015 19:26:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754853AbbEUT0H (ORCPT ); Thu, 21 May 2015 15:26:07 -0400 Received: from mail-qk0-f170.google.com ([209.85.220.170]:34062 "EHLO mail-qk0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755382AbbEUT0F (ORCPT ); Thu, 21 May 2015 15:26:05 -0400 Received: by qkgx75 with SMTP id x75so64001726qkg.1 for ; Thu, 21 May 2015 12:26:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-type:content-transfer-encoding; bh=RoEtvlIDfwYpKZ4RJ4omvx7vwgrxwfJEL67N7UR0WrA=; b=EiPgWXRp0Km4UABXigf9I638+j9/7jheD+Iz6bD3HERdlMQqNi1hldu6RSB0IDdgIh nwJqYgDvQEEBF0J54/WjmjKV2/M/koiL9qLwbvnXrHh2gogJLLzJthoB4nAtQ7gcn+0W 6pbmR6xbTO+n9VBZ8rwLqcFRK/fc4UqJ/NrWQYvfmIrVrILOOrn4r112CgJBlATOdgd1 T+HSyi+u+p02m8Vo+IacMNGpE3hYV4Z4T1vJC+junwbwEmmVNKYBbKa+IIudXjad2Dtx K/mw9dFDOeT7Bc172YDXqPA35044SyrVy3kSH/VeJI2cMWXTPhdQ4nvcAGqWQhWWRD8m kKvg== X-Received: by 10.140.232.69 with SMTP id d66mr6246059qhc.103.1432236364855; Thu, 21 May 2015 12:26:04 -0700 (PDT) Received: from localhost.localdomain.com (nat-pool-bos-t.redhat.com. [66.187.233.206]) by mx.google.com with ESMTPSA id 64sm1309851qkw.13.2015.05.21.12.26.01 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 21 May 2015 12:26:04 -0700 (PDT) From: j.glisse@gmail.com To: akpm@linux-foundation.org Cc: Linus Torvalds , , Mel Gorman , "H. Peter Anvin" , Peter Zijlstra , Andrea Arcangeli , Johannes Weiner , Larry Woodman , Rik van Riel , Dave Airlie , Brendan Conoboy , Joe Donohue , Duncan Poole , Sherry Cheung , Subhash Gutti , John Hubbard , Mark Hairgrove , Lucien Dunning , Cameron Buschardt , Arvind Gopalakrishnan , Haggai Eran , Shachar Raindel , Liran Liss , Roland Dreier , Ben Sander , Greg Stoner , John Bridgman , Michael Mantor , Paul Blinzer , Laurent Morichetti , Alexander Deucher , Oded Gabbay , =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= , Subject: [PATCH 33/36] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM. Date: Thu, 21 May 2015 15:23:50 -0400 Message-Id: <1432236233-4035-34-git-send-email-j.glisse@gmail.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1432236233-4035-1-git-send-email-j.glisse@gmail.com> References: <1432236233-4035-1-git-send-email-j.glisse@gmail.com> MIME-Version: 1.0 Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org X-Spam-Status: No, score=-6.8 required=5.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_HI, T_DKIM_INVALID, T_RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Jérôme Glisse This add new core infiniband structure and helper to implement ODP (on demand paging) on top of HMM. We need to retain the tree of ib_umem as some hardware associate unique identifiant with each umem (or mr) and only allow hardware page table to be updated using this unique id. Signed-off-by: Jérôme Glisse Signed-off-by: John Hubbard cc: --- drivers/infiniband/core/umem_odp.c | 148 +++++++++++++++++++++++++++++++++- drivers/infiniband/core/uverbs_cmd.c | 6 +- drivers/infiniband/core/uverbs_main.c | 6 ++ include/rdma/ib_umem_odp.h | 28 ++++++- include/rdma/ib_verbs.h | 17 +++- 5 files changed, 199 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c index e55e124..d5d57a8 100644 --- a/drivers/infiniband/core/umem_odp.c +++ b/drivers/infiniband/core/umem_odp.c @@ -41,9 +41,155 @@ #include #include + #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM -#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !" + + +static void ib_mirror_destroy(struct kref *kref) +{ + struct ib_mirror *ib_mirror; + struct ib_device *ib_device; + + ib_mirror = container_of(kref, struct ib_mirror, kref); + hmm_mirror_unregister(&ib_mirror->base); + + ib_device = ib_mirror->ib_device; + mutex_lock(&ib_device->hmm_mutex); + list_del_init(&ib_mirror->list); + mutex_unlock(&ib_device->hmm_mutex); + kfree(ib_mirror); +} + +void ib_mirror_unref(struct ib_mirror *ib_mirror) +{ + if (ib_mirror == NULL) + return; + + kref_put(&ib_mirror->kref, ib_mirror_destroy); +} +EXPORT_SYMBOL(ib_mirror_unref); + +static inline struct ib_mirror *ib_mirror_ref(struct ib_mirror *ib_mirror) +{ + if (!ib_mirror || !kref_get_unless_zero(&ib_mirror->kref)) + return NULL; + return ib_mirror; +} + +int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem) +{ + struct mm_struct *mm = get_task_mm(current); + struct ib_device *ib_device = context->device; + struct ib_mirror *ib_mirror; + struct pid *our_pid; + int ret; + + if (!mm || !ib_device->hmm_ready) + return -EINVAL; + + /* FIXME can this really happen ? */ + if (unlikely(ib_umem_start(umem) == ib_umem_end(umem))) + return -EINVAL; + + /* Prevent creating ODP MRs in child processes */ + rcu_read_lock(); + our_pid = get_task_pid(current->group_leader, PIDTYPE_PID); + rcu_read_unlock(); + put_pid(our_pid); + if (context->tgid != our_pid) { + mmput(mm); + return -EINVAL; + } + + umem->hugetlb = 0; + umem->odp_data = kmalloc(sizeof(*umem->odp_data), GFP_KERNEL); + if (umem->odp_data == NULL) { + mmput(mm); + return -ENOMEM; + } + umem->odp_data->private = NULL; + umem->odp_data->umem = umem; + + mutex_lock(&ib_device->hmm_mutex); + /* Is there an existing mirror for this process mm ? */ + ib_mirror = ib_mirror_ref(context->ib_mirror); + if (!ib_mirror) + list_for_each_entry(ib_mirror, &ib_device->ib_mirrors, list) { + if (ib_mirror->base.hmm->mm != mm) + continue; + ib_mirror = ib_mirror_ref(ib_mirror); + break; + } + + if (ib_mirror == NULL || + ib_mirror == list_first_entry(&ib_device->ib_mirrors, + struct ib_mirror, list)) { + /* We need to create a new mirror. */ + ib_mirror = kmalloc(sizeof(*ib_mirror), GFP_KERNEL); + if (ib_mirror == NULL) { + mutex_unlock(&ib_device->hmm_mutex); + mmput(mm); + return -ENOMEM; + } + kref_init(&ib_mirror->kref); + init_rwsem(&ib_mirror->hmm_mr_rwsem); + ib_mirror->umem_tree = RB_ROOT; + ib_mirror->ib_device = ib_device; + + ib_mirror->base.device = &ib_device->hmm_dev; + ret = hmm_mirror_register(&ib_mirror->base); + if (ret) { + mutex_unlock(&ib_device->hmm_mutex); + kfree(ib_mirror); + mmput(mm); + return ret; + } + + list_add(&ib_mirror->list, &ib_device->ib_mirrors); + context->ib_mirror = ib_mirror_ref(ib_mirror); + } + mutex_unlock(&ib_device->hmm_mutex); + umem->odp_data.ib_mirror = ib_mirror; + + down_write(&ib_mirror->umem_rwsem); + rbt_ib_umem_insert(&umem->odp_data->interval_tree, &mirror->umem_tree); + up_write(&ib_mirror->umem_rwsem); + + mmput(mm); + return 0; +} + +void ib_umem_odp_release(struct ib_umem *umem) +{ + struct ib_mirror *ib_mirror = umem->odp_data; + + /* + * Ensure that no more pages are mapped in the umem. + * + * It is the driver's responsibility to ensure, before calling us, + * that the hardware will not attempt to access the MR any more. + */ + + /* One optimization to release resources early here would be to call : + * hmm_mirror_range_discard(&ib_mirror->base, + * ib_umem_start(umem), + * ib_umem_end(umem)); + * But we can have overlapping umem so we would need to only discard + * range covered by one and only one umem while holding the umem rwsem. + */ + down_write(&ib_mirror->umem_rwsem); + rbt_ib_umem_remove(&umem->odp_data->interval_tree, &mirror->umem_tree); + up_write(&ib_mirror->umem_rwsem); + + ib_mirror_unref(ib_mirror); + kfree(umem->odp_data); + kfree(umem); +} + + #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ + + static void ib_umem_notifier_start_account(struct ib_umem *item) { mutex_lock(&item->odp_data->umem_mutex); diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index ccd6bbe..3225ab5 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -337,7 +337,9 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, ucontext->closing = 0; #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING -#ifndef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM + ucontext->ib_mirror = NULL; +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ ucontext->umem_tree = RB_ROOT; init_rwsem(&ucontext->umem_rwsem); ucontext->odp_mrs_count = 0; @@ -348,7 +350,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, goto err_free; if (!(dev_attr.device_cap_flags & IB_DEVICE_ON_DEMAND_PAGING)) ucontext->invalidate_range = NULL; -#endif /* !CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ +#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */ resp.num_comp_vectors = file->device->num_comp_vectors; diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index 88cce9b..3f069d7 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -45,6 +45,7 @@ #include #include #include +#include #include @@ -297,6 +298,11 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, kfree(uobj); } +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM + ib_mirror_unref(context->ib_mirror); + context->ib_mirror = NULL; +#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ + put_pid(context->tgid); return context->device->dealloc_ucontext(context); diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h index 765aeb3..c7c2670 100644 --- a/include/rdma/ib_umem_odp.h +++ b/include/rdma/ib_umem_odp.h @@ -37,6 +37,32 @@ #include #include +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM +/* struct ib_mirror - per process mirror structure for infiniband driver. + * + * @ib_device: Infiniband device this mirror is associated with. + * @base: The hmm base mirror struct. + * @kref: Refcount for the structure. + * @list: For the list of ib_mirror of a given ib_device. + * @umem_tree: Red black tree of ib_umem ordered by virtual address. + * @umem_rwsem: Semaphore protecting the reb black tree. + * + * Because ib_ucontext struct is tie to file descriptor there can be several of + * them for a same process, which violate HMM requirement. Hence we create only + * one ib_mirror struct per process and have each ib_umem struct reference it. + */ +struct ib_mirror { + struct ib_device *ib_device; + struct hmm_mirror base; + struct kref kref; + struct list_head list; + struct rb_root umem_tree; + struct rw_semaphore umem_rwsem; +}; + +void ib_mirror_unref(struct ib_mirror *ib_mirror); +#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ + struct umem_odp_node { u64 __subtree_last; struct rb_node rb; @@ -44,7 +70,7 @@ struct umem_odp_node { struct ib_umem_odp { #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM -#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !" + struct ib_mirror *ib_mirror; #else /* * An array of the pages included in the on-demand paging umem. diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 7b00d30..83da1bd 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -49,6 +49,9 @@ #include #include #include +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM +#include +#endif #include #include @@ -1157,7 +1160,9 @@ struct ib_ucontext { struct pid *tgid; #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING -#ifndef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM + struct ib_mirror *ib_mirror; +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ struct rb_root umem_tree; /* * Protects .umem_rbroot and tree, as well as odp_mrs_count and @@ -1172,7 +1177,7 @@ struct ib_ucontext { /* A list of umems that don't have private mmu notifier counters yet. */ struct list_head no_private_counters; int odp_mrs_count; -#endif /* !CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ +#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */ }; @@ -1657,6 +1662,14 @@ struct ib_device { struct ib_dma_mapping_ops *dma_ops; +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM + /* For ODP using HMM. */ + struct hmm_device hmm_dev; + struct list_head ib_mirrors; + struct mutex hmm_mutex; + bool hmm_ready; +#endif + struct module *owner; struct device dev; struct kobject *ports_parent;