Message ID | 1441658303-18081-6-git-send-email-pandit.parav@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
On 07/09/2015 23:38, Parav Pandit wrote: > +/* RDMA resources from device cgroup perspective */ > +enum devcgroup_rdma_rt { > + DEVCG_RDMA_RES_TYPE_UCTX, > + DEVCG_RDMA_RES_TYPE_CQ, > + DEVCG_RDMA_RES_TYPE_PD, > + DEVCG_RDMA_RES_TYPE_AH, > + DEVCG_RDMA_RES_TYPE_MR, > + DEVCG_RDMA_RES_TYPE_MW, I didn't see memory windows in dev_cgroup_files in patch 3. Is it used? > + DEVCG_RDMA_RES_TYPE_SRQ, > + DEVCG_RDMA_RES_TYPE_QP, > + DEVCG_RDMA_RES_TYPE_FLOW, > + DEVCG_RDMA_RES_TYPE_MAX, > +}; > +struct devcgroup_rdma_tracker { > + int limit; > + atomic_t usage; > + int failcnt; > +}; Have you considered using struct res_counter? > + * RDMA resource limits are hierarchical, so the highest configured limit of > + * the hierarchy is enforced. Allowing resource limit configuration to default > + * cgroup allows fair share to kernel space ULPs as well. In what way is the highest configured limit of the hierarchy enforced? I would expect all the limits along the hierarchy to be enforced. > +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v) > +{ > + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf)); > + int type = seq_cft(sf)->private; > + u32 usage; > + > + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) { > + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR); > + } else { > + usage = dev_cg->rdma.tracker[type].limit; If this is the resource limit, don't name it 'usage'. > + seq_printf(sf, "%u\n", usage); > + } > + return 0; > +} > +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v) > +{ > + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf)); > + int type = seq_cft(sf)->private; > + u32 usage; > + > + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) { > + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR); I'm not sure hiding the actual number is good, especially in the show_usage case. > + } else { > + usage = dev_cg->rdma.tracker[type].limit; > + seq_printf(sf, "%u\n", usage); > + } > + return 0; > +} > +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, > + enum devcgroup_rdma_rt type, int num) > +{ > + struct dev_cgroup *dev_cg, *p; > + struct task_struct *ctx_task; > + > + if (!num) > + return; > + > + /* get cgroup of ib_ucontext it belong to, to uncharge > + * so that when its called from any worker tasks or any > + * other tasks to which this resource doesn't belong to, > + * it can be uncharged correctly. > + */ > + if (ucontext) > + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID); > + else > + ctx_task = current; > + dev_cg = task_devcgroup(ctx_task); > + > + spin_lock(&ctx_task->rdma_res_counter->lock); Don't you need an rcu read lock and rcu_dereference to access rdma_res_counter? > + ctx_task->rdma_res_counter->usage[type] -= num; > + > + for (p = dev_cg; p; p = parent_devcgroup(p)) > + uncharge_resource(p, type, num); > + > + spin_unlock(&ctx_task->rdma_res_counter->lock); > + > + if (type == DEVCG_RDMA_RES_TYPE_UCTX) > + rdma_free_res_counter(ctx_task); > +} > +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource); > +int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num) > +{ > + struct dev_cgroup *dev_cg = task_devcgroup(current); > + struct task_rdma_res_counter *res_cnt = current->rdma_res_counter; > + int status; > + > + if (!res_cnt) { > + res_cnt = kzalloc(sizeof(*res_cnt), GFP_KERNEL); > + if (!res_cnt) > + return -ENOMEM; > + > + spin_lock_init(&res_cnt->lock); > + rcu_assign_pointer(current->rdma_res_counter, res_cnt); Don't you need the task lock to update rdma_res_counter here? > + } > + > + /* synchronize with migration task by taking lock, to avoid > + * race condition of performing cgroup resource migration > + * in non atomic way with this task, which can leads to leaked > + * resources in older cgroup. > + */ > + spin_lock(&res_cnt->lock); > + status = try_charge_resource(dev_cg, type, num); > + if (status) > + goto busy; > + > + /* single task updating its rdma resource usage, so atomic is > + * not required. > + */ > + current->rdma_res_counter->usage[type] += num; > + > +busy: > + spin_unlock(&res_cnt->lock); > + return status; > +} > +EXPORT_SYMBOL(devcgroup_rdma_try_charge_resource); Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/09/2015 23:38, Parav Pandit wrote: > +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, > + enum devcgroup_rdma_rt type, int num) > +{ > + struct dev_cgroup *dev_cg, *p; > + struct task_struct *ctx_task; > + > + if (!num) > + return; > + > + /* get cgroup of ib_ucontext it belong to, to uncharge > + * so that when its called from any worker tasks or any > + * other tasks to which this resource doesn't belong to, > + * it can be uncharged correctly. > + */ > + if (ucontext) > + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID); > + else > + ctx_task = current; So what happens if a process creates a ucontext, forks, and then the child creates and destroys a CQ? If I understand correctly, created resources are always charged to the current process (the child), but when it is destroyed the owner of the ucontext (the parent) will be uncharged. Since ucontexts are not meant to be used by multiple processes, I think it would be okay to always charge the owner process (the one that created the ucontext). > + dev_cg = task_devcgroup(ctx_task); > + > + spin_lock(&ctx_task->rdma_res_counter->lock); > + ctx_task->rdma_res_counter->usage[type] -= num; > + > + for (p = dev_cg; p; p = parent_devcgroup(p)) > + uncharge_resource(p, type, num); > + > + spin_unlock(&ctx_task->rdma_res_counter->lock); > + > + if (type == DEVCG_RDMA_RES_TYPE_UCTX) > + rdma_free_res_counter(ctx_task); > +} > +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource); -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 8, 2015 at 1:52 PM, Haggai Eran <haggaie@mellanox.com> wrote: > On 07/09/2015 23:38, Parav Pandit wrote: >> +/* RDMA resources from device cgroup perspective */ >> +enum devcgroup_rdma_rt { >> + DEVCG_RDMA_RES_TYPE_UCTX, >> + DEVCG_RDMA_RES_TYPE_CQ, >> + DEVCG_RDMA_RES_TYPE_PD, >> + DEVCG_RDMA_RES_TYPE_AH, >> + DEVCG_RDMA_RES_TYPE_MR, >> + DEVCG_RDMA_RES_TYPE_MW, > I didn't see memory windows in dev_cgroup_files in patch 3. Is it used? ib_uverbs_dereg_mr() needs a fix in my patch for MW and alloc_mw() also needs to use it. I will fix it. >> + DEVCG_RDMA_RES_TYPE_SRQ, >> + DEVCG_RDMA_RES_TYPE_QP, >> + DEVCG_RDMA_RES_TYPE_FLOW, >> + DEVCG_RDMA_RES_TYPE_MAX, >> +}; > >> +struct devcgroup_rdma_tracker { >> + int limit; >> + atomic_t usage; >> + int failcnt; >> +}; > Have you considered using struct res_counter? No. I will look into the structure and see if it fits or not. > >> + * RDMA resource limits are hierarchical, so the highest configured limit of >> + * the hierarchy is enforced. Allowing resource limit configuration to default >> + * cgroup allows fair share to kernel space ULPs as well. > In what way is the highest configured limit of the hierarchy enforced? I > would expect all the limits along the hierarchy to be enforced. > In hierarchy, of say 3 cgroups, the smallest limit of the cgroup is applied. Lets take example to clarify. Say cg_A, cg_B, cg_C Role name limit Parent cg_A 100 Child_level1 cg_B (child of cg_A) 20 Child_level2: cg_C (child of cg_B) 50 If the process allocating rdma resource belongs to cg_C, limit lowest limit in the hierarchy is applied during charge() stage. If cg_A limit happens to be 10, since 10 is lowest, its limit would be applicable as you expected. this is similar to newly added PID subsystem in functionality. >> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v) >> +{ >> + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf)); >> + int type = seq_cft(sf)->private; >> + u32 usage; >> + >> + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) { >> + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR); >> + } else { >> + usage = dev_cg->rdma.tracker[type].limit; > If this is the resource limit, don't name it 'usage'. > o.k. This is typo mistake from usage show function I made. I will change it. >> + seq_printf(sf, "%u\n", usage); >> + } >> + return 0; >> +} > >> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v) >> +{ >> + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf)); >> + int type = seq_cft(sf)->private; >> + u32 usage; >> + >> + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) { >> + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR); > I'm not sure hiding the actual number is good, especially in the > show_usage case. This is similar to following other controller same as newly added PID subsystem in showing max limit. > >> + } else { >> + usage = dev_cg->rdma.tracker[type].limit; >> + seq_printf(sf, "%u\n", usage); >> + } >> + return 0; >> +} > >> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, >> + enum devcgroup_rdma_rt type, int num) >> +{ >> + struct dev_cgroup *dev_cg, *p; >> + struct task_struct *ctx_task; >> + >> + if (!num) >> + return; >> + >> + /* get cgroup of ib_ucontext it belong to, to uncharge >> + * so that when its called from any worker tasks or any >> + * other tasks to which this resource doesn't belong to, >> + * it can be uncharged correctly. >> + */ >> + if (ucontext) >> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID); >> + else >> + ctx_task = current; >> + dev_cg = task_devcgroup(ctx_task); >> + >> + spin_lock(&ctx_task->rdma_res_counter->lock); > Don't you need an rcu read lock and rcu_dereference to access > rdma_res_counter? I believe, its not required because when uncharge() is happening, it can happen only from 3 contexts. (a) from the caller task context, who has made allocation call, so no synchronizing needed. (b) from the dealloc resource context, again this is from the same task context which allocated, it so this is single threaded, no need to syncronize. (c) from the fput() context when process is terminated abruptly or as part of differed cleanup, when this is happening there cannot be allocator task anyway. > >> + ctx_task->rdma_res_counter->usage[type] -= num; >> + >> + for (p = dev_cg; p; p = parent_devcgroup(p)) >> + uncharge_resource(p, type, num); >> + >> + spin_unlock(&ctx_task->rdma_res_counter->lock); >> + >> + if (type == DEVCG_RDMA_RES_TYPE_UCTX) >> + rdma_free_res_counter(ctx_task); >> +} >> +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource); > >> +int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num) >> +{ >> + struct dev_cgroup *dev_cg = task_devcgroup(current); >> + struct task_rdma_res_counter *res_cnt = current->rdma_res_counter; >> + int status; >> + >> + if (!res_cnt) { >> + res_cnt = kzalloc(sizeof(*res_cnt), GFP_KERNEL); >> + if (!res_cnt) >> + return -ENOMEM; >> + >> + spin_lock_init(&res_cnt->lock); >> + rcu_assign_pointer(current->rdma_res_counter, res_cnt); > Don't you need the task lock to update rdma_res_counter here? > No. this is the caller task allocating it, so its single threaded. It needs to syncronize with migration thread which is reading counters of all the processes, while they are getting allocated and freed. Therefore rcu() is sufficient. >> + } >> + >> + /* synchronize with migration task by taking lock, to avoid >> + * race condition of performing cgroup resource migration >> + * in non atomic way with this task, which can leads to leaked >> + * resources in older cgroup. >> + */ >> + spin_lock(&res_cnt->lock); >> + status = try_charge_resource(dev_cg, type, num); >> + if (status) >> + goto busy; >> + >> + /* single task updating its rdma resource usage, so atomic is >> + * not required. >> + */ >> + current->rdma_res_counter->usage[type] += num; >> + >> +busy: >> + spin_unlock(&res_cnt->lock); >> + return status; >> +} >> +EXPORT_SYMBOL(devcgroup_rdma_try_charge_resource); > > Regards, > Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 8, 2015 at 2:06 PM, Haggai Eran <haggaie@mellanox.com> wrote: > On 07/09/2015 23:38, Parav Pandit wrote: >> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, >> + enum devcgroup_rdma_rt type, int num) >> +{ >> + struct dev_cgroup *dev_cg, *p; >> + struct task_struct *ctx_task; >> + >> + if (!num) >> + return; >> + >> + /* get cgroup of ib_ucontext it belong to, to uncharge >> + * so that when its called from any worker tasks or any >> + * other tasks to which this resource doesn't belong to, >> + * it can be uncharged correctly. >> + */ >> + if (ucontext) >> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID); >> + else >> + ctx_task = current; > So what happens if a process creates a ucontext, forks, and then the > child creates and destroys a CQ? If I understand correctly, created > resources are always charged to the current process (the child), but > when it is destroyed the owner of the ucontext (the parent) will be > uncharged. > > Since ucontexts are not meant to be used by multiple processes, I think > it would be okay to always charge the owner process (the one that > created the ucontext). I need to think about it. I would like to avoid keep per task resource counters for two reasons. For a while I thought that native fork() doesn't take care to share the RDMA resources and all CQ, QP dmaable memory from PID namespace perspective. 1. Because, it could well happen that process and its child process is created in PID namespace_A, after which child is migrated to new PID namespace_B. after which parent from the namespace_A is terminated. I am not sure how the ucontext ownership changes from parent to child process at that point today. I prefer to keep this complexity out if at all it exists as process migration across namespaces is not a frequent event for which to optimize the code for. 2. by having per task counter (as cost of memory some memory) allows to avoid using atomic during charge(), uncharge(). The intent is to have per task (process and thread) to have their resource counter instance, but I can see that its broken where its charging parent process as of now without atomics. As you said its ok to always charge the owner process, I have to relax 2nd requirement and fallback to use atomics for charge(), uncharge() or I have to get rid of ucontext from the uncharge() API which is difficult due to fput() being in worker thread context. > >> + dev_cg = task_devcgroup(ctx_task); >> + >> + spin_lock(&ctx_task->rdma_res_counter->lock); >> + ctx_task->rdma_res_counter->usage[type] -= num; >> + >> + for (p = dev_cg; p; p = parent_devcgroup(p)) >> + uncharge_resource(p, type, num); >> + >> + spin_unlock(&ctx_task->rdma_res_counter->lock); >> + >> + if (type == DEVCG_RDMA_RES_TYPE_UCTX) >> + rdma_free_res_counter(ctx_task); >> +} >> +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource); > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/09/2015 13:18, Parav Pandit wrote: >> > >>> >> + * RDMA resource limits are hierarchical, so the highest configured limit of >>> >> + * the hierarchy is enforced. Allowing resource limit configuration to default >>> >> + * cgroup allows fair share to kernel space ULPs as well. >> > In what way is the highest configured limit of the hierarchy enforced? I >> > would expect all the limits along the hierarchy to be enforced. >> > > In hierarchy, of say 3 cgroups, the smallest limit of the cgroup is applied. > > Lets take example to clarify. > Say cg_A, cg_B, cg_C > Role name limit > Parent cg_A 100 > Child_level1 cg_B (child of cg_A) 20 > Child_level2: cg_C (child of cg_B) 50 > > If the process allocating rdma resource belongs to cg_C, limit lowest > limit in the hierarchy is applied during charge() stage. > If cg_A limit happens to be 10, since 10 is lowest, its limit would be > applicable as you expected. Looking at the code, the usage in every level is charged. This is what I would expect. I just think the comment is a bit misleading. >>> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v) >>> +{ >>> + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf)); >>> + int type = seq_cft(sf)->private; >>> + u32 usage; >>> + >>> + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) { >>> + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR); >> I'm not sure hiding the actual number is good, especially in the >> show_usage case. > > This is similar to following other controller same as newly added PID > subsystem in showing max limit. Okay. >>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, >>> + enum devcgroup_rdma_rt type, int num) >>> +{ >>> + struct dev_cgroup *dev_cg, *p; >>> + struct task_struct *ctx_task; >>> + >>> + if (!num) >>> + return; >>> + >>> + /* get cgroup of ib_ucontext it belong to, to uncharge >>> + * so that when its called from any worker tasks or any >>> + * other tasks to which this resource doesn't belong to, >>> + * it can be uncharged correctly. >>> + */ >>> + if (ucontext) >>> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID); >>> + else >>> + ctx_task = current; >>> + dev_cg = task_devcgroup(ctx_task); >>> + >>> + spin_lock(&ctx_task->rdma_res_counter->lock); >> Don't you need an rcu read lock and rcu_dereference to access >> rdma_res_counter? > > I believe, its not required because when uncharge() is happening, it > can happen only from 3 contexts. > (a) from the caller task context, who has made allocation call, so no > synchronizing needed. > (b) from the dealloc resource context, again this is from the same > task context which allocated, it so this is single threaded, no need > to syncronize. I don't think it is true. You can access uverbs from multiple threads. What may help your case here I think is the fact that only when the last ucontext is released you can change the rdma_res_counter field, and ucontext release takes the ib_uverbs_file->mutex. Still, I think it would be best to use rcu_dereference(), if only for documentation and sparse. > (c) from the fput() context when process is terminated abruptly or as > part of differed cleanup, when this is happening there cannot be > allocator task anyway. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/09/2015 13:50, Parav Pandit wrote: > On Tue, Sep 8, 2015 at 2:06 PM, Haggai Eran <haggaie@mellanox.com> wrote: >> On 07/09/2015 23:38, Parav Pandit wrote: >>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, >>> + enum devcgroup_rdma_rt type, int num) >>> +{ >>> + struct dev_cgroup *dev_cg, *p; >>> + struct task_struct *ctx_task; >>> + >>> + if (!num) >>> + return; >>> + >>> + /* get cgroup of ib_ucontext it belong to, to uncharge >>> + * so that when its called from any worker tasks or any >>> + * other tasks to which this resource doesn't belong to, >>> + * it can be uncharged correctly. >>> + */ >>> + if (ucontext) >>> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID); >>> + else >>> + ctx_task = current; >> So what happens if a process creates a ucontext, forks, and then the >> child creates and destroys a CQ? If I understand correctly, created >> resources are always charged to the current process (the child), but >> when it is destroyed the owner of the ucontext (the parent) will be >> uncharged. >> >> Since ucontexts are not meant to be used by multiple processes, I think >> it would be okay to always charge the owner process (the one that >> created the ucontext). > > I need to think about it. I would like to avoid keep per task resource > counters for two reasons. > For a while I thought that native fork() doesn't take care to share > the RDMA resources and all CQ, QP dmaable memory from PID namespace > perspective. > > 1. Because, it could well happen that process and its child process is > created in PID namespace_A, after which child is migrated to new PID > namespace_B. > after which parent from the namespace_A is terminated. I am not sure > how the ucontext ownership changes from parent to child process at > that point today. > I prefer to keep this complexity out if at all it exists as process > migration across namespaces is not a frequent event for which to > optimize the code for. > > 2. by having per task counter (as cost of memory some memory) allows > to avoid using atomic during charge(), uncharge(). > > The intent is to have per task (process and thread) to have their > resource counter instance, but I can see that its broken where its > charging parent process as of now without atomics. > As you said its ok to always charge the owner process, I have to relax > 2nd requirement and fallback to use atomics for charge(), uncharge() > or I have to get rid of ucontext from the uncharge() API which is > difficult due to fput() being in worker thread context. > I think the cost of atomic operations here would normally be negligible compared to the cost of accessing the hardware to allocate or deallocate these resources. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 8, 2015 at 7:20 PM, Haggai Eran <haggaie@mellanox.com> wrote: > On 08/09/2015 13:18, Parav Pandit wrote: >>> > >>>> >> + * RDMA resource limits are hierarchical, so the highest configured limit of >>>> >> + * the hierarchy is enforced. Allowing resource limit configuration to default >>>> >> + * cgroup allows fair share to kernel space ULPs as well. >>> > In what way is the highest configured limit of the hierarchy enforced? I >>> > would expect all the limits along the hierarchy to be enforced. >>> > >> In hierarchy, of say 3 cgroups, the smallest limit of the cgroup is applied. >> >> Lets take example to clarify. >> Say cg_A, cg_B, cg_C >> Role name limit >> Parent cg_A 100 >> Child_level1 cg_B (child of cg_A) 20 >> Child_level2: cg_C (child of cg_B) 50 >> >> If the process allocating rdma resource belongs to cg_C, limit lowest >> limit in the hierarchy is applied during charge() stage. >> If cg_A limit happens to be 10, since 10 is lowest, its limit would be >> applicable as you expected. > > Looking at the code, the usage in every level is charged. This is what I > would expect. I just think the comment is a bit misleading. > >>>> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v) >>>> +{ >>>> + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf)); >>>> + int type = seq_cft(sf)->private; >>>> + u32 usage; >>>> + >>>> + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) { >>>> + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR); >>> I'm not sure hiding the actual number is good, especially in the >>> show_usage case. >> >> This is similar to following other controller same as newly added PID >> subsystem in showing max limit. > > Okay. > >>>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, >>>> + enum devcgroup_rdma_rt type, int num) >>>> +{ >>>> + struct dev_cgroup *dev_cg, *p; >>>> + struct task_struct *ctx_task; >>>> + >>>> + if (!num) >>>> + return; >>>> + >>>> + /* get cgroup of ib_ucontext it belong to, to uncharge >>>> + * so that when its called from any worker tasks or any >>>> + * other tasks to which this resource doesn't belong to, >>>> + * it can be uncharged correctly. >>>> + */ >>>> + if (ucontext) >>>> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID); >>>> + else >>>> + ctx_task = current; >>>> + dev_cg = task_devcgroup(ctx_task); >>>> + >>>> + spin_lock(&ctx_task->rdma_res_counter->lock); >>> Don't you need an rcu read lock and rcu_dereference to access >>> rdma_res_counter? >> >> I believe, its not required because when uncharge() is happening, it >> can happen only from 3 contexts. >> (a) from the caller task context, who has made allocation call, so no >> synchronizing needed. >> (b) from the dealloc resource context, again this is from the same >> task context which allocated, it so this is single threaded, no need >> to syncronize. > I don't think it is true. You can access uverbs from multiple threads. Yes, thats right. Though I design counter structure allocation on per task basis for individual thread access, I totally missed out ucontext sharing among threads. I replied in other thread to make counters during charge, uncharge to atomic to cover that case. Therefore I need rcu lock and deference as well. > What may help your case here I think is the fact that only when the last > ucontext is released you can change the rdma_res_counter field, and > ucontext release takes the ib_uverbs_file->mutex. > > Still, I think it would be best to use rcu_dereference(), if only for > documentation and sparse. yes. > >> (c) from the fput() context when process is terminated abruptly or as >> part of differed cleanup, when this is happening there cannot be >> allocator task anyway. > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/device_rdma_cgroup.h b/include/linux/device_rdma_cgroup.h new file mode 100644 index 0000000..a2c261b --- /dev/null +++ b/include/linux/device_rdma_cgroup.h @@ -0,0 +1,83 @@ +#ifndef _DEVICE_RDMA_CGROUP_H +#define _DEVICE_RDMA_CGROUP_H + +#include <linux/cgroup.h> + +/* RDMA resources from device cgroup perspective */ +enum devcgroup_rdma_rt { + DEVCG_RDMA_RES_TYPE_UCTX, + DEVCG_RDMA_RES_TYPE_CQ, + DEVCG_RDMA_RES_TYPE_PD, + DEVCG_RDMA_RES_TYPE_AH, + DEVCG_RDMA_RES_TYPE_MR, + DEVCG_RDMA_RES_TYPE_MW, + DEVCG_RDMA_RES_TYPE_SRQ, + DEVCG_RDMA_RES_TYPE_QP, + DEVCG_RDMA_RES_TYPE_FLOW, + DEVCG_RDMA_RES_TYPE_MAX, +}; + +struct ib_ucontext; + +#define DEVCG_RDMA_MAX_RESOURCES S32_MAX + +#ifdef CONFIG_CGROUP_RDMA_RESOURCE + +#define DEVCG_RDMA_MAX_RESOURCE_STR "max" + +enum devcgroup_rdma_access_files { + DEVCG_RDMA_LIST_USAGE, +}; + +struct task_rdma_res_counter { + /* allows atomic increment of task and cgroup counters + * to avoid race with migration task. + */ + spinlock_t lock; + u32 usage[DEVCG_RDMA_RES_TYPE_MAX]; +}; + +struct devcgroup_rdma_tracker { + int limit; + atomic_t usage; + int failcnt; +}; + +struct devcgroup_rdma { + struct devcgroup_rdma_tracker tracker[DEVCG_RDMA_RES_TYPE_MAX]; +}; + +struct dev_cgroup; + +void init_devcgroup_rdma_tracker(struct dev_cgroup *dev_cg); +ssize_t devcgroup_rdma_set_max_resource(struct kernfs_open_file *of, + char *buf, + size_t nbytes, loff_t off); +int devcgroup_rdma_get_max_resource(struct seq_file *m, void *v); +int devcgroup_rdma_show_usage(struct seq_file *m, void *v); + +int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num); +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, + enum devcgroup_rdma_rt type, int num); +void devcgroup_rdma_fork(struct task_struct *task, void *priv); + +int devcgroup_rdma_can_attach(struct cgroup_subsys_state *css, + struct cgroup_taskset *tset); +void devcgroup_rdma_cancel_attach(struct cgroup_subsys_state *css, + struct cgroup_taskset *tset); +int devcgroup_rdma_query_resource_limit(enum devcgroup_rdma_rt type); +#else + +static inline int devcgroup_rdma_try_charge_resource( + enum devcgroup_rdma_rt type, int num) +{ return 0; } +static inline void devcgroup_rdma_uncharge_resource( + struct ib_ucontext *ucontext, + enum devcgroup_rdma_rt type, int num) +{ } +static inline int devcgroup_rdma_query_resource_limit( + enum devcgroup_rdma_rt type) +{ return DEVCG_RDMA_MAX_RESOURCES; } +#endif + +#endif diff --git a/security/device_rdma_cgroup.c b/security/device_rdma_cgroup.c new file mode 100644 index 0000000..fb4cc59 --- /dev/null +++ b/security/device_rdma_cgroup.c @@ -0,0 +1,422 @@ +/* + * RDMA device cgroup controller of device controller cgroup. + * + * Provides a cgroup hierarchy to limit various RDMA resource allocation to a + * configured limit of the cgroup. + * + * Its easy for user space applications to consume of RDMA device specific + * hardware resources. Such resource exhaustion should be prevented so that + * user space applications and other kernel consumers gets chance to allocate + * and effectively use the hardware resources. + * + * In order to use the device rdma controller, set the maximum resource count + * per cgroup, which ensures that total rdma resources for processes belonging + * to a cgroup doesn't exceed configured limit. + * + * RDMA resource limits are hierarchical, so the highest configured limit of + * the hierarchy is enforced. Allowing resource limit configuration to default + * cgroup allows fair share to kernel space ULPs as well. + * + * This file is subject to the terms and conditions of version 2 of the GNU + * General Public License. See the file COPYING in the main directory of the + * Linux distribution for more details. + */ + +#include <linux/slab.h> +#include <linux/device_rdma_cgroup.h> +#include <linux/device_cgroup.h> +#include <rdma/ib_verbs.h> + +/** + * init_devcgroup_rdma_tracker - initialize resource limits. + * @dev_cg: device cgroup pointer for which limits should be + * initialized. + */ +void init_devcgroup_rdma_tracker(struct dev_cgroup *dev_cg) +{ + int i; + + for (i = 0; i < DEVCG_RDMA_RES_TYPE_MAX; i++) + dev_cg->rdma.tracker[i].limit = DEVCG_RDMA_MAX_RESOURCES; +} + +ssize_t devcgroup_rdma_set_max_resource(struct kernfs_open_file *of, + char *buf, + size_t nbytes, loff_t off) +{ + struct cgroup_subsys_state *css = of_css(of); + struct dev_cgroup *dev_cg = css_to_devcgroup(css); + s64 new_limit; + int type = of_cft(of)->private; + int err; + + buf = strstrip(buf); + if (!strcmp(buf, DEVCG_RDMA_MAX_RESOURCE_STR)) { + new_limit = DEVCG_RDMA_MAX_RESOURCES; + goto max_limit; + } + + err = kstrtoll(buf, 0, &new_limit); + if (err) + return err; + + if (new_limit < 0 || new_limit >= DEVCG_RDMA_MAX_RESOURCES) + return -EINVAL; + +max_limit: + dev_cg->rdma.tracker[type].limit = new_limit; + return nbytes; +} + +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v) +{ + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf)); + int type = seq_cft(sf)->private; + u32 usage; + + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) { + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR); + } else { + usage = dev_cg->rdma.tracker[type].limit; + seq_printf(sf, "%u\n", usage); + } + return 0; +} + +static const char * const rdma_res_name[] = { + [DEVCG_RDMA_RES_TYPE_UCTX] = "uctx", + [DEVCG_RDMA_RES_TYPE_CQ] = "cq", + [DEVCG_RDMA_RES_TYPE_PD] = "pd", + [DEVCG_RDMA_RES_TYPE_AH] = "ah", + [DEVCG_RDMA_RES_TYPE_MR] = "mr", + [DEVCG_RDMA_RES_TYPE_MW] = "mw", + [DEVCG_RDMA_RES_TYPE_SRQ] = "srq", + [DEVCG_RDMA_RES_TYPE_QP] = "qp", + [DEVCG_RDMA_RES_TYPE_FLOW] = "flow", +}; + +int devcgroup_rdma_show_usage(struct seq_file *m, void *v) +{ + struct dev_cgroup *devcg = css_to_devcgroup(seq_css(m)); + const char *res_name = NULL; + u32 usage; + int i; + + for (i = 0; i < DEVCG_RDMA_RES_TYPE_MAX; i++) { + res_name = rdma_res_name[i]; + usage = atomic_read(&devcg->rdma.tracker[i].usage); + if (usage == DEVCG_RDMA_MAX_RESOURCES) + seq_printf(m, "%s %s\n", res_name, + DEVCG_RDMA_MAX_RESOURCE_STR); + else + seq_printf(m, "%s %u\n", res_name, usage); + }; + return 0; +} + +static void rdma_free_res_counter(struct task_struct *task) +{ + struct task_rdma_res_counter *res_cnt = NULL; + bool free_res = false; + + task_lock(task); + res_cnt = task->rdma_res_counter; + if (res_cnt && + res_cnt->usage[DEVCG_RDMA_RES_TYPE_UCTX] == 0) { + /* free resource counters if this is the last + * ucontext, which is getting deallocated. + */ + task->rdma_res_counter = NULL; + free_res = true; + } + task_unlock(task); + + /* synchronize with task migration activity from one to other cgroup + * which might be reading this task's resource counters. + */ + synchronize_rcu(); + if (free_res) + kfree(res_cnt); +} + +static void uncharge_resource(struct dev_cgroup *dev_cg, + enum devcgroup_rdma_rt type, s64 num) +{ + /* + * A negative count (or overflow for that matter) is invalid, + * and indicates a bug in the device rdma controller. + */ + WARN_ON_ONCE(atomic_add_negative(-num, + &dev_cg->rdma.tracker[type].usage)); +} + +static void uncharge_task_resource(struct task_struct *task, + struct dev_cgroup *cg, + enum devcgroup_rdma_rt type, + int num) +{ + struct dev_cgroup *p; + + if (!num) + return; + + /* protect against actual task which might be + * freeing resource counter memory due to no resource + * consumption. + */ + task_lock(task); + if (!task->rdma_res_counter) { + task_unlock(task); + return; + } + for (p = cg; p; p = parent_devcgroup(p)) + uncharge_resource(p, type, num); + + task_unlock(task); +} + +/** + * devcgroup_rdma_uncharge_resource - hierarchically uncharge + * rdma resource count + * @ucontext: the ucontext from which to uncharge the resource + * pass null when caller knows that there was past allocation + * and its calling from same process context to which this resource + * belongs. + * @type: the type of resource to uncharge + * @num: the number of resource to uncharge + */ +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, + enum devcgroup_rdma_rt type, int num) +{ + struct dev_cgroup *dev_cg, *p; + struct task_struct *ctx_task; + + if (!num) + return; + + /* get cgroup of ib_ucontext it belong to, to uncharge + * so that when its called from any worker tasks or any + * other tasks to which this resource doesn't belong to, + * it can be uncharged correctly. + */ + if (ucontext) + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID); + else + ctx_task = current; + dev_cg = task_devcgroup(ctx_task); + + spin_lock(&ctx_task->rdma_res_counter->lock); + ctx_task->rdma_res_counter->usage[type] -= num; + + for (p = dev_cg; p; p = parent_devcgroup(p)) + uncharge_resource(p, type, num); + + spin_unlock(&ctx_task->rdma_res_counter->lock); + + if (type == DEVCG_RDMA_RES_TYPE_UCTX) + rdma_free_res_counter(ctx_task); +} +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource); + +/** + * This function does not follow configured rdma resource limit. + * It cannot fail and the new rdma resource count may exceed the limit. + * This is only used during task migration where there is no other + * way out than violating the limit. + */ +static void charge_resource(struct dev_cgroup *dev_cg, + enum devcgroup_rdma_rt type, int num) +{ + struct dev_cgroup *p; + + for (p = dev_cg; p; p = parent_devcgroup(p)) { + struct devcgroup_rdma *rdma = &p->rdma; + + atomic_add(num, &rdma->tracker[type].usage); + } +} + +/** + * try_charge_resource - hierarchically try to charge + * the rdma resource count + * @type: the type of resource to uncharge + * @num: the number of rdma resource to charge + * + * This function follows the set limit. It will fail if the charge would cause + * the new value to exceed the hierarchical limit. Returns 0 if the charge + * succeded, otherwise -EAGAIN. + */ +static int try_charge_resource(struct dev_cgroup *dev_cg, + enum devcgroup_rdma_rt type, int num) +{ + struct dev_cgroup *p, *q; + + for (p = dev_cg; p; p = parent_devcgroup(p)) { + struct devcgroup_rdma *rdma = &p->rdma; + s64 new = atomic_add_return(num, + &rdma->tracker[type].usage); + + if (new > rdma->tracker[type].limit) + goto revert; + } + return 0; + +revert: + for (q = dev_cg; q != p; q = parent_devcgroup(q)) + uncharge_resource(q, type, num); + uncharge_resource(q, type, num); + return -EAGAIN; +} + +/** + * devcgroup_rdma_try_charge_resource - hierarchically try to charge + * the rdma resource count + * @type: the type of resource to uncharge + * @num: the number of rdma resource to charge + * + * This function follows the set limit in hierarchical way. + * It will fail if the charge would cause the new value to exceed the + * hierarchical limit. + * Returns 0 if the charge succeded, otherwise -EAGAIN. + */ +int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num) +{ + struct dev_cgroup *dev_cg = task_devcgroup(current); + struct task_rdma_res_counter *res_cnt = current->rdma_res_counter; + int status; + + if (!res_cnt) { + res_cnt = kzalloc(sizeof(*res_cnt), GFP_KERNEL); + if (!res_cnt) + return -ENOMEM; + + spin_lock_init(&res_cnt->lock); + rcu_assign_pointer(current->rdma_res_counter, res_cnt); + } + + /* synchronize with migration task by taking lock, to avoid + * race condition of performing cgroup resource migration + * in non atomic way with this task, which can leads to leaked + * resources in older cgroup. + */ + spin_lock(&res_cnt->lock); + status = try_charge_resource(dev_cg, type, num); + if (status) + goto busy; + + /* single task updating its rdma resource usage, so atomic is + * not required. + */ + current->rdma_res_counter->usage[type] += num; + +busy: + spin_unlock(&res_cnt->lock); + return status; +} +EXPORT_SYMBOL(devcgroup_rdma_try_charge_resource); + +/** + * devcgroup_rdma_query_resource_limit - query the resource limit + * for a given resource type of the calling user process. It returns the + * hierarchically smallest limit of the cgroup hierarchy. + * @type: the type of resource to query the limit + * Returns resource limit across all the RDMA devices accessible + * to this process. + */ +int devcgroup_rdma_query_resource_limit(enum devcgroup_rdma_rt type) +{ + struct dev_cgroup *dev_cg, *p; + int cur_limit, limit; + + dev_cg = task_devcgroup(current); + limit = dev_cg->rdma.tracker[type].limit; + + /* find the controller in the given hirerchy with lowest limit, + * and report its limit to avoid confusion to user and applications, + * who rely on the query functionality. + */ + for (p = dev_cg; p; p = parent_devcgroup(p)) { + cur_limit = p->rdma.tracker[type].limit; + limit = min_t(int, cur_limit, limit); + } + return limit; +} +EXPORT_SYMBOL(devcgroup_rdma_query_resource_limit); + +int devcgroup_rdma_can_attach(struct cgroup_subsys_state *dst_css, + struct cgroup_taskset *tset) +{ + struct dev_cgroup *dst_cg = css_to_devcgroup(dst_css); + struct dev_cgroup *old_cg; + struct task_struct *task; + struct task_rdma_res_counter *task_res_cnt; + int val, i; + + cgroup_taskset_for_each(task, tset) { + old_cg = task_devcgroup(task); + + /* protect against a task which might be deallocating + * rdma_res_counter structure because last resource + * of the task might undergoing deallocation. + */ + rcu_read_lock(); + task_res_cnt = rcu_dereference(task->rdma_res_counter); + if (!task_res_cnt) + goto empty_task; + + spin_lock(&task_res_cnt->lock); + for (i = 0; i < DEVCG_RDMA_RES_TYPE_MAX; i++) { + val = task_res_cnt->usage[i]; + + charge_resource(dst_cg, i, val); + uncharge_task_resource(task, old_cg, i, val); + } + spin_unlock(&task_res_cnt->lock); + +empty_task: + rcu_read_unlock(); + } + return 0; +} + +void devcgroup_rdma_cancel_attach(struct cgroup_subsys_state *dst_css, + struct cgroup_taskset *tset) +{ + struct dev_cgroup *dst_cg = css_to_devcgroup(dst_css); + struct dev_cgroup *old_cg; + struct task_struct *task; + struct task_rdma_res_counter *task_res_cnt; + u32 val; int i; + + cgroup_taskset_for_each(task, tset) { + old_cg = task_devcgroup(task); + + /* protect against task deallocating rdma_res_counter structure + * because last ucontext resource of the task might be + * getting deallocated. + */ + rcu_read_lock(); + task_res_cnt = rcu_dereference(task->rdma_res_counter); + if (!task_res_cnt) + goto empty_task; + + spin_lock(&task_res_cnt->lock); + for (i = 0; i < DEVCG_RDMA_RES_TYPE_MAX; i++) { + val = task_res_cnt->usage[i]; + + charge_resource(old_cg, i, val); + uncharge_task_resource(task, dst_cg, i, val); + } + spin_unlock(&task_res_cnt->lock); +empty_task: + rcu_read_unlock(); + } +} + +void devcgroup_rdma_fork(struct task_struct *task, void *priv) +{ + /* There is per task resource counters, + * so whatever clone as copied over, ignore it. + */ + task->rdma_res_counter = NULL; +}
Extension of device cgroup for RDMA device resources. This implements RDMA resource tracker to limit RDMA resources such as AH, CQ, PD, QP, MR, SRQ etc resources for processes of the cgroup. It implements RDMA resource limit module to limit consuming RDMA resources for processes of the cgroup. RDMA resources are tracked on per task basis. RDMA resources across multiple such devices are limited among multiple processes of the owning device cgroup. RDMA device cgroup extension returns error when user space applications try to allocate resources more than its configured limit. Signed-off-by: Parav Pandit <pandit.parav@gmail.com> --- include/linux/device_rdma_cgroup.h | 83 ++++++++ security/device_rdma_cgroup.c | 422 +++++++++++++++++++++++++++++++++++++ 2 files changed, 505 insertions(+) create mode 100644 include/linux/device_rdma_cgroup.h create mode 100644 security/device_rdma_cgroup.c