Message ID | 1452020286-9508-7-git-send-email-pandit.parav@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Hello, On Wed, Jan 06, 2016 at 12:28:06AM +0530, Parav Pandit wrote: > +5-4-1. RDMA Interface Files > + > + rdma.resource.verb.list > + rdma.resource.verb.limit > + rdma.resource.verb.usage > + rdma.resource.verb.failcnt > + rdma.resource.hw.list > + rdma.resource.hw.limit > + rdma.resource.hw.usage > + rdma.resource.hw.failcnt Can you please read the rest of cgroup.txt and put the interface in line with the common conventions followed by other controllers? Thanks.
On Wed, Jan 6, 2016 at 3:23 AM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Wed, Jan 06, 2016 at 12:28:06AM +0530, Parav Pandit wrote: >> +5-4-1. RDMA Interface Files >> + >> + rdma.resource.verb.list >> + rdma.resource.verb.limit >> + rdma.resource.verb.usage >> + rdma.resource.verb.failcnt >> + rdma.resource.hw.list >> + rdma.resource.hw.limit >> + rdma.resource.hw.usage >> + rdma.resource.hw.failcnt > > Can you please read the rest of cgroup.txt and put the interface in > line with the common conventions followed by other controllers? > Yes. I read through. I can see two changes to be made in V2 version of this patch. 1. rdma.resource.verb.usage and rdma.resource.verb.limit to change respectively to, 2. rdma.resource.verb.stat and rdma.resource.verb.max. 3. rdma.resource.verb.failcnt indicate failure events, which I think should go to events. I roll out new patch for events post this patch as additional feature and remove this feature in V2. rdma.resource.verb.list file is unique to rdma cgroup, so I believe this is fine. We will conclude whether to have rdma.resource.hw.<files> or not in other patches. I am in opinion to keep "resource" and "verb" or "hw" tags around to keep it verbose enough to know what are we trying to control. Is that ok? > Thanks. > > -- > tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Thu, Jan 07, 2016 at 04:14:26AM +0530, Parav Pandit wrote: > Yes. I read through. I can see two changes to be made in V2 version of > this patch. > 1. rdma.resource.verb.usage and rdma.resource.verb.limit to change > respectively to, > 2. rdma.resource.verb.stat and rdma.resource.verb.max. > 3. rdma.resource.verb.failcnt indicate failure events, which I think > should go to events. What's up with the ".resource" part? Also can't the .max file list the available resources? Why does it need a separtae list file? > I roll out new patch for events post this patch as additional feature > and remove this feature in V2. > > rdma.resource.verb.list file is unique to rdma cgroup, so I believe > this is fine. Please see above. > We will conclude whether to have rdma.resource.hw.<files> or not in > other patches. > I am in opinion to keep "resource" and "verb" or "hw" tags around to > keep it verbose enough to know what are we trying to control. What does that achieve? I feel that it's getting overengineered constantly. Thanks.
On Thu, Jan 7, 2016 at 4:27 AM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Thu, Jan 07, 2016 at 04:14:26AM +0530, Parav Pandit wrote: >> Yes. I read through. I can see two changes to be made in V2 version of >> this patch. >> 1. rdma.resource.verb.usage and rdma.resource.verb.limit to change >> respectively to, >> 2. rdma.resource.verb.stat and rdma.resource.verb.max. >> 3. rdma.resource.verb.failcnt indicate failure events, which I think >> should go to events. > > What's up with the ".resource" part? I can remove "resource" key word. If just that if something other than resource comes up to limit to in future, it will be hard to define at that time. > Also can't the .max file list > the available resources? Why does it need a separtae list file? > max file does lists them only after limits are configured for that device. Thats when rpool (array of max and usage counts) is allocated. If user wants to know what all knobs are available, than list file exposes them on per device basis without actually mentioning actual limit or without allocating rpool arrays. If you are hinting that I should allocate rpool array when rdma cgroup is created, that can be done for already discovered devices. If new devices are discovered after cgroup is created, for them we anyway have to allocate/free when they appear/disappear. In different implementation, where list of all the rdma cgroups can be maintained, and rpool arrays can be allocated for all of them when new device appear/disappear. This can move complexity of dynamic allocation from try_charge/uncharge to device addition and removal APIs. ib_register_ib_device() level. However this comes with memory cost, where even if those device doesnt participate in cgroup, for them rpool memory will be allocated for each such rdma cgroup. list file looks like below for two device entries. mlx4_0 ah qp mr pd srq flow ocrdma0 ah qp mr pd max file looks like below. mlx4_0 ah=100 qp=40 mr=10 pd=90 srq=10 flow=10 >> I roll out new patch for events post this patch as additional feature >> and remove this feature in V2. >> >> rdma.resource.verb.list file is unique to rdma cgroup, so I believe >> this is fine. > > Please see above. > >> We will conclude whether to have rdma.resource.hw.<files> or not in >> other patches. >> I am in opinion to keep "resource" and "verb" or "hw" tags around to >> keep it verbose enough to know what are we trying to control. > > What does that achieve? I feel that it's getting overengineered > constantly. Please see above for "resource". I guess we are not loosing anything by having "rdma.resource" vs just having "rdma". But if that sounds too much, we can remove "resource". > > Thanks. > > -- > tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Thu, Jan 07, 2016 at 05:22:40AM +0530, Parav Pandit wrote: > I can remove "resource" key word. If just that if something other than > resource comes up to limit to in future, it will be hard to define at > that time. Please remove. The word doesn't mean anything in this context. > > Also can't the .max file list > > the available resources? Why does it need a separtae list file? > > > max file does lists them only after limits are configured for that > device. Thats when rpool (array of max and usage counts) is allocated. > > If user wants to know what all knobs are available, than list file > exposes them on per device basis without actually mentioning actual > limit or without allocating rpool arrays. ... > list file looks like below for two device entries. > mlx4_0 ah qp mr pd srq flow > ocrdma0 ah qp mr pd > > max file looks like below. > mlx4_0 ah=100 qp=40 mr=10 pd=90 srq=10 flow=10 Just always show the settings for all devices in the max file like the following? mlx4_0 ah=max qp=max mr=max pd=max srq=max flow=max ocrdma0 ah=max qp=max mr=max pd=max Thanks.
On Thu, Jan 7, 2016 at 9:12 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Thu, Jan 07, 2016 at 05:22:40AM +0530, Parav Pandit wrote: >> I can remove "resource" key word. If just that if something other than >> resource comes up to limit to in future, it will be hard to define at >> that time. > > Please remove. The word doesn't mean anything in this context. > ok. I will remove. >> > Also can't the .max file list >> > the available resources? Why does it need a separtae list file? >> > >> max file does lists them only after limits are configured for that >> device. Thats when rpool (array of max and usage counts) is allocated. >> >> If user wants to know what all knobs are available, than list file >> exposes them on per device basis without actually mentioning actual >> limit or without allocating rpool arrays. > ... >> list file looks like below for two device entries. >> mlx4_0 ah qp mr pd srq flow >> ocrdma0 ah qp mr pd >> >> max file looks like below. >> mlx4_0 ah=100 qp=40 mr=10 pd=90 srq=10 flow=10 > > Just always show the settings for all devices in the max file like the > following? > > mlx4_0 ah=max qp=max mr=max pd=max srq=max flow=max > ocrdma0 ah=max qp=max mr=max pd=max Should be possible. I will make the change. > > Thanks. > > -- > tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/Documentation/cgroup-legacy/rdma.txt b/Documentation/cgroup-legacy/rdma.txt new file mode 100644 index 0000000..70626c5 --- /dev/null +++ b/Documentation/cgroup-legacy/rdma.txt @@ -0,0 +1,129 @@ + RDMA Resource Controller + ------------------------ + +Contents +-------- + +1. Overview + 1-1. What is RDMA resource controller? + 1-2. Why RDMA resource controller needed? + 1-3. How is RDMA resource controller implemented? +2. Usage Examples + +1. Overview + +1-1. What is RDMA resource controller? +------------------------------------- + +RDMA resource controller allows user to limit RDMA/IB specific resources +that a given set of processes can use. These processes are grouped using +RDMA resource controller. + +RDMA resource controller currently allows two different type of resource +pools. +(a) RDMA IB specification level verb resources defined by IB stack +(b) HCA vendor device specific resources + +RDMA resource controller controller allows maximum of upto 64 resources in +a resource pool which is the internal construct of rdma cgroup explained +at later part of this document. + +1-2. Why RDMA resource controller needed? +---------------------------------------- + +Currently user space applications can easily take away all the rdma device +specific resources such as AH, CQ, QP, MR etc. Due to which other applications +in other cgroup or kernel space ULPs may not even get chance to allocate any +rdma resources. This leads to service unavailability. + +Therefore RDMA resource controller is needed through which resource consumption +of processes can be limited. Through this controller various different rdma +resources described by IB uverbs layer and any HCA vendor driver can be +accounted. + +1-3. How is RDMA resource controller implemented? +------------------------------------------------ + +rdma cgroup allows limit configuration of resources. These resources are not +defined by the rdma controller. Instead they are defined by the IB stack +and HCA device drivers(optionally). +This provides great flexibility to allow IB stack to define new resources, +without any changes to rdma cgroup. +Rdma cgroup maintains resource accounting per cgroup, per device, per resource +type using resource pool structure. Each such resource pool is limited up to +64 resources in given resource pool by rdma cgroup, which can be extended +later if required. + +This resource pool object is linked to the cgroup css. Typically there +are 0 to 4 resource pool instances per cgroup, per device in most use cases. +But nothing limits to have it more. At present hundreds of RDMA devices per +single cgroup may not be handled optimally, however there is no known use case +for such configuration either. + +Since RDMA resources can be allocated from any process and can be freed by any +of the child processes which shares the address space, rdma resources are +always owned by the creator cgroup css. This allows process migration from one +to other cgroup without major complexity of transferring resource ownership; +because such ownership is not really present due to shared nature of +rdma resources. Linking resources around css also ensures that cgroups can be +deleted after processes migrated. This allow progress migration as well with +active resources, even though that’s not the primary use case. + +Finally mapping of the resource owner pid to cgroup is maintained using +simple hash table to perform quick look-up during resource charing/uncharging +time. + +Resource pool object is created in following situations. +(a) User sets the limit and no previous resource pool exist for the device +of interest for the cgroup. +(b) No resource limits were configured, but IB/RDMA stack tries to +charge the resource. So that it correctly uncharge them when applications are +running without limits and later on when limits are enforced during uncharging, +otherwise usage count will drop to negative. This is done using default +resource pool. Instead of implementing any sort of time markers, default pool +simplifies the design. + +Resource pool is destroyed if it was of default type (not created +by administrative operation) and it’s the last resource getting +deallocated. Resource pool created as administrative operation is not +deleted, as it’s expected to be used in near future. + +If user setting tries to delete all the resource limit +with active resources per device, RDMA cgroup just marks the pool as +default pool with maximum limits for each resource, otherwise it deletes the +default resource pool. + +2. Usage Examples +----------------- + +(a) List available RDMA verb level resources: + +#cat /sys/fs/cgroup/rdma/1/rdma.resource.verb.list +Output: +mlx4_0 uctx ah pd mr srq qp flow + +(b) Configure resource limit: +echo mlx4_0 mr=100 qp=10 ah=2 > /sys/fs/cgroup/rdma/1/rdma.resource.verb.limit +echo ocrdma1 mr=120 qp=20 cq=10 > /sys/fs/cgroup/rdma/2/rdma.resource.verb.limit + +(c) Query resource limit: +cat /sys/fs/cgroup/rdma/2/rdma.resource.verb.limit +#Output: +mlx4_0 mr=100 qp=10 ah=2 +ocrdma1 mr=120 qp=20 cq=10 + +(d) Query current usage: +cat /sys/fs/cgroup/rdma/2/rdma.resource.verb.usage +#Output: +mlx4_0 mr=95 qp=8 ah=2 +ocrdma1 mr=0 qp=20 cq=10 + +(e) Delete resource limit: +echo mlx4_0 remove > /sys/fs/cgroup/rdma/1/rdma.resource.verb.limit + +(f) List available HCA HW specific resources: (optional) +cat /sys/fs/cgroup/rdma/1/rdma.hw.verb.list +vendor1 hw_qp hw_cq hw_timer + +(g) Configure hw specific resource limit: +echo vendor1 hw_qp=56 > /sys/fs/cgroup/rdma/2/rdma.resource.hw.limit diff --git a/Documentation/cgroup.txt b/Documentation/cgroup.txt index 983ba63..57eb59c 100644 --- a/Documentation/cgroup.txt +++ b/Documentation/cgroup.txt @@ -47,6 +47,8 @@ CONTENTS 5-3. IO 5-3-1. IO Interface Files 5-3-2. Writeback + 5-4. RDMA + 5-4-1. RDMA Interface Files 6. Namespace 6-1. Basics 6-2. The Root and Views @@ -1017,6 +1019,83 @@ writeback as follows. total available memory and applied the same way as vm.dirty[_background]_ratio. +5-4. RDMA + +The "rdma" controller regulates the distribution of RDMA resources. +This controller implements both RDMA/IB verb level and RDMA HCA +driver level resource distribution. + +5-4-1. RDMA Interface Files + + rdma.resource.verb.list + + A read-only file that exists for all the cgroups that describes + which all verb specific resources of a given device can be + distributed and accounted. + + Lines are keyed by device name and are not ordered. + Each line contains space separated resource name that can be + distributed. + + An example for mlx4_0 device follows. + + mlx4_0 ah cq pd mr qp flow srq + + rdma.resource.verb.limit + A readwrite file that exists for all the cgroups that describes + current configured verbs resource limit for a RDMA/IB device. + + Lines are keyed by device name and are not ordered. + Each line contains space separated resource name and its configured + limit that can be distributed. + + An example for mlx4 and ocrdma device follows. + + mlx4_0 mr=1000 qp=104 ah=2 + ocrdma1 mr=900 qp=89 cq=10 + + rdma.resource.verb.usage + A read-only file that describes current resource usage. + It exists for all the cgroup including root. + + An example for mlx4 and ocrdma device follows. + + mlx4_0 mr=1000 qp=102 ah=2 + ocrdma1 mr=900 qp=79 cq=10 + + rdma.resource.verb.failcnt + A read-only file that describes resource allocation failure + count for a given resource type of a particular device. + It exists for all the cgroup including root. + + An example for mlx4 and ocrdma device follows. + + mlx4_0 mr=0 qp=1 ah=1 + ocrdma1 mr=2 qp=1 cq=1 + + rdma.resource.hw.list + + A read-only file that exists for all the cgroups that describes + which all HCA hardware specific resources of a given device can be + distributed and accounted. + + rdma.resource.hw.limit + A readwrite file that exists for all the cgroups that describes + current configured HCA hardware resource limit for a RDMA/IB device. + + Lines are keyed by device name and are not ordered. + Each line contains space separated resource name and its configured + limit that can be distributed. + + rdma.resource.hw.usage + A read-only file that describes current resource usage. + It exists for all the cgroup including root. + + rdma.resource.hw.failcnt + A read-only file that describes HCA hardware resource + allocation failure count for a given resource type of + a particular device. + It exists for all the cgroup including root. 6. Namespace
Added documentation for rdma controller to use in legacy mode and using new unified hirerchy. Signed-off-by: Parav Pandit <pandit.parav@gmail.com> --- Documentation/cgroup-legacy/rdma.txt | 129 +++++++++++++++++++++++++++++++++++ Documentation/cgroup.txt | 79 +++++++++++++++++++++ 2 files changed, 208 insertions(+) create mode 100644 Documentation/cgroup-legacy/rdma.txt