Message ID | 20130606124351.GZ27176@twins.programming.kicks-ass.net (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On Thu, 6 Jun 2013, Peter Zijlstra wrote: > Since RLIMIT_MEMLOCK is very clearly a limit on the amount of pages the > process can 'lock' into memory it should very much include pinned pages > as well as mlock()ed pages. Neither can be paged. So we we thought that this is the sum of the pages that a process has mlocked. Initiated by the process and/or environment explicitly. A user space initiated action. > Since nobody had anything constructive to say about the VM_PINNED > approach and the IB code hurts my head too much to make it work I > propose we revert said patch. I said that the use of a PIN page flag would allow correct accounting if one wanted to interpret the limit the way you do. > Once again the rationale; MLOCK(2) is part of POSIX Realtime Extentsion > (1003.1b-1993/1003.1i-1995). It states that the specified part of the > user address space should stay memory resident until either program exit > or a matching munlock() call. > > This definition basically excludes major faults from happening on the > pages -- a major fault being one where IO needs to happen to obtain the > page content; the direct implication being that page content must remain > in memory. Exactly that is the definition. > Linux has taken this literal and made mlock()ed pages subject to page > migration (albeit only for the explicit move_pages() syscall; but it > would very much like to make them subject to implicit page migration for > the purpose of compaction etc.). Page migration is not a page fault? The ability to move a process completely (including its mlocked segments) is important for the manual migration of process memory. That is what page migration was made for. If mlocked pages are treated as pinnned pages then the complete process can no longer be moved from node to node. > This view disregards the intention of the spec; since mlock() is part of > the realtime spec the intention is very much that the user address range > generate no faults; neither minor nor major -- any delay is > unacceptable. Where does it say that no faults are generated? Dont we generate COW on mlocked ranges? > This leaves the RT people unhappy -- therefore _if_ we continue with > this Linux specific interpretation of mlock() we must introduce new > syscalls that implement the intended mlock() semantics. Intended means Peter's semantics? > It was found that there are useful purposes for this weaker mlock(), a > rationale to indeed have two sets of syscalls. The weaker mlock() can be > used in the context of security -- where we avoid sensitive data being > written to disk, and in the context of userspace deamons that are part > of the IO path -- which would otherwise form IO deadlocks. Migratable mlocked pages enable complete process migration between nodes of a NUMA system for HPC workloads. > The proposed second set of primitives would be mpin() and munpin() and > would implement the intended mlock() semantics. I agree that we need mpin and munpin. But they should not be called mlock semantics. > Such pages would not be migratable in any way (a possible > implementation would be to 'pin' the pages using an extra refcount on > the page frame). From the above we can see that any mpin()ed page is > also an mlock()ed page, since mpin() will disallow any fault, and thus > will also disallow major faults. That cannot be so since mlocked pages need to be migratable. > While we still lack the formal mpin() and munpin() syscalls there are a > number of sites that have similar 'side effects' and result in user > controlled 'pinning' of pages. Namely IB and perf. Right thats why we need this. > For the purpose of RLIMIT_MEMLOCK we must use intent only as it is not > part of the formal spec. The only useful thing is to limit the amount of > pages a user can exempt from paging. This would therefore include all > pages either mlock()ed or mpin()ed. RLIMIT_MEMLOCK is a limit on the pages that a process has mlocked into memory. Pinning is not initiated by user space but by the kernel. Either temporarily (page count increases are used all over the kernel for this) or for longer time frame (IB and Perf and likely more drivers that we have not found yet). > > > Back to the patch; a resource limit must have a resource counter to > enact the limit upon. Before the patch this was mm_struct::locked_vm. > After the patch there is no such thing left. The limit was not checked correctly before the patch since pinned pages were accounted as mlocked. > I state that since mlockall() disables/invalidates RLIMIT_MEMLOCK the > actual resource counter value is irrelevant, and thus the reported > problem is a non-problem. Where does it disable RLIMIT_MEMLOCK? > However, it would still be possible to observe weirdness in the very > unlikely event that a user would indeed call mlock() upon an address > range obtained from IB/perf. In this case he would be unduly constrained > and find his effective RLIMIT_MEMLOCK limit halved (at worst). This is weird for other reasons as well since we are using two different methods: MLOCK leads to the page be marked as PG_mlock and be put on a special LRU list. Pinning is simply an increase in the refcounts. Its going to be difficult to keep accounting straight because a page can have both. > I've yet to hear a coherent objection to the above. Christoph is always > quick to yell: 'but if fixes a double accounting issue' but is > completely deaf to the fact that he changed user visible semantics > without mention and regard. The semantics we agreed upon for mlock were that they are migratable. Pinned pages cannot be migrated and therefore are not mlocked pages. It would be best to have two different mechanisms for this. Note that the pinning is always done by the kernel and/or device driver related to some other activity whereas mlocking is initiated by userspace. We do not need a new API. If you want to avoid the moving of mlocked pages then disasble the mechanisms that need to move processes around. Having the ability in general to move processes is a degree of control that we want. Pinning could be restricted to where it is required by the hardware. We could also add a process flag that exempts a certain process from page migration which may be the result of NUMA scheduler actions, explict requests from the user to migrate a process, defrag or compaction. Excessive amounts of pinned pages limit the ability of the kernel to manage its memory severely which may cause instabilities. mlocked pages are better because the kernel still has some options. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 06, 2013 at 06:46:50PM +0000, Christoph Lameter wrote: > On Thu, 6 Jun 2013, Peter Zijlstra wrote: > > > Since RLIMIT_MEMLOCK is very clearly a limit on the amount of pages the > > process can 'lock' into memory it should very much include pinned pages > > as well as mlock()ed pages. Neither can be paged. > > So we we thought that this is the sum of the pages that a process has > mlocked. Initiated by the process and/or environment explicitly. A user > space initiated action. Which we; also it remains fact that your changelog didn't mention this change in semantics at all. Nor did you CC all affected parties. However you twist this; your patch leaves an inconsistent mess. If you really think they're two different things then you should have introduced a second RLIMIT_MEMPIN to go along with your counter. I'll argue against such a thing; for I think that limiting the total amount of pages a user can exempt from paging is the far more userful/natural thing to measure/limit. > > Since nobody had anything constructive to say about the VM_PINNED > > approach and the IB code hurts my head too much to make it work I > > propose we revert said patch. > > I said that the use of a PIN page flag would allow correct accounting if > one wanted to interpret the limit the way you do. You failed to explain how that would help any. With a pin page flag you still need to find the mm to unaccount crap from. Also, all user controlled address space ops operate on vmas. We had the VM_LOCKED far before we had the lock page flag. And you cannot replace all VM_LOCKED utility with the pageflag either. > > Once again the rationale; MLOCK(2) is part of POSIX Realtime Extentsion > > (1003.1b-1993/1003.1i-1995). It states that the specified part of the > > user address space should stay memory resident until either program exit > > or a matching munlock() call. > > > > This definition basically excludes major faults from happening on the > > pages -- a major fault being one where IO needs to happen to obtain the > > page content; the direct implication being that page content must remain > > in memory. > > Exactly that is the definition. > > > Linux has taken this literal and made mlock()ed pages subject to page > > migration (albeit only for the explicit move_pages() syscall; but it > > would very much like to make them subject to implicit page migration for > > the purpose of compaction etc.). > > Page migration is not a page fault? It introduces faults; what happens when a process hits the migration pte? It gets a random delay and eventually services a minor fault to the new page. At which point the saw will have cut your finger off (going with the most popular RT application ever -- that of a bandsaw and a laser beam). > The ability to move a process > completely (including its mlocked segments) is important for the manual > migration of process memory. That is what page migration was made for. If > mlocked pages are treated as pinnned pages then the complete process can > no longer be moved from node to node. > > > This view disregards the intention of the spec; since mlock() is part of > > the realtime spec the intention is very much that the user address range > > generate no faults; neither minor nor major -- any delay is > > unacceptable. > > Where does it say that no faults are generated? Dont we generate COW on > mlocked ranges? That's under user control. If the user uses fork() the user can avoid those faults by pre-faulting the pages. > > This leaves the RT people unhappy -- therefore _if_ we continue with > > this Linux specific interpretation of mlock() we must introduce new > > syscalls that implement the intended mlock() semantics. > > Intended means Peter's semantics? No, I don't actually write RT applications. But I've had plenty of arguments with RT people when I explained to them what our mlock() actually does vs what they expected it to do. They're not happy. Aside from that; you HPC/HFT minimal latency lot should very well appreciate the minimal interference stuff they do actually expect. > > It was found that there are useful purposes for this weaker mlock(), a > > rationale to indeed have two sets of syscalls. The weaker mlock() can be > > used in the context of security -- where we avoid sensitive data being > > written to disk, and in the context of userspace deamons that are part > > of the IO path -- which would otherwise form IO deadlocks. > > Migratable mlocked pages enable complete process migration between nodes > of a NUMA system for HPC workloads. This might well be; and I'm not arguing we remove this. I'm merely stating that it doesn't make everybody happy. Also what purpose do HPC type applications have for mlock()? > > The proposed second set of primitives would be mpin() and munpin() and > > would implement the intended mlock() semantics. > > I agree that we need mpin and munpin. But they should not be called mlock > semantics. Here we must disagree I fear; given that mlock() is of RT origin and RT people very much want/expect mlock() to do what our proposed mpin() will do. > > Such pages would not be migratable in any way (a possible > > implementation would be to 'pin' the pages using an extra refcount on > > the page frame). From the above we can see that any mpin()ed page is > > also an mlock()ed page, since mpin() will disallow any fault, and thus > > will also disallow major faults. > > That cannot be so since mlocked pages need to be migratable. I'm talking about the proposed mpin() stuff. > > While we still lack the formal mpin() and munpin() syscalls there are a > > number of sites that have similar 'side effects' and result in user > > controlled 'pinning' of pages. Namely IB and perf. > > Right thats why we need this. So I proposed most of the machinery that would be required to actually implement the syscalls. Except that the IB code stumped me. In particular I cannot easily find the userspace address to unpin for ipath/qib release paths. Once we have that we can trivially implement the syscalls. > > For the purpose of RLIMIT_MEMLOCK we must use intent only as it is not > > part of the formal spec. The only useful thing is to limit the amount of > > pages a user can exempt from paging. This would therefore include all > > pages either mlock()ed or mpin()ed. > > RLIMIT_MEMLOCK is a limit on the pages that a process has mlocked into > memory. See my argument above; I don't think userspace is served well with two RLIMITs. > Pinning is not initiated by user space but by the kernel. Either > temporarily (page count increases are used all over the kernel for this) > or for longer time frame (IB and Perf and likely more drivers that we have > not found yet). Here I disagree, I'll argue that all pins that require userspace action to go away are userspace controlled pins. This makes IB/Perf user controlled pins and should thus be treated the same as mpin()/munpin() calls. > > > > Back to the patch; a resource limit must have a resource counter to > > enact the limit upon. Before the patch this was mm_struct::locked_vm. > > After the patch there is no such thing left. > > The limit was not checked correctly before the patch since pinned pages > were accounted as mlocked. This goes back to what you want the limit to mean; I think a single limit counting all pages exempt from paging is the far more useful limit. Again, your changelog was completely devoid of any rlimit discussion. > > I state that since mlockall() disables/invalidates RLIMIT_MEMLOCK the > > actual resource counter value is irrelevant, and thus the reported > > problem is a non-problem. > > Where does it disable RLIMIT_MEMLOCK? The only way to get an MCL_FUTURE is through CAP_IPC_LOCK. Practically most MCL_CURRENT are also larger than RLIMIT_MEMLOCK and this also requires CAP_IPC_LOCK. Once you have CAP_IPC_LOCK, RLIMIT_MEMLOCK becomes irrelevant. > > However, it would still be possible to observe weirdness in the very > > unlikely event that a user would indeed call mlock() upon an address > > range obtained from IB/perf. In this case he would be unduly constrained > > and find his effective RLIMIT_MEMLOCK limit halved (at worst). > > This is weird for other reasons as well since we are using two different > methods: MLOCK leads to the page be marked as PG_mlock and be put on a > special LRU list. Pinning is simply an increase in the refcounts. Its > going to be difficult to keep accounting straight because a page can have > both. The way how the kernel implements these semantics is irrelevant. > > I've yet to hear a coherent objection to the above. Christoph is always > > quick to yell: 'but if fixes a double accounting issue' but is > > completely deaf to the fact that he changed user visible semantics > > without mention and regard. > > The semantics we agreed upon for mlock were that they are migratable. > Pinned pages cannot be migrated and therefore are not mlocked pages. > It would be best to have two different mechanisms for this. Best for whoem? How is making two RLIMIT settings better than having one? For me the being able to migrate a page is a consequence of allowing minor faults; not a fundamental property. > Note that the > pinning is always done by the kernel and/or device driver related to some > other activity whereas mlocking is initiated by userspace. See my earlier argument. If unpinning requires userspace action; its a userspace controlled pin. Also for perf the setup is very much also a userspace action; an mmap() call, and I expect the same is true for IB; although it might use different syscalls. > We do not need a new API. This is in direct contradiction with your earlier statement where you say: "I agree that we need mpin and munpin.". Please make up your mind. > If you want to avoid the moving of mlocked pages then disasble the > mechanisms that need to move processes around. This is not really a feasible option anymore; page migration is creeping all over the place. Also who is to say people don't want to also run something with huge pages along side their RT proglet. > Having the ability in > general to move processes is a degree of control that we want. Pinning > could be restricted to where it is required by the hardware. Or software; you cannot blindly dismiss realtime guarantees the software needs to provide. > We could also add a process flag that exempts a certain process from page > migration which may be the result of NUMA scheduler actions, explict > requests from the user to migrate a process, defrag or compaction. This too is a new API; and one I think is much less precise than mpin(). You wouldn't want an entire process to be exempt from paging. Only the code and data involved with the realtime part of the process needs stay put. Furthermore, mpin() could simply migrate the pages into as few unmovable page blocks as possible to minimise the impact on the rest of the vm. Allocating entire processes as umovable and migrating any DSO it touches into unmovable is a far more onerous endeavour. With the pre-compaction of pinned memory, I don't see the VM impact of pinned pages being much bigger than that of mlocked pages. The main issue will be limiting the amount of memory exempt from paging, not fragmentation per se. Yes you can get some unmovable block fragmentation on unpin, but if its a common thing a new pin will be able to fill those blocks again. > Excessive amounts of pinned pages limit the ability of the kernel > to manage its memory severely which may cause instabilities. mlocked pages > are better because the kernel still has some options. I'm not arguing which are better; I've even conceded we need mpin() and can keep the current mlock() semantics. I'm arguing for what we _need_. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 7 Jun 2013, Peter Zijlstra wrote: > However you twist this; your patch leaves an inconsistent mess. If you > really think they're two different things then you should have > introduced a second RLIMIT_MEMPIN to go along with your counter. Well continuing to repeat myself: I worked based on agreed upon characteristics of mlocked pages. The patch was there to address a brokenness in the mlock accounting because someone naively assumed that pinning = mlock. > I'll argue against such a thing; for I think that limiting the total > amount of pages a user can exempt from paging is the far more > userful/natural thing to measure/limit. Pinned pages are exempted by the kernel. A device driver or some other kernel process (reclaim, page migration, io etc) increase the page count. There is currently no consistent accounting for pinned pages. The vm_pinned counter was introduced to allow the largest pinners to track what they did. > > I said that the use of a PIN page flag would allow correct accounting if > > one wanted to interpret the limit the way you do. > > You failed to explain how that would help any. With a pin page flag you > still need to find the mm to unaccount crap from. Also, all user > controlled address space ops operate on vmas. Pinning is kernel controlled... > > Page migration is not a page fault? > > It introduces faults; what happens when a process hits the migration > pte? It gets a random delay and eventually services a minor fault to the > new page. Ok but this is similar to reclaim and other such things that are unmapping pages. > At which point the saw will have cut your finger off (going with the > most popular RT application ever -- that of a bandsaw and a laser beam). I am pretty confused by your newer notion of RT. RT was about high latency deterministic behavior I thought. RT was basically an abused marketing term and was referring to the bloating of the kernel with all sorts of fair stuff that slows us down. What happened to make you work on low latency stuff? There is some shift that you still need to go through to make that transition. Yes, you would want to avoid reclaim and all sorts of other stuff for low latency. So you disable auto NUMA, defrag etc to avoid these things. > > > This leaves the RT people unhappy -- therefore _if_ we continue with > > > this Linux specific interpretation of mlock() we must introduce new > > > syscalls that implement the intended mlock() semantics. > > > > Intended means Peter's semantics? > > No, I don't actually write RT applications. But I've had plenty of > arguments with RT people when I explained to them what our mlock() > actually does vs what they expected it to do. Ok Guess this is all new to you at this point. I am happy to see that you are willing to abandon your evil ways (although under pressure from your users) and are willing to put the low latency people now in the RT camp. > They're not happy. Aside from that; you HPC/HFT minimal latency lot > should very well appreciate the minimal interference stuff they do > actually expect. Sure we do and we know how to do things to work around the "fair scheduler" and other stuff. But you are breaking the basics of how we do things with your conflation of pinning and mlocking. We do not migrate, do not allow defragmentation or reclaim when running low latency applications. These are non issues. > This might well be; and I'm not arguing we remove this. I'm merely > stating that it doesn't make everybody happy. Also what purpose do HPC > type applications have for mlock()? HPC wants to keep them in memory to avoid eviction. HPC apps are not as sensitive to faults as low latency apps are. Minor faults have traditionally be tolerated there. The lower you get in terms of the latencies required the more difficult the OS control becomes. > Here we must disagree I fear; given that mlock() is of RT origin and RT > people very much want/expect mlock() to do what our proposed mpin() will > do. RT is a dirty word for me given the fairness and bloat issue. Not sure what you mean with that. mlock is a means to keep data in memory and not a magical wand that avoids all OS handling of the page. > > That cannot be so since mlocked pages need to be migratable. > > I'm talking about the proposed mpin() stuff. Could you write that up in detail? I am not sure how this could work at this point. > So I proposed most of the machinery that would be required to actually > implement the syscalls. Except that the IB code stumped me. In > particular I cannot easily find the userspace address to unpin for > ipath/qib release paths. > > Once we have that we can trivially implement the syscalls. Why would you need syscalls? Pinning is driver/kernel subsystem initiated and therefore the driver can do the pin/unpin calls. > > Pinning is not initiated by user space but by the kernel. Either > > temporarily (page count increases are used all over the kernel for this) > > or for longer time frame (IB and Perf and likely more drivers that we have > > not found yet). > > Here I disagree, I'll argue that all pins that require userspace action > to go away are userspace controlled pins. This makes IB/Perf user > controlled pins and should thus be treated the same as mpin()/munpin() > calls. Both IB and Perf are kernel subsystems that pin as a side effect of another syscall. > This goes back to what you want the limit to mean; I think a single > limit counting all pages exempt from paging is the far more useful > limit. Then you first need to show that the scheme account for *all* pages exempt from paging. This includes dirty pages and various pages of other subsystems that have an increased refcount in order to keep these pages were they are. We need a couple of passes through the kernel to find these locations and build up counter for these. Not all of these pages will be associated with processes. > The only way to get an MCL_FUTURE is through CAP_IPC_LOCK. Practically > most MCL_CURRENT are also larger than RLIMIT_MEMLOCK and this also > requires CAP_IPC_LOCK. > > Once you have CAP_IPC_LOCK, RLIMIT_MEMLOCK becomes irrelevant. mlockall does not require CAP_IPC_LOCK. Never had an issue. > > The semantics we agreed upon for mlock were that they are migratable. > > Pinned pages cannot be migrated and therefore are not mlocked pages. > > It would be best to have two different mechanisms for this. > > Best for whoem? How is making two RLIMIT settings better than having > one? The pinned pages do not need to have a separate limit. They are kernel resources and could be bounded in other ways together with other resources that could cause danger to the kernel (such as too many dirty pages). Pinned pages are often not associated with a process. > > We do not need a new API. > > This is in direct contradiction with your earlier statement where you > say: "I agree that we need mpin and munpin.". > > Please make up your mind. Yes we need mpin and munpin as a in kernel API. Glad to clue you in. > > Having the ability in > > general to move processes is a degree of control that we want. Pinning > > could be restricted to where it is required by the hardware. > > Or software; you cannot blindly dismiss realtime guarantees the software > needs to provide. I cringe when I hear realtime guarantees and intending that to refer to low latency.... But maybe they turned you finally? And I may just feeling my decade old frustration with your old approach to "realtime". What do you exactly mean by realtime guarantees? (Given that "realtime" is such an overloaded marketing term). > > Excessive amounts of pinned pages limit the ability of the kernel > > to manage its memory severely which may cause instabilities. mlocked pages > > are better because the kernel still has some options. > > I'm not arguing which are better; I've even conceded we need mpin() and > can keep the current mlock() semantics. I'm arguing for what we _need_. We need the kernel to track the pinned pages (all of them not just those from IB and perf) and provide proper global boundaries to avoid kernel failures. Yes. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Let's try to get this wrapped up? On Thu, 6 Jun 2013 14:43:51 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > > Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages") > broke RLIMIT_MEMLOCK. I rather like what bc3e53f682 did, actually. RLIMIT_MEMLOCK limits the amount of memory you can mlock(). Nice and simple. This pinning thing which infiniband/perf are doing is conceptually different and if we care at all, perhaps we should be looking at adding RLIMIT_PINNED. > Before that patch: mm_struct::locked_vm < RLIMIT_MEMLOCK; after that > patch we have: mm_struct::locked_vm < RLIMIT_MEMLOCK && > mm_struct::pinned_vm < RLIMIT_MEMLOCK. But this is a policy decision which was implemented in perf_mmap() and perf can alter that decision. How bad would it be if perf just ignored RLIMIT_MEMLOCK? drivers/infiniband/hw/qib/qib_user_pages.c has issues, btw. It compares the amount-to-be-pinned with rlimit(RLIMIT_MEMLOCK), but forgets to also look at current->mm->pinned_vm. Duh. It also does the pinned accounting in __qib_get_user_pages() but in __qib_release_user_pages(), the caller is supposed to do it, which is rather awkward. Longer-term I don't think that inifinband or perf should be dinking around with rlimit(RLIMIT_MEMLOCK) or ->pinned_vm. Those policy decisions should be hoisted into a core mm helper where we can do it uniformly (and more correctly than infiniband's attempt!). -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 13, 2013 at 02:06:32PM -0700, Andrew Morton wrote: > Let's try to get this wrapped up? > > On Thu, 6 Jun 2013 14:43:51 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > > > > > Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages") > > broke RLIMIT_MEMLOCK. > > I rather like what bc3e53f682 did, actually. RLIMIT_MEMLOCK limits the > amount of memory you can mlock(). Nice and simple. > > This pinning thing which infiniband/perf are doing is conceptually > different and if we care at all, perhaps we should be looking at adding > RLIMIT_PINNED. We could do that; but I really don't like doing it for the reasons I outlined previously. It gives the user another knob to twiddle which is pretty much the same as one he already has just slightly different. Like said, I see RLIMIT_MEMLOCK to mean the amount of pages the user can exempt from paging; since that is what the VM cares about most. > > Before that patch: mm_struct::locked_vm < RLIMIT_MEMLOCK; after that > > patch we have: mm_struct::locked_vm < RLIMIT_MEMLOCK && > > mm_struct::pinned_vm < RLIMIT_MEMLOCK. > > But this is a policy decision which was implemented in perf_mmap() and > perf can alter that decision. How bad would it be if perf just ignored > RLIMIT_MEMLOCK? Then it could pin all memory -- seems like something bad. > drivers/infiniband/hw/qib/qib_user_pages.c has issues, btw. It > compares the amount-to-be-pinned with rlimit(RLIMIT_MEMLOCK), but > forgets to also look at current->mm->pinned_vm. Duh. > > It also does the pinned accounting in __qib_get_user_pages() but in > __qib_release_user_pages(), the caller is supposed to do it, which is > rather awkward. > > > Longer-term I don't think that inifinband or perf should be dinking > around with rlimit(RLIMIT_MEMLOCK) or ->pinned_vm. Those policy > decisions should be hoisted into a core mm helper where we can do it > uniformly (and more correctly than infiniband's attempt!). Agreed, hence my VM_PINNED proposal that would lift most of that to the core VM. I just got really lost in the IB code :/ -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 07, 2013 at 02:52:05PM +0000, Christoph Lameter wrote: > On Fri, 7 Jun 2013, Peter Zijlstra wrote: > > > However you twist this; your patch leaves an inconsistent mess. If you > > really think they're two different things then you should have > > introduced a second RLIMIT_MEMPIN to go along with your counter. > > Well continuing to repeat myself: I worked based on agreed upon > characteristics of mlocked pages. The patch was there to address a > brokenness in the mlock accounting because someone naively assumed that > pinning = mlock. They did no such thing; being one of those who wrote such code. I expressly used RLIMIT_MEMLOCK for its the one limit userspace has to limit pages that are exempt from paging. > > I'll argue against such a thing; for I think that limiting the total > > amount of pages a user can exempt from paging is the far more > > userful/natural thing to measure/limit. > > Pinned pages are exempted by the kernel. A device driver or some other > kernel process (reclaim, page migration, io etc) increase the page count. > There is currently no consistent accounting for pinned pages. The > vm_pinned counter was introduced to allow the largest pinners to track > what they did. No, not the largest, user space controlled pinnners. The thing that makes all the difference is the _USER_ control. > > > I said that the use of a PIN page flag would allow correct accounting if > > > one wanted to interpret the limit the way you do. > > > > You failed to explain how that would help any. With a pin page flag you > > still need to find the mm to unaccount crap from. Also, all user > > controlled address space ops operate on vmas. > > Pinning is kernel controlled... That's pure bull; perf and IB have user controlled pinning. > > > Page migration is not a page fault? > > > > It introduces faults; what happens when a process hits the migration > > pte? It gets a random delay and eventually services a minor fault to the > > new page. > > Ok but this is similar to reclaim and other such things that are unmapping > pages. Which are avoided by mlock() an mlock()ed page will (typically) not get unmapped. > > At which point the saw will have cut your finger off (going with the > > most popular RT application ever -- that of a bandsaw and a laser beam). > > I am pretty confused by your newer notion of RT. RT was about high latency > deterministic behavior I thought. RT was basically an abused marketing > term and was referring to the bloating of the kernel with all sorts of > fair stuff that slows us down. What happened to make you work on low > latency stuff? There is some shift that you still need to go through to > make that transition. Yes, you would want to avoid reclaim and all sorts > of other stuff for low latency. So you disable auto NUMA, defrag etc to > avoid these things. Its about low latency deterministic stuff; its just that worst case latency is more important than avg latency and hence some things get more expensive on avg. But we don't want to disable defrag, defrag is good for the general health of the system. Also not all applications are RT; we want a kernel that's able to run general purpose stuff along with some RT apps. > > > > This leaves the RT people unhappy -- therefore _if_ we continue with > > > > this Linux specific interpretation of mlock() we must introduce new > > > > syscalls that implement the intended mlock() semantics. > > > > > > Intended means Peter's semantics? > > > > No, I don't actually write RT applications. But I've had plenty of > > arguments with RT people when I explained to them what our mlock() > > actually does vs what they expected it to do. > > Ok Guess this is all new to you at this point. I am happy to see that you > are willing to abandon your evil ways (although under pressure from your > users) and are willing to put the low latency people now in the RT camp. *sigh*.. we have shared goals up to a point. RT bounds and if possible lowers the worst case latency. The worst case is absolutely most important for us. By doing so we often also lower the avg latency, but its not the primary goal. > > They're not happy. Aside from that; you HPC/HFT minimal latency lot > > should very well appreciate the minimal interference stuff they do > > actually expect. > > Sure we do and we know how to do things to work around the "fair > scheduler" and other stuff. But you are breaking the basics of how we do > things with your conflation of pinning and mlocking. Baseless statement there; I've given a very clear and concise definition of how I see mlock() and mpin() work together. You can't dismiss all that and expect people to take you seriously. And argue until your face is blue but IB and perf are very much user controlled pins. > We do not migrate, do not allow defragmentation or reclaim when running > low latency applications. These are non issues. Sounds like you lot are a bunch of work-around hacks. We very much want to allow running such applications without having to rebuild the kernel. In fact we want to allow running apps that need these CONFIG things both enabled and disabled. And while page migration and defrag have CONFIG knobs, reclaim does not; file based reclaim is unconditional (except as already noted when using mlock()). I just want to be able to have all that enabled and still have applications be able to express their needs and be able to run. > > Here we must disagree I fear; given that mlock() is of RT origin and RT > > people very much want/expect mlock() to do what our proposed mpin() will > > do. > > RT is a dirty word for me given the fairness and bloat issue. Not sure > what you mean with that. mlock is a means to keep data in memory and not a > magical wand that avoids all OS handling of the page. RT is about bounded worst case latency, and a preference for that bound to be as low as possible. That is all RT is and wants to be -- its definitely not about fairness in any way shape or form. But yes, the intention was very much for the page (and mapping) to be left alone by the OS -- its unfortunate the specs wording don't clarify this. > > > That cannot be so since mlocked pages need to be migratable. > > > > I'm talking about the proposed mpin() stuff. > > Could you write that up in detail? I am not sure how this could work at > this point. *omg*.. I did. mpin() (and we'll include IB and perf here) is a stronger mlock() in that it must avoid all faults, not only the major faults. After that there's just implementation details; the proposed implementation used VM_PINNED to tag the VMAs use user called mpin() on. It will use get_user_pages() to acquire a ref and effect the actual pinning. Below I also proposed we could pre-compact such regions into UNMOVABLE page blocks. But I don't much care about the implementation per-se. All I've been really arguing for is semantics. > > So I proposed most of the machinery that would be required to actually > > implement the syscalls. Except that the IB code stumped me. In > > particular I cannot easily find the userspace address to unpin for > > ipath/qib release paths. > > > > Once we have that we can trivially implement the syscalls. > > Why would you need syscalls? Pinning is driver/kernel subsystem initiated > and therefore the driver can do the pin/unpin calls. mpin()/munpin() syscalls so userspace can effect the same on their desired ranges. > > > Pinning is not initiated by user space but by the kernel. Either > > > temporarily (page count increases are used all over the kernel for this) > > > or for longer time frame (IB and Perf and likely more drivers that we have > > > not found yet). > > > > Here I disagree, I'll argue that all pins that require userspace action > > to go away are userspace controlled pins. This makes IB/Perf user > > controlled pins and should thus be treated the same as mpin()/munpin() > > calls. > > Both IB and Perf are kernel subsystems that pin as a side effect of > another syscall. Yeah so? You still have user controlled pinning; the kernel can not rid of these resources on its own > > This goes back to what you want the limit to mean; I think a single > > limit counting all pages exempt from paging is the far more useful > > limit. > > Then you first need to show that the scheme account for *all* pages exempt > from paging. This includes dirty pages and various pages of other > subsystems that have an increased refcount in order to keep these pages > were they are. We need a couple of passes through the kernel to find these > locations and build up counter for these. Not all of these pages will be > associated with processes. All _user_ controlled pages exempt from paging. This would be mlock() + IB + perf. AFAIK there's nothing else that a user can do to exempt pages from paging. Our big disconnect seems to be user controlled vs kernel controlled. Why do you care about kernel pins? Those are transient are they not? > > The only way to get an MCL_FUTURE is through CAP_IPC_LOCK. Practically > > most MCL_CURRENT are also larger than RLIMIT_MEMLOCK and this also > > requires CAP_IPC_LOCK. > > > > Once you have CAP_IPC_LOCK, RLIMIT_MEMLOCK becomes irrelevant. > > mlockall does not require CAP_IPC_LOCK. Never had an issue. MCL_FUTURE does absolutely require CAP_IPC_LOCK, MCL_CURRENT requires a huge (as opposed to the default 64k) RLIMIT or CAP_IPC_LOCK. There's no argument there, look at the code. > > > The semantics we agreed upon for mlock were that they are migratable. > > > Pinned pages cannot be migrated and therefore are not mlocked pages. > > > It would be best to have two different mechanisms for this. > > > > Best for whoem? How is making two RLIMIT settings better than having > > one? > > The pinned pages do not need to have a separate limit. They are kernel > resources and could be bounded in other ways together with other resources > that could cause danger to the kernel (such as too many dirty pages). > Pinned pages are often not associated with a process. So why is grouping user spaced pinned pages with mlock()ed pages such a bad idea? Both can cause vm deadlocks for not allowing to reclaim pages. Note that throughout I've been talking about user space controlled pins; I really don't see a problem with kernel pins, which are by and large small and transient. > > > We do not need a new API. > > > > This is in direct contradiction with your earlier statement where you > > say: "I agree that we need mpin and munpin.". > > > > Please make up your mind. > > Yes we need mpin and munpin as a in kernel API. Glad to clue you in. Dude,.. my VM_PINNED patch had that; its you who seems to be somewhat slow of whit here. > > > Having the ability in > > > general to move processes is a degree of control that we want. Pinning > > > could be restricted to where it is required by the hardware. > > > > Or software; you cannot blindly dismiss realtime guarantees the software > > needs to provide. > > I cringe when I hear realtime guarantees and intending that to refer to > low latency.... But maybe they turned you finally? And I may just feeling > my decade old frustration with your old approach to "realtime". > > What do you exactly mean by realtime guarantees? (Given that "realtime" is > such an overloaded marketing term). Bounded and preferred low as possible worst case latency. The bound gives determinism the second finer resolution. Yes there's a whole host of various real world realtime software constraints. There's the hard realtime case where if you miss a deadline the world 'ends' -- tokamak EM control for nuclear fusion; the band saw vs finger thing etc. And there's soft realtime in a hundred different shades but its generally understood to mean stuff that can recover from sporadic lateness. We want to cater to all those people and thus typically focus on the hard-rt case. Now a RT application will typically have a large !RT part; it will log data, maybe update decision matrices, talk to other computers etc.. So while we want our self driving car to not collide with stuff, we also want to map the road, download maps, track other moving objects etc. Not all of that has the same constraints; sensor data must always precede maps if at all available, mapping/updating maps isn't at all important but nice to have etc. Such systems would like as full a general purpose system as possible that is still able to provide guarantees where required. Therefore its desired to have 'nonsense' like compaction etc. enabled. > > > Excessive amounts of pinned pages limit the ability of the kernel > > > to manage its memory severely which may cause instabilities. mlocked pages > > > are better because the kernel still has some options. > > > > I'm not arguing which are better; I've even conceded we need mpin() and > > can keep the current mlock() semantics. I'm arguing for what we _need_. > > We need the kernel to track the pinned pages (all of them not just those > from IB and perf) and provide proper global boundaries to avoid kernel > failures. Yes. That's not what I said. Also you failed to provide rationale for your desire to do so. Why would we need to track the transient kernel pins? Those will go away on themselves. The IB/perf/mpin() pins are special in that they're dependent on userspace and will not go away on their own. Also they need to be limited lest userspace would consume too much of them and VM deadlocks might ensue. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 13 Jun 2013, Andrew Morton wrote: > Let's try to get this wrapped up? > > On Thu, 6 Jun 2013 14:43:51 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > > > > > Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages") > > broke RLIMIT_MEMLOCK. > > I rather like what bc3e53f682 did, actually. RLIMIT_MEMLOCK limits the > amount of memory you can mlock(). Nice and simple. > > This pinning thing which infiniband/perf are doing is conceptually > different and if we care at all, perhaps we should be looking at adding > RLIMIT_PINNED. Actually PINNED is just a stronger version of MEMLOCK. PINNED and MEMLOCK are both preventing the page from being paged out. PINNED adds the constraint of preventing minor faults as well. So I think the really important tuning knob is the limitation of pages which cannot be paged out. And this is what RLIMIT_MEMLOCK is about. Now if you want to add RLIMIT_PINNED as well, then it only limits the number of pages which cannot create minor faults, but that does not affect the limitation of total pages which cannot be paged out. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 17 Jun 2013, Peter Zijlstra wrote: > They did no such thing; being one of those who wrote such code. I > expressly used RLIMIT_MEMLOCK for its the one limit userspace has to > limit pages that are exempt from paging. Dont remember reviewing that. Assumptions were wrong in that patch then. > > Pinned pages are exempted by the kernel. A device driver or some other > > kernel process (reclaim, page migration, io etc) increase the page count. > > There is currently no consistent accounting for pinned pages. The > > vm_pinned counter was introduced to allow the largest pinners to track > > what they did. > > No, not the largest, user space controlled pinnners. The thing that > makes all the difference is the _USER_ control. The pinning *cannot* be done from user space. Here it is the IB subsystem that is doing it. > > Pinning is kernel controlled... > > That's pure bull; perf and IB have user controlled pinning. Pinning is a side effect of memory registratin in IB. > > Ok but this is similar to reclaim and other such things that are unmapping > > pages. > > Which are avoided by mlock() an mlock()ed page will (typically) not get > unmapped. Oh it will get unmapped definitely by page migration and such. Otherwise it could not be moved. > But we don't want to disable defrag, defrag is good for the general > health of the system. Also not all applications are RT; we want a kernel > that's able to run general purpose stuff along with some RT apps. Still have no idea what you mean by RT. > *sigh*.. we have shared goals up to a point. RT bounds and if possible > lowers the worst case latency. The worst case is absolutely most > important for us. Ok that differs then. > By doing so we often also lower the avg latency, but its not the primary > goal. So far we have mostly seen these measures to increase the average latency. > And argue until your face is blue but IB and perf are very much user > controlled pins. Sure the user controls the memory registred with IB but the pinning is done because a device maps the memory. Side effect. > And while page migration and defrag have CONFIG knobs, reclaim does not; > file based reclaim is unconditional (except as already noted when using > mlock()). Yes that is a big problem. > I just want to be able to have all that enabled and still have > applications be able to express their needs and be able to run. That is not working because these goodies all cause additional latency, by taking the processor away and disturbing the caches. You cannot have the cake and eat it too. > But yes, the intention was very much for the page (and mapping) to be > left alone by the OS -- its unfortunate the specs wording don't clarify > this. All the talks amount this with the mm developers that I have been with had a different impression and we have implemented it according to the understanding that mlocked pages are movable. Its not that I am that fond of it but I accepted it and worked with that notion. > > > I'm talking about the proposed mpin() stuff. > > > > Could you write that up in detail? I am not sure how this could work at > > this point. > > *omg*.. I did. mpin() (and we'll include IB and perf here) is a stronger > mlock() in that it must avoid all faults, not only the major faults. Looked like hand waving to me. How exactly does mpin get implemented and how does it interact with the vm? Do we have to create an additional LRU for this? Does this do any good given that there is still reclaim and lots of other stuff going on? > But I don't much care about the implementation per-se. All I've been > really arguing for is semantics. Yes you are arguing to change the established semantics for mlocked pages. > > Both IB and Perf are kernel subsystems that pin as a side effect of > > another syscall. > > Yeah so? You still have user controlled pinning; the kernel can not rid > of these resources on its own Oh there is an mmu_notifier subsystem for this purpose. You can setup a callback that can unpin pages. Again this is kernel specific. The device driver needs to register the mmu notifier. > All _user_ controlled pages exempt from paging. This would be mlock() + > IB + perf. AFAIK there's nothing else that a user can do to exempt pages > from paging. What about dirtying pages, writeback? > Our big disconnect seems to be user controlled vs kernel controlled. > > Why do you care about kernel pins? Those are transient are they not? Nope the pins for page migration, ib etc can be quite long. And I suspect that there are multiple subsystems that we are not aware of that also pin for a longer time period but do not account for their pinning actions. > > mlockall does not require CAP_IPC_LOCK. Never had an issue. > > MCL_FUTURE does absolutely require CAP_IPC_LOCK, MCL_CURRENT requires a > huge (as opposed to the default 64k) RLIMIT or CAP_IPC_LOCK. > > There's no argument there, look at the code. I am sorry but we have been mlockall() for years now without the issues that you are bringing up. AFAICT mlockall does not require MCL_FUTURE. > > The pinned pages do not need to have a separate limit. They are kernel > > resources and could be bounded in other ways together with other resources > > that could cause danger to the kernel (such as too many dirty pages). > > Pinned pages are often not associated with a process. > > So why is grouping user spaced pinned pages with mlock()ed pages such a > bad idea? Both can cause vm deadlocks for not allowing to reclaim pages. So can other things. > > What do you exactly mean by realtime guarantees? (Given that "realtime" is > > such an overloaded marketing term). > > Bounded and preferred low as possible worst case latency. The bound > gives determinism the second finer resolution. Well and you still want to run reclaim on the processor etc? What is the point of all of that when the OS can come in an take the processor away for a much longer time frame. > So while we want our self driving car to not collide with stuff, we also > want to map the road, download maps, track other moving objects etc. Not > all of that has the same constraints; sensor data must always precede > maps if at all available, mapping/updating maps isn't at all important > but nice to have etc. Well if there is a big boulder on the road (reclaim) then I would worry about that rather but extensive effort in to clear the road of small stones (like minor faults). > That's not what I said. Also you failed to provide rationale for your > desire to do so. Why would we need to track the transient kernel pins? > Those will go away on themselves. I was not talking about transient pins but the longer term ones. > The IB/perf/mpin() pins are special in that they're dependent on > userspace and will not go away on their own. Also they need to be > limited lest userspace would consume too much of them and VM deadlocks > might ensure. Yes that is what I said should be done. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Thomas Gleixner <tglx@linutronix.de> wrote: > On Thu, 13 Jun 2013, Andrew Morton wrote: > > Let's try to get this wrapped up? > > > > On Thu, 6 Jun 2013 14:43:51 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > > Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages") > > > broke RLIMIT_MEMLOCK. > > > > I rather like what bc3e53f682 did, actually. RLIMIT_MEMLOCK limits the > > amount of memory you can mlock(). Nice and simple. > > > > This pinning thing which infiniband/perf are doing is conceptually > > different and if we care at all, perhaps we should be looking at adding > > RLIMIT_PINNED. > > Actually PINNED is just a stronger version of MEMLOCK. PINNED and > MEMLOCK are both preventing the page from being paged out. PINNED adds > the constraint of preventing minor faults as well. > > So I think the really important tuning knob is the limitation of pages > which cannot be paged out. And this is what RLIMIT_MEMLOCK is about. > > Now if you want to add RLIMIT_PINNED as well, then it only limits the > number of pages which cannot create minor faults, but that does not > affect the limitation of total pages which cannot be paged out. Agreed. ( Furthermore, the RLIMIT_MEMLOCK semantics change actively broke code so this is not academic and it would be nice to progress with it. ) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Christoph Lameter <cl@gentwo.org> wrote: > On Mon, 17 Jun 2013, Peter Zijlstra wrote: > > > They did no such thing; being one of those who wrote such code. I > > expressly used RLIMIT_MEMLOCK for its the one limit userspace has to > > limit pages that are exempt from paging. > > Dont remember reviewing that. Assumptions were wrong in that patch then. > > > > Pinned pages are exempted by the kernel. A device driver or some other > > > kernel process (reclaim, page migration, io etc) increase the page count. > > > There is currently no consistent accounting for pinned pages. The > > > vm_pinned counter was introduced to allow the largest pinners to track > > > what they did. > > > > No, not the largest, user space controlled pinnners. The thing that > > makes all the difference is the _USER_ control. > > The pinning *cannot* be done from user space. Here it is the IB subsystem > that is doing it. Peter clearly pointed it out that in the perf case it's user-space that initiates the pinned memory mapping which is resource-controlled via RLIMIT_MEMLOCK - and this was implemented that way before your commit broke the code. You seem to be hell bent on defining 'memory pinning' only as "the thing done via the mlock*() system calls", but that is a nonsensical distinction that actively and incorrectly ignores other system calls that can and do pin memory legitimately. If some other system call results in mapping pinned memory that is at least as restrictively pinned as an mlock()-ed vma (the perf syscall is such) then it's entirely proper design to be resource controlled under RLIMIT_MEMLOCK as well. In fact this worked so before your commit broke it. > > > mlockall does not require CAP_IPC_LOCK. Never had an issue. > > > > MCL_FUTURE does absolutely require CAP_IPC_LOCK, MCL_CURRENT requires > > a huge (as opposed to the default 64k) RLIMIT or CAP_IPC_LOCK. > > > > There's no argument there, look at the code. > > I am sorry but we have been mlockall() for years now without the issues > that you are bringing up. AFAICT mlockall does not require MCL_FUTURE. You only have to read the mlockall() code to see that Peter's claim is correct: mm/mlock.c: SYSCALL_DEFINE1(mlockall, int, flags) { unsigned long lock_limit; int ret = -EINVAL; if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE))) goto out; ret = -EPERM; if (!can_do_mlock()) goto out; ... int can_do_mlock(void) { if (capable(CAP_IPC_LOCK)) return 1; if (rlimit(RLIMIT_MEMLOCK) != 0) return 1; return 0; } EXPORT_SYMBOL(can_do_mlock); Q.E.D. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 20 Jun 2013, Ingo Molnar wrote: > Peter clearly pointed it out that in the perf case it's user-space that > initiates the pinned memory mapping which is resource-controlled via > RLIMIT_MEMLOCK - and this was implemented that way before your commit > broke the code. There is no way that user space can initiate a page pin right now. Perf is pinning the page from the kernel. Similarly the IB subsystem pins memory meeded for device I/O. > You seem to be hell bent on defining 'memory pinning' only as "the thing > done via the mlock*() system calls", but that is a nonsensical distinction > that actively and incorrectly ignores other system calls that can and do > pin memory legitimately. Nope. I have said that Memory pinning is done by increasing the refcount which is different from mlock which sets a page flag. I have consistently argued that these are two different things. And I am a bit surprised that this point has not been understood after all these repetitions. Memory pinning these days is done as a side effect of kernel / driver needs. I.e. the memory registration done through the IB subsystem and elsewhere. > int can_do_mlock(void) > { > if (capable(CAP_IPC_LOCK)) > return 1; > if (rlimit(RLIMIT_MEMLOCK) != 0) > return 1; > return 0; > } > EXPORT_SYMBOL(can_do_mlock); > > Q.E.D. Argh. Just checked the apps. True. They did set the rlimit to 0 at some point in order to make this work. Then they monitor the number of locked pages and create alerts so that action can be taking if a system uses too many mlocked pages. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 20, 2013 at 7:48 AM, Christoph Lameter <cl@linux.com> wrote: > There is no way that user space can initiate a page pin right now. Perf is > pinning the page from the kernel. Similarly the IB subsystem pins memory > meeded for device I/O. Christoph, your argument would be a lot more convincing if you stopped repeating this nonsense. Sure, in a strict sense, it might be true that the IB subsystem in the kernel is the code that actually pins memory, but given that unprivileged userspace can tell the kernel to pin arbitrary parts of its memory for any amount of time, is that relevant? And in fact taking your "initiate" word choice above, I don't even think your statement is true -- userspace initiates the pinning by, for example, doing an IB memory registration (libibverbs ibv_reg_mr() call), which turns into a system call, which leads to the kernel trying to pin pages. The pages aren't unpinned until userspace unregisters the memory (or causes a cleanup by closing the context fd). Here's an argument by analogy. Would it make any sense for me to say userspace can't mlock memory, because only the kernel can set VM_LOCKED on a vma? Of course not. Userspace has the mlock() system call, and although the actual work happens in the kernel, we clearly want to be able to limit the amount of memory locked by the kernel ON BEHALF OF USERSPACE. - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 20 Jun 2013, Roland Dreier wrote: > Christoph, your argument would be a lot more convincing if you stopped > repeating this nonsense. Sure, in a strict sense, it might be true Well this is regarding tracking of pages that need to stay resident and since the kernel does the pinning through the IB subsystem it is trackable right there. No nonsense and no need for a separate pinning system call. > that the IB subsystem in the kernel is the code thatactually pins > memory, but given that unprivileged userspace can tell the kernel to > pin arbitrary parts of its memory for any amount of time, is that > relevant? And in fact taking your "initiate" word choice above, I > don't even think your statement is true -- userspace initiates the > pinning by, for example, doing an IB memory registration (libibverbs > ibv_reg_mr() call), which turns into a system call, which leads to the > kernel trying to pin pages. The pages aren't unpinned until userspace > unregisters the memory (or causes a cleanup by closing the context > fd). In some sense userspace initiates everything since the kernels purpose is to run applications. So you can say that everything is user initated if you wanted. However, the user visible mechanism here is a registration of memory with the IB subsystem for RDMA. The primary intend is not to pin the pages but to make memory available for remote I/O. The pages are pinned *because* otherwise remote RDMA operations could corrupt memory due to the kernel moving/evicting memory. > Here's an argument by analogy. Would it make any sense for me to say > userspace can't mlock memory, because only the kernel can set > VM_LOCKED on a vma? Of course not. Userspace has the mlock() system > call, and although the actual work happens in the kernel, we clearly > want to be able to limit the amount of memory locked by the kernel ON > BEHALF OF USERSPACE. I would think that mlock is a memory management function and therefore the app/user directly says that the memory is not to be evicted from memory. This is different for the IB subsystem which is dealing with I/O and only indirectly with memory. Would we have a different mechanism to prevent reclaim etc the we would not need to pin the pages. Actual there is such a mechanism that could be used here. If you had a reserved memory region that is not mapped by the kernel (boot time alloc, device memory) then you can use VM_PFNMAP to refer to that region and the kernel would not be able to do reclaim on that memory. No pinning necessary if the IB subsystem would register that type of memory. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index a841123..cc92137b 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -137,7 +137,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, down_write(¤t->mm->mmap_sem); - locked = npages + current->mm->pinned_vm; + locked = npages + current->mm->locked_vm; lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) { @@ -207,7 +207,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, __ib_umem_release(context->device, umem, 0); kfree(umem); } else - current->mm->pinned_vm = locked; + current->mm->locked_vm = locked; up_write(¤t->mm->mmap_sem); if (vma_list) @@ -223,7 +223,7 @@ static void ib_umem_account(struct work_struct *work) struct ib_umem *umem = container_of(work, struct ib_umem, work); down_write(&umem->mm->mmap_sem); - umem->mm->pinned_vm -= umem->diff; + umem->mm->locked_vm -= umem->diff; up_write(&umem->mm->mmap_sem); mmput(umem->mm); kfree(umem); @@ -269,7 +269,7 @@ void ib_umem_release(struct ib_umem *umem) } else down_write(&mm->mmap_sem); - current->mm->pinned_vm -= diff; + current->mm->locked_vm -= diff; up_write(&mm->mmap_sem); mmput(mm); kfree(umem); diff --git a/drivers/infiniband/hw/ipath/ipath_user_pages.c b/drivers/infiniband/hw/ipath/ipath_user_pages.c index dc66c45..cfed539 100644 --- a/drivers/infiniband/hw/ipath/ipath_user_pages.c +++ b/drivers/infiniband/hw/ipath/ipath_user_pages.c @@ -79,7 +79,7 @@ static int __ipath_get_user_pages(unsigned long start_page, size_t num_pages, goto bail_release; } - current->mm->pinned_vm += num_pages; + current->mm->locked_vm += num_pages; ret = 0; goto bail; @@ -178,7 +178,7 @@ void ipath_release_user_pages(struct page **p, size_t num_pages) __ipath_release_user_pages(p, num_pages, 1); - current->mm->pinned_vm -= num_pages; + current->mm->locked_vm -= num_pages; up_write(¤t->mm->mmap_sem); } @@ -195,7 +195,7 @@ static void user_pages_account(struct work_struct *_work) container_of(_work, struct ipath_user_pages_work, work); down_write(&work->mm->mmap_sem); - work->mm->pinned_vm -= work->num_pages; + work->mm->locked_vm -= work->num_pages; up_write(&work->mm->mmap_sem); mmput(work->mm); kfree(work); diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c index 2bc1d2b..7689e49 100644 --- a/drivers/infiniband/hw/qib/qib_user_pages.c +++ b/drivers/infiniband/hw/qib/qib_user_pages.c @@ -74,7 +74,7 @@ static int __qib_get_user_pages(unsigned long start_page, size_t num_pages, goto bail_release; } - current->mm->pinned_vm += num_pages; + current->mm->locked_vm += num_pages; ret = 0; goto bail; @@ -151,7 +151,7 @@ void qib_release_user_pages(struct page **p, size_t num_pages) __qib_release_user_pages(p, num_pages, 1); if (current->mm) { - current->mm->pinned_vm -= num_pages; + current->mm->locked_vm -= num_pages; up_write(¤t->mm->mmap_sem); } } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 3e636d8..7d09d6a 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -44,7 +44,6 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) "VmPeak:\t%8lu kB\n" "VmSize:\t%8lu kB\n" "VmLck:\t%8lu kB\n" - "VmPin:\t%8lu kB\n" "VmHWM:\t%8lu kB\n" "VmRSS:\t%8lu kB\n" "VmData:\t%8lu kB\n" @@ -56,7 +55,6 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) hiwater_vm << (PAGE_SHIFT-10), total_vm << (PAGE_SHIFT-10), mm->locked_vm << (PAGE_SHIFT-10), - mm->pinned_vm << (PAGE_SHIFT-10), hiwater_rss << (PAGE_SHIFT-10), total_rss << (PAGE_SHIFT-10), data << (PAGE_SHIFT-10), diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index ace9a5f..d185c13 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -356,7 +356,6 @@ struct mm_struct { unsigned long total_vm; /* Total pages mapped */ unsigned long locked_vm; /* Pages that have PG_mlocked set */ - unsigned long pinned_vm; /* Refcount permanently increased */ unsigned long shared_vm; /* Shared pages (files) */ unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE */ unsigned long stack_vm; /* VM_GROWSUP/DOWN */ diff --git a/kernel/events/core.c b/kernel/events/core.c index 95edd5a..1e926b2 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3729,7 +3729,7 @@ static void perf_mmap_close(struct vm_area_struct *vma) if (ring_buffer_put(rb)) { atomic_long_sub((size >> PAGE_SHIFT) + 1, &mmap_user->locked_vm); - vma->vm_mm->pinned_vm -= mmap_locked; + vma->vm_mm->locked_vm -= mmap_locked; free_uid(mmap_user); } } @@ -3805,7 +3805,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma) lock_limit = rlimit(RLIMIT_MEMLOCK); lock_limit >>= PAGE_SHIFT; - locked = vma->vm_mm->pinned_vm + extra; + locked = vma->vm_mm->locked_vm + extra; if ((locked > lock_limit) && perf_paranoid_tracepoint_raw() && !capable(CAP_IPC_LOCK)) { @@ -3831,7 +3831,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma) rb->mmap_user = get_current_user(); atomic_long_add(user_extra, &user->locked_vm); - vma->vm_mm->pinned_vm += extra; + vma->vm_mm->locked_vm += extra; rcu_assign_pointer(event->rb, rb);
Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages") broke RLIMIT_MEMLOCK. Before that patch: mm_struct::locked_vm < RLIMIT_MEMLOCK; after that patch we have: mm_struct::locked_vm < RLIMIT_MEMLOCK && mm_struct::pinned_vm < RLIMIT_MEMLOCK. The patch doesn't mention RLIMIT_MEMLOCK and thus also doesn't discus this (user visible) change in semantics. And thus we must assume it was unintentional. Since RLIMIT_MEMLOCK is very clearly a limit on the amount of pages the process can 'lock' into memory it should very much include pinned pages as well as mlock()ed pages. Neither can be paged. Since nobody had anything constructive to say about the VM_PINNED approach and the IB code hurts my head too much to make it work I propose we revert said patch. Once again the rationale; MLOCK(2) is part of POSIX Realtime Extentsion (1003.1b-1993/1003.1i-1995). It states that the specified part of the user address space should stay memory resident until either program exit or a matching munlock() call. This definition basically excludes major faults from happening on the pages -- a major fault being one where IO needs to happen to obtain the page content; the direct implication being that page content must remain in memory. Linux has taken this literal and made mlock()ed pages subject to page migration (albeit only for the explicit move_pages() syscall; but it would very much like to make them subject to implicit page migration for the purpose of compaction etc.). This view disregards the intention of the spec; since mlock() is part of the realtime spec the intention is very much that the user address range generate no faults; neither minor nor major -- any delay is unacceptable. This leaves the RT people unhappy -- therefore _if_ we continue with this Linux specific interpretation of mlock() we must introduce new syscalls that implement the intended mlock() semantics. It was found that there are useful purposes for this weaker mlock(), a rationale to indeed have two sets of syscalls. The weaker mlock() can be used in the context of security -- where we avoid sensitive data being written to disk, and in the context of userspace deamons that are part of the IO path -- which would otherwise form IO deadlocks. The proposed second set of primitives would be mpin() and munpin() and would implement the intended mlock() semantics. Such pages would not be migratable in any way (a possible implementation would be to 'pin' the pages using an extra refcount on the page frame). From the above we can see that any mpin()ed page is also an mlock()ed page, since mpin() will disallow any fault, and thus will also disallow major faults. While we still lack the formal mpin() and munpin() syscalls there are a number of sites that have similar 'side effects' and result in user controlled 'pinning' of pages. Namely IB and perf. For the purpose of RLIMIT_MEMLOCK we must use intent only as it is not part of the formal spec. The only useful thing is to limit the amount of pages a user can exempt from paging. This would therefore include all pages either mlock()ed or mpin()ed. Back to the patch; a resource limit must have a resource counter to enact the limit upon. Before the patch this was mm_struct::locked_vm. After the patch there is no such thing left. The patch was proposed to 'fix' a double accounting problem where pages are both pinned and mlock()ed. This was particularly visible when using mlockall() on a process that uses either IB or perf. I state that since mlockall() disables/invalidates RLIMIT_MEMLOCK the actual resource counter value is irrelevant, and thus the reported problem is a non-problem. However, it would still be possible to observe weirdness in the very unlikely event that a user would indeed call mlock() upon an address range obtained from IB/perf. In this case he would be unduly constrained and find his effective RLIMIT_MEMLOCK limit halved (at worst). After the patch; that same user will find he has an effectively double RLIMIT_MEMLOCK, since the IB/perf pages are not counted towards the same limit as his mlock() pages are. It is far more likely a user will employ mlock() on different rages than those he received from IB/perf since he already knows those aren't going anywhere. Therefore the patch trades an unlikely weirdness for a much more likely weirdness. So barring a proper solution I propose we revert. I've yet to hear a coherent objection to the above. Christoph is always quick to yell: 'but if fixes a double accounting issue' but is completely deaf to the fact that he changed user visible semantics without mention and regard. Signed-off-by: Peter Zijlstra <peterz@infradead.org> --- drivers/infiniband/core/umem.c | 8 ++++---- drivers/infiniband/hw/ipath/ipath_user_pages.c | 6 +++--- drivers/infiniband/hw/qib/qib_user_pages.c | 4 ++-- fs/proc/task_mmu.c | 2 -- include/linux/mm_types.h | 1 - kernel/events/core.c | 6 +++--- 6 files changed, 12 insertions(+), 15 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html