Message ID | 20230420060156.895881-3-usama.anjum@collabora.com (mailing list archive) |
---|---|
State | Mainlined, archived |
Headers | show |
Series | Implement IOCTL to get and optionally clear info about PTEs | expand |
On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote: > +/* Supported flags */ > +#define PM_SCAN_OP_GET (1 << 0) > +#define PM_SCAN_OP_WP (1 << 1) We have only these flag options available in PAGEMAP_SCAN IOCTL. PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can be specified as need. But PM_SCAN_OP_WP cannot be specified without PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate functionality which can be achieved by UFFDIO_WRITEPROTECT.) 1) PM_SCAN_OP_GET | PM_SCAN_OP_WP vs 2) UFFDIO_WRITEPROTECT After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are getting really good performance which is comparable just like we are depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp, PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT performance and behavior wise. I've got the results from someone else that UFFDIO_WRITEPROTECT block pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this as I don't have tests comparing them one-to-one. What are your thoughts about it? Have you thought about making UFFDIO_WRITEPROTECT perform better? I'm sorry to mention the word "performance" here. Actually we want better performance to emulate Windows syscall. That is why we are adding this functionality. So either we need to see what can be improved in UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in pagemap_ioctl? Thank you so much for the help.
Hi, Muhammad, On Wed, Apr 26, 2023 at 12:06:23PM +0500, Muhammad Usama Anjum wrote: > On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote: > > +/* Supported flags */ > > +#define PM_SCAN_OP_GET (1 << 0) > > +#define PM_SCAN_OP_WP (1 << 1) > We have only these flag options available in PAGEMAP_SCAN IOCTL. > PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can > be specified as need. But PM_SCAN_OP_WP cannot be specified without > PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate > functionality which can be achieved by UFFDIO_WRITEPROTECT.) > > 1) PM_SCAN_OP_GET | PM_SCAN_OP_WP > vs > 2) UFFDIO_WRITEPROTECT > > After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are > getting really good performance which is comparable just like we are > depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp, > PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT > performance and behavior wise. > > I've got the results from someone else that UFFDIO_WRITEPROTECT block > pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this > as I don't have tests comparing them one-to-one. > > What are your thoughts about it? Have you thought about making > UFFDIO_WRITEPROTECT perform better? > > I'm sorry to mention the word "performance" here. Actually we want better > performance to emulate Windows syscall. That is why we are adding this > functionality. So either we need to see what can be improved in > UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in > pagemap_ioctl? I'm fine if you want to add it back if it works for you. Though before that, could you remind me why there can be a difference on performance? Thanks,
Hi Peter, Thank you for your reply. On 4/26/23 7:13 PM, Peter Xu wrote: > Hi, Muhammad, > > On Wed, Apr 26, 2023 at 12:06:23PM +0500, Muhammad Usama Anjum wrote: >> On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote: >>> +/* Supported flags */ >>> +#define PM_SCAN_OP_GET (1 << 0) >>> +#define PM_SCAN_OP_WP (1 << 1) >> We have only these flag options available in PAGEMAP_SCAN IOCTL. >> PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can >> be specified as need. But PM_SCAN_OP_WP cannot be specified without >> PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate >> functionality which can be achieved by UFFDIO_WRITEPROTECT.) >> >> 1) PM_SCAN_OP_GET | PM_SCAN_OP_WP >> vs >> 2) UFFDIO_WRITEPROTECT >> >> After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are >> getting really good performance which is comparable just like we are >> depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp, >> PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT >> performance and behavior wise. >> >> I've got the results from someone else that UFFDIO_WRITEPROTECT block >> pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this >> as I don't have tests comparing them one-to-one. >> >> What are your thoughts about it? Have you thought about making >> UFFDIO_WRITEPROTECT perform better? >> >> I'm sorry to mention the word "performance" here. Actually we want better >> performance to emulate Windows syscall. That is why we are adding this >> functionality. So either we need to see what can be improved in >> UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in >> pagemap_ioctl? > > I'm fine if you want to add it back if it works for you. Though before > that, could you remind me why there can be a difference on performance? The only difference can be that UFFDIO_WRITEPROTECT acquires read mm lock once for entire duration. But for PAGEMAP_SCAN IOCTL, we acquire and release for each PMD to keep intermediate buffer short. This must be hard to convince you. So I'll write some test to see what is the exact difference and show you the numbers. > > Thanks, >
On 4/26/23 7:13 PM, Peter Xu wrote: > Hi, Muhammad, > > On Wed, Apr 26, 2023 at 12:06:23PM +0500, Muhammad Usama Anjum wrote: >> On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote: >>> +/* Supported flags */ >>> +#define PM_SCAN_OP_GET (1 << 0) >>> +#define PM_SCAN_OP_WP (1 << 1) >> We have only these flag options available in PAGEMAP_SCAN IOCTL. >> PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can >> be specified as need. But PM_SCAN_OP_WP cannot be specified without >> PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate >> functionality which can be achieved by UFFDIO_WRITEPROTECT.) >> >> 1) PM_SCAN_OP_GET | PM_SCAN_OP_WP >> vs >> 2) UFFDIO_WRITEPROTECT >> >> After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are >> getting really good performance which is comparable just like we are >> depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp, >> PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT >> performance and behavior wise. >> >> I've got the results from someone else that UFFDIO_WRITEPROTECT block >> pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this >> as I don't have tests comparing them one-to-one. >> >> What are your thoughts about it? Have you thought about making >> UFFDIO_WRITEPROTECT perform better? >> >> I'm sorry to mention the word "performance" here. Actually we want better >> performance to emulate Windows syscall. That is why we are adding this >> functionality. So either we need to see what can be improved in >> UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in >> pagemap_ioctl? > > I'm fine if you want to add it back if it works for you. Though before > that, could you remind me why there can be a difference on performance? I've looked at the code again and I think I've found something. Lets look at exact performance numbers: I've run 2 different tests. In first test UFFDIO_WRITEPROTECT is being used for engaging WP. In second test PM_SCAN_OP_WP is being used. I've measured the average write time to the same memory which is being WP-ed and total time of execution of these APIs: **avg write time:** | No of pages | 2000 | 8192 | 100000 | 500000 | |------------------------|------|------|--------|--------| | UFFDIO_WRITEPROTECT | 2200 | 2300 | 4100 | 4200 | | PM_SCAN_OP_WP | 2000 | 2300 | 2500 | 2800 | **Execution time measured in rdtsc:** | No of pages | 2000 | 8192 | 100000 | 500000 | |------------------------|------|-------|--------|--------| | UFFDIO_WRITEPROTECT | 3200 | 14000 | 59000 | 58000 | | PM_SCAN_OP_WP | 1900 | 7000 | 38000 | 40000 | Avg write time for UFFDIO_WRITEPROTECT is 1.3 times slow. The execution time is 1.5 times slower in the case of UFFDIO_WRITEPROTECT. So UFFDIO_WRITEPROTECT is making writes slower to the pages and execution time is also slower. This proves that PM_SCAN_OP_WP is better than UFFDIO_WRITEPROTECT. Although PM_SCAN_OP_WP and UFFDIO_WRITEPROTECT have been implemented differently. We should have seen no difference in performance. But we have quite a lot of difference in performance here. PM_SCAN_OP_WP takes read mm lock, uses walk_page_range() to walk over pages which finds VMAs from address ranges to walk over them and pagemap_scan_pmd_entry() is handling most of the work including tlb flushing. UFFDIO_WRITEPROTECT is also taking the mm lock and iterating from all the different page directories until a pte is found and then flags are updated there and tlb is flushed for every pte. My next deduction would be that we are getting worse performance as we are flushing tlb for one page at a time in case of UFFDIO_WRITEPROTECT. While we flush tlb for 512 pages (moslty) at a time in case of PM_SCAN_OP_WP. I've just verified this by adding some logs to the change_pte_range() and pagemap_scan_pmd_entry(). Logs are attached. I've allocated memory of 1000 pages and write-protected it with UFFDIO_WRITEPROTECT and PM_SCAN_OP_WP. The logs show that UFFDIO_WRITEPROTECT has flushed tlb 1000 times of size 1 page each time. While PM_SCAN_OP_WP has flushed only 3 times of bigger sizes. I've learned over my last experience that tlb flush is very expensive. Probably this is what we need to improve if we don't want to add PM_SCAN_OP_WP? The UFFDIO_WRITEPROTECT uses change_pte_range() which is very generic function and I'm not sure if can try to not do tlb flushes if uffd_wp is true. We can try to do flush somewhere else and hopefully we should do only one flush if possible. It will not be so straight forward to move away from generic fundtion. Thoughts? > Thanks, >
On 5/22/23 3:24 PM, Muhammad Usama Anjum wrote: > On 4/26/23 7:13 PM, Peter Xu wrote: >> Hi, Muhammad, >> >> On Wed, Apr 26, 2023 at 12:06:23PM +0500, Muhammad Usama Anjum wrote: >>> On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote: >>>> +/* Supported flags */ >>>> +#define PM_SCAN_OP_GET (1 << 0) >>>> +#define PM_SCAN_OP_WP (1 << 1) >>> We have only these flag options available in PAGEMAP_SCAN IOCTL. >>> PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can >>> be specified as need. But PM_SCAN_OP_WP cannot be specified without >>> PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate >>> functionality which can be achieved by UFFDIO_WRITEPROTECT.) >>> >>> 1) PM_SCAN_OP_GET | PM_SCAN_OP_WP >>> vs >>> 2) UFFDIO_WRITEPROTECT >>> >>> After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are >>> getting really good performance which is comparable just like we are >>> depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp, >>> PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT >>> performance and behavior wise. >>> >>> I've got the results from someone else that UFFDIO_WRITEPROTECT block >>> pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this >>> as I don't have tests comparing them one-to-one. >>> >>> What are your thoughts about it? Have you thought about making >>> UFFDIO_WRITEPROTECT perform better? >>> >>> I'm sorry to mention the word "performance" here. Actually we want better >>> performance to emulate Windows syscall. That is why we are adding this >>> functionality. So either we need to see what can be improved in >>> UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in >>> pagemap_ioctl? >> >> I'm fine if you want to add it back if it works for you. Though before >> that, could you remind me why there can be a difference on performance? > I've looked at the code again and I think I've found something. Lets look > at exact performance numbers: > > I've run 2 different tests. In first test UFFDIO_WRITEPROTECT is being used > for engaging WP. In second test PM_SCAN_OP_WP is being used. I've measured > the average write time to the same memory which is being WP-ed and total > time of execution of these APIs: > > **avg write time:** > | No of pages | 2000 | 8192 | 100000 | 500000 | > |------------------------|------|------|--------|--------| > | UFFDIO_WRITEPROTECT | 2200 | 2300 | 4100 | 4200 | > | PM_SCAN_OP_WP | 2000 | 2300 | 2500 | 2800 | > > **Execution time measured in rdtsc:** > | No of pages | 2000 | 8192 | 100000 | 500000 | > |------------------------|------|-------|--------|--------| > | UFFDIO_WRITEPROTECT | 3200 | 14000 | 59000 | 58000 | > | PM_SCAN_OP_WP | 1900 | 7000 | 38000 | 40000 | > > Avg write time for UFFDIO_WRITEPROTECT is 1.3 times slow. The execution > time is 1.5 times slower in the case of UFFDIO_WRITEPROTECT. So > UFFDIO_WRITEPROTECT is making writes slower to the pages and execution time > is also slower. > > This proves that PM_SCAN_OP_WP is better than UFFDIO_WRITEPROTECT. Although > PM_SCAN_OP_WP and UFFDIO_WRITEPROTECT have been implemented differently. We > should have seen no difference in performance. But we have quite a lot of > difference in performance here. PM_SCAN_OP_WP takes read mm lock, uses > walk_page_range() to walk over pages which finds VMAs from address ranges > to walk over them and pagemap_scan_pmd_entry() is handling most of the work > including tlb flushing. UFFDIO_WRITEPROTECT is also taking the mm lock and > iterating from all the different page directories until a pte is found and > then flags are updated there and tlb is flushed for every pte. > > My next deduction would be that we are getting worse performance as we are > flushing tlb for one page at a time in case of UFFDIO_WRITEPROTECT. While > we flush tlb for 512 pages (moslty) at a time in case of PM_SCAN_OP_WP. > I've just verified this by adding some logs to the change_pte_range() and > pagemap_scan_pmd_entry(). Logs are attached. I've allocated memory of 1000 > pages and write-protected it with UFFDIO_WRITEPROTECT and PM_SCAN_OP_WP. > The logs show that UFFDIO_WRITEPROTECT has flushed tlb 1000 times of size 1 > page each time. While PM_SCAN_OP_WP has flushed only 3 times of bigger > sizes. I've learned over my last experience that tlb flush is very > expensive. Probably this is what we need to improve if we don't want to add > PM_SCAN_OP_WP? > > The UFFDIO_WRITEPROTECT uses change_pte_range() which is very generic > function and I'm not sure if can try to not do tlb flushes if uffd_wp is > true. We can try to do flush somewhere else and hopefully we should do only > one flush if possible. It will not be so straight forward to move away from > generic fundtion. Thoughts? I've just tested this theory of not doing per pte flushes and only did one flush on entire range in uffd_wp_range(). But it didn't improve the situation either. I was wrong that tlb flushes may be the cause. > > >> Thanks, >> >
Hi, Muhammad, On Mon, May 22, 2023 at 04:26:07PM +0500, Muhammad Usama Anjum wrote: > On 5/22/23 3:24 PM, Muhammad Usama Anjum wrote: > > On 4/26/23 7:13 PM, Peter Xu wrote: > >> Hi, Muhammad, > >> > >> On Wed, Apr 26, 2023 at 12:06:23PM +0500, Muhammad Usama Anjum wrote: > >>> On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote: > >>>> +/* Supported flags */ > >>>> +#define PM_SCAN_OP_GET (1 << 0) > >>>> +#define PM_SCAN_OP_WP (1 << 1) > >>> We have only these flag options available in PAGEMAP_SCAN IOCTL. > >>> PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can > >>> be specified as need. But PM_SCAN_OP_WP cannot be specified without > >>> PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate > >>> functionality which can be achieved by UFFDIO_WRITEPROTECT.) > >>> > >>> 1) PM_SCAN_OP_GET | PM_SCAN_OP_WP > >>> vs > >>> 2) UFFDIO_WRITEPROTECT > >>> > >>> After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are > >>> getting really good performance which is comparable just like we are > >>> depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp, > >>> PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT > >>> performance and behavior wise. > >>> > >>> I've got the results from someone else that UFFDIO_WRITEPROTECT block > >>> pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this > >>> as I don't have tests comparing them one-to-one. > >>> > >>> What are your thoughts about it? Have you thought about making > >>> UFFDIO_WRITEPROTECT perform better? > >>> > >>> I'm sorry to mention the word "performance" here. Actually we want better > >>> performance to emulate Windows syscall. That is why we are adding this > >>> functionality. So either we need to see what can be improved in > >>> UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in > >>> pagemap_ioctl? > >> > >> I'm fine if you want to add it back if it works for you. Though before > >> that, could you remind me why there can be a difference on performance? > > I've looked at the code again and I think I've found something. Lets look > > at exact performance numbers: > > > > I've run 2 different tests. In first test UFFDIO_WRITEPROTECT is being used > > for engaging WP. In second test PM_SCAN_OP_WP is being used. I've measured > > the average write time to the same memory which is being WP-ed and total > > time of execution of these APIs: What is the steps of the test? Is it as simple as "writeprotect", "unprotect", then write all pages in a single thread? Is UFFDIO_WRITEPROTECT sent in one range covering all pages? Maybe you can attach the test program here too. > > > > **avg write time:** > > | No of pages | 2000 | 8192 | 100000 | 500000 | > > |------------------------|------|------|--------|--------| > > | UFFDIO_WRITEPROTECT | 2200 | 2300 | 4100 | 4200 | > > | PM_SCAN_OP_WP | 2000 | 2300 | 2500 | 2800 | > > > > **Execution time measured in rdtsc:** > > | No of pages | 2000 | 8192 | 100000 | 500000 | > > |------------------------|------|-------|--------|--------| > > | UFFDIO_WRITEPROTECT | 3200 | 14000 | 59000 | 58000 | > > | PM_SCAN_OP_WP | 1900 | 7000 | 38000 | 40000 | > > > > Avg write time for UFFDIO_WRITEPROTECT is 1.3 times slow. The execution > > time is 1.5 times slower in the case of UFFDIO_WRITEPROTECT. So > > UFFDIO_WRITEPROTECT is making writes slower to the pages and execution time > > is also slower. > > > > This proves that PM_SCAN_OP_WP is better than UFFDIO_WRITEPROTECT. Although > > PM_SCAN_OP_WP and UFFDIO_WRITEPROTECT have been implemented differently. We > > should have seen no difference in performance. But we have quite a lot of > > difference in performance here. PM_SCAN_OP_WP takes read mm lock, uses > > walk_page_range() to walk over pages which finds VMAs from address ranges > > to walk over them and pagemap_scan_pmd_entry() is handling most of the work > > including tlb flushing. UFFDIO_WRITEPROTECT is also taking the mm lock and > > iterating from all the different page directories until a pte is found and > > then flags are updated there and tlb is flushed for every pte. > > > > My next deduction would be that we are getting worse performance as we are > > flushing tlb for one page at a time in case of UFFDIO_WRITEPROTECT. While > > we flush tlb for 512 pages (moslty) at a time in case of PM_SCAN_OP_WP. > > I've just verified this by adding some logs to the change_pte_range() and > > pagemap_scan_pmd_entry(). Logs are attached. I've allocated memory of 1000 > > pages and write-protected it with UFFDIO_WRITEPROTECT and PM_SCAN_OP_WP. > > The logs show that UFFDIO_WRITEPROTECT has flushed tlb 1000 times of size 1 > > page each time. While PM_SCAN_OP_WP has flushed only 3 times of bigger > > sizes. I've learned over my last experience that tlb flush is very > > expensive. Probably this is what we need to improve if we don't want to add > > PM_SCAN_OP_WP? > > > > The UFFDIO_WRITEPROTECT uses change_pte_range() which is very generic > > function and I'm not sure if can try to not do tlb flushes if uffd_wp is > > true. We can try to do flush somewhere else and hopefully we should do only > > one flush if possible. It will not be so straight forward to move away from > > generic fundtion. Thoughts? > I've just tested this theory of not doing per pte flushes and only did one > flush on entire range in uffd_wp_range(). But it didn't improve the > situation either. I was wrong that tlb flushes may be the cause. I had a feeling that you were trapping tlb_flush_pte_range(), which is actually not really sending any TLB flushes but updating mmu_gather object for the addr range for future invalidations. That's probably why it didn't show an effect when you comment it out. I am not sure whether the wr-protect path difference can be caused by the arch hooks, namely arch_enter_lazy_mmu_mode() / arch_leave_lazy_mmu_mode(). On x86 I saw that it's actually hooked onto some PV calls. I had a feeling that this is for optimization only, but maybe it's still a good idea you also take that into your new code: static inline void arch_enter_lazy_mmu_mode(void) { PVOP_VCALL0(mmu.lazy_mode.enter); } The other thing is I think you're flushing tlb outside pgtable lock in your new code. IIUC that's racy, see: commit 6ce64428d62026a10cb5d80138ff2f90cc21d367 Author: Nadav Amit <namit@vmware.com> Date: Fri Mar 12 21:08:17 2021 -0800 mm/userfaultfd: fix memory corruption due to writeprotect So you may want to put it at least into pgtable lock critical section, or IIUC you can also do inc_tlb_flush_pending() then dec_tlb_flush_pending() just like __tlb_gather_mmu(), to make sure do_wp_page() will properly flush the page when unluckily hit some of the page. That's also the spot (the flush_tlb_page() in 6ce64428d) that made me think on whether it caused the slowness on writting to those pages. But it really depends on your test program, e.g. if it's a single threaded I don't think it'll trigger because when writting mm_tlb_flush_pending() should start to return 0 already, so the tlb should logically not be needed. If you want maybe you can double check that. So in short, I had a feeling that the new PM_SCAN_OP_WP just misses something here and there so it's faster - it means even if it's faster it may also be prone to race conditions etc so we'd better figure it out.. Thanks,
On 5/24/23 12:43 AM, Peter Xu wrote: > Hi, Muhammad, > > On Mon, May 22, 2023 at 04:26:07PM +0500, Muhammad Usama Anjum wrote: >> On 5/22/23 3:24 PM, Muhammad Usama Anjum wrote: >>> On 4/26/23 7:13 PM, Peter Xu wrote: >>>> Hi, Muhammad, >>>> >>>> On Wed, Apr 26, 2023 at 12:06:23PM +0500, Muhammad Usama Anjum wrote: >>>>> On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote: >>>>>> +/* Supported flags */ >>>>>> +#define PM_SCAN_OP_GET (1 << 0) >>>>>> +#define PM_SCAN_OP_WP (1 << 1) >>>>> We have only these flag options available in PAGEMAP_SCAN IOCTL. >>>>> PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can >>>>> be specified as need. But PM_SCAN_OP_WP cannot be specified without >>>>> PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate >>>>> functionality which can be achieved by UFFDIO_WRITEPROTECT.) >>>>> >>>>> 1) PM_SCAN_OP_GET | PM_SCAN_OP_WP >>>>> vs >>>>> 2) UFFDIO_WRITEPROTECT >>>>> >>>>> After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are >>>>> getting really good performance which is comparable just like we are >>>>> depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp, >>>>> PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT >>>>> performance and behavior wise. >>>>> >>>>> I've got the results from someone else that UFFDIO_WRITEPROTECT block >>>>> pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this >>>>> as I don't have tests comparing them one-to-one. >>>>> >>>>> What are your thoughts about it? Have you thought about making >>>>> UFFDIO_WRITEPROTECT perform better? >>>>> >>>>> I'm sorry to mention the word "performance" here. Actually we want better >>>>> performance to emulate Windows syscall. That is why we are adding this >>>>> functionality. So either we need to see what can be improved in >>>>> UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in >>>>> pagemap_ioctl? >>>> >>>> I'm fine if you want to add it back if it works for you. Though before >>>> that, could you remind me why there can be a difference on performance? >>> I've looked at the code again and I think I've found something. Lets look >>> at exact performance numbers: >>> >>> I've run 2 different tests. In first test UFFDIO_WRITEPROTECT is being used >>> for engaging WP. In second test PM_SCAN_OP_WP is being used. I've measured >>> the average write time to the same memory which is being WP-ed and total >>> time of execution of these APIs: > > What is the steps of the test? Is it as simple as "writeprotect", > "unprotect", then write all pages in a single thread? > > Is UFFDIO_WRITEPROTECT sent in one range covering all pages? > > Maybe you can attach the test program here too. I'd not attached the test earlier as I thought that you wouldn't be interested in running the test. I've attached it now. The test has multiple threads where one thread tries to get status of flags and reset them, while other threads write to that memory. In main(), we call the pagemap_scan ioctl to get status of flags and reset the memory area as well. While in N threads, the memory is written. I usually run the test by following where memory area is of 100000 * pages: ./win2_linux 8 100000 1 1 0 I'm running tests on real hardware. The results are pretty consistent. I'm also testing only on x86_64. PM_SCAN_OP_WP wins every time as compared to UFFDIO_WRITEPROTECT. The PM_SCAN_OP_WP op doesn't work exclusively on v15. So please find the updated WIP code here: https://gitlab.collabora.com/usama.anjum/linux-mainline/-/commits/memwatchv16/ > >>> >>> **avg write time:** >>> | No of pages | 2000 | 8192 | 100000 | 500000 | >>> |------------------------|------|------|--------|--------| >>> | UFFDIO_WRITEPROTECT | 2200 | 2300 | 4100 | 4200 | >>> | PM_SCAN_OP_WP | 2000 | 2300 | 2500 | 2800 | >>> >>> **Execution time measured in rdtsc:** >>> | No of pages | 2000 | 8192 | 100000 | 500000 | >>> |------------------------|------|-------|--------|--------| >>> | UFFDIO_WRITEPROTECT | 3200 | 14000 | 59000 | 58000 | >>> | PM_SCAN_OP_WP | 1900 | 7000 | 38000 | 40000 | >>> >>> Avg write time for UFFDIO_WRITEPROTECT is 1.3 times slow. The execution >>> time is 1.5 times slower in the case of UFFDIO_WRITEPROTECT. So >>> UFFDIO_WRITEPROTECT is making writes slower to the pages and execution time >>> is also slower. >>> >>> This proves that PM_SCAN_OP_WP is better than UFFDIO_WRITEPROTECT. Although >>> PM_SCAN_OP_WP and UFFDIO_WRITEPROTECT have been implemented differently. We >>> should have seen no difference in performance. But we have quite a lot of >>> difference in performance here. PM_SCAN_OP_WP takes read mm lock, uses >>> walk_page_range() to walk over pages which finds VMAs from address ranges >>> to walk over them and pagemap_scan_pmd_entry() is handling most of the work >>> including tlb flushing. UFFDIO_WRITEPROTECT is also taking the mm lock and >>> iterating from all the different page directories until a pte is found and >>> then flags are updated there and tlb is flushed for every pte. >>> >>> My next deduction would be that we are getting worse performance as we are >>> flushing tlb for one page at a time in case of UFFDIO_WRITEPROTECT. While >>> we flush tlb for 512 pages (moslty) at a time in case of PM_SCAN_OP_WP. >>> I've just verified this by adding some logs to the change_pte_range() and >>> pagemap_scan_pmd_entry(). Logs are attached. I've allocated memory of 1000 >>> pages and write-protected it with UFFDIO_WRITEPROTECT and PM_SCAN_OP_WP. >>> The logs show that UFFDIO_WRITEPROTECT has flushed tlb 1000 times of size 1 >>> page each time. While PM_SCAN_OP_WP has flushed only 3 times of bigger >>> sizes. I've learned over my last experience that tlb flush is very >>> expensive. Probably this is what we need to improve if we don't want to add >>> PM_SCAN_OP_WP? >>> >>> The UFFDIO_WRITEPROTECT uses change_pte_range() which is very generic >>> function and I'm not sure if can try to not do tlb flushes if uffd_wp is >>> true. We can try to do flush somewhere else and hopefully we should do only >>> one flush if possible. It will not be so straight forward to move away from >>> generic fundtion. Thoughts? >> I've just tested this theory of not doing per pte flushes and only did one >> flush on entire range in uffd_wp_range(). But it didn't improve the >> situation either. I was wrong that tlb flushes may be the cause. > > I had a feeling that you were trapping tlb_flush_pte_range(), which is > actually not really sending any TLB flushes but updating mmu_gather object > for the addr range for future invalidations. > > That's probably why it didn't show an effect when you comment it out. Yeah, probably. > > I am not sure whether the wr-protect path difference can be caused by the > arch hooks, namely arch_enter_lazy_mmu_mode() / arch_leave_lazy_mmu_mode(). > > On x86 I saw that it's actually hooked onto some PV calls. I had a feeling > that this is for optimization only, but maybe it's still a good idea you > also take that into your new code: > > static inline void arch_enter_lazy_mmu_mode(void) > { > PVOP_VCALL0(mmu.lazy_mode.enter); > } I've just looked into it. It isn't making any difference. But I think I should include it in the code. It must be helpful for hypervisors etc. > > The other thing is I think you're flushing tlb outside pgtable lock in your > new code. IIUC that's racy, see: > > commit 6ce64428d62026a10cb5d80138ff2f90cc21d367 > Author: Nadav Amit <namit@vmware.com> > Date: Fri Mar 12 21:08:17 2021 -0800 > > mm/userfaultfd: fix memory corruption due to writeprotect > > So you may want to put it at least into pgtable lock critical section, or > IIUC you can also do inc_tlb_flush_pending() then dec_tlb_flush_pending() > just like __tlb_gather_mmu(), to make sure do_wp_page() will properly flush > the page when unluckily hit some of the page. Good point. I'll release page table lock after tlb flushing. I've just added it to next WIP v16. > > That's also the spot (the flush_tlb_page() in 6ce64428d) that made me think > on whether it caused the slowness on writting to those pages. But it > really depends on your test program, e.g. if it's a single threaded I don't > think it'll trigger because when writting mm_tlb_flush_pending() should > start to return 0 already, so the tlb should logically not be needed. If > you want maybe you can double check that. > > So in short, I had a feeling that the new PM_SCAN_OP_WP just misses > something here and there so it's faster - it means even if it's faster it > may also be prone to race conditions etc so we'd better figure it out.. The test program is multi-threaded. The performance number cannot be reproduced with single-threaded application. > > Thanks, >
On Wed, May 24, 2023 at 04:26:33PM +0500, Muhammad Usama Anjum wrote: > On 5/24/23 12:43 AM, Peter Xu wrote: > > Hi, Muhammad, > > > > On Mon, May 22, 2023 at 04:26:07PM +0500, Muhammad Usama Anjum wrote: > >> On 5/22/23 3:24 PM, Muhammad Usama Anjum wrote: > >>> On 4/26/23 7:13 PM, Peter Xu wrote: > >>>> Hi, Muhammad, > >>>> > >>>> On Wed, Apr 26, 2023 at 12:06:23PM +0500, Muhammad Usama Anjum wrote: > >>>>> On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote: > >>>>>> +/* Supported flags */ > >>>>>> +#define PM_SCAN_OP_GET (1 << 0) > >>>>>> +#define PM_SCAN_OP_WP (1 << 1) > >>>>> We have only these flag options available in PAGEMAP_SCAN IOCTL. > >>>>> PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can > >>>>> be specified as need. But PM_SCAN_OP_WP cannot be specified without > >>>>> PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate > >>>>> functionality which can be achieved by UFFDIO_WRITEPROTECT.) > >>>>> > >>>>> 1) PM_SCAN_OP_GET | PM_SCAN_OP_WP > >>>>> vs > >>>>> 2) UFFDIO_WRITEPROTECT > >>>>> > >>>>> After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are > >>>>> getting really good performance which is comparable just like we are > >>>>> depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp, > >>>>> PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT > >>>>> performance and behavior wise. > >>>>> > >>>>> I've got the results from someone else that UFFDIO_WRITEPROTECT block > >>>>> pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this > >>>>> as I don't have tests comparing them one-to-one. > >>>>> > >>>>> What are your thoughts about it? Have you thought about making > >>>>> UFFDIO_WRITEPROTECT perform better? > >>>>> > >>>>> I'm sorry to mention the word "performance" here. Actually we want better > >>>>> performance to emulate Windows syscall. That is why we are adding this > >>>>> functionality. So either we need to see what can be improved in > >>>>> UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in > >>>>> pagemap_ioctl? > >>>> > >>>> I'm fine if you want to add it back if it works for you. Though before > >>>> that, could you remind me why there can be a difference on performance? > >>> I've looked at the code again and I think I've found something. Lets look > >>> at exact performance numbers: > >>> > >>> I've run 2 different tests. In first test UFFDIO_WRITEPROTECT is being used > >>> for engaging WP. In second test PM_SCAN_OP_WP is being used. I've measured > >>> the average write time to the same memory which is being WP-ed and total > >>> time of execution of these APIs: > > > > What is the steps of the test? Is it as simple as "writeprotect", > > "unprotect", then write all pages in a single thread? > > > > Is UFFDIO_WRITEPROTECT sent in one range covering all pages? > > > > Maybe you can attach the test program here too. > > I'd not attached the test earlier as I thought that you wouldn't be > interested in running the test. I've attached it now. The test has multiple Thanks. No plan to run it, just to make sure I understand why such a difference. > threads where one thread tries to get status of flags and reset them, while > other threads write to that memory. In main(), we call the pagemap_scan > ioctl to get status of flags and reset the memory area as well. While in N > threads, the memory is written. > > I usually run the test by following where memory area is of 100000 * pages: > ./win2_linux 8 100000 1 1 0 > > I'm running tests on real hardware. The results are pretty consistent. I'm > also testing only on x86_64. PM_SCAN_OP_WP wins every time as compared to > UFFDIO_WRITEPROTECT. If it's multi-threaded test especially when the ioctl runs together with the writers, then I'd assume it's caused by writers frequently need to flush tlb (when writes during UFFDIO_WRITEPROTECT), the flush target could potentially also include the core running the main thread who is also trying to reprotect because they run on the same mm. This makes me think that your current test case probably is the worst case of Nadav's patch 6ce64428d6 because (1) the UFFDIO_WRITEPROTECT covers a super large range, and (2) there're a _lot_ of concurrent writers during the ioctl, so all of them will need to trigger a tlb flush, and that tlb flush will further slow down the ioctl sender. While I think that's the optimal case sometimes, of having minimum tlb flush on the ioctl(UFFDIO_WRITEPROTECT), so maybe it makes sense somewhere else where concurrent writers are not that much. I'll need to rethink a bit on all these to find out whether we can have a good way for both.. For now, if your workload is mostly exactly like your test case, maybe you can have your pagemap version of WP-only op there, making sure tlb flush is within the pgtable lock critical section (so you should be safe even without Nadav's patch). If so, I'd appreciate you can add some comment somewhere about such difference of using pagemap WP-only and ioctl(UFFDIO_WRITEPROTECT), though. In short, functional-wise they should be the same, but trivial detail difference on performance as TBD (maybe one day we can have a good approach for all and make them aligned again, but maybe that also doesn't need to block your work).
On 5/24/23 6:55 PM, Peter Xu wrote: ... >>> What is the steps of the test? Is it as simple as "writeprotect", >>> "unprotect", then write all pages in a single thread? >>> >>> Is UFFDIO_WRITEPROTECT sent in one range covering all pages? >>> >>> Maybe you can attach the test program here too. >> >> I'd not attached the test earlier as I thought that you wouldn't be >> interested in running the test. I've attached it now. The test has multiple > > Thanks. No plan to run it, just to make sure I understand why such a > difference. > >> threads where one thread tries to get status of flags and reset them, while >> other threads write to that memory. In main(), we call the pagemap_scan >> ioctl to get status of flags and reset the memory area as well. While in N >> threads, the memory is written. >> >> I usually run the test by following where memory area is of 100000 * pages: >> ./win2_linux 8 100000 1 1 0 >> >> I'm running tests on real hardware. The results are pretty consistent. I'm >> also testing only on x86_64. PM_SCAN_OP_WP wins every time as compared to >> UFFDIO_WRITEPROTECT. > > If it's multi-threaded test especially when the ioctl runs together with > the writers, then I'd assume it's caused by writers frequently need to > flush tlb (when writes during UFFDIO_WRITEPROTECT), the flush target could > potentially also include the core running the main thread who is also > trying to reprotect because they run on the same mm. > > This makes me think that your current test case probably is the worst case > of Nadav's patch 6ce64428d6 because (1) the UFFDIO_WRITEPROTECT covers a > super large range, and (2) there're a _lot_ of concurrent writers during > the ioctl, so all of them will need to trigger a tlb flush, and that tlb > flush will further slow down the ioctl sender. > > While I think that's the optimal case sometimes, of having minimum tlb > flush on the ioctl(UFFDIO_WRITEPROTECT), so maybe it makes sense somewhere > else where concurrent writers are not that much. I'll need to rethink a bit > on all these to find out whether we can have a good way for both.. > > For now, if your workload is mostly exactly like your test case, maybe you > can have your pagemap version of WP-only op there, making sure tlb flush is > within the pgtable lock critical section (so you should be safe even > without Nadav's patch). If so, I'd appreciate you can add some comment > somewhere about such difference of using pagemap WP-only and > ioctl(UFFDIO_WRITEPROTECT), though. In short, functional-wise they should > be the same, but trivial detail difference on performance as TBD (maybe one > day we can have a good approach for all and make them aligned again, but > maybe that also doesn't need to block your work). Thank you for understanding what I've been trying to convey. We are going to translate Windows syscall to this new ioctl. So it is very difficult to find out the exact use cases as application must be using this syscall in several different ways. There is one thing for sure is that we want to get best performance possible which we are getting by adding WP-only. I'll add it and send v16. I think that we are almost there. >
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 38b19a757281..e119e13a9ba5 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -19,6 +19,7 @@ #include <linux/shmem_fs.h> #include <linux/uaccess.h> #include <linux/pkeys.h> +#include <linux/minmax.h> #include <asm/elf.h> #include <asm/tlb.h> @@ -1767,11 +1768,491 @@ static int pagemap_release(struct inode *inode, struct file *file) return 0; } +#define PM_SCAN_FOUND_MAX_PAGES (1) +#define PM_SCAN_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \ + PAGE_IS_PRESENT | PAGE_IS_SWAPPED) +#define PM_SCAN_OPS (PM_SCAN_OP_GET | PM_SCAN_OP_WP) +#define PM_SCAN_DO_UFFD_WP(a) (a->flags & PM_SCAN_OP_WP) +#define PM_SCAN_BITMAP(wt, file, present, swap) \ + ((wt) | ((file) << 1) | ((present) << 2) | ((swap) << 3)) + +struct pagemap_scan_private { + struct page_region *vec; + struct page_region cur; + unsigned long vec_len, vec_index; + unsigned int max_pages, found_pages, flags; + unsigned long required_mask, anyof_mask, excluded_mask, return_mask; +}; + +static inline bool is_pte_uffd_wp(pte_t pte) +{ + return (pte_present(pte) && pte_uffd_wp(pte)) || + pte_swp_uffd_wp_any(pte); +} + +static inline void make_uffd_wp_pte(struct vm_area_struct *vma, + unsigned long addr, pte_t *pte) +{ + pte_t ptent = *pte; + + if (pte_present(ptent)) { + pte_t old_pte; + + old_pte = ptep_modify_prot_start(vma, addr, pte); + ptent = pte_mkuffd_wp(ptent); + ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent); + } else if (is_swap_pte(ptent)) { + ptent = pte_swp_mkuffd_wp(ptent); + set_pte_at(vma->vm_mm, addr, pte, ptent); + } +} + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static inline bool is_pmd_uffd_wp(pmd_t pmd) +{ + return (pmd_present(pmd) && pmd_uffd_wp(pmd)) || + (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)); +} + +static inline void make_uffd_wp_pmd(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmdp) +{ + pmd_t old, pmd = *pmdp; + + if (pmd_present(pmd)) { + old = pmdp_invalidate_ad(vma, addr, pmdp); + pmd = pmd_mkuffd_wp(old); + set_pmd_at(vma->vm_mm, addr, pmdp, pmd); + } else if (is_migration_entry(pmd_to_swp_entry(pmd))) { + pmd = pmd_swp_mkuffd_wp(pmd); + set_pmd_at(vma->vm_mm, addr, pmdp, pmd); + } +} +#endif + +#ifdef CONFIG_HUGETLB_PAGE +static inline bool is_huge_pte_uffd_wp(pte_t pte) +{ + return ((pte_present(pte) && huge_pte_uffd_wp(pte)) || + pte_swp_uffd_wp_any(pte)); +} + +static inline void make_uffd_wp_huge_pte(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep, + pte_t ptent) +{ + pte_t old_pte; + + if (!huge_pte_none(ptent)) { + old_pte = huge_ptep_modify_prot_start(vma, addr, ptep); + ptent = huge_pte_mkuffd_wp(old_pte); + ptep_modify_prot_commit(vma, addr, ptep, old_pte, ptent); + } else { + set_huge_pte_at(vma->vm_mm, addr, ptep, + make_pte_marker(PTE_MARKER_UFFD_WP)); + } +} +#endif + +static inline bool pagemap_scan_check_page_written(struct pagemap_scan_private *p) +{ + return (p->required_mask | p->anyof_mask | p->excluded_mask) & + PAGE_IS_WRITTEN; +} + +static int pagemap_scan_test_walk(unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + struct pagemap_scan_private *p = walk->private; + struct vm_area_struct *vma = walk->vma; + + if (pagemap_scan_check_page_written(p) && (!userfaultfd_wp(vma) || + !userfaultfd_wp_async(vma))) + return -EPERM; + + if (vma->vm_flags & VM_PFNMAP) + return 1; + + return 0; +} + +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, + struct pagemap_scan_private *p, + unsigned long addr, unsigned int n_pages) +{ + unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap); + struct page_region *cur = &p->cur; + + if (!n_pages) + return -EINVAL; + + if ((p->required_mask & bitmap) != p->required_mask) + return 0; + if (p->anyof_mask && !(p->anyof_mask & bitmap)) + return 0; + if (p->excluded_mask & bitmap) + return 0; + + bitmap &= p->return_mask; + if (!bitmap) + return 0; + + if (cur->bitmap == bitmap && + cur->start + cur->len * PAGE_SIZE == addr) { + cur->len += n_pages; + p->found_pages += n_pages; + + if (p->max_pages && (p->found_pages == p->max_pages)) + return PM_SCAN_FOUND_MAX_PAGES; + + return 0; + } + + /* + * All data is copied to cur first. When more data is found, we push + * cur to vec and copy new data to cur. The vec_index represents the + * current index of vec array. We add 1 to the vec_index while + * performing checks to account for data in cur. + */ + if (p->vec_index && (p->vec_index + 1) >= p->vec_len) + return -ENOSPC; + + if (cur->len) { + memcpy(&p->vec[p->vec_index], cur, sizeof(*p->vec)); + p->vec_index++; + } + + cur->start = addr; + cur->len = n_pages; + cur->bitmap = bitmap; + p->found_pages += n_pages; + + if (p->max_pages && (p->found_pages == p->max_pages)) + return PM_SCAN_FOUND_MAX_PAGES; + + return 0; +} + +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, + unsigned long end, struct mm_walk *walk) +{ + struct pagemap_scan_private *p = walk->private; + struct vm_area_struct *vma = walk->vma; + unsigned long addr = end; + pte_t *pte, *orig_pte; + spinlock_t *ptl; + bool is_written; + int ret = 0; + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + ptl = pmd_trans_huge_lock(pmd, vma); + if (ptl) { + unsigned long n_pages = (end - start)/PAGE_SIZE; + + if (p->max_pages && n_pages > p->max_pages - p->found_pages) + n_pages = p->max_pages - p->found_pages; + + is_written = !is_pmd_uffd_wp(*pmd); + + /* + * Break huge page into small pages if the WP operation need to + * be performed is on a portion of the huge page. + */ + if (is_written && PM_SCAN_DO_UFFD_WP(p) && + n_pages < HPAGE_SIZE/PAGE_SIZE) { + spin_unlock(ptl); + split_huge_pmd(vma, pmd, start); + goto process_smaller_pages; + } + + ret = pagemap_scan_output(is_written, vma->vm_file, + pmd_present(*pmd), is_swap_pmd(*pmd), + p, start, n_pages); + + if (ret >= 0 && is_written && PM_SCAN_DO_UFFD_WP(p)) + make_uffd_wp_pmd(vma, addr, pmd); + + spin_unlock(ptl); + + if (PM_SCAN_DO_UFFD_WP(p)) + flush_tlb_range(vma, start, end); + + return ret; + } +process_smaller_pages: + if (pmd_trans_unstable(pmd)) + return 0; +#endif + + orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); + for (addr = start; addr < end && !ret; pte++, addr += PAGE_SIZE) { + is_written = !is_pte_uffd_wp(*pte); + + ret = pagemap_scan_output(is_written, vma->vm_file, + pte_present(*pte), is_swap_pte(*pte), + p, addr, 1); + + if (ret >= 0 && is_written && PM_SCAN_DO_UFFD_WP(p)) + make_uffd_wp_pte(vma, addr, pte); + } + pte_unmap_unlock(orig_pte, ptl); + + if (PM_SCAN_DO_UFFD_WP(p)) + flush_tlb_range(vma, start, addr); + + cond_resched(); + return ret; +} + +#ifdef CONFIG_HUGETLB_PAGE +static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask, + unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + unsigned long n_pages = (end - start)/PAGE_SIZE; + struct pagemap_scan_private *p = walk->private; + struct vm_area_struct *vma = walk->vma; + struct hstate *h = hstate_vma(vma); + int ret = -EPERM; + spinlock_t *ptl; + bool is_written; + pte_t pte; + + if (p->max_pages && n_pages > p->max_pages - p->found_pages) + n_pages = p->max_pages - p->found_pages; + + if (PM_SCAN_DO_UFFD_WP(p)) { + i_mmap_lock_write(vma->vm_file->f_mapping); + ptl = huge_pte_lock(h, vma->vm_mm, ptep); + } + + pte = huge_ptep_get(ptep); + is_written = !is_huge_pte_uffd_wp(pte); + + /* + * Partial hugetlb page clear isn't supported + */ + if (is_written && PM_SCAN_DO_UFFD_WP(p) && + n_pages < HPAGE_SIZE/PAGE_SIZE) + goto unlock_and_return; + + ret = pagemap_scan_output(is_written, vma->vm_file, pte_present(pte), + is_swap_pte(pte), p, start, n_pages); + if (ret < 0) + goto unlock_and_return; + + if (is_written && PM_SCAN_DO_UFFD_WP(p)) { + make_uffd_wp_huge_pte(vma, start, ptep, pte); + flush_hugetlb_tlb_range(vma, start, end); + } + +unlock_and_return: + if (PM_SCAN_DO_UFFD_WP(p)) { + spin_unlock(ptl); + i_mmap_unlock_write(vma->vm_file->f_mapping); + } + + return ret; +} +#else +#define pagemap_scan_hugetlb_entry NULL +#endif + +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, + int depth, struct mm_walk *walk) +{ + unsigned long n_pages = (end - addr)/PAGE_SIZE; + struct pagemap_scan_private *p = walk->private; + struct vm_area_struct *vma = walk->vma; + int ret = 0; + + if (!vma) + return 0; + + if (p->max_pages && p->found_pages + n_pages > p->max_pages) + n_pages = p->max_pages - p->found_pages; + + ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr, + n_pages); + return ret; +} + +static const struct mm_walk_ops pagemap_scan_ops = { + .test_walk = pagemap_scan_test_walk, + .pmd_entry = pagemap_scan_pmd_entry, + .pte_hole = pagemap_scan_pte_hole, + .hugetlb_entry = pagemap_scan_hugetlb_entry, +}; + +static int pagemap_scan_args_valid(struct pm_scan_arg *arg, unsigned long start, + struct page_region __user *vec) +{ + /* Detect illegal size, flags, len and masks */ + if (arg->size != sizeof(struct pm_scan_arg)) + return -EINVAL; + if (arg->flags & ~PM_SCAN_OPS) + return -EINVAL; + if (!arg->len) + return -EINVAL; + if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask | + arg->return_mask) & ~PM_SCAN_BITS_ALL) + return -EINVAL; + if (!arg->required_mask && !arg->anyof_mask && + !arg->excluded_mask) + return -EINVAL; + if (!arg->return_mask) + return -EINVAL; + + /* Validate memory ranges */ + if (!(arg->flags & PM_SCAN_OP_GET)) + return -EINVAL; + if (!arg->vec) + return -EINVAL; + if (arg->vec_len == 0) + return -EINVAL; + + if (!IS_ALIGNED(start, PAGE_SIZE)) + return -EINVAL; + if (!access_ok((void __user *)start, arg->len)) + return -EFAULT; + + if (!PM_SCAN_DO_UFFD_WP(arg)) + return 0; + + if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask) & + ~PAGE_IS_WRITTEN) + return -EINVAL; + + return 0; +} + +static long do_pagemap_scan(struct mm_struct *mm, + struct pm_scan_arg __user *uarg) +{ + unsigned long start, end, walk_start, walk_end; + unsigned long empty_slots, vec_index = 0; + struct mmu_notifier_range range; + struct page_region __user *vec; + struct pagemap_scan_private p; + struct pm_scan_arg arg; + int ret = 0; + + if (copy_from_user(&arg, uarg, sizeof(arg))) + return -EFAULT; + + start = untagged_addr((unsigned long)arg.start); + vec = (struct page_region *)untagged_addr((unsigned long)arg.vec); + + ret = pagemap_scan_args_valid(&arg, start, vec); + if (ret) + return ret; + + end = start + arg.len; + p.max_pages = arg.max_pages; + p.found_pages = 0; + p.flags = arg.flags; + p.required_mask = arg.required_mask; + p.anyof_mask = arg.anyof_mask; + p.excluded_mask = arg.excluded_mask; + p.return_mask = arg.return_mask; + p.cur.len = 0; + p.cur.start = 0; + p.vec = NULL; + p.vec_len = PAGEMAP_WALK_SIZE >> PAGE_SHIFT; + + /* + * Allocate smaller buffer to get output from inside the page walk + * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As + * we want to return output to user in compact form where no two + * consecutive regions should be continuous and have the same flags. + * So store the latest element in p.cur between different walks and + * store the p.cur at the end of the walk to the user buffer. + */ + p.vec = kmalloc_array(p.vec_len, sizeof(*p.vec), GFP_KERNEL); + if (!p.vec) + return -ENOMEM; + + if (p.flags & PM_SCAN_OP_WP) { + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0, + mm, start, end); + mmu_notifier_invalidate_range_start(&range); + } + + walk_start = walk_end = start; + while (walk_end < end && !ret) { + p.vec_index = 0; + + empty_slots = arg.vec_len - vec_index; + p.vec_len = min(p.vec_len, empty_slots); + + walk_end = (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK; + if (walk_end > end) + walk_end = end; + + ret = mmap_read_lock_killable(mm); + if (ret) + goto free_data; + ret = walk_page_range(mm, walk_start, walk_end, + &pagemap_scan_ops, &p); + mmap_read_unlock(mm); + + if (ret && ret != -ENOSPC && ret != PM_SCAN_FOUND_MAX_PAGES) + goto free_data; + + walk_start = walk_end; + if (p.vec_index) { + if (copy_to_user(&vec[vec_index], p.vec, + p.vec_index * sizeof(*p.vec))) { + /* + * Return error even though the OP succeeded + */ + ret = -EFAULT; + goto free_data; + } + vec_index += p.vec_index; + } + } + + if (p.flags & PM_SCAN_OP_WP) + mmu_notifier_invalidate_range_end(&range); + + if (p.cur.len) { + if (copy_to_user(&vec[vec_index], &p.cur, sizeof(*p.vec))) { + ret = -EFAULT; + goto free_data; + } + vec_index++; + } + + ret = vec_index; + +free_data: + kfree(p.vec); + return ret; +} + +static long do_pagemap_cmd(struct file *file, unsigned int cmd, + unsigned long arg) +{ + struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)arg; + struct mm_struct *mm = file->private_data; + + switch (cmd) { + case PAGEMAP_SCAN: + return do_pagemap_scan(mm, uarg); + + default: + return -EINVAL; + } +} + const struct file_operations proc_pagemap_operations = { .llseek = mem_lseek, /* borrow this */ .read = pagemap_read, .open = pagemap_open, .release = pagemap_release, + .unlocked_ioctl = do_pagemap_cmd, + .compat_ioctl = do_pagemap_cmd, }; #endif /* CONFIG_PROC_PAGE_MONITOR */ diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index b7b56871029c..47879c38ce2f 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -305,4 +305,57 @@ typedef int __bitwise __kernel_rwf_t; #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ RWF_APPEND) +/* Pagemap ioctl */ +#define PAGEMAP_SCAN _IOWR('f', 16, struct pm_scan_arg) + +/* Bits are set in the bitmap of the page_region and masks in pm_scan_args */ +#define PAGE_IS_WRITTEN (1 << 0) +#define PAGE_IS_FILE (1 << 1) +#define PAGE_IS_PRESENT (1 << 2) +#define PAGE_IS_SWAPPED (1 << 3) + +/* + * struct page_region - Page region with bitmap flags + * @start: Start of the region + * @len: Length of the region in pages + * bitmap: Bits sets for the region + */ +struct page_region { + __u64 start; + __u64 len; + __u64 bitmap; +}; + +/* + * struct pm_scan_arg - Pagemap ioctl argument + * @size: Size of the structure + * @flags: Flags for the IOCTL + * @start: Starting address of the region + * @len: Length of the region (All the pages in this length are included) + * @vec: Address of page_region struct array for output + * @vec_len: Length of the page_region struct array + * @max_pages: Optional max return pages + * @required_mask: Required mask - All of these bits have to be set in the PTE + * @anyof_mask: Any mask - Any of these bits are set in the PTE + * @excluded_mask: Exclude mask - None of these bits are set in the PTE + * @return_mask: Bits that are to be reported in page_region + */ +struct pm_scan_arg { + __u64 size; + __u64 flags; + __u64 start; + __u64 len; + __u64 vec; + __u64 vec_len; + __u64 max_pages; + __u64 required_mask; + __u64 anyof_mask; + __u64 excluded_mask; + __u64 return_mask; +}; + +/* Supported flags */ +#define PM_SCAN_OP_GET (1 << 0) +#define PM_SCAN_OP_WP (1 << 1) + #endif /* _UAPI_LINUX_FS_H */
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear the info about page table entries. The following operations are supported in this ioctl: - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped (PAGE_IS_SWAPPED). - Find pages which have been written-to and write protect the pages (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) This IOCTL can be extended to get information about more PTE bits. Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> --- Changes in v15: - Build fix: - Use generic tlb flush function in pagemap_scan_pmd_entry() instead of using x86 specific flush function in do_pagemap_scan() - Remove #ifdef from pagemap_scan_hugetlb_entry() - Use mm instead of undefined vma->vm_mm Changes in v14: - Fix build error caused by #ifdef added at last minute in some configs Changes in v13: - Review updates - mmap_read_lock_killable() instead of mmap_read_lock() - Replace uffd_wp_range() with helpers which increases performance drastically for OP_WP operations by reducing the number of tlb flushing etc - Add MMU_NOTIFY_PROTECTION_VMA notification for the memory range Changes in v12: - Add hugetlb support to cover all memory types - Merge "userfaultfd: Define dummy uffd_wp_range()" with this patch - Review updates to the code Changes in v11: - Find written pages in a better way - Fix a corner case (thanks Paul) - Improve the code/comments - remove ENGAGE_WP + ! GET operation - shorten the commit message in favour of moving documentation to pagemap.rst Changes in v10: - move changes in tools/include/uapi/linux/fs.h to separate patch - update commit message Change in v8: - Correct is_pte_uffd_wp() - Improve readability and error checks - Remove some un-needed code Changes in v7: - Rebase on top of latest next - Fix some corner cases - Base soft-dirty on the uffd wp async - Update the terminologies - Optimize the memory usage inside the ioctl Changes in v6: - Rename variables and update comments - Make IOCTL independent of soft_dirty config - Change masks and bitmap type to _u64 - Improve code quality Changes in v5: - Remove tlb flushing even for clear operation Changes in v4: - Update the interface and implementation Changes in v3: - Tighten the user-kernel interface by using explicit types and add more error checking Changes in v2: - Convert the interface from syscall to ioctl - Remove pidfd support as it doesn't make sense in ioctl task_mmu task_mmu task_mmu.c --- fs/proc/task_mmu.c | 481 ++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 53 +++++ 2 files changed, 534 insertions(+)