From patchwork Fri Sep 4 11:31:12 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?= X-Patchwork-Id: 11756757 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 07D4113B1 for ; Fri, 4 Sep 2020 11:31:10 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C1D44208C7 for ; Fri, 4 Sep 2020 11:31:09 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C1D44208C7 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bitdefender.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9A5638E0005; Fri, 4 Sep 2020 07:31:08 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 92EA28E0001; Fri, 4 Sep 2020 07:31:08 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7A94E8E0005; Fri, 4 Sep 2020 07:31:08 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0226.hostedemail.com [216.40.44.226]) by kanga.kvack.org (Postfix) with ESMTP id 5FF338E0003 for ; Fri, 4 Sep 2020 07:31:08 -0400 (EDT) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 2F0FC824556B for ; Fri, 4 Sep 2020 11:31:08 +0000 (UTC) X-FDA: 77225162616.30.duck14_5d04733270b1 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin30.hostedemail.com (Postfix) with ESMTP id ED951180B3C8B for ; Fri, 4 Sep 2020 11:31:07 +0000 (UTC) X-Spam-Summary: 1,0,0,a9dbb4f6092894db,d41d8cd98f00b204,alazar@bitdefender.com,,RULES_HIT:41:152:355:379:800:960:973:988:989:1260:1261:1277:1311:1313:1314:1345:1359:1431:1437:1515:1516:1518:1534:1542:1593:1594:1676:1711:1730:1747:1777:1792:2198:2199:2393:2559:2562:2898:3138:3139:3140:3141:3142:3353:3369:3867:3870:4321:4605:5007:6120:6261:6742:7576:7901:10004:10400:11026:11658:11914:12219:12296:12297:12517:12519:12555:12679:13894:14096:14097:14394:14659:14721:21080:21433:21451:21627:21990:30054:30070:30074,0,RBL:91.199.104.161:@bitdefender.com:.lbl8.mailshell.net-62.2.8.100 64.100.201.201;04y88bax7snr6wrr3udf941zb5ezaoc8e3mui1pdyrnp41gh4ejighg5n5jinct.a4nm9ywqadj5q7gdk6ykppmc1mar3imi4arnaemf4cjgm17g68hs53c3mdb678u.n-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: duck14_5d04733270b1 X-Filterd-Recvd-Size: 4213 Received: from mx01.bbu.dsd.mx.bitdefender.com (mx01.bbu.dsd.mx.bitdefender.com [91.199.104.161]) by imf44.hostedemail.com (Postfix) with ESMTP for ; Fri, 4 Sep 2020 11:31:07 +0000 (UTC) Received: from smtp.bitdefender.com (smtp01.buh.bitdefender.com [10.17.80.75]) by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id F224430747C6; Fri, 4 Sep 2020 14:31:05 +0300 (EEST) Received: from localhost.localdomain (unknown [195.189.155.252]) by smtp.bitdefender.com (Postfix) with ESMTPSA id 2140C3072786; Fri, 4 Sep 2020 14:31:05 +0300 (EEST) From: =?utf-8?q?Adalbert_Laz=C4=83r?= To: linux-mm@kvack.org Cc: linux-api@vger.kernel.org, Andrew Morton , Alexander Graf , Stefan Hajnoczi , Jerome Glisse , Paolo Bonzini , =?utf-8?q?Mihai_Don=C8=9Bu?= , Mircea Cirjaliu , Andy Lutomirski , Arnd Bergmann , Sargun Dhillon , Aleksa Sarai , Oleg Nesterov , Jann Horn , Kees Cook , Matthew Wilcox , Christian Brauner , =?utf-8?q?Adalbert_Laz?= =?utf-8?q?=C4=83r?= Subject: [RESEND RFC PATCH 1/5] mm: add atomic capability to zap_details Date: Fri, 4 Sep 2020 14:31:12 +0300 Message-Id: <20200904113116.20648-2-alazar@bitdefender.com> In-Reply-To: <20200904113116.20648-1-alazar@bitdefender.com> References: <20200904113116.20648-1-alazar@bitdefender.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: ED951180B3C8B X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam05 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Mircea Cirjaliu Force zap_xxx_range() functions to loop without rescheduling. Useful for unmapping memory in an atomic context, although no checks for atomic context are being made. Signed-off-by: Mircea Cirjaliu Signed-off-by: Adalbert Lazăr --- include/linux/mm.h | 6 ++++++ mm/memory.c | 11 +++++++---- 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 5a323422d783..1be4482a7b81 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1601,8 +1601,14 @@ struct zap_details { struct address_space *check_mapping; /* Check page->mapping if set */ pgoff_t first_index; /* Lowest page->index to unmap */ pgoff_t last_index; /* Highest page->index to unmap */ + bool atomic; /* Do not sleep. */ }; +static inline bool zap_is_atomic(struct zap_details *details) +{ + return (unlikely(details) && details->atomic); +} + struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte); struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, diff --git a/mm/memory.c b/mm/memory.c index f703fe8c8346..8e78fb151f8f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1056,7 +1056,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, if (pte_none(ptent)) continue; - if (need_resched()) + if (!zap_is_atomic(details) && need_resched()) break; if (pte_present(ptent)) { @@ -1159,7 +1159,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, } if (addr != end) { - cond_resched(); + if (!zap_is_atomic(details)) + cond_resched(); goto again; } @@ -1195,7 +1196,8 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, goto next; next = zap_pte_range(tlb, vma, pmd, addr, next, details); next: - cond_resched(); + if (!zap_is_atomic(details)) + cond_resched(); } while (pmd++, addr = next, addr != end); return addr; @@ -1224,7 +1226,8 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb, continue; next = zap_pmd_range(tlb, vma, pud, addr, next, details); next: - cond_resched(); + if (!zap_is_atomic(details)) + cond_resched(); } while (pud++, addr = next, addr != end); return addr; From patchwork Fri Sep 4 11:31:13 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?= X-Patchwork-Id: 11756761 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B9E67138C for ; Fri, 4 Sep 2020 11:31:14 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7AC0C206B7 for ; Fri, 4 Sep 2020 11:31:14 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7AC0C206B7 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bitdefender.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C40F1900002; Fri, 4 Sep 2020 07:31:09 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id BED228E0003; Fri, 4 Sep 2020 07:31:09 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A914D900002; Fri, 4 Sep 2020 07:31:09 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0100.hostedemail.com [216.40.44.100]) by kanga.kvack.org (Postfix) with ESMTP id 7B1E88E0003 for ; Fri, 4 Sep 2020 07:31:09 -0400 (EDT) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 34111181AC9C6 for ; Fri, 4 Sep 2020 11:31:09 +0000 (UTC) X-FDA: 77225162658.24.desk38_1d11326270b1 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin24.hostedemail.com (Postfix) with ESMTP id 03B661A4A0 for ; Fri, 4 Sep 2020 11:31:08 +0000 (UTC) X-Spam-Summary: 1,0,0,08ed7fd8a2f325cb,d41d8cd98f00b204,alazar@bitdefender.com,,RULES_HIT:1:2:41:69:152:355:379:800:960:966:968:973:988:989:1260:1261:1277:1311:1313:1314:1345:1359:1431:1437:1515:1516:1518:1593:1594:1605:1676:1730:1747:1777:1792:2196:2199:2393:2559:2562:2898:3138:3139:3140:3141:3142:3369:3865:3866:3867:3870:3874:4051:4250:4321:4385:4605:5007:6119:6120:6261:6742:7576:7901:7903:8957:9592:10004:11026:11473:11658:11914:12043:12219:12291:12296:12297:12438:12517:12519:12555:12679:12683:12986:13149:13230:13255:13894:14394:14659:21080:21088:21324:21451:21611:21627:21990:30054:30070:30074,0,RBL:91.199.104.161:@bitdefender.com:.lbl8.mailshell.net-64.100.201.201 62.2.8.100;04y8reextsjurn3qfqj9um1x555duyphcyfob3cjdmo6kzdgz1yjda4w3nosz1f.r8ntf39agn7rhjqqe6m9nfi78xt4dr6y6wagyywrzrc1tfhwxu7acajz5r3ajy7.1-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:25,LUA_SU MMARY:no X-HE-Tag: desk38_1d11326270b1 X-Filterd-Recvd-Size: 10279 Received: from mx01.bbu.dsd.mx.bitdefender.com (mx01.bbu.dsd.mx.bitdefender.com [91.199.104.161]) by imf19.hostedemail.com (Postfix) with ESMTP for ; Fri, 4 Sep 2020 11:31:08 +0000 (UTC) Received: from smtp.bitdefender.com (smtp01.buh.bitdefender.com [10.17.80.75]) by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id CD72B30747C8; Fri, 4 Sep 2020 14:31:06 +0300 (EEST) Received: from localhost.localdomain (unknown [195.189.155.252]) by smtp.bitdefender.com (Postfix) with ESMTPSA id 041833072787; Fri, 4 Sep 2020 14:31:05 +0300 (EEST) From: =?utf-8?q?Adalbert_Laz=C4=83r?= To: linux-mm@kvack.org Cc: linux-api@vger.kernel.org, Andrew Morton , Alexander Graf , Stefan Hajnoczi , Jerome Glisse , Paolo Bonzini , =?utf-8?q?Mihai_Don=C8=9Bu?= , Mircea Cirjaliu , Andy Lutomirski , Arnd Bergmann , Sargun Dhillon , Aleksa Sarai , Oleg Nesterov , Jann Horn , Kees Cook , Matthew Wilcox , Christian Brauner , =?utf-8?q?Adalbert_Laz?= =?utf-8?q?=C4=83r?= Subject: [RESEND RFC PATCH 2/5] mm: let the VMA decide how zap_pte_range() acts on mapped pages Date: Fri, 4 Sep 2020 14:31:13 +0300 Message-Id: <20200904113116.20648-3-alazar@bitdefender.com> In-Reply-To: <20200904113116.20648-1-alazar@bitdefender.com> References: <20200904113116.20648-1-alazar@bitdefender.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 03B661A4A0 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam03 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Mircea Cirjaliu Instead of having one big function to handle all cases of page unmapping, have multiple implementation-defined callbacks, each for its own VMA type. In the future, exotic VMA implementations won't have to bloat the unique zapping function with another case of mappings. Signed-off-by: Mircea Cirjaliu Signed-off-by: Adalbert Lazăr --- include/linux/mm.h | 16 ++++ mm/memory.c | 182 +++++++++++++++++++++++++-------------------- 2 files changed, 116 insertions(+), 82 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 1be4482a7b81..39e55467aa49 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -36,6 +36,7 @@ struct file_ra_state; struct user_struct; struct writeback_control; struct bdi_writeback; +struct zap_details; void init_mm_internals(void); @@ -601,6 +602,14 @@ struct vm_operations_struct { */ struct page *(*find_special_page)(struct vm_area_struct *vma, unsigned long addr); + + /* + * Called by zap_pte_range() for use by special VMAs that implement + * custom zapping behavior. + */ + int (*zap_pte)(struct vm_area_struct *vma, unsigned long addr, + pte_t *pte, int rss[], struct mmu_gather *tlb, + struct zap_details *details); }; static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm) @@ -1594,6 +1603,13 @@ static inline bool can_do_mlock(void) { return false; } extern int user_shm_lock(size_t, struct user_struct *); extern void user_shm_unlock(size_t, struct user_struct *); +/* + * Flags returned by zap_pte implementations + */ +#define ZAP_PTE_CONTINUE 0 +#define ZAP_PTE_FLUSH (1 << 0) /* Ask for TLB flush. */ +#define ZAP_PTE_BREAK (1 << 1) /* Break PTE iteration. */ + /* * Parameter block passed down to zap_pte_range in exceptional cases. */ diff --git a/mm/memory.c b/mm/memory.c index 8e78fb151f8f..a225bfd01417 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1031,18 +1031,109 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, return ret; } +static int zap_pte_common(struct vm_area_struct *vma, unsigned long addr, + pte_t *pte, int rss[], struct mmu_gather *tlb, + struct zap_details *details) +{ + struct mm_struct *mm = tlb->mm; + pte_t ptent = *pte; + swp_entry_t entry; + int flags = 0; + + if (pte_present(ptent)) { + struct page *page; + + page = vm_normal_page(vma, addr, ptent); + if (unlikely(details) && page) { + /* + * unmap_shared_mapping_pages() wants to + * invalidate cache without truncating: + * unmap shared but keep private pages. + */ + if (details->check_mapping && + details->check_mapping != page_rmapping(page)) + return 0; + } + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + tlb_remove_tlb_entry(tlb, pte, addr); + if (unlikely(!page)) + return 0; + + if (!PageAnon(page)) { + if (pte_dirty(ptent)) { + flags |= ZAP_PTE_FLUSH; + set_page_dirty(page); + } + if (pte_young(ptent) && + likely(!(vma->vm_flags & VM_SEQ_READ))) + mark_page_accessed(page); + } + rss[mm_counter(page)]--; + page_remove_rmap(page, false); + if (unlikely(page_mapcount(page) < 0)) + print_bad_pte(vma, addr, ptent, page); + if (unlikely(__tlb_remove_page(tlb, page))) + flags |= ZAP_PTE_FLUSH | ZAP_PTE_BREAK; + return flags; + } + + entry = pte_to_swp_entry(ptent); + if (non_swap_entry(entry) && is_device_private_entry(entry)) { + struct page *page = device_private_entry_to_page(entry); + + if (unlikely(details && details->check_mapping)) { + /* + * unmap_shared_mapping_pages() wants to + * invalidate cache without truncating: + * unmap shared but keep private pages. + */ + if (details->check_mapping != page_rmapping(page)) + return 0; + } + + pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); + rss[mm_counter(page)]--; + page_remove_rmap(page, false); + put_page(page); + return 0; + } + + /* If details->check_mapping, we leave swap entries. */ + if (unlikely(details)) + return 0; + + if (!non_swap_entry(entry)) + rss[MM_SWAPENTS]--; + else if (is_migration_entry(entry)) { + struct page *page; + + page = migration_entry_to_page(entry); + rss[mm_counter(page)]--; + } + if (unlikely(!free_swap_and_cache(entry))) + print_bad_pte(vma, addr, ptent, NULL); + pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); + + return flags; +} + static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, struct zap_details *details) { struct mm_struct *mm = tlb->mm; - int force_flush = 0; + int flags = 0; int rss[NR_MM_COUNTERS]; spinlock_t *ptl; pte_t *start_pte; pte_t *pte; - swp_entry_t entry; + + int (*zap_pte)(struct vm_area_struct *vma, unsigned long addr, + pte_t *pte, int rss[], struct mmu_gather *tlb, + struct zap_details *details) = zap_pte_common; + if (vma->vm_ops && vma->vm_ops->zap_pte) + zap_pte = vma->vm_ops->zap_pte; tlb_change_page_size(tlb, PAGE_SIZE); again: @@ -1058,92 +1149,19 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, if (!zap_is_atomic(details) && need_resched()) break; - - if (pte_present(ptent)) { - struct page *page; - - page = vm_normal_page(vma, addr, ptent); - if (unlikely(details) && page) { - /* - * unmap_shared_mapping_pages() wants to - * invalidate cache without truncating: - * unmap shared but keep private pages. - */ - if (details->check_mapping && - details->check_mapping != page_rmapping(page)) - continue; - } - ptent = ptep_get_and_clear_full(mm, addr, pte, - tlb->fullmm); - tlb_remove_tlb_entry(tlb, pte, addr); - if (unlikely(!page)) - continue; - - if (!PageAnon(page)) { - if (pte_dirty(ptent)) { - force_flush = 1; - set_page_dirty(page); - } - if (pte_young(ptent) && - likely(!(vma->vm_flags & VM_SEQ_READ))) - mark_page_accessed(page); - } - rss[mm_counter(page)]--; - page_remove_rmap(page, false); - if (unlikely(page_mapcount(page) < 0)) - print_bad_pte(vma, addr, ptent, page); - if (unlikely(__tlb_remove_page(tlb, page))) { - force_flush = 1; - addr += PAGE_SIZE; - break; - } - continue; - } - - entry = pte_to_swp_entry(ptent); - if (non_swap_entry(entry) && is_device_private_entry(entry)) { - struct page *page = device_private_entry_to_page(entry); - - if (unlikely(details && details->check_mapping)) { - /* - * unmap_shared_mapping_pages() wants to - * invalidate cache without truncating: - * unmap shared but keep private pages. - */ - if (details->check_mapping != - page_rmapping(page)) - continue; - } - - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); - rss[mm_counter(page)]--; - page_remove_rmap(page, false); - put_page(page); - continue; + if (flags & ZAP_PTE_BREAK) { + flags &= ~ZAP_PTE_BREAK; + break; } - /* If details->check_mapping, we leave swap entries. */ - if (unlikely(details)) - continue; - - if (!non_swap_entry(entry)) - rss[MM_SWAPENTS]--; - else if (is_migration_entry(entry)) { - struct page *page; - - page = migration_entry_to_page(entry); - rss[mm_counter(page)]--; - } - if (unlikely(!free_swap_and_cache(entry))) - print_bad_pte(vma, addr, ptent, NULL); - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); + flags |= zap_pte(vma, addr, pte, rss, tlb, details); } while (pte++, addr += PAGE_SIZE, addr != end); add_mm_rss_vec(mm, rss); arch_leave_lazy_mmu_mode(); /* Do the actual TLB flush before dropping ptl */ - if (force_flush) + if (flags & ZAP_PTE_FLUSH) tlb_flush_mmu_tlbonly(tlb); pte_unmap_unlock(start_pte, ptl); @@ -1153,8 +1171,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, * entries before releasing the ptl), free the batched * memory too. Restart if we didn't do everything. */ - if (force_flush) { - force_flush = 0; + if (flags & ZAP_PTE_FLUSH) { + flags &= ~ZAP_PTE_FLUSH; tlb_flush_mmu(tlb); } From patchwork Fri Sep 4 11:31:14 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?= X-Patchwork-Id: 11756763 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9532E138C for ; Fri, 4 Sep 2020 11:31:17 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 67D98206B7 for ; Fri, 4 Sep 2020 11:31:16 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 67D98206B7 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bitdefender.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 710598E0003; Fri, 4 Sep 2020 07:31:10 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 6727A900003; Fri, 4 Sep 2020 07:31:10 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 49B1D8E0006; Fri, 4 Sep 2020 07:31:10 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0072.hostedemail.com [216.40.44.72]) by kanga.kvack.org (Postfix) with ESMTP id 2CCD38E0003 for ; Fri, 4 Sep 2020 07:31:10 -0400 (EDT) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id D8DA040E1 for ; Fri, 4 Sep 2020 11:31:09 +0000 (UTC) X-FDA: 77225162658.07.elbow81_0113bb5270b1 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin07.hostedemail.com (Postfix) with ESMTP id AD36E1803F9B6 for ; Fri, 4 Sep 2020 11:31:09 +0000 (UTC) X-Spam-Summary: 1,0,0,75114902058c764a,d41d8cd98f00b204,alazar@bitdefender.com,,RULES_HIT:41:69:152:355:379:800:960:968:973:982:988:989:1260:1261:1277:1311:1313:1314:1345:1359:1431:1437:1500:1515:1516:1518:1535:1544:1593:1594:1676:1711:1730:1747:1777:1792:2393:2559:2562:2693:3138:3139:3140:3141:3142:3353:3865:3866:3868:3870:3871:4250:4321:4605:5007:6119:6120:6261:6742:7576:7901:7903:9010:9592:10004:11026:11473:11658:11914:12043:12295:12296:12297:12438:12517:12519:12555:12679:12986:13161:13229:13255:13894:13972:14096:14097:14394:14659:14721:21080:21324:21451:21611:21627:21987:21990:30029:30045:30054,0,RBL:91.199.104.161:@bitdefender.com:.lbl8.mailshell.net-62.2.8.100 64.100.201.201;04y8b8qhrwo1ms3wnymissyqizzfaopng3o6stpxzdqf6mui5afc51euqaqyakz.3ky5aha63u8uho6wjg3upqmxjwo7xfd63xqphb5wmey5hopwwap4a311nrr4chm.c-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA _SUMMARY X-HE-Tag: elbow81_0113bb5270b1 X-Filterd-Recvd-Size: 5911 Received: from mx01.bbu.dsd.mx.bitdefender.com (mx01.bbu.dsd.mx.bitdefender.com [91.199.104.161]) by imf49.hostedemail.com (Postfix) with ESMTP for ; Fri, 4 Sep 2020 11:31:09 +0000 (UTC) Received: from smtp.bitdefender.com (smtp01.buh.bitdefender.com [10.17.80.75]) by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id 9B813307C934; Fri, 4 Sep 2020 14:31:07 +0300 (EEST) Received: from localhost.localdomain (unknown [195.189.155.252]) by smtp.bitdefender.com (Postfix) with ESMTPSA id D31BB3072785; Fri, 4 Sep 2020 14:31:06 +0300 (EEST) From: =?utf-8?q?Adalbert_Laz=C4=83r?= To: linux-mm@kvack.org Cc: linux-api@vger.kernel.org, Andrew Morton , Alexander Graf , Stefan Hajnoczi , Jerome Glisse , Paolo Bonzini , =?utf-8?q?Mihai_Don=C8=9Bu?= , Mircea Cirjaliu , Andy Lutomirski , Arnd Bergmann , Sargun Dhillon , Aleksa Sarai , Oleg Nesterov , Jann Horn , Kees Cook , Matthew Wilcox , Christian Brauner , =?utf-8?q?Adalbert_Laz?= =?utf-8?q?=C4=83r?= Subject: [RESEND RFC PATCH 3/5] mm/mmu_notifier: remove lockdep map, allow mmu notifier to be used in nested scenarios Date: Fri, 4 Sep 2020 14:31:14 +0300 Message-Id: <20200904113116.20648-4-alazar@bitdefender.com> In-Reply-To: <20200904113116.20648-1-alazar@bitdefender.com> References: <20200904113116.20648-1-alazar@bitdefender.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: AD36E1803F9B6 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Mircea Cirjaliu The combination of remote mapping + KVM causes nested range invalidations, which reports lockdep warnings. Signed-off-by: Mircea Cirjaliu Signed-off-by: Adalbert Lazăr --- include/linux/mmu_notifier.h | 5 +---- mm/mmu_notifier.c | 19 ------------------- 2 files changed, 1 insertion(+), 23 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 736f6918335e..81ea457d41be 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -440,12 +440,10 @@ mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) { might_sleep(); - lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); if (mm_has_notifiers(range->mm)) { range->flags |= MMU_NOTIFIER_RANGE_BLOCKABLE; __mmu_notifier_invalidate_range_start(range); } - lock_map_release(&__mmu_notifier_invalidate_range_start_map); } static inline int @@ -453,12 +451,11 @@ mmu_notifier_invalidate_range_start_nonblock(struct mmu_notifier_range *range) { int ret = 0; - lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); if (mm_has_notifiers(range->mm)) { range->flags &= ~MMU_NOTIFIER_RANGE_BLOCKABLE; ret = __mmu_notifier_invalidate_range_start(range); } - lock_map_release(&__mmu_notifier_invalidate_range_start_map); + return ret; } diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 06852b896fa6..928751bd8630 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -22,12 +22,6 @@ /* global SRCU for all MMs */ DEFINE_STATIC_SRCU(srcu); -#ifdef CONFIG_LOCKDEP -struct lockdep_map __mmu_notifier_invalidate_range_start_map = { - .name = "mmu_notifier_invalidate_range_start" -}; -#endif - /* * The mmu_notifier_subscriptions structure is allocated and installed in * mm->notifier_subscriptions inside the mm_take_all_locks() protected @@ -242,8 +236,6 @@ mmu_interval_read_begin(struct mmu_interval_notifier *interval_sub) * will always clear the below sleep in some reasonable time as * subscriptions->invalidate_seq is even in the idle state. */ - lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); - lock_map_release(&__mmu_notifier_invalidate_range_start_map); if (is_invalidating) wait_event(subscriptions->wq, READ_ONCE(subscriptions->invalidate_seq) != seq); @@ -572,13 +564,11 @@ void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range, struct mmu_notifier_subscriptions *subscriptions = range->mm->notifier_subscriptions; - lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); if (subscriptions->has_itree) mn_itree_inv_end(subscriptions); if (!hlist_empty(&subscriptions->list)) mn_hlist_invalidate_end(subscriptions, range, only_end); - lock_map_release(&__mmu_notifier_invalidate_range_start_map); } void __mmu_notifier_invalidate_range(struct mm_struct *mm, @@ -612,13 +602,6 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, lockdep_assert_held_write(&mm->mmap_sem); BUG_ON(atomic_read(&mm->mm_users) <= 0); - if (IS_ENABLED(CONFIG_LOCKDEP)) { - fs_reclaim_acquire(GFP_KERNEL); - lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); - lock_map_release(&__mmu_notifier_invalidate_range_start_map); - fs_reclaim_release(GFP_KERNEL); - } - if (!mm->notifier_subscriptions) { /* * kmalloc cannot be called under mm_take_all_locks(), but we @@ -1062,8 +1045,6 @@ void mmu_interval_notifier_remove(struct mmu_interval_notifier *interval_sub) * The possible sleep on progress in the invalidation requires the * caller not hold any locks held by invalidation callbacks. */ - lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); - lock_map_release(&__mmu_notifier_invalidate_range_start_map); if (seq) wait_event(subscriptions->wq, READ_ONCE(subscriptions->invalidate_seq) != seq); From patchwork Fri Sep 4 11:31:15 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?= X-Patchwork-Id: 11756767 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D53B6138C for ; Fri, 4 Sep 2020 11:31:21 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 696BD206B7 for ; Fri, 4 Sep 2020 11:31:21 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 696BD206B7 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bitdefender.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9CD29900005; Fri, 4 Sep 2020 07:31:12 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 95F52900003; Fri, 4 Sep 2020 07:31:12 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 67998900006; Fri, 4 Sep 2020 07:31:12 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0154.hostedemail.com [216.40.44.154]) by kanga.kvack.org (Postfix) with ESMTP id 2CBB1900003 for ; Fri, 4 Sep 2020 07:31:12 -0400 (EDT) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id CDAC18245571 for ; Fri, 4 Sep 2020 11:31:11 +0000 (UTC) X-FDA: 77225162742.25.sofa08_1d172a7270b1 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin25.hostedemail.com (Postfix) with ESMTP id A5CCB1804E3A8 for ; Fri, 4 Sep 2020 11:31:11 +0000 (UTC) X-Spam-Summary: 1,0,0,f2324e2cd35e3737,d41d8cd98f00b204,alazar@bitdefender.com,,RULES_HIT:152:327:355:379:960:966:968:973:988:989:1260:1261:1277:1311:1313:1314:1345:1359:1431:1437:1500:1515:1516:1518:1593:1594:1605:1676:1730:1747:1777:1792:1801:1981:2194:2196:2198:2199:2200:2201:2393:2538:2559:2562:2693:2901:2902:2903:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4037:4250:4321:4385:4605:5007:6120:6261:6742:7576:7875:7901:7903:8603:8660:8957:10004:11026:11657:11914:12043:12219:12291:12296:12297:12438:12517:12519:12555:12679:12986:13148:13230:13255:13894:14394:14659:21080:21324:21433:21451:21611:21627:21740:21790:21810:21939:21987:21990:30003:30029:30030:30051:30054:30056:30067:30070:30074:30080:30090,0,RBL:91.199.104.161:@bitdefender.com:.lbl8.mailshell.net-64.100.201.201 62.2.8.100;04yg66pod6kp9o73a6opg9n6ogsxmop4oekkenff871je7houn7f1ax3hnqso8h.mybiitqyuntkhg33hci7ffyx6k6itx1ctfpbfxq3yp81he8d5f69ci71rhcik3x.y-lbl8.mailshell.net-223.238.255.100,CacheIP:none ,Bayesia X-HE-Tag: sofa08_1d172a7270b1 X-Filterd-Recvd-Size: 38858 Received: from mx01.bbu.dsd.mx.bitdefender.com (mx01.bbu.dsd.mx.bitdefender.com [91.199.104.161]) by imf36.hostedemail.com (Postfix) with ESMTP for ; Fri, 4 Sep 2020 11:31:10 +0000 (UTC) Received: from smtp.bitdefender.com (smtp01.buh.bitdefender.com [10.17.80.75]) by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id 837D3307C93E; Fri, 4 Sep 2020 14:31:08 +0300 (EEST) Received: from localhost.localdomain (unknown [195.189.155.252]) by smtp.bitdefender.com (Postfix) with ESMTPSA id A0DD33072786; Fri, 4 Sep 2020 14:31:07 +0300 (EEST) From: =?utf-8?q?Adalbert_Laz=C4=83r?= To: linux-mm@kvack.org Cc: linux-api@vger.kernel.org, Andrew Morton , Alexander Graf , Stefan Hajnoczi , Jerome Glisse , Paolo Bonzini , =?utf-8?q?Mihai_Don=C8=9Bu?= , Mircea Cirjaliu , Andy Lutomirski , Arnd Bergmann , Sargun Dhillon , Aleksa Sarai , Oleg Nesterov , Jann Horn , Kees Cook , Matthew Wilcox , Christian Brauner , =?utf-8?q?Adalbert_Laz?= =?utf-8?q?=C4=83r?= Subject: [RESEND RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process Date: Fri, 4 Sep 2020 14:31:15 +0300 Message-Id: <20200904113116.20648-5-alazar@bitdefender.com> In-Reply-To: <20200904113116.20648-1-alazar@bitdefender.com> References: <20200904113116.20648-1-alazar@bitdefender.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: A5CCB1804E3A8 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Mircea Cirjaliu Remote mapping creates a mirror VMA that exposes memory owned by another process in a zero-copy manner. The pages are mapped in the current process' address space with no memory management involved and little impact on the remote process operation. Currently incompatible with THP. Signed-off-by: Mircea Cirjaliu Signed-off-by: Adalbert Lazăr --- include/linux/remote_mapping.h | 22 + include/uapi/linux/remote_mapping.h | 36 + mm/Kconfig | 11 + mm/Makefile | 1 + mm/remote_mapping.c | 1273 +++++++++++++++++++++++++++ 5 files changed, 1343 insertions(+) create mode 100644 include/linux/remote_mapping.h create mode 100644 include/uapi/linux/remote_mapping.h create mode 100644 mm/remote_mapping.c diff --git a/include/linux/remote_mapping.h b/include/linux/remote_mapping.h new file mode 100644 index 000000000000..5c1d43e8f669 --- /dev/null +++ b/include/linux/remote_mapping.h @@ -0,0 +1,22 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef _LINUX_REMOTE_MAPPING_H +#define _LINUX_REMOTE_MAPPING_H + +#include + +#ifdef CONFIG_REMOTE_MAPPING + +extern int task_remote_map(struct task_struct *task, int fds[]); + +#else /* CONFIG_REMOTE_MAPPING */ + +static inline int task_remote_map(struct task_struct *task, int fds[]) +{ + return -EINVAL; +} + +#endif /* CONFIG_REMOTE_MAPPING */ + + +#endif /* _LINUX_REMOTE_MAPPING_H */ diff --git a/include/uapi/linux/remote_mapping.h b/include/uapi/linux/remote_mapping.h new file mode 100644 index 000000000000..5d2828a6aa47 --- /dev/null +++ b/include/uapi/linux/remote_mapping.h @@ -0,0 +1,36 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ + +#ifndef __UAPI_REMOTE_MAPPING_H__ +#define __UAPI_REMOTE_MAPPING_H__ + +#include +#include + +// device file interface +#define REMOTE_PROC_MAP _IOW('r', 0x01, int) + + +// pidfd interface +#define PIDFD_IO_MAGIC 'p' + +struct pidfd_mem_map { + uint64_t address; + off_t offset; + off_t size; + int flags; + int padding[7]; +}; + +struct pidfd_mem_unmap { + off_t offset; + off_t size; +}; + +#define PIDFD_MEM_MAP _IOW(PIDFD_IO_MAGIC, 0x01, struct pidfd_mem_map) +#define PIDFD_MEM_UNMAP _IOW(PIDFD_IO_MAGIC, 0x02, struct pidfd_mem_unmap) +#define PIDFD_MEM_LOCK _IOW(PIDFD_IO_MAGIC, 0x03, int) + +#define PIDFD_MEM_REMAP _IOW(PIDFD_IO_MAGIC, 0x04, unsigned long) +// TODO: actually this is not for pidfd, find better names + +#endif /* __UAPI_REMOTE_MAPPING_H__ */ diff --git a/mm/Kconfig b/mm/Kconfig index c1acc34c1c35..0ecc3f41a98e 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -804,6 +804,17 @@ config HMM_MIRROR bool depends on MMU +config REMOTE_MAPPING + bool "Remote memory mapping" + depends on X86_64 && MMU && !TRANSPARENT_HUGEPAGE + select MMU_NOTIFIER + default n + help + Allows a client process to gain access to an unrelated process' + address space on a range-basis. The feature maps pages found at + the remote equivalent address in the current process' page tables + in a lightweight manner. + config DEVICE_PRIVATE bool "Unaddressable device memory (GPU memory, ...)" depends on ZONE_DEVICE diff --git a/mm/Makefile b/mm/Makefile index fccd3756b25f..ce1a00e7bc8c 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -112,3 +112,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o obj-$(CONFIG_PTDUMP_CORE) += ptdump.o obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o +obj-$(CONFIG_REMOTE_MAPPING) += remote_mapping.o diff --git a/mm/remote_mapping.c b/mm/remote_mapping.c new file mode 100644 index 000000000000..1dc53992424b --- /dev/null +++ b/mm/remote_mapping.c @@ -0,0 +1,1273 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Remote memory mapping. + * + * Copyright (C) 2017-2020 Bitdefender S.R.L. + * + * Author: + * Mircea Cirjaliu + */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "internal.h" + +struct remote_file_context { + refcount_t refcount; + + struct srcu_struct fault_srcu; + struct mm_struct *mm; + + bool locked; + struct rb_root_cached rb_view; /* view offset tree */ + struct mutex views_lock; + +}; + +struct remote_view { + refcount_t refcount; + + unsigned long address; + unsigned long size; + unsigned long offset; + bool valid; + + struct rb_node target_rb; /* link for views tree */ + unsigned long rb_subtree_last; /* in remote_file_context */ + + struct mmu_interval_notifier mmin; + spinlock_t user_lock; + + /* + * interval tree for mapped ranges (indexed by source process HVA) + * because of GPA->HVA aliasing, multiple ranges may overlap + */ + struct rb_root_cached rb_rmap; /* rmap tree */ + struct rw_semaphore rmap_lock; +}; + +struct remote_vma_context { + struct vm_area_struct *vma; /* link back to VMA */ + struct remote_view *view; /* corresponding view */ + + struct rb_node rmap_rb; /* link for rmap tree */ + unsigned long rb_subtree_last; +}; + +/* view offset tree */ +static inline unsigned long view_start(struct remote_view *view) +{ + return view->offset + 1; +} + +static inline unsigned long view_last(struct remote_view *view) +{ + return view->offset + view->size - 1; +} + +INTERVAL_TREE_DEFINE(struct remote_view, target_rb, + unsigned long, rb_subtree_last, view_start, view_last, + static inline, view_interval_tree) + +#define view_tree_foreach(view, root, start, last) \ + for (view = view_interval_tree_iter_first(root, start, last); \ + view; view = view_interval_tree_iter_next(view, start, last)) + +/* rmap interval tree */ +static inline unsigned long ctx_start(struct remote_vma_context *ctx) +{ + struct vm_area_struct *vma = ctx->vma; + struct remote_view *view = ctx->view; + unsigned long offset = vma->vm_pgoff << PAGE_SHIFT; + + return offset - view->offset + view->address; +} + +static inline unsigned long ctx_last(struct remote_vma_context *ctx) +{ + struct vm_area_struct *vma = ctx->vma; + struct remote_view *view = ctx->view; + unsigned long offset; + + offset = (vma->vm_pgoff << PAGE_SHIFT) + (vma->vm_end - vma->vm_start); + + return offset - view->offset + view->address; +} + +static inline unsigned long ctx_rmap_start(struct remote_vma_context *ctx) +{ + return ctx_start(ctx) + 1; +} + +static inline unsigned long ctx_rmap_last(struct remote_vma_context *ctx) +{ + return ctx_last(ctx) - 1; +} + +INTERVAL_TREE_DEFINE(struct remote_vma_context, rmap_rb, + unsigned long, rb_subtree_last, ctx_rmap_start, ctx_rmap_last, + static inline, rmap_interval_tree) + +#define rmap_foreach(ctx, root, start, last) \ + for (ctx = rmap_interval_tree_iter_first(root, start, last); \ + ctx; ctx = rmap_interval_tree_iter_next(ctx, start, last)) + +static int mirror_zap_pte(struct vm_area_struct *vma, unsigned long addr, + pte_t *pte, int rss[], struct mmu_gather *tlb, + struct zap_details *details) +{ + pte_t ptent = *pte; + struct page *page; + int flags = 0; + + page = vm_normal_page(vma, addr, ptent); + //ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + ptent = ptep_clear_flush_notify(vma, addr, pte); + //tlb_remove_tlb_entry(tlb, pte, addr); + + if (pte_dirty(ptent)) { + flags |= ZAP_PTE_FLUSH; + set_page_dirty(page); + } + + return flags; +} + +static void +zap_remote_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end, + bool atomic) +{ + struct mmu_notifier_range range; + struct mmu_gather tlb; + struct zap_details details = { + .atomic = atomic, + }; + + pr_debug("%s: vma %lx-%lx, zap range %lx-%lx\n", + __func__, vma->vm_start, vma->vm_end, start, end); + + tlb_gather_mmu(&tlb, vma->vm_mm, start, end); + + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, + vma, vma->vm_mm, start, end); + if (atomic) + mmu_notifier_invalidate_range_start_nonblock(&range); + else + mmu_notifier_invalidate_range_start(&range); + + unmap_page_range(&tlb, vma, start, end, &details); + + mmu_notifier_invalidate_range_end(&range); + tlb_finish_mmu(&tlb, start, end); +} + +static bool +mirror_clear_view(struct remote_view *view, + unsigned long start, unsigned long last, bool atomic) +{ + struct remote_vma_context *ctx; + unsigned long src_start, src_last; + unsigned long vma_start, vma_last; + + pr_debug("%s: view %p [%lx-%lx), range [%lx-%lx)", __func__, view, + view->offset, view->offset + view->size, start, last); + + if (likely(!atomic)) + down_read(&view->rmap_lock); + else if (!down_read_trylock(&view->rmap_lock)) + return false; + + rmap_foreach(ctx, &view->rb_rmap, start, last) { + struct vm_area_struct *vma = ctx->vma; + + // intersect intervals (source process address range) + src_start = max(start, ctx_start(ctx)); + src_last = min(last, ctx_last(ctx)); + + // translate to destination process address range + vma_start = vma->vm_start + (src_start - ctx_start(ctx)); + vma_last = vma->vm_end - (ctx_last(ctx) - src_last); + + zap_remote_range(vma, vma_start, vma_last, atomic); + } + + up_read(&view->rmap_lock); + + return true; +} + +static bool mmin_invalidate(struct mmu_interval_notifier *interval_sub, + const struct mmu_notifier_range *range, + unsigned long cur_seq) +{ + struct remote_view *view = + container_of(interval_sub, struct remote_view, mmin); + + pr_debug("%s: reason %d, range [%lx-%lx)\n", __func__, + range->event, range->start, range->end); + + spin_lock(&view->user_lock); + mmu_interval_set_seq(interval_sub, cur_seq); + spin_unlock(&view->user_lock); + + /* mark view as invalid before zapping the page tables */ + if (range->event == MMU_NOTIFY_RELEASE) + WRITE_ONCE(view->valid, false); + + return mirror_clear_view(view, range->start, range->end, + !mmu_notifier_range_blockable(range)); +} + +static const struct mmu_interval_notifier_ops mmin_ops = { + .invalidate = mmin_invalidate, +}; + +static void view_init(struct remote_view *view) +{ + refcount_set(&view->refcount, 1); + view->valid = true; + RB_CLEAR_NODE(&view->target_rb); + view->rb_rmap = RB_ROOT_CACHED; + init_rwsem(&view->rmap_lock); + spin_lock_init(&view->user_lock); +} + +/* return working view or reason why it failed */ +static struct remote_view * +view_alloc(struct mm_struct *mm, unsigned long address, unsigned long size, unsigned long offset) +{ + struct remote_view *view; + int result; + + view = kzalloc(sizeof(*view), GFP_KERNEL); + if (!view) + return ERR_PTR(-ENOMEM); + + view_init(view); + + view->address = address; + view->size = size; + view->offset = offset; + + pr_debug("%s: view %p [%lx-%lx)", __func__, view, + view->offset, view->offset + view->size); + + result = mmu_interval_notifier_insert(&view->mmin, mm, address, size, &mmin_ops); + if (result) { + kfree(view); + return ERR_PTR(result); + } + + return view; +} + +static void +view_insert(struct remote_file_context *fctx, struct remote_view *view) +{ + view_interval_tree_insert(view, &fctx->rb_view); + refcount_inc(&view->refcount); +} + +static struct remote_view * +view_search_get(struct remote_file_context *fctx, + unsigned long start, unsigned long last) +{ + struct remote_view *view; + + lockdep_assert_held(&fctx->views_lock); + + /* + * loop & return the first view intersecting interval + * further checks will be done down the road + */ + view_tree_foreach(view, &fctx->rb_view, start, last) + break; + + if (view) + refcount_inc(&view->refcount); + + return view; +} + +static void +view_put(struct remote_view *view) +{ + if (refcount_dec_and_test(&view->refcount)) { + pr_debug("%s: view %p [%lx-%lx) bye bye", __func__, view, + view->offset, view->offset + view->size); + + mmu_interval_notifier_remove(&view->mmin); + kfree(view); + } +} + +static void +view_remove(struct remote_file_context *fctx, struct remote_view *view) +{ + view_interval_tree_remove(view, &fctx->rb_view); + RB_CLEAR_NODE(&view->target_rb); + view_put(view); +} + +static bool +view_overlaps(struct remote_file_context *fctx, + unsigned long start, unsigned long last) +{ + struct remote_view *view; + + view_tree_foreach(view, &fctx->rb_view, start, last) + return true; + + return false; +} + +static struct remote_view * +alloc_identity_view(struct mm_struct *mm) +{ + return view_alloc(mm, 0, ULONG_MAX, 0); +} + +static void remote_file_context_init(struct remote_file_context *fctx) +{ + refcount_set(&fctx->refcount, 1); + init_srcu_struct(&fctx->fault_srcu); + fctx->locked = false; + fctx->rb_view = RB_ROOT_CACHED; + mutex_init(&fctx->views_lock); +} + +static struct remote_file_context *remote_file_context_alloc(void) +{ + struct remote_file_context *fctx; + + fctx = kzalloc(sizeof(*fctx), GFP_KERNEL); + if (fctx) + remote_file_context_init(fctx); + + pr_debug("%s: fctx %p\n", __func__, fctx); + + return fctx; +} + +static void remote_file_context_get(struct remote_file_context *fctx) +{ + refcount_inc(&fctx->refcount); +} + +static void remote_file_context_put(struct remote_file_context *fctx) +{ + struct remote_view *view, *n; + + if (refcount_dec_and_test(&fctx->refcount)) { + pr_debug("%s: fctx %p\n", __func__, fctx); + + rbtree_postorder_for_each_entry_safe(view, n, + &fctx->rb_view.rb_root, target_rb) + view_put(view); + + if (fctx->mm) + mmdrop(fctx->mm); + + kfree(fctx); + } +} + +static void remote_vma_context_init(struct remote_vma_context *ctx) +{ + RB_CLEAR_NODE(&ctx->rmap_rb); +} + +static struct remote_vma_context *remote_vma_context_alloc(void) +{ + struct remote_vma_context *ctx; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (ctx) + remote_vma_context_init(ctx); + + return ctx; +} + +static void remote_vma_context_free(struct remote_vma_context *ctx) +{ + kfree(ctx); +} + +static int mirror_dev_open(struct inode *inode, struct file *file) +{ + struct remote_file_context *fctx; + + pr_debug("%s: file %p\n", __func__, file); + + fctx = remote_file_context_alloc(); + if (!fctx) + return -ENOMEM; + file->private_data = fctx; + + return 0; +} + +static int do_remote_proc_map(struct file *file, int pid) +{ + struct remote_file_context *fctx = file->private_data; + struct task_struct *req_task; + struct mm_struct *req_mm; + struct remote_view *id; + int result = 0; + + pr_debug("%s: pid %d\n", __func__, pid); + + req_task = find_get_task_by_vpid(pid); + if (!req_task) + return -ESRCH; + + req_mm = get_task_mm(req_task); + put_task_struct(req_task); + if (!req_mm) + return -EINVAL; + + /* error on 2nd call or multithreaded race */ + if (cmpxchg(&fctx->mm, (struct mm_struct *)NULL, req_mm) != NULL) { + result = -EALREADY; + goto out; + } else + mmgrab(req_mm); + + id = alloc_identity_view(req_mm); + if (IS_ERR(id)) { + mmdrop(req_mm); + result = PTR_ERR(id); + goto out; + } + + /* one view only, don't need to take mutex */ + view_insert(fctx, id); + view_put(id); /* usage reference */ + +out: + mmput(req_mm); + + return result; +} + +static long mirror_dev_ioctl(struct file *file, unsigned int ioctl, + unsigned long arg) +{ + long result; + + switch (ioctl) { + case REMOTE_PROC_MAP: { + int pid = (int)arg; + + result = do_remote_proc_map(file, pid); + break; + } + + default: + pr_debug("%s: ioctl %x not implemented\n", __func__, ioctl); + result = -ENOTTY; + } + + return result; +} + +/* + * This is called after all reference to the file have been dropped, + * including mmap()s, even if the file is close()d first. + */ +static int mirror_dev_release(struct inode *inode, struct file *file) +{ + struct remote_file_context *fctx = file->private_data; + + pr_debug("%s: file %p\n", __func__, file); + + remote_file_context_put(fctx); + + return 0; +} + +static struct page *mm_remote_get_page(struct mm_struct *req_mm, + unsigned long address, unsigned int flags) +{ + struct page *req_page = NULL; + long nrpages; + + might_sleep(); + + flags |= FOLL_ANON | FOLL_MIGRATION; + + /* get host page corresponding to requested address */ + nrpages = get_user_pages_remote(NULL, req_mm, address, 1, + flags, &req_page, NULL, NULL); + if (unlikely(nrpages == 0)) { + pr_err("no page at %lx\n", address); + return ERR_PTR(-ENOENT); + } + if (IS_ERR_VALUE(nrpages)) { + pr_err("get_user_pages_remote() failed: %d\n", (int)nrpages); + return ERR_PTR(nrpages); + } + + /* limit introspection to anon memory (this also excludes zero-page) */ + if (!PageAnon(req_page)) { + put_page(req_page); + pr_err("page at %lx not anon\n", address); + return ERR_PTR(-EINVAL); + } + + return req_page; +} + +/* + * avoid PTE allocation in this function for 2 reasons: + * - it runs under user_lock, which is a spinlock and can't sleep + * (user_lock can be a mutex is allocation is needed) + * - PTE allocation triggers reclaim, which causes a possible deadlock warning + */ +static vm_fault_t remote_map_page(struct vm_fault *vmf, struct page *page) +{ + struct vm_area_struct *vma = vmf->vma; + pte_t entry; + + if (vmf->prealloc_pte) { + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); + if (unlikely(!pmd_none(*vmf->pmd))) { + spin_unlock(vmf->ptl); + goto map_pte; + } + + mm_inc_nr_ptes(vma->vm_mm); + pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); + spin_unlock(vmf->ptl); + vmf->prealloc_pte = NULL; + } else { + BUG_ON(pmd_none(*vmf->pmd)); + } + +map_pte: + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); + + if (!pte_none(*vmf->pte)) + goto out_unlock; + + entry = mk_pte(page, vma->vm_page_prot); + set_pte_at_notify(vma->vm_mm, vmf->address, vmf->pte, entry); + +out_unlock: + pte_unmap_unlock(vmf->pte, vmf->ptl); + return VM_FAULT_NOPAGE; +} + +static vm_fault_t mirror_vm_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct mm_struct *mm = vma->vm_mm; + struct remote_vma_context *ctx = vma->vm_private_data; + struct remote_view *view = ctx->view; + struct file *file = vma->vm_file; + struct remote_file_context *fctx = file->private_data; + unsigned long req_addr; + unsigned int gup_flags; + struct page *req_page; + vm_fault_t result = VM_FAULT_SIGBUS; + struct mm_struct *src_mm = fctx->mm; + unsigned long seq; + int idx; + +fault_retry: + seq = mmu_interval_read_begin(&view->mmin); + + idx = srcu_read_lock(&fctx->fault_srcu); + + /* check if view was invalidated */ + if (unlikely(!READ_ONCE(view->valid))) { + pr_debug("%s: region [%lx-%lx) was invalidated!!\n", __func__, + view->offset, view->offset + view->size); + goto out_invalid; /* VM_FAULT_SIGBUS */ + } + + /* drop current mm semapchore */ + up_read(¤t->mm->mmap_sem); + + /* take remote mm semaphore */ + if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) { + if (!down_read_trylock(&src_mm->mmap_sem)) { + pr_debug("%s: couldn't take source semaphore!!\n", __func__); + goto out_retry; + } + } else + down_read(&src_mm->mmap_sem); + + /* set GUP flags depending on the VMA */ + gup_flags = 0; + if (vma->vm_flags & VM_WRITE) + gup_flags |= FOLL_WRITE | FOLL_FORCE; + + /* translate file offset to source process HVA */ + req_addr = (vmf->pgoff << PAGE_SHIFT) - view->offset + view->address; + req_page = mm_remote_get_page(src_mm, req_addr, gup_flags); + + /* check for validity of the page */ + if (IS_ERR_OR_NULL(req_page)) { + up_read(&src_mm->mmap_sem); + + if (PTR_ERR(req_page) == -ERESTARTSYS || + PTR_ERR(req_page) == -EBUSY) { + goto out_retry; + } else + goto out_err; /* VM_FAULT_SIGBUS */ + } + + up_read(&src_mm->mmap_sem); + + /* retake current mm semapchore */ + down_read(¤t->mm->mmap_sem); + + /* expedite retry */ + if (mmu_interval_check_retry(&view->mmin, seq)) { + put_page(req_page); + + srcu_read_unlock(&fctx->fault_srcu, idx); + + goto fault_retry; + } + + /* make sure the VMA hasn't gone away */ + vma = find_vma(current->mm, vmf->address); + if (vma == vmf->vma) { + spin_lock(&view->user_lock); + + if (mmu_interval_read_retry(&view->mmin, seq)) { + spin_unlock(&view->user_lock); + + put_page(req_page); + + srcu_read_unlock(&fctx->fault_srcu, idx); + + goto fault_retry; + } + + result = remote_map_page(vmf, req_page); /* install PTE here */ + + spin_unlock(&view->user_lock); + } + + put_page(req_page); + + srcu_read_unlock(&fctx->fault_srcu, idx); + + return result; + +out_err: + /* retake current mm semapchore */ + down_read(¤t->mm->mmap_sem); +out_invalid: + srcu_read_unlock(&fctx->fault_srcu, idx); + + return result; + +out_retry: + /* retake current mm semapchore */ + down_read(¤t->mm->mmap_sem); + + srcu_read_unlock(&fctx->fault_srcu, idx); + + /* TODO: some optimizations work here when we arrive with FAULT_FLAG_ALLOW_RETRY */ + /* TODO: mmap_sem doesn't need to be taken, then dropped */ + + /* + * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be released + * before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT is + * not set. + * + * If FAULT_FLAG_ALLOW_RETRY is set but FAULT_FLAG_KILLABLE is not + * set, VM_FAULT_RETRY can still be returned if and only if there are + * fatal_signal_pending()s, and the mmap_sem must be released before + * returning it. + */ + if (vmf->flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_TRIED)) { + if (!(vmf->flags & FAULT_FLAG_KILLABLE)) + if (current && fatal_signal_pending(current)) { + up_read(¤t->mm->mmap_sem); + return VM_FAULT_RETRY; + } + + if (!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) + up_read(&mm->mmap_sem); + + return VM_FAULT_RETRY; + } else + return VM_FAULT_SIGBUS; +} + +/* + * This is called in remove_vma() at the end of __do_munmap() after the address + * space has been unmapped and the page tables have been freed. + */ +static void mirror_vm_close(struct vm_area_struct *vma) +{ + struct remote_vma_context *ctx = vma->vm_private_data; + struct remote_view *view = ctx->view; + + pr_debug("%s: VMA %lx-%lx (%lu bytes)\n", __func__, + vma->vm_start, vma->vm_end, vma->vm_end - vma->vm_start); + + /* will wait for any running invalidate notifiers to finish */ + down_write(&view->rmap_lock); + rmap_interval_tree_remove(ctx, &view->rb_rmap); + up_write(&view->rmap_lock); + view_put(view); + + remote_vma_context_free(ctx); +} + +/* prevent partial unmap of destination VMA */ +static int mirror_vm_split(struct vm_area_struct *area, unsigned long addr) +{ + return -EINVAL; +} + +static const struct vm_operations_struct mirror_vm_ops = { + .close = mirror_vm_close, + .fault = mirror_vm_fault, + .split = mirror_vm_split, + .zap_pte = mirror_zap_pte, +}; + +static bool is_mirror_vma(struct vm_area_struct *vma) +{ + return vma->vm_ops == &mirror_vm_ops; +} + +static struct remote_view * +getme_matching_view(struct remote_file_context *fctx, + unsigned long start, unsigned long last) +{ + struct remote_view *view; + + /* lookup view for the VMA offset range */ + view = view_search_get(fctx, start, last); + if (!view) + return NULL; + + /* make sure the interval we're after is contained in the view */ + if (start < view->offset || last > view->offset + view->size) { + view_put(view); + return NULL; + } + + return view; +} + +static struct remote_view * +getme_exact_view(struct remote_file_context *fctx, + unsigned long start, unsigned long last) +{ + struct remote_view *view; + + /* lookup view for the VMA offset range */ + view = view_search_get(fctx, start, last); + if (!view) + return NULL; + + /* make sure the interval we're after is contained in the view */ + if (start != view->offset || last != view->offset + view->size) { + view_put(view); + return NULL; + } + + return view; +} + +static int mirror_dev_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct remote_file_context *fctx = file->private_data; + struct remote_vma_context *ctx; + unsigned long start, length, last; + struct remote_view *view; + + start = vma->vm_pgoff << PAGE_SHIFT; + length = vma->vm_end - vma->vm_start; + last = start + length; + + pr_debug("%s: VMA %lx-%lx (%lu bytes), offsets %lx-%lx\n", __func__, + vma->vm_start, vma->vm_end, length, start, last); + + if (!(vma->vm_flags & VM_SHARED)) { + pr_debug("%s: VMA is not shared\n", __func__); + return -EINVAL; + } + + /* prepare the context */ + ctx = remote_vma_context_alloc(); + if (!ctx) + return -ENOMEM; + + /* lookup view for the VMA offset range */ + mutex_lock(&fctx->views_lock); + view = getme_matching_view(fctx, start, last); + mutex_unlock(&fctx->views_lock); + if (!view) { + pr_debug("%s: no view for range %lx-%lx\n", __func__, start, last); + remote_vma_context_free(ctx); + return -EINVAL; + } + + /* VMA must be linked to ctx before adding to rmap tree !! */ + vma->vm_private_data = ctx; + ctx->vma = vma; + + /* view may already be invalidated by the time it's linked */ + down_write(&view->rmap_lock); + ctx->view = view; /* view reference goes here */ + rmap_interval_tree_insert(ctx, &view->rb_rmap); + up_write(&view->rmap_lock); + + /* set basic VMA properties */ + vma->vm_flags |= VM_DONTCOPY | VM_DONTDUMP | VM_DONTEXPAND; + vma->vm_ops = &mirror_vm_ops; + + return 0; +} + +static const struct file_operations mirror_ops = { + .open = mirror_dev_open, + .unlocked_ioctl = mirror_dev_ioctl, + .compat_ioctl = mirror_dev_ioctl, + .llseek = no_llseek, + .mmap = mirror_dev_mmap, + .release = mirror_dev_release, +}; + +static struct miscdevice mirror_dev = { + .minor = MISC_DYNAMIC_MINOR, + .name = "mirror-proc", + .fops = &mirror_ops, +}; + +builtin_misc_device(mirror_dev); + +static int pidfd_mem_remap(struct remote_file_context *fctx, unsigned long address) +{ + struct vm_area_struct *vma; + unsigned long start, last; + struct remote_vma_context *ctx; + struct remote_view *view, *new_view; + int result = 0; + + pr_debug("%s: address %lx\n", __func__, address); + + down_write(¤t->mm->mmap_sem); + + vma = find_vma(current->mm, address); + if (!vma || !is_mirror_vma(vma)) { + result = -EINVAL; + goto out_vma; + } + + ctx = vma->vm_private_data; + view = ctx->view; + + if (view->valid) + goto out_vma; + + start = vma->vm_pgoff << PAGE_SHIFT; + last = start + (vma->vm_end - vma->vm_start); + + /* lookup view for the VMA offset range */ + mutex_lock(&fctx->views_lock); + new_view = getme_matching_view(fctx, start, last); + mutex_unlock(&fctx->views_lock); + if (!new_view) { + result = -EINVAL; + goto out_vma; + } + /* do not link to another invalid view */ + if (!new_view->valid) { + view_put(new_view); + result = -EINVAL; + goto out_vma; + } + + /* we have current->mm->mmap_sem in write mode, so no faults going on */ + down_write(&view->rmap_lock); + rmap_interval_tree_remove(ctx, &view->rb_rmap); + up_write(&view->rmap_lock); + view_put(view); /* ctx reference */ + + /* replace with the new view */ + down_write(&new_view->rmap_lock); + ctx->view = new_view; /* new view reference goes here */ + rmap_interval_tree_insert(ctx, &new_view->rb_rmap); + up_write(&new_view->rmap_lock); + +out_vma: + up_write(¤t->mm->mmap_sem); + + return result; +} + +static long +pidfd_mem_map_ioctl(struct file *file, unsigned int ioctl, unsigned long arg) +{ + struct remote_file_context *fctx = file->private_data; + long result = 0; + + switch (ioctl) { + case PIDFD_MEM_REMAP: + result = pidfd_mem_remap(fctx, arg); + break; + + default: + pr_debug("%s: ioctl %x not implemented\n", __func__, ioctl); + result = -ENOTTY; + } + + return result; +} + +static void pidfd_mem_lock(struct remote_file_context *fctx) +{ + pr_debug("%s:\n", __func__); + + mutex_lock(&fctx->views_lock); + fctx->locked = true; + mutex_unlock(&fctx->views_lock); +} + +static int pidfd_mem_map(struct remote_file_context *fctx, struct pidfd_mem_map *map) +{ + struct remote_view *view; + int result = 0; + + pr_debug("%s: offset %lx, size %lx, address %lx\n", + __func__, map->offset, map->size, (long)map->address); + + if (!PAGE_ALIGNED(map->offset)) + return -EINVAL; + if (!PAGE_ALIGNED(map->size)) + return -EINVAL; + if (!PAGE_ALIGNED(map->address)) + return -EINVAL; + + /* make sure we're creating the view for a valid address space */ + if (!mmget_not_zero(fctx->mm)) + return -EINVAL; + + view = view_alloc(fctx->mm, map->address, map->size, map->offset); + if (IS_ERR(view)) { + result = PTR_ERR(view); + goto out_mm; + } + + mutex_lock(&fctx->views_lock); + + /* locked ? */ + if (unlikely(fctx->locked)) { + pr_debug("%s: views locked\n", __func__); + result = -EINVAL; + goto out; + } + + /* overlaps another view ? */ + if (view_overlaps(fctx, map->offset, map->offset + map->size)) { + pr_debug("%s: range overlaps\n", __func__); + result = -EALREADY; + goto out; + } + + view_insert(fctx, view); + +out: + mutex_unlock(&fctx->views_lock); + + view_put(view); /* usage reference */ +out_mm: + mmput(fctx->mm); + + return result; +} + +static int pidfd_mem_unmap(struct remote_file_context *fctx, struct pidfd_mem_unmap *unmap) +{ + struct remote_view *view; + + pr_debug("%s: offset %lx, size %lx\n", + __func__, unmap->offset, unmap->size); + + if (!PAGE_ALIGNED(unmap->offset)) + return -EINVAL; + if (!PAGE_ALIGNED(unmap->size)) + return -EINVAL; + + mutex_lock(&fctx->views_lock); + + if (unlikely(fctx->locked)) { + mutex_unlock(&fctx->views_lock); + return -EINVAL; + } + + view = getme_exact_view(fctx, unmap->offset, unmap->offset + unmap->size); + if (!view) { + mutex_unlock(&fctx->views_lock); + return -EINVAL; + } + + view_remove(fctx, view); + + mutex_unlock(&fctx->views_lock); + + /* + * The view may still be refernced by a mapping VMA, so dropping + * a reference here may not delete it. The view will be marked as + * invalid, together with all the VMAs linked to it. + */ + WRITE_ONCE(view->valid, false); + + /* wait until local faults finish */ + synchronize_srcu(&fctx->fault_srcu); + + /* + * because the view is marked as invalid, faults will not succeed, so + * we don't have to worry about synchronizing invalidations/faults + */ + mirror_clear_view(view, 0, ULONG_MAX, false); + + view_put(view); /* usage reference */ + + return 0; +} + +static long +pidfd_mem_ctrl_ioctl(struct file *file, unsigned int ioctl, unsigned long arg) +{ + struct remote_file_context *fctx = file->private_data; + void __user *argp = (void __user *)arg; + long result = 0; + + switch (ioctl) { + case PIDFD_MEM_MAP: { + struct pidfd_mem_map map; + + result = -EINVAL; + if (copy_from_user(&map, argp, sizeof(map))) + return result; + + result = pidfd_mem_map(fctx, &map); + break; + } + + case PIDFD_MEM_UNMAP: { + struct pidfd_mem_unmap unmap; + + result = -EINVAL; + if (copy_from_user(&unmap, argp, sizeof(unmap))) + return result; + + result = pidfd_mem_unmap(fctx, &unmap); + break; + } + + case PIDFD_MEM_LOCK: + pidfd_mem_lock(fctx); + break; + + default: + pr_debug("%s: ioctl %x not implemented\n", __func__, ioctl); + result = -ENOTTY; + } + + return result; +} + +static int pidfd_mem_ctrl_release(struct inode *inode, struct file *file) +{ + struct remote_file_context *fctx = file->private_data; + + pr_debug("%s: file %p\n", __func__, file); + + remote_file_context_put(fctx); + + return 0; +} + +static const struct file_operations pidfd_mem_ctrl_ops = { + .owner = THIS_MODULE, + .unlocked_ioctl = pidfd_mem_ctrl_ioctl, + .compat_ioctl = pidfd_mem_ctrl_ioctl, + .llseek = no_llseek, + .release = pidfd_mem_ctrl_release, +}; + +static unsigned long +pidfd_mem_get_unmapped_area(struct file *file, unsigned long addr, + unsigned long len, unsigned long pgoff, unsigned long flags) +{ + struct remote_file_context *fctx = file->private_data; + unsigned long start = pgoff << PAGE_SHIFT; + unsigned long last = start + len; + unsigned long remote_addr, align_offset; + struct remote_view *view; + struct vm_area_struct *vma; + unsigned long result; + + pr_debug("%s: addr %lx, len %lx, pgoff %lx, flags %lx\n", + __func__, addr, len, pgoff, flags); + + if (flags & MAP_FIXED) { + if (addr == 0) + return -ENOMEM; + else + return addr; + } + + // TODO: ellaborate on this case, we must still have alignment !!!!!!!!! + // TODO: only if THP enabled + if (addr == 0) + return current->mm->get_unmapped_area(file, addr, len, pgoff, flags); + + /* use this backing VMA */ + vma = find_vma(current->mm, addr); + if (!vma) { + pr_debug("%s: no VMA found at %lx\n", __func__, addr); + return -EINVAL; + } + + /* VMA was mapped with PROT_NONE */ + if (vma_is_accessible(vma)) { + pr_debug("%s: VMA at %lx is not a backing VMA\n", __func__, addr); + return -EINVAL; + } + + /* + * if the view somehow gets removed afterwards, we're gonna create a + * VMA for which there's no backing view, so mmap() will fail + */ + mutex_lock(&fctx->views_lock); + view = getme_matching_view(fctx, start, last); + mutex_unlock(&fctx->views_lock); + if (!view) { + pr_debug("%s: no view for range %lx-%lx\n", __func__, start, last); + return -EINVAL; + } + + /* this should be enough to ensure VMA alignment */ + remote_addr = start - view->offset + view->address; + align_offset = remote_addr % PMD_SIZE; + + if (addr % PMD_SIZE <= align_offset) + result = (addr & PMD_MASK) + align_offset; + else + result = (addr & PMD_MASK) + align_offset + PMD_SIZE; + + view_put(view); /* usage reference */ + + return result; +} + +static const struct file_operations pidfd_mem_map_fops = { + .owner = THIS_MODULE, + .mmap = mirror_dev_mmap, + .get_unmapped_area = pidfd_mem_get_unmapped_area, + .unlocked_ioctl = pidfd_mem_map_ioctl, + .compat_ioctl = pidfd_mem_map_ioctl, + .llseek = no_llseek, + .release = mirror_dev_release, +}; + +int task_remote_map(struct task_struct *task, int fds[]) +{ + struct mm_struct *mm; + struct remote_file_context *fctx; + struct file *ctrl, *map; + int ret; + + // allocate common file context + fctx = remote_file_context_alloc(); + if (!fctx) + return -ENOMEM; + + // create these 2 fds + fds[0] = fds[1] = -1; + + fds[0] = anon_inode_getfd("[pidfd_mem.ctrl]", &pidfd_mem_ctrl_ops, fctx, + O_RDWR | O_CLOEXEC); + if (fds[0] < 0) { + ret = fds[0]; + goto out; + } + remote_file_context_get(fctx); + + ctrl = fget(fds[0]); + ctrl->f_mode |= FMODE_WRITE_IOCTL; + fput(ctrl); + + fds[1] = anon_inode_getfd("[pidfd_mem.map]", &pidfd_mem_map_fops, fctx, + O_RDWR | O_CLOEXEC | O_LARGEFILE); + if (fds[1] < 0) { + ret = fds[1]; + goto out; + } + remote_file_context_get(fctx); + + map = fget(fds[1]); + map->f_mode |= FMODE_LSEEK | FMODE_UNSIGNED_OFFSET | FMODE_RANDOM; + fput(map); + + mm = get_task_mm(task); + if (!mm) { + ret = -EINVAL; + goto out; + } + + /* reference this mm in fctx */ + mmgrab(mm); + fctx->mm = mm; + + mmput(mm); + remote_file_context_put(fctx); /* usage reference */ + + return 0; + +out: + if (fds[0] != -1) { + __close_fd(current->files, fds[0]); + remote_file_context_put(fctx); + } + + if (fds[1] != -1) { + __close_fd(current->files, fds[1]); + remote_file_context_put(fctx); + } + + // TODO: using __close_fd() does not guarantee success, use other means + // for file allocation & error recovery + + remote_file_context_put(fctx); + + return ret; +} From patchwork Fri Sep 4 11:31:16 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Adalbert_Laz=C4=83r?= X-Patchwork-Id: 11756765 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0BC7E13B1 for ; Fri, 4 Sep 2020 11:31:19 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C4B61206B7 for ; Fri, 4 Sep 2020 11:31:18 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C4B61206B7 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bitdefender.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 51756900004; Fri, 4 Sep 2020 07:31:12 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 33B81900005; Fri, 4 Sep 2020 07:31:12 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 194BD900004; Fri, 4 Sep 2020 07:31:12 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0037.hostedemail.com [216.40.44.37]) by kanga.kvack.org (Postfix) with ESMTP id E30A2900003 for ; Fri, 4 Sep 2020 07:31:11 -0400 (EDT) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id A7613824556B for ; Fri, 4 Sep 2020 11:31:11 +0000 (UTC) X-FDA: 77225162742.20.rail44_590daac270b1 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin20.hostedemail.com (Postfix) with ESMTP id 64333180C07AF for ; Fri, 4 Sep 2020 11:31:11 +0000 (UTC) X-Spam-Summary: 30,2,0,b8f7f754cbfef8c3,d41d8cd98f00b204,alazar@bitdefender.com,,RULES_HIT:2:41:152:341:355:364:379:800:960:973:982:988:989:1260:1261:1277:1311:1313:1314:1345:1359:1431:1437:1515:1516:1518:1535:1593:1594:1605:1606:1676:1730:1747:1777:1792:1801:1981:2194:2199:2393:2559:2562:2693:2901:2915:3138:3139:3140:3141:3142:3865:3866:3867:3868:3871:4037:4118:4321:4605:5007:6119:6120:6261:6742:7576:7875:7901:7903:7974:8660:9121:10004:11026:11473:11657:11658:11914:12043:12291:12296:12297:12438:12517:12519:12555:12679:12986:13148:13230:13894:14394:14659:21080:21212:21324:21451:21627:21796:21939:21990:30003:30034:30036:30051:30054:30069,0,RBL:91.199.104.161:@bitdefender.com:.lbl8.mailshell.net-62.2.8.100 64.100.201.201;04yra4a4z8ro5uwapciaqucao6fegyp9et9b8skuuszskqzb54f84k8z5b9ofqz.9nkmx98wga9c3yosd9jckgd3jokgzcu93gghfsjhex8xytbwj7t3nudhsbw3ceg.4-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:n eutral,C X-HE-Tag: rail44_590daac270b1 X-Filterd-Recvd-Size: 7512 Received: from mx01.bbu.dsd.mx.bitdefender.com (mx01.bbu.dsd.mx.bitdefender.com [91.199.104.161]) by imf46.hostedemail.com (Postfix) with ESMTP for ; Fri, 4 Sep 2020 11:31:10 +0000 (UTC) Received: from smtp.bitdefender.com (smtp01.buh.bitdefender.com [10.17.80.75]) by mx01.bbu.dsd.mx.bitdefender.com (Postfix) with ESMTPS id 6C750307C93F; Fri, 4 Sep 2020 14:31:09 +0300 (EEST) Received: from localhost.localdomain (unknown [195.189.155.252]) by smtp.bitdefender.com (Postfix) with ESMTPSA id 8B3FA3072785; Fri, 4 Sep 2020 14:31:08 +0300 (EEST) From: =?utf-8?q?Adalbert_Laz=C4=83r?= To: linux-mm@kvack.org Cc: linux-api@vger.kernel.org, Andrew Morton , Alexander Graf , Stefan Hajnoczi , Jerome Glisse , Paolo Bonzini , =?utf-8?q?Mihai_Don=C8=9Bu?= , Mircea Cirjaliu , Andy Lutomirski , Arnd Bergmann , Sargun Dhillon , Aleksa Sarai , Oleg Nesterov , Jann Horn , Kees Cook , Matthew Wilcox , Christian Brauner , Christian Brauner , =?utf-8?q?Adalbert_Laz=C4=83r?= Subject: [RESEND RFC PATCH 5/5] pidfd_mem: implemented remote memory mapping system call Date: Fri, 4 Sep 2020 14:31:16 +0300 Message-Id: <20200904113116.20648-6-alazar@bitdefender.com> In-Reply-To: <20200904113116.20648-1-alazar@bitdefender.com> References: <20200904113116.20648-1-alazar@bitdefender.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 64333180C07AF X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Mircea Cirjaliu This system call returns 2 fds for inspecting the address space of a remote process: one for control and one for access. Use according to remote mapping specifications. Cc: Christian Brauner Signed-off-by: Mircea Cirjaliu Signed-off-by: Adalbert Lazăr --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/pid.h | 1 + include/linux/syscalls.h | 1 + include/uapi/asm-generic/unistd.h | 2 + kernel/exit.c | 2 +- kernel/pid.c | 55 ++++++++++++++++++++++++++ 7 files changed, 62 insertions(+), 1 deletion(-) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 54581ac671b4..ca1b5a32dbc5 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -440,5 +440,6 @@ 433 i386 fspick sys_fspick 434 i386 pidfd_open sys_pidfd_open 435 i386 clone3 sys_clone3 +436 i386 pidfd_mem sys_pidfd_mem 437 i386 openat2 sys_openat2 438 i386 pidfd_getfd sys_pidfd_getfd diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 37b844f839bc..6138d3d023f8 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -357,6 +357,7 @@ 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open 435 common clone3 sys_clone3 +436 common pidfd_mem sys_pidfd_mem 437 common openat2 sys_openat2 438 common pidfd_getfd sys_pidfd_getfd diff --git a/include/linux/pid.h b/include/linux/pid.h index cc896f0fc4e3..9ec23ab23fd4 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -76,6 +76,7 @@ extern const struct file_operations pidfd_fops; struct file; +extern struct pid *pidfd_get_pid(unsigned int fd); extern struct pid *pidfd_pid(const struct file *file); static inline struct pid *get_pid(struct pid *pid) diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 1815065d52f3..621f3d52ed4e 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -934,6 +934,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock, asmlinkage long sys_syncfs(int fd); asmlinkage long sys_setns(int fd, int nstype); asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags); +asmlinkage long sys_pidfd_mem(int pidfd, int __user *fds, unsigned int flags); asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg, unsigned int vlen, unsigned flags); asmlinkage long sys_process_vm_readv(pid_t pid, diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 3a3201e4618e..2663afc03c86 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -850,6 +850,8 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open) #define __NR_clone3 435 __SYSCALL(__NR_clone3, sys_clone3) #endif +#define __NR_pidfd_mem 436 +__SYSCALL(__NR_pidfd_mem, sys_pidfd_mem) #define __NR_openat2 437 __SYSCALL(__NR_openat2, sys_openat2) diff --git a/kernel/exit.c b/kernel/exit.c index 389a88cb3081..37cd8949e606 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -1464,7 +1464,7 @@ static long do_wait(struct wait_opts *wo) return retval; } -static struct pid *pidfd_get_pid(unsigned int fd) +struct pid *pidfd_get_pid(unsigned int fd) { struct fd f; struct pid *pid; diff --git a/kernel/pid.c b/kernel/pid.c index c835b844aca7..c9c49edf4a8a 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -42,6 +42,7 @@ #include #include #include +#include struct pid init_struct_pid = { .count = REFCOUNT_INIT(1), @@ -565,6 +566,60 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags) return fd; } +/** + * pidfd_mem() - Allow access to process address space. + * + * @pidfd: pid file descriptor for the target process + * @fds: array where the control and access file descriptors are returned + * @flags: flags to pass + * + * This creates a pair of file descriptors used to gain access to the + * target process memory. The control fd is used to establish a linear + * mapping between an offset range and a userspace address range. + * The access fd is used to mmap(offset range) on the client side. + * + * Return: On success, 0 is returned. + * On error, a negative errno number will be returned. + */ +SYSCALL_DEFINE3(pidfd_mem, int, pidfd, int __user *, fds, unsigned int, flags) +{ + struct pid *pid; + struct task_struct *task; + int ret_fds[2]; + int ret; + + if (pidfd < 0) + return -EINVAL; + if (!fds) + return -EINVAL; + if (flags) + return -EINVAL; + + pid = pidfd_get_pid(pidfd); + if (IS_ERR(pid)) + return PTR_ERR(pid); + + task = get_pid_task(pid, PIDTYPE_PID); + put_pid(pid); + if (IS_ERR(task)) + return PTR_ERR(task); + + ret = -EPERM; + if (unlikely(task == current) || capable(CAP_SYS_PTRACE)) + ret = task_remote_map(task, ret_fds); + put_task_struct(task); + if (IS_ERR_VALUE((long)ret)) + return ret; + + if (copy_to_user(fds, ret_fds, sizeof(ret_fds))) { + put_unused_fd(ret_fds[0]); + put_unused_fd(ret_fds[1]); + return -EFAULT; + } + + return 0; +} + void __init pid_idr_init(void) { /* Verify no one has done anything silly: */