From patchwork Fri Mar 18 09:55:29 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Xiaoguang Wang X-Patchwork-Id: 12785092 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6D717C433FE for ; Fri, 18 Mar 2022 09:55:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E6A058D0003; Fri, 18 Mar 2022 05:55:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E196F8D0001; Fri, 18 Mar 2022 05:55:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CE0B88D0003; Fri, 18 Mar 2022 05:55:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.27]) by kanga.kvack.org (Postfix) with ESMTP id C093A8D0001 for ; Fri, 18 Mar 2022 05:55:51 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 96076220BE for ; Fri, 18 Mar 2022 09:55:51 +0000 (UTC) X-FDA: 79257050502.11.4E1771D Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132]) by imf20.hostedemail.com (Postfix) with ESMTP id A283C1C0031 for ; Fri, 18 Mar 2022 09:55:50 +0000 (UTC) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R141e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04426;MF=xiaoguang.wang@linux.alibaba.com;NM=1;PH=DS;RN=6;SR=0;TI=SMTPD_---0V7W8jbJ_1647597346; Received: from localhost(mailfrom:xiaoguang.wang@linux.alibaba.com fp:SMTPD_---0V7W8jbJ_1647597346) by smtp.aliyun-inc.com(127.0.0.1); Fri, 18 Mar 2022 17:55:47 +0800 From: Xiaoguang Wang To: linux-mm@kvack.org, target-devel@vger.kernel.org, linux-scsi@vger.kernel.org Cc: linux-block@vger.kernel.org, xuyu@linux.alibaba.com, bostroesser@gmail.com Subject: [RFC 1/3] mm/memory.c: introduce vm_insert_page(s)_mkspecial Date: Fri, 18 Mar 2022 17:55:29 +0800 Message-Id: <20220318095531.15479-2-xiaoguang.wang@linux.alibaba.com> X-Mailer: git-send-email 2.17.2 In-Reply-To: <20220318095531.15479-1-xiaoguang.wang@linux.alibaba.com> References: <20220318095531.15479-1-xiaoguang.wang@linux.alibaba.com> X-Rspamd-Queue-Id: A283C1C0031 X-Rspam-User: Authentication-Results: imf20.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf20.hostedemail.com: domain of xiaoguang.wang@linux.alibaba.com designates 115.124.30.132 as permitted sender) smtp.mailfrom=xiaoguang.wang@linux.alibaba.com X-Stat-Signature: hxz6ydqkikscxmhsm83t3ym8xmgir57f X-Rspamd-Server: rspam04 X-HE-Tag: 1647597350-845731 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Xu Yu This adds the ability to insert anonymous pages or file pages, used for direct IO or buffer IO respectively, to a user VM. The intention behind this is to facilitate mapping pages in IO requests to user space, which is usually the backend of remote block device. This integrates the advantage of vm_insert_pages (batching the pmd lock), and eliminates the overhead of remap_pfn_range (track_pfn_remap), since the pages to be inserted should always be ram. NOTE that file page used in buffer IO is either locked (read) or writeback (sync), while anonymous page used in dio is pinned. Depending on this premise, such pages can be inserted as special PTE, without increasing the page refcount and mapcount. On the other hand, such pages are unlocked, writeback cleared, or unpinned in endio, by when the special mapping in user space is zapped (and of course, it is the caller's responsibility). Signed-off-by: Xu Yu --- include/linux/mm.h | 2 + mm/memory.c | 182 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 184 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 213cc569b192..0d660139b29e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2870,6 +2870,8 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr, struct page **pages, unsigned long *num); +int vm_insert_pages_mkspecial(struct vm_area_struct *vma, unsigned long addr, + struct page **pages, unsigned long *num); int vm_map_pages(struct vm_area_struct *vma, struct page **pages, unsigned long num); int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages, diff --git a/mm/memory.c b/mm/memory.c index c125c4969913..1f745e4d11c2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2510,6 +2510,188 @@ int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long } EXPORT_SYMBOL(vm_iomap_memory); +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL +static int insert_page_into_pte_locked_mkspecial(struct mm_struct *mm, pte_t *pte, + unsigned long addr, struct page *page, pgprot_t prot) +{ + /* + * The page to be inserted should be either anonymous page or file page. + * The anonymous page used in dio should be pinned, while the file page + * used in buffer IO is either locked (read) or writeback (sync). + */ + if (PageAnon(page)) { + int extra = 0; + + if (PageSwapCache(page)) + extra += 1 + page_has_private(page); + + if ((page_count(page) - extra) <= page_mapcount(page)) + return -EINVAL; + } else if (page_is_file_lru(page)) { + if (!PageLocked(page) && !PageWriteback(page)) + return -EINVAL; + } else + return -EINVAL; + + flush_dcache_page(page); + + if (!pte_none(*pte)) + return -EBUSY; + set_pte_at(mm, addr, pte, pte_mkspecial(mk_pte(page, prot))); + return 0; +} + +static int insert_page_mkspecial(struct vm_area_struct *vma, unsigned long addr, + struct page *page, pgprot_t prot) +{ + struct mm_struct *mm = vma->vm_mm; + int retval; + pte_t *pte; + spinlock_t *ptl; + + retval = -ENOMEM; + pte = get_locked_pte(mm, addr, &ptl); + if (!pte) + goto out; + retval = insert_page_into_pte_locked_mkspecial(mm, pte, addr, page, prot); + pte_unmap_unlock(pte, ptl); +out: + return retval; +} + +/* + * vm_insert_page_mkspecial - variant of vm_insert_page, where pte is inserted + * with special bit set. + * + * Different from vm_insert_page(), @page in vm_insert_page_mkspecial() can + * either be anonymous page or file page, used for direct IO or buffer IO, + * respectively. + */ +int vm_insert_page_mkspecial(struct vm_area_struct *vma, unsigned long addr, struct page *page) +{ + if (addr < vma->vm_start || addr >= vma->vm_end) + return -EFAULT; + if (!(vma->vm_flags & VM_MIXEDMAP)) { + BUG_ON(mmap_read_trylock(vma->vm_mm)); + BUG_ON(vma->vm_flags & VM_PFNMAP); + vma->vm_flags |= VM_MIXEDMAP; + } + return insert_page_mkspecial(vma, addr, page, vma->vm_page_prot); +} + +#ifdef pte_index +/* + * insert_pages_mkspecial() amortizes the cost of spinlock operations + * when inserting pages in a loop. Arch *must* define pte_index. + */ +static int insert_pages_mkspecial(struct vm_area_struct *vma, unsigned long addr, + struct page **pages, unsigned long *num, pgprot_t prot) +{ + pmd_t *pmd = NULL; + pte_t *start_pte, *pte; + spinlock_t *pte_lock; + struct mm_struct *const mm = vma->vm_mm; + unsigned long curr_page_idx = 0; + unsigned long remaining_pages_total = *num; + unsigned long pages_to_write_in_pmd; + int ret; +more: + ret = -EFAULT; + pmd = walk_to_pmd(mm, addr); + if (!pmd) + goto out; + + pages_to_write_in_pmd = min_t(unsigned long, + remaining_pages_total, PTRS_PER_PTE - pte_index(addr)); + + /* Allocate the PTE if necessary; takes PMD lock once only. */ + ret = -ENOMEM; + if (pte_alloc(mm, pmd)) + goto out; + + while (pages_to_write_in_pmd) { + int pte_idx = 0; + const int batch_size = min_t(int, pages_to_write_in_pmd, 8); + + start_pte = pte_offset_map_lock(mm, pmd, addr, &pte_lock); + for (pte = start_pte; pte_idx < batch_size; ++pte, ++pte_idx) { + int err = insert_page_into_pte_locked_mkspecial(mm, pte, + addr, pages[curr_page_idx], prot); + if (unlikely(err)) { + pte_unmap_unlock(start_pte, pte_lock); + ret = err; + remaining_pages_total -= pte_idx; + goto out; + } + addr += PAGE_SIZE; + ++curr_page_idx; + } + pte_unmap_unlock(start_pte, pte_lock); + pages_to_write_in_pmd -= batch_size; + remaining_pages_total -= batch_size; + } + if (remaining_pages_total) + goto more; + ret = 0; +out: + *num = remaining_pages_total; + return ret; +} +#endif /* pte_index */ + +/* + * vm_insert_pages_mkspecial - variant of vm_insert_pages, where pte is inserted + * with special bit set. + * + * Different from vm_insert_pages(), @pages in vm_insert_pages_mkspecial() can + * either be anonymous page or file page, used for direct IO or buffer IO, + * respectively. + * + * The main purpose of vm_insert_pages_mkspecial is to combine the advantages of + * vm_insert_pages (batching the pmd lock) and remap_pfn_range_notrack (skipping + * track_pfn_insert). + */ +int vm_insert_pages_mkspecial(struct vm_area_struct *vma, unsigned long addr, + struct page **pages, unsigned long *num) +{ +#ifdef pte_index + const unsigned long end_addr = addr + (*num * PAGE_SIZE) - 1; + + if (addr < vma->vm_start || end_addr >= vma->vm_end) + return -EFAULT; + if (!(vma->vm_flags & VM_MIXEDMAP)) { + BUG_ON(mmap_read_trylock(vma->vm_mm)); + BUG_ON(vma->vm_flags & VM_PFNMAP); + vma->vm_flags |= VM_MIXEDMAP; + } + return insert_pages_mkspecial(vma, addr, pages, num, vma->vm_page_prot); +#else + unsigned long idx = 0, pgcount = *num; + int err = -EINVAL; + + for (; idx < pgcount; ++idx) { + err = vm_insert_page_mkspecial(vma, addr + (PAGE_SIZE * idx), pages[idx]); + if (err) + break; + } + *num = pgcount - idx; + return err; +#endif /* pte_index */ +} +#else +int vm_insert_page_mkspecial(struct vm_area_struct *vma, unsigned long addr, struct page *page) +{ + return -EINVAL; +} +int vm_insert_pages_mkspecial(struct vm_area_struct *vma, unsigned long addr, + struct page **pages, unsigned long *num) +{ + return -EINVAL; +} +#endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */ +EXPORT_SYMBOL(vm_insert_page_mkspecial); +EXPORT_SYMBOL(vm_insert_pages_mkspecial); + static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, unsigned long end, pte_fn_t fn, void *data, bool create, From patchwork Fri Mar 18 09:55:30 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Xiaoguang Wang X-Patchwork-Id: 12785091 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 845D9C433EF for ; Fri, 18 Mar 2022 09:55:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 05F278D0005; Fri, 18 Mar 2022 05:55:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 00F568D0001; Fri, 18 Mar 2022 05:55:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E400D8D0005; Fri, 18 Mar 2022 05:55:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0083.hostedemail.com [216.40.44.83]) by kanga.kvack.org (Postfix) with ESMTP id D51478D0001 for ; Fri, 18 Mar 2022 05:55:53 -0400 (EDT) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 908A0181CC40D for ; Fri, 18 Mar 2022 09:55:53 +0000 (UTC) X-FDA: 79257050586.28.9E6B817 Received: from out30-43.freemail.mail.aliyun.com (out30-43.freemail.mail.aliyun.com [115.124.30.43]) by imf09.hostedemail.com (Postfix) with ESMTP id A76F9140022 for ; Fri, 18 Mar 2022 09:55:52 +0000 (UTC) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04395;MF=xiaoguang.wang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0V7W7z1S_1647597348; Received: from localhost(mailfrom:xiaoguang.wang@linux.alibaba.com fp:SMTPD_---0V7W7z1S_1647597348) by smtp.aliyun-inc.com(127.0.0.1); Fri, 18 Mar 2022 17:55:49 +0800 From: Xiaoguang Wang To: linux-mm@kvack.org, target-devel@vger.kernel.org, linux-scsi@vger.kernel.org Cc: linux-block@vger.kernel.org, xuyu@linux.alibaba.com, bostroesser@gmail.com, Xiaoguang Wang Subject: [RFC 2/3] mm: export zap_page_range() Date: Fri, 18 Mar 2022 17:55:30 +0800 Message-Id: <20220318095531.15479-3-xiaoguang.wang@linux.alibaba.com> X-Mailer: git-send-email 2.17.2 In-Reply-To: <20220318095531.15479-1-xiaoguang.wang@linux.alibaba.com> References: <20220318095531.15479-1-xiaoguang.wang@linux.alibaba.com> X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: A76F9140022 X-Stat-Signature: ny9c3hu1echzmpm7fdajwnn3bamnzxfr X-Rspam-User: Authentication-Results: imf09.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf09.hostedemail.com: domain of xiaoguang.wang@linux.alibaba.com designates 115.124.30.43 as permitted sender) smtp.mailfrom=xiaoguang.wang@linux.alibaba.com X-HE-Tag: 1647597352-647890 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000006, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Module target_core_user will use it to implement zero copy feature. Signed-off-by: Xiaoguang Wang --- mm/memory.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/memory.c b/mm/memory.c index 1f745e4d11c2..9974d0406dad 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1664,6 +1664,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start, mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb); } +EXPORT_SYMBOL_GPL(zap_page_range); /** * zap_page_range_single - remove user pages in a given range From patchwork Fri Mar 18 09:55:31 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Xiaoguang Wang X-Patchwork-Id: 12785093 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0E9EEC433F5 for ; Fri, 18 Mar 2022 09:55:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9AF058D0006; Fri, 18 Mar 2022 05:55:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 95DD78D0001; Fri, 18 Mar 2022 05:55:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 827508D0006; Fri, 18 Mar 2022 05:55:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0065.hostedemail.com [216.40.44.65]) by kanga.kvack.org (Postfix) with ESMTP id 73A338D0001 for ; Fri, 18 Mar 2022 05:55:56 -0400 (EDT) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 22D0EA4A57 for ; Fri, 18 Mar 2022 09:55:56 +0000 (UTC) X-FDA: 79257050712.20.F9474B3 Received: from out30-54.freemail.mail.aliyun.com (out30-54.freemail.mail.aliyun.com [115.124.30.54]) by imf04.hostedemail.com (Postfix) with ESMTP id EDEC940028 for ; Fri, 18 Mar 2022 09:55:54 +0000 (UTC) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R771e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04407;MF=xiaoguang.wang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0V7W8jcH_1647597351; Received: from localhost(mailfrom:xiaoguang.wang@linux.alibaba.com fp:SMTPD_---0V7W8jcH_1647597351) by smtp.aliyun-inc.com(127.0.0.1); Fri, 18 Mar 2022 17:55:52 +0800 From: Xiaoguang Wang To: linux-mm@kvack.org, target-devel@vger.kernel.org, linux-scsi@vger.kernel.org Cc: linux-block@vger.kernel.org, xuyu@linux.alibaba.com, bostroesser@gmail.com, Xiaoguang Wang Subject: [RFC 3/3] scsi: target: tcmu: Support zero copy Date: Fri, 18 Mar 2022 17:55:31 +0800 Message-Id: <20220318095531.15479-4-xiaoguang.wang@linux.alibaba.com> X-Mailer: git-send-email 2.17.2 In-Reply-To: <20220318095531.15479-1-xiaoguang.wang@linux.alibaba.com> References: <20220318095531.15479-1-xiaoguang.wang@linux.alibaba.com> X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: EDEC940028 Authentication-Results: imf04.hostedemail.com; dkim=none; spf=pass (imf04.hostedemail.com: domain of xiaoguang.wang@linux.alibaba.com designates 115.124.30.54 as permitted sender) smtp.mailfrom=xiaoguang.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com X-Rspam-User: X-Stat-Signature: 4pbpwew49g3xht5uj8agfnb1zqjndbua X-HE-Tag: 1647597354-945171 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Currently in tcmu, for READ commands, it copies user space backstore's data buffer to tcmu internal data area, then copies data in data area to READ commands sgl pages. For WRITE commands, tcmu copies sgl pages to tcmu internal data area, then copies data in data area to user space backstore. For both cases, there are obvious copy overhead, which impact io throughput, especially for large io size. To mitigate this issue, we implement zero copy feature to tcmu, which map sgl pages to user space backstore's address space. Currently only sgl pages's offset and length are both aligned to page size, can this command go into tcmu zero copy path. Signed-off-by: Xiaoguang Wang --- drivers/target/target_core_user.c | 257 +++++++++++++++++++++++++++++++++----- 1 file changed, 229 insertions(+), 28 deletions(-) diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c index 7b2a89a67cdb..4314e0c00f8e 100644 --- a/drivers/target/target_core_user.c +++ b/drivers/target/target_core_user.c @@ -16,6 +16,8 @@ #include #include #include +#include +#include #include #include #include @@ -72,6 +74,7 @@ */ #define DATA_PAGES_PER_BLK_DEF 1 #define DATA_AREA_PAGES_DEF (256 * 1024) +#define ZC_DATA_AREA_PAGES_DEF (256 * 1024) #define TCMU_MBS_TO_PAGES(_mbs) ((size_t)_mbs << (20 - PAGE_SHIFT)) #define TCMU_PAGES_TO_MBS(_pages) (_pages >> (20 - PAGE_SHIFT)) @@ -145,9 +148,20 @@ struct tcmu_dev { struct list_head qfull_queue; struct list_head tmr_queue; + /* For zero copy handle */ + int zc_data_area_mb; + uint32_t zc_max_blocks; uint32_t dbi_max; uint32_t dbi_thresh; unsigned long *data_bitmap; + + struct mm_struct *vma_vm_mm; + struct vm_area_struct *vma; + + uint32_t zc_dbi_max; + uint32_t zc_dbi_thresh; + unsigned long *zc_data_bitmap; + struct xarray data_pages; uint32_t data_pages_per_blk; uint32_t data_blk_size; @@ -177,6 +191,12 @@ struct tcmu_cmd { struct tcmu_dev *tcmu_dev; struct list_head queue_entry; + /* For zero_copy handle */ + struct mm_struct *vma_vm_mm; + struct vm_area_struct *vma; + struct iovec *iov; + int iov_cnt; + uint16_t cmd_id; /* Can't use se_cmd when cleaning up expired cmds, because if @@ -192,6 +212,7 @@ struct tcmu_cmd { #define TCMU_CMD_BIT_EXPIRED 0 #define TCMU_CMD_BIT_KEEP_BUF 1 +#define TCMU_CMD_BIT_ZEROCOPY 2 unsigned long flags; }; @@ -495,10 +516,16 @@ static struct genl_family tcmu_genl_family __ro_after_init = { static void tcmu_cmd_free_data(struct tcmu_cmd *tcmu_cmd, uint32_t len) { struct tcmu_dev *udev = tcmu_cmd->tcmu_dev; + unsigned long *data_bitmap; uint32_t i; + if (test_bit(TCMU_CMD_BIT_ZEROCOPY, &tcmu_cmd->flags)) + data_bitmap = udev->zc_data_bitmap; + else + data_bitmap = udev->data_bitmap; + for (i = 0; i < len; i++) - clear_bit(tcmu_cmd->dbi[i], udev->data_bitmap); + clear_bit(tcmu_cmd->dbi[i], data_bitmap); } static inline int tcmu_get_empty_block(struct tcmu_dev *udev, @@ -549,8 +576,30 @@ static inline int tcmu_get_empty_block(struct tcmu_dev *udev, return i == page_cnt ? dbi : -1; } +static inline int tcmu_get_zc_empty_block(struct tcmu_dev *udev, + struct tcmu_cmd *tcmu_cmd, int prev_dbi, + int *iov_cnt) +{ + int dbi; + + dbi = find_first_zero_bit(udev->zc_data_bitmap, udev->zc_dbi_thresh); + if (dbi == udev->zc_dbi_thresh) + return -1; + + if (dbi > udev->zc_dbi_max) + udev->zc_dbi_max = dbi; + + set_bit(dbi, udev->zc_data_bitmap); + tcmu_cmd_set_dbi(tcmu_cmd, dbi); + + if (dbi != prev_dbi + 1) + *iov_cnt += 1; + return dbi; +} + static int tcmu_get_empty_blocks(struct tcmu_dev *udev, - struct tcmu_cmd *tcmu_cmd, int length) + struct tcmu_cmd *tcmu_cmd, int length, + bool zero_copy) { /* start value of dbi + 1 must not be a valid dbi */ int dbi = -2; @@ -559,16 +608,111 @@ static int tcmu_get_empty_blocks(struct tcmu_dev *udev, for (; length > 0; length -= blk_size) { blk_data_len = min_t(uint32_t, length, blk_size); - dbi = tcmu_get_empty_block(udev, tcmu_cmd, dbi, blk_data_len, - &iov_cnt); + if (zero_copy) { + dbi = tcmu_get_zc_empty_block(udev, tcmu_cmd, dbi, + &iov_cnt); + } else { + dbi = tcmu_get_empty_block(udev, tcmu_cmd, dbi, + blk_data_len, &iov_cnt); + } if (dbi < 0) return -1; } return iov_cnt; } +#define TCMU_ZEROCOPY_PAGE_BATCH 32 + +static inline void tcmu_zerocopy_one_seg(struct iovec *iov, + struct vm_area_struct *vma, + struct sg_page_iter *sgiter) +{ + struct page *pages[TCMU_ZEROCOPY_PAGE_BATCH]; + unsigned int len = iov->iov_len; + unsigned long address = (unsigned long)iov->iov_base + vma->vm_start; + unsigned long pages_remaining, pg_index = 0; + struct page *page; + + while (len > 0) { + __sg_page_iter_next(sgiter); + page = sg_page_iter_page(sgiter); + pages[pg_index++] = page; + len -= PAGE_SIZE; + if (pg_index == TCMU_ZEROCOPY_PAGE_BATCH || !len) { + pages_remaining = pg_index; + vm_insert_pages_mkspecial(vma, address, pages, &pages_remaining); + address = address + pg_index * PAGE_SIZE; + pg_index = 0; + } + } +} + +static long tcmu_cmd_zerocopy_map(struct tcmu_dev *udev, + struct tcmu_cmd *cmd, + struct iovec *iov, + int iov_cnt) +{ + struct se_cmd *se_cmd = cmd->se_cmd; + struct scatterlist *data_sg; + unsigned int data_nents; + struct iovec *tiov; + struct sg_page_iter sgiter; + struct vm_area_struct *vma = udev->vma; + int i, ret = 0; + + mmap_read_lock(udev->vma_vm_mm); + data_sg = se_cmd->t_data_sg; + data_nents = se_cmd->t_data_nents; + __sg_page_iter_start(&sgiter, data_sg, data_nents, 0); + tiov = iov; + for (i = 0; i < iov_cnt; i++) { + tcmu_zerocopy_one_seg(tiov, vma, &sgiter); + tiov++; + } + cmd->iov = iov; + cmd->iov_cnt = iov_cnt; + cmd->vma_vm_mm = vma->vm_mm; + cmd->vma = vma; + mmgrab(cmd->vma_vm_mm); + mmap_read_unlock(udev->vma_vm_mm); + return ret; +} + +static void tcmu_cmd_zerocopy_unmap(struct tcmu_cmd *cmd) +{ + struct mm_struct *mm; + struct vm_area_struct *vma; + struct iovec *iov; + unsigned long address; + int i; + + mm = cmd->vma_vm_mm; + if (!mm) + return; + + vma = cmd->vma; + iov = cmd->iov; + if (mmget_not_zero(mm)) { + mmap_read_lock(mm); + for (i = 0; i < cmd->iov_cnt; i++) { + address = (unsigned long)iov->iov_base + vma->vm_start; + zap_page_range(vma, address, iov->iov_len); + iov++; + } + mmap_read_unlock(mm); + mmput(mm); + } + + cmd->vma_vm_mm = NULL; + cmd->vma = NULL; + mmdrop(mm); +} + static inline void tcmu_free_cmd(struct tcmu_cmd *tcmu_cmd) { + if (test_bit(TCMU_CMD_BIT_ZEROCOPY, &tcmu_cmd->flags)) + tcmu_cmd_zerocopy_unmap(tcmu_cmd); + kfree(tcmu_cmd->dbi); kmem_cache_free(tcmu_cmd_cache, tcmu_cmd); } @@ -850,37 +994,57 @@ static bool is_ring_space_avail(struct tcmu_dev *udev, size_t cmd_size) * Called with ring lock held. */ static int tcmu_alloc_data_space(struct tcmu_dev *udev, struct tcmu_cmd *cmd, - int *iov_bidi_cnt) + int *iov_bidi_cnt, bool zero_copy) { int space, iov_cnt = 0, ret = 0; if (!cmd->dbi_cnt) goto wr_iov_cnts; - /* try to check and get the data blocks as needed */ - space = spc_bitmap_free(udev->data_bitmap, udev->dbi_thresh); - if (space < cmd->dbi_cnt) { - unsigned long blocks_left = - (udev->max_blocks - udev->dbi_thresh) + space; + if (!zero_copy) { + /* try to check and get the data blocks as needed */ + space = spc_bitmap_free(udev->data_bitmap, udev->dbi_thresh); + if (space < cmd->dbi_cnt) { + unsigned long blocks_left = + (udev->max_blocks - udev->dbi_thresh) + space; + + if (blocks_left < cmd->dbi_cnt) { + pr_debug("no data space: only %lu available, but ask for %u\n", + blocks_left * udev->data_blk_size, + cmd->dbi_cnt * udev->data_blk_size); + return -1; + } - if (blocks_left < cmd->dbi_cnt) { - pr_debug("no data space: only %lu available, but ask for %u\n", - blocks_left * udev->data_blk_size, - cmd->dbi_cnt * udev->data_blk_size); - return -1; + udev->dbi_thresh += cmd->dbi_cnt; + if (udev->dbi_thresh > udev->max_blocks) + udev->dbi_thresh = udev->max_blocks; } + } else { + /* try to check and get the data blocks as needed */ + space = spc_bitmap_free(udev->zc_data_bitmap, udev->zc_dbi_thresh); + if (space < cmd->dbi_cnt) { + unsigned long blocks_left = + (udev->zc_max_blocks - udev->zc_dbi_thresh) + space; + + if (blocks_left < cmd->dbi_cnt) { + pr_debug("no data space: only %lu available, but ask for %u\n", + blocks_left * udev->data_blk_size, + cmd->dbi_cnt * udev->data_blk_size); + return -1; + } - udev->dbi_thresh += cmd->dbi_cnt; - if (udev->dbi_thresh > udev->max_blocks) - udev->dbi_thresh = udev->max_blocks; + udev->zc_dbi_thresh += cmd->dbi_cnt; + if (udev->zc_dbi_thresh > udev->zc_max_blocks) + udev->zc_dbi_thresh = udev->zc_max_blocks; + } } - iov_cnt = tcmu_get_empty_blocks(udev, cmd, cmd->se_cmd->data_length); + iov_cnt = tcmu_get_empty_blocks(udev, cmd, cmd->se_cmd->data_length, zero_copy); if (iov_cnt < 0) return -1; if (cmd->dbi_bidi_cnt) { - ret = tcmu_get_empty_blocks(udev, cmd, cmd->data_len_bidi); + ret = tcmu_get_empty_blocks(udev, cmd, cmd->data_len_bidi, zero_copy); if (ret < 0) return -1; } @@ -1021,6 +1185,7 @@ static int queue_cmd_ring(struct tcmu_cmd *tcmu_cmd, sense_reason_t *scsi_err) uint32_t blk_size = udev->data_blk_size; /* size of data buffer needed */ size_t data_length = (size_t)tcmu_cmd->dbi_cnt * blk_size; + bool zero_copy = false; *scsi_err = TCM_NO_SENSE; @@ -1044,7 +1209,22 @@ static int queue_cmd_ring(struct tcmu_cmd *tcmu_cmd, sense_reason_t *scsi_err) return -1; } - iov_cnt = tcmu_alloc_data_space(udev, tcmu_cmd, &iov_bidi_cnt); + if (!(se_cmd->se_cmd_flags & SCF_BIDI) && se_cmd->data_length && + IS_ALIGNED(se_cmd->data_length, PAGE_SIZE)) { + struct scatterlist *data_sg = se_cmd->t_data_sg, *sg; + unsigned int data_nents = se_cmd->t_data_nents; + int i; + + for_each_sg(data_sg, sg, data_nents, i) { + if (!((!sg->offset || IS_ALIGNED(sg->offset, PAGE_SIZE)) && + IS_ALIGNED(sg->length, PAGE_SIZE))) + break; + } + if (i == data_nents) + zero_copy = true; + } + + iov_cnt = tcmu_alloc_data_space(udev, tcmu_cmd, &iov_bidi_cnt, zero_copy); if (iov_cnt < 0) goto free_and_queue; @@ -1093,7 +1273,7 @@ static int queue_cmd_ring(struct tcmu_cmd *tcmu_cmd, sense_reason_t *scsi_err) tcmu_cmd_reset_dbi_cur(tcmu_cmd); iov = &entry->req.iov[0]; - if (se_cmd->data_direction == DMA_TO_DEVICE || + if (((se_cmd->data_direction == DMA_TO_DEVICE) && !zero_copy) || se_cmd->se_cmd_flags & SCF_BIDI) scatter_data_area(udev, tcmu_cmd, &iov); else @@ -1111,6 +1291,11 @@ static int queue_cmd_ring(struct tcmu_cmd *tcmu_cmd, sense_reason_t *scsi_err) tcmu_setup_cmd_timer(tcmu_cmd, udev->cmd_time_out, &udev->cmd_timer); entry->hdr.cmd_id = tcmu_cmd->cmd_id; + if (zero_copy) { + iov = &entry->req.iov[0]; + tcmu_cmd_zerocopy_map(udev, tcmu_cmd, iov, entry->req.iov_cnt); + set_bit(TCMU_CMD_BIT_ZEROCOPY, &tcmu_cmd->flags); + } tcmu_hdr_set_len(&entry->hdr.len_op, command_size); @@ -1366,7 +1551,10 @@ static bool tcmu_handle_completion(struct tcmu_cmd *cmd, else se_cmd->se_cmd_flags |= SCF_TREAT_READ_AS_NORMAL; } - if (se_cmd->se_cmd_flags & SCF_BIDI) { + + if (test_bit(TCMU_CMD_BIT_ZEROCOPY, &cmd->flags)) { + tcmu_cmd_zerocopy_unmap(cmd); + } else if (se_cmd->se_cmd_flags & SCF_BIDI) { /* Get Data-In buffer before clean up */ gather_data_area(udev, cmd, true, read_len); } else if (se_cmd->data_direction == DMA_FROM_DEVICE) { @@ -1520,6 +1708,8 @@ static void tcmu_check_expired_ring_cmd(struct tcmu_cmd *cmd) if (!time_after_eq(jiffies, cmd->deadline)) return; + if (test_bit(TCMU_CMD_BIT_ZEROCOPY, &cmd->flags)) + tcmu_cmd_zerocopy_unmap(cmd); set_bit(TCMU_CMD_BIT_EXPIRED, &cmd->flags); list_del_init(&cmd->queue_entry); se_cmd = cmd->se_cmd; @@ -1618,6 +1808,8 @@ static struct se_device *tcmu_alloc_device(struct se_hba *hba, const char *name) udev->data_pages_per_blk = DATA_PAGES_PER_BLK_DEF; udev->max_blocks = DATA_AREA_PAGES_DEF / udev->data_pages_per_blk; udev->data_area_mb = TCMU_PAGES_TO_MBS(DATA_AREA_PAGES_DEF); + udev->zc_max_blocks = ZC_DATA_AREA_PAGES_DEF / udev->data_pages_per_blk; + udev->zc_data_area_mb = TCMU_PAGES_TO_MBS(ZC_DATA_AREA_PAGES_DEF); mutex_init(&udev->cmdr_lock); @@ -1841,6 +2033,9 @@ static void tcmu_vma_open(struct vm_area_struct *vma) pr_debug("vma_open\n"); + udev->vma_vm_mm = vma->vm_mm; + udev->vma = vma; + mmgrab(udev->vma_vm_mm); kref_get(&udev->kref); } @@ -1850,6 +2045,10 @@ static void tcmu_vma_close(struct vm_area_struct *vma) pr_debug("vma_close\n"); + mmdrop(udev->vma_vm_mm); + udev->vma_vm_mm = NULL; + udev->vma = NULL; + /* release ref from tcmu_vma_open */ kref_put(&udev->kref, tcmu_dev_kref_release); } @@ -1901,7 +2100,7 @@ static int tcmu_mmap(struct uio_info *info, struct vm_area_struct *vma) { struct tcmu_dev *udev = container_of(info, struct tcmu_dev, uio_info); - vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; + vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP | VM_MIXEDMAP; vma->vm_ops = &tcmu_vm_ops; vma->vm_private_data = udev; @@ -2172,7 +2371,7 @@ static int tcmu_configure_device(struct se_device *dev) struct tcmu_dev *udev = TCMU_DEV(dev); struct uio_info *info; struct tcmu_mailbox *mb; - size_t data_size; + size_t data_size, zc_data_size; int ret = 0; ret = tcmu_update_uio_info(udev); @@ -2183,8 +2382,9 @@ static int tcmu_configure_device(struct se_device *dev) mutex_lock(&udev->cmdr_lock); udev->data_bitmap = bitmap_zalloc(udev->max_blocks, GFP_KERNEL); + udev->zc_data_bitmap = bitmap_zalloc(udev->zc_max_blocks, GFP_KERNEL); mutex_unlock(&udev->cmdr_lock); - if (!udev->data_bitmap) { + if (!udev->data_bitmap || !udev->zc_data_bitmap) { ret = -ENOMEM; goto err_bitmap_alloc; } @@ -2201,7 +2401,8 @@ static int tcmu_configure_device(struct se_device *dev) udev->cmdr_size = CMDR_SIZE; udev->data_off = MB_CMDR_SIZE; data_size = TCMU_MBS_TO_PAGES(udev->data_area_mb) << PAGE_SHIFT; - udev->mmap_pages = (data_size + MB_CMDR_SIZE) >> PAGE_SHIFT; + zc_data_size = TCMU_MBS_TO_PAGES(udev->zc_data_area_mb) << PAGE_SHIFT; + udev->mmap_pages = (data_size + zc_data_size + MB_CMDR_SIZE) >> PAGE_SHIFT; udev->data_blk_size = udev->data_pages_per_blk * PAGE_SIZE; udev->dbi_thresh = 0; /* Default in Idle state */ @@ -2221,7 +2422,7 @@ static int tcmu_configure_device(struct se_device *dev) info->mem[0].name = "tcm-user command & data buffer"; info->mem[0].addr = (phys_addr_t)(uintptr_t)udev->mb_addr; - info->mem[0].size = data_size + MB_CMDR_SIZE; + info->mem[0].size = data_size + zc_data_size + MB_CMDR_SIZE; info->mem[0].memtype = UIO_MEM_NONE; info->irqcontrol = tcmu_irqcontrol;