From patchwork Sun Mar 10 08:26:04 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Gibson X-Patchwork-Id: 10846131 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DDD8F139A for ; Sun, 10 Mar 2019 08:32:08 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C971928E38 for ; Sun, 10 Mar 2019 08:32:08 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id BE01328EA1; Sun, 10 Mar 2019 08:32:08 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI autolearn=ham version=3.3.1 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 2EE7928E40 for ; Sun, 10 Mar 2019 08:32:08 +0000 (UTC) Received: from localhost ([127.0.0.1]:41741 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h2tsN-0004qc-Bj for patchwork-qemu-devel@patchwork.kernel.org; Sun, 10 Mar 2019 04:32:07 -0400 Received: from eggs.gnu.org ([209.51.188.92]:42422) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h2tno-0000dj-RV for qemu-devel@nongnu.org; Sun, 10 Mar 2019 04:27:25 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1h2tnk-00014M-UH for qemu-devel@nongnu.org; Sun, 10 Mar 2019 04:27:23 -0400 Received: from ozlabs.org ([2401:3900:2:1::2]:49257) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1h2tnk-000129-7C; Sun, 10 Mar 2019 04:27:20 -0400 Received: by ozlabs.org (Postfix, from userid 1007) id 44HDqj5Wwvz9sCJ; Sun, 10 Mar 2019 19:27:13 +1100 (AEDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gibson.dropbear.id.au; s=201602; t=1552206433; bh=CtJZnPdT0k7AZ4hgqR3mpNV0QNnTmlYWSScqch58bfo=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=pCNPtk4Xq+9cinksTOu2gAlmqqDmcjx2uzNMlodm60VmpMfq31cWxajj964Kbb0+8 0oe/TZsZoOtgd9eVqzLJt33e+uyg5CYhsrVDb7AN8DPRLeGFzKOcBYZ4QcBfIotIBG 3dyf1Zs4IiZwsmOBhz3qekUiUkfiFLxdKd5TwaD0= From: David Gibson To: peter.maydell@linaro.org Date: Sun, 10 Mar 2019 19:26:04 +1100 Message-Id: <20190310082703.1245-2-david@gibson.dropbear.id.au> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190310082703.1245-1-david@gibson.dropbear.id.au> References: <20190310082703.1245-1-david@gibson.dropbear.id.au> MIME-Version: 1.0 X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2401:3900:2:1::2 Subject: [Qemu-devel] [PULL 01/60] vfio/spapr: Fix indirect levels calculation X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: lvivier@redhat.com, Alexey Kardashevskiy , groug@kaod.org, qemu-devel@nongnu.org, qemu-ppc@nongnu.org, David Gibson Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Virus-Scanned: ClamAV using ClamSMTP From: Alexey Kardashevskiy The current code assumes that we can address more bits on a PCI bus for DMA than we really can but there is no way knowing the actual limit. This makes a better guess for the number of levels and if the kernel fails to allocate that, this increases the level numbers till succeeded or reached the 64bit limit. This adds levels to the trace point. This may cause the kernel to warn about failed allocation: [65122.837458] Failed to allocate a TCE memory, level shift=28 which might happen if MAX_ORDER is not large enough as it can vary: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/Kconfig?h=v5.0-rc2#n727 Signed-off-by: Alexey Kardashevskiy Message-Id: <20190227085149.38596-3-aik@ozlabs.ru> Signed-off-by: David Gibson --- hw/vfio/spapr.c | 43 +++++++++++++++++++++++++++++++++---------- hw/vfio/trace-events | 2 +- 2 files changed, 34 insertions(+), 11 deletions(-) diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c index becf71a3fc..88437a79e6 100644 --- a/hw/vfio/spapr.c +++ b/hw/vfio/spapr.c @@ -143,10 +143,10 @@ int vfio_spapr_create_window(VFIOContainer *container, MemoryRegionSection *section, hwaddr *pgsize) { - int ret; + int ret = 0; IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr); uint64_t pagesize = memory_region_iommu_get_min_page_size(iommu_mr); - unsigned entries, pages; + unsigned entries, bits_total, bits_per_level, max_levels; struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) }; long systempagesize = qemu_getrampagesize(); @@ -176,16 +176,38 @@ int vfio_spapr_create_window(VFIOContainer *container, create.window_size = int128_get64(section->size); create.page_shift = ctz64(pagesize); /* - * SPAPR host supports multilevel TCE tables, there is some - * heuristic to decide how many levels we want for our table: - * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4 + * SPAPR host supports multilevel TCE tables. We try to guess optimal + * levels number and if this fails (for example due to the host memory + * fragmentation), we increase levels. The DMA address structure is: + * rrrrrrrr rxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx iiiiiiii + * where: + * r = reserved (bits >= 55 are reserved in the existing hardware) + * i = IOMMU page offset (64K in this example) + * x = bits to index a TCE which can be split to equal chunks to index + * within the level. + * The aim is to split "x" to smaller possible number of levels. */ entries = create.window_size >> create.page_shift; - pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1); - pages = MAX(pow2ceil(pages), 1); /* Round up */ - create.levels = ctz64(pages) / 6 + 1; - - ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create); + /* bits_total is number of "x" needed */ + bits_total = ctz64(entries * sizeof(uint64_t)); + /* + * bits_per_level is a safe guess of how much we can allocate per level: + * 8 is the current minimum for CONFIG_FORCE_MAX_ZONEORDER and MAX_ORDER + * is usually bigger than that. + * Below we look at getpagesize() as TCEs are allocated from system pages. + */ + bits_per_level = ctz64(getpagesize()) + 8; + create.levels = bits_total / bits_per_level; + if (bits_total % bits_per_level) { + ++create.levels; + } + max_levels = (64 - create.page_shift) / ctz64(getpagesize()); + for ( ; create.levels <= max_levels; ++create.levels) { + ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create); + if (!ret) { + break; + } + } if (ret) { error_report("Failed to create a window, ret = %d (%m)", ret); return -errno; @@ -200,6 +222,7 @@ int vfio_spapr_create_window(VFIOContainer *container, return -EINVAL; } trace_vfio_spapr_create_window(create.page_shift, + create.levels, create.window_size, create.start_addr); *pgsize = pagesize; diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index ed2f333ad7..cf1e886818 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events @@ -129,6 +129,6 @@ vfio_prereg_listener_region_add_skip(uint64_t start, uint64_t end) "0x%"PRIx64" vfio_prereg_listener_region_del_skip(uint64_t start, uint64_t end) "0x%"PRIx64" - 0x%"PRIx64 vfio_prereg_register(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d" vfio_prereg_unregister(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d" -vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64 +vfio_spapr_create_window(int ps, unsigned int levels, uint64_t ws, uint64_t off) "pageshift=0x%x levels=%u winsize=0x%"PRIx64" offset=0x%"PRIx64 vfio_spapr_remove_window(uint64_t off) "offset=0x%"PRIx64 vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to liobn fd %d"