From patchwork Wed Feb 27 08:51:44 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 10831417 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CD77613B5 for ; Wed, 27 Feb 2019 08:53:38 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B5E6D2CD7F for ; Wed, 27 Feb 2019 08:53:38 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A48732CD7D; Wed, 27 Feb 2019 08:53:38 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI autolearn=ham version=3.3.1 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 409712CD7D for ; Wed, 27 Feb 2019 08:53:38 +0000 (UTC) Received: from localhost ([127.0.0.1]:40374 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyuy9-0005JH-I3 for patchwork-qemu-devel@patchwork.kernel.org; Wed, 27 Feb 2019 03:53:37 -0500 Received: from eggs.gnu.org ([209.51.188.92]:36325) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyuwf-00044h-Ew for qemu-devel@nongnu.org; Wed, 27 Feb 2019 03:52:06 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gyuwe-0004Bv-CB for qemu-devel@nongnu.org; Wed, 27 Feb 2019 03:52:05 -0500 Received: from ozlabs.ru ([107.173.13.209]:35275) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyuwa-00048Y-IJ; Wed, 27 Feb 2019 03:52:02 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 4ABE5AE800EC; Wed, 27 Feb 2019 03:51:54 -0500 (EST) From: Alexey Kardashevskiy To: qemu-devel@nongnu.org Date: Wed, 27 Feb 2019 19:51:44 +1100 Message-Id: <20190227085149.38596-2-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190227085149.38596-1-aik@ozlabs.ru> References: <20190227085149.38596-1-aik@ozlabs.ru> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 107.173.13.209 Subject: [Qemu-devel] [PATCH qemu v3 1/6] pci: Move NVIDIA vendor id to the rest of ids X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Jose Ricardo Ziviani , Alexey Kardashevskiy , Daniel Henrique Barboza , Alex Williamson , Sam Bobroff , Piotr Jaroszynski , qemu-ppc@nongnu.org, =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_Garcia?= , David Gibson Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Virus-Scanned: ClamAV using ClamSMTP sPAPR code will use it too so move it from VFIO to the common code. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson Reviewed-by: Alistair Francis --- include/hw/pci/pci_ids.h | 2 ++ hw/vfio/pci-quirks.c | 2 -- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/include/hw/pci/pci_ids.h b/include/hw/pci/pci_ids.h index eeb3301..0abe27a 100644 --- a/include/hw/pci/pci_ids.h +++ b/include/hw/pci/pci_ids.h @@ -271,4 +271,6 @@ #define PCI_VENDOR_ID_SYNOPSYS 0x16C3 +#define PCI_VENDOR_ID_NVIDIA 0x10de + #endif diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c index eae31c7..40a1200 100644 --- a/hw/vfio/pci-quirks.c +++ b/hw/vfio/pci-quirks.c @@ -526,8 +526,6 @@ static void vfio_probe_ati_bar2_quirk(VFIOPCIDevice *vdev, int nr) * note it for future reference. */ -#define PCI_VENDOR_ID_NVIDIA 0x10de - /* * Nvidia has several different methods to get to config space, the * nouveu project has several of these documented here: From patchwork Wed Feb 27 08:51:45 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 10831457 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1217F13B5 for ; Wed, 27 Feb 2019 08:58:21 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F0E6F2B002 for ; Wed, 27 Feb 2019 08:58:20 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E46EF2C162; Wed, 27 Feb 2019 08:58:20 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI autolearn=ham version=3.3.1 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 66F0E2B002 for ; Wed, 27 Feb 2019 08:58:20 +0000 (UTC) Received: from localhost ([127.0.0.1]:40448 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyv2h-0000le-QD for patchwork-qemu-devel@patchwork.kernel.org; Wed, 27 Feb 2019 03:58:19 -0500 Received: from eggs.gnu.org ([209.51.188.92]:36533) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyuxg-0004w8-F8 for qemu-devel@nongnu.org; Wed, 27 Feb 2019 03:53:09 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gyuxf-0004m0-5T for qemu-devel@nongnu.org; Wed, 27 Feb 2019 03:53:08 -0500 Received: from ozlabs.ru ([107.173.13.209]:35353) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyuxe-0004P4-QD; Wed, 27 Feb 2019 03:53:07 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 45EE5AE80215; Wed, 27 Feb 2019 03:51:57 -0500 (EST) From: Alexey Kardashevskiy To: qemu-devel@nongnu.org Date: Wed, 27 Feb 2019 19:51:45 +1100 Message-Id: <20190227085149.38596-3-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190227085149.38596-1-aik@ozlabs.ru> References: <20190227085149.38596-1-aik@ozlabs.ru> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 107.173.13.209 Subject: [Qemu-devel] [PATCH qemu v3 2/6] vfio/spapr: Fix indirect levels calculation X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Jose Ricardo Ziviani , Alexey Kardashevskiy , Daniel Henrique Barboza , Alex Williamson , Sam Bobroff , Piotr Jaroszynski , qemu-ppc@nongnu.org, =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_Garcia?= , David Gibson Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Virus-Scanned: ClamAV using ClamSMTP The current code assumes that we can address more bits on a PCI bus for DMA than we really can but there is no way knowing the actual limit. This makes a better guess for the number of levels and if the kernel fails to allocate that, this increases the level numbers till succeeded or reached the 64bit limit. This adds levels to the trace point. This may cause the kernel to warn about failed allocation: [65122.837458] Failed to allocate a TCE memory, level shift=28 which might happen if MAX_ORDER is not large enough as it can vary: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/Kconfig?h=v5.0-rc2#n727 Signed-off-by: Alexey Kardashevskiy --- Changes: v2: * replace systempagesize with getpagesize() when calculating bits_per_level/max_levels --- hw/vfio/spapr.c | 43 +++++++++++++++++++++++++++++++++---------- hw/vfio/trace-events | 2 +- 2 files changed, 34 insertions(+), 11 deletions(-) diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c index becf71a..88437a7 100644 --- a/hw/vfio/spapr.c +++ b/hw/vfio/spapr.c @@ -143,10 +143,10 @@ int vfio_spapr_create_window(VFIOContainer *container, MemoryRegionSection *section, hwaddr *pgsize) { - int ret; + int ret = 0; IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr); uint64_t pagesize = memory_region_iommu_get_min_page_size(iommu_mr); - unsigned entries, pages; + unsigned entries, bits_total, bits_per_level, max_levels; struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) }; long systempagesize = qemu_getrampagesize(); @@ -176,16 +176,38 @@ int vfio_spapr_create_window(VFIOContainer *container, create.window_size = int128_get64(section->size); create.page_shift = ctz64(pagesize); /* - * SPAPR host supports multilevel TCE tables, there is some - * heuristic to decide how many levels we want for our table: - * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4 + * SPAPR host supports multilevel TCE tables. We try to guess optimal + * levels number and if this fails (for example due to the host memory + * fragmentation), we increase levels. The DMA address structure is: + * rrrrrrrr rxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx iiiiiiii + * where: + * r = reserved (bits >= 55 are reserved in the existing hardware) + * i = IOMMU page offset (64K in this example) + * x = bits to index a TCE which can be split to equal chunks to index + * within the level. + * The aim is to split "x" to smaller possible number of levels. */ entries = create.window_size >> create.page_shift; - pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1); - pages = MAX(pow2ceil(pages), 1); /* Round up */ - create.levels = ctz64(pages) / 6 + 1; - - ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create); + /* bits_total is number of "x" needed */ + bits_total = ctz64(entries * sizeof(uint64_t)); + /* + * bits_per_level is a safe guess of how much we can allocate per level: + * 8 is the current minimum for CONFIG_FORCE_MAX_ZONEORDER and MAX_ORDER + * is usually bigger than that. + * Below we look at getpagesize() as TCEs are allocated from system pages. + */ + bits_per_level = ctz64(getpagesize()) + 8; + create.levels = bits_total / bits_per_level; + if (bits_total % bits_per_level) { + ++create.levels; + } + max_levels = (64 - create.page_shift) / ctz64(getpagesize()); + for ( ; create.levels <= max_levels; ++create.levels) { + ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create); + if (!ret) { + break; + } + } if (ret) { error_report("Failed to create a window, ret = %d (%m)", ret); return -errno; @@ -200,6 +222,7 @@ int vfio_spapr_create_window(VFIOContainer *container, return -EINVAL; } trace_vfio_spapr_create_window(create.page_shift, + create.levels, create.window_size, create.start_addr); *pgsize = pagesize; diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index ed2f333..cf1e886 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events @@ -129,6 +129,6 @@ vfio_prereg_listener_region_add_skip(uint64_t start, uint64_t end) "0x%"PRIx64" vfio_prereg_listener_region_del_skip(uint64_t start, uint64_t end) "0x%"PRIx64" - 0x%"PRIx64 vfio_prereg_register(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d" vfio_prereg_unregister(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d" -vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64 +vfio_spapr_create_window(int ps, unsigned int levels, uint64_t ws, uint64_t off) "pageshift=0x%x levels=%u winsize=0x%"PRIx64" offset=0x%"PRIx64 vfio_spapr_remove_window(uint64_t off) "offset=0x%"PRIx64 vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to liobn fd %d" From patchwork Wed Feb 27 08:51:46 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 10831415 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E39C313B5 for ; Wed, 27 Feb 2019 08:53:24 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CF4D02CD7D for ; Wed, 27 Feb 2019 08:53:24 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C36AA2CD87; Wed, 27 Feb 2019 08:53:24 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI autolearn=ham version=3.3.1 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 646D72CD7F for ; Wed, 27 Feb 2019 08:53:24 +0000 (UTC) Received: from localhost ([127.0.0.1]:40372 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyuxv-00057R-IF for patchwork-qemu-devel@patchwork.kernel.org; Wed, 27 Feb 2019 03:53:23 -0500 Received: from eggs.gnu.org ([209.51.188.92]:36323) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyuwf-00044g-Ct for qemu-devel@nongnu.org; Wed, 27 Feb 2019 03:52:06 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gyuwe-0004CH-Ki for qemu-devel@nongnu.org; Wed, 27 Feb 2019 03:52:05 -0500 Received: from ozlabs.ru ([107.173.13.209]:35293) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyuwe-0004BB-Eh; Wed, 27 Feb 2019 03:52:04 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 42415AE807E8; Wed, 27 Feb 2019 03:52:00 -0500 (EST) From: Alexey Kardashevskiy To: qemu-devel@nongnu.org Date: Wed, 27 Feb 2019 19:51:46 +1100 Message-Id: <20190227085149.38596-4-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190227085149.38596-1-aik@ozlabs.ru> References: <20190227085149.38596-1-aik@ozlabs.ru> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 107.173.13.209 Subject: [Qemu-devel] [PATCH qemu v3 3/6] vfio/spapr: Rename local systempagesize variable X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Jose Ricardo Ziviani , Alexey Kardashevskiy , Daniel Henrique Barboza , Alex Williamson , Sam Bobroff , Piotr Jaroszynski , qemu-ppc@nongnu.org, =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_Garcia?= , David Gibson Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Virus-Scanned: ClamAV using ClamSMTP The "systempagesize" name suggests that it is the host system page size while it is the smallest page size of memory backing the guest RAM so let's rename it to stop confusion. This should cause no behavioral change. Signed-off-by: Alexey Kardashevskiy --- hw/vfio/spapr.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c index 88437a7..57fe758 100644 --- a/hw/vfio/spapr.c +++ b/hw/vfio/spapr.c @@ -148,14 +148,14 @@ int vfio_spapr_create_window(VFIOContainer *container, uint64_t pagesize = memory_region_iommu_get_min_page_size(iommu_mr); unsigned entries, bits_total, bits_per_level, max_levels; struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) }; - long systempagesize = qemu_getrampagesize(); + long rampagesize = qemu_getrampagesize(); /* * The host might not support the guest supported IOMMU page size, * so we will use smaller physical IOMMU pages to back them. */ - if (pagesize > systempagesize) { - pagesize = systempagesize; + if (pagesize > rampagesize) { + pagesize = rampagesize; } pagesize = 1ULL << (63 - clz64(container->pgsizes & (pagesize | (pagesize - 1)))); From patchwork Wed Feb 27 08:51:47 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 10831421 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B5A9013B5 for ; Wed, 27 Feb 2019 08:55:54 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A1E2C2B002 for ; Wed, 27 Feb 2019 08:55:54 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 958172C162; Wed, 27 Feb 2019 08:55:54 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI autolearn=ham version=3.3.1 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 282522B002 for ; Wed, 27 Feb 2019 08:55:54 +0000 (UTC) Received: from localhost ([127.0.0.1]:40425 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyv0L-0007AP-Gr for patchwork-qemu-devel@patchwork.kernel.org; Wed, 27 Feb 2019 03:55:53 -0500 Received: from eggs.gnu.org ([209.51.188.92]:36475) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyuxe-0004uV-Gu for qemu-devel@nongnu.org; Wed, 27 Feb 2019 03:53:07 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gyuxX-0004de-DU for qemu-devel@nongnu.org; Wed, 27 Feb 2019 03:53:02 -0500 Received: from ozlabs.ru ([107.173.13.209]:35365) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyuxL-0004RR-E9; Wed, 27 Feb 2019 03:52:49 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 0C066AE807F0; Wed, 27 Feb 2019 03:52:02 -0500 (EST) From: Alexey Kardashevskiy To: qemu-devel@nongnu.org Date: Wed, 27 Feb 2019 19:51:47 +1100 Message-Id: <20190227085149.38596-5-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190227085149.38596-1-aik@ozlabs.ru> References: <20190227085149.38596-1-aik@ozlabs.ru> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 107.173.13.209 Subject: [Qemu-devel] [PATCH qemu v3 4/6] spapr_iommu: Do not replay mappings from just created DMA window X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Jose Ricardo Ziviani , Alexey Kardashevskiy , Daniel Henrique Barboza , Alex Williamson , Sam Bobroff , Piotr Jaroszynski , qemu-ppc@nongnu.org, =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_Garcia?= , David Gibson Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Virus-Scanned: ClamAV using ClamSMTP On sPAPR vfio_listener_region_add() is called in 2 situations: 1. a new listener is registered from vfio_connect_container(); 2. a new IOMMU Memory Region is added from rtas_ibm_create_pe_dma_window(). In both cases vfio_listener_region_add() calls memory_region_iommu_replay() to notify newly registered IOMMU notifiers about existing mappings which is totally desirable for case 1. However for case 2 it is nothing but noop as the window has just been created and has no valid mappings so replaying those does not do anything. It is barely noticeable with usual guests but if the window happens to be really big, such no-op replay might take minutes and trigger RCU stall warnings in the guest. For example, a upcoming GPU RAM memory region mapped at 64TiB (right after SPAPR_PCI_LIMIT) causes a 64bit DMA window to be at least 128TiB which is (128<<40)/0x10000=2.147.483.648 TCEs to replay. This mitigates the problem by adding an "skipping_replay" flag to sPAPRTCETable and defining sPAPR own IOMMU MR replay() hook which does exactly the same thing as the generic one except it returns early if @skipping_replay==true. When "ibm,create-pe-dma-window" is complete, the guest will map only required regions of the huge DMA window. Signed-off-by: Alexey Kardashevskiy --- include/hw/ppc/spapr.h | 1 + hw/ppc/spapr_iommu.c | 31 +++++++++++++++++++++++++++++++ hw/ppc/spapr_rtas_ddw.c | 7 +++++++ 3 files changed, 39 insertions(+) diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h index 86b0488..358bb38 100644 --- a/include/hw/ppc/spapr.h +++ b/include/hw/ppc/spapr.h @@ -727,6 +727,7 @@ struct sPAPRTCETable { uint64_t *mig_table; bool bypass; bool need_vfio; + bool skipping_replay; int fd; MemoryRegion root; IOMMUMemoryRegion iommu; diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c index 37e98f9..8f23179 100644 --- a/hw/ppc/spapr_iommu.c +++ b/hw/ppc/spapr_iommu.c @@ -141,6 +141,36 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(IOMMUMemoryRegion *iommu, return ret; } +static void spapr_tce_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n) +{ + MemoryRegion *mr = MEMORY_REGION(iommu_mr); + IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr); + hwaddr addr, granularity; + IOMMUTLBEntry iotlb; + sPAPRTCETable *tcet = container_of(iommu_mr, sPAPRTCETable, iommu); + + if (tcet->skipping_replay) { + return; + } + + granularity = memory_region_iommu_get_min_page_size(iommu_mr); + + for (addr = 0; addr < memory_region_size(mr); addr += granularity) { + iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx); + if (iotlb.perm != IOMMU_NONE) { + n->notify(n, &iotlb); + } + + /* + * if (2^64 - MR size) < granularity, it's possible to get an + * infinite loop here. This should catch such a wraparound. + */ + if ((addr + granularity) < addr) { + break; + } + } +} + static int spapr_tce_table_pre_save(void *opaque) { sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque); @@ -659,6 +689,7 @@ static void spapr_iommu_memory_region_class_init(ObjectClass *klass, void *data) IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_CLASS(klass); imrc->translate = spapr_tce_translate_iommu; + imrc->replay = spapr_tce_replay; imrc->get_min_page_size = spapr_tce_get_min_page_size; imrc->notify_flag_changed = spapr_tce_notify_flag_changed; imrc->get_attr = spapr_tce_get_attr; diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c index cb8a410..9cc020d 100644 --- a/hw/ppc/spapr_rtas_ddw.c +++ b/hw/ppc/spapr_rtas_ddw.c @@ -171,8 +171,15 @@ static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu, } win_addr = (windows == 0) ? sphb->dma_win_addr : sphb->dma64_win_addr; + /* + * We have just created a window, we know for the fact that it is empty, + * use a hack to avoid iterating over the table as it is quite possible + * to have billions of TCEs, all empty. + */ + tcet->skipping_replay = true; spapr_tce_table_enable(tcet, page_shift, win_addr, 1ULL << (window_shift - page_shift)); + tcet->skipping_replay = false; if (!tcet->nb_table) { goto hw_error_exit; } From patchwork Wed Feb 27 08:51:48 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 10831419 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 44B5F1390 for ; Wed, 27 Feb 2019 08:55:35 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 309AB2AF92 for ; Wed, 27 Feb 2019 08:55:35 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 247902BB07; Wed, 27 Feb 2019 08:55:35 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI autolearn=ham version=3.3.1 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id CB1872AF92 for ; Wed, 27 Feb 2019 08:55:34 +0000 (UTC) Received: from localhost ([127.0.0.1]:40423 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyv02-0006v8-4i for patchwork-qemu-devel@patchwork.kernel.org; Wed, 27 Feb 2019 03:55:34 -0500 Received: from eggs.gnu.org ([209.51.188.92]:36384) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyuwj-000482-TO for qemu-devel@nongnu.org; Wed, 27 Feb 2019 03:52:10 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gyuwj-0004FE-4l for qemu-devel@nongnu.org; Wed, 27 Feb 2019 03:52:09 -0500 Received: from ozlabs.ru ([107.173.13.209]:35318) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyuwi-0004F5-Uw; Wed, 27 Feb 2019 03:52:09 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 0B02CAE800EC; Wed, 27 Feb 2019 03:52:05 -0500 (EST) From: Alexey Kardashevskiy To: qemu-devel@nongnu.org Date: Wed, 27 Feb 2019 19:51:48 +1100 Message-Id: <20190227085149.38596-6-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190227085149.38596-1-aik@ozlabs.ru> References: <20190227085149.38596-1-aik@ozlabs.ru> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 107.173.13.209 Subject: [Qemu-devel] [PATCH qemu v3 5/6] vfio: Make vfio_get_region_info_cap public X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Jose Ricardo Ziviani , Alexey Kardashevskiy , Daniel Henrique Barboza , Alex Williamson , Sam Bobroff , Piotr Jaroszynski , qemu-ppc@nongnu.org, =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_Garcia?= , David Gibson Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Virus-Scanned: ClamAV using ClamSMTP This makes vfio_get_region_info_cap() to be used in quirks. Signed-off-by: Alexey Kardashevskiy --- include/hw/vfio/vfio-common.h | 2 ++ hw/vfio/common.c | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 7624c9f..fbf0966 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -189,6 +189,8 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, uint32_t subtype, struct vfio_region_info **info); bool vfio_has_region_cap(VFIODevice *vbasedev, int region, uint16_t cap_type); +struct vfio_info_cap_header * +vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id); #endif extern const MemoryListener vfio_prereg_listener; diff --git a/hw/vfio/common.c b/hw/vfio/common.c index df2b472..4374cc6 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -729,7 +729,7 @@ static void vfio_listener_release(VFIOContainer *container) } } -static struct vfio_info_cap_header * +struct vfio_info_cap_header * vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id) { struct vfio_info_cap_header *hdr; From patchwork Wed Feb 27 08:51:49 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 10831455 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 15F9813B5 for ; Wed, 27 Feb 2019 08:58:09 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F00862B002 for ; Wed, 27 Feb 2019 08:58:08 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E09182C162; Wed, 27 Feb 2019 08:58:08 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI autolearn=ham version=3.3.1 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id EF9EC2B002 for ; Wed, 27 Feb 2019 08:58:06 +0000 (UTC) Received: from localhost ([127.0.0.1]:40446 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyv2U-0000aR-9S for patchwork-qemu-devel@patchwork.kernel.org; Wed, 27 Feb 2019 03:58:06 -0500 Received: from eggs.gnu.org ([209.51.188.92]:36478) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyuxe-0004uX-Hq for qemu-devel@nongnu.org; Wed, 27 Feb 2019 03:53:09 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gyuxX-0004dd-DV for qemu-devel@nongnu.org; Wed, 27 Feb 2019 03:53:05 -0500 Received: from ozlabs.ru ([107.173.13.209]:35375) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyuxL-0004Tp-JT; Wed, 27 Feb 2019 03:52:49 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id C7142AE807F3; Wed, 27 Feb 2019 03:52:08 -0500 (EST) From: Alexey Kardashevskiy To: qemu-devel@nongnu.org Date: Wed, 27 Feb 2019 19:51:49 +1100 Message-Id: <20190227085149.38596-7-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190227085149.38596-1-aik@ozlabs.ru> References: <20190227085149.38596-1-aik@ozlabs.ru> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 107.173.13.209 Subject: [Qemu-devel] [PATCH qemu v3 6/6] spapr: Support NVIDIA V100 GPU with NVLink2 X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Jose Ricardo Ziviani , Alexey Kardashevskiy , Daniel Henrique Barboza , Alex Williamson , Sam Bobroff , Piotr Jaroszynski , qemu-ppc@nongnu.org, =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_Garcia?= , David Gibson Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Virus-Scanned: ClamAV using ClamSMTP NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver implements special regions for such GPUs and emulates an NVLink bridge. NVLink2-enabled POWER9 CPUs also provide address translation services which includes an ATS shootdown (ATSD) register exported via the NVLink bridge device. This adds a quirk to VFIO to map the GPU memory and create an MR; the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses this to get the MR and map it to the system address space. Another quirk does the same for ATSD. This adds additional steps to sPAPR PHB setup: 1. Search for specific GPUs and NPUs, collect findings in sPAPRPHBState::nvgpus, manage system address space mappings; 2. Add device-specific properties such as "ibm,npu", "ibm,gpu", "memory-block", "link-speed" to advertise the NVLink2 function to the guest; 3. Add "mmio-atsd" to vPHB to advertise the ATSD capability; 4. Add new memory blocks (with extra "linux,memory-usable" to prevent the guest OS from accessing the new memory until it is onlined) and npuphb# nodes representing an NPU unit for every vPHB as the GPU driver uses it for link discovery. This allocates space for GPU RAM and ATSD like we do for MMIOs by adding 2 new parameters to the phb_placement() hook. Older machine types set these to zero. This puts new memory nodes in a separate NUMA node to replicate the host system setup as the GPU driver relies on this. This adds requirement similar to EEH - one IOMMU group per vPHB. The reason for this is that ATSD registers belong to a physical NPU so they cannot invalidate translations on GPUs attached to another NPU. It is guaranteed by the host platform as it does not mix NVLink bridges or GPUs from different NPU in the same IOMMU group. If more than one IOMMU group is detected on a vPHB, this disables ATSD support for that vPHB and prints a warning. Signed-off-by: Alexey Kardashevskiy --- Changes: v3: * moved GPU RAM above PCI MMIO limit * renamed QOM property to nvlink2-tgt * moved nvlink2 code to its own file --- The example command line for redbud system: pbuild/qemu-aiku1804le-ppc64/ppc64-softmmu/qemu-system-ppc64 \ -nodefaults \ -chardev stdio,id=STDIO0,signal=off,mux=on \ -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \ -mon id=MON0,chardev=STDIO0,mode=readline -nographic -vga none \ -enable-kvm -m 384G \ -chardev socket,id=SOCKET0,server,nowait,host=localhost,port=40000 \ -mon chardev=SOCKET0,mode=control \ -smp 80,sockets=1,threads=4 \ -netdev "tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0" \ -device "virtio-net-pci,id=vnet0,mac=52:54:00:12:34:56,netdev=TAP0" \ img/vdisk0.img \ -device "vfio-pci,id=vfio0004_04_00_0,host=0004:04:00.0" \ -device "vfio-pci,id=vfio0006_00_00_0,host=0006:00:00.0" \ -device "vfio-pci,id=vfio0006_00_00_1,host=0006:00:00.1" \ -device "vfio-pci,id=vfio0006_00_00_2,host=0006:00:00.2" \ -device "vfio-pci,id=vfio0004_05_00_0,host=0004:05:00.0" \ -device "vfio-pci,id=vfio0006_00_01_0,host=0006:00:01.0" \ -device "vfio-pci,id=vfio0006_00_01_1,host=0006:00:01.1" \ -device "vfio-pci,id=vfio0006_00_01_2,host=0006:00:01.2" \ -device spapr-pci-host-bridge,id=phb1,index=1 \ -device "vfio-pci,id=vfio0035_03_00_0,host=0035:03:00.0" \ -device "vfio-pci,id=vfio0007_00_00_0,host=0007:00:00.0" \ -device "vfio-pci,id=vfio0007_00_00_1,host=0007:00:00.1" \ -device "vfio-pci,id=vfio0007_00_00_2,host=0007:00:00.2" \ -device "vfio-pci,id=vfio0035_04_00_0,host=0035:04:00.0" \ -device "vfio-pci,id=vfio0007_00_01_0,host=0007:00:01.0" \ -device "vfio-pci,id=vfio0007_00_01_1,host=0007:00:01.1" \ -device "vfio-pci,id=vfio0007_00_01_2,host=0007:00:01.2" -snapshot \ -machine pseries \ -L /home/aik/t/qemu-ppc64-bios/ -d guest_errors Note that QEMU attaches PCI devices to the last added vPHB so first 8 devices - 4:04:00.0 till 6:00:01.2 - go to the default vPHB, and 35:03:00.0..7:00:01.2 to the vPHB with id=phb1. --- hw/ppc/Makefile.objs | 2 +- hw/vfio/pci.h | 2 + include/hw/pci-host/spapr.h | 41 ++++ include/hw/ppc/spapr.h | 3 +- hw/ppc/spapr.c | 29 ++- hw/ppc/spapr_pci.c | 8 + hw/ppc/spapr_pci_nvlink2.c | 419 ++++++++++++++++++++++++++++++++++++ hw/vfio/pci-quirks.c | 120 +++++++++++ hw/vfio/pci.c | 14 ++ hw/vfio/trace-events | 4 + 10 files changed, 637 insertions(+), 5 deletions(-) create mode 100644 hw/ppc/spapr_pci_nvlink2.c diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs index 1111b21..636e717 100644 --- a/hw/ppc/Makefile.objs +++ b/hw/ppc/Makefile.objs @@ -9,7 +9,7 @@ obj-$(CONFIG_SPAPR_RNG) += spapr_rng.o # IBM PowerNV obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy) -obj-y += spapr_pci_vfio.o +obj-y += spapr_pci_vfio.o spapr_pci_nvlink2.o endif obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o # PowerPC 4xx boards diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h index b1ae4c0..706c304 100644 --- a/hw/vfio/pci.h +++ b/hw/vfio/pci.h @@ -194,6 +194,8 @@ int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp); int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev, struct vfio_region_info *info, Error **errp); +int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp); +int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp); void vfio_display_reset(VFIOPCIDevice *vdev); int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp); diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h index ab0e3a0..e791dd4 100644 --- a/include/hw/pci-host/spapr.h +++ b/include/hw/pci-host/spapr.h @@ -87,6 +87,9 @@ struct sPAPRPHBState { uint32_t mig_liobn; hwaddr mig_mem_win_addr, mig_mem_win_size; hwaddr mig_io_win_addr, mig_io_win_size; + hwaddr nv2_gpa_win_addr; + hwaddr nv2_atsd_win_addr; + struct spapr_phb_pci_nvgpu_config *nvgpus; }; #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL @@ -105,6 +108,23 @@ struct sPAPRPHBState { #define SPAPR_PCI_MSI_WINDOW 0x40000000000ULL +#define SPAPR_PCI_NV2RAM64_WIN_BASE SPAPR_PCI_LIMIT +#define SPAPR_PCI_NV2RAM64_WIN_SIZE 0x10000000000ULL /* 1 TiB for all 6xGPUs */ + +/* Max number of these GPUs per a physical box */ +#define NVGPU_MAX_NUM 6 +/* + * One NVLink bridge provides one ATSD register so it should be 18. + * In practice though since we allow only one group per vPHB which equals + * to an NPU2 which has maximum 6 NVLink bridges. + */ +#define NVGPU_MAX_ATSD 6 + +#define SPAPR_PCI_NV2ATSD_WIN_BASE (SPAPR_PCI_NV2RAM64_WIN_BASE + \ + SPAPR_PCI_NV2RAM64_WIN_SIZE * \ + NVGPU_MAX_NUM) +#define SPAPR_PCI_NV2ATSD_WIN_SIZE (NVGPU_MAX_ATSD * 0x10000) + static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin) { sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine()); @@ -135,6 +155,11 @@ int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state); int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option); int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb); void spapr_phb_vfio_reset(DeviceState *qdev); +void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb); +void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off); +void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt); +void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset, + sPAPRPHBState *sphb); #else static inline bool spapr_phb_eeh_available(sPAPRPHBState *sphb) { @@ -161,6 +186,22 @@ static inline int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb) static inline void spapr_phb_vfio_reset(DeviceState *qdev) { } +static inline void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb) +{ +} +static inline void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, + int bus_off) +{ +} +static inline void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, + void *fdt) +{ +} +static inline void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, + int offset, + sPAPRPHBState *sphb) +{ +} #endif void spapr_phb_dma_reset(sPAPRPHBState *sphb); diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h index 358bb38..9acf867 100644 --- a/include/hw/ppc/spapr.h +++ b/include/hw/ppc/spapr.h @@ -113,7 +113,8 @@ struct sPAPRMachineClass { void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index, uint64_t *buid, hwaddr *pio, hwaddr *mmio32, hwaddr *mmio64, - unsigned n_dma, uint32_t *liobns, Error **errp); + unsigned n_dma, uint32_t *liobns, hwaddr *nv2gpa, + hwaddr *nv2atsd, Error **errp); sPAPRResizeHPT resize_hpt_default; sPAPRCapabilities default_caps; sPAPRIrq *irq; diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c index 74c9b07..fda6e7e 100644 --- a/hw/ppc/spapr.c +++ b/hw/ppc/spapr.c @@ -3929,7 +3929,9 @@ static void spapr_phb_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev, smc->phb_placement(spapr, sphb->index, &sphb->buid, &sphb->io_win_addr, &sphb->mem_win_addr, &sphb->mem64_win_addr, - windows_supported, sphb->dma_liobn, errp); + windows_supported, sphb->dma_liobn, + &sphb->nv2_gpa_win_addr, &sphb->nv2_atsd_win_addr, + errp); } static void spapr_phb_plug(HotplugHandler *hotplug_dev, DeviceState *dev, @@ -4129,7 +4131,8 @@ static const CPUArchIdList *spapr_possible_cpu_arch_ids(MachineState *machine) static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index, uint64_t *buid, hwaddr *pio, hwaddr *mmio32, hwaddr *mmio64, - unsigned n_dma, uint32_t *liobns, Error **errp) + unsigned n_dma, uint32_t *liobns, + hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp) { /* * New-style PHB window placement. @@ -4174,6 +4177,9 @@ static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index, *pio = SPAPR_PCI_BASE + index * SPAPR_PCI_IO_WIN_SIZE; *mmio32 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM32_WIN_SIZE; *mmio64 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM64_WIN_SIZE; + + *nv2gpa = SPAPR_PCI_NV2RAM64_WIN_BASE + index * SPAPR_PCI_NV2RAM64_WIN_SIZE; + *nv2atsd = SPAPR_PCI_NV2ATSD_WIN_BASE + index * SPAPR_PCI_NV2ATSD_WIN_SIZE; } static ICSState *spapr_ics_get(XICSFabric *dev, int irq) @@ -4376,6 +4382,18 @@ DEFINE_SPAPR_MACHINE(4_0, "4.0", true); /* * pseries-3.1 */ +static void phb_placement_3_1(sPAPRMachineState *spapr, uint32_t index, + uint64_t *buid, hwaddr *pio, + hwaddr *mmio32, hwaddr *mmio64, + unsigned n_dma, uint32_t *liobns, + hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp) +{ + spapr_phb_placement(spapr, index, buid, pio, mmio32, mmio64, n_dma, liobns, + nv2gpa, nv2atsd, errp); + *nv2gpa = 0; + *nv2atsd = 0; +} + static void spapr_machine_3_1_class_options(MachineClass *mc) { sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc); @@ -4391,6 +4409,7 @@ static void spapr_machine_3_1_class_options(MachineClass *mc) mc->default_cpu_type = POWERPC_CPU_TYPE_NAME("power8_v2.0"); smc->update_dt_enabled = false; smc->dr_phb_enabled = false; + smc->phb_placement = phb_placement_3_1; } DEFINE_SPAPR_MACHINE(3_1, "3.1", false); @@ -4522,7 +4541,8 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", false); static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index, uint64_t *buid, hwaddr *pio, hwaddr *mmio32, hwaddr *mmio64, - unsigned n_dma, uint32_t *liobns, Error **errp) + unsigned n_dma, uint32_t *liobns, + hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp) { /* Legacy PHB placement for pseries-2.7 and earlier machine types */ const uint64_t base_buid = 0x800000020000000ULL; @@ -4566,6 +4586,9 @@ static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index, * fallback behaviour of automatically splitting a large "32-bit" * window into contiguous 32-bit and 64-bit windows */ + + *nv2gpa = 0; + *nv2atsd = 0; } static void spapr_machine_2_7_class_options(MachineClass *mc) diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c index 06a5ffd..f076462 100644 --- a/hw/ppc/spapr_pci.c +++ b/hw/ppc/spapr_pci.c @@ -1355,6 +1355,8 @@ static void spapr_populate_pci_child_dt(PCIDevice *dev, void *fdt, int offset, if (sphb->pcie_ecs && pci_is_express(dev)) { _FDT(fdt_setprop_cell(fdt, offset, "ibm,pci-config-space-type", 0x1)); } + + spapr_phb_nvgpu_populate_pcidev_dt(dev, fdt, offset, sphb); } /* create OF node for pci device and required OF DT properties */ @@ -1878,6 +1880,7 @@ static void spapr_phb_reset(DeviceState *qdev) sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev); spapr_phb_dma_reset(sphb); + spapr_phb_nvgpu_setup(sphb); /* Reset the IOMMU state */ object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL); @@ -1910,6 +1913,8 @@ static Property spapr_phb_properties[] = { pre_2_8_migration, false), DEFINE_PROP_BOOL("pcie-extended-configuration-space", sPAPRPHBState, pcie_ecs, true), + DEFINE_PROP_UINT64("gpa", sPAPRPHBState, nv2_gpa_win_addr, 0), + DEFINE_PROP_UINT64("atsd", sPAPRPHBState, nv2_atsd_win_addr, 0), DEFINE_PROP_END_OF_LIST(), }; @@ -2282,6 +2287,9 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb, uint32_t intc_phandle, void *fdt, return ret; } + spapr_phb_nvgpu_populate_dt(phb, fdt, bus_off); + spapr_phb_nvgpu_ram_populate_dt(phb, fdt); + return 0; } diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c new file mode 100644 index 0000000..965a6be --- /dev/null +++ b/hw/ppc/spapr_pci_nvlink2.c @@ -0,0 +1,419 @@ +/* + * QEMU sPAPR PCI for NVLink2 pass through + * + * Copyright (c) 2019 Alexey Kardashevskiy, IBM Corporation. + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ +#include "qemu/osdep.h" +#include "qapi/error.h" +#include "qemu-common.h" +#include "hw/pci/pci.h" +#include "hw/pci-host/spapr.h" +#include "qemu/error-report.h" +#include "hw/ppc/fdt.h" +#include "hw/pci/pci_bridge.h" + +#define PHANDLE_PCIDEV(phb, pdev) (0x12000000 | \ + (((phb)->index) << 16) | ((pdev)->devfn)) +#define PHANDLE_GPURAM(phb, n) (0x110000FF | ((n) << 8) | \ + (((phb)->index) << 16)) +/* NVLink2 wants a separate NUMA node for its RAM */ +#define GPURAM_ASSOCIATIVITY(phb, n) (255 - ((phb)->index * 3 + (n))) +#define PHANDLE_NVLINK(phb, gn, nn) (0x00130000 | (((phb)->index) << 8) | \ + ((gn) << 4) | (nn)) + +/* Max number of NVLinks per GPU in any physical box */ +#define NVGPU_MAX_LINKS 3 + +struct spapr_phb_pci_nvgpu_config { + uint64_t nv2_ram_current; + uint64_t nv2_atsd_current; + int num; /* number of non empty (i.e. tgt!=0) entries in slots[] */ + struct spapr_phb_pci_nvgpu_slot { + uint64_t tgt; + uint64_t gpa; + PCIDevice *gpdev; + int linknum; + struct { + uint64_t atsd_gpa; + PCIDevice *npdev; + uint32_t link_speed; + } links[NVGPU_MAX_LINKS]; + } slots[NVGPU_MAX_NUM]; +}; + +static int spapr_pci_nvgpu_get_slot(struct spapr_phb_pci_nvgpu_config *nvgpus, + uint64_t tgt) +{ + int i; + + /* Search for partially collected "slot" */ + for (i = 0; i < nvgpus->num; ++i) { + if (nvgpus->slots[i].tgt == tgt) { + return i; + } + } + + if (nvgpus->num == ARRAY_SIZE(nvgpus->slots)) { + warn_report("Found too many NVLink bridges per GPU"); + return -1; + } + + i = nvgpus->num; + nvgpus->slots[i].tgt = tgt; + ++nvgpus->num; + + return i; +} + +static void spapr_pci_collect_nvgpu(struct spapr_phb_pci_nvgpu_config *nvgpus, + PCIDevice *pdev, uint64_t tgt, + MemoryRegion *mr) +{ + int i = spapr_pci_nvgpu_get_slot(nvgpus, tgt); + + if (i < 0) { + return; + } + g_assert(!nvgpus->slots[i].gpdev); + nvgpus->slots[i].gpdev = pdev; + + nvgpus->slots[i].gpa = nvgpus->nv2_ram_current; + nvgpus->nv2_ram_current += memory_region_size(mr); +} + +static void spapr_pci_collect_nvnpu(struct spapr_phb_pci_nvgpu_config *nvgpus, + PCIDevice *pdev, uint64_t tgt, + MemoryRegion *mr) +{ + int i = spapr_pci_nvgpu_get_slot(nvgpus, tgt), j; + struct spapr_phb_pci_nvgpu_slot *nvslot; + + if (i < 0) { + return; + } + + nvslot = &nvgpus->slots[i]; + j = nvslot->linknum; + if (j == ARRAY_SIZE(nvslot->links)) { + warn_report("Found too many NVLink2 bridges"); + return; + } + ++nvslot->linknum; + + g_assert(!nvslot->links[j].npdev); + nvslot->links[j].npdev = pdev; + nvslot->links[j].atsd_gpa = nvgpus->nv2_atsd_current; + nvgpus->nv2_atsd_current += memory_region_size(mr); + nvslot->links[j].link_speed = + object_property_get_uint(OBJECT(pdev), "nvlink2-link-speed", NULL); +} + +static void spapr_phb_pci_collect_nvgpu(PCIBus *bus, PCIDevice *pdev, + void *opaque) +{ + PCIBus *sec_bus; + Object *po = OBJECT(pdev); + uint64_t tgt = object_property_get_uint(po, "nvlink2-tgt", NULL); + + if (tgt) { + Object *mr_gpu = object_property_get_link(po, "nvlink2-mr[0]", NULL); + Object *mr_npu = object_property_get_link(po, "nvlink2-atsd-mr[0]", + NULL); + + if (mr_gpu) { + spapr_pci_collect_nvgpu(opaque, pdev, tgt, MEMORY_REGION(mr_gpu)); + } else if (mr_npu) { + spapr_pci_collect_nvnpu(opaque, pdev, tgt, MEMORY_REGION(mr_npu)); + } else { + warn_report("Unexpected device with \"nvlink2-tgt\""); + } + } + if ((pci_default_read_config(pdev, PCI_HEADER_TYPE, 1) != + PCI_HEADER_TYPE_BRIDGE)) { + return; + } + + sec_bus = pci_bridge_get_sec_bus(PCI_BRIDGE(pdev)); + if (!sec_bus) { + return; + } + + pci_for_each_device(sec_bus, pci_bus_num(sec_bus), + spapr_phb_pci_collect_nvgpu, opaque); +} + +void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb) +{ + int i, j, valid_gpu_num; + + /* If there are existing NVLink2 MRs, unmap those before recreating */ + if (sphb->nvgpus) { + for (i = 0; i < sphb->nvgpus->num; ++i) { + struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i]; + Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev), + "nvlink2-mr[0]", NULL); + + if (nv_mrobj) { + memory_region_del_subregion(get_system_memory(), + MEMORY_REGION(nv_mrobj)); + } + for (j = 0; j < nvslot->linknum; ++j) { + PCIDevice *npdev = nvslot->links[j].npdev; + Object *atsd_mrobj; + atsd_mrobj = object_property_get_link(OBJECT(npdev), + "nvlink2-atsd-mr[0]", + NULL); + if (atsd_mrobj) { + memory_region_del_subregion(get_system_memory(), + MEMORY_REGION(atsd_mrobj)); + } + } + } + g_free(sphb->nvgpus); + sphb->nvgpus = NULL; + } + + /* Search for GPUs and NPUs */ + if (sphb->nv2_gpa_win_addr && sphb->nv2_atsd_win_addr) { + PCIBus *bus = PCI_HOST_BRIDGE(sphb)->bus; + + sphb->nvgpus = g_new0(struct spapr_phb_pci_nvgpu_config, 1); + sphb->nvgpus->nv2_ram_current = sphb->nv2_gpa_win_addr; + sphb->nvgpus->nv2_atsd_current = sphb->nv2_atsd_win_addr; + + pci_for_each_device(bus, pci_bus_num(bus), + spapr_phb_pci_collect_nvgpu, sphb->nvgpus); + } + + /* Add found GPU RAM and ATSD MRs if found */ + for (i = 0, valid_gpu_num = 0; i < sphb->nvgpus->num; ++i) { + Object *nvmrobj; + struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i]; + + if (!nvslot->gpdev) { + continue; + } + nvmrobj = object_property_get_link(OBJECT(nvslot->gpdev), + "nvlink2-mr[0]", NULL); + /* ATSD is pointless without GPU RAM MR so skip those */ + if (!nvmrobj) { + continue; + } + + ++valid_gpu_num; + memory_region_add_subregion(get_system_memory(), nvslot->gpa, + MEMORY_REGION(nvmrobj)); + + for (j = 0; j < nvslot->linknum; ++j) { + Object *atsdmrobj; + + atsdmrobj = object_property_get_link(OBJECT(nvslot->links[j].npdev), + "nvlink2-atsd-mr[0]", + NULL); + if (!atsdmrobj) { + continue; + } + memory_region_add_subregion(get_system_memory(), + nvslot->links[j].atsd_gpa, + MEMORY_REGION(atsdmrobj)); + } + } + + if (!valid_gpu_num) { + /* We did not find any interesting GPU */ + g_free(sphb->nvgpus); + sphb->nvgpus = NULL; + } +} + +void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off) +{ + int i, j, atsdnum = 0; + uint64_t atsd[8]; /* The existing limitation of known guests */ + + if (!sphb->nvgpus) { + return; + } + + for (i = 0; (i < sphb->nvgpus->num) && (atsdnum < ARRAY_SIZE(atsd)); ++i) { + struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i]; + + if (!nvslot->gpdev) { + continue; + } + for (j = 0; j < nvslot->linknum; ++j) { + if (!nvslot->links[j].atsd_gpa) { + continue; + } + + if (atsdnum == ARRAY_SIZE(atsd)) { + warn_report("Only %ld ATSD registers allowed", + ARRAY_SIZE(atsd)); + break; + } + atsd[atsdnum] = cpu_to_be64(nvslot->links[j].atsd_gpa); + ++atsdnum; + } + } + + if (!atsdnum) { + warn_report("No ATSD registers found"); + } else if (!spapr_phb_eeh_available(sphb)) { + /* + * ibm,mmio-atsd contains ATSD registers; these belong to an NPU PHB + * which we do not emulate as a separate device. Instead we put + * ibm,mmio-atsd to the vPHB with GPU and make sure that we do not + * put GPUs from different IOMMU groups to the same vPHB to ensure + * that the guest will use ATSDs from the corresponding NPU. + */ + warn_report("ATSD requires separate vPHB per GPU IOMMU group"); + } else { + _FDT((fdt_setprop(fdt, bus_off, "ibm,mmio-atsd", + atsd, atsdnum * sizeof(atsd[0])))); + } +} + +void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt) +{ + int i, j, linkidx, npuoff; + char *npuname; + + if (!sphb->nvgpus) { + return; + } + + npuname = g_strdup_printf("npuphb%d", sphb->index); + npuoff = fdt_add_subnode(fdt, 0, npuname); + _FDT(npuoff); + _FDT(fdt_setprop_cell(fdt, npuoff, "#address-cells", 1)); + _FDT(fdt_setprop_cell(fdt, npuoff, "#size-cells", 0)); + /* Advertise NPU as POWER9 so the guest can enable NPU2 contexts */ + _FDT((fdt_setprop_string(fdt, npuoff, "compatible", "ibm,power9-npu"))); + g_free(npuname); + + for (i = 0, linkidx = 0; i < sphb->nvgpus->num; ++i) { + for (j = 0; j < sphb->nvgpus->slots[i].linknum; ++j) { + char *linkname = g_strdup_printf("link@%d", linkidx); + int off = fdt_add_subnode(fdt, npuoff, linkname); + + _FDT(off); + /* _FDT((fdt_setprop_cell(fdt, off, "reg", linkidx))); */ + _FDT((fdt_setprop_string(fdt, off, "compatible", + "ibm,npu-link"))); + _FDT((fdt_setprop_cell(fdt, off, "phandle", + PHANDLE_NVLINK(sphb, i, j)))); + _FDT((fdt_setprop_cell(fdt, off, "ibm,npu-link-index", linkidx))); + g_free(linkname); + ++linkidx; + } + } + + /* Add memory nodes for GPU RAM and mark them unusable */ + for (i = 0; i < sphb->nvgpus->num; ++i) { + struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i]; + Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev), + "nvlink2-mr[0]", NULL); + uint32_t at = cpu_to_be32(GPURAM_ASSOCIATIVITY(sphb, i)); + uint32_t associativity[] = { cpu_to_be32(0x4), at, at, at, at }; + uint64_t size = object_property_get_uint(nv_mrobj, "size", NULL); + uint64_t mem_reg[2] = { cpu_to_be64(nvslot->gpa), cpu_to_be64(size) }; + char *mem_name = g_strdup_printf("memory@%lx", nvslot->gpa); + int off = fdt_add_subnode(fdt, 0, mem_name); + + _FDT(off); + _FDT((fdt_setprop_string(fdt, off, "device_type", "memory"))); + _FDT((fdt_setprop(fdt, off, "reg", mem_reg, sizeof(mem_reg)))); + _FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity, + sizeof(associativity)))); + + _FDT((fdt_setprop_string(fdt, off, "compatible", + "ibm,coherent-device-memory"))); + + mem_reg[1] = cpu_to_be64(0); + _FDT((fdt_setprop(fdt, off, "linux,usable-memory", mem_reg, + sizeof(mem_reg)))); + _FDT((fdt_setprop_cell(fdt, off, "phandle", + PHANDLE_GPURAM(sphb, i)))); + g_free(mem_name); + } + +} + +void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset, + sPAPRPHBState *sphb) +{ + int i, j; + + if (!sphb->nvgpus) { + return; + } + + for (i = 0; i < sphb->nvgpus->num; ++i) { + struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i]; + + /* Skip "slot" without attached GPU */ + if (!nvslot->gpdev) { + continue; + } + if (dev == nvslot->gpdev) { + uint32_t npus[nvslot->linknum]; + + for (j = 0; j < nvslot->linknum; ++j) { + PCIDevice *npdev = nvslot->links[j].npdev; + + npus[j] = cpu_to_be32(PHANDLE_PCIDEV(sphb, npdev)); + } + _FDT(fdt_setprop(fdt, offset, "ibm,npu", npus, + j * sizeof(npus[0]))); + _FDT((fdt_setprop_cell(fdt, offset, "phandle", + PHANDLE_PCIDEV(sphb, dev)))); + continue; + } + + for (j = 0; j < nvslot->linknum; ++j) { + if (dev != nvslot->links[j].npdev) { + continue; + } + + _FDT((fdt_setprop_cell(fdt, offset, "phandle", + PHANDLE_PCIDEV(sphb, dev)))); + _FDT(fdt_setprop_cell(fdt, offset, "ibm,gpu", + PHANDLE_PCIDEV(sphb, nvslot->gpdev))); + _FDT((fdt_setprop_cell(fdt, offset, "ibm,nvlink", + PHANDLE_NVLINK(sphb, i, j)))); + /* + * If we ever want to emulate GPU RAM at the same location as on + * the host - here is the encoding GPA->TGT: + * + * gta = ((sphb->nv2_gpa >> 42) & 0x1) << 42; + * gta |= ((sphb->nv2_gpa >> 45) & 0x3) << 43; + * gta |= ((sphb->nv2_gpa >> 49) & 0x3) << 45; + * gta |= sphb->nv2_gpa & ((1UL << 43) - 1); + */ + _FDT(fdt_setprop_cell(fdt, offset, "memory-region", + PHANDLE_GPURAM(sphb, i))); + _FDT(fdt_setprop_u64(fdt, offset, "ibm,device-tgt-addr", + nvslot->tgt)); + _FDT(fdt_setprop_cell(fdt, offset, "ibm,nvlink-speed", + nvslot->links[j].link_speed)); + } + } +} diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c index 40a1200..15ec0b4 100644 --- a/hw/vfio/pci-quirks.c +++ b/hw/vfio/pci-quirks.c @@ -2180,3 +2180,123 @@ int vfio_add_virt_caps(VFIOPCIDevice *vdev, Error **errp) return 0; } + +static void vfio_pci_nvlink2_get_tgt(Object *obj, Visitor *v, + const char *name, + void *opaque, Error **errp) +{ + uint64_t tgt = (uint64_t) opaque; + visit_type_uint64(v, name, &tgt, errp); +} + +static void vfio_pci_nvlink2_get_link_speed(Object *obj, Visitor *v, + const char *name, + void *opaque, Error **errp) +{ + uint32_t link_speed = (uint32_t)(uint64_t) opaque; + visit_type_uint32(v, name, &link_speed, errp); +} + +int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp) +{ + int ret; + void *p; + struct vfio_region_info *nv2region = NULL; + struct vfio_info_cap_header *hdr; + MemoryRegion *nv2mr = g_malloc0(sizeof(*nv2mr)); + + ret = vfio_get_dev_region_info(&vdev->vbasedev, + VFIO_REGION_TYPE_PCI_VENDOR_TYPE | + PCI_VENDOR_ID_NVIDIA, + VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM, + &nv2region); + if (ret) { + return ret; + } + + p = mmap(NULL, nv2region->size, PROT_READ | PROT_WRITE | PROT_EXEC, + MAP_SHARED, vdev->vbasedev.fd, nv2region->offset); + + if (!p) { + return -errno; + } + + memory_region_init_ram_ptr(nv2mr, OBJECT(vdev), "nvlink2-mr", + nv2region->size, p); + + hdr = vfio_get_region_info_cap(nv2region, + VFIO_REGION_INFO_CAP_NVLINK2_SSATGT); + if (hdr) { + struct vfio_region_info_cap_nvlink2_ssatgt *cap = (void *) hdr; + + object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64", + vfio_pci_nvlink2_get_tgt, NULL, NULL, + (void *) cap->tgt, NULL); + trace_vfio_pci_nvidia_gpu_setup_quirk(vdev->vbasedev.name, cap->tgt, + nv2region->size); + } + g_free(nv2region); + + return 0; +} + +int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp) +{ + int ret; + void *p; + struct vfio_region_info *atsd_region = NULL; + struct vfio_info_cap_header *hdr; + + ret = vfio_get_dev_region_info(&vdev->vbasedev, + VFIO_REGION_TYPE_PCI_VENDOR_TYPE | + PCI_VENDOR_ID_IBM, + VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD, + &atsd_region); + if (ret) { + return ret; + } + + /* Some NVLink bridges come without assigned ATSD, skip MR part */ + if (atsd_region->size) { + MemoryRegion *atsd_mr = g_malloc0(sizeof(*atsd_mr)); + + p = mmap(NULL, atsd_region->size, PROT_READ | PROT_WRITE | PROT_EXEC, + MAP_SHARED, vdev->vbasedev.fd, atsd_region->offset); + + if (!p) { + return -errno; + } + + memory_region_init_ram_device_ptr(atsd_mr, OBJECT(vdev), + "nvlink2-atsd-mr", + atsd_region->size, + p); + } + + hdr = vfio_get_region_info_cap(atsd_region, + VFIO_REGION_INFO_CAP_NVLINK2_SSATGT); + if (hdr) { + struct vfio_region_info_cap_nvlink2_ssatgt *cap = (void *) hdr; + + object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64", + vfio_pci_nvlink2_get_tgt, NULL, NULL, + (void *) cap->tgt, NULL); + trace_vfio_pci_nvlink2_setup_quirk_ssatgt(vdev->vbasedev.name, cap->tgt, + atsd_region->size); + } + + hdr = vfio_get_region_info_cap(atsd_region, + VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD); + if (hdr) { + struct vfio_region_info_cap_nvlink2_lnkspd *cap = (void *) hdr; + + object_property_add(OBJECT(vdev), "nvlink2-link-speed", "uint32", + vfio_pci_nvlink2_get_link_speed, NULL, NULL, + (void *) (uint64_t) cap->link_speed, NULL); + trace_vfio_pci_nvlink2_setup_quirk_lnkspd(vdev->vbasedev.name, + cap->link_speed); + } + g_free(atsd_region); + + return 0; +} diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index dd12f36..07aa141 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -3069,6 +3069,20 @@ static void vfio_realize(PCIDevice *pdev, Error **errp) goto out_teardown; } + if (vdev->vendor_id == PCI_VENDOR_ID_NVIDIA) { + ret = vfio_pci_nvidia_v100_ram_init(vdev, errp); + if (ret && ret != -ENODEV) { + error_report("Failed to setup NVIDIA V100 GPU RAM"); + } + } + + if (vdev->vendor_id == PCI_VENDOR_ID_IBM) { + ret = vfio_pci_nvlink2_init(vdev, errp); + if (ret && ret != -ENODEV) { + error_report("Failed to setup NVlink2 bridge"); + } + } + vfio_register_err_notifier(vdev); vfio_register_req_notifier(vdev); vfio_setup_resetfn_quirk(vdev); diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index cf1e886..88841e9 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events @@ -87,6 +87,10 @@ vfio_pci_igd_opregion_enabled(const char *name) "%s" vfio_pci_igd_host_bridge_enabled(const char *name) "%s" vfio_pci_igd_lpc_bridge_enabled(const char *name) "%s" +vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64 +vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64 +vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x" + # hw/vfio/common.c vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)" vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64