[kernel,9/9] KVM: PPC: VFIO device: support SPAPR TCE

Message ID	1457322077-26640-10-git-send-email-aik@ozlabs.ru (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Gateway: Authorized Use Only! Violators will be prosecuted for <kvm@vger.kernel.org> from <aik@ozlabs.ru>; Mon, 7 Mar 2016 13:42:45 +1000 Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 7 Mar 2016 13:42:42 +1000 From: Alexey Kardashevskiy <aik@ozlabs.ru> To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy <aik@ozlabs.ru>, Paul Mackerras <paulus@samba.org>, Alex Williamson <alex.williamson@redhat.com>, David Gibson <david@gibson.dropbear.id.au>, kvm-ppc@vger.kernel.org, kvm@vger.kernel.org Subject: [PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE Date: Mon, 7 Mar 2016 14:41:17 +1100 Message-Id: <1457322077-26640-10-git-send-email-aik@ozlabs.ru> In-Reply-To: <1457322077-26640-1-git-send-email-aik@ozlabs.ru> References: <1457322077-26640-1-git-send-email-aik@ozlabs.ru> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Alexey Kardashevskiy March 7, 2016, 3:41 a.m. UTC

sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
identifier. LIOBNs are made up, advertised to guest systems and
linked to IOMMU groups by the user space.
In order to enable acceleration for IOMMU operations in KVM, we need
to tell KVM the information about the LIOBN-to-group mapping.

For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
is added which accepts:
- a VFIO group fd and IO base address to find the actual hardware
TCE table;
- a LIOBN to assign to the found table.

Before notifying KVM about new link, this check the group for being
registered with KVM device in order to release them at unexpected KVM
finish.

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

While we are here, this also fixes VFIO KVM device compiling to let it
link to a KVM module.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
 arch/powerpc/kvm/Kconfig                   |   1 +
 arch/powerpc/kvm/Makefile                  |   5 +-
 arch/powerpc/kvm/powerpc.c                 |   1 +
 include/uapi/linux/kvm.h                   |   9 +++
 virt/kvm/vfio.c                            | 106 +++++++++++++++++++++++++++++
 6 files changed, 140 insertions(+), 3 deletions(-)

David Gibson March 9, 2016, 5:45 a.m. UTC | #1

On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> identifier. LIOBNs are made up, advertised to guest systems and
> linked to IOMMU groups by the user space.
> In order to enable acceleration for IOMMU operations in KVM, we need
> to tell KVM the information about the LIOBN-to-group mapping.
> 
> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> is added which accepts:
> - a VFIO group fd and IO base address to find the actual hardware
> TCE table;
> - a LIOBN to assign to the found table.
> 
> Before notifying KVM about new link, this check the group for being
> registered with KVM device in order to release them at unexpected KVM
> finish.
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> While we are here, this also fixes VFIO KVM device compiling to let it
> link to a KVM module.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>  arch/powerpc/kvm/Kconfig                   |   1 +
>  arch/powerpc/kvm/Makefile                  |   5 +-
>  arch/powerpc/kvm/powerpc.c                 |   1 +
>  include/uapi/linux/kvm.h                   |   9 +++
>  virt/kvm/vfio.c                            | 106 +++++++++++++++++++++++++++++
>  6 files changed, 140 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740..c0d3eb7 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,24 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.

AFAICT these changes are accurate for VFIO as it is already, in which
case it might be clearer to put them in a separate patch.

>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> +	kvm_device_attr.addr points to a struct:
> +		struct kvm_vfio_spapr_tce_liobn {
> +			__u32	argsz;
> +			__s32	fd;
> +			__u32	liobn;
> +			__u8	pad[4];
> +			__u64	start_addr;
> +		};
> +		where
> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
> +		@fd is a file descriptor for a VFIO group;
> +		@liobn is a logical bus id to be associated with the group;
> +		@start_addr is a DMA window offset on the IO (PCI) bus

For the cause of DDW and multiple windows, I'm assuming you can call
this multiple times with different LIOBNs and the same IOMMU group?

> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index 1059846..dfa3488 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>  	select KVM
>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
> +	select KVM_VFIO if VFIO
>  	---help---
>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
>  	  in virtual machines on book3s_64 host processors.
> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> index 7f7b6d8..71f577c 100644
> --- a/arch/powerpc/kvm/Makefile
> +++ b/arch/powerpc/kvm/Makefile
> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>  KVM := ../../../virt/kvm
>  
>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> -		$(KVM)/eventfd.o $(KVM)/vfio.o
> +		$(KVM)/eventfd.o

Please don't disable the VFIO device for the non-book3s case.  I added
it (even though it didn't do anything until now) so that libvirt
wouldn't choke when it finds it's not available.  Obviously the new
ioctl needs to be only for the right IOMMU setup, but the device
itself should be available always.

>  CFLAGS_e500_mmu.o := -I.
>  CFLAGS_e500_mmu_host.o := -I.
> @@ -87,6 +87,9 @@ endif
>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>  	book3s_xics.o
>  
> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> +	$(KVM)/vfio.o \
> +
>  kvm-book3s_64-module-objs += \
>  	$(KVM)/kvm_main.o \
>  	$(KVM)/eventfd.o \
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 19aa59b..63f188d 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  #ifdef CONFIG_PPC_BOOK3S_64
>  	case KVM_CAP_SPAPR_TCE:
>  	case KVM_CAP_SPAPR_TCE_64:
> +	case KVM_CAP_SPAPR_TCE_VFIO:
>  	case KVM_CAP_PPC_ALLOC_HTAB:
>  	case KVM_CAP_PPC_RTAS:
>  	case KVM_CAP_PPC_FIXUP_HCALL:
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 080ffbf..f1abbea 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_GROUP			1
>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>  
>  enum kvm_device_type {
>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> +struct kvm_vfio_spapr_tce_liobn {
> +	__u32	argsz;
> +	__s32	fd;
> +	__u32	liobn;
> +	__u8	pad[4];
> +	__u64	start_addr;
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index 1dd087d..87c771e 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include <linux/vfio.h>
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include <asm/kvm_ppc.h>
> +#endif
> +
>  struct kvm_vfio_group {
>  	struct list_head node;
>  	struct vfio_group *vfio_group;
> @@ -60,6 +64,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>  	symbol_put(vfio_group_put_external_user);
>  }
>  
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> +	int (*fn)(struct vfio_group *);
> +	int ret = -1;

Should this be -ESOMETHING?

> +	fn = symbol_get(vfio_external_user_iommu_id);
> +	if (!fn)
> +		return ret;
> +
> +	ret = fn(vfio_group);
> +
> +	symbol_put(vfio_external_user_iommu_id);
> +
> +	return ret;
> +}
> +
>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>  {
>  	long (*fn)(struct vfio_group *, unsigned long);
> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
>  	mutex_unlock(&kv->lock);
>  }
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> +		struct vfio_group *vfio_group)


Shouldn't this go in the same patch that introduced the attach
function?

> +{
> +	int group_id;
> +	struct iommu_group *grp;
> +
> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> +	grp = iommu_group_get_by_id(group_id);
> +	if (grp) {
> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> +		iommu_group_put(grp);
> +	}
> +}
> +#endif
> +
>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  {
>  	struct kvm_vfio *kv = dev->private;
> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  				continue;
>  
>  			list_del(&kvg->node);
> +#ifdef CONFIG_SPAPR_TCE_IOMMU

Better to make a no-op version of the call than have to #ifdef at the
callsite.

> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
> +					kvg->vfio_group);
> +#endif
>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
>  			kfree(kvg);
>  			ret = 0;
> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  		kvm_vfio_update_coherency(dev);
>  
>  		return ret;
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
> +		struct kvm_vfio_spapr_tce_liobn param;
> +		unsigned long minsz;
> +		struct kvm_vfio *kv = dev->private;
> +		struct vfio_group *vfio_group;
> +		struct kvm_vfio_group *kvg;
> +		struct fd f;
> +
> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
> +				start_addr);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz)
> +			return -EINVAL;
> +
> +		f = fdget(param.fd);
> +		if (!f.file)
> +			return -EBADF;
> +
> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> +		fdput(f);
> +
> +		if (IS_ERR(vfio_group))
> +			return PTR_ERR(vfio_group);
> +
> +		ret = -ENOENT;

Shouldn't there be some runtime test for the type of the IOMMU?  It's
possible a kernel could be built for a platform supporting multiple
IOMMU types.

> +		mutex_lock(&kv->lock);
> +
> +		list_for_each_entry(kvg, &kv->group_list, node) {
> +			int group_id;
> +			struct iommu_group *grp;
> +
> +			if (kvg->vfio_group != vfio_group)
> +				continue;
> +
> +			group_id = kvm_vfio_external_user_iommu_id(
> +					kvg->vfio_group);
> +			grp = iommu_group_get_by_id(group_id);
> +			if (!grp) {
> +				ret = -EFAULT;
> +				break;
> +			}
> +
> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> +					param.liobn, param.start_addr,
> +					grp);
> +			if (ret)
> +				iommu_group_put(grp);
> +			break;
> +		}
> +
> +		mutex_unlock(&kv->lock);
> +
> +		kvm_vfio_group_put_external_user(vfio_group);
> +
> +		return ret;
> +	}
> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>  	}
>  
>  	return -ENXIO;
> @@ -225,6 +328,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_VFIO_GROUP_ADD:
>  		case KVM_DEV_VFIO_GROUP_DEL:
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN:
> +#endif
>  			return 0;
>  		}
>

Alexey Kardashevskiy March 9, 2016, 9:20 a.m. UTC | #2

On 03/09/2016 04:45 PM, David Gibson wrote:
> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
>> identifier. LIOBNs are made up, advertised to guest systems and
>> linked to IOMMU groups by the user space.
>> In order to enable acceleration for IOMMU operations in KVM, we need
>> to tell KVM the information about the LIOBN-to-group mapping.
>>
>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>> is added which accepts:
>> - a VFIO group fd and IO base address to find the actual hardware
>> TCE table;
>> - a LIOBN to assign to the found table.
>>
>> Before notifying KVM about new link, this check the group for being
>> registered with KVM device in order to release them at unexpected KVM
>> finish.
>>
>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>> space.
>>
>> While we are here, this also fixes VFIO KVM device compiling to let it
>> link to a KVM module.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>>   arch/powerpc/kvm/Kconfig                   |   1 +
>>   arch/powerpc/kvm/Makefile                  |   5 +-
>>   arch/powerpc/kvm/powerpc.c                 |   1 +
>>   include/uapi/linux/kvm.h                   |   9 +++
>>   virt/kvm/vfio.c                            | 106 +++++++++++++++++++++++++++++
>>   6 files changed, 140 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>> index ef51740..c0d3eb7 100644
>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>> @@ -16,7 +16,24 @@ Groups:
>>
>>   KVM_DEV_VFIO_GROUP attributes:
>>     KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>
> AFAICT these changes are accurate for VFIO as it is already, in which
> case it might be clearer to put them in a separate patch.
>
>>     KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>> +	kvm_device_attr.addr points to an int32_t file descriptor
>> +	for the VFIO group.
>>
>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>> -for the VFIO group.
>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
>> +	kvm_device_attr.addr points to a struct:
>> +		struct kvm_vfio_spapr_tce_liobn {
>> +			__u32	argsz;
>> +			__s32	fd;
>> +			__u32	liobn;
>> +			__u8	pad[4];
>> +			__u64	start_addr;
>> +		};
>> +		where
>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
>> +		@fd is a file descriptor for a VFIO group;
>> +		@liobn is a logical bus id to be associated with the group;
>> +		@start_addr is a DMA window offset on the IO (PCI) bus
>
> For the cause of DDW and multiple windows, I'm assuming you can call
> this multiple times with different LIOBNs and the same IOMMU group?


Yes. It is called twice per each group (when DDW is activated) - for 32bit 
and 64bit windows, this is why @start_addr is there.


>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>> index 1059846..dfa3488 100644
>> --- a/arch/powerpc/kvm/Kconfig
>> +++ b/arch/powerpc/kvm/Kconfig
>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>>   	select KVM
>>   	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>>   	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>> +	select KVM_VFIO if VFIO
>>   	---help---
>>   	  Support running unmodified book3s_64 and book3s_32 guest kernels
>>   	  in virtual machines on book3s_64 host processors.
>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>> index 7f7b6d8..71f577c 100644
>> --- a/arch/powerpc/kvm/Makefile
>> +++ b/arch/powerpc/kvm/Makefile
>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>   KVM := ../../../virt/kvm
>>
>>   common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>> +		$(KVM)/eventfd.o
>
> Please don't disable the VFIO device for the non-book3s case.  I added
> it (even though it didn't do anything until now) so that libvirt
> wouldn't choke when it finds it's not available.  Obviously the new
> ioctl needs to be only for the right IOMMU setup, but the device
> itself should be available always.

Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.


>>   CFLAGS_e500_mmu.o := -I.
>>   CFLAGS_e500_mmu_host.o := -I.
>> @@ -87,6 +87,9 @@ endif
>>   kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>   	book3s_xics.o
>>
>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>> +	$(KVM)/vfio.o \
>> +
>>   kvm-book3s_64-module-objs += \
>>   	$(KVM)/kvm_main.o \
>>   	$(KVM)/eventfd.o \
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index 19aa59b..63f188d 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>   #ifdef CONFIG_PPC_BOOK3S_64
>>   	case KVM_CAP_SPAPR_TCE:
>>   	case KVM_CAP_SPAPR_TCE_64:
>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>   	case KVM_CAP_PPC_ALLOC_HTAB:
>>   	case KVM_CAP_PPC_RTAS:
>>   	case KVM_CAP_PPC_FIXUP_HCALL:
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 080ffbf..f1abbea 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>>   #define  KVM_DEV_VFIO_GROUP			1
>>   #define   KVM_DEV_VFIO_GROUP_ADD			1
>>   #define   KVM_DEV_VFIO_GROUP_DEL			2
>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>>
>>   enum kvm_device_type {
>>   	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>>   	KVM_DEV_TYPE_MAX,
>>   };
>>
>> +struct kvm_vfio_spapr_tce_liobn {
>> +	__u32	argsz;
>> +	__s32	fd;
>> +	__u32	liobn;
>> +	__u8	pad[4];
>> +	__u64	start_addr;
>> +};
>> +
>>   /*
>>    * ioctls for VM fds
>>    */
>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>> index 1dd087d..87c771e 100644
>> --- a/virt/kvm/vfio.c
>> +++ b/virt/kvm/vfio.c
>> @@ -20,6 +20,10 @@
>>   #include <linux/vfio.h>
>>   #include "vfio.h"
>>
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +#include <asm/kvm_ppc.h>
>> +#endif
>> +
>>   struct kvm_vfio_group {
>>   	struct list_head node;
>>   	struct vfio_group *vfio_group;
>> @@ -60,6 +64,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>   	symbol_put(vfio_group_put_external_user);
>>   }
>>
>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>> +{
>> +	int (*fn)(struct vfio_group *);
>> +	int ret = -1;
>
> Should this be -ESOMETHING?
>
>> +	fn = symbol_get(vfio_external_user_iommu_id);
>> +	if (!fn)
>> +		return ret;
>> +
>> +	ret = fn(vfio_group);
>> +
>> +	symbol_put(vfio_external_user_iommu_id);
>> +
>> +	return ret;
>> +}
>> +
>>   static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>   {
>>   	long (*fn)(struct vfio_group *, unsigned long);
>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
>>   	mutex_unlock(&kv->lock);
>>   }
>>
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>> +		struct vfio_group *vfio_group)
>
>
> Shouldn't this go in the same patch that introduced the attach
> function?

Having less patches which touch different maintainers areas is better. I 
cannot avoid touching both PPC KVM and VFIO in this patch but I can in 
"[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE 
table".


>
>> +{
>> +	int group_id;
>> +	struct iommu_group *grp;
>> +
>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>> +	grp = iommu_group_get_by_id(group_id);
>> +	if (grp) {
>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>> +		iommu_group_put(grp);
>> +	}
>> +}
>> +#endif
>> +
>>   static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>   {
>>   	struct kvm_vfio *kv = dev->private;
>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>   				continue;
>>
>>   			list_del(&kvg->node);
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>
> Better to make a no-op version of the call than have to #ifdef at the
> callsite.

It is questionable. A x86 reader may decide that 
KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get confused.


>
>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
>> +					kvg->vfio_group);
>> +#endif
>>   			kvm_vfio_group_put_external_user(kvg->vfio_group);
>>   			kfree(kvg);
>>   			ret = 0;
>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>   		kvm_vfio_update_coherency(dev);
>>
>>   		return ret;
>> +
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
>> +		struct kvm_vfio_spapr_tce_liobn param;
>> +		unsigned long minsz;
>> +		struct kvm_vfio *kv = dev->private;
>> +		struct vfio_group *vfio_group;
>> +		struct kvm_vfio_group *kvg;
>> +		struct fd f;
>> +
>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
>> +				start_addr);
>> +
>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (param.argsz < minsz)
>> +			return -EINVAL;
>> +
>> +		f = fdget(param.fd);
>> +		if (!f.file)
>> +			return -EBADF;
>> +
>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>> +		fdput(f);
>> +
>> +		if (IS_ERR(vfio_group))
>> +			return PTR_ERR(vfio_group);
>> +
>> +		ret = -ENOENT;
>
> Shouldn't there be some runtime test for the type of the IOMMU?  It's
> possible a kernel could be built for a platform supporting multiple
> IOMMU types.

Well, may make sense but I do not know to test that. The IOMMU type is a 
VFIO container property, not a group property and here (KVM) we only have 
groups.

And calling  iommu_group_get_iommudata() is quite useless as it returns a 
void pointer... I could probably check that the release() callback is the 
one I set via iommu_group_set_iommudata() but there is no API to get it 
from a group.

And I cannot really imagine a kernel with CONFIG_PPC_BOOK3S_64 (and 
therefore KVM_CAP_SPAPR_TCE_VFIO enabled) with different IOMMU types. Can 
the same kernel binary image work on both BOOK3S and embedded PPC? Where 
these other types can come from?


>
>> +		mutex_lock(&kv->lock);
>> +
>> +		list_for_each_entry(kvg, &kv->group_list, node) {
>> +			int group_id;
>> +			struct iommu_group *grp;
>> +
>> +			if (kvg->vfio_group != vfio_group)
>> +				continue;
>> +
>> +			group_id = kvm_vfio_external_user_iommu_id(
>> +					kvg->vfio_group);
>> +			grp = iommu_group_get_by_id(group_id);
>> +			if (!grp) {
>> +				ret = -EFAULT;
>> +				break;
>> +			}
>> +
>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
>> +					param.liobn, param.start_addr,
>> +					grp);
>> +			if (ret)
>> +				iommu_group_put(grp);
>> +			break;
>> +		}
>> +
>> +		mutex_unlock(&kv->lock);
>> +
>> +		kvm_vfio_group_put_external_user(vfio_group);
>> +
>> +		return ret;
>> +	}
>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>>   	}
>>
>>   	return -ENXIO;
>> @@ -225,6 +328,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>>   		switch (attr->attr) {
>>   		case KVM_DEV_VFIO_GROUP_ADD:
>>   		case KVM_DEV_VFIO_GROUP_DEL:
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN:
>> +#endif
>>   			return 0;
>>   		}
>>
>

David Gibson March 10, 2016, 5:21 a.m. UTC | #3

On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
> On 03/09/2016 04:45 PM, David Gibson wrote:
> >On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
> >>sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> >>via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> >>identifier. LIOBNs are made up, advertised to guest systems and
> >>linked to IOMMU groups by the user space.
> >>In order to enable acceleration for IOMMU operations in KVM, we need
> >>to tell KVM the information about the LIOBN-to-group mapping.
> >>
> >>For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> >>is added which accepts:
> >>- a VFIO group fd and IO base address to find the actual hardware
> >>TCE table;
> >>- a LIOBN to assign to the found table.
> >>
> >>Before notifying KVM about new link, this check the group for being
> >>registered with KVM device in order to release them at unexpected KVM
> >>finish.
> >>
> >>This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>space.
> >>
> >>While we are here, this also fixes VFIO KVM device compiling to let it
> >>link to a KVM module.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>---
> >>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
> >>  arch/powerpc/kvm/Kconfig                   |   1 +
> >>  arch/powerpc/kvm/Makefile                  |   5 +-
> >>  arch/powerpc/kvm/powerpc.c                 |   1 +
> >>  include/uapi/linux/kvm.h                   |   9 +++
> >>  virt/kvm/vfio.c                            | 106 +++++++++++++++++++++++++++++
> >>  6 files changed, 140 insertions(+), 3 deletions(-)
> >>
> >>diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >>index ef51740..c0d3eb7 100644
> >>--- a/Documentation/virtual/kvm/devices/vfio.txt
> >>+++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>@@ -16,7 +16,24 @@ Groups:
> >>
> >>  KVM_DEV_VFIO_GROUP attributes:
> >>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>+	kvm_device_attr.addr points to an int32_t file descriptor
> >>+	for the VFIO group.
> >
> >AFAICT these changes are accurate for VFIO as it is already, in which
> >case it might be clearer to put them in a separate patch.
> >
> >>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >>+	kvm_device_attr.addr points to an int32_t file descriptor
> >>+	for the VFIO group.
> >>
> >>-For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>-for the VFIO group.
> >>+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> >>+	kvm_device_attr.addr points to a struct:
> >>+		struct kvm_vfio_spapr_tce_liobn {
> >>+			__u32	argsz;
> >>+			__s32	fd;
> >>+			__u32	liobn;
> >>+			__u8	pad[4];
> >>+			__u64	start_addr;
> >>+		};
> >>+		where
> >>+		@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>+		@fd is a file descriptor for a VFIO group;
> >>+		@liobn is a logical bus id to be associated with the group;
> >>+		@start_addr is a DMA window offset on the IO (PCI) bus
> >
> >For the cause of DDW and multiple windows, I'm assuming you can call
> >this multiple times with different LIOBNs and the same IOMMU group?
> 
> 
> Yes. It is called twice per each group (when DDW is activated) - for 32bit
> and 64bit windows, this is why @start_addr is there.
> 
> 
> >>diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> >>index 1059846..dfa3488 100644
> >>--- a/arch/powerpc/kvm/Kconfig
> >>+++ b/arch/powerpc/kvm/Kconfig
> >>@@ -65,6 +65,7 @@ config KVM_BOOK3S_64
> >>  	select KVM
> >>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
> >>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
> >>+	select KVM_VFIO if VFIO
> >>  	---help---
> >>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
> >>  	  in virtual machines on book3s_64 host processors.
> >>diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >>index 7f7b6d8..71f577c 100644
> >>--- a/arch/powerpc/kvm/Makefile
> >>+++ b/arch/powerpc/kvm/Makefile
> >>@@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
> >>  KVM := ../../../virt/kvm
> >>
> >>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> >>-		$(KVM)/eventfd.o $(KVM)/vfio.o
> >>+		$(KVM)/eventfd.o
> >
> >Please don't disable the VFIO device for the non-book3s case.  I added
> >it (even though it didn't do anything until now) so that libvirt
> >wouldn't choke when it finds it's not available.  Obviously the new
> >ioctl needs to be only for the right IOMMU setup, but the device
> >itself should be available always.
> 
> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
> 
> 
> >>  CFLAGS_e500_mmu.o := -I.
> >>  CFLAGS_e500_mmu_host.o := -I.
> >>@@ -87,6 +87,9 @@ endif
> >>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
> >>  	book3s_xics.o
> >>
> >>+kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> >>+	$(KVM)/vfio.o \
> >>+
> >>  kvm-book3s_64-module-objs += \
> >>  	$(KVM)/kvm_main.o \
> >>  	$(KVM)/eventfd.o \
> >>diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >>index 19aa59b..63f188d 100644
> >>--- a/arch/powerpc/kvm/powerpc.c
> >>+++ b/arch/powerpc/kvm/powerpc.c
> >>@@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>  #ifdef CONFIG_PPC_BOOK3S_64
> >>  	case KVM_CAP_SPAPR_TCE:
> >>  	case KVM_CAP_SPAPR_TCE_64:
> >>+	case KVM_CAP_SPAPR_TCE_VFIO:
> >>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>  	case KVM_CAP_PPC_RTAS:
> >>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>index 080ffbf..f1abbea 100644
> >>--- a/include/uapi/linux/kvm.h
> >>+++ b/include/uapi/linux/kvm.h
> >>@@ -1056,6 +1056,7 @@ struct kvm_device_attr {
> >>  #define  KVM_DEV_VFIO_GROUP			1
> >>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
> >>
> >>  enum kvm_device_type {
> >>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>@@ -1075,6 +1076,14 @@ enum kvm_device_type {
> >>  	KVM_DEV_TYPE_MAX,
> >>  };
> >>
> >>+struct kvm_vfio_spapr_tce_liobn {
> >>+	__u32	argsz;
> >>+	__s32	fd;
> >>+	__u32	liobn;
> >>+	__u8	pad[4];
> >>+	__u64	start_addr;
> >>+};
> >>+
> >>  /*
> >>   * ioctls for VM fds
> >>   */
> >>diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >>index 1dd087d..87c771e 100644
> >>--- a/virt/kvm/vfio.c
> >>+++ b/virt/kvm/vfio.c
> >>@@ -20,6 +20,10 @@
> >>  #include <linux/vfio.h>
> >>  #include "vfio.h"
> >>
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>+#include <asm/kvm_ppc.h>
> >>+#endif
> >>+
> >>  struct kvm_vfio_group {
> >>  	struct list_head node;
> >>  	struct vfio_group *vfio_group;
> >>@@ -60,6 +64,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >>  	symbol_put(vfio_group_put_external_user);
> >>  }
> >>
> >>+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>+{
> >>+	int (*fn)(struct vfio_group *);
> >>+	int ret = -1;
> >
> >Should this be -ESOMETHING?
> >
> >>+	fn = symbol_get(vfio_external_user_iommu_id);
> >>+	if (!fn)
> >>+		return ret;
> >>+
> >>+	ret = fn(vfio_group);
> >>+
> >>+	symbol_put(vfio_external_user_iommu_id);
> >>+
> >>+	return ret;
> >>+}
> >>+
> >>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>  {
> >>  	long (*fn)(struct vfio_group *, unsigned long);
> >>@@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
> >>  	mutex_unlock(&kv->lock);
> >>  }
> >>
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>+static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >>+		struct vfio_group *vfio_group)
> >
> >
> >Shouldn't this go in the same patch that introduced the attach
> >function?
> 
> Having less patches which touch different maintainers areas is better. I
> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
> table".
> 
> 
> >
> >>+{
> >>+	int group_id;
> >>+	struct iommu_group *grp;
> >>+
> >>+	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >>+	grp = iommu_group_get_by_id(group_id);
> >>+	if (grp) {
> >>+		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >>+		iommu_group_put(grp);
> >>+	}
> >>+}
> >>+#endif
> >>+
> >>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  {
> >>  	struct kvm_vfio *kv = dev->private;
> >>@@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  				continue;
> >>
> >>  			list_del(&kvg->node);
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >
> >Better to make a no-op version of the call than have to #ifdef at the
> >callsite.
> 
> It is questionable. A x86 reader may decide that
> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
> confused.
> 
> 
> >
> >>+			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
> >>+					kvg->vfio_group);
> >>+#endif
> >>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>  			kfree(kvg);
> >>  			ret = 0;
> >>@@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>  		kvm_vfio_update_coherency(dev);
> >>
> >>  		return ret;
> >>+
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
> >>+		struct kvm_vfio_spapr_tce_liobn param;
> >>+		unsigned long minsz;
> >>+		struct kvm_vfio *kv = dev->private;
> >>+		struct vfio_group *vfio_group;
> >>+		struct kvm_vfio_group *kvg;
> >>+		struct fd f;
> >>+
> >>+		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
> >>+				start_addr);
> >>+
> >>+		if (copy_from_user(&param, (void __user *)arg, minsz))
> >>+			return -EFAULT;
> >>+
> >>+		if (param.argsz < minsz)
> >>+			return -EINVAL;
> >>+
> >>+		f = fdget(param.fd);
> >>+		if (!f.file)
> >>+			return -EBADF;
> >>+
> >>+		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >>+		fdput(f);
> >>+
> >>+		if (IS_ERR(vfio_group))
> >>+			return PTR_ERR(vfio_group);
> >>+
> >>+		ret = -ENOENT;
> >
> >Shouldn't there be some runtime test for the type of the IOMMU?  It's
> >possible a kernel could be built for a platform supporting multiple
> >IOMMU types.
> 
> Well, may make sense but I do not know to test that. The IOMMU type is a
> VFIO container property, not a group property and here (KVM) we only have
> groups.

Which, as mentioned previously, is broken.

> And calling  iommu_group_get_iommudata() is quite useless as it returns a
> void pointer... I could probably check that the release() callback is the
> one I set via iommu_group_set_iommudata() but there is no API to get it from
> a group.
> 
> And I cannot really imagine a kernel with CONFIG_PPC_BOOK3S_64 (and
> therefore KVM_CAP_SPAPR_TCE_VFIO enabled) with different IOMMU types. Can
> the same kernel binary image work on both BOOK3S and embedded PPC? Where
> these other types can come from?
> 
> 
> >
> >>+		mutex_lock(&kv->lock);
> >>+
> >>+		list_for_each_entry(kvg, &kv->group_list, node) {
> >>+			int group_id;
> >>+			struct iommu_group *grp;
> >>+
> >>+			if (kvg->vfio_group != vfio_group)
> >>+				continue;
> >>+
> >>+			group_id = kvm_vfio_external_user_iommu_id(
> >>+					kvg->vfio_group);
> >>+			grp = iommu_group_get_by_id(group_id);
> >>+			if (!grp) {
> >>+				ret = -EFAULT;
> >>+				break;
> >>+			}
> >>+
> >>+			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> >>+					param.liobn, param.start_addr,
> >>+					grp);
> >>+			if (ret)
> >>+				iommu_group_put(grp);
> >>+			break;
> >>+		}
> >>+
> >>+		mutex_unlock(&kv->lock);
> >>+
> >>+		kvm_vfio_group_put_external_user(vfio_group);
> >>+
> >>+		return ret;
> >>+	}
> >>+#endif /* CONFIG_SPAPR_TCE_IOMMU */
> >>  	}
> >>
> >>  	return -ENXIO;
> >>@@ -225,6 +328,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
> >>  		switch (attr->attr) {
> >>  		case KVM_DEV_VFIO_GROUP_ADD:
> >>  		case KVM_DEV_VFIO_GROUP_DEL:
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>+		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN:
> >>+#endif
> >>  			return 0;
> >>  		}
> >>
> >
> 
>

Alexey Kardashevskiy March 10, 2016, 11:09 p.m. UTC | #4

On 03/10/2016 04:21 PM, David Gibson wrote:
> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
>> On 03/09/2016 04:45 PM, David Gibson wrote:
>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
>>>> identifier. LIOBNs are made up, advertised to guest systems and
>>>> linked to IOMMU groups by the user space.
>>>> In order to enable acceleration for IOMMU operations in KVM, we need
>>>> to tell KVM the information about the LIOBN-to-group mapping.
>>>>
>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>>>> is added which accepts:
>>>> - a VFIO group fd and IO base address to find the actual hardware
>>>> TCE table;
>>>> - a LIOBN to assign to the found table.
>>>>
>>>> Before notifying KVM about new link, this check the group for being
>>>> registered with KVM device in order to release them at unexpected KVM
>>>> finish.
>>>>
>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>> space.
>>>>
>>>> While we are here, this also fixes VFIO KVM device compiling to let it
>>>> link to a KVM module.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>   Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>>>>   arch/powerpc/kvm/Kconfig                   |   1 +
>>>>   arch/powerpc/kvm/Makefile                  |   5 +-
>>>>   arch/powerpc/kvm/powerpc.c                 |   1 +
>>>>   include/uapi/linux/kvm.h                   |   9 +++
>>>>   virt/kvm/vfio.c                            | 106 +++++++++++++++++++++++++++++
>>>>   6 files changed, 140 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
>>>> index ef51740..c0d3eb7 100644
>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>> @@ -16,7 +16,24 @@ Groups:
>>>>
>>>>   KVM_DEV_VFIO_GROUP attributes:
>>>>     KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>
>>> AFAICT these changes are accurate for VFIO as it is already, in which
>>> case it might be clearer to put them in a separate patch.
>>>
>>>>     KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>> +	for the VFIO group.
>>>>
>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>> -for the VFIO group.
>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
>>>> +	kvm_device_attr.addr points to a struct:
>>>> +		struct kvm_vfio_spapr_tce_liobn {
>>>> +			__u32	argsz;
>>>> +			__s32	fd;
>>>> +			__u32	liobn;
>>>> +			__u8	pad[4];
>>>> +			__u64	start_addr;
>>>> +		};
>>>> +		where
>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>> +		@fd is a file descriptor for a VFIO group;
>>>> +		@liobn is a logical bus id to be associated with the group;
>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
>>>
>>> For the cause of DDW and multiple windows, I'm assuming you can call
>>> this multiple times with different LIOBNs and the same IOMMU group?
>>
>>
>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
>> and 64bit windows, this is why @start_addr is there.
>>
>>
>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>>>> index 1059846..dfa3488 100644
>>>> --- a/arch/powerpc/kvm/Kconfig
>>>> +++ b/arch/powerpc/kvm/Kconfig
>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>>>>   	select KVM
>>>>   	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>>>>   	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>>>> +	select KVM_VFIO if VFIO
>>>>   	---help---
>>>>   	  Support running unmodified book3s_64 and book3s_32 guest kernels
>>>>   	  in virtual machines on book3s_64 host processors.
>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>>>> index 7f7b6d8..71f577c 100644
>>>> --- a/arch/powerpc/kvm/Makefile
>>>> +++ b/arch/powerpc/kvm/Makefile
>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>>>   KVM := ../../../virt/kvm
>>>>
>>>>   common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>>>> +		$(KVM)/eventfd.o
>>>
>>> Please don't disable the VFIO device for the non-book3s case.  I added
>>> it (even though it didn't do anything until now) so that libvirt
>>> wouldn't choke when it finds it's not available.  Obviously the new
>>> ioctl needs to be only for the right IOMMU setup, but the device
>>> itself should be available always.
>>
>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
>>
>>
>>>>   CFLAGS_e500_mmu.o := -I.
>>>>   CFLAGS_e500_mmu_host.o := -I.
>>>> @@ -87,6 +87,9 @@ endif
>>>>   kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>>>   	book3s_xics.o
>>>>
>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>>>> +	$(KVM)/vfio.o \
>>>> +
>>>>   kvm-book3s_64-module-objs += \
>>>>   	$(KVM)/kvm_main.o \
>>>>   	$(KVM)/eventfd.o \
>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>> index 19aa59b..63f188d 100644
>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>>>   #ifdef CONFIG_PPC_BOOK3S_64
>>>>   	case KVM_CAP_SPAPR_TCE:
>>>>   	case KVM_CAP_SPAPR_TCE_64:
>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>>>   	case KVM_CAP_PPC_ALLOC_HTAB:
>>>>   	case KVM_CAP_PPC_RTAS:
>>>>   	case KVM_CAP_PPC_FIXUP_HCALL:
>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>> index 080ffbf..f1abbea 100644
>>>> --- a/include/uapi/linux/kvm.h
>>>> +++ b/include/uapi/linux/kvm.h
>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>>>>   #define  KVM_DEV_VFIO_GROUP			1
>>>>   #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>   #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>>>>
>>>>   enum kvm_device_type {
>>>>   	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>>>>   	KVM_DEV_TYPE_MAX,
>>>>   };
>>>>
>>>> +struct kvm_vfio_spapr_tce_liobn {
>>>> +	__u32	argsz;
>>>> +	__s32	fd;
>>>> +	__u32	liobn;
>>>> +	__u8	pad[4];
>>>> +	__u64	start_addr;
>>>> +};
>>>> +
>>>>   /*
>>>>    * ioctls for VM fds
>>>>    */
>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>> index 1dd087d..87c771e 100644
>>>> --- a/virt/kvm/vfio.c
>>>> +++ b/virt/kvm/vfio.c
>>>> @@ -20,6 +20,10 @@
>>>>   #include <linux/vfio.h>
>>>>   #include "vfio.h"
>>>>
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +#include <asm/kvm_ppc.h>
>>>> +#endif
>>>> +
>>>>   struct kvm_vfio_group {
>>>>   	struct list_head node;
>>>>   	struct vfio_group *vfio_group;
>>>> @@ -60,6 +64,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>>   	symbol_put(vfio_group_put_external_user);
>>>>   }
>>>>
>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>> +{
>>>> +	int (*fn)(struct vfio_group *);
>>>> +	int ret = -1;
>>>
>>> Should this be -ESOMETHING?
>>>
>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>> +	if (!fn)
>>>> +		return ret;
>>>> +
>>>> +	ret = fn(vfio_group);
>>>> +
>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>>   static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>>>   {
>>>>   	long (*fn)(struct vfio_group *, unsigned long);
>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
>>>>   	mutex_unlock(&kv->lock);
>>>>   }
>>>>
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>>>> +		struct vfio_group *vfio_group)
>>>
>>>
>>> Shouldn't this go in the same patch that introduced the attach
>>> function?
>>
>> Having less patches which touch different maintainers areas is better. I
>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
>> table".
>>
>>
>>>
>>>> +{
>>>> +	int group_id;
>>>> +	struct iommu_group *grp;
>>>> +
>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>>>> +	grp = iommu_group_get_by_id(group_id);
>>>> +	if (grp) {
>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>>>> +		iommu_group_put(grp);
>>>> +	}
>>>> +}
>>>> +#endif
>>>> +
>>>>   static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>   {
>>>>   	struct kvm_vfio *kv = dev->private;
>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>   				continue;
>>>>
>>>>   			list_del(&kvg->node);
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>
>>> Better to make a no-op version of the call than have to #ifdef at the
>>> callsite.
>>
>> It is questionable. A x86 reader may decide that
>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
>> confused.
>>
>>
>>>
>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
>>>> +					kvg->vfio_group);
>>>> +#endif
>>>>   			kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>   			kfree(kvg);
>>>>   			ret = 0;
>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>   		kvm_vfio_update_coherency(dev);
>>>>
>>>>   		return ret;
>>>> +
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
>>>> +		struct kvm_vfio_spapr_tce_liobn param;
>>>> +		unsigned long minsz;
>>>> +		struct kvm_vfio *kv = dev->private;
>>>> +		struct vfio_group *vfio_group;
>>>> +		struct kvm_vfio_group *kvg;
>>>> +		struct fd f;
>>>> +
>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
>>>> +				start_addr);
>>>> +
>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>>>> +			return -EFAULT;
>>>> +
>>>> +		if (param.argsz < minsz)
>>>> +			return -EINVAL;
>>>> +
>>>> +		f = fdget(param.fd);
>>>> +		if (!f.file)
>>>> +			return -EBADF;
>>>> +
>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>> +		fdput(f);
>>>> +
>>>> +		if (IS_ERR(vfio_group))
>>>> +			return PTR_ERR(vfio_group);
>>>> +
>>>> +		ret = -ENOENT;
>>>
>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
>>> possible a kernel could be built for a platform supporting multiple
>>> IOMMU types.
>>
>> Well, may make sense but I do not know to test that. The IOMMU type is a
>> VFIO container property, not a group property and here (KVM) we only have
>> groups.
>
> Which, as mentioned previously, is broken.

Which I am failing to follow you on this.

What I am trying to achieve here is pretty much referencing a group so it 
cannot be reused. Plus LIOBNs. Passing a container fd does not make much 
sense here as the VFIO device would walk through groups, reference them and 
that is it, there is no locking on VFIO containters and so far there was no 
need to teach KVM about containers.

What do I miss now?



>
>> And calling  iommu_group_get_iommudata() is quite useless as it returns a
>> void pointer... I could probably check that the release() callback is the
>> one I set via iommu_group_set_iommudata() but there is no API to get it from
>> a group.
>>
>> And I cannot really imagine a kernel with CONFIG_PPC_BOOK3S_64 (and
>> therefore KVM_CAP_SPAPR_TCE_VFIO enabled) with different IOMMU types. Can
>> the same kernel binary image work on both BOOK3S and embedded PPC? Where
>> these other types can come from?
>>
>>
>>>
>>>> +		mutex_lock(&kv->lock);
>>>> +
>>>> +		list_for_each_entry(kvg, &kv->group_list, node) {
>>>> +			int group_id;
>>>> +			struct iommu_group *grp;
>>>> +
>>>> +			if (kvg->vfio_group != vfio_group)
>>>> +				continue;
>>>> +
>>>> +			group_id = kvm_vfio_external_user_iommu_id(
>>>> +					kvg->vfio_group);
>>>> +			grp = iommu_group_get_by_id(group_id);
>>>> +			if (!grp) {
>>>> +				ret = -EFAULT;
>>>> +				break;
>>>> +			}
>>>> +
>>>> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
>>>> +					param.liobn, param.start_addr,
>>>> +					grp);
>>>> +			if (ret)
>>>> +				iommu_group_put(grp);
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		mutex_unlock(&kv->lock);
>>>> +
>>>> +		kvm_vfio_group_put_external_user(vfio_group);
>>>> +
>>>> +		return ret;
>>>> +	}
>>>> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>>>>   	}
>>>>
>>>>   	return -ENXIO;
>>>> @@ -225,6 +328,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>>>>   		switch (attr->attr) {
>>>>   		case KVM_DEV_VFIO_GROUP_ADD:
>>>>   		case KVM_DEV_VFIO_GROUP_DEL:
>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN:
>>>> +#endif
>>>>   			return 0;
>>>>   		}
>>>>
>>>
>>
>>
>

David Gibson March 15, 2016, 6:04 a.m. UTC | #5

On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
> On 03/10/2016 04:21 PM, David Gibson wrote:
> >On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/09/2016 04:45 PM, David Gibson wrote:
> >>>On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
> >>>>sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> >>>>via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> >>>>identifier. LIOBNs are made up, advertised to guest systems and
> >>>>linked to IOMMU groups by the user space.
> >>>>In order to enable acceleration for IOMMU operations in KVM, we need
> >>>>to tell KVM the information about the LIOBN-to-group mapping.
> >>>>
> >>>>For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> >>>>is added which accepts:
> >>>>- a VFIO group fd and IO base address to find the actual hardware
> >>>>TCE table;
> >>>>- a LIOBN to assign to the found table.
> >>>>
> >>>>Before notifying KVM about new link, this check the group for being
> >>>>registered with KVM device in order to release them at unexpected KVM
> >>>>finish.
> >>>>
> >>>>This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>>>space.
> >>>>
> >>>>While we are here, this also fixes VFIO KVM device compiling to let it
> >>>>link to a KVM module.
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>---
> >>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
> >>>>  arch/powerpc/kvm/Kconfig                   |   1 +
> >>>>  arch/powerpc/kvm/Makefile                  |   5 +-
> >>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
> >>>>  include/uapi/linux/kvm.h                   |   9 +++
> >>>>  virt/kvm/vfio.c                            | 106 +++++++++++++++++++++++++++++
> >>>>  6 files changed, 140 insertions(+), 3 deletions(-)
> >>>>
> >>>>diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>index ef51740..c0d3eb7 100644
> >>>>--- a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>+++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>@@ -16,7 +16,24 @@ Groups:
> >>>>
> >>>>  KVM_DEV_VFIO_GROUP attributes:
> >>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>>>+	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>+	for the VFIO group.
> >>>
> >>>AFAICT these changes are accurate for VFIO as it is already, in which
> >>>case it might be clearer to put them in a separate patch.
> >>>
> >>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >>>>+	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>+	for the VFIO group.
> >>>>
> >>>>-For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>>>-for the VFIO group.
> >>>>+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> >>>>+	kvm_device_attr.addr points to a struct:
> >>>>+		struct kvm_vfio_spapr_tce_liobn {
> >>>>+			__u32	argsz;
> >>>>+			__s32	fd;
> >>>>+			__u32	liobn;
> >>>>+			__u8	pad[4];
> >>>>+			__u64	start_addr;
> >>>>+		};
> >>>>+		where
> >>>>+		@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>>>+		@fd is a file descriptor for a VFIO group;
> >>>>+		@liobn is a logical bus id to be associated with the group;
> >>>>+		@start_addr is a DMA window offset on the IO (PCI) bus
> >>>
> >>>For the cause of DDW and multiple windows, I'm assuming you can call
> >>>this multiple times with different LIOBNs and the same IOMMU group?
> >>
> >>
> >>Yes. It is called twice per each group (when DDW is activated) - for 32bit
> >>and 64bit windows, this is why @start_addr is there.
> >>
> >>
> >>>>diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> >>>>index 1059846..dfa3488 100644
> >>>>--- a/arch/powerpc/kvm/Kconfig
> >>>>+++ b/arch/powerpc/kvm/Kconfig
> >>>>@@ -65,6 +65,7 @@ config KVM_BOOK3S_64
> >>>>  	select KVM
> >>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
> >>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
> >>>>+	select KVM_VFIO if VFIO
> >>>>  	---help---
> >>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
> >>>>  	  in virtual machines on book3s_64 host processors.
> >>>>diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >>>>index 7f7b6d8..71f577c 100644
> >>>>--- a/arch/powerpc/kvm/Makefile
> >>>>+++ b/arch/powerpc/kvm/Makefile
> >>>>@@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
> >>>>  KVM := ../../../virt/kvm
> >>>>
> >>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> >>>>-		$(KVM)/eventfd.o $(KVM)/vfio.o
> >>>>+		$(KVM)/eventfd.o
> >>>
> >>>Please don't disable the VFIO device for the non-book3s case.  I added
> >>>it (even though it didn't do anything until now) so that libvirt
> >>>wouldn't choke when it finds it's not available.  Obviously the new
> >>>ioctl needs to be only for the right IOMMU setup, but the device
> >>>itself should be available always.
> >>
> >>Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
> >>
> >>
> >>>>  CFLAGS_e500_mmu.o := -I.
> >>>>  CFLAGS_e500_mmu_host.o := -I.
> >>>>@@ -87,6 +87,9 @@ endif
> >>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
> >>>>  	book3s_xics.o
> >>>>
> >>>>+kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> >>>>+	$(KVM)/vfio.o \
> >>>>+
> >>>>  kvm-book3s_64-module-objs += \
> >>>>  	$(KVM)/kvm_main.o \
> >>>>  	$(KVM)/eventfd.o \
> >>>>diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >>>>index 19aa59b..63f188d 100644
> >>>>--- a/arch/powerpc/kvm/powerpc.c
> >>>>+++ b/arch/powerpc/kvm/powerpc.c
> >>>>@@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>>>  #ifdef CONFIG_PPC_BOOK3S_64
> >>>>  	case KVM_CAP_SPAPR_TCE:
> >>>>  	case KVM_CAP_SPAPR_TCE_64:
> >>>>+	case KVM_CAP_SPAPR_TCE_VFIO:
> >>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>>>  	case KVM_CAP_PPC_RTAS:
> >>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>>>diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>>index 080ffbf..f1abbea 100644
> >>>>--- a/include/uapi/linux/kvm.h
> >>>>+++ b/include/uapi/linux/kvm.h
> >>>>@@ -1056,6 +1056,7 @@ struct kvm_device_attr {
> >>>>  #define  KVM_DEV_VFIO_GROUP			1
> >>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>>>+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
> >>>>
> >>>>  enum kvm_device_type {
> >>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>>>@@ -1075,6 +1076,14 @@ enum kvm_device_type {
> >>>>  	KVM_DEV_TYPE_MAX,
> >>>>  };
> >>>>
> >>>>+struct kvm_vfio_spapr_tce_liobn {
> >>>>+	__u32	argsz;
> >>>>+	__s32	fd;
> >>>>+	__u32	liobn;
> >>>>+	__u8	pad[4];
> >>>>+	__u64	start_addr;
> >>>>+};
> >>>>+
> >>>>  /*
> >>>>   * ioctls for VM fds
> >>>>   */
> >>>>diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >>>>index 1dd087d..87c771e 100644
> >>>>--- a/virt/kvm/vfio.c
> >>>>+++ b/virt/kvm/vfio.c
> >>>>@@ -20,6 +20,10 @@
> >>>>  #include <linux/vfio.h>
> >>>>  #include "vfio.h"
> >>>>
> >>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>+#include <asm/kvm_ppc.h>
> >>>>+#endif
> >>>>+
> >>>>  struct kvm_vfio_group {
> >>>>  	struct list_head node;
> >>>>  	struct vfio_group *vfio_group;
> >>>>@@ -60,6 +64,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >>>>  	symbol_put(vfio_group_put_external_user);
> >>>>  }
> >>>>
> >>>>+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>>>+{
> >>>>+	int (*fn)(struct vfio_group *);
> >>>>+	int ret = -1;
> >>>
> >>>Should this be -ESOMETHING?
> >>>
> >>>>+	fn = symbol_get(vfio_external_user_iommu_id);
> >>>>+	if (!fn)
> >>>>+		return ret;
> >>>>+
> >>>>+	ret = fn(vfio_group);
> >>>>+
> >>>>+	symbol_put(vfio_external_user_iommu_id);
> >>>>+
> >>>>+	return ret;
> >>>>+}
> >>>>+
> >>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>>>  {
> >>>>  	long (*fn)(struct vfio_group *, unsigned long);
> >>>>@@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
> >>>>  	mutex_unlock(&kv->lock);
> >>>>  }
> >>>>
> >>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>+static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >>>>+		struct vfio_group *vfio_group)
> >>>
> >>>
> >>>Shouldn't this go in the same patch that introduced the attach
> >>>function?
> >>
> >>Having less patches which touch different maintainers areas is better. I
> >>cannot avoid touching both PPC KVM and VFIO in this patch but I can in
> >>"[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
> >>table".
> >>
> >>
> >>>
> >>>>+{
> >>>>+	int group_id;
> >>>>+	struct iommu_group *grp;
> >>>>+
> >>>>+	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >>>>+	grp = iommu_group_get_by_id(group_id);
> >>>>+	if (grp) {
> >>>>+		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >>>>+		iommu_group_put(grp);
> >>>>+	}
> >>>>+}
> >>>>+#endif
> >>>>+
> >>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>  {
> >>>>  	struct kvm_vfio *kv = dev->private;
> >>>>@@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>  				continue;
> >>>>
> >>>>  			list_del(&kvg->node);
> >>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>
> >>>Better to make a no-op version of the call than have to #ifdef at the
> >>>callsite.
> >>
> >>It is questionable. A x86 reader may decide that
> >>KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
> >>confused.
> >>
> >>
> >>>
> >>>>+			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
> >>>>+					kvg->vfio_group);
> >>>>+#endif
> >>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>>>  			kfree(kvg);
> >>>>  			ret = 0;
> >>>>@@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>  		kvm_vfio_update_coherency(dev);
> >>>>
> >>>>  		return ret;
> >>>>+
> >>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
> >>>>+		struct kvm_vfio_spapr_tce_liobn param;
> >>>>+		unsigned long minsz;
> >>>>+		struct kvm_vfio *kv = dev->private;
> >>>>+		struct vfio_group *vfio_group;
> >>>>+		struct kvm_vfio_group *kvg;
> >>>>+		struct fd f;
> >>>>+
> >>>>+		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
> >>>>+				start_addr);
> >>>>+
> >>>>+		if (copy_from_user(&param, (void __user *)arg, minsz))
> >>>>+			return -EFAULT;
> >>>>+
> >>>>+		if (param.argsz < minsz)
> >>>>+			return -EINVAL;
> >>>>+
> >>>>+		f = fdget(param.fd);
> >>>>+		if (!f.file)
> >>>>+			return -EBADF;
> >>>>+
> >>>>+		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >>>>+		fdput(f);
> >>>>+
> >>>>+		if (IS_ERR(vfio_group))
> >>>>+			return PTR_ERR(vfio_group);
> >>>>+
> >>>>+		ret = -ENOENT;
> >>>
> >>>Shouldn't there be some runtime test for the type of the IOMMU?  It's
> >>>possible a kernel could be built for a platform supporting multiple
> >>>IOMMU types.
> >>
> >>Well, may make sense but I do not know to test that. The IOMMU type is a
> >>VFIO container property, not a group property and here (KVM) we only have
> >>groups.
> >
> >Which, as mentioned previously, is broken.
> 
> Which I am failing to follow you on this.
> 
> What I am trying to achieve here is pretty much referencing a group so it
> cannot be reused. Plus LIOBNs.

"Plus LIOBNs" is not a trivial change.  You are establishing a linkage
from LIOBNs to groups.  But that doesn't make sense; if mapping in one
(guest) LIOBN affects a group it must affect all groups in the
container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
to group.

> Passing a container fd does not make much
> sense here as the VFIO device would walk through groups, reference them and
> that is it, there is no locking on VFIO containters and so far there was no
> need to teach KVM about containers.
> 
> What do I miss now?

Referencing the groups is essentially just a useful side effect.  The
important functionality is informing VFIO of the guest LIOBNs; and
LIOBNs map to containers, not groups.

Alexey Kardashevskiy March 22, 2016, 12:34 a.m. UTC | #6

Uff, lost cc: list. Added back. Some comments below.


On 03/21/2016 04:19 PM, David Gibson wrote:
> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
>> On March 15, 2016 17:29:26 David Gibson <david@gibson.dropbear.id.au> wrote:
>>
>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
>>>> On 03/10/2016 04:21 PM, David Gibson wrote:
>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote:
>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and
>>>>>>>> linked to IOMMU groups by the user space.
>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need
>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping.
>>>>>>>>
>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>>>>>>>> is added which accepts:
>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware
>>>>>>>> TCE table;
>>>>>>>> - a LIOBN to assign to the found table.
>>>>>>>>
>>>>>>>> Before notifying KVM about new link, this check the group for being
>>>>>>>> registered with KVM device in order to release them at unexpected KVM
>>>>>>>> finish.
>>>>>>>>
>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>>>>>> space.
>>>>>>>>
>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it
>>>>>>>> link to a KVM module.
>>>>>>>>
>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>> ---
>>>>>>>>   Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>>>>>>>>   arch/powerpc/kvm/Kconfig                   |   1 +
>>>>>>>>   arch/powerpc/kvm/Makefile                  |   5 +-
>>>>>>>>   arch/powerpc/kvm/powerpc.c                 |   1 +
>>>>>>>>   include/uapi/linux/kvm.h                   |   9 +++
>>>>>>>>   virt/kvm/vfio.c                            | 106
>>>> +++++++++++++++++++++++++++++
>>>>>>>>   6 files changed, 140 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt
>>>> b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>> index ef51740..c0d3eb7 100644
>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>> @@ -16,7 +16,24 @@ Groups:
>>>>>>>>
>>>>>>>>   KVM_DEV_VFIO_GROUP attributes:
>>>>>>>>     KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>> +	for the VFIO group.
>>>>>>>
>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which
>>>>>>> case it might be clearer to put them in a separate patch.
>>>>>>>
>>>>>>>>     KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
>>>> tracking
>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>> +	for the VFIO group.
>>>>>>>>
>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>> -for the VFIO group.
>>>>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
>>>>>>>> +	kvm_device_attr.addr points to a struct:
>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn {
>>>>>>>> +			__u32	argsz;
>>>>>>>> +			__s32	fd;
>>>>>>>> +			__u32	liobn;
>>>>>>>> +			__u8	pad[4];
>>>>>>>> +			__u64	start_addr;
>>>>>>>> +		};
>>>>>>>> +		where
>>>>>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>>>>>> +		@fd is a file descriptor for a VFIO group;
>>>>>>>> +		@liobn is a logical bus id to be associated with the group;
>>>>>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
>>>>>>>
>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call
>>>>>>> this multiple times with different LIOBNs and the same IOMMU group?
>>>>>>
>>>>>>
>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
>>>>>> and 64bit windows, this is why @start_addr is there.
>>>>>>
>>>>>>
>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>>>>>>>> index 1059846..dfa3488 100644
>>>>>>>> --- a/arch/powerpc/kvm/Kconfig
>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig
>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>>>>>>>>   	select KVM
>>>>>>>>   	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>>>>>>>>   	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>>>>>>>> +	select KVM_VFIO if VFIO
>>>>>>>>   	---help---
>>>>>>>>   	  Support running unmodified book3s_64 and book3s_32 guest kernels
>>>>>>>>   	  in virtual machines on book3s_64 host processors.
>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>>>>>>>> index 7f7b6d8..71f577c 100644
>>>>>>>> --- a/arch/powerpc/kvm/Makefile
>>>>>>>> +++ b/arch/powerpc/kvm/Makefile
>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>>>>>>>   KVM := ../../../virt/kvm
>>>>>>>>
>>>>>>>>   common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>>>>>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>>>>>>>> +		$(KVM)/eventfd.o
>>>>>>>
>>>>>>> Please don't disable the VFIO device for the non-book3s case.  I added
>>>>>>> it (even though it didn't do anything until now) so that libvirt
>>>>>>> wouldn't choke when it finds it's not available.  Obviously the new
>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device
>>>>>>> itself should be available always.
>>>>>>
>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
>>>>>>
>>>>>>
>>>>>>>>   CFLAGS_e500_mmu.o := -I.
>>>>>>>>   CFLAGS_e500_mmu_host.o := -I.
>>>>>>>> @@ -87,6 +87,9 @@ endif
>>>>>>>>   kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>>>>>>>   	book3s_xics.o
>>>>>>>>
>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>>>>>>>> +	$(KVM)/vfio.o \
>>>>>>>> +
>>>>>>>>   kvm-book3s_64-module-objs += \
>>>>>>>>   	$(KVM)/kvm_main.o \
>>>>>>>>   	$(KVM)/eventfd.o \
>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>>>>>> index 19aa59b..63f188d 100644
>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
>>>> *kvm, long ext)
>>>>>>>>   #ifdef CONFIG_PPC_BOOK3S_64
>>>>>>>>   	case KVM_CAP_SPAPR_TCE:
>>>>>>>>   	case KVM_CAP_SPAPR_TCE_64:
>>>>>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>>>>>>>   	case KVM_CAP_PPC_ALLOC_HTAB:
>>>>>>>>   	case KVM_CAP_PPC_RTAS:
>>>>>>>>   	case KVM_CAP_PPC_FIXUP_HCALL:
>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>>>>> index 080ffbf..f1abbea 100644
>>>>>>>> --- a/include/uapi/linux/kvm.h
>>>>>>>> +++ b/include/uapi/linux/kvm.h
>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>>>>>>>>   #define  KVM_DEV_VFIO_GROUP			1
>>>>>>>>   #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>>>>>   #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>>>>>>>>
>>>>>>>>   enum kvm_device_type {
>>>>>>>>   	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>>>>>>>>   	KVM_DEV_TYPE_MAX,
>>>>>>>>   };
>>>>>>>>
>>>>>>>> +struct kvm_vfio_spapr_tce_liobn {
>>>>>>>> +	__u32	argsz;
>>>>>>>> +	__s32	fd;
>>>>>>>> +	__u32	liobn;
>>>>>>>> +	__u8	pad[4];
>>>>>>>> +	__u64	start_addr;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>>   /*
>>>>>>>>    * ioctls for VM fds
>>>>>>>>    */
>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>>>>>> index 1dd087d..87c771e 100644
>>>>>>>> --- a/virt/kvm/vfio.c
>>>>>>>> +++ b/virt/kvm/vfio.c
>>>>>>>> @@ -20,6 +20,10 @@
>>>>>>>>   #include <linux/vfio.h>
>>>>>>>>   #include "vfio.h"
>>>>>>>>
>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>> +#include <asm/kvm_ppc.h>
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>>   struct kvm_vfio_group {
>>>>>>>>   	struct list_head node;
>>>>>>>>   	struct vfio_group *vfio_group;
>>>>>>>> @@ -60,6 +64,22 @@ static void
>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>>>>>>   	symbol_put(vfio_group_put_external_user);
>>>>>>>>   }
>>>>>>>>
>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>>>>>> +{
>>>>>>>> +	int (*fn)(struct vfio_group *);
>>>>>>>> +	int ret = -1;
>>>>>>>
>>>>>>> Should this be -ESOMETHING?
>>>>>>>
>>>>>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>>>>>> +	if (!fn)
>>>>>>>> +		return ret;
>>>>>>>> +
>>>>>>>> +	ret = fn(vfio_group);
>>>>>>>> +
>>>>>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>>>>>> +
>>>>>>>> +	return ret;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>   static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>>>>>>>   {
>>>>>>>>   	long (*fn)(struct vfio_group *, unsigned long);
>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
>>>> kvm_device *dev)
>>>>>>>>   	mutex_unlock(&kv->lock);
>>>>>>>>   }
>>>>>>>>
>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>>>>>>>> +		struct vfio_group *vfio_group)
>>>>>>>
>>>>>>>
>>>>>>> Shouldn't this go in the same patch that introduced the attach
>>>>>>> function?
>>>>>>
>>>>>> Having less patches which touch different maintainers areas is better. I
>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
>>>>>> table".
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +{
>>>>>>>> +	int group_id;
>>>>>>>> +	struct iommu_group *grp;
>>>>>>>> +
>>>>>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>>>>>>>> +	grp = iommu_group_get_by_id(group_id);
>>>>>>>> +	if (grp) {
>>>>>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>>>>>>>> +		iommu_group_put(grp);
>>>>>>>> +	}
>>>>>>>> +}
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>>   static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>>>>>   {
>>>>>>>>   	struct kvm_vfio *kv = dev->private;
>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
>>>> *dev, long attr, u64 arg)
>>>>>>>>   				continue;
>>>>>>>>
>>>>>>>>   			list_del(&kvg->node);
>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>
>>>>>>> Better to make a no-op version of the call than have to #ifdef at the
>>>>>>> callsite.
>>>>>>
>>>>>> It is questionable. A x86 reader may decide that
>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
>>>>>> confused.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
>>>>>>>> +					kvg->vfio_group);
>>>>>>>> +#endif
>>>>>>>>   			kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>>>>>   			kfree(kvg);
>>>>>>>>   			ret = 0;
>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
>>>> *dev, long attr, u64 arg)
>>>>>>>>   		kvm_vfio_update_coherency(dev);
>>>>>>>>
>>>>>>>>   		return ret;
>>>>>>>> +
>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn param;
>>>>>>>> +		unsigned long minsz;
>>>>>>>> +		struct kvm_vfio *kv = dev->private;
>>>>>>>> +		struct vfio_group *vfio_group;
>>>>>>>> +		struct kvm_vfio_group *kvg;
>>>>>>>> +		struct fd f;
>>>>>>>> +
>>>>>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
>>>>>>>> +				start_addr);
>>>>>>>> +
>>>>>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>>>>>>>> +			return -EFAULT;
>>>>>>>> +
>>>>>>>> +		if (param.argsz < minsz)
>>>>>>>> +			return -EINVAL;
>>>>>>>> +
>>>>>>>> +		f = fdget(param.fd);
>>>>>>>> +		if (!f.file)
>>>>>>>> +			return -EBADF;
>>>>>>>> +
>>>>>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>>>>>> +		fdput(f);
>>>>>>>> +
>>>>>>>> +		if (IS_ERR(vfio_group))
>>>>>>>> +			return PTR_ERR(vfio_group);
>>>>>>>> +
>>>>>>>> +		ret = -ENOENT;
>>>>>>>
>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
>>>>>>> possible a kernel could be built for a platform supporting multiple
>>>>>>> IOMMU types.
>>>>>>
>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a
>>>>>> VFIO container property, not a group property and here (KVM) we only have
>>>>>> groups.
>>>>>
>>>>> Which, as mentioned previously, is broken.
>>>>
>>>> Which I am failing to follow you on this.
>>>>
>>>> What I am trying to achieve here is pretty much referencing a group so it
>>>> cannot be reused. Plus LIOBNs.
>>>
>>> "Plus LIOBNs" is not a trivial change.  You are establishing a linkage
>> >from LIOBNs to groups.  But that doesn't make sense; if mapping in one
>>> (guest) LIOBN affects a group it must affect all groups in the
>>> container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
>>> to group.
>>
>> I can see your point but i don't see how to proceed now, I'm totally stuck.
>> Pass container fd and then implement new api to lock containers somehow and
>
> I'm not really understanding what the question is about locking containers.
>
>> enumerate groups when updating TCE table (including real mode)?
>
> Why do you need to enumerate groups?  The groups within the container
> share a TCE table, so can't you just update that once?

Well, they share a TCE table but they do not share TCE Kill (TCE cache 
invalidate) register address, it is still per PE but this does not matter 
here (pnv_pci_link_table_and_group() does that), just mentioned to complete 
the picture.


>> Plus new API when we remove a group from a container as the result of guest
>> PCI hot unplug?
>
> I assume you mean a kernel internal API, since it shouldn't need
> anything else visible to userspace.  Won't this happen naturally?
> When the group is removed from the container, it will get its own TCE
> table instead of the previously shared one.
 >
>>>> Passing a container fd does not make much
>>>> sense here as the VFIO device would walk through groups, reference them and
>>>> that is it, there is no locking on VFIO containters and so far there was no
>>>> need to teach KVM about containers.
>>>>
>>>> What do I miss now?
>>>
>>> Referencing the groups is essentially just a useful side effect.  The
>>> important functionality is informing VFIO of the guest LIOBNs; and
>>> LIOBNs map to containers, not groups.
>>
>> No. One liobn maps to one KVM-allocated TCE table, not a container. There
>> can be one or many or none containers per liobn.
>
> Ah, true.

So I need to add new kernel API for KVM to get table(s) from VFIO 
container(s). And invent some locking mechanism to prevent table(s) (or 
associated container(s)) from going away, like 
vfio_group_get_external_user/vfio_group_put_external_user but for 
containers. Correct?

David Gibson March 23, 2016, 3:03 a.m. UTC | #7

On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:
> Uff, lost cc: list. Added back. Some comments below.
> 
> 
> On 03/21/2016 04:19 PM, David Gibson wrote:
> >On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
> >>On March 15, 2016 17:29:26 David Gibson <david@gibson.dropbear.id.au> wrote:
> >>
> >>>On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
> >>>>On 03/10/2016 04:21 PM, David Gibson wrote:
> >>>>>On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>On 03/09/2016 04:45 PM, David Gibson wrote:
> >>>>>>>On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> >>>>>>>>via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> >>>>>>>>identifier. LIOBNs are made up, advertised to guest systems and
> >>>>>>>>linked to IOMMU groups by the user space.
> >>>>>>>>In order to enable acceleration for IOMMU operations in KVM, we need
> >>>>>>>>to tell KVM the information about the LIOBN-to-group mapping.
> >>>>>>>>
> >>>>>>>>For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> >>>>>>>>is added which accepts:
> >>>>>>>>- a VFIO group fd and IO base address to find the actual hardware
> >>>>>>>>TCE table;
> >>>>>>>>- a LIOBN to assign to the found table.
> >>>>>>>>
> >>>>>>>>Before notifying KVM about new link, this check the group for being
> >>>>>>>>registered with KVM device in order to release them at unexpected KVM
> >>>>>>>>finish.
> >>>>>>>>
> >>>>>>>>This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>>>>>>>space.
> >>>>>>>>
> >>>>>>>>While we are here, this also fixes VFIO KVM device compiling to let it
> >>>>>>>>link to a KVM module.
> >>>>>>>>
> >>>>>>>>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>>>---
> >>>>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
> >>>>>>>>  arch/powerpc/kvm/Kconfig                   |   1 +
> >>>>>>>>  arch/powerpc/kvm/Makefile                  |   5 +-
> >>>>>>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
> >>>>>>>>  include/uapi/linux/kvm.h                   |   9 +++
> >>>>>>>>  virt/kvm/vfio.c                            | 106
> >>>>+++++++++++++++++++++++++++++
> >>>>>>>>  6 files changed, 140 insertions(+), 3 deletions(-)
> >>>>>>>>
> >>>>>>>>diff --git a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>index ef51740..c0d3eb7 100644
> >>>>>>>>--- a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>+++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>@@ -16,7 +16,24 @@ Groups:
> >>>>>>>>
> >>>>>>>>  KVM_DEV_VFIO_GROUP attributes:
> >>>>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>>>>>>>+	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>+	for the VFIO group.
> >>>>>>>
> >>>>>>>AFAICT these changes are accurate for VFIO as it is already, in which
> >>>>>>>case it might be clearer to put them in a separate patch.
> >>>>>>>
> >>>>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
> >>>>tracking
> >>>>>>>>+	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>+	for the VFIO group.
> >>>>>>>>
> >>>>>>>>-For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>-for the VFIO group.
> >>>>>>>>+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> >>>>>>>>+	kvm_device_attr.addr points to a struct:
> >>>>>>>>+		struct kvm_vfio_spapr_tce_liobn {
> >>>>>>>>+			__u32	argsz;
> >>>>>>>>+			__s32	fd;
> >>>>>>>>+			__u32	liobn;
> >>>>>>>>+			__u8	pad[4];
> >>>>>>>>+			__u64	start_addr;
> >>>>>>>>+		};
> >>>>>>>>+		where
> >>>>>>>>+		@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>>>>>>>+		@fd is a file descriptor for a VFIO group;
> >>>>>>>>+		@liobn is a logical bus id to be associated with the group;
> >>>>>>>>+		@start_addr is a DMA window offset on the IO (PCI) bus
> >>>>>>>
> >>>>>>>For the cause of DDW and multiple windows, I'm assuming you can call
> >>>>>>>this multiple times with different LIOBNs and the same IOMMU group?
> >>>>>>
> >>>>>>
> >>>>>>Yes. It is called twice per each group (when DDW is activated) - for 32bit
> >>>>>>and 64bit windows, this is why @start_addr is there.
> >>>>>>
> >>>>>>
> >>>>>>>>diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> >>>>>>>>index 1059846..dfa3488 100644
> >>>>>>>>--- a/arch/powerpc/kvm/Kconfig
> >>>>>>>>+++ b/arch/powerpc/kvm/Kconfig
> >>>>>>>>@@ -65,6 +65,7 @@ config KVM_BOOK3S_64
> >>>>>>>>  	select KVM
> >>>>>>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
> >>>>>>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
> >>>>>>>>+	select KVM_VFIO if VFIO
> >>>>>>>>  	---help---
> >>>>>>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
> >>>>>>>>  	  in virtual machines on book3s_64 host processors.
> >>>>>>>>diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >>>>>>>>index 7f7b6d8..71f577c 100644
> >>>>>>>>--- a/arch/powerpc/kvm/Makefile
> >>>>>>>>+++ b/arch/powerpc/kvm/Makefile
> >>>>>>>>@@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
> >>>>>>>>  KVM := ../../../virt/kvm
> >>>>>>>>
> >>>>>>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> >>>>>>>>-		$(KVM)/eventfd.o $(KVM)/vfio.o
> >>>>>>>>+		$(KVM)/eventfd.o
> >>>>>>>
> >>>>>>>Please don't disable the VFIO device for the non-book3s case.  I added
> >>>>>>>it (even though it didn't do anything until now) so that libvirt
> >>>>>>>wouldn't choke when it finds it's not available.  Obviously the new
> >>>>>>>ioctl needs to be only for the right IOMMU setup, but the device
> >>>>>>>itself should be available always.
> >>>>>>
> >>>>>>Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
> >>>>>>
> >>>>>>
> >>>>>>>>  CFLAGS_e500_mmu.o := -I.
> >>>>>>>>  CFLAGS_e500_mmu_host.o := -I.
> >>>>>>>>@@ -87,6 +87,9 @@ endif
> >>>>>>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
> >>>>>>>>  	book3s_xics.o
> >>>>>>>>
> >>>>>>>>+kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> >>>>>>>>+	$(KVM)/vfio.o \
> >>>>>>>>+
> >>>>>>>>  kvm-book3s_64-module-objs += \
> >>>>>>>>  	$(KVM)/kvm_main.o \
> >>>>>>>>  	$(KVM)/eventfd.o \
> >>>>>>>>diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >>>>>>>>index 19aa59b..63f188d 100644
> >>>>>>>>--- a/arch/powerpc/kvm/powerpc.c
> >>>>>>>>+++ b/arch/powerpc/kvm/powerpc.c
> >>>>>>>>@@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
> >>>>*kvm, long ext)
> >>>>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
> >>>>>>>>  	case KVM_CAP_SPAPR_TCE:
> >>>>>>>>  	case KVM_CAP_SPAPR_TCE_64:
> >>>>>>>>+	case KVM_CAP_SPAPR_TCE_VFIO:
> >>>>>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>>>>>>>  	case KVM_CAP_PPC_RTAS:
> >>>>>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>>>>>>>diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>>>>>>index 080ffbf..f1abbea 100644
> >>>>>>>>--- a/include/uapi/linux/kvm.h
> >>>>>>>>+++ b/include/uapi/linux/kvm.h
> >>>>>>>>@@ -1056,6 +1056,7 @@ struct kvm_device_attr {
> >>>>>>>>  #define  KVM_DEV_VFIO_GROUP			1
> >>>>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>>>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>>>>>>>+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
> >>>>>>>>
> >>>>>>>>  enum kvm_device_type {
> >>>>>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>>>>>>>@@ -1075,6 +1076,14 @@ enum kvm_device_type {
> >>>>>>>>  	KVM_DEV_TYPE_MAX,
> >>>>>>>>  };
> >>>>>>>>
> >>>>>>>>+struct kvm_vfio_spapr_tce_liobn {
> >>>>>>>>+	__u32	argsz;
> >>>>>>>>+	__s32	fd;
> >>>>>>>>+	__u32	liobn;
> >>>>>>>>+	__u8	pad[4];
> >>>>>>>>+	__u64	start_addr;
> >>>>>>>>+};
> >>>>>>>>+
> >>>>>>>>  /*
> >>>>>>>>   * ioctls for VM fds
> >>>>>>>>   */
> >>>>>>>>diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >>>>>>>>index 1dd087d..87c771e 100644
> >>>>>>>>--- a/virt/kvm/vfio.c
> >>>>>>>>+++ b/virt/kvm/vfio.c
> >>>>>>>>@@ -20,6 +20,10 @@
> >>>>>>>>  #include <linux/vfio.h>
> >>>>>>>>  #include "vfio.h"
> >>>>>>>>
> >>>>>>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>+#include <asm/kvm_ppc.h>
> >>>>>>>>+#endif
> >>>>>>>>+
> >>>>>>>>  struct kvm_vfio_group {
> >>>>>>>>  	struct list_head node;
> >>>>>>>>  	struct vfio_group *vfio_group;
> >>>>>>>>@@ -60,6 +64,22 @@ static void
> >>>>kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >>>>>>>>  	symbol_put(vfio_group_put_external_user);
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>>+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>>>>>>>+{
> >>>>>>>>+	int (*fn)(struct vfio_group *);
> >>>>>>>>+	int ret = -1;
> >>>>>>>
> >>>>>>>Should this be -ESOMETHING?
> >>>>>>>
> >>>>>>>>+	fn = symbol_get(vfio_external_user_iommu_id);
> >>>>>>>>+	if (!fn)
> >>>>>>>>+		return ret;
> >>>>>>>>+
> >>>>>>>>+	ret = fn(vfio_group);
> >>>>>>>>+
> >>>>>>>>+	symbol_put(vfio_external_user_iommu_id);
> >>>>>>>>+
> >>>>>>>>+	return ret;
> >>>>>>>>+}
> >>>>>>>>+
> >>>>>>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>>>>>>>  {
> >>>>>>>>  	long (*fn)(struct vfio_group *, unsigned long);
> >>>>>>>>@@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
> >>>>kvm_device *dev)
> >>>>>>>>  	mutex_unlock(&kv->lock);
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>+static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >>>>>>>>+		struct vfio_group *vfio_group)
> >>>>>>>
> >>>>>>>
> >>>>>>>Shouldn't this go in the same patch that introduced the attach
> >>>>>>>function?
> >>>>>>
> >>>>>>Having less patches which touch different maintainers areas is better. I
> >>>>>>cannot avoid touching both PPC KVM and VFIO in this patch but I can in
> >>>>>>"[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
> >>>>>>table".
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>+{
> >>>>>>>>+	int group_id;
> >>>>>>>>+	struct iommu_group *grp;
> >>>>>>>>+
> >>>>>>>>+	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >>>>>>>>+	grp = iommu_group_get_by_id(group_id);
> >>>>>>>>+	if (grp) {
> >>>>>>>>+		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >>>>>>>>+		iommu_group_put(grp);
> >>>>>>>>+	}
> >>>>>>>>+}
> >>>>>>>>+#endif
> >>>>>>>>+
> >>>>>>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>>>>>  {
> >>>>>>>>  	struct kvm_vfio *kv = dev->private;
> >>>>>>>>@@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
> >>>>*dev, long attr, u64 arg)
> >>>>>>>>  				continue;
> >>>>>>>>
> >>>>>>>>  			list_del(&kvg->node);
> >>>>>>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>
> >>>>>>>Better to make a no-op version of the call than have to #ifdef at the
> >>>>>>>callsite.
> >>>>>>
> >>>>>>It is questionable. A x86 reader may decide that
> >>>>>>KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
> >>>>>>confused.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>+			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
> >>>>>>>>+					kvg->vfio_group);
> >>>>>>>>+#endif
> >>>>>>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>>>>>>>  			kfree(kvg);
> >>>>>>>>  			ret = 0;
> >>>>>>>>@@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
> >>>>*dev, long attr, u64 arg)
> >>>>>>>>  		kvm_vfio_update_coherency(dev);
> >>>>>>>>
> >>>>>>>>  		return ret;
> >>>>>>>>+
> >>>>>>>>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>+	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
> >>>>>>>>+		struct kvm_vfio_spapr_tce_liobn param;
> >>>>>>>>+		unsigned long minsz;
> >>>>>>>>+		struct kvm_vfio *kv = dev->private;
> >>>>>>>>+		struct vfio_group *vfio_group;
> >>>>>>>>+		struct kvm_vfio_group *kvg;
> >>>>>>>>+		struct fd f;
> >>>>>>>>+
> >>>>>>>>+		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
> >>>>>>>>+				start_addr);
> >>>>>>>>+
> >>>>>>>>+		if (copy_from_user(&param, (void __user *)arg, minsz))
> >>>>>>>>+			return -EFAULT;
> >>>>>>>>+
> >>>>>>>>+		if (param.argsz < minsz)
> >>>>>>>>+			return -EINVAL;
> >>>>>>>>+
> >>>>>>>>+		f = fdget(param.fd);
> >>>>>>>>+		if (!f.file)
> >>>>>>>>+			return -EBADF;
> >>>>>>>>+
> >>>>>>>>+		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >>>>>>>>+		fdput(f);
> >>>>>>>>+
> >>>>>>>>+		if (IS_ERR(vfio_group))
> >>>>>>>>+			return PTR_ERR(vfio_group);
> >>>>>>>>+
> >>>>>>>>+		ret = -ENOENT;
> >>>>>>>
> >>>>>>>Shouldn't there be some runtime test for the type of the IOMMU?  It's
> >>>>>>>possible a kernel could be built for a platform supporting multiple
> >>>>>>>IOMMU types.
> >>>>>>
> >>>>>>Well, may make sense but I do not know to test that. The IOMMU type is a
> >>>>>>VFIO container property, not a group property and here (KVM) we only have
> >>>>>>groups.
> >>>>>
> >>>>>Which, as mentioned previously, is broken.
> >>>>
> >>>>Which I am failing to follow you on this.
> >>>>
> >>>>What I am trying to achieve here is pretty much referencing a group so it
> >>>>cannot be reused. Plus LIOBNs.
> >>>
> >>>"Plus LIOBNs" is not a trivial change.  You are establishing a linkage
> >>>from LIOBNs to groups.  But that doesn't make sense; if mapping in one
> >>>(guest) LIOBN affects a group it must affect all groups in the
> >>>container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
> >>>to group.
> >>
> >>I can see your point but i don't see how to proceed now, I'm totally stuck.
> >>Pass container fd and then implement new api to lock containers somehow and
> >
> >I'm not really understanding what the question is about locking containers.
> >
> >>enumerate groups when updating TCE table (including real mode)?
> >
> >Why do you need to enumerate groups?  The groups within the container
> >share a TCE table, so can't you just update that once?
> 
> Well, they share a TCE table but they do not share TCE Kill (TCE cache
> invalidate) register address, it is still per PE but this does not matter
> here (pnv_pci_link_table_and_group() does that), just mentioned to complete
> the picture.

True, you'll need to enumerate the groups for invalidates.  But you
need that already, right.

> >>Plus new API when we remove a group from a container as the result of guest
> >>PCI hot unplug?
> >
> >I assume you mean a kernel internal API, since it shouldn't need
> >anything else visible to userspace.  Won't this happen naturally?
> >When the group is removed from the container, it will get its own TCE
> >table instead of the previously shared one.
> >
> >>>>Passing a container fd does not make much
> >>>>sense here as the VFIO device would walk through groups, reference them and
> >>>>that is it, there is no locking on VFIO containters and so far there was no
> >>>>need to teach KVM about containers.
> >>>>
> >>>>What do I miss now?
> >>>
> >>>Referencing the groups is essentially just a useful side effect.  The
> >>>important functionality is informing VFIO of the guest LIOBNs; and
> >>>LIOBNs map to containers, not groups.
> >>
> >>No. One liobn maps to one KVM-allocated TCE table, not a container. There
> >>can be one or many or none containers per liobn.
> >
> >Ah, true.
> 
> So I need to add new kernel API for KVM to get table(s) from VFIO
> container(s). And invent some locking mechanism to prevent table(s) (or
> associated container(s)) from going away, like
> vfio_group_get_external_user/vfio_group_put_external_user but for
> containers. Correct?

Well, the container is attached to an fd, so if you get a reference on
the file* that should do it.

Alexey Kardashevskiy April 8, 2016, 9:13 a.m. UTC | #8

On 03/09/2016 04:45 PM, David Gibson wrote:

>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>> index 7f7b6d8..71f577c 100644
>> --- a/arch/powerpc/kvm/Makefile
>> +++ b/arch/powerpc/kvm/Makefile
>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>   KVM := ../../../virt/kvm
>>
>>   common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>> +		$(KVM)/eventfd.o
>
> Please don't disable the VFIO device for the non-book3s case.  I added
> it (even though it didn't do anything until now) so that libvirt
> wouldn't choke when it finds it's not available.  Obviously the new
> ioctl needs to be only for the right IOMMU setup, but the device
> itself should be available always.


After having a closer look, the statement above does not enable VFIO KVM 
device on book3s but does for everything else:


common-objs-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o
[...]
kvm-e500-objs := \
         $(common-objs-y) \
[...]
kvm-objs-$(CONFIG_KVM_E500V2) := $(kvm-e500-objs)
[...]
kvm-e500mc-objs := \
         $(common-objs-y) \
[...]
kvm-objs-$(CONFIG_KVM_E500MC) := $(kvm-e500mc-objs)
[...]
kvm-book3s_32-objs := \
         $(common-objs-y) \
[...]
kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)


This is becaise CONFIG_KVM_BOOK3S_64 does not use "common-objs-y":

kvm-objs-$(CONFIG_KVM_BOOK3S_64) := $(kvm-book3s_64-module-objs)


So I will keep vfio.o in the "common-objs-y" list and add:

+kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
+	$(KVM)/vfio.o

David Gibson April 11, 2016, 3:36 a.m. UTC | #9

On Fri, Apr 08, 2016 at 07:13:06PM +1000, Alexey Kardashevskiy wrote:
> On 03/09/2016 04:45 PM, David Gibson wrote:
> 
> >>diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >>index 7f7b6d8..71f577c 100644
> >>--- a/arch/powerpc/kvm/Makefile
> >>+++ b/arch/powerpc/kvm/Makefile
> >>@@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
> >>  KVM := ../../../virt/kvm
> >>
> >>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> >>-		$(KVM)/eventfd.o $(KVM)/vfio.o
> >>+		$(KVM)/eventfd.o
> >
> >Please don't disable the VFIO device for the non-book3s case.  I added
> >it (even though it didn't do anything until now) so that libvirt
> >wouldn't choke when it finds it's not available.  Obviously the new
> >ioctl needs to be only for the right IOMMU setup, but the device
> >itself should be available always.
> 
> 
> After having a closer look, the statement above does not enable VFIO KVM
> device on book3s but does for everything else:
> 
> 
> common-objs-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o
> [...]
> kvm-e500-objs := \
>         $(common-objs-y) \
> [...]
> kvm-objs-$(CONFIG_KVM_E500V2) := $(kvm-e500-objs)
> [...]
> kvm-e500mc-objs := \
>         $(common-objs-y) \
> [...]
> kvm-objs-$(CONFIG_KVM_E500MC) := $(kvm-e500mc-objs)
> [...]
> kvm-book3s_32-objs := \
>         $(common-objs-y) \
> [...]
> kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)
> 
> 
> This is becaise CONFIG_KVM_BOOK3S_64 does not use "common-objs-y":

Oh, good grief.

> kvm-objs-$(CONFIG_KVM_BOOK3S_64) := $(kvm-book3s_64-module-objs)
> 
> 
> So I will keep vfio.o in the "common-objs-y" list and add:
> 
> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> +	$(KVM)/vfio.o

Ok.

Alexey Kardashevskiy June 9, 2016, 6:47 a.m. UTC | #10

On 23/03/16 14:03, David Gibson wrote:
> On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:
>> Uff, lost cc: list. Added back. Some comments below.
>>
>>
>> On 03/21/2016 04:19 PM, David Gibson wrote:
>>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
>>>> On March 15, 2016 17:29:26 David Gibson <david@gibson.dropbear.id.au> wrote:
>>>>
>>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
>>>>>> On 03/10/2016 04:21 PM, David Gibson wrote:
>>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote:
>>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
>>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and
>>>>>>>>>> linked to IOMMU groups by the user space.
>>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need
>>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping.
>>>>>>>>>>
>>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>>>>>>>>>> is added which accepts:
>>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware
>>>>>>>>>> TCE table;
>>>>>>>>>> - a LIOBN to assign to the found table.
>>>>>>>>>>
>>>>>>>>>> Before notifying KVM about new link, this check the group for being
>>>>>>>>>> registered with KVM device in order to release them at unexpected KVM
>>>>>>>>>> finish.
>>>>>>>>>>
>>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>>>>>>>> space.
>>>>>>>>>>
>>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it
>>>>>>>>>> link to a KVM module.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>> ---
>>>>>>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>>>>>>>>>>  arch/powerpc/kvm/Kconfig                   |   1 +
>>>>>>>>>>  arch/powerpc/kvm/Makefile                  |   5 +-
>>>>>>>>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
>>>>>>>>>>  include/uapi/linux/kvm.h                   |   9 +++
>>>>>>>>>>  virt/kvm/vfio.c                            | 106
>>>>>> +++++++++++++++++++++++++++++
>>>>>>>>>>  6 files changed, 140 insertions(+), 3 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>> b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>> index ef51740..c0d3eb7 100644
>>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>> @@ -16,7 +16,24 @@ Groups:
>>>>>>>>>>
>>>>>>>>>>  KVM_DEV_VFIO_GROUP attributes:
>>>>>>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>> +	for the VFIO group.
>>>>>>>>>
>>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which
>>>>>>>>> case it might be clearer to put them in a separate patch.
>>>>>>>>>
>>>>>>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
>>>>>> tracking
>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>> +	for the VFIO group.
>>>>>>>>>>
>>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>> -for the VFIO group.
>>>>>>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
>>>>>>>>>> +	kvm_device_attr.addr points to a struct:
>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn {
>>>>>>>>>> +			__u32	argsz;
>>>>>>>>>> +			__s32	fd;
>>>>>>>>>> +			__u32	liobn;
>>>>>>>>>> +			__u8	pad[4];
>>>>>>>>>> +			__u64	start_addr;
>>>>>>>>>> +		};
>>>>>>>>>> +		where
>>>>>>>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>>>>>>>> +		@fd is a file descriptor for a VFIO group;
>>>>>>>>>> +		@liobn is a logical bus id to be associated with the group;
>>>>>>>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
>>>>>>>>>
>>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call
>>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group?
>>>>>>>>
>>>>>>>>
>>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
>>>>>>>> and 64bit windows, this is why @start_addr is there.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>>>>>>>>>> index 1059846..dfa3488 100644
>>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig
>>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig
>>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>>>>>>>>>>  	select KVM
>>>>>>>>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>>>>>>>>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>>>>>>>>>> +	select KVM_VFIO if VFIO
>>>>>>>>>>  	---help---
>>>>>>>>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
>>>>>>>>>>  	  in virtual machines on book3s_64 host processors.
>>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>>>>>>>>>> index 7f7b6d8..71f577c 100644
>>>>>>>>>> --- a/arch/powerpc/kvm/Makefile
>>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile
>>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>>>>>>>>>  KVM := ../../../virt/kvm
>>>>>>>>>>
>>>>>>>>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>>>>>>>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>>>>>>>>>> +		$(KVM)/eventfd.o
>>>>>>>>>
>>>>>>>>> Please don't disable the VFIO device for the non-book3s case.  I added
>>>>>>>>> it (even though it didn't do anything until now) so that libvirt
>>>>>>>>> wouldn't choke when it finds it's not available.  Obviously the new
>>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device
>>>>>>>>> itself should be available always.
>>>>>>>>
>>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
>>>>>>>>
>>>>>>>>
>>>>>>>>>>  CFLAGS_e500_mmu.o := -I.
>>>>>>>>>>  CFLAGS_e500_mmu_host.o := -I.
>>>>>>>>>> @@ -87,6 +87,9 @@ endif
>>>>>>>>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>>>>>>>>>  	book3s_xics.o
>>>>>>>>>>
>>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>>>>>>>>>> +	$(KVM)/vfio.o \
>>>>>>>>>> +
>>>>>>>>>>  kvm-book3s_64-module-objs += \
>>>>>>>>>>  	$(KVM)/kvm_main.o \
>>>>>>>>>>  	$(KVM)/eventfd.o \
>>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>>>>>>>> index 19aa59b..63f188d 100644
>>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
>>>>>> *kvm, long ext)
>>>>>>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE:
>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE_64:
>>>>>>>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>>>>>>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
>>>>>>>>>>  	case KVM_CAP_PPC_RTAS:
>>>>>>>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>>>>>>> index 080ffbf..f1abbea 100644
>>>>>>>>>> --- a/include/uapi/linux/kvm.h
>>>>>>>>>> +++ b/include/uapi/linux/kvm.h
>>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>>>>>>>>>>  #define  KVM_DEV_VFIO_GROUP			1
>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>>>>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>>>>>>>>>>
>>>>>>>>>>  enum kvm_device_type {
>>>>>>>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>>>>>>>>>>  	KVM_DEV_TYPE_MAX,
>>>>>>>>>>  };
>>>>>>>>>>
>>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn {
>>>>>>>>>> +	__u32	argsz;
>>>>>>>>>> +	__s32	fd;
>>>>>>>>>> +	__u32	liobn;
>>>>>>>>>> +	__u8	pad[4];
>>>>>>>>>> +	__u64	start_addr;
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>>  /*
>>>>>>>>>>   * ioctls for VM fds
>>>>>>>>>>   */
>>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>>>>>>>> index 1dd087d..87c771e 100644
>>>>>>>>>> --- a/virt/kvm/vfio.c
>>>>>>>>>> +++ b/virt/kvm/vfio.c
>>>>>>>>>> @@ -20,6 +20,10 @@
>>>>>>>>>>  #include <linux/vfio.h>
>>>>>>>>>>  #include "vfio.h"
>>>>>>>>>>
>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>> +#include <asm/kvm_ppc.h>
>>>>>>>>>> +#endif
>>>>>>>>>> +
>>>>>>>>>>  struct kvm_vfio_group {
>>>>>>>>>>  	struct list_head node;
>>>>>>>>>>  	struct vfio_group *vfio_group;
>>>>>>>>>> @@ -60,6 +64,22 @@ static void
>>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>>>>>>>>  	symbol_put(vfio_group_put_external_user);
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>>>>>>>> +{
>>>>>>>>>> +	int (*fn)(struct vfio_group *);
>>>>>>>>>> +	int ret = -1;
>>>>>>>>>
>>>>>>>>> Should this be -ESOMETHING?
>>>>>>>>>
>>>>>>>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>>>>>>>> +	if (!fn)
>>>>>>>>>> +		return ret;
>>>>>>>>>> +
>>>>>>>>>> +	ret = fn(vfio_group);
>>>>>>>>>> +
>>>>>>>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>>>>>>>> +
>>>>>>>>>> +	return ret;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>>>>>>>>>  {
>>>>>>>>>>  	long (*fn)(struct vfio_group *, unsigned long);
>>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
>>>>>> kvm_device *dev)
>>>>>>>>>>  	mutex_unlock(&kv->lock);
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>>>>>>>>>> +		struct vfio_group *vfio_group)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Shouldn't this go in the same patch that introduced the attach
>>>>>>>>> function?
>>>>>>>>
>>>>>>>> Having less patches which touch different maintainers areas is better. I
>>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
>>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
>>>>>>>> table".
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +{
>>>>>>>>>> +	int group_id;
>>>>>>>>>> +	struct iommu_group *grp;
>>>>>>>>>> +
>>>>>>>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>>>>>>>>>> +	grp = iommu_group_get_by_id(group_id);
>>>>>>>>>> +	if (grp) {
>>>>>>>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>>>>>>>>>> +		iommu_group_put(grp);
>>>>>>>>>> +	}
>>>>>>>>>> +}
>>>>>>>>>> +#endif
>>>>>>>>>> +
>>>>>>>>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>>>>>>>  {
>>>>>>>>>>  	struct kvm_vfio *kv = dev->private;
>>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
>>>>>> *dev, long attr, u64 arg)
>>>>>>>>>>  				continue;
>>>>>>>>>>
>>>>>>>>>>  			list_del(&kvg->node);
>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>
>>>>>>>>> Better to make a no-op version of the call than have to #ifdef at the
>>>>>>>>> callsite.
>>>>>>>>
>>>>>>>> It is questionable. A x86 reader may decide that
>>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
>>>>>>>> confused.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
>>>>>>>>>> +					kvg->vfio_group);
>>>>>>>>>> +#endif
>>>>>>>>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>>>>>>>  			kfree(kvg);
>>>>>>>>>>  			ret = 0;
>>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
>>>>>> *dev, long attr, u64 arg)
>>>>>>>>>>  		kvm_vfio_update_coherency(dev);
>>>>>>>>>>
>>>>>>>>>>  		return ret;
>>>>>>>>>> +
>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn param;
>>>>>>>>>> +		unsigned long minsz;
>>>>>>>>>> +		struct kvm_vfio *kv = dev->private;
>>>>>>>>>> +		struct vfio_group *vfio_group;
>>>>>>>>>> +		struct kvm_vfio_group *kvg;
>>>>>>>>>> +		struct fd f;
>>>>>>>>>> +
>>>>>>>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
>>>>>>>>>> +				start_addr);
>>>>>>>>>> +
>>>>>>>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>>>>>>>>>> +			return -EFAULT;
>>>>>>>>>> +
>>>>>>>>>> +		if (param.argsz < minsz)
>>>>>>>>>> +			return -EINVAL;
>>>>>>>>>> +
>>>>>>>>>> +		f = fdget(param.fd);
>>>>>>>>>> +		if (!f.file)
>>>>>>>>>> +			return -EBADF;
>>>>>>>>>> +
>>>>>>>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>>>>>>>> +		fdput(f);
>>>>>>>>>> +
>>>>>>>>>> +		if (IS_ERR(vfio_group))
>>>>>>>>>> +			return PTR_ERR(vfio_group);
>>>>>>>>>> +
>>>>>>>>>> +		ret = -ENOENT;
>>>>>>>>>
>>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
>>>>>>>>> possible a kernel could be built for a platform supporting multiple
>>>>>>>>> IOMMU types.
>>>>>>>>
>>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a
>>>>>>>> VFIO container property, not a group property and here (KVM) we only have
>>>>>>>> groups.
>>>>>>>
>>>>>>> Which, as mentioned previously, is broken.
>>>>>>
>>>>>> Which I am failing to follow you on this.
>>>>>>
>>>>>> What I am trying to achieve here is pretty much referencing a group so it
>>>>>> cannot be reused. Plus LIOBNs.
>>>>>
>>>>> "Plus LIOBNs" is not a trivial change.  You are establishing a linkage
>>>> >from LIOBNs to groups.  But that doesn't make sense; if mapping in one
>>>>> (guest) LIOBN affects a group it must affect all groups in the
>>>>> container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
>>>>> to group.
>>>>
>>>> I can see your point but i don't see how to proceed now, I'm totally stuck.
>>>> Pass container fd and then implement new api to lock containers somehow and
>>>
>>> I'm not really understanding what the question is about locking containers.
>>>
>>>> enumerate groups when updating TCE table (including real mode)?
>>>
>>> Why do you need to enumerate groups?  The groups within the container
>>> share a TCE table, so can't you just update that once?
>>
>> Well, they share a TCE table but they do not share TCE Kill (TCE cache
>> invalidate) register address, it is still per PE but this does not matter
>> here (pnv_pci_link_table_and_group() does that), just mentioned to complete
>> the picture.
> 
> True, you'll need to enumerate the groups for invalidates.  But you
> need that already, right.
> 
>>>> Plus new API when we remove a group from a container as the result of guest
>>>> PCI hot unplug?
>>>
>>> I assume you mean a kernel internal API, since it shouldn't need
>>> anything else visible to userspace.  Won't this happen naturally?
>>> When the group is removed from the container, it will get its own TCE
>>> table instead of the previously shared one.
>>>
>>>>>> Passing a container fd does not make much
>>>>>> sense here as the VFIO device would walk through groups, reference them and
>>>>>> that is it, there is no locking on VFIO containters and so far there was no
>>>>>> need to teach KVM about containers.
>>>>>>
>>>>>> What do I miss now?
>>>>>
>>>>> Referencing the groups is essentially just a useful side effect.  The
>>>>> important functionality is informing VFIO of the guest LIOBNs; and
>>>>> LIOBNs map to containers, not groups.
>>>>
>>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There
>>>> can be one or many or none containers per liobn.
>>>
>>> Ah, true.
>>
>> So I need to add new kernel API for KVM to get table(s) from VFIO
>> container(s). And invent some locking mechanism to prevent table(s) (or
>> associated container(s)) from going away, like
>> vfio_group_get_external_user/vfio_group_put_external_user but for
>> containers. Correct?
> 
> Well, the container is attached to an fd, so if you get a reference on
> the file* that should do it.

I am still trying to think of how to implement this suggestion.

I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right
interface as it knows nothing about KVM. There is VFIO-KVM device but it
does not have idea about containers.

So I have to:

Wenever a container is created or removed, notify the VFIO-KVM device by
passing there a container fd. ok.

Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to
what LIOBN so the realmode handlers could do the job. The real mode TCE
handlers get LIOBN, find a guest view table and update it. Now I want to
update the hardware table which is iommu_table attached to LIOBN.

I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container fd.

Now VFIO-KVM device needs to extract iommu_table's from the container.
These iommu_table pointers are stored in "struct tce_container" which is
local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I
cannot export and use that.

The other way to go would be adding API to VFIO to enumerate IOMMU groups
in a container and use iommu_table pointers stored in iommu_table_group of
each group (in fact the very first group will be enough as multiple groups
in a container share the table). Adding vfio_container_get_groups() when
only first one is needed is quite tricky in terms of maintainers approvals.

So what would be the right course of action? Thanks.

David Gibson June 10, 2016, 6:50 a.m. UTC | #11

On Thu, Jun 09, 2016 at 04:47:59PM +1000, Alexey Kardashevskiy wrote:
> On 23/03/16 14:03, David Gibson wrote:
> > On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:
> >> Uff, lost cc: list. Added back. Some comments below.
> >>
> >>
> >> On 03/21/2016 04:19 PM, David Gibson wrote:
> >>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
> >>>> On March 15, 2016 17:29:26 David Gibson <david@gibson.dropbear.id.au> wrote:
> >>>>
> >>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
> >>>>>> On 03/10/2016 04:21 PM, David Gibson wrote:
> >>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote:
> >>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> >>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> >>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and
> >>>>>>>>>> linked to IOMMU groups by the user space.
> >>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need
> >>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping.
> >>>>>>>>>>
> >>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> >>>>>>>>>> is added which accepts:
> >>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware
> >>>>>>>>>> TCE table;
> >>>>>>>>>> - a LIOBN to assign to the found table.
> >>>>>>>>>>
> >>>>>>>>>> Before notifying KVM about new link, this check the group for being
> >>>>>>>>>> registered with KVM device in order to release them at unexpected KVM
> >>>>>>>>>> finish.
> >>>>>>>>>>
> >>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>>>>>>>>> space.
> >>>>>>>>>>
> >>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it
> >>>>>>>>>> link to a KVM module.
> >>>>>>>>>>
> >>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>>>>> ---
> >>>>>>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
> >>>>>>>>>>  arch/powerpc/kvm/Kconfig                   |   1 +
> >>>>>>>>>>  arch/powerpc/kvm/Makefile                  |   5 +-
> >>>>>>>>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
> >>>>>>>>>>  include/uapi/linux/kvm.h                   |   9 +++
> >>>>>>>>>>  virt/kvm/vfio.c                            | 106
> >>>>>> +++++++++++++++++++++++++++++
> >>>>>>>>>>  6 files changed, 140 insertions(+), 3 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>> b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>>> index ef51740..c0d3eb7 100644
> >>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>>> @@ -16,7 +16,24 @@ Groups:
> >>>>>>>>>>
> >>>>>>>>>>  KVM_DEV_VFIO_GROUP attributes:
> >>>>>>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>>> +	for the VFIO group.
> >>>>>>>>>
> >>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which
> >>>>>>>>> case it might be clearer to put them in a separate patch.
> >>>>>>>>>
> >>>>>>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
> >>>>>> tracking
> >>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>>> +	for the VFIO group.
> >>>>>>>>>>
> >>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>>> -for the VFIO group.
> >>>>>>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> >>>>>>>>>> +	kvm_device_attr.addr points to a struct:
> >>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn {
> >>>>>>>>>> +			__u32	argsz;
> >>>>>>>>>> +			__s32	fd;
> >>>>>>>>>> +			__u32	liobn;
> >>>>>>>>>> +			__u8	pad[4];
> >>>>>>>>>> +			__u64	start_addr;
> >>>>>>>>>> +		};
> >>>>>>>>>> +		where
> >>>>>>>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>>>>>>>>> +		@fd is a file descriptor for a VFIO group;
> >>>>>>>>>> +		@liobn is a logical bus id to be associated with the group;
> >>>>>>>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
> >>>>>>>>>
> >>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call
> >>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
> >>>>>>>> and 64bit windows, this is why @start_addr is there.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> >>>>>>>>>> index 1059846..dfa3488 100644
> >>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig
> >>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig
> >>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
> >>>>>>>>>>  	select KVM
> >>>>>>>>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
> >>>>>>>>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
> >>>>>>>>>> +	select KVM_VFIO if VFIO
> >>>>>>>>>>  	---help---
> >>>>>>>>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
> >>>>>>>>>>  	  in virtual machines on book3s_64 host processors.
> >>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >>>>>>>>>> index 7f7b6d8..71f577c 100644
> >>>>>>>>>> --- a/arch/powerpc/kvm/Makefile
> >>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile
> >>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
> >>>>>>>>>>  KVM := ../../../virt/kvm
> >>>>>>>>>>
> >>>>>>>>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> >>>>>>>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
> >>>>>>>>>> +		$(KVM)/eventfd.o
> >>>>>>>>>
> >>>>>>>>> Please don't disable the VFIO device for the non-book3s case.  I added
> >>>>>>>>> it (even though it didn't do anything until now) so that libvirt
> >>>>>>>>> wouldn't choke when it finds it's not available.  Obviously the new
> >>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device
> >>>>>>>>> itself should be available always.
> >>>>>>>>
> >>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>>  CFLAGS_e500_mmu.o := -I.
> >>>>>>>>>>  CFLAGS_e500_mmu_host.o := -I.
> >>>>>>>>>> @@ -87,6 +87,9 @@ endif
> >>>>>>>>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
> >>>>>>>>>>  	book3s_xics.o
> >>>>>>>>>>
> >>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> >>>>>>>>>> +	$(KVM)/vfio.o \
> >>>>>>>>>> +
> >>>>>>>>>>  kvm-book3s_64-module-objs += \
> >>>>>>>>>>  	$(KVM)/kvm_main.o \
> >>>>>>>>>>  	$(KVM)/eventfd.o \
> >>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >>>>>>>>>> index 19aa59b..63f188d 100644
> >>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c
> >>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c
> >>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
> >>>>>> *kvm, long ext)
> >>>>>>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
> >>>>>>>>>>  	case KVM_CAP_SPAPR_TCE:
> >>>>>>>>>>  	case KVM_CAP_SPAPR_TCE_64:
> >>>>>>>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
> >>>>>>>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>>>>>>>>>  	case KVM_CAP_PPC_RTAS:
> >>>>>>>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>>>>>>>> index 080ffbf..f1abbea 100644
> >>>>>>>>>> --- a/include/uapi/linux/kvm.h
> >>>>>>>>>> +++ b/include/uapi/linux/kvm.h
> >>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
> >>>>>>>>>>  #define  KVM_DEV_VFIO_GROUP			1
> >>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>>>>>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
> >>>>>>>>>>
> >>>>>>>>>>  enum kvm_device_type {
> >>>>>>>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
> >>>>>>>>>>  	KVM_DEV_TYPE_MAX,
> >>>>>>>>>>  };
> >>>>>>>>>>
> >>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn {
> >>>>>>>>>> +	__u32	argsz;
> >>>>>>>>>> +	__s32	fd;
> >>>>>>>>>> +	__u32	liobn;
> >>>>>>>>>> +	__u8	pad[4];
> >>>>>>>>>> +	__u64	start_addr;
> >>>>>>>>>> +};
> >>>>>>>>>> +
> >>>>>>>>>>  /*
> >>>>>>>>>>   * ioctls for VM fds
> >>>>>>>>>>   */
> >>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >>>>>>>>>> index 1dd087d..87c771e 100644
> >>>>>>>>>> --- a/virt/kvm/vfio.c
> >>>>>>>>>> +++ b/virt/kvm/vfio.c
> >>>>>>>>>> @@ -20,6 +20,10 @@
> >>>>>>>>>>  #include <linux/vfio.h>
> >>>>>>>>>>  #include "vfio.h"
> >>>>>>>>>>
> >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>> +#include <asm/kvm_ppc.h>
> >>>>>>>>>> +#endif
> >>>>>>>>>> +
> >>>>>>>>>>  struct kvm_vfio_group {
> >>>>>>>>>>  	struct list_head node;
> >>>>>>>>>>  	struct vfio_group *vfio_group;
> >>>>>>>>>> @@ -60,6 +64,22 @@ static void
> >>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >>>>>>>>>>  	symbol_put(vfio_group_put_external_user);
> >>>>>>>>>>  }
> >>>>>>>>>>
> >>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>>>>>>>>> +{
> >>>>>>>>>> +	int (*fn)(struct vfio_group *);
> >>>>>>>>>> +	int ret = -1;
> >>>>>>>>>
> >>>>>>>>> Should this be -ESOMETHING?
> >>>>>>>>>
> >>>>>>>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
> >>>>>>>>>> +	if (!fn)
> >>>>>>>>>> +		return ret;
> >>>>>>>>>> +
> >>>>>>>>>> +	ret = fn(vfio_group);
> >>>>>>>>>> +
> >>>>>>>>>> +	symbol_put(vfio_external_user_iommu_id);
> >>>>>>>>>> +
> >>>>>>>>>> +	return ret;
> >>>>>>>>>> +}
> >>>>>>>>>> +
> >>>>>>>>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>>>>>>>>>  {
> >>>>>>>>>>  	long (*fn)(struct vfio_group *, unsigned long);
> >>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
> >>>>>> kvm_device *dev)
> >>>>>>>>>>  	mutex_unlock(&kv->lock);
> >>>>>>>>>>  }
> >>>>>>>>>>
> >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >>>>>>>>>> +		struct vfio_group *vfio_group)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Shouldn't this go in the same patch that introduced the attach
> >>>>>>>>> function?
> >>>>>>>>
> >>>>>>>> Having less patches which touch different maintainers areas is better. I
> >>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
> >>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
> >>>>>>>> table".
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> +{
> >>>>>>>>>> +	int group_id;
> >>>>>>>>>> +	struct iommu_group *grp;
> >>>>>>>>>> +
> >>>>>>>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >>>>>>>>>> +	grp = iommu_group_get_by_id(group_id);
> >>>>>>>>>> +	if (grp) {
> >>>>>>>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >>>>>>>>>> +		iommu_group_put(grp);
> >>>>>>>>>> +	}
> >>>>>>>>>> +}
> >>>>>>>>>> +#endif
> >>>>>>>>>> +
> >>>>>>>>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>>>>>>>  {
> >>>>>>>>>>  	struct kvm_vfio *kv = dev->private;
> >>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
> >>>>>> *dev, long attr, u64 arg)
> >>>>>>>>>>  				continue;
> >>>>>>>>>>
> >>>>>>>>>>  			list_del(&kvg->node);
> >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>
> >>>>>>>>> Better to make a no-op version of the call than have to #ifdef at the
> >>>>>>>>> callsite.
> >>>>>>>>
> >>>>>>>> It is questionable. A x86 reader may decide that
> >>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
> >>>>>>>> confused.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
> >>>>>>>>>> +					kvg->vfio_group);
> >>>>>>>>>> +#endif
> >>>>>>>>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>>>>>>>>>  			kfree(kvg);
> >>>>>>>>>>  			ret = 0;
> >>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
> >>>>>> *dev, long attr, u64 arg)
> >>>>>>>>>>  		kvm_vfio_update_coherency(dev);
> >>>>>>>>>>
> >>>>>>>>>>  		return ret;
> >>>>>>>>>> +
> >>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
> >>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn param;
> >>>>>>>>>> +		unsigned long minsz;
> >>>>>>>>>> +		struct kvm_vfio *kv = dev->private;
> >>>>>>>>>> +		struct vfio_group *vfio_group;
> >>>>>>>>>> +		struct kvm_vfio_group *kvg;
> >>>>>>>>>> +		struct fd f;
> >>>>>>>>>> +
> >>>>>>>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
> >>>>>>>>>> +				start_addr);
> >>>>>>>>>> +
> >>>>>>>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> >>>>>>>>>> +			return -EFAULT;
> >>>>>>>>>> +
> >>>>>>>>>> +		if (param.argsz < minsz)
> >>>>>>>>>> +			return -EINVAL;
> >>>>>>>>>> +
> >>>>>>>>>> +		f = fdget(param.fd);
> >>>>>>>>>> +		if (!f.file)
> >>>>>>>>>> +			return -EBADF;
> >>>>>>>>>> +
> >>>>>>>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >>>>>>>>>> +		fdput(f);
> >>>>>>>>>> +
> >>>>>>>>>> +		if (IS_ERR(vfio_group))
> >>>>>>>>>> +			return PTR_ERR(vfio_group);
> >>>>>>>>>> +
> >>>>>>>>>> +		ret = -ENOENT;
> >>>>>>>>>
> >>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
> >>>>>>>>> possible a kernel could be built for a platform supporting multiple
> >>>>>>>>> IOMMU types.
> >>>>>>>>
> >>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a
> >>>>>>>> VFIO container property, not a group property and here (KVM) we only have
> >>>>>>>> groups.
> >>>>>>>
> >>>>>>> Which, as mentioned previously, is broken.
> >>>>>>
> >>>>>> Which I am failing to follow you on this.
> >>>>>>
> >>>>>> What I am trying to achieve here is pretty much referencing a group so it
> >>>>>> cannot be reused. Plus LIOBNs.
> >>>>>
> >>>>> "Plus LIOBNs" is not a trivial change.  You are establishing a linkage
> >>>> >from LIOBNs to groups.  But that doesn't make sense; if mapping in one
> >>>>> (guest) LIOBN affects a group it must affect all groups in the
> >>>>> container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
> >>>>> to group.
> >>>>
> >>>> I can see your point but i don't see how to proceed now, I'm totally stuck.
> >>>> Pass container fd and then implement new api to lock containers somehow and
> >>>
> >>> I'm not really understanding what the question is about locking containers.
> >>>
> >>>> enumerate groups when updating TCE table (including real mode)?
> >>>
> >>> Why do you need to enumerate groups?  The groups within the container
> >>> share a TCE table, so can't you just update that once?
> >>
> >> Well, they share a TCE table but they do not share TCE Kill (TCE cache
> >> invalidate) register address, it is still per PE but this does not matter
> >> here (pnv_pci_link_table_and_group() does that), just mentioned to complete
> >> the picture.
> > 
> > True, you'll need to enumerate the groups for invalidates.  But you
> > need that already, right.
> > 
> >>>> Plus new API when we remove a group from a container as the result of guest
> >>>> PCI hot unplug?
> >>>
> >>> I assume you mean a kernel internal API, since it shouldn't need
> >>> anything else visible to userspace.  Won't this happen naturally?
> >>> When the group is removed from the container, it will get its own TCE
> >>> table instead of the previously shared one.
> >>>
> >>>>>> Passing a container fd does not make much
> >>>>>> sense here as the VFIO device would walk through groups, reference them and
> >>>>>> that is it, there is no locking on VFIO containters and so far there was no
> >>>>>> need to teach KVM about containers.
> >>>>>>
> >>>>>> What do I miss now?
> >>>>>
> >>>>> Referencing the groups is essentially just a useful side effect.  The
> >>>>> important functionality is informing VFIO of the guest LIOBNs; and
> >>>>> LIOBNs map to containers, not groups.
> >>>>
> >>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There
> >>>> can be one or many or none containers per liobn.
> >>>
> >>> Ah, true.
> >>
> >> So I need to add new kernel API for KVM to get table(s) from VFIO
> >> container(s). And invent some locking mechanism to prevent table(s) (or
> >> associated container(s)) from going away, like
> >> vfio_group_get_external_user/vfio_group_put_external_user but for
> >> containers. Correct?
> > 
> > Well, the container is attached to an fd, so if you get a reference on
> > the file* that should do it.
> 
> I am still trying to think of how to implement this suggestion.
> 
> I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right
> interface as it knows nothing about KVM. There is VFIO-KVM device but it
> does not have idea about containers.
> 
> So I have to:
> 
> Wenever a container is created or removed, notify the VFIO-KVM device by
> passing there a container fd. ok.

Actually, I don't think the vfio-kvm device is really useful here.  It
was designed as a hack for a particular problem on x86.  It certainly
could be extended to cover the information we need here, but I don't
think it's a particularly natural way of doing so.

The problem is that conveying the information from the vfio-kvm device
to the real mode H_PUT_TCE handler, which is what really needs it,
isn't particularly simpler than conveying that information from
anywhere else.

> Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to
> what LIOBN so the realmode handlers could do the job. The real mode TCE
> handlers get LIOBN, find a guest view table and update it. Now I want to
> update the hardware table which is iommu_table attached to LIOBN.
> 
> I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container fd.
> 
> Now VFIO-KVM device needs to extract iommu_table's from the container.
> These iommu_table pointers are stored in "struct tce_container" which is
> local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I
> cannot export and use that.
> 
> The other way to go would be adding API to VFIO to enumerate IOMMU groups
> in a container and use iommu_table pointers stored in iommu_table_group of
> each group (in fact the very first group will be enough as multiple groups
> in a container share the table). Adding vfio_container_get_groups() when
> only first one is needed is quite tricky in terms of maintainers approvals.
> 
> So what would be the right course of action? Thanks.

So, from the user side, you need to be able to bind a VFIO backend to
a particular guest IOMMU.  This suggests a new ioctl() used in
conjunction with KVM_CREATE_SPAPR_TCE.  Let's call it
KVM_SPAPR_TCE_BIND_VFIO.  You'd use KVM_CREATE_SPAPR_TCE to make the
kernel aware of a LIOBN in the first place, then use
KVM_SPAPR_TCE_BIND_VFIO to associate a VFIO container with that LIOBN.
So it would be a VM ioctl, taking a LIOBN and a container fd.  You'd
need a capability to go with it, and some way to unbind as well.

To implement that, the ioctl() would need to use a new vfio (kernel
internal) interface - which can be specific to only the spapr TCE
type.  That would take the container fd, and return the list of
iommu_tables in some form or other (or various error conditions,
obviously).

So, when qemu creates the PHB, it uses KVM_CREATE_SPAPR_TCE to inform
the kernel of the LIOBN.  When the VFIO device is attached to the PHB,
it uses KVM_SPAPR_TCE_BIND_VFIO to connect the VFIO container to the
LIOBN.  The ioctl() implementation uses the new special interface into
the spapr_tce vfio backend to get the list of iommu tables, which it
stores in some private format.  The H_PUT_TCE implementation uses that
stored list of iommu tables to translate H_PUT_TCEs from the guest
into actions on the host IOMMU tables.

And, yes, the special interface to the spapr TCE vfio back end is kind
of a hack.  That's what you get when you need to link to separate
kernel subsystems for performance reasons.

Alexey Kardashevskiy June 14, 2016, 3:30 a.m. UTC | #12

On 10/06/16 16:50, David Gibson wrote:
> On Thu, Jun 09, 2016 at 04:47:59PM +1000, Alexey Kardashevskiy wrote:
>> On 23/03/16 14:03, David Gibson wrote:
>>> On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:
>>>> Uff, lost cc: list. Added back. Some comments below.
>>>>
>>>>
>>>> On 03/21/2016 04:19 PM, David Gibson wrote:
>>>>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
>>>>>> On March 15, 2016 17:29:26 David Gibson <david@gibson.dropbear.id.au> wrote:
>>>>>>
>>>>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
>>>>>>>> On 03/10/2016 04:21 PM, David Gibson wrote:
>>>>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote:
>>>>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>>>>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
>>>>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and
>>>>>>>>>>>> linked to IOMMU groups by the user space.
>>>>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need
>>>>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping.
>>>>>>>>>>>>
>>>>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>>>>>>>>>>>> is added which accepts:
>>>>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware
>>>>>>>>>>>> TCE table;
>>>>>>>>>>>> - a LIOBN to assign to the found table.
>>>>>>>>>>>>
>>>>>>>>>>>> Before notifying KVM about new link, this check the group for being
>>>>>>>>>>>> registered with KVM device in order to release them at unexpected KVM
>>>>>>>>>>>> finish.
>>>>>>>>>>>>
>>>>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>>>>>>>>>> space.
>>>>>>>>>>>>
>>>>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it
>>>>>>>>>>>> link to a KVM module.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>>> ---
>>>>>>>>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>>>>>>>>>>>>  arch/powerpc/kvm/Kconfig                   |   1 +
>>>>>>>>>>>>  arch/powerpc/kvm/Makefile                  |   5 +-
>>>>>>>>>>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
>>>>>>>>>>>>  include/uapi/linux/kvm.h                   |   9 +++
>>>>>>>>>>>>  virt/kvm/vfio.c                            | 106
>>>>>>>> +++++++++++++++++++++++++++++
>>>>>>>>>>>>  6 files changed, 140 insertions(+), 3 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>> b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>>>> index ef51740..c0d3eb7 100644
>>>>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>>>> @@ -16,7 +16,24 @@ Groups:
>>>>>>>>>>>>
>>>>>>>>>>>>  KVM_DEV_VFIO_GROUP attributes:
>>>>>>>>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>>>> +	for the VFIO group.
>>>>>>>>>>>
>>>>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which
>>>>>>>>>>> case it might be clearer to put them in a separate patch.
>>>>>>>>>>>
>>>>>>>>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
>>>>>>>> tracking
>>>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>>>> +	for the VFIO group.
>>>>>>>>>>>>
>>>>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>>>> -for the VFIO group.
>>>>>>>>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
>>>>>>>>>>>> +	kvm_device_attr.addr points to a struct:
>>>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn {
>>>>>>>>>>>> +			__u32	argsz;
>>>>>>>>>>>> +			__s32	fd;
>>>>>>>>>>>> +			__u32	liobn;
>>>>>>>>>>>> +			__u8	pad[4];
>>>>>>>>>>>> +			__u64	start_addr;
>>>>>>>>>>>> +		};
>>>>>>>>>>>> +		where
>>>>>>>>>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>>>>>>>>>> +		@fd is a file descriptor for a VFIO group;
>>>>>>>>>>>> +		@liobn is a logical bus id to be associated with the group;
>>>>>>>>>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
>>>>>>>>>>>
>>>>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call
>>>>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
>>>>>>>>>> and 64bit windows, this is why @start_addr is there.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>>>>>>>>>>>> index 1059846..dfa3488 100644
>>>>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig
>>>>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig
>>>>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>>>>>>>>>>>>  	select KVM
>>>>>>>>>>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>>>>>>>>>>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>>>>>>>>>>>> +	select KVM_VFIO if VFIO
>>>>>>>>>>>>  	---help---
>>>>>>>>>>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
>>>>>>>>>>>>  	  in virtual machines on book3s_64 host processors.
>>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>>>>>>>>>>>> index 7f7b6d8..71f577c 100644
>>>>>>>>>>>> --- a/arch/powerpc/kvm/Makefile
>>>>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile
>>>>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>>>>>>>>>>>  KVM := ../../../virt/kvm
>>>>>>>>>>>>
>>>>>>>>>>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>>>>>>>>>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>>>>>>>>>>>> +		$(KVM)/eventfd.o
>>>>>>>>>>>
>>>>>>>>>>> Please don't disable the VFIO device for the non-book3s case.  I added
>>>>>>>>>>> it (even though it didn't do anything until now) so that libvirt
>>>>>>>>>>> wouldn't choke when it finds it's not available.  Obviously the new
>>>>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device
>>>>>>>>>>> itself should be available always.
>>>>>>>>>>
>>>>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>  CFLAGS_e500_mmu.o := -I.
>>>>>>>>>>>>  CFLAGS_e500_mmu_host.o := -I.
>>>>>>>>>>>> @@ -87,6 +87,9 @@ endif
>>>>>>>>>>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>>>>>>>>>>>  	book3s_xics.o
>>>>>>>>>>>>
>>>>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>>>>>>>>>>>> +	$(KVM)/vfio.o \
>>>>>>>>>>>> +
>>>>>>>>>>>>  kvm-book3s_64-module-objs += \
>>>>>>>>>>>>  	$(KVM)/kvm_main.o \
>>>>>>>>>>>>  	$(KVM)/eventfd.o \
>>>>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>>>>>>>>>> index 19aa59b..63f188d 100644
>>>>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
>>>>>>>> *kvm, long ext)
>>>>>>>>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
>>>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE:
>>>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE_64:
>>>>>>>>>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>>>>>>>>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
>>>>>>>>>>>>  	case KVM_CAP_PPC_RTAS:
>>>>>>>>>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>>>>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>>>>>>>>> index 080ffbf..f1abbea 100644
>>>>>>>>>>>> --- a/include/uapi/linux/kvm.h
>>>>>>>>>>>> +++ b/include/uapi/linux/kvm.h
>>>>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>>>>>>>>>>>>  #define  KVM_DEV_VFIO_GROUP			1
>>>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>>>>>>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>>>>>>>>>>>>
>>>>>>>>>>>>  enum kvm_device_type {
>>>>>>>>>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>>>>>>>>>>>>  	KVM_DEV_TYPE_MAX,
>>>>>>>>>>>>  };
>>>>>>>>>>>>
>>>>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn {
>>>>>>>>>>>> +	__u32	argsz;
>>>>>>>>>>>> +	__s32	fd;
>>>>>>>>>>>> +	__u32	liobn;
>>>>>>>>>>>> +	__u8	pad[4];
>>>>>>>>>>>> +	__u64	start_addr;
>>>>>>>>>>>> +};
>>>>>>>>>>>> +
>>>>>>>>>>>>  /*
>>>>>>>>>>>>   * ioctls for VM fds
>>>>>>>>>>>>   */
>>>>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>>>>>>>>>> index 1dd087d..87c771e 100644
>>>>>>>>>>>> --- a/virt/kvm/vfio.c
>>>>>>>>>>>> +++ b/virt/kvm/vfio.c
>>>>>>>>>>>> @@ -20,6 +20,10 @@
>>>>>>>>>>>>  #include <linux/vfio.h>
>>>>>>>>>>>>  #include "vfio.h"
>>>>>>>>>>>>
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>> +#include <asm/kvm_ppc.h>
>>>>>>>>>>>> +#endif
>>>>>>>>>>>> +
>>>>>>>>>>>>  struct kvm_vfio_group {
>>>>>>>>>>>>  	struct list_head node;
>>>>>>>>>>>>  	struct vfio_group *vfio_group;
>>>>>>>>>>>> @@ -60,6 +64,22 @@ static void
>>>>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>>>>>>>>>>  	symbol_put(vfio_group_put_external_user);
>>>>>>>>>>>>  }
>>>>>>>>>>>>
>>>>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +	int (*fn)(struct vfio_group *);
>>>>>>>>>>>> +	int ret = -1;
>>>>>>>>>>>
>>>>>>>>>>> Should this be -ESOMETHING?
>>>>>>>>>>>
>>>>>>>>>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>>>>>>>>>> +	if (!fn)
>>>>>>>>>>>> +		return ret;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	ret = fn(vfio_group);
>>>>>>>>>>>> +
>>>>>>>>>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>>>>>>>>>> +
>>>>>>>>>>>> +	return ret;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>>>>>>>>>>>  {
>>>>>>>>>>>>  	long (*fn)(struct vfio_group *, unsigned long);
>>>>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
>>>>>>>> kvm_device *dev)
>>>>>>>>>>>>  	mutex_unlock(&kv->lock);
>>>>>>>>>>>>  }
>>>>>>>>>>>>
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>>>>>>>>>>>> +		struct vfio_group *vfio_group)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Shouldn't this go in the same patch that introduced the attach
>>>>>>>>>>> function?
>>>>>>>>>>
>>>>>>>>>> Having less patches which touch different maintainers areas is better. I
>>>>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
>>>>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
>>>>>>>>>> table".
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> +{
>>>>>>>>>>>> +	int group_id;
>>>>>>>>>>>> +	struct iommu_group *grp;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>>>>>>>>>>>> +	grp = iommu_group_get_by_id(group_id);
>>>>>>>>>>>> +	if (grp) {
>>>>>>>>>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>>>>>>>>>>>> +		iommu_group_put(grp);
>>>>>>>>>>>> +	}
>>>>>>>>>>>> +}
>>>>>>>>>>>> +#endif
>>>>>>>>>>>> +
>>>>>>>>>>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>>>>>>>>>  {
>>>>>>>>>>>>  	struct kvm_vfio *kv = dev->private;
>>>>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
>>>>>>>> *dev, long attr, u64 arg)
>>>>>>>>>>>>  				continue;
>>>>>>>>>>>>
>>>>>>>>>>>>  			list_del(&kvg->node);
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>
>>>>>>>>>>> Better to make a no-op version of the call than have to #ifdef at the
>>>>>>>>>>> callsite.
>>>>>>>>>>
>>>>>>>>>> It is questionable. A x86 reader may decide that
>>>>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
>>>>>>>>>> confused.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
>>>>>>>>>>>> +					kvg->vfio_group);
>>>>>>>>>>>> +#endif
>>>>>>>>>>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>>>>>>>>>  			kfree(kvg);
>>>>>>>>>>>>  			ret = 0;
>>>>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
>>>>>>>> *dev, long attr, u64 arg)
>>>>>>>>>>>>  		kvm_vfio_update_coherency(dev);
>>>>>>>>>>>>
>>>>>>>>>>>>  		return ret;
>>>>>>>>>>>> +
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
>>>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn param;
>>>>>>>>>>>> +		unsigned long minsz;
>>>>>>>>>>>> +		struct kvm_vfio *kv = dev->private;
>>>>>>>>>>>> +		struct vfio_group *vfio_group;
>>>>>>>>>>>> +		struct kvm_vfio_group *kvg;
>>>>>>>>>>>> +		struct fd f;
>>>>>>>>>>>> +
>>>>>>>>>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
>>>>>>>>>>>> +				start_addr);
>>>>>>>>>>>> +
>>>>>>>>>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>>>>>>>>>>>> +			return -EFAULT;
>>>>>>>>>>>> +
>>>>>>>>>>>> +		if (param.argsz < minsz)
>>>>>>>>>>>> +			return -EINVAL;
>>>>>>>>>>>> +
>>>>>>>>>>>> +		f = fdget(param.fd);
>>>>>>>>>>>> +		if (!f.file)
>>>>>>>>>>>> +			return -EBADF;
>>>>>>>>>>>> +
>>>>>>>>>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>>>>>>>>>> +		fdput(f);
>>>>>>>>>>>> +
>>>>>>>>>>>> +		if (IS_ERR(vfio_group))
>>>>>>>>>>>> +			return PTR_ERR(vfio_group);
>>>>>>>>>>>> +
>>>>>>>>>>>> +		ret = -ENOENT;
>>>>>>>>>>>
>>>>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
>>>>>>>>>>> possible a kernel could be built for a platform supporting multiple
>>>>>>>>>>> IOMMU types.
>>>>>>>>>>
>>>>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a
>>>>>>>>>> VFIO container property, not a group property and here (KVM) we only have
>>>>>>>>>> groups.
>>>>>>>>>
>>>>>>>>> Which, as mentioned previously, is broken.
>>>>>>>>
>>>>>>>> Which I am failing to follow you on this.
>>>>>>>>
>>>>>>>> What I am trying to achieve here is pretty much referencing a group so it
>>>>>>>> cannot be reused. Plus LIOBNs.
>>>>>>>
>>>>>>> "Plus LIOBNs" is not a trivial change.  You are establishing a linkage
>>>>>> >from LIOBNs to groups.  But that doesn't make sense; if mapping in one
>>>>>>> (guest) LIOBN affects a group it must affect all groups in the
>>>>>>> container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
>>>>>>> to group.
>>>>>>
>>>>>> I can see your point but i don't see how to proceed now, I'm totally stuck.
>>>>>> Pass container fd and then implement new api to lock containers somehow and
>>>>>
>>>>> I'm not really understanding what the question is about locking containers.
>>>>>
>>>>>> enumerate groups when updating TCE table (including real mode)?
>>>>>
>>>>> Why do you need to enumerate groups?  The groups within the container
>>>>> share a TCE table, so can't you just update that once?
>>>>
>>>> Well, they share a TCE table but they do not share TCE Kill (TCE cache
>>>> invalidate) register address, it is still per PE but this does not matter
>>>> here (pnv_pci_link_table_and_group() does that), just mentioned to complete
>>>> the picture.
>>>
>>> True, you'll need to enumerate the groups for invalidates.  But you
>>> need that already, right.
>>>
>>>>>> Plus new API when we remove a group from a container as the result of guest
>>>>>> PCI hot unplug?
>>>>>
>>>>> I assume you mean a kernel internal API, since it shouldn't need
>>>>> anything else visible to userspace.  Won't this happen naturally?
>>>>> When the group is removed from the container, it will get its own TCE
>>>>> table instead of the previously shared one.
>>>>>
>>>>>>>> Passing a container fd does not make much
>>>>>>>> sense here as the VFIO device would walk through groups, reference them and
>>>>>>>> that is it, there is no locking on VFIO containters and so far there was no
>>>>>>>> need to teach KVM about containers.
>>>>>>>>
>>>>>>>> What do I miss now?
>>>>>>>
>>>>>>> Referencing the groups is essentially just a useful side effect.  The
>>>>>>> important functionality is informing VFIO of the guest LIOBNs; and
>>>>>>> LIOBNs map to containers, not groups.
>>>>>>
>>>>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There
>>>>>> can be one or many or none containers per liobn.
>>>>>
>>>>> Ah, true.
>>>>
>>>> So I need to add new kernel API for KVM to get table(s) from VFIO
>>>> container(s). And invent some locking mechanism to prevent table(s) (or
>>>> associated container(s)) from going away, like
>>>> vfio_group_get_external_user/vfio_group_put_external_user but for
>>>> containers. Correct?
>>>
>>> Well, the container is attached to an fd, so if you get a reference on
>>> the file* that should do it.
>>
>> I am still trying to think of how to implement this suggestion.
>>
>> I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right
>> interface as it knows nothing about KVM. There is VFIO-KVM device but it
>> does not have idea about containers.
>>
>> So I have to:
>>
>> Wenever a container is created or removed, notify the VFIO-KVM device by
>> passing there a container fd. ok.
> 
> Actually, I don't think the vfio-kvm device is really useful here.  It
> was designed as a hack for a particular problem on x86.  It certainly
> could be extended to cover the information we need here, but I don't
> think it's a particularly natural way of doing so.
> 
> The problem is that conveying the information from the vfio-kvm device
> to the real mode H_PUT_TCE handler, which is what really needs it,
> isn't particularly simpler than conveying that information from
> anywhere else.
> 
>> Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to
>> what LIOBN so the realmode handlers could do the job. The real mode TCE
>> handlers get LIOBN, find a guest view table and update it. Now I want to
>> update the hardware table which is iommu_table attached to LIOBN.
>>
>> I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container fd.
>>
>> Now VFIO-KVM device needs to extract iommu_table's from the container.
>> These iommu_table pointers are stored in "struct tce_container" which is
>> local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I
>> cannot export and use that.
>>
>> The other way to go would be adding API to VFIO to enumerate IOMMU groups
>> in a container and use iommu_table pointers stored in iommu_table_group of
>> each group (in fact the very first group will be enough as multiple groups
>> in a container share the table). Adding vfio_container_get_groups() when
>> only first one is needed is quite tricky in terms of maintainers approvals.
>>
>> So what would be the right course of action? Thanks.
> 
> So, from the user side, you need to be able to bind a VFIO backend to
> a particular guest IOMMU.  This suggests a new ioctl() used in
> conjunction with KVM_CREATE_SPAPR_TCE.  Let's call it
> KVM_SPAPR_TCE_BIND_VFIO.  You'd use KVM_CREATE_SPAPR_TCE to make the
> kernel aware of a LIOBN in the first place, then use
> KVM_SPAPR_TCE_BIND_VFIO to associate a VFIO container with that LIOBN.
> So it would be a VM ioctl, taking a LIOBN and a container fd.  You'd
> need a capability to go with it, and some way to unbind as well.


This is what I had in the first place some years ago. And after 5-6 reviews
I was told that there is a VFIO KVM and I should use it.


> To implement that, the ioctl() would need to use a new vfio (kernel
> internal) interface - which can be specific to only the spapr TCE
> type.  That would take the container fd, and return the list of
> iommu_tables in some form or other (or various error conditions,
> obviously).
> 
> So, when qemu creates the PHB, it uses KVM_CREATE_SPAPR_TCE to inform
> the kernel of the LIOBN.  When the VFIO device is attached to the PHB,
> it uses KVM_SPAPR_TCE_BIND_VFIO to connect the VFIO container to the
> LIOBN.  The ioctl() implementation uses the new special interface into
> the spapr_tce vfio backend to get the list of iommu tables, which it
> stores in some private format.

Getting just a list of IOMMU groups would do too. Pushing such API is a
problem, this is how I ended up with the current design.


> The H_PUT_TCE implementation uses that
> stored list of iommu tables to translate H_PUT_TCEs from the guest
> into actions on the host IOMMU tables.
> 
> And, yes, the special interface to the spapr TCE vfio back end is kind
> of a hack.  That's what you get when you need to link to separate
> kernel subsystems for performance reasons.

One can argue if it is a hack, how is this hack better that the existing
approach? :)

Alex, could you please comment on David's suggestion? Thanks!

David Gibson June 15, 2016, 4:43 a.m. UTC | #13

On Tue, Jun 14, 2016 at 01:30:53PM +1000, Alexey Kardashevskiy wrote:
> On 10/06/16 16:50, David Gibson wrote:
> > On Thu, Jun 09, 2016 at 04:47:59PM +1000, Alexey Kardashevskiy wrote:
> >> On 23/03/16 14:03, David Gibson wrote:
> >>> On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:
> >>>> Uff, lost cc: list. Added back. Some comments below.
> >>>>
> >>>>
> >>>> On 03/21/2016 04:19 PM, David Gibson wrote:
> >>>>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
> >>>>>> On March 15, 2016 17:29:26 David Gibson <david@gibson.dropbear.id.au> wrote:
> >>>>>>
> >>>>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>> On 03/10/2016 04:21 PM, David Gibson wrote:
> >>>>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote:
> >>>>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> >>>>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> >>>>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and
> >>>>>>>>>>>> linked to IOMMU groups by the user space.
> >>>>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need
> >>>>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> >>>>>>>>>>>> is added which accepts:
> >>>>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware
> >>>>>>>>>>>> TCE table;
> >>>>>>>>>>>> - a LIOBN to assign to the found table.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Before notifying KVM about new link, this check the group for being
> >>>>>>>>>>>> registered with KVM device in order to release them at unexpected KVM
> >>>>>>>>>>>> finish.
> >>>>>>>>>>>>
> >>>>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >>>>>>>>>>>> space.
> >>>>>>>>>>>>
> >>>>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it
> >>>>>>>>>>>> link to a KVM module.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>>>>>>> ---
> >>>>>>>>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
> >>>>>>>>>>>>  arch/powerpc/kvm/Kconfig                   |   1 +
> >>>>>>>>>>>>  arch/powerpc/kvm/Makefile                  |   5 +-
> >>>>>>>>>>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
> >>>>>>>>>>>>  include/uapi/linux/kvm.h                   |   9 +++
> >>>>>>>>>>>>  virt/kvm/vfio.c                            | 106
> >>>>>>>> +++++++++++++++++++++++++++++
> >>>>>>>>>>>>  6 files changed, 140 insertions(+), 3 deletions(-)
> >>>>>>>>>>>>
> >>>>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>> b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>>>>> index ef51740..c0d3eb7 100644
> >>>>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >>>>>>>>>>>> @@ -16,7 +16,24 @@ Groups:
> >>>>>>>>>>>>
> >>>>>>>>>>>>  KVM_DEV_VFIO_GROUP attributes:
> >>>>>>>>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >>>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>>>>> +	for the VFIO group.
> >>>>>>>>>>>
> >>>>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which
> >>>>>>>>>>> case it might be clearer to put them in a separate patch.
> >>>>>>>>>>>
> >>>>>>>>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
> >>>>>>>> tracking
> >>>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>>>>> +	for the VFIO group.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >>>>>>>>>>>> -for the VFIO group.
> >>>>>>>>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> >>>>>>>>>>>> +	kvm_device_attr.addr points to a struct:
> >>>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn {
> >>>>>>>>>>>> +			__u32	argsz;
> >>>>>>>>>>>> +			__s32	fd;
> >>>>>>>>>>>> +			__u32	liobn;
> >>>>>>>>>>>> +			__u8	pad[4];
> >>>>>>>>>>>> +			__u64	start_addr;
> >>>>>>>>>>>> +		};
> >>>>>>>>>>>> +		where
> >>>>>>>>>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
> >>>>>>>>>>>> +		@fd is a file descriptor for a VFIO group;
> >>>>>>>>>>>> +		@liobn is a logical bus id to be associated with the group;
> >>>>>>>>>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
> >>>>>>>>>>>
> >>>>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call
> >>>>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
> >>>>>>>>>> and 64bit windows, this is why @start_addr is there.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> >>>>>>>>>>>> index 1059846..dfa3488 100644
> >>>>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig
> >>>>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig
> >>>>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
> >>>>>>>>>>>>  	select KVM
> >>>>>>>>>>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
> >>>>>>>>>>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
> >>>>>>>>>>>> +	select KVM_VFIO if VFIO
> >>>>>>>>>>>>  	---help---
> >>>>>>>>>>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
> >>>>>>>>>>>>  	  in virtual machines on book3s_64 host processors.
> >>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >>>>>>>>>>>> index 7f7b6d8..71f577c 100644
> >>>>>>>>>>>> --- a/arch/powerpc/kvm/Makefile
> >>>>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile
> >>>>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
> >>>>>>>>>>>>  KVM := ../../../virt/kvm
> >>>>>>>>>>>>
> >>>>>>>>>>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> >>>>>>>>>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
> >>>>>>>>>>>> +		$(KVM)/eventfd.o
> >>>>>>>>>>>
> >>>>>>>>>>> Please don't disable the VFIO device for the non-book3s case.  I added
> >>>>>>>>>>> it (even though it didn't do anything until now) so that libvirt
> >>>>>>>>>>> wouldn't choke when it finds it's not available.  Obviously the new
> >>>>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device
> >>>>>>>>>>> itself should be available always.
> >>>>>>>>>>
> >>>>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>>  CFLAGS_e500_mmu.o := -I.
> >>>>>>>>>>>>  CFLAGS_e500_mmu_host.o := -I.
> >>>>>>>>>>>> @@ -87,6 +87,9 @@ endif
> >>>>>>>>>>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
> >>>>>>>>>>>>  	book3s_xics.o
> >>>>>>>>>>>>
> >>>>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> >>>>>>>>>>>> +	$(KVM)/vfio.o \
> >>>>>>>>>>>> +
> >>>>>>>>>>>>  kvm-book3s_64-module-objs += \
> >>>>>>>>>>>>  	$(KVM)/kvm_main.o \
> >>>>>>>>>>>>  	$(KVM)/eventfd.o \
> >>>>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >>>>>>>>>>>> index 19aa59b..63f188d 100644
> >>>>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c
> >>>>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c
> >>>>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
> >>>>>>>> *kvm, long ext)
> >>>>>>>>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
> >>>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE:
> >>>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE_64:
> >>>>>>>>>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
> >>>>>>>>>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
> >>>>>>>>>>>>  	case KVM_CAP_PPC_RTAS:
> >>>>>>>>>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
> >>>>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>>>>>>>>>> index 080ffbf..f1abbea 100644
> >>>>>>>>>>>> --- a/include/uapi/linux/kvm.h
> >>>>>>>>>>>> +++ b/include/uapi/linux/kvm.h
> >>>>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
> >>>>>>>>>>>>  #define  KVM_DEV_VFIO_GROUP			1
> >>>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
> >>>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> >>>>>>>>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
> >>>>>>>>>>>>
> >>>>>>>>>>>>  enum kvm_device_type {
> >>>>>>>>>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> >>>>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
> >>>>>>>>>>>>  	KVM_DEV_TYPE_MAX,
> >>>>>>>>>>>>  };
> >>>>>>>>>>>>
> >>>>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn {
> >>>>>>>>>>>> +	__u32	argsz;
> >>>>>>>>>>>> +	__s32	fd;
> >>>>>>>>>>>> +	__u32	liobn;
> >>>>>>>>>>>> +	__u8	pad[4];
> >>>>>>>>>>>> +	__u64	start_addr;
> >>>>>>>>>>>> +};
> >>>>>>>>>>>> +
> >>>>>>>>>>>>  /*
> >>>>>>>>>>>>   * ioctls for VM fds
> >>>>>>>>>>>>   */
> >>>>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >>>>>>>>>>>> index 1dd087d..87c771e 100644
> >>>>>>>>>>>> --- a/virt/kvm/vfio.c
> >>>>>>>>>>>> +++ b/virt/kvm/vfio.c
> >>>>>>>>>>>> @@ -20,6 +20,10 @@
> >>>>>>>>>>>>  #include <linux/vfio.h>
> >>>>>>>>>>>>  #include "vfio.h"
> >>>>>>>>>>>>
> >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>>>> +#include <asm/kvm_ppc.h>
> >>>>>>>>>>>> +#endif
> >>>>>>>>>>>> +
> >>>>>>>>>>>>  struct kvm_vfio_group {
> >>>>>>>>>>>>  	struct list_head node;
> >>>>>>>>>>>>  	struct vfio_group *vfio_group;
> >>>>>>>>>>>> @@ -60,6 +64,22 @@ static void
> >>>>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> >>>>>>>>>>>>  	symbol_put(vfio_group_put_external_user);
> >>>>>>>>>>>>  }
> >>>>>>>>>>>>
> >>>>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> >>>>>>>>>>>> +{
> >>>>>>>>>>>> +	int (*fn)(struct vfio_group *);
> >>>>>>>>>>>> +	int ret = -1;
> >>>>>>>>>>>
> >>>>>>>>>>> Should this be -ESOMETHING?
> >>>>>>>>>>>
> >>>>>>>>>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
> >>>>>>>>>>>> +	if (!fn)
> >>>>>>>>>>>> +		return ret;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +	ret = fn(vfio_group);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +	symbol_put(vfio_external_user_iommu_id);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +	return ret;
> >>>>>>>>>>>> +}
> >>>>>>>>>>>> +
> >>>>>>>>>>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> >>>>>>>>>>>>  {
> >>>>>>>>>>>>  	long (*fn)(struct vfio_group *, unsigned long);
> >>>>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
> >>>>>>>> kvm_device *dev)
> >>>>>>>>>>>>  	mutex_unlock(&kv->lock);
> >>>>>>>>>>>>  }
> >>>>>>>>>>>>
> >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
> >>>>>>>>>>>> +		struct vfio_group *vfio_group)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Shouldn't this go in the same patch that introduced the attach
> >>>>>>>>>>> function?
> >>>>>>>>>>
> >>>>>>>>>> Having less patches which touch different maintainers areas is better. I
> >>>>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
> >>>>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
> >>>>>>>>>> table".
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> +{
> >>>>>>>>>>>> +	int group_id;
> >>>>>>>>>>>> +	struct iommu_group *grp;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
> >>>>>>>>>>>> +	grp = iommu_group_get_by_id(group_id);
> >>>>>>>>>>>> +	if (grp) {
> >>>>>>>>>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
> >>>>>>>>>>>> +		iommu_group_put(grp);
> >>>>>>>>>>>> +	}
> >>>>>>>>>>>> +}
> >>>>>>>>>>>> +#endif
> >>>>>>>>>>>> +
> >>>>>>>>>>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>>>>>>>>>  {
> >>>>>>>>>>>>  	struct kvm_vfio *kv = dev->private;
> >>>>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
> >>>>>>>> *dev, long attr, u64 arg)
> >>>>>>>>>>>>  				continue;
> >>>>>>>>>>>>
> >>>>>>>>>>>>  			list_del(&kvg->node);
> >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>>>
> >>>>>>>>>>> Better to make a no-op version of the call than have to #ifdef at the
> >>>>>>>>>>> callsite.
> >>>>>>>>>>
> >>>>>>>>>> It is questionable. A x86 reader may decide that
> >>>>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
> >>>>>>>>>> confused.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
> >>>>>>>>>>>> +					kvg->vfio_group);
> >>>>>>>>>>>> +#endif
> >>>>>>>>>>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
> >>>>>>>>>>>>  			kfree(kvg);
> >>>>>>>>>>>>  			ret = 0;
> >>>>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
> >>>>>>>> *dev, long attr, u64 arg)
> >>>>>>>>>>>>  		kvm_vfio_update_coherency(dev);
> >>>>>>>>>>>>
> >>>>>>>>>>>>  		return ret;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>>>>>>>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
> >>>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn param;
> >>>>>>>>>>>> +		unsigned long minsz;
> >>>>>>>>>>>> +		struct kvm_vfio *kv = dev->private;
> >>>>>>>>>>>> +		struct vfio_group *vfio_group;
> >>>>>>>>>>>> +		struct kvm_vfio_group *kvg;
> >>>>>>>>>>>> +		struct fd f;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
> >>>>>>>>>>>> +				start_addr);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> >>>>>>>>>>>> +			return -EFAULT;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		if (param.argsz < minsz)
> >>>>>>>>>>>> +			return -EINVAL;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		f = fdget(param.fd);
> >>>>>>>>>>>> +		if (!f.file)
> >>>>>>>>>>>> +			return -EBADF;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> >>>>>>>>>>>> +		fdput(f);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		if (IS_ERR(vfio_group))
> >>>>>>>>>>>> +			return PTR_ERR(vfio_group);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +		ret = -ENOENT;
> >>>>>>>>>>>
> >>>>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
> >>>>>>>>>>> possible a kernel could be built for a platform supporting multiple
> >>>>>>>>>>> IOMMU types.
> >>>>>>>>>>
> >>>>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a
> >>>>>>>>>> VFIO container property, not a group property and here (KVM) we only have
> >>>>>>>>>> groups.
> >>>>>>>>>
> >>>>>>>>> Which, as mentioned previously, is broken.
> >>>>>>>>
> >>>>>>>> Which I am failing to follow you on this.
> >>>>>>>>
> >>>>>>>> What I am trying to achieve here is pretty much referencing a group so it
> >>>>>>>> cannot be reused. Plus LIOBNs.
> >>>>>>>
> >>>>>>> "Plus LIOBNs" is not a trivial change.  You are establishing a linkage
> >>>>>> >from LIOBNs to groups.  But that doesn't make sense; if mapping in one
> >>>>>>> (guest) LIOBN affects a group it must affect all groups in the
> >>>>>>> container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
> >>>>>>> to group.
> >>>>>>
> >>>>>> I can see your point but i don't see how to proceed now, I'm totally stuck.
> >>>>>> Pass container fd and then implement new api to lock containers somehow and
> >>>>>
> >>>>> I'm not really understanding what the question is about locking containers.
> >>>>>
> >>>>>> enumerate groups when updating TCE table (including real mode)?
> >>>>>
> >>>>> Why do you need to enumerate groups?  The groups within the container
> >>>>> share a TCE table, so can't you just update that once?
> >>>>
> >>>> Well, they share a TCE table but they do not share TCE Kill (TCE cache
> >>>> invalidate) register address, it is still per PE but this does not matter
> >>>> here (pnv_pci_link_table_and_group() does that), just mentioned to complete
> >>>> the picture.
> >>>
> >>> True, you'll need to enumerate the groups for invalidates.  But you
> >>> need that already, right.
> >>>
> >>>>>> Plus new API when we remove a group from a container as the result of guest
> >>>>>> PCI hot unplug?
> >>>>>
> >>>>> I assume you mean a kernel internal API, since it shouldn't need
> >>>>> anything else visible to userspace.  Won't this happen naturally?
> >>>>> When the group is removed from the container, it will get its own TCE
> >>>>> table instead of the previously shared one.
> >>>>>
> >>>>>>>> Passing a container fd does not make much
> >>>>>>>> sense here as the VFIO device would walk through groups, reference them and
> >>>>>>>> that is it, there is no locking on VFIO containters and so far there was no
> >>>>>>>> need to teach KVM about containers.
> >>>>>>>>
> >>>>>>>> What do I miss now?
> >>>>>>>
> >>>>>>> Referencing the groups is essentially just a useful side effect.  The
> >>>>>>> important functionality is informing VFIO of the guest LIOBNs; and
> >>>>>>> LIOBNs map to containers, not groups.
> >>>>>>
> >>>>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There
> >>>>>> can be one or many or none containers per liobn.
> >>>>>
> >>>>> Ah, true.
> >>>>
> >>>> So I need to add new kernel API for KVM to get table(s) from VFIO
> >>>> container(s). And invent some locking mechanism to prevent table(s) (or
> >>>> associated container(s)) from going away, like
> >>>> vfio_group_get_external_user/vfio_group_put_external_user but for
> >>>> containers. Correct?
> >>>
> >>> Well, the container is attached to an fd, so if you get a reference on
> >>> the file* that should do it.
> >>
> >> I am still trying to think of how to implement this suggestion.
> >>
> >> I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right
> >> interface as it knows nothing about KVM. There is VFIO-KVM device but it
> >> does not have idea about containers.
> >>
> >> So I have to:
> >>
> >> Wenever a container is created or removed, notify the VFIO-KVM device by
> >> passing there a container fd. ok.
> > 
> > Actually, I don't think the vfio-kvm device is really useful here.  It
> > was designed as a hack for a particular problem on x86.  It certainly
> > could be extended to cover the information we need here, but I don't
> > think it's a particularly natural way of doing so.
> > 
> > The problem is that conveying the information from the vfio-kvm device
> > to the real mode H_PUT_TCE handler, which is what really needs it,
> > isn't particularly simpler than conveying that information from
> > anywhere else.
> > 
> >> Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to
> >> what LIOBN so the realmode handlers could do the job. The real mode TCE
> >> handlers get LIOBN, find a guest view table and update it. Now I want to
> >> update the hardware table which is iommu_table attached to LIOBN.
> >>
> >> I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container fd.
> >>
> >> Now VFIO-KVM device needs to extract iommu_table's from the container.
> >> These iommu_table pointers are stored in "struct tce_container" which is
> >> local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I
> >> cannot export and use that.
> >>
> >> The other way to go would be adding API to VFIO to enumerate IOMMU groups
> >> in a container and use iommu_table pointers stored in iommu_table_group of
> >> each group (in fact the very first group will be enough as multiple groups
> >> in a container share the table). Adding vfio_container_get_groups() when
> >> only first one is needed is quite tricky in terms of maintainers approvals.
> >>
> >> So what would be the right course of action? Thanks.
> > 
> > So, from the user side, you need to be able to bind a VFIO backend to
> > a particular guest IOMMU.  This suggests a new ioctl() used in
> > conjunction with KVM_CREATE_SPAPR_TCE.  Let's call it
> > KVM_SPAPR_TCE_BIND_VFIO.  You'd use KVM_CREATE_SPAPR_TCE to make the
> > kernel aware of a LIOBN in the first place, then use
> > KVM_SPAPR_TCE_BIND_VFIO to associate a VFIO container with that LIOBN.
> > So it would be a VM ioctl, taking a LIOBN and a container fd.  You'd
> > need a capability to go with it, and some way to unbind as well.
> 
> This is what I had in the first place some years ago. And after 5-6 reviews
> I was told that there is a VFIO KVM and I should use it.

I suspect that's because Alex didn't fully understand what we required
here.  The primary thing here is that we need to link guest visible
LIOBNs to host-visible VFIO containers.  Your comments tended to
emphasise the fact of giving KVM a list of VFIO groups, which is a
side effect of the above, but not really the main point - it does,
however, sound misleadingly like what the kvm-vfio device already does.


> > To implement that, the ioctl() would need to use a new vfio (kernel
> > internal) interface - which can be specific to only the spapr TCE
> > type.  That would take the container fd, and return the list of
> > iommu_tables in some form or other (or various error conditions,
> > obviously).
> > 
> > So, when qemu creates the PHB, it uses KVM_CREATE_SPAPR_TCE to inform
> > the kernel of the LIOBN.  When the VFIO device is attached to the PHB,
> > it uses KVM_SPAPR_TCE_BIND_VFIO to connect the VFIO container to the
> > LIOBN.  The ioctl() implementation uses the new special interface into
> > the spapr_tce vfio backend to get the list of iommu tables, which it
> > stores in some private format.
> 
> Getting just a list of IOMMU groups would do too. Pushing such API is a
> problem, this is how I ended up with the current design.
> 
> 
> > The H_PUT_TCE implementation uses that
> > stored list of iommu tables to translate H_PUT_TCEs from the guest
> > into actions on the host IOMMU tables.
> > 
> > And, yes, the special interface to the spapr TCE vfio back end is kind
> > of a hack.  That's what you get when you need to link to separate
> > kernel subsystems for performance reasons.
> 
> One can argue if it is a hack, how is this hack better that the existing
> approach? :)

Because this hack is only on a kernel internal interface, rather than
impactin the user visible interface.

> Alex, could you please comment on David's suggestion? Thanks!

[kernel,9/9] KVM: PPC: VFIO device: support SPAPR TCE

Commit Message

Comments

Patch