[RFC,16/16] lpfc: vmid: Introducing vmid in io path.

Message ID	1596507196-27417-17-git-send-email-muneendra.kumar@broadcom.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=uCqz=BO=vger.kernel.org=linux-block-owner@kernel.org> From: Muneendra <muneendra.kumar@broadcom.com> To: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org Cc: pbonzini@redhat.com, emilne@redhat.com, mkumar@redhat.com, Gaurav Srivastava <gaurav.srivastava@broadcom.com>, James Smart <jsmart2021@gmail.com>, Muneendra <muneendra.kumar@broadcom.com> Subject: [RFC 16/16] lpfc: vmid: Introducing vmid in io path. Date: Tue, 4 Aug 2020 07:43:16 +0530 Message-Id: <1596507196-27417-17-git-send-email-muneendra.kumar@broadcom.com> In-Reply-To: <1596507196-27417-1-git-send-email-muneendra.kumar@broadcom.com> References: <1596507196-27417-1-git-send-email-muneendra.kumar@broadcom.com> Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	Application specific identification support \| expand [RFC,00/16] Application specific identification support [RFC,01/16] blkcg:Introduce blkio.app_identifier knob to blkio controller [RFC,02/16] lpfc: vmid: Add the datastructure for supporting VMID in lpfc [RFC,03/16] lpfc: vmid: API to check if VMID is enabled. [RFC,04/16] lpfc: vmid: Supplementary data structures for vmid [RFC,05/16] lpfc: vmid: Forward declarations for APIs [RFC,06/16] lpfc: vmid: Add support for vmid in mailbox command [RFC,07/16] lpfc: vmid: VMID params initialization [RFC,08/16] lpfc: vmid: vmid resource allocation [RFC,09/16] lpfc: vmid: cleanup vmid resources [RFC,10/16] lpfc: vmid: Implements ELS commands for appid patch [RFC,11/16] lpfc: vmid: Functions to manage vmids [RFC,12/16] lpfc: vmid: Implements CT commands for appid. [RFC,13/16] lpfc: vmid: Appends the vmid in the wqe before sending request [RFC,14/16] lpfc: vmid: Timeout implementation for vmid [RFC,15/16] lpfc: vmid: Adding qfpa and vmid timeout check in worker thread [RFC,16/16] lpfc: vmid: Introducing vmid in io path.

Muneendra Kumar Aug. 4, 2020, 2:13 a.m. UTC

From: Gaurav Srivastava <gaurav.srivastava@broadcom.com>

The patch introduces the vmid in the io path. It checks if the vmid is
enabled and if io belongs to a vm or not and acts accordingly. Other
supporing APIs are also included in the patch.

Signed-off-by: Gaurav Srivastava  <gaurav.srivastava@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Muneendra <muneendra.kumar@broadcom.com>
---
 drivers/scsi/lpfc/lpfc_scsi.c | 163 ++++++++++++++++++++++++++++++++++
 1 file changed, 163 insertions(+)

Hannes Reinecke Aug. 5, 2020, 7:16 a.m. UTC | #1

On 8/4/20 4:13 AM, Muneendra wrote:
> From: Gaurav Srivastava <gaurav.srivastava@broadcom.com>
> 
> The patch introduces the vmid in the io path. It checks if the vmid is
> enabled and if io belongs to a vm or not and acts accordingly. Other
> supporing APIs are also included in the patch.
> 
> Signed-off-by: Gaurav Srivastava  <gaurav.srivastava@broadcom.com>
> Signed-off-by: James Smart <jsmart2021@gmail.com>
> Signed-off-by: Muneendra <muneendra.kumar@broadcom.com>
> ---
>   drivers/scsi/lpfc/lpfc_scsi.c | 163 ++++++++++++++++++++++++++++++++++
>   1 file changed, 163 insertions(+)
> 
> diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
> index e5a1056cc575..3c7af8064b12 100644
> --- a/drivers/scsi/lpfc/lpfc_scsi.c
> +++ b/drivers/scsi/lpfc/lpfc_scsi.c
> @@ -4624,6 +4624,149 @@ void lpfc_vmid_assign_cs_ctl(struct lpfc_vport *vport, struct lpfc_vmid *vmid)
>   	}
>   }
>   
> +/*
> + * lpfc_vmid_get_appid- get the vmid associated with the uuid
> + * @vport: The virtual port for which this call is being executed.
> + * @uuid: uuid associated with the VE
> + * @cmd: address of scsi cmmd descriptor
> + * @tag: VMID tag
> + * Returns status of the function
> + */
> +static int lpfc_vmid_get_appid(struct lpfc_vport *vport, char *uuid, struct
> +			       scsi_cmnd * cmd, union lpfc_vmid_io_tag *tag)
> +{
> +	struct lpfc_vmid *vmp = NULL;
> +	int hash, len, rc = 1, i;
> +	u8 pending = 0;
> +
> +	/* check if QFPA is complete */
> +	if (lpfc_vmid_is_type_priority_tag(vport) && !(vport->vmid_flag &
> +	      LPFC_VMID_QFPA_CMPL))
> +		return 1;
> +
> +	/* search if the uuid has already been mapped to the vmid */
> +	len = strlen(uuid);
> +	hash = lpfc_vmid_hash_fn(uuid, len);
> +
> +	/* search for the VMID in the table */
> +	read_lock(&vport->vmid_lock);
> +	vmp = lpfc_get_vmid_from_hastable(vport, hash, uuid);
> +	read_unlock(&vport->vmid_lock);
> +
> +	/* if found, check if its already registered  */
> +	if (vmp  && vmp->flag & LPFC_VMID_REGISTERED) {
> +		lpfc_vmid_update_entry(vport, cmd, vmp, tag);
> +		rc = 0;
> +	} else if (vmp && (vmp->flag & LPFC_VMID_REQ_REGISTER ||
> +			   vmp->flag & LPFC_VMID_DE_REGISTER)) {
> +		/* else if register or dereg request has already been sent */
> +		/* Hence vmid tag will not be added for this IO */
> +		rc = 1;
> +	} else {
> +		/* else, start the process to obtain one as per the */
> +		/* switch connected */
> +		write_lock(&vport->vmid_lock);
> +		vmp = lpfc_get_vmid_from_hastable(vport, hash, uuid);
> +
> +		/* while the read lock was released, in case the entry was */
> +		/* added by other context or is in process of being added */
> +		if (vmp && vmp->flag & LPFC_VMID_REGISTERED) {
> +			lpfc_vmid_update_entry(vport, cmd, vmp, tag);
> +			write_unlock(&vport->vmid_lock);
> +			return 0;
> +		} else if (vmp && vmp->flag & LPFC_VMID_REQ_REGISTER) {
> +			write_unlock(&vport->vmid_lock);
> +			return 1;
> +		}
> +
> +		/* else search and allocate a free slot in the hash table */
> +		if (vport->cur_vmid_cnt < vport->max_vmid) {
> +			for (i = 0; i < vport->max_vmid; ++i) {
> +				vmp = vport->vmid + i;
> +				if (vmp->flag == LPFC_VMID_SLOT_FREE) {
> +					vmp = vport->vmid + i;
> +					break;
> +				}
> +			}
> +		} else {
> +			write_unlock(&vport->vmid_lock);
> +			return 1;
> +		}
> +
> +		if (vmp && (vmp->flag == LPFC_VMID_SLOT_FREE)) {
> +			vmp->vmid_len = len;
> +
> +			/* Add the vmid and register  */
> +			memcpy(vmp->host_vmid, uuid, vmp->vmid_len);
> +			vmp->io_rd_cnt = 0;
> +			vmp->io_wr_cnt = 0;
> +			vmp->flag = LPFC_VMID_SLOT_USED;
> +			lpfc_put_vmid_in_hashtable(vport, hash, vmp);
> +
> +			vmp->delete_inactive =
> +			    vport->vmid_inactivity_timeout ? 1 : 0;
> +
> +			/* if type priority tag, get next available vmid */
> +			if (lpfc_vmid_is_type_priority_tag(vport))
> +				lpfc_vmid_assign_cs_ctl(vport, vmp);
> +
> +			/* allocate the per cpu variable for holding */
> +			/* the last access time stamp only if vmid is enabled */
> +			if (!vmp->last_io_time)
> +				vmp->last_io_time =
> +				    __alloc_percpu(sizeof(u64),
> +						   __alignof__(struct
> +							       lpfc_vmid));
> +
> +			/* registration pending */
> +			pending = 1;
> +			rc = 1;
> +		}
> +		write_unlock(&vport->vmid_lock);
> +
> +		/* complete transaction with switch */
> +		if (pending) {
> +			if (lpfc_vmid_is_type_priority_tag(vport))
> +				rc = lpfc_vmid_uvem(vport, vmp, true);
> +			else
> +				rc = lpfc_vmid_cmd(vport,
> +						   SLI_CTAS_RAPP_IDENT,
> +						   vmp);
> +			if (!rc) {
> +				write_lock(&vport->vmid_lock);
> +				vport->cur_vmid_cnt++;
> +				vmp->flag |= LPFC_VMID_REQ_REGISTER;
> +				write_unlock(&vport->vmid_lock);
> +			}
> +		}
> +
> +		/* finally, enable the idle timer once */
> +		if (!(vport->phba->pport->vmid_flag & LPFC_VMID_TIMER_ENBLD)) {
> +			mod_timer(&vport->phba->inactive_vmid_poll,
> +				  jiffies +
> +				  msecs_to_jiffies(1000 * LPFC_VMID_TIMER));
> +			vport->phba->pport->vmid_flag |= LPFC_VMID_TIMER_ENBLD;
> +		}
> +	}
> +	return rc;
> +}
> +
> +/*
> + * lpfc_is_command_vm_io - get the uuid from blk cgroup
> + * @cmd:Pointer to scsi_cmnd data structure
> + * Returns uuid if present if not null
> + */
> +static char *lpfc_is_command_vm_io(struct scsi_cmnd *cmd)
> +{
> +	char *uuid = NULL;
> +
> +	if (cmd->request) {
> +		if (cmd->request->bio)
> +			uuid = blkcg_get_app_identifier(cmd->request->bio);
> +	}
> +	return uuid;
> +}
> +
>   /**
>    * lpfc_queuecommand - scsi_host_template queuecommand entry point
>    * @cmnd: Pointer to scsi_cmnd data structure.
> @@ -4649,6 +4792,7 @@ lpfc_queuecommand(struct Scsi_Host *shost, struct scsi_cmnd *cmnd)
>   	int err, idx;
>   #ifdef CONFIG_SCSI_LPFC_DEBUG_FS
>   	uint64_t start = 0L;
> +	u8 *uuid = NULL;
>   
>   	if (phba->ktime_on)
>   		start = ktime_get_ns();
> @@ -4772,6 +4916,25 @@ lpfc_queuecommand(struct Scsi_Host *shost, struct scsi_cmnd *cmnd)
>   
>   	lpfc_scsi_prep_cmnd(vport, lpfc_cmd, ndlp);
>   
> +	/* check the necessary and sufficient condition to support VMID */
> +	if (lpfc_is_vmid_enabled(phba) &&
> +	    (ndlp->vmid_support ||
> +	     phba->pport->vmid_priority_tagging ==
> +	     LPFC_VMID_PRIO_TAG_ALL_TARGETS)) {
> +		/* is the IO generated by a VM, get the associated virtual */
> +		/* entity id */
> +		uuid = lpfc_is_command_vm_io(cmnd);
> +
> +		if (uuid) {
> +			err = lpfc_vmid_get_appid(vport, uuid, cmnd,
> +				(union lpfc_vmid_io_tag *)
> +					&lpfc_cmd->cur_iocbq.vmid_tag);
> +			if (!err)
> +				lpfc_cmd->cur_iocbq.iocb_flag |= LPFC_IO_VMID;
> +		}
> +	}
> +
> +	atomic_inc(&ndlp->cmd_pending);
>   #ifdef CONFIG_SCSI_LPFC_DEBUG_FS
>   	if (unlikely(phba->hdwqstat_on & LPFC_CHECK_SCSI_IO))
>   		this_cpu_inc(phba->sli4_hba.c_stat->xmt_io);
> 
Well.

Creating a VMID in the hotpath with a while() loop will be bogging down 
performance to no end.
I'd rather have restricted the ->queuecommand() function to a direct lookup.
If that fails (as the VMID isn't registered) we should kicking of a 
workqueue for registering the VMID and return BUSY.
Or tweak blkcg to register the VMID directly, and reject the command if 
the VMID isn't registered :-)

Cheers,

Hannes

James Smart Aug. 5, 2020, 11:38 p.m. UTC | #2

On 8/5/2020 12:16 AM, Hannes Reinecke wrote:
> Well.
>
> Creating a VMID in the hotpath with a while() loop will be bogging 
> down performance to no end.
> I'd rather have restricted the ->queuecommand() function to a direct 
> lookup.
> If that fails (as the VMID isn't registered) we should kicking of a 
> workqueue for registering the VMID and return BUSY.
> Or tweak blkcg to register the VMID directly, and reject the command 
> if the VMID isn't registered :-)
>
> Cheers,
>
> Hannes

That's actually what's supposed to be happening. fastpath uses the uuid 
to look up a vmid tag. If no vmid tag, kick off the fabric traffic that 
will get one but don't wait for it to complete. Any io issued while that 
process is occurring will be not be vmid tagged.    I'll circle back on 
lpfc to make sure this is happening.

In the mean time - the most important patch to review is the cgroup 
patch - patch1.

If we wanted to speed the driver's io path up, one thing to consider is 
adding a driver-settable value on the blkcg structure.  Once the fabric 
traffic obtained the vmid, the driver would set the blkcg structure with 
the value.  In this scenario though, as the vmid is destroyed as of link 
down, the driver needs a way, independent of an io, to reach into the 
blkcg struct to clear the vmid value.  We also need to be sure the blkcg 
struct won't be on top of a multipath device or something such that the 
blkcg struct may be referenced by a different scsi host - I assume we're 
good in that area.

-- james

Muneendra Kumar Aug. 6, 2020, 12:34 p.m. UTC | #3

Hi James,

As Tejun and Ming is suggesting to have interface close to the usage as
fabric infrasture I was thinking of this approach from the inputs provided
by you in the below mail.
Please provide your inputs on this approach.

1)	blkcg will have a new field to store driver specific information as
"blkio_cg_ priv_data"(in the current patch it is app_identifier) as Tejun
said he doesn’t mind cgroup data structs carrying extra bits for stuff.
2)	scsi transport will provide a new interface(sysfs) as register_vm_fabric
3)	As part of this interface user/deamon will provide the details of VM such
as UUID,PID on VM creation to the transport .
4)	With VM PID information we need to find the associated blkcg and needs to
update the UUID info in blkio_cg_ priv_data.
5)	Once we update the blkio_cg_ priv_data with vmid all the io’s issued from
VM will have the UUID info as part of blkcg.

With this  we can also address the concerns raised by Tejun and the
existing lpfc patches still holds good.

Regards,
Muneendra.

-----Original Message-----
From: James Smart [mailto:james.smart@broadcom.com]
Sent: Thursday, August 6, 2020 5:08 AM
To: Hannes Reinecke <hare@suse.de>; Muneendra
<muneendra.kumar@broadcom.com>; linux-block@vger.kernel.org;
linux-scsi@vger.kernel.org
Cc: pbonzini@redhat.com; emilne@redhat.com; mkumar@redhat.com; Gaurav
Srivastava <gaurav.srivastava@broadcom.com>; James Smart
<jsmart2021@gmail.com>
Subject: Re: [RFC 16/16] lpfc: vmid: Introducing vmid in io path.


That's actually what's supposed to be happening. fastpath uses the uuid to
look up a vmid tag. If no vmid tag, kick off the fabric traffic that will
get one but don't wait for it to complete. Any io issued while that process
is occurring will be not be vmid tagged.    I'll circle back on lpfc to make
sure this is happening.

In the mean time - the most important patch to review is the cgroup patch -
patch1.

If we wanted to speed the driver's io path up, one thing to consider is
adding a driver-settable value on the blkcg structure.  Once the fabric
traffic obtained the vmid, the driver would set the blkcg structure with the
value.  In this scenario though, as the vmid is destroyed as of link down,
the driver needs a way, independent of an io, to reach into the blkcg struct
to clear the vmid value.  We also need to be sure the blkcg struct won't be
on top of a multipath device or something such that the blkcg struct may be
referenced by a different scsi host - I assume we're good in that area.

-- james

Paolo Bonzini Aug. 6, 2020, 2:32 p.m. UTC | #4

On 06/08/20 14:34, Muneendra Kumar M wrote:
> 3)	As part of this interface user/deamon will provide the details of VM such
> as UUID,PID on VM creation to the transport .

The VM process, or the container process, is likely to be unprivileged
and cannot obtain the permissions needed to do this; therefore, you need
to cope with the situation where there is no PID yet in the cgroup,
because the tool that created the VM or container might be initializing
the cgroup, but it might not have started the VM yet.  In that case
there would be no PID.

Would it be possible to pass a file descriptor for the cgroup directory
in sysfs, instead of the PID?

Also what would the kernel API look like for this?  Would it have to be
driver-specific?

Paolo

> 4)	With VM PID information we need to find the associated blkcg and needs to
> update the UUID info in blkio_cg_ priv_data.
> 5)	Once we update the blkio_cg_ priv_data with vmid all the io’s issued from
> VM will have the UUID info as part of blkcg.

Tejun Heo Aug. 6, 2020, 2:41 p.m. UTC | #5

Hello,

On Thu, Aug 06, 2020 at 06:04:36PM +0530, Muneendra Kumar M wrote:
> 1)	blkcg will have a new field to store driver specific information as
> "blkio_cg_ priv_data"(in the current patch it is app_identifier) as Tejun
> said he doesn’t mind cgroup data structs carrying extra bits for stuff.

I'd make it something more specific - lpfc_app_id or something along that
line.

> 2)	scsi transport will provide a new interface(sysfs) as register_vm_fabric
> 3)	As part of this interface user/deamon will provide the details of VM such
> as UUID,PID on VM creation to the transport .
> 4)	With VM PID information we need to find the associated blkcg and needs to
> update the UUID info in blkio_cg_ priv_data.

You can pass in cgroup ID or open fd to uniquely identify a cgroup.

Thanks.

Paolo Bonzini Aug. 6, 2020, 2:46 p.m. UTC | #6

On 06/08/20 16:41, Tejun Heo wrote:
>> 1)	blkcg will have a new field to store driver specific information as
>> "blkio_cg_ priv_data"(in the current patch it is app_identifier) as Tejun
>> said he doesn’t mind cgroup data structs carrying extra bits for stuff.
> 
> I'd make it something more specific - lpfc_app_id or something along that
> line.

Note that there will be support in other drivers in all likelihood.

Paolo

Tejun Heo Aug. 6, 2020, 2:48 p.m. UTC | #7

On Thu, Aug 06, 2020 at 04:46:45PM +0200, Paolo Bonzini wrote:
> On 06/08/20 16:41, Tejun Heo wrote:
> >> 1)	blkcg will have a new field to store driver specific information as
> >> "blkio_cg_ priv_data"(in the current patch it is app_identifier) as Tejun
> >> said he doesn’t mind cgroup data structs carrying extra bits for stuff.
> > 
> > I'd make it something more specific - lpfc_app_id or something along that
> > line.
> 
> Note that there will be support in other drivers in all likelihood.

Yeah, make it specific to the scope, whatever that may be. Just avoid names
like priv which doesn't indicate anything about the scope or usage.

Thanks.

Paolo Bonzini Aug. 6, 2020, 2:54 p.m. UTC | #8

On 06/08/20 16:48, Tejun Heo wrote:
>>> I'd make it something more specific - lpfc_app_id or something along that
>>> line.
>> Note that there will be support in other drivers in all likelihood.
> Yeah, make it specific to the scope, whatever that may be. Just avoid names
> like priv which doesn't indicate anything about the scope or usage.

So I guess fc_app_id.

If I understand correctly, your only objection is that you'd rather not
have it specified with a file under /sys/kernel/cgroup, and instead you
would prefer to have it implemented as a ioctl for a magic file
somewhere else in sysfs?  I don't think there is any precedent for this,
and I'm not even sure where that sysfs file would be.

Paolo

Tejun Heo Aug. 6, 2020, 2:59 p.m. UTC | #9

Hello, Paolo.

On Thu, Aug 06, 2020 at 04:54:21PM +0200, Paolo Bonzini wrote:
> If I understand correctly, your only objection is that you'd rather not
> have it specified with a file under /sys/kernel/cgroup, and instead you
> would prefer to have it implemented as a ioctl for a magic file
> somewhere else in sysfs?  I don't think there is any precedent for this,
> and I'm not even sure where that sysfs file would be.

It just doesn't fit in the cgroupfs. I don't know where it should go for
this specific case. That's for you guys to figure out. There are multiple
precedences - e.g. how perf or bpf hooks into cgroup and others that I can't
remember off the top of my head.

Thanks.

Muneendra Kumar Aug. 6, 2020, 4:26 p.m. UTC | #10

Hi Paolo,

>3.As part of this interface user/deamon will provide the details of VM such
> as UUID,PID on VM creation to the transport .
>The VM process, or the container process, is likely to be unprivileged and
>cannot obtain the permissions needed to do this; therefore, you need to
>cope with the situation where there is no PID yet in the cgroup, because
>the tool >that created the VM or container might be initializing the
>cgroup, but it might not have started the VM yet.  In that case there would
>be no PID.

Agreed.A
small doubt. If the VM is started (running)then we can have the PID and   we
can use the  PID?

>Would it be possible to pass a file descriptor for the cgroup directory in
>sysfs, instead of the PID?
Yes we can do that.
>Also what would the kernel API look like for this?  Would it have to be
>driver-specific?

The API should be generic and it should not be driver-specific.


Regards,
Muneendra.

Paolo Bonzini Aug. 6, 2020, 6:39 p.m. UTC | #11

On 06/08/20 16:59, Tejun Heo wrote:
>> If I understand correctly, your only objection is that you'd rather not
>> have it specified with a file under /sys/kernel/cgroup, and instead you
>> would prefer to have it implemented as a ioctl for a magic file
>> somewhere else in sysfs?  I don't think there is any precedent for this,
>> and I'm not even sure where that sysfs file would be.
> It just doesn't fit in the cgroupfs. I don't know where it should go for
> this specific case. That's for you guys to figure out. There are multiple
> precedences - e.g. how perf or bpf hooks into cgroup and others that I can't
> remember off the top of my head.

perf and bpf have file descriptors, system calls and data structures of
their own, here there is simply none: it's just an array of chars.  Can
you explain _why_ it doesn't fit in the cgroupfs?

Paolo

Paolo Bonzini Aug. 6, 2020, 6:41 p.m. UTC | #12

On 06/08/20 18:26, Muneendra Kumar M wrote:
> Hi Paolo,
> 
>> 3.As part of this interface user/deamon will provide the details of VM such
>> as UUID,PID on VM creation to the transport .
>> The VM process, or the container process, is likely to be unprivileged and
>> cannot obtain the permissions needed to do this; therefore, you need to
>> cope with the situation where there is no PID yet in the cgroup, because
>> the tool >that created the VM or container might be initializing the
>> cgroup, but it might not have started the VM yet.  In that case there would
>> be no PID.
> 
> Agreed.A
> small doubt. If the VM is started (running)then we can have the PID and   we
> can use the  PID?

Yes, but it's too late when the VM is started.  In general there's no
requirement that a cgroup is setup shortly before it is populated.

>> Would it be possible to pass a file descriptor for the cgroup directory in
>> sysfs, instead of the PID?
> Yes we can do that.
>> Also what would the kernel API look like for this?  Would it have to be
>> driver-specific?
> 
> The API should be generic and it should not be driver-specific.

So it would be a new file in /dev, whose only function is to set up a
UUID for a cgroup?

Paolo

Tejun Heo Aug. 6, 2020, 6:49 p.m. UTC | #13

On Thu, Aug 06, 2020 at 08:39:26PM +0200, Paolo Bonzini wrote:
> On 06/08/20 16:59, Tejun Heo wrote:
> >> If I understand correctly, your only objection is that you'd rather not
> >> have it specified with a file under /sys/kernel/cgroup, and instead you
> >> would prefer to have it implemented as a ioctl for a magic file
> >> somewhere else in sysfs?  I don't think there is any precedent for this,
> >> and I'm not even sure where that sysfs file would be.
> > It just doesn't fit in the cgroupfs. I don't know where it should go for
> > this specific case. That's for you guys to figure out. There are multiple
> > precedences - e.g. how perf or bpf hooks into cgroup and others that I can't
> > remember off the top of my head.
> 
> perf and bpf have file descriptors, system calls and data structures of
> their own, here there is simply none: it's just an array of chars.  Can
> you explain _why_ it doesn't fit in the cgroupfs?

What's the hierarchical or delegation behavior? Why do the vast majority of
people who don't have the hardware or feature need to see it? We can argue
but I can pretty much guarantee that the conclusion is gonna be the same and
it's gonna be a waste of time and energy for both of us.

Thanks.

Paolo Bonzini Aug. 6, 2020, 7:20 p.m. UTC | #14

On 06/08/20 20:49, Tejun Heo wrote:
>> perf and bpf have file descriptors, system calls and data structures of
>> their own, here there is simply none: it's just an array of chars.  Can
>> you explain _why_ it doesn't fit in the cgroupfs?
> What's the hierarchical or delegation behavior?

If a cgroup does not have an app identifier the driver should use the
one from the closes parent that has one.

> Why do the vast majority of
> people who don't have the hardware or feature need to see it? We can argue
> but I can pretty much guarantee that the conclusion is gonna be the same and
> it's gonna be a waste of time and energy for both of us.

I don't want to argue, I want to understand.  My standard is that a
maintainer that rejects code explains a plan for integrating with his
subsystem and/or points to existing code that does something similar,
rather than handwaving it away as something "that I can't remember off
the top of my head".

Thanks,

Paolo

Tejun Heo Aug. 6, 2020, 7:32 p.m. UTC | #15

On Thu, Aug 06, 2020 at 09:20:38PM +0200, Paolo Bonzini wrote:
> On 06/08/20 20:49, Tejun Heo wrote:
> >> perf and bpf have file descriptors, system calls and data structures of
> >> their own, here there is simply none: it's just an array of chars.  Can
> >> you explain _why_ it doesn't fit in the cgroupfs?
> > What's the hierarchical or delegation behavior?
> 
> If a cgroup does not have an app identifier the driver should use the
> one from the closes parent that has one.
> 
> > Why do the vast majority of
> > people who don't have the hardware or feature need to see it? We can argue
> > but I can pretty much guarantee that the conclusion is gonna be the same and
> > it's gonna be a waste of time and energy for both of us.
> 
> I don't want to argue, I want to understand.  My standard is that a
> maintainer that rejects code explains a plan for integrating with his
> subsystem and/or points to existing code that does something similar,
> rather than handwaving it away as something "that I can't remember off
> the top of my head".

I could be misreading you but my general sense is that you tend to be
antagonistic in sometimes underhanded way, like above - you lifted that part
from a sentence which was listing two examples prior to that phrase. That's
an disingenious way to discuss regardless of the topic. To put in your way,
I find your way of communication failing to meet my standard for productive
discussions.

I've given brief reasons why I don't think it fits. If anyone else wants to
discuss that further, please raise it. I'd be happy to discuss. Paolo, if
I'm misunderstanding you, my aplogies but otherwise please do better.

Thanks.

Muneendra Kumar Aug. 7, 2020, 11:24 a.m. UTC | #16

Hi Paolo,
Below are my replies.

>> 3.As part of this interface user/deamon will provide the details of
>> VM such as UUID,PID on VM creation to the transport .
>> The VM process, or the container process, is likely to be
>> unprivileged and cannot obtain the permissions needed to do this;
>> therefore, you need to cope with the situation where there is no PID
>> yet in the cgroup, because the tool >that created the VM or container
>> might be initializing the cgroup, but it might not have started the
>> VM yet.  In that case there would be no PID.
>
> Agreed.A
> small doubt. If the VM is started (running)then we can have the PID and
> we
> can use the  PID?
>Yes, but it's too late when the VM is started.  In general there's no
>requirement that a cgroup is setup shortly before it is populated.
This should be ok .
The fabric  interface just provides a mechanism to store user specific data
into a pid blkcg
Before the daemon issues the UUID and pid to the fabric interface, it needs
to check whether the VM is in running state or not.
If it the VM is in running state then only it issues the VM details.
And if the  cgroup's are not setup as you mentioned the interface will
return a failure(with a proper logs) and the daemon will retry after some
time.
And this also helps us to keep track of PID to UUID mapping at daemon level.


>> Also what would the kernel API look like for this?  Would it have to
>> be driver-specific?
>
> The API should be generic and it should not be driver-specific.

>So it would be a new file in /dev, whose only function is to set up a UUID
>for a cgroup?

I will work with James Smart and check whether we can use any of the
existing interface for the same and will share the details.

>Paolo

Regards,
Muneendra.

Paolo Bonzini Aug. 7, 2020, 11:38 a.m. UTC | #17

On 07/08/20 13:24, Muneendra Kumar M wrote:
> Hi Paolo,
> Below are my replies.
> 
>>> 3.As part of this interface user/deamon will provide the details of
>>> VM such as UUID,PID on VM creation to the transport .
>>> The VM process, or the container process, is likely to be
>>> unprivileged and cannot obtain the permissions needed to do this;
>>> therefore, you need to cope with the situation where there is no PID
>>> yet in the cgroup, because the tool >that created the VM or container
>>> might be initializing the cgroup, but it might not have started the
>>> VM yet.  In that case there would be no PID.
>>
>> Agreed.A
>> small doubt. If the VM is started (running)then we can have the PID and
>> we
>> can use the  PID?
>> Yes, but it's too late when the VM is started.  In general there's no
>> requirement that a cgroup is setup shortly before it is populated.
>
> This should be ok .
>
> The fabric  interface just provides a mechanism to store user
> specific data into a pid blkcg
>
> Before the daemon issues the UUID and pid to the fabric interface, it needs
> to check whether the VM is in running state or not.
>
> If it the VM is in running state then only it issues the VM details.
>
> And if the  cgroup's are not setup as you mentioned the interface will
> return a failure(with a proper logs) and the daemon will retry after some
> time.
>
> And this also helps us to keep track of PID to UUID mapping at daemon level.

Why would that be useful?  Again, the mapping of the UUID is _not_ to a
PID, it is to a cgroup.  There is no concept of a VM PID; you could
legitimately have I/O in a separate process than say the QEMU process,
and that I/O process could legitimately reside in a separate blkcg than
QEMU.

There is no need for any daemon, and I'm not even sure which daemon
would be handling this.  App identifier should be purely a kernel
concept with no userspace state.

Paolo

Paolo Bonzini Aug. 7, 2020, 12:14 p.m. UTC | #18

On 06/08/20 21:32, Tejun Heo wrote:
> On Thu, Aug 06, 2020 at 09:20:38PM +0200, Paolo Bonzini wrote:
>> On 06/08/20 20:49, Tejun Heo wrote:
>>>> perf and bpf have file descriptors, system calls and data structures of
>>>> their own, here there is simply none: it's just an array of chars.  Can
>>>> you explain _why_ it doesn't fit in the cgroupfs?
>>> What's the hierarchical or delegation behavior?
>>
>> If a cgroup does not have an app identifier the driver should use the
>> one from the closes parent that has one.
>>
>>> Why do the vast majority of
>>> people who don't have the hardware or feature need to see it? We can argue
>>> but I can pretty much guarantee that the conclusion is gonna be the same and
>>> it's gonna be a waste of time and energy for both of us.
>>
>> I don't want to argue, I want to understand.  My standard is that a
>> maintainer that rejects code explains a plan for integrating with his
>> subsystem and/or points to existing code that does something similar,
>> rather than handwaving it away as something "that I can't remember off
>> the top of my head".
> 
> I could be misreading you but my general sense is that you tend to be
> antagonistic in sometimes underhanded way, like above - you lifted that part
> from a sentence which was listing two examples prior to that phrase.

Yes, I did, but I already explained that they are irrelevant.  If I ask
you to buy me something, asking an extra question and saying "you might
want to ask XYZ instead" would make me happier than "you might want to
buy that on Amazon or ask someone else I can't remember off the top of
my head".  Perhaps I needed icecream that Amazon doesn't even sell. :)

I do apologize for being too blunt, but here are the reasons you brought up:

- "I'm not sure it makes sense to introduce custom IDs for these given
that there already are unique per-host cgroup IDs which aren't recycled"
(VM IDs are recycled, that's the idea, so I don't think this applies?)

- "There are basic problems with e.g. delegation as these things are at
odds with everything else sharing the interface" (this I cannot parse at
all: what is delegation, who is the everything else that's sharing the
interface?  why would the problem not be there if you put the interface
somewhere in the SCSI layer?)

- "people who don't need app_identifier don't see them and preferably
can disable them" (ok this I get, but then the same applies to a lot of
stuff in sysfs).

- "the proposed is akin to adding per-thread application ID to procfs
[...] which wouldn't fly well either given the specialized and (at least
for now) niche nature of the feature" (so it is basically that it is too
specialized; cache allocation technology comes to mind then, it is also
very specialized but we _did_ add resctrl hooks into procfs).


My feeling (it's a feeling---it doesn't have to be that way!) was that
you didn't try too hard to understand the problem and to help the
submitter reframe it.  As you said very properly, apologies if I'm
misunderstanding you, but I would like to be clear about what I perceived.

Thanks,

Paolo

Muneendra Kumar Aug. 7, 2020, 12:17 p.m. UTC | #19

>
> Before the daemon issues the UUID and pid to the fabric interface, it
> needs to check whether the VM is in running state or not.
>
> If it the VM is in running state then only it issues the VM details.
>
> And if the  cgroup's are not setup as you mentioned the interface will
> return a failure(with a proper logs) and the daemon will retry after
> some time.
>
> And this also helps us to keep track of PID to UUID mapping at daemon
> level.

>Why would that be useful?  Again, the mapping of the UUID is _not_ to a
>PID, it is to a cgroup.  There is no concept of a VM PID; you could
>legitimately have I/O in a separate process than say the QEMU process, and
>that I/O >process could legitimately reside in a separate blkcg than QEMU.
Agreed .
When I run  ps  -aef  | grep <VMname > we got a pid.

ps -aef | grep mmkvm1
root      3627     1  0 04:20 ?        00:00:34 /usr/libexec/qemu-kvm -name
testmmkvm1 -S

And I was talking about at the below one PIDS(3627) .And with the help of
these PIDS I was able to reach blkcg.
Correct me if iam going in a wrong direction.
And even when I run the below command it pointed me to the same pid.
cat
/sys/fs/cgroup/blkio/machine.slice/machine-qemu\\x2d7\\x2dtestmmkvm1.scope/tasks
3627  -->/usr/libexec/qemu-kvm -name testmmkvm1
3655 --> [kvm-nx-lpage-re]
3658--> vhost-3627

And I was talking about the above PIDS(3627) to be passed to the interface
along with UUID.

>There is no need for any daemon, and I'm not even sure which daemon would
>be handling this.  App identifier should be purely a kernel concept with no
>userspace state.
Agreed App identifier is a pure kernel concept.
And the user can only provide the info what needs to be filled in it using
an interface.
Iam talking about FC transport daemon.
One of the feature of this daemon is to track all the running VM's and push
the appid information to the blk cgroup via the interface provided by the
fabric
And this feature is under development.


Paolo

Muneendra Kumar Aug. 7, 2020, 12:32 p.m. UTC | #20

Hi Paolo,


>Why would that be useful?  Again, the mapping of the UUID is _not_ to a
>PID, it is to a cgroup.  There is no concept of a VM PID; you could
>legitimately have I/O in a separate process than say the QEMU process, and
>that I/O >process could legitimately reside in a separate blkcg than QEMU.
>Agreed .
>When I run  ps  -aef  | grep <VMname > we got a pid.

>ps -aef | grep mmkvm1
root      3627     1  0 04:20 ?        00:00:34 /usr/libexec/qemu-kvm -name
testmmkvm1 -S

>And I was talking about at the below one PIDS(3627) .And with the help of
>these PIDS I was able to reach blkcg.
>Correct me if iam going in a wrong direction.

Also cross checked the same with the below comand
cat /proc/3627/cgroup
10:hugetlb:/
9:devices:/machine.slice/machine-qemu\x2d7\x2dtestmmkvm1.scope
8:cpuset:/machine.slice/machine-qemu\x2d7\x2dtestmmkvm1.scope/emulator
7:blkio:/machine.slice/machine-qemu\x2d7\x2dtestmmkvm1.scope
6:freezer:/machine.slice/machine-qemu\x2d7\x2dtestmmkvm1.scope
5:perf_event:/machine.slice/machine-qemu\x2d7\x2dtestmmkvm1.scope
4:memory:/machine.slice/machine-qemu\x2d7\x2dtestmmkvm1.scope
3:net_cls:/machine.slice/machine-qemu\x2d7\x2dtestmmkvm1.scope
2:cpu,cpuacct:/machine.slice/machine-qemu\x2d7\x2dtestmmkvm1.scope/emulator
1:name=systemd:/machine.slice/machine-qemu\x2d7\x2dtestmmkvm1.scope

Regards,
Muneendra.

Paolo Bonzini Aug. 10, 2020, 9:03 a.m. UTC | #21

On 07/08/20 14:17, Muneendra Kumar M wrote:
> And I was talking about at the below one PIDS(3627) .And with the help of
> these PIDS I was able to reach blkcg.
> Correct me if iam going in a wrong direction.
> And even when I run the below command it pointed me to the same pid.
> cat
> /sys/fs/cgroup/blkio/machine.slice/machine-qemu\\x2d7\\x2dtestmmkvm1.scope/tasks
> 3627  -->/usr/libexec/qemu-kvm -name testmmkvm1
> 3655 --> [kvm-nx-lpage-re]
> 3658--> vhost-3627
> 
> And I was talking about the above PIDS(3627) to be passed to the interface
> along with UUID.

The cgroup exists even before the VM is started and waiting for the PID
to appear would be racy.  The PID is *not* a representation of the VM,
the cgroup is (or at least it's the closest thing).

Using the PID would lead to an API that is easy to misuse.  For example
say you have a random QEMU process that has not been placed in a cgroup,
for whatever reason.  If someone doesn't understand that the API uses
the PID just as a proxy for the cgroup, they could end up classifying
*all traffic from the host* with the VMID.  If the API uses cgroups as
the fundamental concept instead, it's much harder to make this mistake.

>> There is no need for any daemon, and I'm not even sure which daemon would
>> be handling this.
>
> Iam talking about FC transport daemon.
> One of the feature of this daemon is to track all the running VM's and push
> the appid information to the blk cgroup via the interface provided by the
> fabric

There is no need for this daemon.  The cgroups can be initialized by
whatever takes care of starting the VM, for example libvirt.  How would
the daemon learn about the VM PIDs and UUIDs, besides?

Paolo

Muneendra Kumar Aug. 10, 2020, 12:13 p.m. UTC | #22

> And I was talking about the above PIDS(3627) to be passed to the
> interface along with UUID.

>The cgroup exists even before the VM is started and waiting for the PID to
>appear would be racy.  The PID is *not* a representation of the VM, the
>cgroup is (or at least it's the closest thing).

>Using the PID would lead to an API that is easy to misuse.  For example say
>you have a random QEMU process that has not been placed in a cgroup, for
>whatever reason.  If someone doesn't understand that the API uses the PID
> >just as a proxy for the cgroup, they could end up classifying *all traffic
>from the host* with the VMID.  If the API uses cgroups as the fundamental
>concept instead, it's much harder to make this mistake.

Agreed:
So from the user we need to provide UUID and the cgroup associated info with
VM to the kernel interface. Is this correct?
There is no issues with UUID  passing as one of the arg.
Coming to the other cgroup associated VM here are the options which we can
send

1)openfd:
We need a utility which opens the cgroup path and pass the fd details to the
interface.
And we can use the cgroup_get_from_fd() utility to get the associated cgroup
in the kernel.
Dependent on utilty.

2)cgroupid:From the  userspace iam not sure if there is any syfs/proc/sytem
call interface which give us the cgroup id directly.
Tejun correct me if iam wrong.If there is any such interface please let me
know so that I can pass the same.


3)give the complete cgroup path associated with VM and the kernel interface
will get the associated blkcg.
The user needs to pass the path and the uuid info to the sysfs interface
provided by the fabric interface
The interface will  write the uuid info in blkcg associated with cgroup path
with the help of cgroup_get_from_path().
And there is no dependency on any utility. The user can simply pass the
details using sysfs.

Need your inputs on the same which one to use (openfd/cgroupid/path) .


Regards,
Muneendra.

James Smart Aug. 11, 2020, 11:48 p.m. UTC | #23

On 8/6/2020 9:26 AM, Muneendra Kumar M wrote:
> Hi Paolo,
> 
>> 3.As part of this interface user/deamon will provide the details of VM such
>> as UUID,PID on VM creation to the transport .
>> The VM process, or the container process, is likely to be unprivileged and
>> cannot obtain the permissions needed to do this; therefore, you need to
>> cope with the situation where there is no PID yet in the cgroup, because
>> the tool >that created the VM or container might be initializing the
>> cgroup, but it might not have started the VM yet.  In that case there would
>> be no PID.
> 
> Agreed.A
> small doubt. If the VM is started (running)then we can have the PID and   we
> can use the  PID?
Note: I'm not worried about this. The transport/lldd would be fine with 
the device being used for a while with no blkcg attribute set. When it 
finally does get set, it would initiate the creation of the 
corresponding fc wire app_id.  This will be something normal.

-- james

Paolo Bonzini Aug. 12, 2020, 7:54 a.m. UTC | #24

On 10/08/20 14:13, Muneendra Kumar M wrote:
> Agreed:
> So from the user we need to provide UUID and the cgroup associated info with
> VM to the kernel interface. Is this correct?

Yes.

> There is no issues with UUID  passing as one of the arg.
> Coming to the other cgroup associated VM here are the options which we can
> send
> 
> 1)openfd:
> We need a utility which opens the cgroup path and pass the fd details to the
> interface.
> And we can use the cgroup_get_from_fd() utility to get the associated cgroup
> in the kernel.
> Dependent on utilty.

This looks good to me.

Paolo

Muneendra Kumar Aug. 12, 2020, 12:16 p.m. UTC | #25

Hi Paolo/Tejun,
From the overall conversation the approach will be as below.
Please let me know if I miss anything or any issue with the same.

1)      blkcg will have a new field fc_app_id to store UUID /driver specific
information.
2)      scsi transport will provide a new interface(sysfs) as
register_vm_fabric
3)      As part of this interface user will provide the details of UUID and
the  open fd (cgroup associated
 info with VM) to the new interface.
4)      With VM provided cgroup info we need to find the associated blkcg
and needs to
update the UUID info in the fc_app_id.
5)      Once we update the fc_app_id  all the io’s issued from VM will have
the UUID info as part of blkcg.

If this approach is fine then I will make the necessary changes in my next
version.

Regards,
Muneendra.

-----Original Message-----
From: Paolo Bonzini [mailto:pbonzini@redhat.com]
Sent: Wednesday, August 12, 2020 1:25 PM
To: Muneendra Kumar M <muneendra.kumar@broadcom.com>; James Smart
<james.smart@broadcom.com>; Hannes Reinecke <hare@suse.de>;
linux-block@vger.kernel.org; linux-scsi@vger.kernel.org
Cc: emilne@redhat.com; mkumar@redhat.com; Gaurav Srivastava
<gaurav.srivastava@broadcom.com>; James Smart <jsmart2021@gmail.com>; Ming
Lei <tom.leiming@gmail.com>; Tejun Heo <tj@kernel.org>
Subject: Re: [RFC 16/16] lpfc: vmid: Introducing vmid in io path.

On 10/08/20 14:13, Muneendra Kumar M wrote:
> Agreed:
> So from the user we need to provide UUID and the cgroup associated
> info with VM to the kernel interface. Is this correct?

Yes.

> There is no issues with UUID  passing as one of the arg.
> Coming to the other cgroup associated VM here are the options which we
> can send
>
> 1)openfd:
> We need a utility which opens the cgroup path and pass the fd details
> to the interface.
> And we can use the cgroup_get_from_fd() utility to get the associated
> cgroup in the kernel.
> Dependent on utilty.

This looks good to me.

Paolo

[RFC,16/16] lpfc: vmid: Introducing vmid in io path.

Commit Message

Comments

Patch