diff mbox

[RFC,1/3] IB/IPoIB: Support SRIOV standard configuration

Message ID 1432225447-6536-2-git-send-email-ogerlitz@mellanox.com (mailing list archive)
State Changes Requested
Headers show

Commit Message

Or Gerlitz May 21, 2015, 4:24 p.m. UTC
Standard configuration of SRIOV VFs through the host is done over the
following chain of calls: libvirt --> netlink --> PF netdevice

When this comes to IB/IPoIB we should normalize this into the verbs
framework so we further go: PF IPoIB --> verbs API --> PF HW driver

Virtualization systems assign VMs 48 bits mac, to allow working with
non-modified SW layers (open-stack, libvirt, etc), we can safely
extend this mac to unique 64 bits GUID. Hence the IPoIB ndo_set_vf_mac
entry calls the set_vf_guid verb.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c |   39 +++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h                   |    4 +++
 2 files changed, 43 insertions(+), 0 deletions(-)

Comments

Jason Gunthorpe May 21, 2015, 6:46 p.m. UTC | #1
On Thu, May 21, 2015 at 07:24:05PM +0300, Or Gerlitz wrote:
> Standard configuration of SRIOV VFs through the host is done over the
> following chain of calls: libvirt --> netlink --> PF netdevice
> 
> When this comes to IB/IPoIB we should normalize this into the verbs
> framework so we further go: PF IPoIB --> verbs API --> PF HW driver
> 
> Virtualization systems assign VMs 48 bits mac, to allow working with
> non-modified SW layers (open-stack, libvirt, etc), we can safely
> extend this mac to unique 64 bits GUID. Hence the IPoIB ndo_set_vf_mac
> entry calls the set_vf_guid verb.
> 
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
>  drivers/infiniband/ulp/ipoib/ipoib_main.c |   39 +++++++++++++++++++++++++++++
>  include/rdma/ib_verbs.h                   |    4 +++
>  2 files changed, 43 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 9e1b203..8f82870 100644
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -1357,6 +1357,43 @@ void ipoib_dev_cleanup(struct net_device *dev)
>  	priv->tx_ring = NULL;
>  }
>  
> +static int ipoib_get_vf_config(struct net_device *dev, int vf, struct ifla_vf_info *ivf)
> +{
> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> +
> +	if (priv->ca->get_vf_config)
> +		return priv->ca->get_vf_config(priv->ca, priv->port, vf, ivf);
> +	else
> +		return -EINVAL;
> +}
> +
> +static int ipoib_set_vf_mac(struct net_device *dev, int queue, u8 *mac)
> +{
> +	char *raw_guid;
> +	u64 guid = 0;

This doesn't seem right at all.

It makes no sense that a IPoIB interface with a 20 byte LLADDR would
accept an 8 byte LLADDR only for 'ip link set vf mac'

The definition of the netlink struct seems to confirm this:

struct ifla_vf_mac {
        __u32 vf;
        __u8 mac[32]; /* MAX_ADDR_LEN */
};

If it was really just ever a mac it would really only be 6 bytes.
[Honestly, this whole feature seems very inconistent with the rest of
 the design of ip net link, so who knows]

If the ifla_vf_mac had been variable-sized (like every other address
related attribute) then sure, auto detect the length and do the right
thing.

But with this API, I think you have no choice, 'ip set vf mac LLADDR'
can only be the 20 byte address.

If you really, really want this: Get someone from iproute or netdev to
sign off that they really mean that 'ip set vf mac LLADDR' is always a
6 byte mac.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz May 21, 2015, 8:05 p.m. UTC | #2
On Thu, May 21, 2015, Jason Gunthorpe  wrote:

>> +static int ipoib_set_vf_mac(struct net_device *dev, int queue, u8 *mac)
>> +{
>> +     char *raw_guid;
>> +     u64 guid = 0;
>
> This doesn't seem right at all.
>
> It makes no sense that a IPoIB interface with a 20 byte LLADDR would
> accept an 8 byte LLADDR only for 'ip link set vf mac'
>
> The definition of the netlink struct seems to confirm this:
>
> struct ifla_vf_mac {
>         __u32 vf;
>         __u8 mac[32]; /* MAX_ADDR_LEN */
> };
>
> If it was really just ever a mac it would really only be 6 bytes.
> [Honestly, this whole feature seems very inconistent with the rest of
>  the design of ip net link, so who knows]
>
> If the ifla_vf_mac had been variable-sized (like every other address
> related attribute) then sure, auto detect the length and do the right
> thing.
>
> But with this API, I think you have no choice, 'ip set vf mac LLADDR'
> can only be the 20 byte address.

You can't enforce 20 byte address on ipoib instance b/c the driver
can't dictate the QPN of their UD QP. Also, you don't need to force
that, b/c what the virtualization system want to provision relates to
a VM ID which is their mac (--> guid).

If the ifla_vf_mac had been variable-sized we could have the ipoib
set_vf_mac implementation to check if user-space provided 48 bits
(mac), 64bits (guid) or even 128bits (whole gid), but it doesn't and I
would like to either use the lowest common denominator (48bits) or
just take always 64bits which could have two zero bytes (e.g when
libvirt calls into that api through netlink).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 21, 2015, 9:01 p.m. UTC | #3
On Thu, May 21, 2015 at 11:05:32PM +0300, Or Gerlitz wrote:
> On Thu, May 21, 2015, Jason Gunthorpe  wrote:
> 
> >> +static int ipoib_set_vf_mac(struct net_device *dev, int queue, u8 *mac)
> >> +{
> >> +     char *raw_guid;
> >> +     u64 guid = 0;
> >
> > This doesn't seem right at all.
> >
> > It makes no sense that a IPoIB interface with a 20 byte LLADDR would
> > accept an 8 byte LLADDR only for 'ip link set vf mac'
> >
> > The definition of the netlink struct seems to confirm this:
> >
> > struct ifla_vf_mac {
> >         __u32 vf;
> >         __u8 mac[32]; /* MAX_ADDR_LEN */
> > };
> >
> > If it was really just ever a mac it would really only be 6 bytes.
> > [Honestly, this whole feature seems very inconistent with the rest of
> >  the design of ip net link, so who knows]
> >
> > If the ifla_vf_mac had been variable-sized (like every other address
> > related attribute) then sure, auto detect the length and do the right
> > thing.
> >
> > But with this API, I think you have no choice, 'ip set vf mac LLADDR'
> > can only be the 20 byte address.
> 
> You can't enforce 20 byte address on ipoib instance b/c the driver
> can't dictate the QPN of their UD QP.

I can think of several ways off the top of my head to deal with
that. Don't see it as a problem.

> Also, you don't need to force that, b/c what the virtualization
> system want to provision relates to a VM ID which is their mac (-->
> guid).

No, I do need to force that because that is how netlink works: each
device has a defined hw address length, and all netlink messages
dealing with LLADDR must use an identical format for the hw
address. For IPoIB that is 20 byte.

It is a simple matter of UAPI sanity, it is not the kernel's problem
that some userspace tools assume a 6 byte LLADDR for all devices -
they are broken.

> (mac), 64bits (guid) or even 128bits (whole gid), but it doesn't and I
> would like to either use the lowest common denominator (48bits) or
> just take always 64bits which could have two zero bytes (e.g when
> libvirt calls into that api through netlink).

Then get the iproute and netdev people to sign off on using
ifla_vf_mac with something other than the defined 20 byte IPoIB HW
address. Maybe you can work out some scheme with them. Making the
IFLA_VF_MAC properly variable sized might be a good angle.

NACK otherwise.

The rest of the series's idea seems sound.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford May 21, 2015, 9:05 p.m. UTC | #4
On Thu, 2015-05-21 at 12:46 -0600, Jason Gunthorpe wrote:
> On Thu, May 21, 2015 at 07:24:05PM +0300, Or Gerlitz wrote:
> > Standard configuration of SRIOV VFs through the host is done over the
> > following chain of calls: libvirt --> netlink --> PF netdevice
> > 
> > When this comes to IB/IPoIB we should normalize this into the verbs
> > framework so we further go: PF IPoIB --> verbs API --> PF HW driver
> > 
> > Virtualization systems assign VMs 48 bits mac, to allow working with
> > non-modified SW layers (open-stack, libvirt, etc), we can safely
> > extend this mac to unique 64 bits GUID. Hence the IPoIB ndo_set_vf_mac
> > entry calls the set_vf_guid verb.
> > 
> > Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> >  drivers/infiniband/ulp/ipoib/ipoib_main.c |   39 +++++++++++++++++++++++++++++
> >  include/rdma/ib_verbs.h                   |    4 +++
> >  2 files changed, 43 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > index 9e1b203..8f82870 100644
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > @@ -1357,6 +1357,43 @@ void ipoib_dev_cleanup(struct net_device *dev)
> >  	priv->tx_ring = NULL;
> >  }
> >  
> > +static int ipoib_get_vf_config(struct net_device *dev, int vf, struct ifla_vf_info *ivf)
> > +{
> > +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> > +
> > +	if (priv->ca->get_vf_config)
> > +		return priv->ca->get_vf_config(priv->ca, priv->port, vf, ivf);
> > +	else
> > +		return -EINVAL;
> > +}
> > +
> > +static int ipoib_set_vf_mac(struct net_device *dev, int queue, u8 *mac)
> > +{
> > +	char *raw_guid;
> > +	u64 guid = 0;
> 
> This doesn't seem right at all.
> 
> It makes no sense that a IPoIB interface with a 20 byte LLADDR would
> accept an 8 byte LLADDR only for 'ip link set vf mac'

It has to be the 8byte guid, the next 8bytes are the subnet prefix which
is set by the SM and not under driver control, and the final 4bytes are
specific to the IPoIB connection.  You could pass in all 20bytes, but
you would only be able to set 8 of them.
Jason Gunthorpe May 21, 2015, 9:21 p.m. UTC | #5
On Thu, May 21, 2015 at 05:05:31PM -0400, Doug Ledford wrote:
> It has to be the 8byte guid, the next 8bytes are the subnet prefix which
> is set by the SM and not under driver control, and the final 4bytes are
> specific to the IPoIB connection.  You could pass in all 20bytes, but
> you would only be able to set 8 of them.

Of course you can't set all 20 bytes, but you still have to provide
them, that is the defined LLADDR format for this device.

1) Kernel demands immutable bits are zero and fills them in
   automatically
2) Userspace copies the current LLADDR and only modifies the 8 bytes
   of the GUID, kernel demands immutable bits are unchanged
3) Who is to say the QPN can't be made settable someday? People have
   complained about the unstable QPN in the past.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 9e1b203..8f82870 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1357,6 +1357,43 @@  void ipoib_dev_cleanup(struct net_device *dev)
 	priv->tx_ring = NULL;
 }
 
+static int ipoib_get_vf_config(struct net_device *dev, int vf, struct ifla_vf_info *ivf)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	if (priv->ca->get_vf_config)
+		return priv->ca->get_vf_config(priv->ca, priv->port, vf, ivf);
+	else
+		return -EINVAL;
+}
+
+static int ipoib_set_vf_mac(struct net_device *dev, int queue, u8 *mac)
+{
+	char *raw_guid;
+	u64 guid = 0;
+
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	raw_guid = (char *)&guid;
+	raw_guid[0] = mac[0];
+	raw_guid[1] = mac[1];
+	raw_guid[2] = mac[2];
+	raw_guid[3] = 0xff;
+	raw_guid[4] = 0xfe;
+	raw_guid[5] = mac[3];
+	raw_guid[6] = mac[4];
+	raw_guid[7] = mac[5];
+
+	guid &= ~(cpu_to_be64(1ULL << 56));
+	guid |= cpu_to_be64(1ULL << 57);
+
+	if (priv->ca->set_vf_guid)
+		return priv->ca->set_vf_guid(priv->ca, priv->port, queue, guid);
+	else
+		return -EINVAL;
+}
+
+
 static const struct header_ops ipoib_header_ops = {
 	.create	= ipoib_hard_header,
 };
@@ -1371,6 +1408,8 @@  static const struct net_device_ops ipoib_netdev_ops = {
 	.ndo_tx_timeout		 = ipoib_timeout,
 	.ndo_set_rx_mode	 = ipoib_set_mcast_list,
 	.ndo_get_iflink		 = ipoib_get_iflink,
+	.ndo_get_vf_config	 = ipoib_get_vf_config,
+	.ndo_set_vf_mac	 	 = ipoib_set_vf_mac,
 };
 
 void ipoib_setup(struct net_device *dev)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 65994a1..6589520 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -53,6 +53,7 @@ 
 #include <linux/atomic.h>
 #include <linux/mmu_notifier.h>
 #include <asm/uaccess.h>
+#include <linux/if_link.h>
 
 extern struct workqueue_struct *ib_wq;
 
@@ -1653,6 +1654,9 @@  struct ib_device {
 	int			   (*check_mr_status)(struct ib_mr *mr, u32 check_mask,
 						      struct ib_mr_status *mr_status);
 
+	int 			  (*set_vf_guid)  (struct ib_device *device, int port, int vf, u64 guid);
+	int 			  (*get_vf_config)(struct ib_device *device, int port, int vf, struct ifla_vf_info *ivf);
+
 	struct ib_dma_mapping_ops   *dma_ops;
 
 	struct module               *owner;