[6/7] PCI: Make sure VF's driver get attached after PF's

Message ID	1368498506-25857-7-git-send-email-yinghai@kernel.org (mailing list archive)
State	New, archived
Delegated to:	Bjorn Helgaas
Headers	show Return-Path: <linux-pci-owner@vger.kernel.org> From: Yinghai Lu <yinghai@kernel.org> To: Bjorn Helgaas <bhelgaas@google.com> Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Yinghai Lu <yinghai@kernel.org>, netdev@vger.kernel.org Subject: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's Date: Mon, 13 May 2013 19:28:25 -0700 Message-Id: <1368498506-25857-7-git-send-email-yinghai@kernel.org> In-Reply-To: <1368498506-25857-1-git-send-email-yinghai@kernel.org> References: <1368498506-25857-1-git-send-email-yinghai@kernel.org> Sender: linux-pci-owner@vger.kernel.org Precedence: bulk

Yinghai Lu May 14, 2013, 2:28 a.m. UTC

Found kernel try to load mlx4 drivers for VFs before
PF's is really loaded when the drivers are built-in, and kernel
command line include probe_vfs=63, num_vfs=63.

It turns that it also happen for hotadd path even drivers are
compiled as modules and if they loaded. Esp some VF share the
same driver with PF.

calling path:
	device driver probe
		==> pci_enable_sriov
			==> virtfn_add
				==> pci_dev_add
				==> pci_bus_device_add
when pci_bus_device_add is called, the VF's driver will be attached.
and at that time PF's driver does not finish yet.

Need to move out pci_bus_device_add from virtfn_add and call it
later. Fix the problem for two path,
1. hotadd path: use device_schedule_callback.
2. for booting path, use initcall to call that for all VF's.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: netdev@vger.kernel.org

---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c    |    7 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c      |    4 -
 drivers/net/ethernet/cisco/enic/enic_main.c          |    2 
 drivers/net/ethernet/emulex/benet/be_main.c          |    4 +
 drivers/net/ethernet/intel/igb/igb_main.c            |   11 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c        |    2 
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c       |    9 +-
 drivers/net/ethernet/mellanox/mlx4/main.c            |    2 
 drivers/net/ethernet/neterion/vxge/vxge-main.c       |    3 
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c |    5 +
 drivers/net/ethernet/sfc/efx.c                       |    1 
 drivers/pci/iov.c                                    |   73 +++++++++++++++++--
 drivers/scsi/lpfc/lpfc_init.c                        |    2 
 include/linux/pci.h                                  |    4 +
 14 files changed, 115 insertions(+), 14 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yan Burman May 14, 2013, 8:58 a.m. UTC | #1

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Yinghai Lu
> Sent: Tuesday, May 14, 2013 05:31
> To: Bjorn Helgaas
> Cc: Gu Zheng; linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Yinghai
> Lu; netdev@vger.kernel.org
> Subject: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's
> 
> Found kernel try to load mlx4 drivers for VFs before PF's is really loaded when
> the drivers are built-in, and kernel command line include probe_vfs=63,
> num_vfs=63.
> 
> It turns that it also happen for hotadd path even drivers are compiled as
> modules and if they loaded. Esp some VF share the same driver with PF.
> 
> calling path:
> 	device driver probe
> 		==> pci_enable_sriov
> 			==> virtfn_add
> 				==> pci_dev_add
> 				==> pci_bus_device_add
> when pci_bus_device_add is called, the VF's driver will be attached.
> and at that time PF's driver does not finish yet.
> 
> Need to move out pci_bus_device_add from virtfn_add and call it later. Fix
> the problem for two path, 1. hotadd path: use device_schedule_callback.
> 2. for booting path, use initcall to call that for all VF's.
> 
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> Cc: netdev@vger.kernel.org
> 
> ---
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c    |    7 +
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c      |    4 -
>  drivers/net/ethernet/cisco/enic/enic_main.c          |    2
>  drivers/net/ethernet/emulex/benet/be_main.c          |    4 +
>  drivers/net/ethernet/intel/igb/igb_main.c            |   11 ++
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c        |    2
>  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c       |    9 +-
>  drivers/net/ethernet/mellanox/mlx4/main.c            |    2
>  drivers/net/ethernet/neterion/vxge/vxge-main.c       |    3
>  drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c |    5 +
>  drivers/net/ethernet/sfc/efx.c                       |    1
>  drivers/pci/iov.c                                    |   73 +++++++++++++++++--
>  drivers/scsi/lpfc/lpfc_init.c                        |    2
>  include/linux/pci.h                                  |    4 +
>  14 files changed, 115 insertions(+), 14 deletions(-)
> 
> Index: linux-2.6/drivers/net/ethernet/mellanox/mlx4/main.c
> =================================================================
> ==
> --- linux-2.6.orig/drivers/net/ethernet/mellanox/mlx4/main.c
> +++ linux-2.6/drivers/net/ethernet/mellanox/mlx4/main.c
> @@ -2308,6 +2308,8 @@ slave_start:
>  	priv->pci_dev_data = pci_dev_data;
>  	pci_set_drvdata(pdev, dev);
> 
> +	pci_bus_add_device_vfs(pdev);
> +

This seems wrong, since pci_bus_add_device_vfs() is being called for VF's as well as for PF when SRIOV is not enabled.
I can see that there are sanity checks in pci_bus_add_vf() that would prevent this code from doing damage,
but why schedule a callback that will do nothing if we can prevent scheduling altogether?
It would probably be better to do the following:

+	if (dev->flags & MLX4_FLAG_SRIOV)
+		pci_bus_add_device_vfs(pdev);
+

>  	return 0;
> 
>  err_port:

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sathya.Perla@Emulex.Com May 14, 2013, 9:46 a.m. UTC | #2

> -----Original Message-----
> From: linux-pci-owner@vger.kernel.org [mailto:linux-pci-
> owner@vger.kernel.org] On Behalf Of Yinghai Lu
> 
> Found kernel try to load mlx4 drivers for VFs before PF's is really loaded
> when the drivers are built-in, and kernel command line include
> probe_vfs=63, num_vfs=63.
> 
> It turns that it also happen for hotadd path even drivers are compiled as
> modules and if they loaded. Esp some VF share the same driver with PF.
> 
> calling path:
> 	device driver probe
> 		==> pci_enable_sriov
> 			==> virtfn_add
> 				==> pci_dev_add
> 				==> pci_bus_device_add
> when pci_bus_device_add is called, the VF's driver will be attached.
> and at that time PF's driver does not finish yet.
> 
> Need to move out pci_bus_device_add from virtfn_add and call it later. Fix
> the problem for two path, 1. hotadd path: use device_schedule_callback.
> 2. for booting path, use initcall to call that for all VF's.
> 

...
> 
> Index: linux-2.6/drivers/net/ethernet/emulex/benet/be_main.c
> ==========================================================
> =========
> --- linux-2.6.orig/drivers/net/ethernet/emulex/benet/be_main.c
> +++ linux-2.6/drivers/net/ethernet/emulex/benet/be_main.c
> @@ -4119,6 +4119,7 @@ static int lancer_recover_func(struct be
>  	if (status)
>  		goto err;
> 
> +	pci_bus_add_device_vfs(adapter->pdev);
>  	if (netif_running(adapter->netdev)) {
>  		status = be_open(adapter->netdev);
>  		if (status)
> @@ -4335,6 +4336,8 @@ static int be_probe(struct pci_dev *pdev
>  	dev_info(&pdev->dev, "%s: %s %s port %c\n", nic_name(pdev),
>  		 func_name(adapter), mc_name(adapter), port_name);
> 
> +	pci_bus_add_device_vfs(pdev);
> +
>  	return 0;
> 
>  unsetup:
> @@ -4406,6 +4409,7 @@ static int be_resume(struct pci_dev *pde
>  		rtnl_unlock();
>  	}
> 
> +	pci_bus_add_device_vfs(adapter->pdev);
>  	schedule_delayed_work(&adapter->func_recovery_work,
>  			      msecs_to_jiffies(1000));
>  	netif_device_attach(netdev);

I fixed this issue in be2net with the following patch (commit b4c1df93)
http://marc.info/?l=linux-netdev&m=136801459808765&w=2

-Sathya
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yinghai Lu May 14, 2013, 3:19 p.m. UTC | #3

On Tue, May 14, 2013 at 2:46 AM, Perla, Sathya <Sathya.Perla@emulex.com> wrote:
>
>> -----Original Message-----
>> From: linux-pci-owner@vger.kernel.org [mailto:linux-pci-
>> owner@vger.kernel.org] On Behalf Of Yinghai Lu
>>
>> Found kernel try to load mlx4 drivers for VFs before PF's is really loaded
>> when the drivers are built-in, and kernel command line include
>> probe_vfs=63, num_vfs=63.
>>
>> It turns that it also happen for hotadd path even drivers are compiled as
>> modules and if they loaded. Esp some VF share the same driver with PF.
>>
>> calling path:
>>       device driver probe
>>               ==> pci_enable_sriov
>>                       ==> virtfn_add
>>                               ==> pci_dev_add
>>                               ==> pci_bus_device_add
>> when pci_bus_device_add is called, the VF's driver will be attached.
>> and at that time PF's driver does not finish yet.
>>
>> Need to move out pci_bus_device_add from virtfn_add and call it later. Fix
>> the problem for two path, 1. hotadd path: use device_schedule_callback.
>> 2. for booting path, use initcall to call that for all VF's.
>>
>
> ...
>>
>> Index: linux-2.6/drivers/net/ethernet/emulex/benet/be_main.c
>> ==========================================================
>> =========
>> --- linux-2.6.orig/drivers/net/ethernet/emulex/benet/be_main.c
>> +++ linux-2.6/drivers/net/ethernet/emulex/benet/be_main.c
>> @@ -4119,6 +4119,7 @@ static int lancer_recover_func(struct be
>>       if (status)
>>               goto err;
>>
>> +     pci_bus_add_device_vfs(adapter->pdev);
>>       if (netif_running(adapter->netdev)) {
>>               status = be_open(adapter->netdev);
>>               if (status)
>> @@ -4335,6 +4336,8 @@ static int be_probe(struct pci_dev *pdev
>>       dev_info(&pdev->dev, "%s: %s %s port %c\n", nic_name(pdev),
>>                func_name(adapter), mc_name(adapter), port_name);
>>
>> +     pci_bus_add_device_vfs(pdev);
>> +
>>       return 0;
>>
>>  unsetup:
>> @@ -4406,6 +4409,7 @@ static int be_resume(struct pci_dev *pde
>>               rtnl_unlock();
>>       }
>>
>> +     pci_bus_add_device_vfs(adapter->pdev);
>>       schedule_delayed_work(&adapter->func_recovery_work,
>>                             msecs_to_jiffies(1000));
>>       netif_device_attach(netdev);
>
> I fixed this issue in be2net with the following patch (commit b4c1df93)
> http://marc.info/?l=linux-netdev&m=136801459808765&w=2

We should fix that in generic way.

BTW, Can you please implement sriov_configure like
intel igb, ixgbe
qlogic: qlcnic
...

driver really should not call pci_enable_sriov in probe path.

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yinghai Lu May 14, 2013, 3:43 p.m. UTC | #4

On Tue, May 14, 2013 at 1:58 AM, Yan Burman <yanb@mellanox.com> wrote:
>
>
>> -----Original Message-----
>> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
>> On Behalf Of Yinghai Lu
>> Sent: Tuesday, May 14, 2013 05:31
>> To: Bjorn Helgaas
>> Cc: Gu Zheng; linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Yinghai
>> Lu; netdev@vger.kernel.org
>> Subject: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's
>>
>> Found kernel try to load mlx4 drivers for VFs before PF's is really loaded when
>> the drivers are built-in, and kernel command line include probe_vfs=63,
>> num_vfs=63.
>>
>> It turns that it also happen for hotadd path even drivers are compiled as
>> modules and if they loaded. Esp some VF share the same driver with PF.
>>
>> calling path:
>>       device driver probe
>>               ==> pci_enable_sriov
>>                       ==> virtfn_add
>>                               ==> pci_dev_add
>>                               ==> pci_bus_device_add
>> when pci_bus_device_add is called, the VF's driver will be attached.
>> and at that time PF's driver does not finish yet.
>>
>> Need to move out pci_bus_device_add from virtfn_add and call it later. Fix
>> the problem for two path, 1. hotadd path: use device_schedule_callback.
>> 2. for booting path, use initcall to call that for all VF's.

>> Index: linux-2.6/drivers/net/ethernet/mellanox/mlx4/main.c
>> =================================================================
>> ==
>> --- linux-2.6.orig/drivers/net/ethernet/mellanox/mlx4/main.c
>> +++ linux-2.6/drivers/net/ethernet/mellanox/mlx4/main.c
>> @@ -2308,6 +2308,8 @@ slave_start:
>>       priv->pci_dev_data = pci_dev_data;
>>       pci_set_drvdata(pdev, dev);
>>
>> +     pci_bus_add_device_vfs(pdev);
>> +
>
> This seems wrong, since pci_bus_add_device_vfs() is being called for VF's as well as for PF when SRIOV is not enabled.
> I can see that there are sanity checks in pci_bus_add_vf() that would prevent this code from doing damage,
> but why schedule a callback that will do nothing if we can prevent scheduling altogether?
> It would probably be better to do the following:
>
> +       if (dev->flags & MLX4_FLAG_SRIOV)
> +               pci_bus_add_device_vfs(pdev);
> +

Yes, we can add that check.

BTW, do you have any plan to make mlx4 support sriov_configure via sysfs?

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duyck, Alexander H May 14, 2013, 4 p.m. UTC | #5

On 05/13/2013 07:28 PM, Yinghai Lu wrote:
> Found kernel try to load mlx4 drivers for VFs before
> PF's is really loaded when the drivers are built-in, and kernel
> command line include probe_vfs=63, num_vfs=63.
>
> It turns that it also happen for hotadd path even drivers are
> compiled as modules and if they loaded. Esp some VF share the
> same driver with PF.
>
> calling path:
> 	device driver probe
> 		==> pci_enable_sriov
> 			==> virtfn_add
> 				==> pci_dev_add
> 				==> pci_bus_device_add
> when pci_bus_device_add is called, the VF's driver will be attached.
> and at that time PF's driver does not finish yet.
>
> Need to move out pci_bus_device_add from virtfn_add and call it
> later. Fix the problem for two path,
> 1. hotadd path: use device_schedule_callback.
> 2. for booting path, use initcall to call that for all VF's.
>
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> Cc: netdev@vger.kernel.org
>

I'm sorry, but what is the point of this patch?  With device assignment
it is always possible to have VFs loaded and the PF driver unloaded
since you cannot remove the VFs if they are assigned to a VM.

If there is a driver that has to have the PF driver fully loaded before
it instantiates the VFs then it sounds like a buggy driver to me.  The
VF driver should be able to be loaded when the PF driver is not
present.  We handle it in igb and ixgbe last I checked, and I don't see
any reason why it cannot be handled in all other VF drivers.  I'm not
saying the VF has to be able to fully functional, but it should be able
to detect the PF becoming enabled and then bring itself to a fully
functional state.  To not handle that case is a bug.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yinghai Lu May 14, 2013, 6:44 p.m. UTC | #6

On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
<alexander.h.duyck@intel.com> wrote:
> On 05/13/2013 07:28 PM, Yinghai Lu wrote:
>> Found kernel try to load mlx4 drivers for VFs before
>> PF's is really loaded when the drivers are built-in, and kernel
>> command line include probe_vfs=63, num_vfs=63.
>>
>> It turns that it also happen for hotadd path even drivers are
>> compiled as modules and if they loaded. Esp some VF share the
>> same driver with PF.
>>
>> calling path:
>>       device driver probe
>>               ==> pci_enable_sriov
>>                       ==> virtfn_add
>>                               ==> pci_dev_add
>>                               ==> pci_bus_device_add
>> when pci_bus_device_add is called, the VF's driver will be attached.
>> and at that time PF's driver does not finish yet.
>>
>> Need to move out pci_bus_device_add from virtfn_add and call it
>> later. Fix the problem for two path,
>> 1. hotadd path: use device_schedule_callback.
>> 2. for booting path, use initcall to call that for all VF's.
>>
>> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
>> Cc: netdev@vger.kernel.org
>>
>
> I'm sorry, but what is the point of this patch?  With device assignment
> it is always possible to have VFs loaded and the PF driver unloaded
> since you cannot remove the VFs if they are assigned to a VM.

unload PF driver will not call pci_disable_sriov?

>
> If there is a driver that has to have the PF driver fully loaded before
> it instantiates the VFs then it sounds like a buggy driver to me.  The
> VF driver should be able to be loaded when the PF driver is not
> present.  We handle it in igb and ixgbe last I checked, and I don't see
> any reason why it cannot be handled in all other VF drivers.  I'm not
> saying the VF has to be able to fully functional, but it should be able
> to detect the PF becoming enabled and then bring itself to a fully
> functional state.  To not handle that case is a bug.

more than that.

there is work_on_cpu nested lock problem. from calling pci_bus_add_device
in driver pci probe function.

[  181.938110] mlx4_core 0000:02:00.0: Started init_resource_tracker: 80 slaves
[  181.938759]   alloc irq_desc for 1170 on node 0
[  181.949104] mlx4_core 0000:02:00.0: irq 1170 for MSI-X
[  181.949404]   alloc irq_desc for 1171 on node 0
[  181.949741] mlx4_core 0000:02:00.0: irq 1171 for MSI-X
[  181.969253]   alloc irq_desc for 1172 on node 0
[  181.969564] mlx4_core 0000:02:00.0: irq 1172 for MSI-X
[  181.989137]   alloc irq_desc for 1173 on node 0
[  181.989485] mlx4_core 0000:02:00.0: irq 1173 for MSI-X
[  182.033789] mlx4_core 0000:02:00.0: NOP command IRQ test passed
[  182.035380]
[  182.035473] =============================================
[  182.049065] [ INFO: possible recursive locking detected ]
[  182.049349] 3.10.0-rc1-yh-00114-gf59c98e-dirty #1588 Not tainted
[  182.069079] ---------------------------------------------
[  182.069354] kworker/0:1/2285 is trying to acquire lock:
[  182.089080]  ((&wfc.work)){+.+.+.}, at: [<ffffffff810ab745>]
flush_work+0x5/0x280
[  182.089500]
[  182.089500] but task is already holding lock:
[  182.109671]  ((&wfc.work)){+.+.+.}, at: [<ffffffff810aabe2>]
process_one_work+0x202/0x490
[  182.129097]
[  182.129097] other info that might help us debug this:
[  182.129415]  Possible unsafe locking scenario:
[  182.129415]
[  182.149275]        CPU0
[  182.149386]        ----
[  182.149513]   lock((&wfc.work));
[  182.149705]   lock((&wfc.work));
[  182.169391]
[  182.169391]  *** DEADLOCK ***
[  182.169391]
[  182.169722]  May be due to missing lock nesting notation
[  182.169722]
[  182.189461] 3 locks held by kworker/0:1/2285:
[  182.189664]  #0:  (events){.+.+.+}, at: [<ffffffff810aabe2>]
process_one_work+0x202/0x490
[  182.209468]  #1:  ((&wfc.work)){+.+.+.}, at: [<ffffffff810aabe2>]
process_one_work+0x202/0x490
[  182.229176]  #2:  (&__lockdep_no_validate__){......}, at:
[<ffffffff81765eea>] device_attach+0x2a/0xc0
[  182.249108]
[  182.249108] stack backtrace:
[  182.249362] CPU: 0 PID: 2285 Comm: kworker/0:1 Not tainted
3.10.0-rc1-yh-00114-gf59c98e-dirty #1588
[  182.269258] Hardware name: Oracle Corporation  unknown       /
, BIOS 11016600    05/17/2011
[  182.289141] Workqueue: events work_for_cpu_fn
[  182.289410]  ffffffff83350bc0 ffff881025c11778 ffffffff82093a74
ffff881025c11838
[  182.309167]  ffffffff810ed194 ffff881025c117b8 ffff881025c38000
0000b787702301dc
[  182.309587]  ffff881000000000 0000000000000002 ffffffff8322cba0
ffff881025c11878
[  182.329524] Call Trace:
[  182.329669]  [<ffffffff82093a74>] dump_stack+0x19/0x1b
[  182.349365]  [<ffffffff810ed194>] validate_chain.isra.19+0x8f4/0x1210
[  182.349720]  [<ffffffff810ed3b6>] ? validate_chain.isra.19+0xb16/0x1210
[  182.369261]  [<ffffffff810eacf8>] ? trace_hardirqs_off_caller+0x28/0x160
[  182.389069]  [<ffffffff810f0c40>] __lock_acquire+0xac0/0xce0
[  182.389330]  [<ffffffff810f150a>] lock_acquire+0xda/0x130
[  182.409077]  [<ffffffff810ab745>] ? flush_work+0x5/0x280
[  182.409320]  [<ffffffff810ab78c>] flush_work+0x4c/0x280
[  182.409595]  [<ffffffff810ab745>] ? flush_work+0x5/0x280
[  182.429306]  [<ffffffff810ee506>] ? mark_held_locks+0x136/0x150
[  182.429634]  [<ffffffff820a67fb>] ? _raw_spin_unlock+0x2b/0x40
[  182.449352]  [<ffffffff810aa5a5>] ? queue_work_on+0x75/0xa0
[  182.469088]  [<ffffffff810ee78d>] ? trace_hardirqs_on+0xd/0x10
[  182.469352]  [<ffffffff810aba42>] work_on_cpu+0x82/0x90
[  182.489073]  [<ffffffff810a7940>] ? find_worker_executing_work+0x90/0x90
[  182.489426]  [<ffffffff8151e290>] ? pci_device_shutdown+0x70/0x70
[  182.509188]  [<ffffffff8151ebcf>] pci_device_probe+0xaf/0x110
[  182.509448]  [<ffffffff8176608d>] driver_probe_device+0xdd/0x220
[  182.529193]  [<ffffffff81766280>] ? __driver_attach+0xb0/0xb0
[  182.529516]  [<ffffffff817662b3>] __device_attach+0x33/0x50
[  182.549222]  [<ffffffff817640b6>] bus_for_each_drv+0x56/0xa0
[  182.549503]  [<ffffffff81765f48>] device_attach+0x88/0xc0
[  182.569215]  [<ffffffff81515b49>] pci_bus_add_device+0x39/0x60
[  182.569513]  [<ffffffff81540605>] pci_bus_add_vf+0x25/0x40
[  182.589239]  [<ffffffff81540834>] pci_bus_add_device_vfs+0xa4/0xe0
[  182.589618]  [<ffffffff81c1faa6>] __mlx4_init_one+0xa96/0xc90
[  182.609273]  [<ffffffff81c1fd0d>] mlx4_init_one+0x4d/0x60
[  182.609588]  [<ffffffff8151e2db>] local_pci_probe+0x4b/0x80
[  182.629584]  [<ffffffff810a7958>] work_for_cpu_fn+0x18/0x30
[  182.629869]  [<ffffffff810aac6b>] process_one_work+0x28b/0x490
[  182.649313]  [<ffffffff810aabe2>] ? process_one_work+0x202/0x490
[  182.649608]  [<ffffffff810abf68>] ? worker_thread+0x48/0x370
[  182.669325]  [<ffffffff810aae9c>] process_scheduled_works+0x2c/0x40
[  182.690446]  [<ffffffff810ac158>] worker_thread+0x238/0x370
[  182.690712]  [<ffffffff810ee78d>] ? trace_hardirqs_on+0xd/0x10
[  182.709143]  [<ffffffff810abf20>] ? manage_workers.isra.18+0x330/0x330
[  182.709499]  [<ffffffff810b2e78>] kthread+0xe8/0xf0
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duyck, Alexander H May 14, 2013, 7:45 p.m. UTC | #7

On 05/14/2013 11:44 AM, Yinghai Lu wrote:
> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
> <alexander.h.duyck@intel.com> wrote:
>> On 05/13/2013 07:28 PM, Yinghai Lu wrote:
>>> Found kernel try to load mlx4 drivers for VFs before
>>> PF's is really loaded when the drivers are built-in, and kernel
>>> command line include probe_vfs=63, num_vfs=63.
>>>
>>> It turns that it also happen for hotadd path even drivers are
>>> compiled as modules and if they loaded. Esp some VF share the
>>> same driver with PF.
>>>
>>> calling path:
>>>       device driver probe
>>>               ==> pci_enable_sriov
>>>                       ==> virtfn_add
>>>                               ==> pci_dev_add
>>>                               ==> pci_bus_device_add
>>> when pci_bus_device_add is called, the VF's driver will be attached.
>>> and at that time PF's driver does not finish yet.
>>>
>>> Need to move out pci_bus_device_add from virtfn_add and call it
>>> later. Fix the problem for two path,
>>> 1. hotadd path: use device_schedule_callback.
>>> 2. for booting path, use initcall to call that for all VF's.
>>>
>>> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
>>> Cc: netdev@vger.kernel.org
>>>
>> I'm sorry, but what is the point of this patch?  With device assignment
>> it is always possible to have VFs loaded and the PF driver unloaded
>> since you cannot remove the VFs if they are assigned to a VM.
> unload PF driver will not call pci_disable_sriov?

You cannot call pci_disable_sriov because you will panic all of the
guests that have devices assigned.

>> If there is a driver that has to have the PF driver fully loaded before
>> it instantiates the VFs then it sounds like a buggy driver to me.  The
>> VF driver should be able to be loaded when the PF driver is not
>> present.  We handle it in igb and ixgbe last I checked, and I don't see
>> any reason why it cannot be handled in all other VF drivers.  I'm not
>> saying the VF has to be able to fully functional, but it should be able
>> to detect the PF becoming enabled and then bring itself to a fully
>> functional state.  To not handle that case is a bug.
> more than that.
>
> there is work_on_cpu nested lock problem. from calling pci_bus_add_device
> in driver pci probe function.
>
> [  181.938110] mlx4_core 0000:02:00.0: Started init_resource_tracker: 80 slaves
> [  181.938759]   alloc irq_desc for 1170 on node 0
> [  181.949104] mlx4_core 0000:02:00.0: irq 1170 for MSI-X
> [  181.949404]   alloc irq_desc for 1171 on node 0
> [  181.949741] mlx4_core 0000:02:00.0: irq 1171 for MSI-X
> [  181.969253]   alloc irq_desc for 1172 on node 0
> [  181.969564] mlx4_core 0000:02:00.0: irq 1172 for MSI-X
> [  181.989137]   alloc irq_desc for 1173 on node 0
> [  181.989485] mlx4_core 0000:02:00.0: irq 1173 for MSI-X
> [  182.033789] mlx4_core 0000:02:00.0: NOP command IRQ test passed
> [  182.035380]
> [  182.035473] =============================================
> [  182.049065] [ INFO: possible recursive locking detected ]
> [  182.049349] 3.10.0-rc1-yh-00114-gf59c98e-dirty #1588 Not tainted
> [  182.069079] ---------------------------------------------
> [  182.069354] kworker/0:1/2285 is trying to acquire lock:
> [  182.089080]  ((&wfc.work)){+.+.+.}, at: [<ffffffff810ab745>]
> flush_work+0x5/0x280
> [  182.089500]
> [  182.089500] but task is already holding lock:
> [  182.109671]  ((&wfc.work)){+.+.+.}, at: [<ffffffff810aabe2>]
> process_one_work+0x202/0x490
> [  182.129097]
> [  182.129097] other info that might help us debug this:
> [  182.129415]  Possible unsafe locking scenario:
> [  182.129415]
> [  182.149275]        CPU0
> [  182.149386]        ----
> [  182.149513]   lock((&wfc.work));
> [  182.149705]   lock((&wfc.work));
> [  182.169391]
> [  182.169391]  *** DEADLOCK ***
> [  182.169391]
> [  182.169722]  May be due to missing lock nesting notation
> [  182.169722]
> [  182.189461] 3 locks held by kworker/0:1/2285:
> [  182.189664]  #0:  (events){.+.+.+}, at: [<ffffffff810aabe2>]
> process_one_work+0x202/0x490
> [  182.209468]  #1:  ((&wfc.work)){+.+.+.}, at: [<ffffffff810aabe2>]
> process_one_work+0x202/0x490
> [  182.229176]  #2:  (&__lockdep_no_validate__){......}, at:
> [<ffffffff81765eea>] device_attach+0x2a/0xc0
> [  182.249108]
> [  182.249108] stack backtrace:
> [  182.249362] CPU: 0 PID: 2285 Comm: kworker/0:1 Not tainted
> 3.10.0-rc1-yh-00114-gf59c98e-dirty #1588
> [  182.269258] Hardware name: Oracle Corporation  unknown       /
> , BIOS 11016600    05/17/2011
> [  182.289141] Workqueue: events work_for_cpu_fn
> [  182.289410]  ffffffff83350bc0 ffff881025c11778 ffffffff82093a74
> ffff881025c11838
> [  182.309167]  ffffffff810ed194 ffff881025c117b8 ffff881025c38000
> 0000b787702301dc
> [  182.309587]  ffff881000000000 0000000000000002 ffffffff8322cba0
> ffff881025c11878
> [  182.329524] Call Trace:
> [  182.329669]  [<ffffffff82093a74>] dump_stack+0x19/0x1b
> [  182.349365]  [<ffffffff810ed194>] validate_chain.isra.19+0x8f4/0x1210
> [  182.349720]  [<ffffffff810ed3b6>] ? validate_chain.isra.19+0xb16/0x1210
> [  182.369261]  [<ffffffff810eacf8>] ? trace_hardirqs_off_caller+0x28/0x160
> [  182.389069]  [<ffffffff810f0c40>] __lock_acquire+0xac0/0xce0
> [  182.389330]  [<ffffffff810f150a>] lock_acquire+0xda/0x130
> [  182.409077]  [<ffffffff810ab745>] ? flush_work+0x5/0x280
> [  182.409320]  [<ffffffff810ab78c>] flush_work+0x4c/0x280
> [  182.409595]  [<ffffffff810ab745>] ? flush_work+0x5/0x280
> [  182.429306]  [<ffffffff810ee506>] ? mark_held_locks+0x136/0x150
> [  182.429634]  [<ffffffff820a67fb>] ? _raw_spin_unlock+0x2b/0x40
> [  182.449352]  [<ffffffff810aa5a5>] ? queue_work_on+0x75/0xa0
> [  182.469088]  [<ffffffff810ee78d>] ? trace_hardirqs_on+0xd/0x10
> [  182.469352]  [<ffffffff810aba42>] work_on_cpu+0x82/0x90
> [  182.489073]  [<ffffffff810a7940>] ? find_worker_executing_work+0x90/0x90
> [  182.489426]  [<ffffffff8151e290>] ? pci_device_shutdown+0x70/0x70
> [  182.509188]  [<ffffffff8151ebcf>] pci_device_probe+0xaf/0x110
> [  182.509448]  [<ffffffff8176608d>] driver_probe_device+0xdd/0x220
> [  182.529193]  [<ffffffff81766280>] ? __driver_attach+0xb0/0xb0
> [  182.529516]  [<ffffffff817662b3>] __device_attach+0x33/0x50
> [  182.549222]  [<ffffffff817640b6>] bus_for_each_drv+0x56/0xa0
> [  182.549503]  [<ffffffff81765f48>] device_attach+0x88/0xc0
> [  182.569215]  [<ffffffff81515b49>] pci_bus_add_device+0x39/0x60
> [  182.569513]  [<ffffffff81540605>] pci_bus_add_vf+0x25/0x40
> [  182.589239]  [<ffffffff81540834>] pci_bus_add_device_vfs+0xa4/0xe0
> [  182.589618]  [<ffffffff81c1faa6>] __mlx4_init_one+0xa96/0xc90
> [  182.609273]  [<ffffffff81c1fd0d>] mlx4_init_one+0x4d/0x60
> [  182.609588]  [<ffffffff8151e2db>] local_pci_probe+0x4b/0x80
> [  182.629584]  [<ffffffff810a7958>] work_for_cpu_fn+0x18/0x30
> [  182.629869]  [<ffffffff810aac6b>] process_one_work+0x28b/0x490
> [  182.649313]  [<ffffffff810aabe2>] ? process_one_work+0x202/0x490
> [  182.649608]  [<ffffffff810abf68>] ? worker_thread+0x48/0x370
> [  182.669325]  [<ffffffff810aae9c>] process_scheduled_works+0x2c/0x40
> [  182.690446]  [<ffffffff810ac158>] worker_thread+0x238/0x370
> [  182.690712]  [<ffffffff810ee78d>] ? trace_hardirqs_on+0xd/0x10
> [  182.709143]  [<ffffffff810abf20>] ? manage_workers.isra.18+0x330/0x330
> [  182.709499]  [<ffffffff810b2e78>] kthread+0xe8/0xf0

So how does your patch actually fix this problem?  It seems like it is
just avoiding it.

From what I can tell your problem is originating in pci_call_probe.  I
believe it is calling work_on_cpu and that doesn't seem correct since
the work should be taking place on a CPU already local to the PF.  You
might want to look there to see why you are trying to schedule work on a
CPU which should be perfectly fine for you to already be doing your work on.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yinghai Lu May 14, 2013, 7:59 p.m. UTC | #8

On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
<alexander.h.duyck@intel.com> wrote:
> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>> <alexander.h.duyck@intel.com> wrote:

>>> I'm sorry, but what is the point of this patch?  With device assignment
>>> it is always possible to have VFs loaded and the PF driver unloaded
>>> since you cannot remove the VFs if they are assigned to a VM.
>> unload PF driver will not call pci_disable_sriov?
>
> You cannot call pci_disable_sriov because you will panic all of the
> guests that have devices assigned.

ixgbe_remove did call pci_disable_sriov...

for guest panic, that is another problem.
just like you pci passthrough with real pci device and hotremove the
card in host.

...

> So how does your patch actually fix this problem?  It seems like it is
> just avoiding it.

yes, until the first one is done.

>
> From what I can tell your problem is originating in pci_call_probe.  I
> believe it is calling work_on_cpu and that doesn't seem correct since
> the work should be taking place on a CPU already local to the PF. You
> might want to look there to see why you are trying to schedule work on a
> CPU which should be perfectly fine for you to already be doing your work on.

it always try to go with local cpu with same pxm.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duyck, Alexander H May 14, 2013, 9:39 p.m. UTC | #9

On 05/14/2013 12:59 PM, Yinghai Lu wrote:
> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
> <alexander.h.duyck@intel.com> wrote:
>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>> <alexander.h.duyck@intel.com> wrote:
>>>> I'm sorry, but what is the point of this patch?  With device assignment
>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>> since you cannot remove the VFs if they are assigned to a VM.
>>> unload PF driver will not call pci_disable_sriov?
>> You cannot call pci_disable_sriov because you will panic all of the
>> guests that have devices assigned.
> ixgbe_remove did call pci_disable_sriov...
>
> for guest panic, that is another problem.
> just like you pci passthrough with real pci device and hotremove the
> card in host.
>
> ...

I suggest you take another look.  In ixgbe_disable_sriov, which is the
function that is called we do a check for assigned VFs.  If they are
assigned then we do not call pci_disable_sriov.

>
>> So how does your patch actually fix this problem?  It seems like it is
>> just avoiding it.
> yes, until the first one is done.

Avoiding the issue doesn't fix the underlying problem and instead you
are likely just introducing more bugs as a result.

>> From what I can tell your problem is originating in pci_call_probe.  I
>> believe it is calling work_on_cpu and that doesn't seem correct since
>> the work should be taking place on a CPU already local to the PF. You
>> might want to look there to see why you are trying to schedule work on a
>> CPU which should be perfectly fine for you to already be doing your work on.
> it always try to go with local cpu with same pxm.

The problem is we really shouldn't be calling work_for_cpu in this case
since we are already on the correct CPU.  What probably should be
happening is that pci_call_probe should be doing a check to see if the
current CPU is already contained within the device node map and if so
just call local_pci_probe directly.  That way you can avoid deadlocking
the system by trying to flush the CPU queue of the CPU you are currently on.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Or Gerlitz May 16, 2013, 4 a.m. UTC | #10

On Tue, May 14, 2013 at 11:43 AM, Yinghai Lu <yinghai@kernel.org> wrote:

> BTW, do you have any plan to make mlx4 support sriov_configure via sysfs?

yes we do, we're waiting for a firmware change that will allow for
such a patch to get working.

Please note that this whole lockdep warning was identified as false
positive by Tejun http://www.spinics.net/lists/linux-pci/msg21568.html

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yinghai Lu May 16, 2013, 4:39 a.m. UTC | #11

On Wed, May 15, 2013 at 9:00 PM, Or Gerlitz <or.gerlitz@gmail.com> wrote:
> On Tue, May 14, 2013 at 11:43 AM, Yinghai Lu <yinghai@kernel.org> wrote:
>
>> BTW, do you have any plan to make mlx4 support sriov_configure via sysfs?
>
> yes we do, we're waiting for a firmware change that will allow for
> such a patch to get working.
>
> Please note that this whole lockdep warning was identified as false
> positive by Tejun http://www.spinics.net/lists/linux-pci/msg21568.html

No, at least one time, I noticed there is one real hang happens.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Or Gerlitz May 16, 2013, 4:56 a.m. UTC | #12

On Thu, May 16, 2013 at 12:39 AM, Yinghai Lu <yinghai@kernel.org> wrote:
>
> On Wed, May 15, 2013 at 9:00 PM, Or Gerlitz <or.gerlitz@gmail.com> wrote:
> > On Tue, May 14, 2013 at 11:43 AM, Yinghai Lu <yinghai@kernel.org> wrote:
> >
> >> BTW, do you have any plan to make mlx4 support sriov_configure via
> >> sysfs?
> >
> > yes we do, we're waiting for a firmware change that will allow for
> > such a patch to get working.
> >
> > Please note that this whole lockdep warning was identified as false
> > positive by Tejun http://www.spinics.net/lists/linux-pci/msg21568.html
>
> No, at least one time, I noticed there is one real hang happens.

Tejun, so what your thinking here?
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michael S. Tsirkin May 16, 2013, 6:39 a.m. UTC | #13

On Mon, May 13, 2013 at 07:28:25PM -0700, Yinghai Lu wrote:
> Found kernel try to load mlx4 drivers for VFs before
> PF's is really loaded when the drivers are built-in, and kernel
> command line include probe_vfs=63, num_vfs=63.
> 
> It turns that it also happen for hotadd path even drivers are
> compiled as modules and if they loaded. Esp some VF share the
> same driver with PF.
> 
> calling path:
> 	device driver probe
> 		==> pci_enable_sriov
> 			==> virtfn_add
> 				==> pci_dev_add
> 				==> pci_bus_device_add
> when pci_bus_device_add is called, the VF's driver will be attached.
> and at that time PF's driver does not finish yet.

Okay but why is this a problem?


> Need to move out pci_bus_device_add from virtfn_add and call it
> later. Fix the problem for two path,
> 1. hotadd path: use device_schedule_callback.
> 2. for booting path, use initcall to call that for all VF's.
> 
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> Cc: netdev@vger.kernel.org
> 
> ---
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c    |    7 +
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c      |    4 -
>  drivers/net/ethernet/cisco/enic/enic_main.c          |    2 
>  drivers/net/ethernet/emulex/benet/be_main.c          |    4 +
>  drivers/net/ethernet/intel/igb/igb_main.c            |   11 ++
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c        |    2 
>  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c       |    9 +-
>  drivers/net/ethernet/mellanox/mlx4/main.c            |    2 
>  drivers/net/ethernet/neterion/vxge/vxge-main.c       |    3 
>  drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c |    5 +
>  drivers/net/ethernet/sfc/efx.c                       |    1 
>  drivers/pci/iov.c                                    |   73 +++++++++++++++++--
>  drivers/scsi/lpfc/lpfc_init.c                        |    2 
>  include/linux/pci.h                                  |    4 +
>  14 files changed, 115 insertions(+), 14 deletions(-)
> 
> Index: linux-2.6/drivers/net/ethernet/mellanox/mlx4/main.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/mellanox/mlx4/main.c
> +++ linux-2.6/drivers/net/ethernet/mellanox/mlx4/main.c
> @@ -2308,6 +2308,8 @@ slave_start:
>  	priv->pci_dev_data = pci_dev_data;
>  	pci_set_drvdata(pdev, dev);
>  
> +	pci_bus_add_device_vfs(pdev);
> +
>  	return 0;
>  
>  err_port:
> Index: linux-2.6/drivers/pci/iov.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/iov.c
> +++ linux-2.6/drivers/pci/iov.c
> @@ -66,7 +66,8 @@ static void virtfn_remove_bus(struct pci
>  		pci_remove_bus(child);
>  }
>  
> -static int virtfn_add(struct pci_dev *dev, int id, int reset)
> +static int virtfn_add(struct pci_dev *dev, int id, int reset,
> +			struct pci_dev **ret)
>  {
>  	int i;
>  	int rc;
> @@ -116,7 +117,6 @@ static int virtfn_add(struct pci_dev *de
>  	pci_device_add(virtfn, virtfn->bus);
>  	mutex_unlock(&iov->dev->sriov->lock);
>  
> -	rc = pci_bus_add_device(virtfn);
>  	sprintf(buf, "virtfn%u", id);
>  	rc = sysfs_create_link(&dev->dev.kobj, &virtfn->dev.kobj, buf);
>  	if (rc)
> @@ -127,6 +127,8 @@ static int virtfn_add(struct pci_dev *de
>  
>  	kobject_uevent(&virtfn->dev.kobj, KOBJ_CHANGE);
>  
> +	if (ret)
> +		*ret = virtfn;
>  	return 0;
>  
>  failed2:
> @@ -141,6 +143,55 @@ failed1:
>  	return rc;
>  }
>  
> +static void pci_bus_add_vf(struct pci_dev *dev)
> +{
> +	int rc;
> +
> +	if (!dev || !dev->is_virtfn || dev->is_added)
> +		return;
> +
> +	rc = pci_bus_add_device(dev);
> +}
> +
> +static void bus_add_vfs(struct device *device)
> +{
> +	struct pci_dev *dev = to_pci_dev(device);
> +	int i, num_vfs = pci_num_vf(dev);
> +
> +	for (i = 0; i < num_vfs; i++) {
> +		struct pci_bus *bus;
> +		struct pci_dev *virtfn;
> +
> +		bus = pci_find_bus(pci_domain_nr(dev->bus), virtfn_bus(dev, i));
> +		if (!bus)
> +			continue;
> +
> +		virtfn = pci_get_slot(bus, virtfn_devfn(dev, i));
> +		pci_bus_add_vf(virtfn);
> +		pci_dev_put(virtfn);
> +	}
> +}
> +void pci_bus_add_device_vfs(struct pci_dev *pdev)
> +{
> +	if (system_state == SYSTEM_BOOTING)
> +		return;
> +
> +	device_schedule_callback(&pdev->dev, bus_add_vfs);
> +}
> +EXPORT_SYMBOL_GPL(pci_bus_add_device_vfs);
> +
> +/* Make sure all VFs get added before pci_sysfs_init */
> +static int __init pci_bus_add_device_vfs_booting(void)
> +{
> +	struct pci_dev *dev = NULL;
> +
> +	for_each_pci_dev(dev)
> +		pci_bus_add_vf(dev);
> +
> +	return 0;
> +}
> +device_initcall_sync(pci_bus_add_device_vfs_booting);
> +
>  static void virtfn_remove(struct pci_dev *dev, int id, int reset)
>  {
>  	char buf[VIRTFN_ID_LEN];
> @@ -213,14 +264,22 @@ static void sriov_migration_task(struct
>  		if (state == PCI_SRIOV_VFM_MI) {
>  			writeb(PCI_SRIOV_VFM_AV, iov->mstate + i);
>  			state = readb(iov->mstate + i);
> -			if (state == PCI_SRIOV_VFM_AV)
> -				virtfn_add(iov->self, i, 1);
> +			if (state == PCI_SRIOV_VFM_AV) {
> +				struct pci_dev *virtfn = NULL;
> +
> +				virtfn_add(iov->self, i, 1, &virtfn);
> +				pci_bus_add_vf(virtfn);
> +			}
>  		} else if (state == PCI_SRIOV_VFM_MO) {
>  			virtfn_remove(iov->self, i, 1);
>  			writeb(PCI_SRIOV_VFM_UA, iov->mstate + i);
>  			state = readb(iov->mstate + i);
> -			if (state == PCI_SRIOV_VFM_AV)
> -				virtfn_add(iov->self, i, 0);
> +			if (state == PCI_SRIOV_VFM_AV) {
> +				struct pci_dev *virtfn = NULL;
> +
> +				virtfn_add(iov->self, i, 0, &virtfn);
> +				pci_bus_add_vf(virtfn);
> +			}
>  		}
>  	}
>  
> @@ -356,7 +415,7 @@ static int sriov_enable(struct pci_dev *
>  		initial = nr_virtfn;
>  
>  	for (i = 0; i < initial; i++) {
> -		rc = virtfn_add(dev, i, 0);
> +		rc = virtfn_add(dev, i, 0, NULL);
>  		if (rc)
>  			goto failed;
>  	}
> Index: linux-2.6/include/linux/pci.h
> ===================================================================
> --- linux-2.6.orig/include/linux/pci.h
> +++ linux-2.6/include/linux/pci.h
> @@ -1659,6 +1659,7 @@ void __iomem *pci_ioremap_bar(struct pci
>  
>  #ifdef CONFIG_PCI_IOV
>  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
> +void pci_bus_add_device_vfs(struct pci_dev *dev);
>  void pci_disable_sriov(struct pci_dev *dev);
>  irqreturn_t pci_sriov_migration(struct pci_dev *dev);
>  int pci_num_vf(struct pci_dev *dev);
> @@ -1670,6 +1671,9 @@ static inline int pci_enable_sriov(struc
>  {
>  	return -ENODEV;
>  }
> +static inline void pci_bus_add_device_vfs(struct pci_dev *dev)
> +{
> +}
>  static inline void pci_disable_sriov(struct pci_dev *dev)
>  {
>  }
> Index: linux-2.6/drivers/net/ethernet/intel/igb/igb_main.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/intel/igb/igb_main.c
> +++ linux-2.6/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -2366,6 +2366,9 @@ static int igb_probe(struct pci_dev *pde
>  	}
>  
>  	pm_runtime_put_noidle(&pdev->dev);
> +
> +	pci_bus_add_device_vfs(pdev);
> +
>  	return 0;
>  
>  err_register:
> @@ -7278,8 +7281,12 @@ static int igb_pci_sriov_configure(struc
>  #ifdef CONFIG_PCI_IOV
>  	if (num_vfs == 0)
>  		return igb_pci_disable_sriov(dev);
> -	else
> -		return igb_pci_enable_sriov(dev, num_vfs);
> +	else {
> +		int ret = igb_pci_enable_sriov(dev, num_vfs);
> +		if (ret > 0)
> +			pci_bus_add_device_vfs(dev);
> +		return ret;
> +	}
>  #endif
>  	return 0;
>  }
> Index: linux-2.6/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> +++ linux-2.6/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> @@ -346,8 +346,13 @@ int ixgbe_pci_sriov_configure(struct pci
>  {
>  	if (num_vfs == 0)
>  		return ixgbe_pci_sriov_disable(dev);
> -	else
> -		return ixgbe_pci_sriov_enable(dev, num_vfs);
> +	else {
> +		int ret = ixgbe_pci_sriov_enable(dev, num_vfs);
> +
> +		if (ret > 0)
> +			pci_bus_add_device_vfs(dev);
> +		return ret;
> +	}
>  }
>  
>  static int ixgbe_set_vf_multicasts(struct ixgbe_adapter *adapter,
> Index: linux-2.6/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ linux-2.6/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -7658,6 +7658,8 @@ skip_sriov:
>  			IXGBE_LINK_SPEED_10GB_FULL | IXGBE_LINK_SPEED_1GB_FULL,
>  			true);
>  
> +	pci_bus_add_device_vfs(pdev);
> +
>  	return 0;
>  
>  err_register:
> Index: linux-2.6/drivers/scsi/lpfc/lpfc_init.c
> ===================================================================
> --- linux-2.6.orig/drivers/scsi/lpfc/lpfc_init.c
> +++ linux-2.6/drivers/scsi/lpfc/lpfc_init.c
> @@ -10582,6 +10582,8 @@ lpfc_pci_probe_one(struct pci_dev *pdev,
>  	else
>  		rc = lpfc_pci_probe_one_s3(pdev, pid);
>  
> +	pci_bus_add_device_vfs(pdev);
> +
>  	return rc;
>  }
>  
> Index: linux-2.6/drivers/net/ethernet/emulex/benet/be_main.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/emulex/benet/be_main.c
> +++ linux-2.6/drivers/net/ethernet/emulex/benet/be_main.c
> @@ -4119,6 +4119,7 @@ static int lancer_recover_func(struct be
>  	if (status)
>  		goto err;
>  
> +	pci_bus_add_device_vfs(adapter->pdev);
>  	if (netif_running(adapter->netdev)) {
>  		status = be_open(adapter->netdev);
>  		if (status)
> @@ -4335,6 +4336,8 @@ static int be_probe(struct pci_dev *pdev
>  	dev_info(&pdev->dev, "%s: %s %s port %c\n", nic_name(pdev),
>  		 func_name(adapter), mc_name(adapter), port_name);
>  
> +	pci_bus_add_device_vfs(pdev);
> +
>  	return 0;
>  
>  unsetup:
> @@ -4406,6 +4409,7 @@ static int be_resume(struct pci_dev *pde
>  		rtnl_unlock();
>  	}
>  
> +	pci_bus_add_device_vfs(adapter->pdev);
>  	schedule_delayed_work(&adapter->func_recovery_work,
>  			      msecs_to_jiffies(1000));
>  	netif_device_attach(netdev);
> Index: linux-2.6/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
> +++ linux-2.6/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
> @@ -568,8 +568,11 @@ int qlcnic_pci_sriov_configure(struct pc
>  
>  	if (num_vfs == 0)
>  		err = qlcnic_pci_sriov_disable(adapter);
> -	else
> +	else {
>  		err = qlcnic_pci_sriov_enable(adapter, num_vfs);
> +		if (err > 0)
> +			pci_bus_add_device_vfs(dev);
> +	}
>  
>  	clear_bit(__QLCNIC_RESETTING, &adapter->state);
>  	return err;
> Index: linux-2.6/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
> +++ linux-2.6/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
> @@ -3048,7 +3048,12 @@ int bnx2x_sriov_configure(struct pci_dev
>  		pci_disable_sriov(dev);
>  		return 0;
>  	} else {
> -		return bnx2x_enable_sriov(bp);
> +		int ret = bnx2x_enable_sriov(bp);
> +
> +		if (ret > 0)
> +			pci_bus_add_device_vfs(dev);
> +
> +		return 0;
>  	}
>  }
>  
> Index: linux-2.6/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
> +++ linux-2.6/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
> @@ -5749,10 +5749,12 @@ static int init_one(struct pci_dev *pdev
>  sriov:
>  #ifdef CONFIG_PCI_IOV
>  	if (func < ARRAY_SIZE(num_vf) && num_vf[func] > 0)
> -		if (pci_enable_sriov(pdev, num_vf[func]) == 0)
> +		if (pci_enable_sriov(pdev, num_vf[func]) == 0) {
>  			dev_info(&pdev->dev,
>  				 "instantiated %u virtual functions\n",
>  				 num_vf[func]);
> +			pci_bus_add_device_vfs(pdev);
> +		}
>  #endif
>  	return 0;
>  
> Index: linux-2.6/drivers/net/ethernet/cisco/enic/enic_main.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/cisco/enic/enic_main.c
> +++ linux-2.6/drivers/net/ethernet/cisco/enic/enic_main.c
> @@ -2524,6 +2524,8 @@ static int enic_probe(struct pci_dev *pd
>  		goto err_out_dev_deinit;
>  	}
>  
> +	pci_bus_add_device_vfs(pdev);
> +
>  	return 0;
>  
>  err_out_dev_deinit:
> Index: linux-2.6/drivers/net/ethernet/neterion/vxge/vxge-main.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/neterion/vxge/vxge-main.c
> +++ linux-2.6/drivers/net/ethernet/neterion/vxge/vxge-main.c
> @@ -4731,6 +4731,9 @@ vxge_probe(struct pci_dev *pdev, const s
>  		vxge_hw_device_trace_level_get(hldev));
>  
>  	kfree(ll_config);
> +
> +	pci_bus_add_device_vfs(pdev);
> +
>  	return 0;
>  
>  _exit6:
> Index: linux-2.6/drivers/net/ethernet/sfc/efx.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/sfc/efx.c
> +++ linux-2.6/drivers/net/ethernet/sfc/efx.c
> @@ -2822,6 +2822,7 @@ static int efx_pci_probe(struct pci_dev
>  		netif_warn(efx, probe, efx->net_dev,
>  			   "pci_enable_pcie_error_reporting failed (%d)\n", rc);
>  
> +	pci_bus_add_device_vfs(pci_dev);
>  	return 0;
>  
>   fail4:
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tejun Heo May 16, 2013, 5:53 p.m. UTC | #14

On Thu, May 16, 2013 at 12:56:42AM -0400, Or Gerlitz wrote:
> > > Please note that this whole lockdep warning was identified as false
> > > positive by Tejun http://www.spinics.net/lists/linux-pci/msg21568.html
> >
> > No, at least one time, I noticed there is one real hang happens.
> 
> Tejun, so what your thinking here?

Can't really tell much without more details.  Yinghai, do you have any
logs from such hang?  Are you sure it's from this?

Thanks.

Yinghai Lu May 16, 2013, 6:36 p.m. UTC | #15

On Thu, May 16, 2013 at 10:53 AM, Tejun Heo <tj@kernel.org> wrote:
> On Thu, May 16, 2013 at 12:56:42AM -0400, Or Gerlitz wrote:
>> > > Please note that this whole lockdep warning was identified as false
>> > > positive by Tejun http://www.spinics.net/lists/linux-pci/msg21568.html
>> >
>> > No, at least one time, I noticed there is one real hang happens.
>>
>> Tejun, so what your thinking here?
>
> Can't really tell much without more details.  Yinghai, do you have any
> logs from such hang?  Are you sure it's from this?

I did not keep that log.

Will remove AlexD patch, try several times to see if I can duplicate it again.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Or Gerlitz May 20, 2013, 12:23 p.m. UTC | #16

On Thu, May 16, 2013 at 9:36 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> On Thu, May 16, 2013 at 10:53 AM, Tejun Heo <tj@kernel.org> wrote:
>> On Thu, May 16, 2013 at 12:56:42AM -0400, Or Gerlitz wrote:

>>>>> Please note that this whole lockdep warning was identified as false
>>>>> positive by Tejun http://www.spinics.net/lists/linux-pci/msg21568.html

>>>> No, at least one time, I noticed there is one real hang happens.

>>> Tejun, so what your thinking here?

>> Can't really tell much without more details.  Yinghai, do you have any
>> logs from such hang?  Are you sure it's from this?

> I did not keep that log.
> Will remove AlexD patch, try several times to see if I can duplicate it again.

Any news here?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Donald Dutile May 21, 2013, 9:30 p.m. UTC | #17

On 05/14/2013 05:39 PM, Alexander Duyck wrote:
> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>> <alexander.h.duyck@intel.com>  wrote:
>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>> <alexander.h.duyck@intel.com>  wrote:
>>>>> I'm sorry, but what is the point of this patch?  With device assignment
>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>> unload PF driver will not call pci_disable_sriov?
>>> You cannot call pci_disable_sriov because you will panic all of the
>>> guests that have devices assigned.
>> ixgbe_remove did call pci_disable_sriov...
>>
>> for guest panic, that is another problem.
>> just like you pci passthrough with real pci device and hotremove the
>> card in host.
>>
>> ...
>
> I suggest you take another look.  In ixgbe_disable_sriov, which is the
> function that is called we do a check for assigned VFs.  If they are
> assigned then we do not call pci_disable_sriov.
>
>>
>>> So how does your patch actually fix this problem?  It seems like it is
>>> just avoiding it.
>> yes, until the first one is done.
>
> Avoiding the issue doesn't fix the underlying problem and instead you
> are likely just introducing more bugs as a result.
>
>>>  From what I can tell your problem is originating in pci_call_probe.  I
>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>> the work should be taking place on a CPU already local to the PF. You
>>> might want to look there to see why you are trying to schedule work on a
>>> CPU which should be perfectly fine for you to already be doing your work on.
>> it always try to go with local cpu with same pxm.
>
> The problem is we really shouldn't be calling work_for_cpu in this case
> since we are already on the correct CPU.  What probably should be
> happening is that pci_call_probe should be doing a check to see if the
> current CPU is already contained within the device node map and if so
> just call local_pci_probe directly.  That way you can avoid deadlocking
> the system by trying to flush the CPU queue of the CPU you are currently on.
>
That's the patch that Michael Tsirkin posted for a fix,
but it was noted that if you have the case where the _same_ driver is used for vf & pf,
other deadlocks may occur.
It would work in the case of ixgbe/ixgbevf, but not for something like
the Mellanox pf/vf driver (which is the same).

>
> Alex
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Donald Dutile May 21, 2013, 9:31 p.m. UTC | #18

On 05/21/2013 05:30 PM, Don Dutile wrote:
> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>> <alexander.h.duyck@intel.com> wrote:
>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>> <alexander.h.duyck@intel.com> wrote:
>>>>>> I'm sorry, but what is the point of this patch? With device assignment
>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>> unload PF driver will not call pci_disable_sriov?
>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>> guests that have devices assigned.
>>> ixgbe_remove did call pci_disable_sriov...
>>>
>>> for guest panic, that is another problem.
>>> just like you pci passthrough with real pci device and hotremove the
>>> card in host.
>>>
>>> ...
>>
>> I suggest you take another look. In ixgbe_disable_sriov, which is the
>> function that is called we do a check for assigned VFs. If they are
>> assigned then we do not call pci_disable_sriov.
>>
>>>
>>>> So how does your patch actually fix this problem? It seems like it is
>>>> just avoiding it.
>>> yes, until the first one is done.
>>
>> Avoiding the issue doesn't fix the underlying problem and instead you
>> are likely just introducing more bugs as a result.
>>
>>>> From what I can tell your problem is originating in pci_call_probe. I
>>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>>> the work should be taking place on a CPU already local to the PF. You
>>>> might want to look there to see why you are trying to schedule work on a
>>>> CPU which should be perfectly fine for you to already be doing your work on.
>>> it always try to go with local cpu with same pxm.
>>
>> The problem is we really shouldn't be calling work_for_cpu in this case
>> since we are already on the correct CPU. What probably should be
>> happening is that pci_call_probe should be doing a check to see if the
>> current CPU is already contained within the device node map and if so
>> just call local_pci_probe directly. That way you can avoid deadlocking
>> the system by trying to flush the CPU queue of the CPU you are currently on.
>>
> That's the patch that Michael Tsirkin posted for a fix,
> but it was noted that if you have the case where the _same_ driver is used for vf & pf,
> other deadlocks may occur.
> It would work in the case of ixgbe/ixgbevf, but not for something like
> the Mellanox pf/vf driver (which is the same).
>
apologies; here's the thread the discussed the issue:
https://patchwork.kernel.org/patch/2458681/

>>
>> Alex
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michael S. Tsirkin May 21, 2013, 9:49 p.m. UTC | #19

On Tue, May 21, 2013 at 05:30:32PM -0400, Don Dutile wrote:
> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
> >On 05/14/2013 12:59 PM, Yinghai Lu wrote:
> >>On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
> >><alexander.h.duyck@intel.com>  wrote:
> >>>On 05/14/2013 11:44 AM, Yinghai Lu wrote:
> >>>>On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
> >>>><alexander.h.duyck@intel.com>  wrote:
> >>>>>I'm sorry, but what is the point of this patch?  With device assignment
> >>>>>it is always possible to have VFs loaded and the PF driver unloaded
> >>>>>since you cannot remove the VFs if they are assigned to a VM.
> >>>>unload PF driver will not call pci_disable_sriov?
> >>>You cannot call pci_disable_sriov because you will panic all of the
> >>>guests that have devices assigned.
> >>ixgbe_remove did call pci_disable_sriov...
> >>
> >>for guest panic, that is another problem.
> >>just like you pci passthrough with real pci device and hotremove the
> >>card in host.
> >>
> >>...
> >
> >I suggest you take another look.  In ixgbe_disable_sriov, which is the
> >function that is called we do a check for assigned VFs.  If they are
> >assigned then we do not call pci_disable_sriov.
> >
> >>
> >>>So how does your patch actually fix this problem?  It seems like it is
> >>>just avoiding it.
> >>yes, until the first one is done.
> >
> >Avoiding the issue doesn't fix the underlying problem and instead you
> >are likely just introducing more bugs as a result.
> >
> >>> From what I can tell your problem is originating in pci_call_probe.  I
> >>>believe it is calling work_on_cpu and that doesn't seem correct since
> >>>the work should be taking place on a CPU already local to the PF. You
> >>>might want to look there to see why you are trying to schedule work on a
> >>>CPU which should be perfectly fine for you to already be doing your work on.
> >>it always try to go with local cpu with same pxm.
> >
> >The problem is we really shouldn't be calling work_for_cpu in this case
> >since we are already on the correct CPU.  What probably should be
> >happening is that pci_call_probe should be doing a check to see if the
> >current CPU is already contained within the device node map and if so
> >just call local_pci_probe directly.  That way you can avoid deadlocking
> >the system by trying to flush the CPU queue of the CPU you are currently on.
> >
> That's the patch that Michael Tsirkin posted for a fix,
> but it was noted that if you have the case where the _same_ driver is used for vf & pf,
> other deadlocks may occur.
> It would work in the case of ixgbe/ixgbevf, but not for something like
> the Mellanox pf/vf driver (which is the same).
> 

I think our conclusion was this is a false positive for Mellanox.
If not, we need to understand what the deadlock is better.

> >
> >Alex
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alexander Duyck May 21, 2013, 9:58 p.m. UTC | #20

On 05/21/2013 02:31 PM, Don Dutile wrote:
> On 05/21/2013 05:30 PM, Don Dutile wrote:
>> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>>> <alexander.h.duyck@intel.com> wrote:
>>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>>> <alexander.h.duyck@intel.com> wrote:
>>>>>>> I'm sorry, but what is the point of this patch? With device
>>>>>>> assignment
>>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>>> unload PF driver will not call pci_disable_sriov?
>>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>>> guests that have devices assigned.
>>>> ixgbe_remove did call pci_disable_sriov...
>>>>
>>>> for guest panic, that is another problem.
>>>> just like you pci passthrough with real pci device and hotremove the
>>>> card in host.
>>>>
>>>> ...
>>>
>>> I suggest you take another look. In ixgbe_disable_sriov, which is the
>>> function that is called we do a check for assigned VFs. If they are
>>> assigned then we do not call pci_disable_sriov.
>>>
>>>>
>>>>> So how does your patch actually fix this problem? It seems like it is
>>>>> just avoiding it.
>>>> yes, until the first one is done.
>>>
>>> Avoiding the issue doesn't fix the underlying problem and instead you
>>> are likely just introducing more bugs as a result.
>>>
>>>>> From what I can tell your problem is originating in pci_call_probe. I
>>>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>>>> the work should be taking place on a CPU already local to the PF. You
>>>>> might want to look there to see why you are trying to schedule work
>>>>> on a
>>>>> CPU which should be perfectly fine for you to already be doing your
>>>>> work on.
>>>> it always try to go with local cpu with same pxm.
>>>
>>> The problem is we really shouldn't be calling work_for_cpu in this case
>>> since we are already on the correct CPU. What probably should be
>>> happening is that pci_call_probe should be doing a check to see if the
>>> current CPU is already contained within the device node map and if so
>>> just call local_pci_probe directly. That way you can avoid deadlocking
>>> the system by trying to flush the CPU queue of the CPU you are
>>> currently on.
>>>
>> That's the patch that Michael Tsirkin posted for a fix,
>> but it was noted that if you have the case where the _same_ driver is
>> used for vf & pf,
>> other deadlocks may occur.
>> It would work in the case of ixgbe/ixgbevf, but not for something like
>> the Mellanox pf/vf driver (which is the same).
>>
> apologies; here's the thread the discussed the issue:
> https://patchwork.kernel.org/patch/2458681/
> 

I found out about that patch after I submitted one that was similar.
The only real complaint I had about his patch was that it was only
looking at the CPU and he could save himself some trouble by just doing
the work locally if we were on the correct NUMA node.  For example if
the system only has one node in it what is the point in scheduling all
of the work on CPU 0?  My alternative patch can be found at:
https://patchwork.kernel.org/patch/2568881/

As far as the inter-driver locking issues for same driver I don't think
that is really any kind if issue.  Most drivers shouldn't be holding any
big locks when they call pci_enable_sriov.  If I am not mistaken the
follow on patch I submitted which was similar to Michaels was reported
to have resolved the issue.

As far as the Mellanox PF/VF the bigger issue is that when they call
pci_enable_sriov they are not ready to handle VFs.  There have been
several suggestions on how to resolve it including -EPROBE_DEFER or the
igbvf/ixgbevf approach of just brining up the device in a "link down" state.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alexander Duyck May 21, 2013, 10:01 p.m. UTC | #21

On 05/21/2013 02:49 PM, Michael S. Tsirkin wrote:
> On Tue, May 21, 2013 at 05:30:32PM -0400, Don Dutile wrote:
>> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>>> <alexander.h.duyck@intel.com>  wrote:
>>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>>> <alexander.h.duyck@intel.com>  wrote:
>>>>>>> I'm sorry, but what is the point of this patch?  With device assignment
>>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>>> unload PF driver will not call pci_disable_sriov?
>>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>>> guests that have devices assigned.
>>>> ixgbe_remove did call pci_disable_sriov...
>>>>
>>>> for guest panic, that is another problem.
>>>> just like you pci passthrough with real pci device and hotremove the
>>>> card in host.
>>>>
>>>> ...
>>>
>>> I suggest you take another look.  In ixgbe_disable_sriov, which is the
>>> function that is called we do a check for assigned VFs.  If they are
>>> assigned then we do not call pci_disable_sriov.
>>>
>>>>
>>>>> So how does your patch actually fix this problem?  It seems like it is
>>>>> just avoiding it.
>>>> yes, until the first one is done.
>>>
>>> Avoiding the issue doesn't fix the underlying problem and instead you
>>> are likely just introducing more bugs as a result.
>>>
>>>>> From what I can tell your problem is originating in pci_call_probe.  I
>>>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>>>> the work should be taking place on a CPU already local to the PF. You
>>>>> might want to look there to see why you are trying to schedule work on a
>>>>> CPU which should be perfectly fine for you to already be doing your work on.
>>>> it always try to go with local cpu with same pxm.
>>>
>>> The problem is we really shouldn't be calling work_for_cpu in this case
>>> since we are already on the correct CPU.  What probably should be
>>> happening is that pci_call_probe should be doing a check to see if the
>>> current CPU is already contained within the device node map and if so
>>> just call local_pci_probe directly.  That way you can avoid deadlocking
>>> the system by trying to flush the CPU queue of the CPU you are currently on.
>>>
>> That's the patch that Michael Tsirkin posted for a fix,
>> but it was noted that if you have the case where the _same_ driver is used for vf & pf,
>> other deadlocks may occur.
>> It would work in the case of ixgbe/ixgbevf, but not for something like
>> the Mellanox pf/vf driver (which is the same).
>>
> 
> I think our conclusion was this is a false positive for Mellanox.
> If not, we need to understand what the deadlock is better.
> 

As I understand the issue, the problem is not a deadlock for Mellanox
(At least with either your patch or mine applied), the issue is that the
PF is not ready to handle VFs when pci_enable_sriov is called due to
some firmware issues.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Donald Dutile May 21, 2013, 10:09 p.m. UTC | #22

On 05/21/2013 05:58 PM, Alexander Duyck wrote:
> On 05/21/2013 02:31 PM, Don Dutile wrote:
>> On 05/21/2013 05:30 PM, Don Dutile wrote:
>>> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>>>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>>>> <alexander.h.duyck@intel.com>  wrote:
>>>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>>>> <alexander.h.duyck@intel.com>  wrote:
>>>>>>>> I'm sorry, but what is the point of this patch? With device
>>>>>>>> assignment
>>>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>>>> unload PF driver will not call pci_disable_sriov?
>>>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>>>> guests that have devices assigned.
>>>>> ixgbe_remove did call pci_disable_sriov...
>>>>>
>>>>> for guest panic, that is another problem.
>>>>> just like you pci passthrough with real pci device and hotremove the
>>>>> card in host.
>>>>>
>>>>> ...
>>>>
>>>> I suggest you take another look. In ixgbe_disable_sriov, which is the
>>>> function that is called we do a check for assigned VFs. If they are
>>>> assigned then we do not call pci_disable_sriov.
>>>>
>>>>>
>>>>>> So how does your patch actually fix this problem? It seems like it is
>>>>>> just avoiding it.
>>>>> yes, until the first one is done.
>>>>
>>>> Avoiding the issue doesn't fix the underlying problem and instead you
>>>> are likely just introducing more bugs as a result.
>>>>
>>>>>>  From what I can tell your problem is originating in pci_call_probe. I
>>>>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>>>>> the work should be taking place on a CPU already local to the PF. You
>>>>>> might want to look there to see why you are trying to schedule work
>>>>>> on a
>>>>>> CPU which should be perfectly fine for you to already be doing your
>>>>>> work on.
>>>>> it always try to go with local cpu with same pxm.
>>>>
>>>> The problem is we really shouldn't be calling work_for_cpu in this case
>>>> since we are already on the correct CPU. What probably should be
>>>> happening is that pci_call_probe should be doing a check to see if the
>>>> current CPU is already contained within the device node map and if so
>>>> just call local_pci_probe directly. That way you can avoid deadlocking
>>>> the system by trying to flush the CPU queue of the CPU you are
>>>> currently on.
>>>>
>>> That's the patch that Michael Tsirkin posted for a fix,
>>> but it was noted that if you have the case where the _same_ driver is
>>> used for vf&  pf,
>>> other deadlocks may occur.
>>> It would work in the case of ixgbe/ixgbevf, but not for something like
>>> the Mellanox pf/vf driver (which is the same).
>>>
>> apologies; here's the thread the discussed the issue:
>> https://patchwork.kernel.org/patch/2458681/
>>
>
> I found out about that patch after I submitted one that was similar.
> The only real complaint I had about his patch was that it was only
> looking at the CPU and he could save himself some trouble by just doing
> the work locally if we were on the correct NUMA node.  For example if
> the system only has one node in it what is the point in scheduling all
> of the work on CPU 0?  My alternative patch can be found at:
> https://patchwork.kernel.org/patch/2568881/
>
> As far as the inter-driver locking issues for same driver I don't think
> that is really any kind if issue.  Most drivers shouldn't be holding any
> big locks when they call pci_enable_sriov.  If I am not mistaken the
> follow on patch I submitted which was similar to Michaels was reported
> to have resolved the issue.
>
You mean the above patchwork patch, or another one?

> As far as the Mellanox PF/VF the bigger issue is that when they call
> pci_enable_sriov they are not ready to handle VFs.  There have been
> several suggestions on how to resolve it including -EPROBE_DEFER or the
> igbvf/ixgbevf approach of just brining up the device in a "link down" state.
>
thanks for summary.  i was backlogged on email, and responding as i read them;
I should have read through the whole thread before chiming in.

> Thanks,
>
> Alex
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michael S. Tsirkin May 21, 2013, 10:11 p.m. UTC | #23

On Tue, May 21, 2013 at 03:01:08PM -0700, Alexander Duyck wrote:
> On 05/21/2013 02:49 PM, Michael S. Tsirkin wrote:
> > On Tue, May 21, 2013 at 05:30:32PM -0400, Don Dutile wrote:
> >> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
> >>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
> >>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
> >>>> <alexander.h.duyck@intel.com>  wrote:
> >>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
> >>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
> >>>>>> <alexander.h.duyck@intel.com>  wrote:
> >>>>>>> I'm sorry, but what is the point of this patch?  With device assignment
> >>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
> >>>>>>> since you cannot remove the VFs if they are assigned to a VM.
> >>>>>> unload PF driver will not call pci_disable_sriov?
> >>>>> You cannot call pci_disable_sriov because you will panic all of the
> >>>>> guests that have devices assigned.
> >>>> ixgbe_remove did call pci_disable_sriov...
> >>>>
> >>>> for guest panic, that is another problem.
> >>>> just like you pci passthrough with real pci device and hotremove the
> >>>> card in host.
> >>>>
> >>>> ...
> >>>
> >>> I suggest you take another look.  In ixgbe_disable_sriov, which is the
> >>> function that is called we do a check for assigned VFs.  If they are
> >>> assigned then we do not call pci_disable_sriov.
> >>>
> >>>>
> >>>>> So how does your patch actually fix this problem?  It seems like it is
> >>>>> just avoiding it.
> >>>> yes, until the first one is done.
> >>>
> >>> Avoiding the issue doesn't fix the underlying problem and instead you
> >>> are likely just introducing more bugs as a result.
> >>>
> >>>>> From what I can tell your problem is originating in pci_call_probe.  I
> >>>>> believe it is calling work_on_cpu and that doesn't seem correct since
> >>>>> the work should be taking place on a CPU already local to the PF. You
> >>>>> might want to look there to see why you are trying to schedule work on a
> >>>>> CPU which should be perfectly fine for you to already be doing your work on.
> >>>> it always try to go with local cpu with same pxm.
> >>>
> >>> The problem is we really shouldn't be calling work_for_cpu in this case
> >>> since we are already on the correct CPU.  What probably should be
> >>> happening is that pci_call_probe should be doing a check to see if the
> >>> current CPU is already contained within the device node map and if so
> >>> just call local_pci_probe directly.  That way you can avoid deadlocking
> >>> the system by trying to flush the CPU queue of the CPU you are currently on.
> >>>
> >> That's the patch that Michael Tsirkin posted for a fix,
> >> but it was noted that if you have the case where the _same_ driver is used for vf & pf,
> >> other deadlocks may occur.
> >> It would work in the case of ixgbe/ixgbevf, but not for something like
> >> the Mellanox pf/vf driver (which is the same).
> >>
> > 
> > I think our conclusion was this is a false positive for Mellanox.
> > If not, we need to understand what the deadlock is better.
> > 
> 
> As I understand the issue, the problem is not a deadlock for Mellanox
> (At least with either your patch or mine applied), the issue is that the
> PF is not ready to handle VFs when pci_enable_sriov is called due to
> some firmware issues.
> 
> Thanks,
> 
> Alex

I haven't seen Mellanox guys say anything like this on the list.
Pointers?
All I saw is some lockdep warnings and Tejun says they are bogus ...

Alexander Duyck May 21, 2013, 10:12 p.m. UTC | #24

On 05/21/2013 03:09 PM, Don Dutile wrote:
> On 05/21/2013 05:58 PM, Alexander Duyck wrote:
>> On 05/21/2013 02:31 PM, Don Dutile wrote:
>>> On 05/21/2013 05:30 PM, Don Dutile wrote:
>>>> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>>>>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>>>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>>>>> <alexander.h.duyck@intel.com>  wrote:
>>>>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>>>>> <alexander.h.duyck@intel.com>  wrote:
>>>>>>>>> I'm sorry, but what is the point of this patch? With device
>>>>>>>>> assignment
>>>>>>>>> it is always possible to have VFs loaded and the PF driver
>>>>>>>>> unloaded
>>>>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>>>>> unload PF driver will not call pci_disable_sriov?
>>>>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>>>>> guests that have devices assigned.
>>>>>> ixgbe_remove did call pci_disable_sriov...
>>>>>>
>>>>>> for guest panic, that is another problem.
>>>>>> just like you pci passthrough with real pci device and hotremove the
>>>>>> card in host.
>>>>>>
>>>>>> ...
>>>>>
>>>>> I suggest you take another look. In ixgbe_disable_sriov, which is the
>>>>> function that is called we do a check for assigned VFs. If they are
>>>>> assigned then we do not call pci_disable_sriov.
>>>>>
>>>>>>
>>>>>>> So how does your patch actually fix this problem? It seems like
>>>>>>> it is
>>>>>>> just avoiding it.
>>>>>> yes, until the first one is done.
>>>>>
>>>>> Avoiding the issue doesn't fix the underlying problem and instead you
>>>>> are likely just introducing more bugs as a result.
>>>>>
>>>>>>>  From what I can tell your problem is originating in
>>>>>>> pci_call_probe. I
>>>>>>> believe it is calling work_on_cpu and that doesn't seem correct
>>>>>>> since
>>>>>>> the work should be taking place on a CPU already local to the PF.
>>>>>>> You
>>>>>>> might want to look there to see why you are trying to schedule work
>>>>>>> on a
>>>>>>> CPU which should be perfectly fine for you to already be doing your
>>>>>>> work on.
>>>>>> it always try to go with local cpu with same pxm.
>>>>>
>>>>> The problem is we really shouldn't be calling work_for_cpu in this
>>>>> case
>>>>> since we are already on the correct CPU. What probably should be
>>>>> happening is that pci_call_probe should be doing a check to see if the
>>>>> current CPU is already contained within the device node map and if so
>>>>> just call local_pci_probe directly. That way you can avoid deadlocking
>>>>> the system by trying to flush the CPU queue of the CPU you are
>>>>> currently on.
>>>>>
>>>> That's the patch that Michael Tsirkin posted for a fix,
>>>> but it was noted that if you have the case where the _same_ driver is
>>>> used for vf&  pf,
>>>> other deadlocks may occur.
>>>> It would work in the case of ixgbe/ixgbevf, but not for something like
>>>> the Mellanox pf/vf driver (which is the same).
>>>>
>>> apologies; here's the thread the discussed the issue:
>>> https://patchwork.kernel.org/patch/2458681/
>>>
>>
>> I found out about that patch after I submitted one that was similar.
>> The only real complaint I had about his patch was that it was only
>> looking at the CPU and he could save himself some trouble by just doing
>> the work locally if we were on the correct NUMA node.  For example if
>> the system only has one node in it what is the point in scheduling all
>> of the work on CPU 0?  My alternative patch can be found at:
>> https://patchwork.kernel.org/patch/2568881/
>>
>> As far as the inter-driver locking issues for same driver I don't think
>> that is really any kind if issue.  Most drivers shouldn't be holding any
>> big locks when they call pci_enable_sriov.  If I am not mistaken the
>> follow on patch I submitted which was similar to Michaels was reported
>> to have resolved the issue.
>>
> You mean the above patchwork patch, or another one?

Well I know the above patchwork patch resolves it, but I am assuming
Michaels would probably work as well since it resolves the underlying issue.

>> As far as the Mellanox PF/VF the bigger issue is that when they call
>> pci_enable_sriov they are not ready to handle VFs.  There have been
>> several suggestions on how to resolve it including -EPROBE_DEFER or the
>> igbvf/ixgbevf approach of just brining up the device in a "link down"
>> state.
>>
> thanks for summary.  i was backlogged on email, and responding as i read
> them;
> I should have read through the whole thread before chiming in.
> 

No problem.  My main concern at this point is that we should probably
get either Michaels patch or mine pulled in since the potential for
deadlock is still there.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alexander Duyck May 21, 2013, 10:30 p.m. UTC | #25

On 05/21/2013 03:11 PM, Michael S. Tsirkin wrote:
> On Tue, May 21, 2013 at 03:01:08PM -0700, Alexander Duyck wrote:
>> On 05/21/2013 02:49 PM, Michael S. Tsirkin wrote:
>>> On Tue, May 21, 2013 at 05:30:32PM -0400, Don Dutile wrote:
>>>> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>>>>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>>>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>>>>> <alexander.h.duyck@intel.com>  wrote:
>>>>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>>>>> <alexander.h.duyck@intel.com>  wrote:
>>>>>>>>> I'm sorry, but what is the point of this patch?  With device assignment
>>>>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>>>>> unload PF driver will not call pci_disable_sriov?
>>>>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>>>>> guests that have devices assigned.
>>>>>> ixgbe_remove did call pci_disable_sriov...
>>>>>>
>>>>>> for guest panic, that is another problem.
>>>>>> just like you pci passthrough with real pci device and hotremove the
>>>>>> card in host.
>>>>>>
>>>>>> ...
>>>>>
>>>>> I suggest you take another look.  In ixgbe_disable_sriov, which is the
>>>>> function that is called we do a check for assigned VFs.  If they are
>>>>> assigned then we do not call pci_disable_sriov.
>>>>>
>>>>>>
>>>>>>> So how does your patch actually fix this problem?  It seems like it is
>>>>>>> just avoiding it.
>>>>>> yes, until the first one is done.
>>>>>
>>>>> Avoiding the issue doesn't fix the underlying problem and instead you
>>>>> are likely just introducing more bugs as a result.
>>>>>
>>>>>>> From what I can tell your problem is originating in pci_call_probe.  I
>>>>>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>>>>>> the work should be taking place on a CPU already local to the PF. You
>>>>>>> might want to look there to see why you are trying to schedule work on a
>>>>>>> CPU which should be perfectly fine for you to already be doing your work on.
>>>>>> it always try to go with local cpu with same pxm.
>>>>>
>>>>> The problem is we really shouldn't be calling work_for_cpu in this case
>>>>> since we are already on the correct CPU.  What probably should be
>>>>> happening is that pci_call_probe should be doing a check to see if the
>>>>> current CPU is already contained within the device node map and if so
>>>>> just call local_pci_probe directly.  That way you can avoid deadlocking
>>>>> the system by trying to flush the CPU queue of the CPU you are currently on.
>>>>>
>>>> That's the patch that Michael Tsirkin posted for a fix,
>>>> but it was noted that if you have the case where the _same_ driver is used for vf & pf,
>>>> other deadlocks may occur.
>>>> It would work in the case of ixgbe/ixgbevf, but not for something like
>>>> the Mellanox pf/vf driver (which is the same).
>>>>
>>>
>>> I think our conclusion was this is a false positive for Mellanox.
>>> If not, we need to understand what the deadlock is better.
>>>
>>
>> As I understand the issue, the problem is not a deadlock for Mellanox
>> (At least with either your patch or mine applied), the issue is that the
>> PF is not ready to handle VFs when pci_enable_sriov is called due to
>> some firmware issues.
>>
>> Thanks,
>>
>> Alex
> 
> I haven't seen Mellanox guys say anything like this on the list.
> Pointers?
> All I saw is some lockdep warnings and Tejun says they are bogus ...

Actually the patch I submitted is at:
https://patchwork.kernel.org/patch/2568881/

It was in response to:
https://patchwork.kernel.org/patch/2562471/

Basically the patch I was responding to was supposed to address both the
lockdep issue and a problem with mlx4 not being able to support the VFs
when pci_enable_sriov is called.  Yinghai had specifically called out
the work_on_cpu lockdep issue that you also submitted a patch for.

As per the feedback from Yinghai it seems like my patch does resolve the
lockdep issue that was seen.  The other half of the issue was what we
have been discussing with Or in regards to delaying VF driver init via
something like -EPROBE_DEFER instead of trying to split up
pci_enable_sriov and VF probe.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Or Gerlitz May 22, 2013, 8:16 p.m. UTC | #26

On Wed, May 22, 2013 at 1:30 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On 05/21/2013 03:11 PM, Michael S. Tsirkin wrote:
>> On Tue, May 21, 2013 at 03:01:08PM -0700, Alexander Duyck wrote:
>>> On 05/21/2013 02:49 PM, Michael S. Tsirkin wrote:
>>>> On Tue, May 21, 2013 at 05:30:32PM -0400, Don Dutile wrote:
>>>>> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>>>>>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>>>>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>>>>>> <alexander.h.duyck@intel.com>  wrote:
>>>>>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>>>>>> <alexander.h.duyck@intel.com>  wrote:
>>>>>>>>>> I'm sorry, but what is the point of this patch?  With device assignment
>>>>>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>>>>>> unload PF driver will not call pci_disable_sriov?
>>>>>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>>>>>> guests that have devices assigned.
>>>>>>> ixgbe_remove did call pci_disable_sriov...
>>>>>>>
>>>>>>> for guest panic, that is another problem.
>>>>>>> just like you pci passthrough with real pci device and hotremove the
>>>>>>> card in host.
>>>>>>>
>>>>>>> ...
>>>>>>
>>>>>> I suggest you take another look.  In ixgbe_disable_sriov, which is the
>>>>>> function that is called we do a check for assigned VFs.  If they are
>>>>>> assigned then we do not call pci_disable_sriov.
>>>>>>
>>>>>>>
>>>>>>>> So how does your patch actually fix this problem?  It seems like it is
>>>>>>>> just avoiding it.
>>>>>>> yes, until the first one is done.
>>>>>>
>>>>>> Avoiding the issue doesn't fix the underlying problem and instead you
>>>>>> are likely just introducing more bugs as a result.
>>>>>>
>>>>>>>> From what I can tell your problem is originating in pci_call_probe.  I
>>>>>>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>>>>>>> the work should be taking place on a CPU already local to the PF. You
>>>>>>>> might want to look there to see why you are trying to schedule work on a
>>>>>>>> CPU which should be perfectly fine for you to already be doing your work on.
>>>>>>> it always try to go with local cpu with same pxm.
>>>>>>
>>>>>> The problem is we really shouldn't be calling work_for_cpu in this case
>>>>>> since we are already on the correct CPU.  What probably should be
>>>>>> happening is that pci_call_probe should be doing a check to see if the
>>>>>> current CPU is already contained within the device node map and if so
>>>>>> just call local_pci_probe directly.  That way you can avoid deadlocking
>>>>>> the system by trying to flush the CPU queue of the CPU you are currently on.
>>>>>>
>>>>> That's the patch that Michael Tsirkin posted for a fix,
>>>>> but it was noted that if you have the case where the _same_ driver is used for vf & pf,
>>>>> other deadlocks may occur.
>>>>> It would work in the case of ixgbe/ixgbevf, but not for something like
>>>>> the Mellanox pf/vf driver (which is the same).
>>>>>
>>>>
>>>> I think our conclusion was this is a false positive for Mellanox.
>>>> If not, we need to understand what the deadlock is better.
>>>>
>>>
>>> As I understand the issue, the problem is not a deadlock for Mellanox
>>> (At least with either your patch or mine applied), the issue is that the
>>> PF is not ready to handle VFs when pci_enable_sriov is called due to
>>> some firmware issues.


>> I haven't seen Mellanox guys say anything like this on the list. Pointers?
>> All I saw is some lockdep warnings and Tejun says they are bogus ...
>
> Actually the patch I submitted is at:
> https://patchwork.kernel.org/patch/2568881/
>
> It was in response to:
> https://patchwork.kernel.org/patch/2562471/
>
> Basically the patch I was responding to was supposed to address both the
> lockdep issue and a problem with mlx4 not being able to support the VFs
> when pci_enable_sriov is called.  Yinghai had specifically called out
> the work_on_cpu lockdep issue that you also submitted a patch for.
>
> As per the feedback from Yinghai it seems like my patch does resolve the
> lockdep issue that was seen.  The other half of the issue was what we
> have been discussing with Or in regards to delaying VF driver init via
> something like -EPROBE_DEFER instead of trying to split up
> pci_enable_sriov and VF probe.


Hi Alex, all, so to clarify:

1. currently due to current firmware limitation we must call
pci_enable_sriov before the
PF ends its initialization sequence done in the PCI probe callback, hence

2. we can't move to the new sysfs API for enabling SRIOV

3. as of 3.9-rc1 we see these nested brobes, bisected that to be as of
commit 90888ac01d059e38ffe77a2291d44cafa9016fb "driver core: fix
possible missing of device probe". But we didn't reach into consensus
with the author that this wasn't possible before the commit, nor this
is something that needs to be avoided, see
http://marc.info/?t=136249697200007&r=1&w=2

4. I am not sure if/how we can modify the PF code to support the case
where VFs are probed and start thier initialization sequence before
the PF is done with its initialization

5. etc

all in all, we will look into returning  -EPROBE_DEFER from the VF
when they identify the problematic situation -- so for how much time
this is deferred? or if this isn't time based what the logical
condition which once met the VF probe is attempted again?


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Donald Dutile May 22, 2013, 9:40 p.m. UTC | #27

On 05/22/2013 04:16 PM, Or Gerlitz wrote:
> On Wed, May 22, 2013 at 1:30 AM, Alexander Duyck
> <alexander.duyck@gmail.com>  wrote:
>> On 05/21/2013 03:11 PM, Michael S. Tsirkin wrote:
>>> On Tue, May 21, 2013 at 03:01:08PM -0700, Alexander Duyck wrote:
>>>> On 05/21/2013 02:49 PM, Michael S. Tsirkin wrote:
>>>>> On Tue, May 21, 2013 at 05:30:32PM -0400, Don Dutile wrote:
>>>>>> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>>>>>>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>>>>>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>>>>>>> <alexander.h.duyck@intel.com>   wrote:
>>>>>>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>>>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>>>>>>> <alexander.h.duyck@intel.com>   wrote:
>>>>>>>>>>> I'm sorry, but what is the point of this patch?  With device assignment
>>>>>>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>>>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>>>>>>> unload PF driver will not call pci_disable_sriov?
>>>>>>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>>>>>>> guests that have devices assigned.
>>>>>>>> ixgbe_remove did call pci_disable_sriov...
>>>>>>>>
>>>>>>>> for guest panic, that is another problem.
>>>>>>>> just like you pci passthrough with real pci device and hotremove the
>>>>>>>> card in host.
>>>>>>>>
>>>>>>>> ...
>>>>>>>
>>>>>>> I suggest you take another look.  In ixgbe_disable_sriov, which is the
>>>>>>> function that is called we do a check for assigned VFs.  If they are
>>>>>>> assigned then we do not call pci_disable_sriov.
>>>>>>>
>>>>>>>>
>>>>>>>>> So how does your patch actually fix this problem?  It seems like it is
>>>>>>>>> just avoiding it.
>>>>>>>> yes, until the first one is done.
>>>>>>>
>>>>>>> Avoiding the issue doesn't fix the underlying problem and instead you
>>>>>>> are likely just introducing more bugs as a result.
>>>>>>>
>>>>>>>>>  From what I can tell your problem is originating in pci_call_probe.  I
>>>>>>>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>>>>>>>> the work should be taking place on a CPU already local to the PF. You
>>>>>>>>> might want to look there to see why you are trying to schedule work on a
>>>>>>>>> CPU which should be perfectly fine for you to already be doing your work on.
>>>>>>>> it always try to go with local cpu with same pxm.
>>>>>>>
>>>>>>> The problem is we really shouldn't be calling work_for_cpu in this case
>>>>>>> since we are already on the correct CPU.  What probably should be
>>>>>>> happening is that pci_call_probe should be doing a check to see if the
>>>>>>> current CPU is already contained within the device node map and if so
>>>>>>> just call local_pci_probe directly.  That way you can avoid deadlocking
>>>>>>> the system by trying to flush the CPU queue of the CPU you are currently on.
>>>>>>>
>>>>>> That's the patch that Michael Tsirkin posted for a fix,
>>>>>> but it was noted that if you have the case where the _same_ driver is used for vf&  pf,
>>>>>> other deadlocks may occur.
>>>>>> It would work in the case of ixgbe/ixgbevf, but not for something like
>>>>>> the Mellanox pf/vf driver (which is the same).
>>>>>>
>>>>>
>>>>> I think our conclusion was this is a false positive for Mellanox.
>>>>> If not, we need to understand what the deadlock is better.
>>>>>
>>>>
>>>> As I understand the issue, the problem is not a deadlock for Mellanox
>>>> (At least with either your patch or mine applied), the issue is that the
>>>> PF is not ready to handle VFs when pci_enable_sriov is called due to
>>>> some firmware issues.
>
>
>>> I haven't seen Mellanox guys say anything like this on the list. Pointers?
>>> All I saw is some lockdep warnings and Tejun says they are bogus ...
>>
>> Actually the patch I submitted is at:
>> https://patchwork.kernel.org/patch/2568881/
>>
>> It was in response to:
>> https://patchwork.kernel.org/patch/2562471/
>>
>> Basically the patch I was responding to was supposed to address both the
>> lockdep issue and a problem with mlx4 not being able to support the VFs
>> when pci_enable_sriov is called.  Yinghai had specifically called out
>> the work_on_cpu lockdep issue that you also submitted a patch for.
>>
>> As per the feedback from Yinghai it seems like my patch does resolve the
>> lockdep issue that was seen.  The other half of the issue was what we
>> have been discussing with Or in regards to delaying VF driver init via
>> something like -EPROBE_DEFER instead of trying to split up
>> pci_enable_sriov and VF probe.
>
>
> Hi Alex, all, so to clarify:
>
> 1. currently due to current firmware limitation we must call
> pci_enable_sriov before the
> PF ends its initialization sequence done in the PCI probe callback, hence
>
> 2. we can't move to the new sysfs API for enabling SRIOV
>
> 3. as of 3.9-rc1 we see these nested brobes, bisected that to be as of
> commit 90888ac01d059e38ffe77a2291d44cafa9016fb "driver core: fix
> possible missing of device probe". But we didn't reach into consensus
> with the author that this wasn't possible before the commit, nor this
> is something that needs to be avoided, see
> http://marc.info/?t=136249697200007&r=1&w=2
>
> 4. I am not sure if/how we can modify the PF code to support the case
> where VFs are probed and start thier initialization sequence before
> the PF is done with its initialization
>
> 5. etc
>
> all in all, we will look into returning  -EPROBE_DEFER from the VF
> when they identify the problematic situation -- so for how much time
> this is deferred? or if this isn't time based what the logical
> condition which once met the VF probe is attempted again?
>
ah, sounds device specific.... i.e., it goes back to PF probe....

So, I'm assuming some sort of init/info-xchg is done btwn VF & PF
and has to be done to some level before PF can continue it's pci-probe
operation.  In that case, has the VF & PF done sufficient init/info-xchg
on 1st call, that the PF can continue, and then queue up a sriov-enable
at the end of PF probe ?

>
> Or.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ben Hutchings May 22, 2013, 11:45 p.m. UTC | #28

On Wed, 2013-05-22 at 23:16 +0300, Or Gerlitz wrote:
[...]
> all in all, we will look into returning  -EPROBE_DEFER from the VF
> when they identify the problematic situation -- so for how much time
> this is deferred? or if this isn't time based what the logical
> condition which once met the VF probe is attempted again?

My reading is that it will be retried as soon as the PF probe returns.
But I've never tried using this feature myself.

Ben.

Or Gerlitz May 23, 2013, 6:32 a.m. UTC | #29

On Thu, May 23, 2013 at 2:45 AM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
> On Wed, 2013-05-22 at 23:16 +0300, Or Gerlitz wrote:
> [...]
>> all in all, we will look into returning  -EPROBE_DEFER from the VF
>> when they identify the problematic situation -- so for how much time
>> this is deferred? or if this isn't time based what the logical
>> condition which once met the VF probe is attempted again?
>
> My reading is that it will be retried as soon as the PF probe returns.
> But I've never tried using this feature myself.

Yes, empirically this is what happens, the VF probe detects that the
PF isn't ready yet, and returns error.
Few seconds later the VF is probed again and this time it works, today
we return -EIO, so we  need
to change that to  -EPROBE_DEFER to make that more robust.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Or Gerlitz May 23, 2013, 6:43 a.m. UTC | #30

On Thu, May 23, 2013 at 12:40 AM, Don Dutile <ddutile@redhat.com> wrote:
> On 05/22/2013 04:16 PM, Or Gerlitz wrote:
[...]
>> Hi Alex, all, so to clarify:
>>
>> 1. currently due to current firmware limitation we must call pci_enable_sriov before the
>> PF ends its initialization sequence done in the PCI probe callback, hence
>>
>> 2. we can't move to the new sysfs API for enabling SRIOV
>>
>> 3. as of 3.9-rc1 we see these nested brobes, bisected that to be as of
>> commit 90888ac01d059e38ffe77a2291d44cafa9016fb "driver core: fix
>> possible missing of device probe". But we didn't reach into consensus
>> with the author that this wasn't possible before the commit, nor this
>> is something that needs to be avoided, see
>> http://marc.info/?t=136249697200007&r=1&w=2
>>
>> 4. I am not sure if/how we can modify the PF code to support the case
>> where VFs are probed and start thier initialization sequence before
>> the PF is done with its initialization
>>
>> 5. etc
>>
>> all in all, we will look into returning  -EPROBE_DEFER from the VF
>> when they identify the problematic situation -- so for how much time
>> this is deferred? or if this isn't time based what the logical
>> condition which once met the VF probe is attempted again?
>>
> ah, sounds device specific.... i.e., it goes back to PF probe....
>
> So, I'm assuming some sort of init/info-xchg is done btwn VF & PF
> and has to be done to some level before PF can continue it's pci-probe
> operation.  In that case, has the VF & PF done sufficient init/info-xchg
> on 1st call, that the PF can continue, and then queue up a sriov-enable
> at the end of PF probe ?

not exactly (sorry if repeating myself) -- currently we must call
pci_enable_sriov
before the PF ends its initialization sequence done in its probe
function. As soon as
pci_enable_sriov is called, VFs are started to get probed and they
detect that the PF has
not done its initialization and hence the VF init/info-xchng can't be
done. As to your question,
we can't allow for VFs to start their probing before the PF did
sriov-enable b/c of the that very same limitation.

Summing up my recollection of the the 2-3 related thread that were around lately

1. when the VF see they can't probe, we will return  -EPROBE_DEFER -
this will fix the mlx4 specific issue

2. once the limitation is removed, mlx4 will implement the sysfs
method so doing sriov-enable will be decoupled from PF probing

3. since nested probing is more general, still need to see if/which of
Michael/Tejun/Alex patch needs to go in

makes sense?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[6/7] PCI: Make sure VF's driver get attached after PF's

Commit Message

Comments

Patch