diff mbox series

driver/pci: reduce the single block time in pci_read_config

Message ID 20200824052025.48362-1-benbjiang@tencent.com (mailing list archive)
State Superseded, archived
Delegated to: Bjorn Helgaas
Headers show
Series driver/pci: reduce the single block time in pci_read_config | expand

Commit Message

Jiang Biao Aug. 24, 2020, 5:20 a.m. UTC
From: Jiang Biao <benbjiang@tencent.com>

pci_read_config() could block several ms in kernel space, mainly
caused by the while loop to call pci_user_read_config_dword().
Singel pci_user_read_config_dword() loop could consume 130us+,
              |    pci_user_read_config_dword() {
              |      _raw_spin_lock_irq() {
! 136.698 us  |        native_queued_spin_lock_slowpath();
! 137.582 us  |      }
              |      pci_read() {
              |        raw_pci_read() {
              |          pci_conf1_read() {
  0.230 us    |            _raw_spin_lock_irqsave();
  0.035 us    |            _raw_spin_unlock_irqrestore();
  8.476 us    |          }
  8.790 us    |        }
  9.091 us    |      }
! 147.263 us  |    }
and dozens of the loop could consume ms+.

If we execute some lspci commands concurrently, ms+ scheduling
latency could be detected.

Add scheduling chance in the loop to improve the latency.

Reported-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jiang Biao <benbjiang@tencent.com>
---
 drivers/pci/pci-sysfs.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Jiang Biao Aug. 27, 2020, 9:49 a.m. UTC | #1
kindly ping :)

On Mon, 24 Aug 2020 at 13:20, Jiang Biao <benbjiang@gmail.com> wrote:
>
> From: Jiang Biao <benbjiang@tencent.com>
>
> pci_read_config() could block several ms in kernel space, mainly
> caused by the while loop to call pci_user_read_config_dword().
> Singel pci_user_read_config_dword() loop could consume 130us+,
>               |    pci_user_read_config_dword() {
>               |      _raw_spin_lock_irq() {
> ! 136.698 us  |        native_queued_spin_lock_slowpath();
> ! 137.582 us  |      }
>               |      pci_read() {
>               |        raw_pci_read() {
>               |          pci_conf1_read() {
>   0.230 us    |            _raw_spin_lock_irqsave();
>   0.035 us    |            _raw_spin_unlock_irqrestore();
>   8.476 us    |          }
>   8.790 us    |        }
>   9.091 us    |      }
> ! 147.263 us  |    }
> and dozens of the loop could consume ms+.
>
> If we execute some lspci commands concurrently, ms+ scheduling
> latency could be detected.
>
> Add scheduling chance in the loop to improve the latency.
>
> Reported-by: Bin Lai <robinlai@tencent.com>
> Signed-off-by: Jiang Biao <benbjiang@tencent.com>
> ---
>  drivers/pci/pci-sysfs.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 6d78df9..3b9f63d 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -708,6 +708,7 @@ static ssize_t pci_read_config(struct file *filp, struct kobject *kobj,
>                 data[off - init_off + 3] = (val >> 24) & 0xff;
>                 off += 4;
>                 size -= 4;
> +               cond_resched();
>         }
>
>         if (size >= 2) {
> --
> 1.8.3.1
>
Bjorn Helgaas Sept. 10, 2020, 1:25 a.m. UTC | #2
On Mon, Aug 24, 2020 at 01:20:25PM +0800, Jiang Biao wrote:
> From: Jiang Biao <benbjiang@tencent.com>
> 
> pci_read_config() could block several ms in kernel space, mainly
> caused by the while loop to call pci_user_read_config_dword().
> Singel pci_user_read_config_dword() loop could consume 130us+,
>               |    pci_user_read_config_dword() {
>               |      _raw_spin_lock_irq() {
> ! 136.698 us  |        native_queued_spin_lock_slowpath();
> ! 137.582 us  |      }
>               |      pci_read() {
>               |        raw_pci_read() {
>               |          pci_conf1_read() {
>   0.230 us    |            _raw_spin_lock_irqsave();
>   0.035 us    |            _raw_spin_unlock_irqrestore();
>   8.476 us    |          }
>   8.790 us    |        }
>   9.091 us    |      }
> ! 147.263 us  |    }
> and dozens of the loop could consume ms+.
> 
> If we execute some lspci commands concurrently, ms+ scheduling
> latency could be detected.
> 
> Add scheduling chance in the loop to improve the latency.

Thanks for the patch, this makes a lot of sense.

Shouldn't we do the same in pci_write_config()?

> Reported-by: Bin Lai <robinlai@tencent.com>
> Signed-off-by: Jiang Biao <benbjiang@tencent.com>
> ---
>  drivers/pci/pci-sysfs.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 6d78df9..3b9f63d 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -708,6 +708,7 @@ static ssize_t pci_read_config(struct file *filp, struct kobject *kobj,
>  		data[off - init_off + 3] = (val >> 24) & 0xff;
>  		off += 4;
>  		size -= 4;
> +		cond_resched();
>  	}
>  
>  	if (size >= 2) {
> -- 
> 1.8.3.1
>
Jiang Biao Sept. 10, 2020, 1:54 a.m. UTC | #3
Hi,

On Thu, 10 Sep 2020 at 09:25, Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Mon, Aug 24, 2020 at 01:20:25PM +0800, Jiang Biao wrote:
> > From: Jiang Biao <benbjiang@tencent.com>
> >
> > pci_read_config() could block several ms in kernel space, mainly
> > caused by the while loop to call pci_user_read_config_dword().
> > Singel pci_user_read_config_dword() loop could consume 130us+,
> >               |    pci_user_read_config_dword() {
> >               |      _raw_spin_lock_irq() {
> > ! 136.698 us  |        native_queued_spin_lock_slowpath();
> > ! 137.582 us  |      }
> >               |      pci_read() {
> >               |        raw_pci_read() {
> >               |          pci_conf1_read() {
> >   0.230 us    |            _raw_spin_lock_irqsave();
> >   0.035 us    |            _raw_spin_unlock_irqrestore();
> >   8.476 us    |          }
> >   8.790 us    |        }
> >   9.091 us    |      }
> > ! 147.263 us  |    }
> > and dozens of the loop could consume ms+.
> >
> > If we execute some lspci commands concurrently, ms+ scheduling
> > latency could be detected.
> >
> > Add scheduling chance in the loop to improve the latency.
>
> Thanks for the patch, this makes a lot of sense.
>
> Shouldn't we do the same in pci_write_config()?
Yes, IMHO, that could be helpful too.
I'll send v2 to include that. :)
Thanks a lot for your comment.

Regards,
Jiang
Jiang Biao Sept. 13, 2020, 4:27 a.m. UTC | #4
Hi, Bjorn

On Thu, 10 Sep 2020 at 09:59, Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Sep 10, 2020 at 09:54:02AM +0800, Jiang Biao wrote:
> > Hi,
> >
> > On Thu, 10 Sep 2020 at 09:25, Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >
> > > On Mon, Aug 24, 2020 at 01:20:25PM +0800, Jiang Biao wrote:
> > > > From: Jiang Biao <benbjiang@tencent.com>
> > > >
> > > > pci_read_config() could block several ms in kernel space, mainly
> > > > caused by the while loop to call pci_user_read_config_dword().
> > > > Singel pci_user_read_config_dword() loop could consume 130us+,
> > > >               |    pci_user_read_config_dword() {
> > > >               |      _raw_spin_lock_irq() {
> > > > ! 136.698 us  |        native_queued_spin_lock_slowpath();
> > > > ! 137.582 us  |      }
> > > >               |      pci_read() {
> > > >               |        raw_pci_read() {
> > > >               |          pci_conf1_read() {
> > > >   0.230 us    |            _raw_spin_lock_irqsave();
> > > >   0.035 us    |            _raw_spin_unlock_irqrestore();
> > > >   8.476 us    |          }
> > > >   8.790 us    |        }
> > > >   9.091 us    |      }
> > > > ! 147.263 us  |    }
> > > > and dozens of the loop could consume ms+.
> > > >
> > > > If we execute some lspci commands concurrently, ms+ scheduling
> > > > latency could be detected.
> > > >
> > > > Add scheduling chance in the loop to improve the latency.
> > >
> > > Thanks for the patch, this makes a lot of sense.
> > >
> > > Shouldn't we do the same in pci_write_config()?
> > Yes, IMHO, that could be helpful too.
>
> If it's feasible, it would be nice to actually verify that it makes a
> difference.  I know config writes should be faster than reads, but
> they're certainly not as fast as a CPU can pump out data, so there
> must be *some* mechanism that slows the CPU down.
>
> Bjorn
We failed to build a test case to produce the latency by setpci command,
AFAIU, setpci could be much less frequently realistically used than lspci.
So, the latency from pci_write_config() path could not be verified for now,
could we apply this patch alone to erase the verified latency introduced
by pci_read_config() path? :)

Thanks a lot.
Regards,
Jiang
Bjorn Helgaas Sept. 16, 2020, 4:56 p.m. UTC | #5
On Sun, Sep 13, 2020 at 12:27:09PM +0800, Jiang Biao wrote:
> On Thu, 10 Sep 2020 at 09:59, Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, Sep 10, 2020 at 09:54:02AM +0800, Jiang Biao wrote:
> > > On Thu, 10 Sep 2020 at 09:25, Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Mon, Aug 24, 2020 at 01:20:25PM +0800, Jiang Biao wrote:
> > > > > From: Jiang Biao <benbjiang@tencent.com>
> > > > >
> > > > > pci_read_config() could block several ms in kernel space, mainly
> > > > > caused by the while loop to call pci_user_read_config_dword().
> > > > > Singel pci_user_read_config_dword() loop could consume 130us+,
> > > > >               |    pci_user_read_config_dword() {
> > > > >               |      _raw_spin_lock_irq() {
> > > > > ! 136.698 us  |        native_queued_spin_lock_slowpath();
> > > > > ! 137.582 us  |      }
> > > > >               |      pci_read() {
> > > > >               |        raw_pci_read() {
> > > > >               |          pci_conf1_read() {
> > > > >   0.230 us    |            _raw_spin_lock_irqsave();
> > > > >   0.035 us    |            _raw_spin_unlock_irqrestore();
> > > > >   8.476 us    |          }
> > > > >   8.790 us    |        }
> > > > >   9.091 us    |      }
> > > > > ! 147.263 us  |    }
> > > > > and dozens of the loop could consume ms+.
> > > > >
> > > > > If we execute some lspci commands concurrently, ms+ scheduling
> > > > > latency could be detected.
> > > > >
> > > > > Add scheduling chance in the loop to improve the latency.
> > > >
> > > > Thanks for the patch, this makes a lot of sense.
> > > >
> > > > Shouldn't we do the same in pci_write_config()?
> > >
> > > Yes, IMHO, that could be helpful too.
> >
> > If it's feasible, it would be nice to actually verify that it makes a
> > difference.  I know config writes should be faster than reads, but
> > they're certainly not as fast as a CPU can pump out data, so there
> > must be *some* mechanism that slows the CPU down.
> >
> We failed to build a test case to produce the latency by setpci command,
> AFAIU, setpci could be much less frequently realistically used than lspci.
> So, the latency from pci_write_config() path could not be verified for now,
> could we apply this patch alone to erase the verified latency introduced
> by pci_read_config() path? :)

Thanks for trying!  I'll apply the patch as-is.  I'd like to include a
note in the commit log about the user-visible effect of this.  I
looked through recent similar commits:

  928da37a229f ("RDMA/umem: Add a schedule point in ib_umem_get()")
  47aaabdedf36 ("fanotify: Avoid softlockups when reading many events")
  9f47eb5461aa ("fs/btrfs: Add cond_resched() for try_release_extent_mapping() stalls")
  0a3b3c253a1e ("mm/mmap.c: Add cond_resched() for exit_mmap() CPU stalls")
  b7e3debdd040 ("mm/memory_hotplug.c: fix false softlockup during pfn range removal")
  d35bd764e689 ("dm writecache: add cond_resched to loop in persistent_memory_claim()")
  da97f2d56bbd ("mm: call cond_resched() from deferred_init_memmap()")
  ab8b65be1831 ("KVM: PPC: Book3S: Fix some RCU-list locks")
  48c963e31bc6 ("KVM: arm/arm64: Release kvm->mmu_lock in loop to prevent starvation")
  e84fe99b68ce ("mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()")
  4005f5c3c9d0 ("wireguard: send/receive: cond_resched() when processing worker ringbuffers")
  3fd44c86711f ("io_uring: use cond_resched() in io_ring_ctx_wait_and_kill()")
  7979457b1d3a ("net: bridge: vlan: Add a schedule point during VLAN processing")
  2ed6edd33a21 ("perf: Add cond_resched() to task_function_call()")
  1edaa447d958 ("dm writecache: add cond_resched to avoid CPU hangs")
  ce9a4186f9ac ("macvlan: add cond_resched() during multicast processing")
  7be1b9b8e9d1 ("drm/mm: Break long searches in fragmented address spaces")
  bb699a793110 ("drm/i915/gem: Break up long lists of object reclaim")
  9424ef56e13a ("ext4: add cond_resched() to __ext4_find_entry()")

and many of them mention softlockups, RCU CPU stall warnings, or
watchdogs triggering.  Are you seeing one of those, or are you
measuring latency some other way?

Bjorn
Jiang Biao Sept. 17, 2020, 12:22 a.m. UTC | #6
Hi,

On Thu, 17 Sep 2020 at 00:56, Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Sun, Sep 13, 2020 at 12:27:09PM +0800, Jiang Biao wrote:
> > On Thu, 10 Sep 2020 at 09:59, Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Thu, Sep 10, 2020 at 09:54:02AM +0800, Jiang Biao wrote:
> > > > On Thu, 10 Sep 2020 at 09:25, Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Mon, Aug 24, 2020 at 01:20:25PM +0800, Jiang Biao wrote:
> > > > > > From: Jiang Biao <benbjiang@tencent.com>
> > > > > >
> > > > > > pci_read_config() could block several ms in kernel space, mainly
> > > > > > caused by the while loop to call pci_user_read_config_dword().
> > > > > > Singel pci_user_read_config_dword() loop could consume 130us+,
> > > > > >               |    pci_user_read_config_dword() {
> > > > > >               |      _raw_spin_lock_irq() {
> > > > > > ! 136.698 us  |        native_queued_spin_lock_slowpath();
> > > > > > ! 137.582 us  |      }
> > > > > >               |      pci_read() {
> > > > > >               |        raw_pci_read() {
> > > > > >               |          pci_conf1_read() {
> > > > > >   0.230 us    |            _raw_spin_lock_irqsave();
> > > > > >   0.035 us    |            _raw_spin_unlock_irqrestore();
> > > > > >   8.476 us    |          }
> > > > > >   8.790 us    |        }
> > > > > >   9.091 us    |      }
> > > > > > ! 147.263 us  |    }
> > > > > > and dozens of the loop could consume ms+.
> > > > > >
> > > > > > If we execute some lspci commands concurrently, ms+ scheduling
> > > > > > latency could be detected.
> > > > > >
> > > > > > Add scheduling chance in the loop to improve the latency.
> > > > >
> > > > > Thanks for the patch, this makes a lot of sense.
> > > > >
> > > > > Shouldn't we do the same in pci_write_config()?
> > > >
> > > > Yes, IMHO, that could be helpful too.
> > >
> > > If it's feasible, it would be nice to actually verify that it makes a
> > > difference.  I know config writes should be faster than reads, but
> > > they're certainly not as fast as a CPU can pump out data, so there
> > > must be *some* mechanism that slows the CPU down.
> > >
> > We failed to build a test case to produce the latency by setpci command,
> > AFAIU, setpci could be much less frequently realistically used than lspci.
> > So, the latency from pci_write_config() path could not be verified for now,
> > could we apply this patch alone to erase the verified latency introduced
> > by pci_read_config() path? :)
>
> Thanks for trying!  I'll apply the patch as-is.  I'd like to include a
Thanks. :)

> note in the commit log about the user-visible effect of this.  I
> looked through recent similar commits:
>
>   928da37a229f ("RDMA/umem: Add a schedule point in ib_umem_get()")
>   47aaabdedf36 ("fanotify: Avoid softlockups when reading many events")
>   9f47eb5461aa ("fs/btrfs: Add cond_resched() for try_release_extent_mapping() stalls")
>   0a3b3c253a1e ("mm/mmap.c: Add cond_resched() for exit_mmap() CPU stalls")
>   b7e3debdd040 ("mm/memory_hotplug.c: fix false softlockup during pfn range removal")
>   d35bd764e689 ("dm writecache: add cond_resched to loop in persistent_memory_claim()")
>   da97f2d56bbd ("mm: call cond_resched() from deferred_init_memmap()")
>   ab8b65be1831 ("KVM: PPC: Book3S: Fix some RCU-list locks")
>   48c963e31bc6 ("KVM: arm/arm64: Release kvm->mmu_lock in loop to prevent starvation")
>   e84fe99b68ce ("mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()")
>   4005f5c3c9d0 ("wireguard: send/receive: cond_resched() when processing worker ringbuffers")
>   3fd44c86711f ("io_uring: use cond_resched() in io_ring_ctx_wait_and_kill()")
>   7979457b1d3a ("net: bridge: vlan: Add a schedule point during VLAN processing")
>   2ed6edd33a21 ("perf: Add cond_resched() to task_function_call()")
>   1edaa447d958 ("dm writecache: add cond_resched to avoid CPU hangs")
>   ce9a4186f9ac ("macvlan: add cond_resched() during multicast processing")
>   7be1b9b8e9d1 ("drm/mm: Break long searches in fragmented address spaces")
>   bb699a793110 ("drm/i915/gem: Break up long lists of object reclaim")
>   9424ef56e13a ("ext4: add cond_resched() to __ext4_find_entry()")
>
> and many of them mention softlockups, RCU CPU stall warnings, or
> watchdogs triggering.  Are you seeing one of those, or are you
No softlockup or RCU stall warnings.

> measuring latency some other way?
We test the scheduling latency by cyclictest benchmark, the max
latency could be more than 5ms without this patch. The ftrace log
shows pci_read_config is the main cause, and the 5ms+ latency
disappeared with this patch applied.

Thanks a lot.
Regards,
Jiang
Bjorn Helgaas Sept. 17, 2020, 5:32 p.m. UTC | #7
On Mon, Aug 24, 2020 at 01:20:25PM +0800, Jiang Biao wrote:
> From: Jiang Biao <benbjiang@tencent.com>
> 
> pci_read_config() could block several ms in kernel space, mainly
> caused by the while loop to call pci_user_read_config_dword().
> Singel pci_user_read_config_dword() loop could consume 130us+,
>               |    pci_user_read_config_dword() {
>               |      _raw_spin_lock_irq() {
> ! 136.698 us  |        native_queued_spin_lock_slowpath();
> ! 137.582 us  |      }
>               |      pci_read() {
>               |        raw_pci_read() {
>               |          pci_conf1_read() {
>   0.230 us    |            _raw_spin_lock_irqsave();
>   0.035 us    |            _raw_spin_unlock_irqrestore();
>   8.476 us    |          }
>   8.790 us    |        }
>   9.091 us    |      }
> ! 147.263 us  |    }
> and dozens of the loop could consume ms+.
> 
> If we execute some lspci commands concurrently, ms+ scheduling
> latency could be detected.
> 
> Add scheduling chance in the loop to improve the latency.
> 
> Reported-by: Bin Lai <robinlai@tencent.com>
> Signed-off-by: Jiang Biao <benbjiang@tencent.com>

Applied to pci/enumeration for v5.10, thanks!

> ---
>  drivers/pci/pci-sysfs.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 6d78df9..3b9f63d 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -708,6 +708,7 @@ static ssize_t pci_read_config(struct file *filp, struct kobject *kobj,
>  		data[off - init_off + 3] = (val >> 24) & 0xff;
>  		off += 4;
>  		size -= 4;
> +		cond_resched();
>  	}
>  
>  	if (size >= 2) {
> -- 
> 1.8.3.1
>
diff mbox series

Patch

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 6d78df9..3b9f63d 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -708,6 +708,7 @@  static ssize_t pci_read_config(struct file *filp, struct kobject *kobj,
 		data[off - init_off + 3] = (val >> 24) & 0xff;
 		off += 4;
 		size -= 4;
+		cond_resched();
 	}
 
 	if (size >= 2) {