diff mbox

[v1,05/47] pci: add pci_iomap_wc() variants

Message ID 1426893517-2511-6-git-send-email-mcgrof@do-not-panic.com (mailing list archive)
State New, archived
Headers show

Commit Message

Luis R. Rodriguez March 20, 2015, 11:17 p.m. UTC
From: "Luis R. Rodriguez" <mcgrof@suse.com>

This allows drivers to take advantage of write-combining
when possible. Ideally we'd have pci_read_bases() just
peg an IORESOURCE_WC flag for us but where exactly
video devices memory lie varies *largely* and at times things
are mixed with MMIO registers, sometimes we can address
the changes in drivers, other times the change requires
intrusive changes.

Although there is also arch_phys_wc_add() that makes use of
architecture specific write-combinging alternatives (MTRR on
x86 when a system does not have PAT) we void polluting
pci_iomap() space with it and force drivers and subsystems
that want to use it to be explicit.

There are a few motivations for this:

a) Take advantage of PAT when available

b) Help bury MTRR code away, MTRR is architecture specific and on
   x86 its replaced by PAT

c) Help with the goal of eventually using _PAGE_CACHE_UC over
   _PAGE_CACHE_UC_MINUS on x86 on ioremap_nocache() (de33c442e)

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Antonino Daplas <adaplas@gmail.com>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: venkatesh.pallipadi@intel.com
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: konrad.wilk@oracle.com
Cc: ville.syrjala@linux.intel.com
Cc: david.vrabel@citrix.com
Cc: jbeulich@suse.com
Cc: toshi.kani@hp.com
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: linux-fbdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: xen-devel@lists.xensource.com
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
---
 include/asm-generic/pci_iomap.h | 14 ++++++++++
 lib/pci_iomap.c                 | 61 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 75 insertions(+)

Comments

Bjorn Helgaas March 23, 2015, 5:20 p.m. UTC | #1
Hi Luis,

This seems OK to me, but I'm curious about a few things.

On Fri, Mar 20, 2015 at 6:17 PM, Luis R. Rodriguez
<mcgrof@do-not-panic.com> wrote:
> From: "Luis R. Rodriguez" <mcgrof@suse.com>
>
> This allows drivers to take advantage of write-combining
> when possible. Ideally we'd have pci_read_bases() just
> peg an IORESOURCE_WC flag for us

We do set IORESOURCE_PREFETCH.  Do you mean something different?

>  but where exactly
> video devices memory lie varies *largely* and at times things
> are mixed with MMIO registers, sometimes we can address
> the changes in drivers, other times the change requires
> intrusive changes.

What does a video device address have to do with this?  I do see that
if a BAR maps only a frame buffer, the device might be able to mark it
prefetchable, while if the BAR mapped both a frame buffer and some
registers, it might not be able to make it prefetchable.  But that
doesn't seem like it depends on the *address*.

pci_iomap_range() already makes a cacheable mapping if
IORESOURCE_CACHEABLE; I'm guessing that you would like it to
automatically use WC if the BAR if IORESOURCE_PREFETCH, e.g.,

  if (flags & IORESOURCE_CACHEABLE)
    return ioremap(start, len);
  if (flags & IORESOURCE_PREFETCH)
    return ioremap_wc(start, len);
  return ioremap_nocache(start, len);

Is there a reason not to do that?

> Although there is also arch_phys_wc_add() that makes use of
> architecture specific write-combinging alternatives (MTRR on
> x86 when a system does not have PAT) we void polluting
> pci_iomap() space with it and force drivers and subsystems
> that want to use it to be explicit.
>
> There are a few motivations for this:
>
> a) Take advantage of PAT when available
>
> b) Help bury MTRR code away, MTRR is architecture specific and on
>    x86 its replaced by PAT
>
> c) Help with the goal of eventually using _PAGE_CACHE_UC over
>    _PAGE_CACHE_UC_MINUS on x86 on ioremap_nocache() (de33c442e)
> ...

> +void __iomem *pci_iomap_wc_range(struct pci_dev *dev,
> +                                int bar,
> +                                unsigned long offset,
> +                                unsigned long maxlen)
> +{
> +       resource_size_t start = pci_resource_start(dev, bar);
> +       resource_size_t len = pci_resource_len(dev, bar);
> +       unsigned long flags = pci_resource_flags(dev, bar);
> +
> +       if (len <= offset || !start)
> +               return NULL;
> +       len -= offset;
> +       start += offset;
> +       if (maxlen && len > maxlen)
> +               len = maxlen;
> +       if (flags & IORESOURCE_IO)
> +               return __pci_ioport_map(dev, start, len);
> +       if (flags & IORESOURCE_MEM)

Should we log a note in dmesg if the BAR is *not* IORESOURCE_PREFETCH?
 I know the driver might know it's safe even if the device didn't mark
the BAR as prefetchable, but it does seem like an easy way for a
driver to shoot itself in the foot.

> +               return ioremap_wc(start, len);
> +       /* What? */
> +       return NULL;
> +}
--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Konrad Rzeszutek Wilk March 25, 2015, 8:07 p.m. UTC | #2
On Fri, Mar 20, 2015 at 04:17:55PM -0700, Luis R. Rodriguez wrote:
> From: "Luis R. Rodriguez" <mcgrof@suse.com>
> 
> This allows drivers to take advantage of write-combining
> when possible. Ideally we'd have pci_read_bases() just
> peg an IORESOURCE_WC flag for us but where exactly
> video devices memory lie varies *largely* and at times things
> are mixed with MMIO registers, sometimes we can address
> the changes in drivers, other times the change requires
> intrusive changes.
> 
> Although there is also arch_phys_wc_add() that makes use of
> architecture specific write-combinging alternatives (MTRR on

combinging?

> x86 when a system does not have PAT) we void polluting
> pci_iomap() space with it and force drivers and subsystems
> that want to use it to be explicit.
> 
> There are a few motivations for this:
> 
> a) Take advantage of PAT when available
> 
> b) Help bury MTRR code away, MTRR is architecture specific and on
>    x86 its replaced by PAT
> 
> c) Help with the goal of eventually using _PAGE_CACHE_UC over
>    _PAGE_CACHE_UC_MINUS on x86 on ioremap_nocache() (de33c442e)
> 
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Suresh Siddha <suresh.b.siddha@intel.com>
> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Antonino Daplas <adaplas@gmail.com>
> Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
> Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: venkatesh.pallipadi@intel.com
> Cc: Stefan Bader <stefan.bader@canonical.com>
> Cc: konrad.wilk@oracle.com
> Cc: ville.syrjala@linux.intel.com
> Cc: david.vrabel@citrix.com
> Cc: jbeulich@suse.com
> Cc: toshi.kani@hp.com
> Cc: Roger Pau Monné <roger.pau@citrix.com>
> Cc: linux-fbdev@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: xen-devel@lists.xensource.com
> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
> ---
>  include/asm-generic/pci_iomap.h | 14 ++++++++++
>  lib/pci_iomap.c                 | 61 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 75 insertions(+)
> 
> diff --git a/include/asm-generic/pci_iomap.h b/include/asm-generic/pci_iomap.h
> index 7389c87..b1e17fc 100644
> --- a/include/asm-generic/pci_iomap.h
> +++ b/include/asm-generic/pci_iomap.h
> @@ -15,9 +15,13 @@ struct pci_dev;
>  #ifdef CONFIG_PCI
>  /* Create a virtual mapping cookie for a PCI BAR (memory or IO) */
>  extern void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long max);
> +extern void __iomem *pci_iomap_wc(struct pci_dev *dev, int bar, unsigned long max);
>  extern void __iomem *pci_iomap_range(struct pci_dev *dev, int bar,
>  				     unsigned long offset,
>  				     unsigned long maxlen);
> +extern void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar,
> +					unsigned long offset,
> +					unsigned long maxlen);
>  /* Create a virtual mapping cookie for a port on a given PCI device.
>   * Do not call this directly, it exists to make it easier for architectures
>   * to override */
> @@ -34,12 +38,22 @@ static inline void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned lon
>  	return NULL;
>  }
>  
> +static inline void __iomem *pci_iomap_wc(struct pci_dev *dev, int bar, unsigned long max)
> +{
> +	return NULL;
> +}
>  static inline void __iomem *pci_iomap_range(struct pci_dev *dev, int bar,
>  					    unsigned long offset,
>  					    unsigned long maxlen)
>  {
>  	return NULL;
>  }
> +static inline void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar,
> +					       unsigned long offset,
> +					       unsigned long maxlen)
> +{
> +	return NULL;
> +}
>  #endif
>  
>  #endif /* __ASM_GENERIC_IO_H */
> diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c
> index bcce5f1..30b65ae 100644
> --- a/lib/pci_iomap.c
> +++ b/lib/pci_iomap.c
> @@ -52,6 +52,46 @@ void __iomem *pci_iomap_range(struct pci_dev *dev,
>  EXPORT_SYMBOL(pci_iomap_range);
>  
>  /**
> + * pci_iomap_wc_range - create a virtual WC mapping cookie for a PCI BAR
> + * @dev: PCI device that owns the BAR
> + * @bar: BAR number
> + * @offset: map memory at the given offset in BAR
> + * @maxlen: max length of the memory to map
> + *
> + * Using this function you will get a __iomem address to your device BAR.
> + * You can access it using ioread*() and iowrite*(). These functions hide
> + * the details if this is a MMIO or PIO address space and will just do what
> + * you expect from them in the correct way. When possible write combining
> + * is used.
> + *
> + * @maxlen specifies the maximum length to map. If you want to get access to
> + * the complete BAR from offset to the end, pass %0 here.

s/%0/0 ? Or is that some special syntax?

> + * */
> +void __iomem *pci_iomap_wc_range(struct pci_dev *dev,
> +				 int bar,
> +				 unsigned long offset,
> +				 unsigned long maxlen)
> +{
> +	resource_size_t start = pci_resource_start(dev, bar);
> +	resource_size_t len = pci_resource_len(dev, bar);
> +	unsigned long flags = pci_resource_flags(dev, bar);
> +
> +	if (len <= offset || !start)
> +		return NULL;
> +	len -= offset;
> +	start += offset;
> +	if (maxlen && len > maxlen)
> +		len = maxlen;
> +	if (flags & IORESOURCE_IO)
> +		return __pci_ioport_map(dev, start, len);
> +	if (flags & IORESOURCE_MEM)
> +		return ioremap_wc(start, len);
> +	/* What? */
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(pci_iomap_wc_range);
> +
> +/**
>   * pci_iomap - create a virtual mapping cookie for a PCI BAR
>   * @dev: PCI device that owns the BAR
>   * @bar: BAR number
> @@ -70,4 +110,25 @@ void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long maxlen)
>  	return pci_iomap_range(dev, bar, 0, maxlen);
>  }
>  EXPORT_SYMBOL(pci_iomap);
> +
> +/**
> + * pci_iomap_wc - create a virtual WC mapping cookie for a PCI BAR
> + * @dev: PCI device that owns the BAR
> + * @bar: BAR number
> + * @maxlen: length of the memory to map
> + *
> + * Using this function you will get a __iomem address to your device BAR.
> + * You can access it using ioread*() and iowrite*(). These functions hide
> + * the details if this is a MMIO or PIO address space and will just do what
> + * you expect from them in the correct way. When possible write combining
> + * is used.
> + *
> + * @maxlen specifies the maximum length to map. If you want to get access to
> + * the complete BAR without checking for its length first, pass %0 here.
> + * */
> +void __iomem *pci_iomap_wc(struct pci_dev *dev, int bar, unsigned long maxlen)
> +{
> +	return pci_iomap_wc_range(dev, bar, 0, maxlen);
> +}
> +EXPORT_SYMBOL_GPL(pci_iomap_wc);
>  #endif /* CONFIG_PCI */
> -- 
> 2.3.2.209.gd67f9d5.dirty
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luis Chamberlain March 26, 2015, 3 a.m. UTC | #3
On Mon, Mar 23, 2015 at 12:20:47PM -0500, Bjorn Helgaas wrote:
> Hi Luis,
> 
> This seems OK to me, 

Great.

> but I'm curious about a few things.
> 
> On Fri, Mar 20, 2015 at 6:17 PM, Luis R. Rodriguez
> <mcgrof@do-not-panic.com> wrote:
> > From: "Luis R. Rodriguez" <mcgrof@suse.com>
> >
> > This allows drivers to take advantage of write-combining
> > when possible. Ideally we'd have pci_read_bases() just
> > peg an IORESOURCE_WC flag for us
> 
> We do set IORESOURCE_PREFETCH.  Do you mean something different?

I did not think we had a WC IORESOURCE flag. Are you saying that we can use
IORESOURCE_PREFETCH for that purpose? If so then great.  As I read a PCI BAR
can have PCI_BASE_ADDRESS_MEM_PREFETCH and when that's the case we peg
IORESOURCE_PREFETCH. That seems to be what I want indeed. Questions below.

> >  but where exactly
> > video devices memory lie varies *largely* and at times things
> > are mixed with MMIO registers, sometimes we can address
> > the changes in drivers, other times the change requires
> > intrusive changes.
> 
> What does a video device address have to do with this?  I do see that
> if a BAR maps only a frame buffer, the device might be able to mark it
> prefetchable, while if the BAR mapped both a frame buffer and some
> registers, it might not be able to make it prefetchable.  But that
> doesn't seem like it depends on the *address*.

I meant the offsets for each of those, either registers or framebuffer,
and that typically they are mixed (primarily on older devices), so indeed your
summary of the problem is what I meant. Let's remember that we are trying to
take advantage of PAT here when available and avoid MTRR in that case, do we
know that the same PCI BARs that have always historically used MTRRs had
IORESOURCE_PREFETCH set, is that a fair assumption ? I realize they are
different things -- but its precisely why I ask.

> pci_iomap_range() already makes a cacheable mapping if
> IORESOURCE_CACHEABLE; I'm guessing that you would like it to
> automatically use WC if the BAR if IORESOURCE_PREFETCH, e.g.,
> 
>   if (flags & IORESOURCE_CACHEABLE)
>     return ioremap(start, len);
>   if (flags & IORESOURCE_PREFETCH)
>     return ioremap_wc(start, len);
>   return ioremap_nocache(start, len);

Indeed, that's exactly what I think we should strive towards.

> Is there a reason not to do that?

This depends on the exact defintion of IORESOURCE_PREFETCH and
PCI_BASE_ADDRESS_MEM_PREFETCH and how they are used all over and
accross *all devices*. This didn't look promising for starters:

include/uapi/linux/pci_regs.h:#define  PCI_BASE_ADDRESS_MEM_PREFETCH    0x08    /* prefetchable? */

PCI_BASE_ADDRESS_MEM_PREFETCH seems to be BAR specific, so a few questions:

1) Can we rest assured for instance that if we check for
PCI_BASE_ADDRESS_MEM_PREFETCH and if set that it will *only* be set on a full
PCI BAR if the full PCI BAR does want WC? If not this can regress
functionality. That seems risky. It however would not be risky if we used
another API that did look for IORESOURCE_PREFETCH and if so use ioremap_wc() --
that way only drivers we know that do use the full PCI bar would use this API.
There's a bit of a problem with this though:

2) Do we know that if a *full PCI BAR* is used for WC that
PCI_BASE_ADDRESS_MEM_PREFETCH *was* definitely set for the PCI BAR? If so then
the API usage would be restricted only to devices that we know *do* adhere to
this. That reduces the possible uses for older drivers and can create
regressions if used loosely without verification... but..

3) If from now on we get folks to commit to uset PCI_BASE_ADDRESS_MEM_PREFETCH
for full PCI BARs that do want WC perhaps newer devices / drivers will use
this very consistently ? Can we bank on that and is it worth it ?

4) If a PCI BAR *does not* have PCI_BASE_ADDRESS_MEM_PREFETCH do we know it
must not never want WC ?

If we don't have certainty on any of the above I'm afraid we can't do much
right now but perhaps we can push towards better use of PCI_BASE_ADDRESS_MEM_PREFETCH
and hope folks will only use this for the full PCI BAR only if WC is desired.

Thoughts?

> > Although there is also arch_phys_wc_add() that makes use of
> > architecture specific write-combinging alternatives (MTRR on
> > x86 when a system does not have PAT) we void polluting
> > pci_iomap() space with it and force drivers and subsystems
> > that want to use it to be explicit.
> >
> > There are a few motivations for this:
> >
> > a) Take advantage of PAT when available
> >
> > b) Help bury MTRR code away, MTRR is architecture specific and on
> >    x86 its replaced by PAT
> >
> > c) Help with the goal of eventually using _PAGE_CACHE_UC over
> >    _PAGE_CACHE_UC_MINUS on x86 on ioremap_nocache() (de33c442e)
> > ...
> 
> > +void __iomem *pci_iomap_wc_range(struct pci_dev *dev,
> > +                                int bar,
> > +                                unsigned long offset,
> > +                                unsigned long maxlen)
> > +{
> > +       resource_size_t start = pci_resource_start(dev, bar);
> > +       resource_size_t len = pci_resource_len(dev, bar);
> > +       unsigned long flags = pci_resource_flags(dev, bar);
> > +
> > +       if (len <= offset || !start)
> > +               return NULL;
> > +       len -= offset;
> > +       start += offset;
> > +       if (maxlen && len > maxlen)
> > +               len = maxlen;
> > +       if (flags & IORESOURCE_IO)
> > +               return __pci_ioport_map(dev, start, len);
> > +       if (flags & IORESOURCE_MEM)
> 
> Should we log a note in dmesg if the BAR is *not* IORESOURCE_PREFETCH?
>  I know the driver might know it's safe even if the device didn't mark
> the BAR as prefetchable, but it does seem like an easy way for a
> driver to shoot itself in the foot.

You tell me. I would fear this may not be consistent and we'd end up
having bug reports open for something that has historically been a
non-issue. The above questions can help us gauge the risk of this.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luis Chamberlain March 27, 2015, 6:40 p.m. UTC | #4
On Wed, Mar 25, 2015 at 04:07:43PM -0400, Konrad Rzeszutek Wilk wrote:
> On Fri, Mar 20, 2015 at 04:17:55PM -0700, Luis R. Rodriguez wrote:
> > From: "Luis R. Rodriguez" <mcgrof@suse.com>
> > 
> > This allows drivers to take advantage of write-combining
> > when possible. Ideally we'd have pci_read_bases() just
> > peg an IORESOURCE_WC flag for us but where exactly
> > video devices memory lie varies *largely* and at times things
> > are mixed with MMIO registers, sometimes we can address
> > the changes in drivers, other times the change requires
> > intrusive changes.
> > 
> > Although there is also arch_phys_wc_add() that makes use of
> > architecture specific write-combinging alternatives (MTRR on
> 
> combinging?

Amended.

> > diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c
> > index bcce5f1..30b65ae 100644
> > --- a/lib/pci_iomap.c
> > +++ b/lib/pci_iomap.c
> > @@ -52,6 +52,46 @@ void __iomem *pci_iomap_range(struct pci_dev *dev,
> >  EXPORT_SYMBOL(pci_iomap_range);
> >  
> >  /**
> > + * pci_iomap_wc_range - create a virtual WC mapping cookie for a PCI BAR
> > + * @dev: PCI device that owns the BAR
> > + * @bar: BAR number
> > + * @offset: map memory at the given offset in BAR
> > + * @maxlen: max length of the memory to map
> > + *
> > + * Using this function you will get a __iomem address to your device BAR.
> > + * You can access it using ioread*() and iowrite*(). These functions hide
> > + * the details if this is a MMIO or PIO address space and will just do what
> > + * you expect from them in the correct way. When possible write combining
> > + * is used.
> > + *
> > + * @maxlen specifies the maximum length to map. If you want to get access to
> > + * the complete BAR from offset to the end, pass %0 here.
> 
> s/%0/0 ? Or is that some special syntax?

This copies the syntax of pci_iomap_range() which also uses %0, and as per
Documentation/kernel-doc-nano-HOWTO.txt % is used for constants. See:

scripts/kernel-doc -man -function pci_iomap_range lib/pci_iomap.c | nroff -man | less

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Toshi Kani March 27, 2015, 7:18 p.m. UTC | #5
On Mon, 2015-03-23 at 12:20 -0500, Bjorn Helgaas wrote:
 :
> pci_iomap_range() already makes a cacheable mapping if
> IORESOURCE_CACHEABLE; I'm guessing that you would like it to
> automatically use WC if the BAR if IORESOURCE_PREFETCH, e.g.,
> 
>   if (flags & IORESOURCE_CACHEABLE)
>     return ioremap(start, len);

Is this supposed to be ioremap_cache()?  ioremap() is the same as
ioremap_nocache() at least on x86 per arch/x86/include/asm/io.h.

>   if (flags & IORESOURCE_PREFETCH)
>     return ioremap_wc(start, len);
>   return ioremap_nocache(start, len);
> 

-Toshi


--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luis Chamberlain April 21, 2015, 5:52 p.m. UTC | #6
On Thu, Mar 26, 2015 at 04:00:54AM +0100, Luis R. Rodriguez wrote:
> On Mon, Mar 23, 2015 at 12:20:47PM -0500, Bjorn Helgaas wrote:
> > Hi Luis,
> > 
> > This seems OK to me, 
> 
> Great.
> 
> > but I'm curious about a few things.
> > 
> > On Fri, Mar 20, 2015 at 6:17 PM, Luis R. Rodriguez
> > <mcgrof@do-not-panic.com> wrote:
> > > From: "Luis R. Rodriguez" <mcgrof@suse.com>
> > >
> > > This allows drivers to take advantage of write-combining
> > > when possible. Ideally we'd have pci_read_bases() just
> > > peg an IORESOURCE_WC flag for us
> > 
> > We do set IORESOURCE_PREFETCH.  Do you mean something different?
> 
> I did not think we had a WC IORESOURCE flag. Are you saying that we can use
> IORESOURCE_PREFETCH for that purpose? If so then great.  As I read a PCI BAR
> can have PCI_BASE_ADDRESS_MEM_PREFETCH and when that's the case we peg
> IORESOURCE_PREFETCH. That seems to be what I want indeed. Questions below.
> 
> > >  but where exactly
> > > video devices memory lie varies *largely* and at times things
> > > are mixed with MMIO registers, sometimes we can address
> > > the changes in drivers, other times the change requires
> > > intrusive changes.
> > 
> > What does a video device address have to do with this?  I do see that
> > if a BAR maps only a frame buffer, the device might be able to mark it
> > prefetchable, while if the BAR mapped both a frame buffer and some
> > registers, it might not be able to make it prefetchable.  But that
> > doesn't seem like it depends on the *address*.
> 
> I meant the offsets for each of those, either registers or framebuffer,
> and that typically they are mixed (primarily on older devices), so indeed your
> summary of the problem is what I meant. Let's remember that we are trying to
> take advantage of PAT here when available and avoid MTRR in that case, do we
> know that the same PCI BARs that have always historically used MTRRs had
> IORESOURCE_PREFETCH set, is that a fair assumption ? I realize they are
> different things -- but its precisely why I ask.
> 
> > pci_iomap_range() already makes a cacheable mapping if
> > IORESOURCE_CACHEABLE; I'm guessing that you would like it to
> > automatically use WC if the BAR if IORESOURCE_PREFETCH, e.g.,
> > 
> >   if (flags & IORESOURCE_CACHEABLE)
> >     return ioremap(start, len);
> >   if (flags & IORESOURCE_PREFETCH)
> >     return ioremap_wc(start, len);
> >   return ioremap_nocache(start, len);
> 
> Indeed, that's exactly what I think we should strive towards.
> 
> > Is there a reason not to do that?
> 
> This depends on the exact defintion of IORESOURCE_PREFETCH and
> PCI_BASE_ADDRESS_MEM_PREFETCH and how they are used all over and
> accross *all devices*. This didn't look promising for starters:
> 
> include/uapi/linux/pci_regs.h:#define  PCI_BASE_ADDRESS_MEM_PREFETCH    0x08    /* prefetchable? */
> 
> PCI_BASE_ADDRESS_MEM_PREFETCH seems to be BAR specific, so a few questions:
> 
> 1) Can we rest assured for instance that if we check for
> PCI_BASE_ADDRESS_MEM_PREFETCH and if set that it will *only* be set on a full
> PCI BAR if the full PCI BAR does want WC? If not this can regress
> functionality. That seems risky. It however would not be risky if we used
> another API that did look for IORESOURCE_PREFETCH and if so use ioremap_wc() --
> that way only drivers we know that do use the full PCI bar would use this API.
> There's a bit of a problem with this though:
> 
> 2) Do we know that if a *full PCI BAR* is used for WC that
> PCI_BASE_ADDRESS_MEM_PREFETCH *was* definitely set for the PCI BAR? If so then
> the API usage would be restricted only to devices that we know *do* adhere to
> this. That reduces the possible uses for older drivers and can create
> regressions if used loosely without verification... but..
> 
> 3) If from now on we get folks to commit to uset PCI_BASE_ADDRESS_MEM_PREFETCH
> for full PCI BARs that do want WC perhaps newer devices / drivers will use
> this very consistently ? Can we bank on that and is it worth it ?
> 
> 4) If a PCI BAR *does not* have PCI_BASE_ADDRESS_MEM_PREFETCH do we know it
> must not never want WC ?
> 
> If we don't have certainty on any of the above I'm afraid we can't do much
> right now but perhaps we can push towards better use of PCI_BASE_ADDRESS_MEM_PREFETCH
> and hope folks will only use this for the full PCI BAR only if WC is desired.
> 
> Thoughts?

Bjorn, now that you're done schooling me on English, any thoughts on the above?

> > > Although there is also arch_phys_wc_add() that makes use of
> > > architecture specific write-combinging alternatives (MTRR on
> > > x86 when a system does not have PAT) we void polluting
> > > pci_iomap() space with it and force drivers and subsystems
> > > that want to use it to be explicit.
> > >
> > > There are a few motivations for this:
> > >
> > > a) Take advantage of PAT when available
> > >
> > > b) Help bury MTRR code away, MTRR is architecture specific and on
> > >    x86 its replaced by PAT
> > >
> > > c) Help with the goal of eventually using _PAGE_CACHE_UC over
> > >    _PAGE_CACHE_UC_MINUS on x86 on ioremap_nocache() (de33c442e)
> > > ...
> > 
> > > +void __iomem *pci_iomap_wc_range(struct pci_dev *dev,
> > > +                                int bar,
> > > +                                unsigned long offset,
> > > +                                unsigned long maxlen)
> > > +{
> > > +       resource_size_t start = pci_resource_start(dev, bar);
> > > +       resource_size_t len = pci_resource_len(dev, bar);
> > > +       unsigned long flags = pci_resource_flags(dev, bar);
> > > +
> > > +       if (len <= offset || !start)
> > > +               return NULL;
> > > +       len -= offset;
> > > +       start += offset;
> > > +       if (maxlen && len > maxlen)
> > > +               len = maxlen;
> > > +       if (flags & IORESOURCE_IO)
> > > +               return __pci_ioport_map(dev, start, len);
> > > +       if (flags & IORESOURCE_MEM)
> > 
> > Should we log a note in dmesg if the BAR is *not* IORESOURCE_PREFETCH?
> >  I know the driver might know it's safe even if the device didn't mark
> > the BAR as prefetchable, but it does seem like an easy way for a
> > driver to shoot itself in the foot.
> 
> You tell me. I would fear this may not be consistent and we'd end up
> having bug reports open for something that has historically been a
> non-issue. The above questions can help us gauge the risk of this.

Now, I'll tell you what I *think* but these are just guestimates (TM):

  * Likely PCI_BASE_ADDRESS_MEM_PREFETCH can implate IORESOURCE_PREFETCH
    and use of ioremap_wc() on a full PCI BAR only, but this strict
    definition likely cannot be 100% guaranteed and could break some
    devices. We need something a bit more concrete and well known so
    that next generation industry standards embrace and let us in
    the kernel automatically detect specific ranges and their respective
    page attribute requirements. Might be good to address here x86 and
    ARM families

Curious: Sarah, how does USB address these different different page attribute
needs on USB 3.0?

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin April 21, 2015, 6:46 p.m. UTC | #7
On Tue, Apr 21, 2015 at 07:52:49PM +0200, Luis R. Rodriguez wrote:
> On Thu, Mar 26, 2015 at 04:00:54AM +0100, Luis R. Rodriguez wrote:
> > On Mon, Mar 23, 2015 at 12:20:47PM -0500, Bjorn Helgaas wrote:
> > > Hi Luis,
> > > 
> > > This seems OK to me, 
> > 
> > Great.
> > 
> > > but I'm curious about a few things.
> > > 
> > > On Fri, Mar 20, 2015 at 6:17 PM, Luis R. Rodriguez
> > > <mcgrof@do-not-panic.com> wrote:
> > > > From: "Luis R. Rodriguez" <mcgrof@suse.com>
> > > >
> > > > This allows drivers to take advantage of write-combining
> > > > when possible. Ideally we'd have pci_read_bases() just
> > > > peg an IORESOURCE_WC flag for us
> > > 
> > > We do set IORESOURCE_PREFETCH.  Do you mean something different?
> > 
> > I did not think we had a WC IORESOURCE flag. Are you saying that we can use
> > IORESOURCE_PREFETCH for that purpose? If so then great.  As I read a PCI BAR
> > can have PCI_BASE_ADDRESS_MEM_PREFETCH and when that's the case we peg
> > IORESOURCE_PREFETCH. That seems to be what I want indeed. Questions below.
> > 
> > > >  but where exactly
> > > > video devices memory lie varies *largely* and at times things
> > > > are mixed with MMIO registers, sometimes we can address
> > > > the changes in drivers, other times the change requires
> > > > intrusive changes.
> > > 
> > > What does a video device address have to do with this?  I do see that
> > > if a BAR maps only a frame buffer, the device might be able to mark it
> > > prefetchable, while if the BAR mapped both a frame buffer and some
> > > registers, it might not be able to make it prefetchable.  But that
> > > doesn't seem like it depends on the *address*.
> > 
> > I meant the offsets for each of those, either registers or framebuffer,
> > and that typically they are mixed (primarily on older devices), so indeed your
> > summary of the problem is what I meant. Let's remember that we are trying to
> > take advantage of PAT here when available and avoid MTRR in that case, do we
> > know that the same PCI BARs that have always historically used MTRRs had
> > IORESOURCE_PREFETCH set, is that a fair assumption ? I realize they are
> > different things -- but its precisely why I ask.
> > 
> > > pci_iomap_range() already makes a cacheable mapping if
> > > IORESOURCE_CACHEABLE; I'm guessing that you would like it to
> > > automatically use WC if the BAR if IORESOURCE_PREFETCH, e.g.,
> > > 
> > >   if (flags & IORESOURCE_CACHEABLE)
> > >     return ioremap(start, len);
> > >   if (flags & IORESOURCE_PREFETCH)
> > >     return ioremap_wc(start, len);
> > >   return ioremap_nocache(start, len);
> > 
> > Indeed, that's exactly what I think we should strive towards.
> > 
> > > Is there a reason not to do that?
> > 
> > This depends on the exact defintion of IORESOURCE_PREFETCH and
> > PCI_BASE_ADDRESS_MEM_PREFETCH and how they are used all over and
> > accross *all devices*. This didn't look promising for starters:
> > 
> > include/uapi/linux/pci_regs.h:#define  PCI_BASE_ADDRESS_MEM_PREFETCH    0x08    /* prefetchable? */
> > 
> > PCI_BASE_ADDRESS_MEM_PREFETCH seems to be BAR specific, so a few questions:
> > 
> > 1) Can we rest assured for instance that if we check for
> > PCI_BASE_ADDRESS_MEM_PREFETCH and if set that it will *only* be set on a full
> > PCI BAR if the full PCI BAR does want WC? If not this can regress
> > functionality. That seems risky. It however would not be risky if we used
> > another API that did look for IORESOURCE_PREFETCH and if so use ioremap_wc() --
> > that way only drivers we know that do use the full PCI bar would use this API.
> > There's a bit of a problem with this though:
> > 
> > 2) Do we know that if a *full PCI BAR* is used for WC that
> > PCI_BASE_ADDRESS_MEM_PREFETCH *was* definitely set for the PCI BAR? If so then
> > the API usage would be restricted only to devices that we know *do* adhere to
> > this. That reduces the possible uses for older drivers and can create
> > regressions if used loosely without verification... but..
> > 

In theory, PCI spec says this about prefetch memory:
	Bridges are permitted to merge writes into this range (refer to Section 3.2.6).

Exceptions could be:
	- devices not behind a bridge (e.g. intergrated in a root
	  complex)
	- devices behind a virtual bridge from same vendor
	  (which know bridge won't prefetch)

I worry that WC might also cause more reordering though.  I don't
remember this is true, off-hand.  Bridges can only reorder transactions
according to very specific rules.

> > 3) If from now on we get folks to commit to uset PCI_BASE_ADDRESS_MEM_PREFETCH
> > for full PCI BARs that do want WC perhaps newer devices / drivers will use
> > this very consistently ? Can we bank on that and is it worth it ?

Unfortunately there's a separate good reason to set memory as prefetcheable:
it's the only way to get 64 bit addresses for devices behind bridges.
So WC might be *safe* for prefetch BARs, but might not be a good idea.

> > 
> > 4) If a PCI BAR *does not* have PCI_BASE_ADDRESS_MEM_PREFETCH do we know it
> > must not never want WC ?

That's not true I think. It means device can't allow prefetch but maybe
it does allow combining.

> > 
> > If we don't have certainty on any of the above I'm afraid we can't do much
> > right now but perhaps we can push towards better use of PCI_BASE_ADDRESS_MEM_PREFETCH
> > and hope folks will only use this for the full PCI BAR only if WC is desired.
> > 
> > Thoughts?
> 
> Bjorn, now that you're done schooling me on English, any thoughts on the above?
>
> > > > Although there is also arch_phys_wc_add() that makes use of
> > > > architecture specific write-combinging alternatives (MTRR on
> > > > x86 when a system does not have PAT) we void polluting
> > > > pci_iomap() space with it and force drivers and subsystems
> > > > that want to use it to be explicit.
> > > >
> > > > There are a few motivations for this:
> > > >
> > > > a) Take advantage of PAT when available
> > > >
> > > > b) Help bury MTRR code away, MTRR is architecture specific and on
> > > >    x86 its replaced by PAT
> > > >
> > > > c) Help with the goal of eventually using _PAGE_CACHE_UC over
> > > >    _PAGE_CACHE_UC_MINUS on x86 on ioremap_nocache() (de33c442e)
> > > > ...
> > > 
> > > > +void __iomem *pci_iomap_wc_range(struct pci_dev *dev,
> > > > +                                int bar,
> > > > +                                unsigned long offset,
> > > > +                                unsigned long maxlen)
> > > > +{
> > > > +       resource_size_t start = pci_resource_start(dev, bar);
> > > > +       resource_size_t len = pci_resource_len(dev, bar);
> > > > +       unsigned long flags = pci_resource_flags(dev, bar);
> > > > +
> > > > +       if (len <= offset || !start)
> > > > +               return NULL;
> > > > +       len -= offset;
> > > > +       start += offset;
> > > > +       if (maxlen && len > maxlen)
> > > > +               len = maxlen;
> > > > +       if (flags & IORESOURCE_IO)
> > > > +               return __pci_ioport_map(dev, start, len);
> > > > +       if (flags & IORESOURCE_MEM)
> > > 
> > > Should we log a note in dmesg if the BAR is *not* IORESOURCE_PREFETCH?
> > >  I know the driver might know it's safe even if the device didn't mark
> > > the BAR as prefetchable, but it does seem like an easy way for a
> > > driver to shoot itself in the foot.
> > 
> > You tell me. I would fear this may not be consistent and we'd end up
> > having bug reports open for something that has historically been a
> > non-issue. The above questions can help us gauge the risk of this.
> 
> Now, I'll tell you what I *think* but these are just guestimates (TM):
> 
>   * Likely PCI_BASE_ADDRESS_MEM_PREFETCH can implate IORESOURCE_PREFETCH
>     and use of ioremap_wc() on a full PCI BAR only, but this strict
>     definition likely cannot be 100% guaranteed and could break some
>     devices. We need something a bit more concrete and well known so
>     that next generation industry standards embrace and let us in
>     the kernel automatically detect specific ranges and their respective
>     page attribute requirements. Might be good to address here x86 and
>     ARM families
> 
> Curious: Sarah, how does USB address these different different page attribute
> needs on USB 3.0?
> 
>   Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin April 21, 2015, 7:25 p.m. UTC | #8
On Mon, Mar 23, 2015 at 12:20:47PM -0500, Bjorn Helgaas wrote:
> pci_iomap_range() already makes a cacheable mapping if
> IORESOURCE_CACHEABLE; I'm guessing that you would like it to
> automatically use WC if the BAR if IORESOURCE_PREFETCH, e.g.,
> 
>   if (flags & IORESOURCE_CACHEABLE)
>     return ioremap(start, len);
>   if (flags & IORESOURCE_PREFETCH)
>     return ioremap_wc(start, len);
>   return ioremap_nocache(start, len);
> 
> Is there a reason not to do that?

I think that's wrong and will break a bunch of things.
PCI prefetch bit merely means bridges can combine writes and prefetch
reads.  Prefetch does not affect ordering rules and does not allow
writes to be collapsed.

WC is stronger: it allows collapsing and changes ordering rules.

WC can also hurt latency as small writes are buffered.

To summarise, driver needs to know what it's doing,
we can't set WC in the pci core automatically.
Luis R. Rodriguez April 21, 2015, 7:27 p.m. UTC | #9
On Tue, Apr 21, 2015 at 12:25 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> To summarise, driver needs to know what it's doing,
> we can't set WC in the pci core automatically.

Thanks, I'll document this and proceed with device driver helpers to
aid with this.

 Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/asm-generic/pci_iomap.h b/include/asm-generic/pci_iomap.h
index 7389c87..b1e17fc 100644
--- a/include/asm-generic/pci_iomap.h
+++ b/include/asm-generic/pci_iomap.h
@@ -15,9 +15,13 @@  struct pci_dev;
 #ifdef CONFIG_PCI
 /* Create a virtual mapping cookie for a PCI BAR (memory or IO) */
 extern void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long max);
+extern void __iomem *pci_iomap_wc(struct pci_dev *dev, int bar, unsigned long max);
 extern void __iomem *pci_iomap_range(struct pci_dev *dev, int bar,
 				     unsigned long offset,
 				     unsigned long maxlen);
+extern void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar,
+					unsigned long offset,
+					unsigned long maxlen);
 /* Create a virtual mapping cookie for a port on a given PCI device.
  * Do not call this directly, it exists to make it easier for architectures
  * to override */
@@ -34,12 +38,22 @@  static inline void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned lon
 	return NULL;
 }
 
+static inline void __iomem *pci_iomap_wc(struct pci_dev *dev, int bar, unsigned long max)
+{
+	return NULL;
+}
 static inline void __iomem *pci_iomap_range(struct pci_dev *dev, int bar,
 					    unsigned long offset,
 					    unsigned long maxlen)
 {
 	return NULL;
 }
+static inline void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar,
+					       unsigned long offset,
+					       unsigned long maxlen)
+{
+	return NULL;
+}
 #endif
 
 #endif /* __ASM_GENERIC_IO_H */
diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c
index bcce5f1..30b65ae 100644
--- a/lib/pci_iomap.c
+++ b/lib/pci_iomap.c
@@ -52,6 +52,46 @@  void __iomem *pci_iomap_range(struct pci_dev *dev,
 EXPORT_SYMBOL(pci_iomap_range);
 
 /**
+ * pci_iomap_wc_range - create a virtual WC mapping cookie for a PCI BAR
+ * @dev: PCI device that owns the BAR
+ * @bar: BAR number
+ * @offset: map memory at the given offset in BAR
+ * @maxlen: max length of the memory to map
+ *
+ * Using this function you will get a __iomem address to your device BAR.
+ * You can access it using ioread*() and iowrite*(). These functions hide
+ * the details if this is a MMIO or PIO address space and will just do what
+ * you expect from them in the correct way. When possible write combining
+ * is used.
+ *
+ * @maxlen specifies the maximum length to map. If you want to get access to
+ * the complete BAR from offset to the end, pass %0 here.
+ * */
+void __iomem *pci_iomap_wc_range(struct pci_dev *dev,
+				 int bar,
+				 unsigned long offset,
+				 unsigned long maxlen)
+{
+	resource_size_t start = pci_resource_start(dev, bar);
+	resource_size_t len = pci_resource_len(dev, bar);
+	unsigned long flags = pci_resource_flags(dev, bar);
+
+	if (len <= offset || !start)
+		return NULL;
+	len -= offset;
+	start += offset;
+	if (maxlen && len > maxlen)
+		len = maxlen;
+	if (flags & IORESOURCE_IO)
+		return __pci_ioport_map(dev, start, len);
+	if (flags & IORESOURCE_MEM)
+		return ioremap_wc(start, len);
+	/* What? */
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(pci_iomap_wc_range);
+
+/**
  * pci_iomap - create a virtual mapping cookie for a PCI BAR
  * @dev: PCI device that owns the BAR
  * @bar: BAR number
@@ -70,4 +110,25 @@  void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long maxlen)
 	return pci_iomap_range(dev, bar, 0, maxlen);
 }
 EXPORT_SYMBOL(pci_iomap);
+
+/**
+ * pci_iomap_wc - create a virtual WC mapping cookie for a PCI BAR
+ * @dev: PCI device that owns the BAR
+ * @bar: BAR number
+ * @maxlen: length of the memory to map
+ *
+ * Using this function you will get a __iomem address to your device BAR.
+ * You can access it using ioread*() and iowrite*(). These functions hide
+ * the details if this is a MMIO or PIO address space and will just do what
+ * you expect from them in the correct way. When possible write combining
+ * is used.
+ *
+ * @maxlen specifies the maximum length to map. If you want to get access to
+ * the complete BAR without checking for its length first, pass %0 here.
+ * */
+void __iomem *pci_iomap_wc(struct pci_dev *dev, int bar, unsigned long maxlen)
+{
+	return pci_iomap_wc_range(dev, bar, 0, maxlen);
+}
+EXPORT_SYMBOL_GPL(pci_iomap_wc);
 #endif /* CONFIG_PCI */