diff mbox

[1/3] pfn_t: Change the encoding

Message ID 1457730784-9890-2-git-send-email-matthew.r.wilcox@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Wilcox, Matthew R March 11, 2016, 9:13 p.m. UTC
By moving the flag bits to the bottom, we encourage commonality
between SGs with pages and those using pfn_t.  We can also then insert
a pfn_t into a radix tree, as it uses the same two bits for indirect &
exceptional indicators.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 include/linux/pfn_t.h | 41 ++++++++++++++++++++++++++++++-----------
 1 file changed, 30 insertions(+), 11 deletions(-)

Comments

Dan Williams March 11, 2016, 9:40 p.m. UTC | #1
On Fri, Mar 11, 2016 at 1:13 PM, Matthew Wilcox
<matthew.r.wilcox@intel.com> wrote:
> By moving the flag bits to the bottom, we encourage commonality
> between SGs with pages and those using pfn_t.  We can also then insert
> a pfn_t into a radix tree, as it uses the same two bits for indirect &
> exceptional indicators.

It's not immediately clear to me what we gain with SG entry
commonality.  The down side is that we lose the property that
pfn_to_pfn_t() is a nop.  This was Dave's suggestion so that the
nominal case did not change the binary layout of a typical pfn.

Can we just bit swizzle a pfn_t on insertion/retrieval from the radix?
Matthew Wilcox March 12, 2016, 6:30 p.m. UTC | #2
On Fri, Mar 11, 2016 at 01:40:20PM -0800, Dan Williams wrote:
> On Fri, Mar 11, 2016 at 1:13 PM, Matthew Wilcox
> <matthew.r.wilcox@intel.com> wrote:
> > By moving the flag bits to the bottom, we encourage commonality
> > between SGs with pages and those using pfn_t.  We can also then insert
> > a pfn_t into a radix tree, as it uses the same two bits for indirect &
> > exceptional indicators.
> 
> It's not immediately clear to me what we gain with SG entry
> commonality.  The down side is that we lose the property that
> pfn_to_pfn_t() is a nop.  This was Dave's suggestion so that the
> nominal case did not change the binary layout of a typical pfn.

I understand that motivation!

> Can we just bit swizzle a pfn_t on insertion/retrieval from the radix?

Of course we *can*, but we end up doing more swizzling that way than we
do this way.  In the Brave New Future where we're storing pfn_t in the
radix tree, on a page fault we find the pfn_t in the radix tree then
we want to insert it into the page tables.  So DAX would first have to
convert the radix tree entry to a pfn_t, then the page table code has to
convert the pfn_t into a pte/pmd/pud (which we currently do by converting
a pfn_t to a pfn, then converting the pfn to a pte/pmd/pud, but I assume
that either the compiler optimises that into a single conversion, or we'll
add pfn_t_pte to each architecture in future if it's actually a problem).

Much easier to look up a pfn_t in the radix tree and pass it directly
to vm_insert_mixed().

If there's any part of the kernel that is doing a *lot* of conversion
between pfn_t and pfn, that surely indicates a place in the kernel where
we need to convert an interface from pfn to pfn_t.

(It occurs to me we can make the code simpler on architectures that
don't support PUDs.  The PFN_HUGE bit is still available to distinguish
between PMDs and PTEs, but we won't need to clear the bottom bit of the
PFN if PFN_HUGE is set, since nobody can add a PUD pfn to the radix tree).
Dan Williams March 13, 2016, 11:09 p.m. UTC | #3
On Sat, Mar 12, 2016 at 10:30 AM, Matthew Wilcox <willy@linux.intel.com> wrote:
> On Fri, Mar 11, 2016 at 01:40:20PM -0800, Dan Williams wrote:
>> On Fri, Mar 11, 2016 at 1:13 PM, Matthew Wilcox
>> <matthew.r.wilcox@intel.com> wrote:
>> > By moving the flag bits to the bottom, we encourage commonality
>> > between SGs with pages and those using pfn_t.  We can also then insert
>> > a pfn_t into a radix tree, as it uses the same two bits for indirect &
>> > exceptional indicators.
>>
>> It's not immediately clear to me what we gain with SG entry
>> commonality.  The down side is that we lose the property that
>> pfn_to_pfn_t() is a nop.  This was Dave's suggestion so that the
>> nominal case did not change the binary layout of a typical pfn.
>
> I understand that motivation!
>
>> Can we just bit swizzle a pfn_t on insertion/retrieval from the radix?
>
> Of course we *can*, but we end up doing more swizzling that way than we
> do this way.  In the Brave New Future where we're storing pfn_t in the
> radix tree, on a page fault we find the pfn_t in the radix tree then
> we want to insert it into the page tables.  So DAX would first have to
> convert the radix tree entry to a pfn_t, then the page table code has to
> convert the pfn_t into a pte/pmd/pud (which we currently do by converting
> a pfn_t to a pfn, then converting the pfn to a pte/pmd/pud, but I assume
> that either the compiler optimises that into a single conversion, or we'll
> add pfn_t_pte to each architecture in future if it's actually a problem).
>
> Much easier to look up a pfn_t in the radix tree and pass it directly
> to vm_insert_mixed().
>
> If there's any part of the kernel that is doing a *lot* of conversion
> between pfn_t and pfn, that surely indicates a place in the kernel where
> we need to convert an interface from pfn to pfn_t.

So this is dependent on where pfn_t gets pushed in the future.  For
example, if we revive using a pfn_t in a bio then I think the
pfn_to_pfn_t() conversions will be more prevalent than the fs/dax.c
radix usages.
Matthew Wilcox March 14, 2016, 3 p.m. UTC | #4
On Sun, Mar 13, 2016 at 04:09:38PM -0700, Dan Williams wrote:
> On Sat, Mar 12, 2016 at 10:30 AM, Matthew Wilcox <willy@linux.intel.com> wrote:
> > On Fri, Mar 11, 2016 at 01:40:20PM -0800, Dan Williams wrote:
> >> Can we just bit swizzle a pfn_t on insertion/retrieval from the radix?
> >
> > Of course we *can*, but we end up doing more swizzling that way than we
> > do this way.  In the Brave New Future where we're storing pfn_t in the
> > radix tree, on a page fault we find the pfn_t in the radix tree then
> > we want to insert it into the page tables.  So DAX would first have to
> > convert the radix tree entry to a pfn_t, then the page table code has to
> > convert the pfn_t into a pte/pmd/pud (which we currently do by converting
> > a pfn_t to a pfn, then converting the pfn to a pte/pmd/pud, but I assume
> > that either the compiler optimises that into a single conversion, or we'll
> > add pfn_t_pte to each architecture in future if it's actually a problem).
> >
> > Much easier to look up a pfn_t in the radix tree and pass it directly
> > to vm_insert_mixed().
> >
> > If there's any part of the kernel that is doing a *lot* of conversion
> > between pfn_t and pfn, that surely indicates a place in the kernel where
> > we need to convert an interface from pfn to pfn_t.
> 
> So this is dependent on where pfn_t gets pushed in the future.  For
> example, if we revive using a pfn_t in a bio then I think the
> pfn_to_pfn_t() conversions will be more prevalent than the fs/dax.c
> radix usages.

Yes, we'll be converting to a pfn_t in more places than we are now
...  but what do we do with that pfn_t once we've got it into a bio?
Except for some rare cases (brd, maybe pmem), it gets converted into an
sg list which then gets DMA mapped, then the DMA addresses are converted
into whatever format the hardware wants.  As long as we convert the sg
list before we convert the bio, there aren't going to be any additional
conversions from pfn_t to pfn.  So I don't see this showing up as an
additional per-I/O cost.
diff mbox

Patch

diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
index 9499481..37c596c 100644
--- a/include/linux/pfn_t.h
+++ b/include/linux/pfn_t.h
@@ -8,16 +8,34 @@ 
  * PFN_SG_LAST - pfn references a page and is the last scatterlist entry
  * PFN_DEV - pfn is not covered by system memmap by default
  * PFN_MAP - pfn has a dynamic page mapping established by a device driver
+ *
+ * Note that the bottom two bits in the pfn_t match the bottom two bits in the
+ * scatterlist so sg_is_chain() and sg_is_last() work.  These bits are also
+ * used by the radix tree for its own purposes, but a PFN cannot be in both a
+ * radix tree and a scatterlist simultaneously.  If a PFN is moved between the
+ * two usages, care should be taken to clear/set these bits appropriately.
  */
-#define PFN_FLAGS_MASK (((u64) ~PAGE_MASK) << (BITS_PER_LONG_LONG - PAGE_SHIFT))
-#define PFN_SG_CHAIN (1ULL << (BITS_PER_LONG_LONG - 1))
-#define PFN_SG_LAST (1ULL << (BITS_PER_LONG_LONG - 2))
-#define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3))
-#define PFN_MAP (1ULL << (BITS_PER_LONG_LONG - 4))
+#define PFN_FLAG_BITS	4
+#define PFN_FLAGS_MASK	((1 << PFN_FLAG_BITS) - 1)
+#define __PFN_MAX	((1 << (BITS_PER_LONG - PFN_FLAG_BITS)) - 1)
+#define PFN_SG_CHAIN	0x01UL
+#define PFN_SG_LAST	0x02UL
+#define PFN_SG_MASK	(PFN_SG_CHAIN | PFN_SG_LAST)
+#define PFN_DEV		0x04UL
+#define PFN_MAP		0x08UL
+
+#if 0
+#define PFN_T_BUG_ON(x)	BUG_ON(x)
+#else
+#define PFN_T_BUG_ON(x)	BUILD_BUG_ON_INVALID(x)
+#endif
 
 static inline pfn_t __pfn_to_pfn_t(unsigned long pfn, u64 flags)
 {
-	pfn_t pfn_t = { .val = pfn | (flags & PFN_FLAGS_MASK), };
+	pfn_t pfn_t = { .val = (pfn << PFN_FLAG_BITS) | flags };
+
+	PFN_T_BUG_ON(pfn & ~__PFN_MAX);
+	PFN_T_BUG_ON(flags & ~PFN_FLAGS_MASK);
 
 	return pfn_t;
 }
@@ -28,6 +46,12 @@  static inline pfn_t pfn_to_pfn_t(unsigned long pfn)
 	return __pfn_to_pfn_t(pfn, 0);
 }
 
+static inline unsigned long pfn_t_to_pfn(pfn_t pfn)
+{
+	unsigned long v = pfn.val;
+	return v >> PFN_FLAG_BITS;
+}
+
 extern pfn_t phys_to_pfn_t(phys_addr_t addr, u64 flags);
 
 static inline bool pfn_t_has_page(pfn_t pfn)
@@ -35,11 +59,6 @@  static inline bool pfn_t_has_page(pfn_t pfn)
 	return (pfn.val & PFN_MAP) == PFN_MAP || (pfn.val & PFN_DEV) == 0;
 }
 
-static inline unsigned long pfn_t_to_pfn(pfn_t pfn)
-{
-	return pfn.val & ~PFN_FLAGS_MASK;
-}
-
 static inline struct page *pfn_t_to_page(pfn_t pfn)
 {
 	if (pfn_t_has_page(pfn))