diff mbox series

pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache

Message ID alpine.LRH.2.02.2009160649560.20720@file01.intranet.prod.int.rdu2.redhat.com
State New
Headers show
Series pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache | expand

Commit Message

Mikulas Patocka Sept. 16, 2020, 10:57 a.m. UTC
On Tue, 15 Sep 2020, Mikulas Patocka wrote:

> 
> 
> On Tue, 15 Sep 2020, Mikulas Patocka wrote:
> 
> > > > - __copy_from_user_inatomic_nocache doesn't flush cache for leading and
> > > > trailing bytes.
> > > 
> > > You want copy_user_flushcache(). See how fs/dax.c arranges for
> > > dax_copy_from_iter() to route to pmem_copy_from_iter().
> > 
> > Is it something new for the kernel 5.10? I see only __copy_user_flushcache 
> > that is implemented just for x86 and arm64.
> > 
> > There is __copy_from_user_flushcache implemented for x86, arm64 and power. 
> > It is used in lib/iov_iter.c under
> > #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE - so should I use this?
> > 
> > Mikulas
> 
> ... and __copy_user_flushcache is not exported for modules. So, I am stuck 
> with __copy_from_user_inatomic_nocache.
> 
> Mikulas

I'm submitting this patch that adds the required exports (so that we could 
use __copy_from_user_flushcache on x86, arm64 and powerpc). Please, queue 
it for the next merge window.


From: Mikulas Patocka <mpatocka@redhat.com>

Export the symbols __copy_user_flushcache and __copy_from_user_flushcache,
so that modules can use this functionality.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 arch/arm64/lib/uaccess_flushcache.c |    1 +
 arch/powerpc/lib/pmem.c             |    1 +
 arch/x86/lib/usercopy_64.c          |    1 +
 3 files changed, 3 insertions(+)

Comments

Mikulas Patocka Sept. 16, 2020, 5:24 p.m. UTC | #1
On Wed, 16 Sep 2020, Dan Williams wrote:

> On Wed, Sep 16, 2020 at 3:57 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >
> >
> > I'm submitting this patch that adds the required exports (so that we could
> > use __copy_from_user_flushcache on x86, arm64 and powerpc). Please, queue
> > it for the next merge window.
> 
> Why? This should go with the first user, and it's not clear that it
> needs to be relative to the current dax_operations export scheme.

Before nvfs gets included in the kernel, I need to distribute it as a 
module. So, it would make my maintenance easier. But if you don't want to 
export it now, no problem, I can just copy __copy_user_flushcache from the 
kernel to the module.

> My first question about nvfs is how it compares to a daxfs with
> executables and other binaries configured to use page cache with the
> new per-file dax facility?

nvfs is faster than dax-based filesystems on metadata-heavy operations 
because it doesn't have the overhead of the buffer cache and bios. See 
this: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS

Mikulas
Dan Williams Sept. 16, 2020, 5:40 p.m. UTC | #2
On Wed, Sep 16, 2020 at 10:24 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
>
> On Wed, 16 Sep 2020, Dan Williams wrote:
>
> > On Wed, Sep 16, 2020 at 3:57 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > >
> > >
> > >
> > > I'm submitting this patch that adds the required exports (so that we could
> > > use __copy_from_user_flushcache on x86, arm64 and powerpc). Please, queue
> > > it for the next merge window.
> >
> > Why? This should go with the first user, and it's not clear that it
> > needs to be relative to the current dax_operations export scheme.
>
> Before nvfs gets included in the kernel, I need to distribute it as a
> module. So, it would make my maintenance easier. But if you don't want to
> export it now, no problem, I can just copy __copy_user_flushcache from the
> kernel to the module.

That sounds a better plan than exporting symbols with no in-kernel consumer.

> > My first question about nvfs is how it compares to a daxfs with
> > executables and other binaries configured to use page cache with the
> > new per-file dax facility?
>
> nvfs is faster than dax-based filesystems on metadata-heavy operations
> because it doesn't have the overhead of the buffer cache and bios. See
> this: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS

...and that metadata problem is intractable upstream? Christoph poked
at bypassing the block layer for xfs metadata operations [1], I just
have not had time to carry that further.

[1]: "xfs: use dax_direct_access for log writes", although it seems
he's dropped that branch from his xfs.git
Christoph Hellwig Sept. 17, 2020, 6:50 a.m. UTC | #3
On Wed, Sep 16, 2020 at 10:40:13AM -0700, Dan Williams wrote:
> > Before nvfs gets included in the kernel, I need to distribute it as a
> > module. So, it would make my maintenance easier. But if you don't want to
> > export it now, no problem, I can just copy __copy_user_flushcache from the
> > kernel to the module.
> 
> That sounds a better plan than exporting symbols with no in-kernel consumer.

Exporting symbols without a user is a complete no-go.

> > > My first question about nvfs is how it compares to a daxfs with
> > > executables and other binaries configured to use page cache with the
> > > new per-file dax facility?
> >
> > nvfs is faster than dax-based filesystems on metadata-heavy operations
> > because it doesn't have the overhead of the buffer cache and bios. See
> > this: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS
> 
> ...and that metadata problem is intractable upstream? Christoph poked
> at bypassing the block layer for xfs metadata operations [1], I just
> have not had time to carry that further.
> 
> [1]: "xfs: use dax_direct_access for log writes", although it seems
> he's dropped that branch from his xfs.git

I've pushed the old branch out again:

    http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-log-dax

The main sticking points here are:

 - currently all our nvdimm/DAX code does totally pointless pmem_flush
   calls just to be on the safe side.  That probably is one of the big
   speedups of nova and other academic snake oil projects over our
   stack.  We need to handle this properly
 - what do we do about write error handling?  That is the other big
   thing in the pmem/dax stack that all of the direct writers (including
   MAP_SYNC mmaps) pretty much ignore

Once that is sorted out we can not just put the log changes like above
in, but also move the buffer cache over to do a direct access and
basically stop using the block layer for a pure DAX XFS.
Mikulas Patocka Sept. 21, 2020, 4:20 p.m. UTC | #4
On Wed, 16 Sep 2020, Mikulas Patocka wrote:

> 
> 
> On Wed, 16 Sep 2020, Dan Williams wrote:
> 
> > On Wed, Sep 16, 2020 at 10:24 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > >
> > > > My first question about nvfs is how it compares to a daxfs with
> > > > executables and other binaries configured to use page cache with the
> > > > new per-file dax facility?
> > >
> > > nvfs is faster than dax-based filesystems on metadata-heavy operations
> > > because it doesn't have the overhead of the buffer cache and bios. See
> > > this: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS
> > 
> > ...and that metadata problem is intractable upstream? Christoph poked
> > at bypassing the block layer for xfs metadata operations [1], I just
> > have not had time to carry that further.
> > 
> > [1]: "xfs: use dax_direct_access for log writes", although it seems
> > he's dropped that branch from his xfs.git
> 
> XFS is very big. I wanted to create something small.

And the another difference is that XFS metadata are optimized for disks 
and SSDs.

On disks and SSDs, reading one byte is as costly as reading a full block. 
So we must put as much information to a block as possible. XFS uses 
b+trees for file block mapping and for directories - it is reasonable 
decision because b+trees minimize the number of disk accesses.

On persistent memory, each access has its own cost, so NVFS uses metadata 
structures that minimize the number of cache lines accessed (rather than 
the number of blocks accessed). For block mapping, NVFS uses the classic 
unix dierct/indirect blocks - if a file block is mapped by a 3-rd level 
indirect block, we do just three memory accesses and we are done. If we 
used b+trees, the number of accesses would be much larger than 3 (we would 
have to do binary search in the b+tree nodes).

The same for directories - NVFS hashes the file name and uses radix-tree 
to locate a directory page where the directory entry is located. XFS 
b+trees would result in much more accesses than the radix-tree.

Regarding journaling - NVFS doesn't do it because persistent memory is so 
fast that we can just check it in the case of crash. NVFS has a 
multithreaded fsck that can do 3 million inodes per second. XFS does 
journaling (it was reasonable decision for disks where fsck took hours) 
and it will cause overhead for all the filesystem operations.

Mikulas
Dave Chinner Sept. 22, 2020, 5:03 a.m. UTC | #5
Hi Mikulas,

I'll say up front that I think you're barking up the wrong tree
trying to knock down XFS and ext4 to justify NVFS. NVFS will stand
or fall on it's own merits, not on how you think it's better than
other filesystems...

I have some fundamental concerns about the NVFS integrity model,
they came out as I wrote this response to your characterisations of
XFS and journalling filesysetms. Maybe I'm missing something about
NVFS that isn't clearly explained....

On Mon, Sep 21, 2020 at 12:20:42PM -0400, Mikulas Patocka wrote:
> On Wed, 16 Sep 2020, Mikulas Patocka wrote:
> > On Wed, 16 Sep 2020, Dan Williams wrote:
> > > On Wed, Sep 16, 2020 at 10:24 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > > >
> > > > > My first question about nvfs is how it compares to a daxfs with
> > > > > executables and other binaries configured to use page cache with the
> > > > > new per-file dax facility?
> > > >
> > > > nvfs is faster than dax-based filesystems on metadata-heavy operations
> > > > because it doesn't have the overhead of the buffer cache and bios. See
> > > > this: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS
> > > 
> > > ...and that metadata problem is intractable upstream? Christoph poked
> > > at bypassing the block layer for xfs metadata operations [1], I just
> > > have not had time to carry that further.
> > > 
> > > [1]: "xfs: use dax_direct_access for log writes", although it seems
> > > he's dropped that branch from his xfs.git
> > 
> > XFS is very big. I wanted to create something small.
> 
> And the another difference is that XFS metadata are optimized for disks 
> and SSDs.

Ah, that old chestnut. :)

> On disks and SSDs, reading one byte is as costly as reading a full block. 
> So we must put as much information to a block as possible. XFS uses 
> b+trees for file block mapping and for directories - it is reasonable 
> decision because b+trees minimize the number of disk accesses.

Actually, no, that wasn't why XFS was implemented using btrees. The
btrees are an implementation detail, not a design requirement to
minimise the number of disk accesses.

XFS was intended for huge disk arrays (hundreds to thousands on
individual spindles) and so no attempt was made in the design to
minimise disk accesses. There was -always- going to be a huge number
of IOPS available in the intended large scale deployments, so
concurrency and *efficiency at scale* was far more important at the
design level than minimising the number of disk ops for any given
operation.

To that end, simulations were done that showed that extent based
trees were much more CPU, memory and IO efficient than bitmaps,
hybrid tree-bitmaps or multi-layer indirect block referencing when
trying to index and access large amounts of data.

To be efficient at scale, all operations need to be O(log N) or
better, and extent based encoding is much, more compact than direct
block indexing. Extent trees are also much more effective for finding
exact fits, identifying large contiguous spaces, and manipulating
large ranges of indexed data than other filesystem indexing
mechanisms.  They are also not bound by alignment restrictions like
hierarchical binary/buddy bitmap schemes are, and their maximum size
is bounded and can be calculated at runtime.

IOWs, extent based trees were chosen because of scalability,
efficiency, and flexibility reasons before the actual tree structure
that it would be implemented with was decided on.  b+trees were used
in the implementation because one tree implementation could do
everything as all that needed to change btree trees was the pointer
and record format.

The result of this is that we have made -zero- changes to the XFS
structure and algorithms for SSDs. We don't do different things
based on the blkdev rotational flag, or anything like that. XFS
behaves exactly the same on spinning disks as it does SSDs as it
does PMEM and it performs well on all of them. And that performance
doesn't drop away as you increase the scale and capability of the
underlying storage.

That's what happens when storage algorithms are designed for
concurrency and efficiency at scale rather than optimising for a
specific storage characteristic.

NVFS is optimised for a specific storage characteristic (i.e. low
latency synchronous storage), so I would absolutely expect it to be
faster than XFS on that specific storage. However, claims like this:

> On persistent memory, each access has its own cost, so NVFS uses metadata 
> structures that minimize the number of cache lines accessed (rather than 
> the number of blocks accessed). For block mapping, NVFS uses the classic 
> unix dierct/indirect blocks - if a file block is mapped by a 3-rd level 
> indirect block, we do just three memory accesses and we are done. If we 
> used b+trees, the number of accesses would be much larger than 3 (we would 
> have to do binary search in the b+tree nodes).

... are kinda naive, because you're clearly optimising the wrong
aspect of block mapping. Extents solve the block indexing overhead
problem; optimising the type of tree you use to index the indirect
blocks doesn't avoid the overhead of having to iterate every block
for range operations.

IOWs, we use extents because they are space and time efficient for
the general use cases. XFS can map 2^21 blocks into a single 16 byte
extent record (8GiB file mapping for 4k block size) and so the vast
majority of files in a filesystem are mapped with a single extent.
The NVFS indirect block tree has a fan-out of 16, mapping 2^21
blocks requires a 5 level indirect tree. Whcih one if going to be
faster to truncate away - a single record or 2 million individual
blocks?

IOWs, we can take afford to take an extra cacheline miss or two on a
tree block search, because we're accessing and managing orders of
magnitude fewer records in the mapping tree than an indirect block
tree.

PMEM doesn't change this: extents are more time and space efficient
at scale for mapping trees than indirect block trees regardless
of the storage medium in use.

> The same for directories - NVFS hashes the file name and uses radix-tree 
> to locate a directory page where the directory entry is located. XFS 
> b+trees would result in much more accesses than the radix-tree.

That's like me saying that XFS hashes the file name and uses a btree
to index the directory block where the dirent is located, so it will
be faster than a radix tree.

It's completely bogus.

It ignores the fact that both filesysetms use the same hash based
lookup indexing algorithms and use O(log N) trees for the name hash.
IOWs, the only real difference is the fan-out and depths of the
tree. The end result is that algorithmic performance of name ->
dirent lookups are going to be *very similar* and, as such, the
performance differential is going to be dominated by other
implementation differences.

Such as the fact that XFS has to buffer the directory metadata,
hence that the initial directory block lookup cost is higher than
NVFS. Subsequent block lookups hit the buffer cache, so that
caching overhead is somewhat amortised over multiple directory
accesses, but it doesn't get rid of it.

IOWs, difference in memory accesses between a radix tree and btree
for this algorithm is largely irrelevant, and even your tests
indicate that.  The modification tests show that metadata lookup
*and journalling* overhead is not really that significant as the
number of directory entries increase:

dir-test /mnt/test/linux-2.6 63000 1048576
nvfs		6.6s
ext4 dax	8.4s
xfs dax		12.2s


dir-test /mnt/test/linux-2.6 63000 1048576 link
nvfs		4.7s
ext4 dax	5.6s
xfs dax		7.8s

dir-test /mnt/test/linux-2.6 63000 1048576 dir
nvfs		8.2s
ext4 dax	15.1s
xfs dax		11.8s

Yes, nvfs is faster than both ext4 and XFS on DAX, but it's  not a
huge difference - it's not orders of magnitude faster.
If XFS and ext4 directory structures were substantially less
efficient than the NVFS structure, then NVFS should absolutely
-slaughter- ext4 and XFS in directory modification intensive
microbenchmarks like this. Especially considering that difference in
performance also includes the journalling overhead.

IOWs, the differences in performance are not a result of the
directory structures or the number of memory fetches their indexing
algorithms require to do the work - the differences come from
structural features: ext4 and XFS have more work to do per
directory operation thanks to their metadata buffer and journalling
management requirements.

Also, keep in mind that much of the complexity in the XFS directory
structure doesn't make XFS go faster at small directory sizes. They
actually slow it down at small sizes, but they also stop performance
from falling off a cliff at scale. Hence results might be quite
different if you are running with millions of directory entries in
the directories rather that a few thousand....

> Regarding journaling - NVFS doesn't do it because persistent memory is so 
> fast that we can just check it in the case of crash. NVFS has a 
> multithreaded fsck that can do 3 million inodes per second.

Scanning speed means little when it comes to integrity checking.

Fast storage can hide a multitude of sins, the least of which is
inefficient algorithms. More importantly, it can hide crash related
inconsistencies, because timing the crash to land between two
specific modifications is much harder on fast storage than it is
slow storage.

IOWs, just because fsck can iterate inodes at 3M a second doesn't
mean the filesystem code is crash safe and correct, nor that fsck
can detect and correct all the inconsistencies that the
crash left behind and need fixing. More on this later....

> XFS does 
> journaling (it was reasonable decision for disks where fsck took hours) 
> and it will cause overhead for all the filesystem operations.

Fundamentally, journalling provides guarantees much more important
than than "does not need fsck". Journalling provides -atomic
metadata changes-, and that's something you can't do without some
variant of journalled, log structured or COW metadata. This is
important, because atomicity of metadata changes is something users
actually expect from filesystems.

Take, for example, truncate. If you punch out the space on storage
before you change the inode size on storage and then crash
in-between the punch and the inode size reduction, the user file is
left with a bunch of zeros in it over the range between the new EOF
and the old EOF. Users will see partial completion state.

IOWs, the NVFS truncate operation as implemented:

        if (attr->ia_valid & ATTR_SIZE) {
                WARN_ON(!inode_is_locked(&nmi->vfs_inode));
                if (attr->ia_size != nmi->vfs_inode.i_size) {
                        r = nvfs_punch_hole(nmi, attr->ia_size, LLONG_MAX - ((1UL << nvs->log2_page_size) - 1));
                        if (unlikely(r))
                                return r;
                        nvfs_set_inode_size(nmi, attr->ia_size);
                }
        }

is not atomic from a user crash recovery perspective as it exposes
partially complete state to the user. For it to be atomic from the
user perspective, on truncate down the inode size must be changed on
disk first, then the space beyond the new EOF needs to get punched
out. hence if we crash while the punching is occurring, users will
not see that after remount because the inconsistent state is beyond
the range they can access. IOWs, they see the file as if the
truncate down is fully complete, regardless of whether it is
actually complete or not.

However, that then potentially breaks truncate up because,
conversely, truncate up requires that any blocks already allocated
beyond EOF needs to be zeroed before the inode size is changed so
stale data is not exposed between the old EOF and the new EOF....

Yes, this can be fixed by changing the order of the punch vs the
inode size change based on what type of operation is actually going
to be performed, but this is just an example of the complexity
problem "soft updates" bring to otherwise "simple" operations. I
haven't even mentioned freeing indirect blocks also updates free
space, inode i_blocks counters, potentially multiple indirect
blocks, etc. If the order is wrong, then all bets are off for what
fsck will actually do with the inode when it scans it and finds
partially complete state. And even if it gets corrected, there's no
guarantee taht the entire operation was completed from the
perspective of the user.

Rename is another operation that has specific "operation has atomic
behaviour" expectations. I haven't looked at how you've
implementated that yet, but I suspect it also is extremely difficult
to implement in an atomic manner using direct pmem updates to the
directory structures.

AFAICS, it is easy to get metadata update ordering wrong, and
without a formal proof that every single write to storage is
correctly ordered I can't see how this model can be validated and
reviewed. It is made exceedingly complex by direct storage access.
instead of doing all the changes to a single block in memory and
then only having to order block writes to stable storage correctly
(as per the BSD filesystem that used soft updates), every direct
pmem modification to an object needs to be ordered correctly against
all other direct pmem modifications, both within the object and
across objects.

And this brings me back to modification atomicity - soft update
ordering across objects is hard enough, but how do you perform
dependent modifications to a single object atomically?

e.g. Looking at nvfs_setattr, why is it correct to write timestamp
changes to pmem before the truncate is done?

And if that is correct behavour, then why is it correct to
change UID/GID _before_ updating the timestamp?

And if that is also correct behaviour, then why are mode changes
done _after both_ uid/gid and timestamp changes? What happens if
setattr is asked to change ATTR_UID|ATTR_GID|ATTR_MODE as an atomic
change and we crash before the mode has been changed?

I think the only right answer can be "->setattr changes are atomic
at the stable storage level", in which case the way NVFS is updating
stable storage breaking all sorts of assumptions and expectations
for changing security and ownership attributes of the inode.

And to bring this right back to "fsck is fast so we don't need
journalling": how would running fsck before mounting detect that
these compound object updates were not fully completed? How does
running fsck after the crash guarantee compound object modifications
it knows nothing about are executed atomically?

That, too me, looks like a fundamental, unfixable flaw in this
approach...

I can see how "almost in place" modification can be done by having
two copies side by side and updating one while the other is the
active copy and switching atomically between the two objects. That
way a traditional soft-update algorithm would work because the
exposure of the changes is via ordering the active copy switches.
That would come at a cost, though, both in metadata footprint and
CPU overhead.

So, what have I missed about the way metadata is updated in the pmem
that allows non-atomic updates to work reliably?

Cheers,

Dave.
Matthew Wilcox Sept. 22, 2020, 12:28 p.m. UTC | #6
On Mon, Sep 21, 2020 at 12:20:42PM -0400, Mikulas Patocka wrote:
> The same for directories - NVFS hashes the file name and uses radix-tree 
> to locate a directory page where the directory entry is located. XFS 
> b+trees would result in much more accesses than the radix-tree.

What?  Radix trees behave _horribly_ badly when indexed by a hash.
If you have a 64-bit hash and use 8 bits per level of the tree, you have
to traverse 8 pointers to get to your destination.  You might as well
use a linked list!
Mikulas Patocka Sept. 22, 2020, 12:39 p.m. UTC | #7
On Tue, 22 Sep 2020, Matthew Wilcox wrote:

> On Mon, Sep 21, 2020 at 12:20:42PM -0400, Mikulas Patocka wrote:
> > The same for directories - NVFS hashes the file name and uses radix-tree 
> > to locate a directory page where the directory entry is located. XFS 
> > b+trees would result in much more accesses than the radix-tree.
> 
> What?  Radix trees behave _horribly_ badly when indexed by a hash.
> If you have a 64-bit hash and use 8 bits per level of the tree, you have
> to traverse 8 pointers to get to your destination.  You might as well
> use a linked list!

In NVFS, radix trees are cut off - they have only as much internal levels, 
as is needed to disambiguate the directory entries.

Read this document: http://people.redhat.com/~mpatocka/nvfs/INTERNALS
the section "DIRECTORIES".

Perhaps, I should call it differently than "radix-trees", but I don't 
really know what is the official name for this data structure.

Mikulas
Mikulas Patocka Sept. 22, 2020, 4:46 p.m. UTC | #8
Hi

Thanks for reviewing NVFS.


On Tue, 22 Sep 2020, Dave Chinner wrote:

> Hi Mikulas,
> 
> I'll say up front that I think you're barking up the wrong tree
> trying to knock down XFS and ext4 to justify NVFS. NVFS will stand
> or fall on it's own merits, not on how you think it's better than
> other filesystems...
> 
> I have some fundamental concerns about the NVFS integrity model,
> they came out as I wrote this response to your characterisations of
> XFS and journalling filesysetms. Maybe I'm missing something about
> NVFS that isn't clearly explained....
> 
> On Mon, Sep 21, 2020 at 12:20:42PM -0400, Mikulas Patocka wrote:
> > On Wed, 16 Sep 2020, Mikulas Patocka wrote:
> > > On Wed, 16 Sep 2020, Dan Williams wrote:
> > > > On Wed, Sep 16, 2020 at 10:24 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > > > >
> > > > > > My first question about nvfs is how it compares to a daxfs with
> > > > > > executables and other binaries configured to use page cache with the
> > > > > > new per-file dax facility?
> > > > >
> > > > > nvfs is faster than dax-based filesystems on metadata-heavy operations
> > > > > because it doesn't have the overhead of the buffer cache and bios. See
> > > > > this: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS
> > > > 
> > > > ...and that metadata problem is intractable upstream? Christoph poked
> > > > at bypassing the block layer for xfs metadata operations [1], I just
> > > > have not had time to carry that further.
> > > > 
> > > > [1]: "xfs: use dax_direct_access for log writes", although it seems
> > > > he's dropped that branch from his xfs.git
> > > 
> > > XFS is very big. I wanted to create something small.
> > 
> > And the another difference is that XFS metadata are optimized for disks 
> > and SSDs.
> 
> Ah, that old chestnut. :)
> 
> > On disks and SSDs, reading one byte is as costly as reading a full block. 
> > So we must put as much information to a block as possible. XFS uses 
> > b+trees for file block mapping and for directories - it is reasonable 
> > decision because b+trees minimize the number of disk accesses.
> 
> Actually, no, that wasn't why XFS was implemented using btrees. The
> btrees are an implementation detail, not a design requirement to
> minimise the number of disk accesses.
> 
> XFS was intended for huge disk arrays (hundreds to thousands on
> individual spindles) and so no attempt was made in the design to
> minimise disk accesses. There was -always- going to be a huge number
> of IOPS available in the intended large scale deployments, so
> concurrency and *efficiency at scale* was far more important at the
> design level than minimising the number of disk ops for any given
> operation.
> 
> To that end, simulations were done that showed that extent based
> trees were much more CPU, memory and IO efficient than bitmaps,
> hybrid tree-bitmaps or multi-layer indirect block referencing when
> trying to index and access large amounts of data.
> 
> To be efficient at scale, all operations need to be O(log N) or
> better, and extent based encoding is much, more compact than direct
> block indexing. Extent trees are also much more effective for finding
> exact fits, identifying large contiguous spaces, and manipulating
> large ranges of indexed data than other filesystem indexing
> mechanisms.  They are also not bound by alignment restrictions like
> hierarchical binary/buddy bitmap schemes are, and their maximum size
> is bounded and can be calculated at runtime.
> 
> IOWs, extent based trees were chosen because of scalability,
> efficiency, and flexibility reasons before the actual tree structure
> that it would be implemented with was decided on.  b+trees were used
> in the implementation because one tree implementation could do
> everything as all that needed to change btree trees was the pointer
> and record format.

I agree that the b+tree were a good choice for XFS.

In RAM-based maps, red-black trees or avl trees are used often. In 
disk-based maps, btrees or b+trees are used. That's because in RAM, you 
are optimizing for the number of cache lines accessed, and on the disk, 
you are optimizing for the number of blocks accessed.

> The result of this is that we have made -zero- changes to the XFS
> structure and algorithms for SSDs. We don't do different things
> based on the blkdev rotational flag, or anything like that. XFS
> behaves exactly the same on spinning disks as it does SSDs as it
> does PMEM and it performs well on all of them. And that performance
> doesn't drop away as you increase the scale and capability of the
> underlying storage.
> 
> That's what happens when storage algorithms are designed for
> concurrency and efficiency at scale rather than optimising for a
> specific storage characteristic.
> 
> NVFS is optimised for a specific storage characteristic (i.e. low
> latency synchronous storage), so I would absolutely expect it to be
> faster than XFS on that specific storage. However, claims like this:
> 
> > On persistent memory, each access has its own cost, so NVFS uses metadata 
> > structures that minimize the number of cache lines accessed (rather than 
> > the number of blocks accessed). For block mapping, NVFS uses the classic 
> > unix dierct/indirect blocks - if a file block is mapped by a 3-rd level 
> > indirect block, we do just three memory accesses and we are done. If we 
> > used b+trees, the number of accesses would be much larger than 3 (we would 
> > have to do binary search in the b+tree nodes).
> 
> ... are kinda naive, because you're clearly optimising the wrong
> aspect of block mapping. Extents solve the block indexing overhead
> problem; optimising the type of tree you use to index the indirect
> blocks doesn't avoid the overhead of having to iterate every block
> for range operations.
> 
> IOWs, we use extents because they are space and time efficient for
> the general use cases. XFS can map 2^21 blocks into a single 16 byte
> extent record (8GiB file mapping for 4k block size) and so the vast
> majority of files in a filesystem are mapped with a single extent.

BTW. How does XFS "predict" the file size? - so that it allocates extent 
of proper size without knowing how big the file will be?

> The NVFS indirect block tree has a fan-out of 16,

No. The top level in the inode contains 16 blocks (11 direct and 5 
indirect). And each indirect block can have 512 pointers (4096/8). You can 
format the device with larger block size and this increases the fanout 
(the NVFS block size must be greater or equal than the system page size).

2 levels can map 1GiB (4096*512^2), 3 levels can map 512 GiB, 4 levels can 
map 256 TiB and 5 levels can map 128 PiB.

> mapping 2^21 blocks requires a 5 level indirect tree. Which one if going 
> to be faster to truncate away - a single record or 2 million individual 
> blocks?
> 
> IOWs, we can take afford to take an extra cacheline miss or two on a
> tree block search, because we're accessing and managing orders of
> magnitude fewer records in the mapping tree than an indirect block
> tree.
> 
> PMEM doesn't change this: extents are more time and space efficient
> at scale for mapping trees than indirect block trees regardless
> of the storage medium in use.

PMEM doesn't have to be read linearly, so the attempts to allocate large 
linear space are not needed. They won't harm but they won't help either.

That's why NVFS has very simple block allocation alrogithm - it uses a 
per-cpu pointer and tries to allocate by a bit scan from this pointer. If 
the group is full, it tries a random group with above-average number of 
free blocks.

EXT4 uses bit scan for allocations and people haven't complained that it's 
inefficient, so it is probably OK.

> > The same for directories - NVFS hashes the file name and uses radix-tree 
> > to locate a directory page where the directory entry is located. XFS 
> > b+trees would result in much more accesses than the radix-tree.
> 
> That's like me saying that XFS hashes the file name and uses a btree
> to index the directory block where the dirent is located, so it will
> be faster than a radix tree.
> 
> It's completely bogus.
> 
> It ignores the fact that both filesysetms use the same hash based
> lookup indexing algorithms and use O(log N) trees for the name hash.
> IOWs, the only real difference is the fan-out and depths of the
> tree.

NVFS has fanout of 512 pointers per block.

> The end result is that algorithmic performance of name ->
> dirent lookups are going to be *very similar* and, as such, the
> performance differential is going to be dominated by other
> implementation differences.
> 
> Such as the fact that XFS has to buffer the directory metadata,
> hence that the initial directory block lookup cost is higher than
> NVFS. Subsequent block lookups hit the buffer cache, so that
> caching overhead is somewhat amortised over multiple directory
> accesses, but it doesn't get rid of it.
> 
> IOWs, difference in memory accesses between a radix tree and btree
> for this algorithm is largely irrelevant, and even your tests
> indicate that.  The modification tests show that metadata lookup
> *and journalling* overhead is not really that significant as the
> number of directory entries increase:
> 
> dir-test /mnt/test/linux-2.6 63000 1048576
> nvfs		6.6s
> ext4 dax	8.4s
> xfs dax		12.2s
> 
> 
> dir-test /mnt/test/linux-2.6 63000 1048576 link
> nvfs		4.7s
> ext4 dax	5.6s
> xfs dax		7.8s
> 
> dir-test /mnt/test/linux-2.6 63000 1048576 dir
> nvfs		8.2s
> ext4 dax	15.1s
> xfs dax		11.8s
> 
> Yes, nvfs is faster than both ext4 and XFS on DAX, but it's  not a
> huge difference - it's not orders of magnitude faster.

If I increase the size of the test directory, NVFS is order of magnitude 
faster:

time dir-test /mnt/test/ 2000000 2000000
NVFS: 0m29,395s
XFS:  1m59,523s
EXT4: 1m14,176s

time dir-test /mnt/test/ 8000000 8000000
NVFS: 2m13,507s
XFS: 14m31,261s
EXT4: reports "file 1976882 can't be created: No space left on device", 
	(although there are free blocks and inodes)
	Is it a bug or expected behavior?

> If XFS and ext4 directory structures were substantially less
> efficient than the NVFS structure, then NVFS should absolutely
> -slaughter- ext4 and XFS in directory modification intensive
> microbenchmarks like this. Especially considering that difference in
> performance also includes the journalling overhead.
> 
> IOWs, the differences in performance are not a result of the
> directory structures or the number of memory fetches their indexing
> algorithms require to do the work - the differences come from
> structural features: ext4 and XFS have more work to do per
> directory operation thanks to their metadata buffer and journalling
> management requirements.
> 
> Also, keep in mind that much of the complexity in the XFS directory
> structure doesn't make XFS go faster at small directory sizes. They
> actually slow it down at small sizes, but they also stop performance
> from falling off a cliff at scale. Hence results might be quite
> different if you are running with millions of directory entries in
> the directories rather that a few thousand....

See above - the ratio between NVFS and XFS grows as we increase directory 
size.

> > Regarding journaling - NVFS doesn't do it because persistent memory is so 
> > fast that we can just check it in the case of crash. NVFS has a 
> > multithreaded fsck that can do 3 million inodes per second.
> 
> Scanning speed means little when it comes to integrity checking.
> 
> Fast storage can hide a multitude of sins, the least of which is
> inefficient algorithms. More importantly, it can hide crash related
> inconsistencies, because timing the crash to land between two
> specific modifications is much harder on fast storage than it is
> slow storage.
> 
> IOWs, just because fsck can iterate inodes at 3M a second doesn't
> mean the filesystem code is crash safe and correct, nor that fsck
> can detect and correct all the inconsistencies that the
> crash left behind and need fixing. More on this later....
> 
> > XFS does 
> > journaling (it was reasonable decision for disks where fsck took hours) 
> > and it will cause overhead for all the filesystem operations.
> 
> Fundamentally, journalling provides guarantees much more important
> than than "does not need fsck". Journalling provides -atomic
> metadata changes-, and that's something you can't do without some
> variant of journalled, log structured or COW metadata. This is
> important, because atomicity of metadata changes is something users
> actually expect from filesystems.
> 
> Take, for example, truncate. If you punch out the space on storage
> before you change the inode size on storage and then crash
> in-between the punch and the inode size reduction, the user file is
> left with a bunch of zeros in it over the range between the new EOF
> and the old EOF. Users will see partial completion state.
> 
> IOWs, the NVFS truncate operation as implemented:
> 
>         if (attr->ia_valid & ATTR_SIZE) {
>                 WARN_ON(!inode_is_locked(&nmi->vfs_inode));
>                 if (attr->ia_size != nmi->vfs_inode.i_size) {
>                         r = nvfs_punch_hole(nmi, attr->ia_size, LLONG_MAX - ((1UL << nvs->log2_page_size) - 1));
>                         if (unlikely(r))
>                                 return r;
>                         nvfs_set_inode_size(nmi, attr->ia_size);
>                 }
>         }
> 
> is not atomic from a user crash recovery perspective as it exposes
> partially complete state to the user. For it to be atomic from the
> user perspective, on truncate down the inode size must be changed on
> disk first, then the space beyond the new EOF needs to get punched
> out. hence if we crash while the punching is occurring, users will
> not see that after remount because the inconsistent state is beyond
> the range they can access. IOWs, they see the file as if the
> truncate down is fully complete, regardless of whether it is
> actually complete or not.

You are right - these operations should be swapped.

BTW. XFS also had (or still has?) a misfeature that it was leaving 
zero-filed files after a crash.

> However, that then potentially breaks truncate up because,
> conversely, truncate up requires that any blocks already allocated
> beyond EOF needs to be zeroed before the inode size is changed so
> stale data is not exposed between the old EOF and the new EOF....
> 
> Yes, this can be fixed by changing the order of the punch vs the
> inode size change based on what type of operation is actually going
> to be performed, but this is just an example of the complexity
> problem "soft updates" bring to otherwise "simple" operations. I
> haven't even mentioned freeing indirect blocks also updates free
> space, inode i_blocks counters, potentially multiple indirect
> blocks, etc. If the order is wrong, then all bets are off for what
> fsck will actually do with the inode when it scans it and finds
> partially complete state. And even if it gets corrected, there's no
> guarantee taht the entire operation was completed from the
> perspective of the user.
> 
> Rename is another operation that has specific "operation has atomic
> behaviour" expectations. I haven't looked at how you've
> implementated that yet, but I suspect it also is extremely difficult
> to implement in an atomic manner using direct pmem updates to the
> directory structures.

There is a small window when renamed inode is neither in source nor in 
target directory. Fsck will reclaim such inode and add it to lost+found - 
just like on EXT2.

> AFAICS, it is easy to get metadata update ordering wrong, and
> without a formal proof that every single write to storage is
> correctly ordered I can't see how this model can be validated and
> reviewed. It is made exceedingly complex by direct storage access.
> instead of doing all the changes to a single block in memory and
> then only having to order block writes to stable storage correctly
> (as per the BSD filesystem that used soft updates), every direct
> pmem modification to an object needs to be ordered correctly against
> all other direct pmem modifications, both within the object and
> across objects.
> 
> And this brings me back to modification atomicity - soft update
> ordering across objects is hard enough, but how do you perform
> dependent modifications to a single object atomically?
> 
> e.g. Looking at nvfs_setattr, why is it correct to write timestamp
> changes to pmem before the truncate is done?
> 
> And if that is correct behavour, then why is it correct to
> change UID/GID _before_ updating the timestamp?
> 
> And if that is also correct behaviour, then why are mode changes
> done _after both_ uid/gid and timestamp changes? What happens if
> setattr is asked to change ATTR_UID|ATTR_GID|ATTR_MODE as an atomic
> change and we crash before the mode has been changed?

There are syscalls chmod,chown,chgrp that update only one of the flag. I 
don't know if something updates all of them.

Even on journaling filesystems, these updates are cached in memory and the 
journal is flushed asynchronously at some time later. Unless the user 
calls fsync(), it is really unspecified when will the changed attributes 
hit the stable storage.

> I think the only right answer can be "->setattr changes are atomic
> at the stable storage level", in which case the way NVFS is updating
> stable storage breaking all sorts of assumptions and expectations
> for changing security and ownership attributes of the inode.
> 
> And to bring this right back to "fsck is fast so we don't need
> journalling": how would running fsck before mounting detect that
> these compound object updates were not fully completed? How does
> running fsck after the crash guarantee compound object modifications
> it knows nothing about are executed atomically?
> 
> That, too me, looks like a fundamental, unfixable flaw in this
> approach...
> 
> I can see how "almost in place" modification can be done by having
> two copies side by side and updating one while the other is the
> active copy and switching atomically between the two objects. That

NVFS uses this "two copies" mechanism to deal with directories. The inode 
has an entry "__le64 i_radix[2][NVFS_RADIX_BLOCKS];" and a flag 
"NVFS_FLAG_SECOND_DIR_BANK" - when the flag is clear, i_radix[0] is valid, 
when the flag is set, i_radix[1] is valid.

Obviously, it is possible to use this method for updating UID, GID and 
MODE as well, but I am not convinced that they need to be updated 
atomically.

Regarding the timestamps - AFAIK POSIX allow them to be updated lazily.

> way a traditional soft-update algorithm would work because the
> exposure of the changes is via ordering the active copy switches.
> That would come at a cost, though, both in metadata footprint and
> CPU overhead.
> 
> So, what have I missed about the way metadata is updated in the pmem
> that allows non-atomic updates to work reliably?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

If you think that the lack of journaling is show-stopper, I can implement 
it. But then, I'll have something that has complexity of EXT4 and 
performance of EXT4. So that there will no longer be any reason why to use 
NVFS over EXT4. Without journaling, it will be faster than EXT4 and it may 
attract some users who want good performance and who don't care about GID 
and UID being updated atomically, etc.

Another possibility would be to implement only light journaling that deals 
with rename and setattr - that wouldn't harm because these operations are 
not performance-critical - and leave the rest of the code as is - and 
carefully review it for pmem_wmb() placement. (could you list other 
operations that must be done atomically?)

Mikulas
Matthew Wilcox Sept. 22, 2020, 5:25 p.m. UTC | #9
On Tue, Sep 22, 2020 at 12:46:05PM -0400, Mikulas Patocka wrote:
> I agree that the b+tree were a good choice for XFS.
> 
> In RAM-based maps, red-black trees or avl trees are used often. In 
> disk-based maps, btrees or b+trees are used. That's because in RAM, you 
> are optimizing for the number of cache lines accessed, and on the disk, 
> you are optimizing for the number of blocks accessed.

That's because software hasn't yet caught up to modern hardware.  An
in-memory btree outperforms an rbtree for several reasons:

 - rbtrees are taller; the average height of a b-tree with even just
   8 pointers per level is about half of an rbtree.
 - Not all cachelines are created equal.  The subsequent cacheline
   of this cacheline is probably already in cache.  If not, it's on its
   way and will be here shortly.  So a forward scan of this node will be
   quicker than finding another level of tree.  Our experiments have found
   a performance improvement with 256-byte B-tree nodes over an rbtree.
 - btrees are (or can be) RCU-safe.  It's very hard to make an rbtree
   RCU safe.  You can do a seqlock to get most of the benefit, but btrees
   let you allocate a new node and copy into it, while rbtrees embed the
   node in the object.

The downside, of course, is that b-trees require external allocations
while rbtrees embed the node in the object.  So you may need to do
more allocations.  But filesystems love doing additional allocations;
they make journalling so much easier.

> BTW. How does XFS "predict" the file size? - so that it allocates extent 
> of proper size without knowing how big the file will be?

XFS does delayed allocation.  The page cache absorbs the writes and then
at writeback time, XFS chooses where on storage to put it.  Some things
break this like O_SYNC and, er, DAX, but it's very effective.

> > The NVFS indirect block tree has a fan-out of 16,
> 
> No. The top level in the inode contains 16 blocks (11 direct and 5 
> indirect). And each indirect block can have 512 pointers (4096/8). You can 
> format the device with larger block size and this increases the fanout 
> (the NVFS block size must be greater or equal than the system page size).
> 
> 2 levels can map 1GiB (4096*512^2), 3 levels can map 512 GiB, 4 levels can 
> map 256 TiB and 5 levels can map 128 PiB.

But compare to an unfragmented file ... you can map the entire thing with
a single entry.  Even if you have to use a leaf node, you can get four
extents in a single cacheline (and that's a fairly naive leaf node layout;
I don't know exactly what XFS uses)

> > Rename is another operation that has specific "operation has atomic
> > behaviour" expectations. I haven't looked at how you've
> > implementated that yet, but I suspect it also is extremely difficult
> > to implement in an atomic manner using direct pmem updates to the
> > directory structures.
> 
> There is a small window when renamed inode is neither in source nor in 
> target directory. Fsck will reclaim such inode and add it to lost+found - 
> just like on EXT2.

... ouch.  If you have to choose, it'd be better to link it to the second
directory then unlink it from the first one.  Then your fsck can detect
it has the wrong count and fix up the count (ie link it into both
directories rather than neither).

> If you think that the lack of journaling is show-stopper, I can implement 
> it. But then, I'll have something that has complexity of EXT4 and 
> performance of EXT4. So that there will no longer be any reason why to use 
> NVFS over EXT4. Without journaling, it will be faster than EXT4 and it may 
> attract some users who want good performance and who don't care about GID 
> and UID being updated atomically, etc.

Well, what's your intent with nvfs?  Do you already have customers in mind
who want to use this in production, or is this somewhere to play with and
develop concepts that might make it into one of the longer-established
filesystems?
Dave Chinner Sept. 23, 2020, 2:45 a.m. UTC | #10
On Tue, Sep 22, 2020 at 12:46:05PM -0400, Mikulas Patocka wrote:
> Thanks for reviewing NVFS.

Not a review - I've just had a cursory look and not looked any
deeper after I'd noticed various red flags...

> On Tue, 22 Sep 2020, Dave Chinner wrote:
> > IOWs, extent based trees were chosen because of scalability,
> > efficiency, and flexibility reasons before the actual tree structure
> > that it would be implemented with was decided on.  b+trees were used
> > in the implementation because one tree implementation could do
> > everything as all that needed to change btree trees was the pointer
> > and record format.
> 
> I agree that the b+tree were a good choice for XFS.
> 
> In RAM-based maps, red-black trees or avl trees are used often. In 
> disk-based maps, btrees or b+trees are used. That's because in RAM, you 
> are optimizing for the number of cache lines accessed, and on the disk, 
> you are optimizing for the number of blocks accessed.

https://lore.kernel.org/linux-fsdevel/20190416122240.GN29573@dread.disaster.area/

"FWIW, I'm not convinced about the scalability of the rb/interval
tree, to tell you the truth. We got rid of the rbtree in XFS for
cache indexing because the multi-level pointer chasing was just too
expensive to do under a spinlock - it's just not a cache efficient
structure for random index object storage."

All the work I've done since has reinforced this - small node
RCU-aware btrees (4 cachelines per node) scale much, much better
than rbtrees, and they can be made lockless, too.

One of the reasons that btrees are more efficient in memory is the
behaviour of modern CPUs and their hardware prefetchers. It is
actually more time efficient to do a linear search of a small node
and then move to another small node than it is to do a binary search
of a large node in memory.  The CPU usage trade-off between linear
search overhead and chasing another pointer is currently somewhere
between 4 and 8 cachelines or pointers/records in a node on modern
x86-64 CPUs.

SO, yeah, btrees are actually very efficient for in-memory indexes
for the same reasons they are efficient for on-disk structures -
they pack more information per node than a binary structure, and
it's faster to search within a node than is to fetch another node...

> > The result of this is that we have made -zero- changes to the XFS
> > structure and algorithms for SSDs. We don't do different things
> > based on the blkdev rotational flag, or anything like that. XFS
> > behaves exactly the same on spinning disks as it does SSDs as it
> > does PMEM and it performs well on all of them. And that performance
> > doesn't drop away as you increase the scale and capability of the
> > underlying storage.
> > 
> > That's what happens when storage algorithms are designed for
> > concurrency and efficiency at scale rather than optimising for a
> > specific storage characteristic.
> > 
> > NVFS is optimised for a specific storage characteristic (i.e. low
> > latency synchronous storage), so I would absolutely expect it to be
> > faster than XFS on that specific storage. However, claims like this:
> > 
> > > On persistent memory, each access has its own cost, so NVFS uses metadata 
> > > structures that minimize the number of cache lines accessed (rather than 
> > > the number of blocks accessed). For block mapping, NVFS uses the classic 
> > > unix dierct/indirect blocks - if a file block is mapped by a 3-rd level 
> > > indirect block, we do just three memory accesses and we are done. If we 
> > > used b+trees, the number of accesses would be much larger than 3 (we would 
> > > have to do binary search in the b+tree nodes).
> > 
> > ... are kinda naive, because you're clearly optimising the wrong
> > aspect of block mapping. Extents solve the block indexing overhead
> > problem; optimising the type of tree you use to index the indirect
> > blocks doesn't avoid the overhead of having to iterate every block
> > for range operations.
> > 
> > IOWs, we use extents because they are space and time efficient for
> > the general use cases. XFS can map 2^21 blocks into a single 16 byte
> > extent record (8GiB file mapping for 4k block size) and so the vast
> > majority of files in a filesystem are mapped with a single extent.
> 
> BTW. How does XFS "predict" the file size? - so that it allocates extent 
> of proper size without knowing how big the file will be?

Oh, there's probably 10-15,000 lines of code involved in getting
that right. There's delayed allocation, speculative preallocation,
extent size hints, about 10 distinct allocation policies including
"allocate exactly at this block or fail" that allow complex
poilicies with multiple fallback conditions to select the best
possible allocation for the given state, there's locality separation
that tries to keep individual workloads in different large
contiguous free spaces, etc.

> > The NVFS indirect block tree has a fan-out of 16,
> 
> No. The top level in the inode contains 16 blocks (11 direct and 5 
> indirect). And each indirect block can have 512 pointers (4096/8). You can 
> format the device with larger block size and this increases the fanout 
> (the NVFS block size must be greater or equal than the system page size).
> 
> 2 levels can map 1GiB (4096*512^2), 3 levels can map 512 GiB, 4 levels can 
> map 256 TiB and 5 levels can map 128 PiB.

Ok, so that's not clear from the docco or the code. But it just
means the fanout is the same as a btree block in a 4kB block size
filesystem. It doesn't really change anything....

> > mapping 2^21 blocks requires a 5 level indirect tree. Which one if going 
> > to be faster to truncate away - a single record or 2 million individual 
> > blocks?
> > 
> > IOWs, we can take afford to take an extra cacheline miss or two on a
> > tree block search, because we're accessing and managing orders of
> > magnitude fewer records in the mapping tree than an indirect block
> > tree.
> > 
> > PMEM doesn't change this: extents are more time and space efficient
> > at scale for mapping trees than indirect block trees regardless
> > of the storage medium in use.
> 
> PMEM doesn't have to be read linearly, so the attempts to allocate large 
> linear space are not needed. They won't harm but they won't help either.

I beg to differ. If the application wants to map huge pages (2MB or
1GB) because tehy get a major improvement in performance by reducing
TLB thrashing, then the filesystem needs to be able to allocate
large contiguous ranges and to be able to do it reliably and
repeatedly.

This is one of the things that DAX can do already (i.e. support huge
pages for mmap() file data) that the page cache can't do. You can
use this with XFS e.g. via extent size hints, using the RT device
with a fixed 2MB block size, etc. This really is a major feature
that users want in any DAX capable filesystem....

> That's why NVFS has very simple block allocation alrogithm - it uses a 
> per-cpu pointer and tries to allocate by a bit scan from this pointer. If 
> the group is full, it tries a random group with above-average number of 
> free blocks.
> 
> EXT4 uses bit scan for allocations and people haven't complained that it's 
> inefficient, so it is probably OK.

You haven't been paying attention :)

The ext4 bitmap allocator algorithms fall in a hole as soon as all
the block group bitmaps get space allocated in them. Then the
allocator has to start searching block groups for the "best free
space" to allocate out of and that has substantial overhead. The
larger the filesystem, the bigger the hole it can fall into....

> > > The same for directories - NVFS hashes the file name and uses radix-tree 
> > > to locate a directory page where the directory entry is located. XFS 
> > > b+trees would result in much more accesses than the radix-tree.
> > 
> > That's like me saying that XFS hashes the file name and uses a btree
> > to index the directory block where the dirent is located, so it will
> > be faster than a radix tree.
> > 
> > It's completely bogus.
> > 
> > It ignores the fact that both filesysetms use the same hash based
> > lookup indexing algorithms and use O(log N) trees for the name hash.
> > IOWs, the only real difference is the fan-out and depths of the
> > tree.
> 
> NVFS has fanout of 512 pointers per block.

Sure, but the difference is still only "the fan-out and depths of
the tree". The directory structure algorithms are still the same.

> > IOWs, difference in memory accesses between a radix tree and btree
> > for this algorithm is largely irrelevant, and even your tests
> > indicate that.  The modification tests show that metadata lookup
> > *and journalling* overhead is not really that significant as the
> > number of directory entries increase:
> > 
> > dir-test /mnt/test/linux-2.6 63000 1048576
> > nvfs		6.6s
> > ext4 dax	8.4s
> > xfs dax		12.2s
> > 
> > 
> > dir-test /mnt/test/linux-2.6 63000 1048576 link
> > nvfs		4.7s
> > ext4 dax	5.6s
> > xfs dax		7.8s
> > 
> > dir-test /mnt/test/linux-2.6 63000 1048576 dir
> > nvfs		8.2s
> > ext4 dax	15.1s
> > xfs dax		11.8s
> > 
> > Yes, nvfs is faster than both ext4 and XFS on DAX, but it's  not a
> > huge difference - it's not orders of magnitude faster.
> 
> If I increase the size of the test directory, NVFS is order of magnitude 
> faster:
> 
> time dir-test /mnt/test/ 2000000 2000000
> NVFS: 0m29,395s
> XFS:  1m59,523s
> EXT4: 1m14,176s

What happened to NVFS there? The runtime went up by a factor of 5,
even though the number of ops performed only doubled.

> time dir-test /mnt/test/ 8000000 8000000
> NVFS: 2m13,507s
> XFS: 14m31,261s
> EXT4: reports "file 1976882 can't be created: No space left on device", 
> 	(although there are free blocks and inodes)
> 	Is it a bug or expected behavior?

Exponential increase in runtime for a workload like this indicates
the XFS journal is too small to run large scale operations. I'm
guessing you're just testing on a small device?

That guess is backed up by the ext4 error  - it indicates a
relatively small device/filesystem is being used because it ran out
of inodes.  2 million inodes would indicate a 32GB filesystem with
mkfs.ext4 defaults, IIRC.

In which case, you'd get a 16MB log for XFS, which is tiny and most
definitely will limit performance of any large scale metadta
operation. Performance should improve significantly for large scale
operations with a much larger log, and that should bring the XFS
runtimes down significantly.

However, I suspect that the journalling overhead of randomly modifying
a gigabyte of directory metadata is going to be the dominating
performance factor here for both XFS and ext4. As I said, I expect
that something like NVFS should -slaughter- both ext4 and XFS on
workloads like this....

> > If XFS and ext4 directory structures were substantially less
> > efficient than the NVFS structure, then NVFS should absolutely
> > -slaughter- ext4 and XFS in directory modification intensive
> > microbenchmarks like this. Especially considering that difference in
> > performance also includes the journalling overhead.
> > 
> > IOWs, the differences in performance are not a result of the
> > directory structures or the number of memory fetches their indexing
> > algorithms require to do the work - the differences come from
> > structural features: ext4 and XFS have more work to do per
> > directory operation thanks to their metadata buffer and journalling
> > management requirements.
> > 
> > Also, keep in mind that much of the complexity in the XFS directory
> > structure doesn't make XFS go faster at small directory sizes. They
> > actually slow it down at small sizes, but they also stop performance
> > from falling off a cliff at scale. Hence results might be quite
> > different if you are running with millions of directory entries in
> > the directories rather that a few thousand....
> 
> See above - the ratio between NVFS and XFS grows as we increase directory 
> size.

Yup, journal size limitations will do that :/

> > Take, for example, truncate. If you punch out the space on storage
> > before you change the inode size on storage and then crash
> > in-between the punch and the inode size reduction, the user file is
> > left with a bunch of zeros in it over the range between the new EOF
> > and the old EOF. Users will see partial completion state.
> > 
> > IOWs, the NVFS truncate operation as implemented:
> > 
> >         if (attr->ia_valid & ATTR_SIZE) {
> >                 WARN_ON(!inode_is_locked(&nmi->vfs_inode));
> >                 if (attr->ia_size != nmi->vfs_inode.i_size) {
> >                         r = nvfs_punch_hole(nmi, attr->ia_size, LLONG_MAX - ((1UL << nvs->log2_page_size) - 1));
> >                         if (unlikely(r))
> >                                 return r;
> >                         nvfs_set_inode_size(nmi, attr->ia_size);
> >                 }
> >         }
> > 
> > is not atomic from a user crash recovery perspective as it exposes
> > partially complete state to the user. For it to be atomic from the
> > user perspective, on truncate down the inode size must be changed on
> > disk first, then the space beyond the new EOF needs to get punched
> > out. hence if we crash while the punching is occurring, users will
> > not see that after remount because the inconsistent state is beyond
> > the range they can access. IOWs, they see the file as if the
> > truncate down is fully complete, regardless of whether it is
> > actually complete or not.
> 
> You are right - these operations should be swapped.
> 
> BTW. XFS also had (or still has?) a misfeature that it was leaving 
> zero-filed files after a crash.

That's entirely irrelevant to NVFS and this discussion.

Mikulas, I prefixed my previous comments with "you gain nothing by
putting other filesytsems down to try to make NVFS look good".  You
haven't gained anything by making this comment - it lost you lost a
lot of my goodwill, though...

FYI, that XFS bug was fixed in -2006-.

> > Rename is another operation that has specific "operation has atomic
> > behaviour" expectations. I haven't looked at how you've
> > implementated that yet, but I suspect it also is extremely difficult
> > to implement in an atomic manner using direct pmem updates to the
> > directory structures.
> 
> There is a small window when renamed inode is neither in source nor in 
> target directory. Fsck will reclaim such inode and add it to lost+found - 
> just like on EXT2.

That might have been good enough for the mid 1990s (it wasn't, given
that's when XFS was designed) but it's not good enough for a new
filesystem in 2020.

> > AFAICS, it is easy to get metadata update ordering wrong, and
> > without a formal proof that every single write to storage is
> > correctly ordered I can't see how this model can be validated and
> > reviewed. It is made exceedingly complex by direct storage access.
> > instead of doing all the changes to a single block in memory and
> > then only having to order block writes to stable storage correctly
> > (as per the BSD filesystem that used soft updates), every direct
> > pmem modification to an object needs to be ordered correctly against
> > all other direct pmem modifications, both within the object and
> > across objects.
> > 
> > And this brings me back to modification atomicity - soft update
> > ordering across objects is hard enough, but how do you perform
> > dependent modifications to a single object atomically?
> > 
> > e.g. Looking at nvfs_setattr, why is it correct to write timestamp
> > changes to pmem before the truncate is done?
> > 
> > And if that is correct behavour, then why is it correct to
> > change UID/GID _before_ updating the timestamp?
> > 
> > And if that is also correct behaviour, then why are mode changes
> > done _after both_ uid/gid and timestamp changes? What happens if
> > setattr is asked to change ATTR_UID|ATTR_GID|ATTR_MODE as an atomic
> > change and we crash before the mode has been changed?
> 
> There are syscalls chmod,chown,chgrp that update only one of the flag. I 
> don't know if something updates all of them.

You've missed the point completely.

My point is that anything -external- to the filesystem could build
an arbitrary set of flags and will rightfully expect them to be
changed -atomically- by the filesystem. NVFS does not apply them
atomically to stable storage, so it would appear to violate the
assumption that callers of ->setattr make.

Further, the set of flags or the combinations can change from kernel
to kernel. Hence if you order the operations according to what you
see syscalls currently do, then any future change to the way
->setattr is called can break your integrity model. You might not
even be aware of these changes because they are in compeltely
different parts of the kernel.

IOws, you can't say "this order of non-atomic changes is correct"
because a) you can't control the combinations of changes sent to the
filesystem, and b) you do not know what future changes might occur
to the VFS API that may completely invalidate your inode update
ordering assumptions. IOWs, your integrity model cannot rely on the
external filesystem APIs always behaving as they do in the current
kernel.

> Even on journaling filesystems, these updates are cached in memory and the 
> journal is flushed asynchronously at some time later. Unless the user 
> calls fsync(), it is really unspecified when will the changed attributes 
> hit the stable storage.

That's also missing the point. When journalling, the change is
atomic in memory until the journal is flushed, and then it is
-atomic on stable storage- because the entire set of changes are
contained within a single atomic transaction unit stored in said
stable storage. At no point in time does the modification appear to
either the user or in stable storage in a partial or incomplete
state.

The problem I'm pointing out is that NVFS is not making these
multiple changes in a single atomic operation - it is making them in
multiple disjoint operations and hence creating conditions where
partially modified state can be exposed if the system crashes.  This
state my be undetectable from a filesystem consistency point of
view, but from a system/application point of view it might well be
very incorrect...

> > I think the only right answer can be "->setattr changes are atomic
> > at the stable storage level", in which case the way NVFS is updating
> > stable storage breaking all sorts of assumptions and expectations
> > for changing security and ownership attributes of the inode.
> > 
> > And to bring this right back to "fsck is fast so we don't need
> > journalling": how would running fsck before mounting detect that
> > these compound object updates were not fully completed? How does
> > running fsck after the crash guarantee compound object modifications
> > it knows nothing about are executed atomically?
> > 
> > That, too me, looks like a fundamental, unfixable flaw in this
> > approach...
> > 
> > I can see how "almost in place" modification can be done by having
> > two copies side by side and updating one while the other is the
> > active copy and switching atomically between the two objects. That
> 
> NVFS uses this "two copies" mechanism to deal with directories. The inode 
> has an entry "__le64 i_radix[2][NVFS_RADIX_BLOCKS];" and a flag 
> "NVFS_FLAG_SECOND_DIR_BANK" - when the flag is clear, i_radix[0] is valid, 
> when the flag is set, i_radix[1] is valid.

Great.

But nobody knows this because it's not in your documentation, and
there isn't a single comment in the entire directory implementation
that explains how it works. IOWs, the NVFS code is -unreviewable-
because nobody knows anything about the design it is supposed to be
implementing and the implementation itself is entirely
undocumented...

FWIW, if you duplicated the whole inode instead of just the indirect
block tree, then you could do the "bank switch" at the inode level.
Then all inode updates could be made atomic, along with the indirect
block mapping updates, etc.

> Obviously, it is possible to use this method for updating UID, GID and 
> MODE as well, but I am not convinced that they need to be updated 
> atomically.
> Regarding the timestamps - AFAIK POSIX allow them to be updated lazily.

Again, focusing on a specific exmaple misses the entire point I was
making - inode updates need to be atomic and ordered against things
that the inode update may be dependent on.

> If you think that the lack of journaling is show-stopper, I can implement 
> it.

I did not say that. My comments are about the requirement for
atomicity of object changes, not journalling. Journalling is an
-implementation that can provide change atomicity-, it is not a
design constraint for metadata modification algorithms.

Really, you can chose how to do object update however you want. What
I want to review is the design documentation and a correctness proof
for whatever mechanism you choose to use. Without that information,
we have absolutely no chance of reviewing the filesystem
implementation for correctness. We don't need a proof for something
that uses journalling (because we all know how that works), but for
something that uses soft updates we most definitely need the proof
of correctness for the update algorithm before we can determine if
the implementation is good...

Cheers,

Dave.
Mikulas Patocka Sept. 23, 2020, 9:20 a.m. UTC | #11
Hi

There seems to be a bug in ext4 - when I create very large directory, ext4 
fails with -ENOSPC despite the fact that there is plenty of free space and 
free inodes on the filesystem.

How to reproduce:
download the program dir-test: 
http://people.redhat.com/~mpatocka/benchmarks/dir-test.c

# modprobe brd rd_size=67108864
# mkfs.ext4 /dev/ram0
# mount -t ext4 /dev/ram0 /mnt/test
# dir-test /mnt/test/ 8000000 8000000
deleting: 7999000
2540000
file 2515327 can't be created: No space left on device
# df /mnt/test
/dev/ram0        65531436 633752 61525860   2% /mnt/test
# df -i /mnt/test
/dev/ram0        4194304 1881547 2312757   45% /mnt/test

(I tried to increase journal size, but it has no effect on this bug)

Mikulas
Jan Kara Sept. 23, 2020, 9:44 a.m. UTC | #12
Hi!

On Wed 23-09-20 05:20:55, Mikulas Patocka wrote:
> There seems to be a bug in ext4 - when I create very large directory, ext4 
> fails with -ENOSPC despite the fact that there is plenty of free space and 
> free inodes on the filesystem.
> 
> How to reproduce:
> download the program dir-test: 
> http://people.redhat.com/~mpatocka/benchmarks/dir-test.c
> 
> # modprobe brd rd_size=67108864
> # mkfs.ext4 /dev/ram0
> # mount -t ext4 /dev/ram0 /mnt/test
> # dir-test /mnt/test/ 8000000 8000000
> deleting: 7999000
> 2540000
> file 2515327 can't be created: No space left on device
> # df /mnt/test
> /dev/ram0        65531436 633752 61525860   2% /mnt/test
> # df -i /mnt/test
> /dev/ram0        4194304 1881547 2312757   45% /mnt/test

Yeah, you likely run out of space in ext4 directory h-tree. You can enable
higher depth h-trees with large_dir feature (mkfs.ext4 -O large_dir). Does
that help?

								Honza
Jan Kara Sept. 23, 2020, 9:57 a.m. UTC | #13
On Tue 22-09-20 12:46:05, Mikulas Patocka wrote:
> > mapping 2^21 blocks requires a 5 level indirect tree. Which one if going 
> > to be faster to truncate away - a single record or 2 million individual 
> > blocks?
> > 
> > IOWs, we can take afford to take an extra cacheline miss or two on a
> > tree block search, because we're accessing and managing orders of
> > magnitude fewer records in the mapping tree than an indirect block
> > tree.
> > 
> > PMEM doesn't change this: extents are more time and space efficient
> > at scale for mapping trees than indirect block trees regardless
> > of the storage medium in use.
> 
> PMEM doesn't have to be read linearly, so the attempts to allocate large 
> linear space are not needed. They won't harm but they won't help either.
> 
> That's why NVFS has very simple block allocation alrogithm - it uses a 
> per-cpu pointer and tries to allocate by a bit scan from this pointer. If 
> the group is full, it tries a random group with above-average number of 
> free blocks.

I agree with Dave here. People are interested in 2MB or 1GB contiguous
allocations for DAX so that files can be mapped at PMD or event PUD levels
thus saving a lot of CPU time on page faults and TLB.

> EXT4 uses bit scan for allocations and people haven't complained that it's 
> inefficient, so it is probably OK.

Yes, it is more or less OK but once you get to 1TB filesystem size and
larger, the number of block groups grows enough that it isn't that great
anymore. We are actually considering new allocation schemes for ext4 for
this large filesystems...

> If you think that the lack of journaling is show-stopper, I can implement 
> it. But then, I'll have something that has complexity of EXT4 and 
> performance of EXT4. So that there will no longer be any reason why to use 
> NVFS over EXT4. Without journaling, it will be faster than EXT4 and it may 
> attract some users who want good performance and who don't care about GID 
> and UID being updated atomically, etc.

I'd hope that your filesystem offers more performance benefits than just
what you can get from a lack of journalling :). ext4 can be configured to
run without a journal as well - mkfs.ext4 -O ^has_journal. And yes, it does
significantly improve performance for some workloads but you have to have
some way to recover from crashes so it's mostly used for scratch
filesystems (e.g. in build systems, Google uses this feature a lot for some
of their infrastructure as well).

								Honza
Mikulas Patocka Sept. 23, 2020, 12:46 p.m. UTC | #14
On Wed, 23 Sep 2020, Jan Kara wrote:

> Hi!
> 
> On Wed 23-09-20 05:20:55, Mikulas Patocka wrote:
> > There seems to be a bug in ext4 - when I create very large directory, ext4 
> > fails with -ENOSPC despite the fact that there is plenty of free space and 
> > free inodes on the filesystem.
> > 
> > How to reproduce:
> > download the program dir-test: 
> > http://people.redhat.com/~mpatocka/benchmarks/dir-test.c
> > 
> > # modprobe brd rd_size=67108864
> > # mkfs.ext4 /dev/ram0
> > # mount -t ext4 /dev/ram0 /mnt/test
> > # dir-test /mnt/test/ 8000000 8000000
> > deleting: 7999000
> > 2540000
> > file 2515327 can't be created: No space left on device
> > # df /mnt/test
> > /dev/ram0        65531436 633752 61525860   2% /mnt/test
> > # df -i /mnt/test
> > /dev/ram0        4194304 1881547 2312757   45% /mnt/test
> 
> Yeah, you likely run out of space in ext4 directory h-tree. You can enable
> higher depth h-trees with large_dir feature (mkfs.ext4 -O large_dir). Does
> that help?

Yes, this helps.

Mikulas

> 
> 								Honza
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Mikulas Patocka Sept. 23, 2020, 1:11 p.m. UTC | #15
On Wed, 23 Sep 2020, Jan Kara wrote:

> On Tue 22-09-20 12:46:05, Mikulas Patocka wrote:
> > > mapping 2^21 blocks requires a 5 level indirect tree. Which one if going 
> > > to be faster to truncate away - a single record or 2 million individual 
> > > blocks?
> > > 
> > > IOWs, we can take afford to take an extra cacheline miss or two on a
> > > tree block search, because we're accessing and managing orders of
> > > magnitude fewer records in the mapping tree than an indirect block
> > > tree.
> > > 
> > > PMEM doesn't change this: extents are more time and space efficient
> > > at scale for mapping trees than indirect block trees regardless
> > > of the storage medium in use.
> > 
> > PMEM doesn't have to be read linearly, so the attempts to allocate large 
> > linear space are not needed. They won't harm but they won't help either.
> > 
> > That's why NVFS has very simple block allocation alrogithm - it uses a 
> > per-cpu pointer and tries to allocate by a bit scan from this pointer. If 
> > the group is full, it tries a random group with above-average number of 
> > free blocks.
> 
> I agree with Dave here. People are interested in 2MB or 1GB contiguous
> allocations for DAX so that files can be mapped at PMD or event PUD levels
> thus saving a lot of CPU time on page faults and TLB.

NVFS has upper limit on block size 1MB. So, should raise it to 2MB? Will 
2MB blocks be useful to someone?

Is there some API how userspace can ask the kernel for aligned allocation? 
fallocate() doesn't seem to offer an option for alignment.

> > EXT4 uses bit scan for allocations and people haven't complained that it's 
> > inefficient, so it is probably OK.
> 
> Yes, it is more or less OK but once you get to 1TB filesystem size and
> larger, the number of block groups grows enough that it isn't that great
> anymore. We are actually considering new allocation schemes for ext4 for
> this large filesystems...

NVFS can run with block size larger than page size, so you can reduce the 
number of block groups by increasing block size.

(ext4 also has bigalloc feature that will do it)

> > If you think that the lack of journaling is show-stopper, I can implement 
> > it. But then, I'll have something that has complexity of EXT4 and 
> > performance of EXT4. So that there will no longer be any reason why to use 
> > NVFS over EXT4. Without journaling, it will be faster than EXT4 and it may 
> > attract some users who want good performance and who don't care about GID 
> > and UID being updated atomically, etc.
> 
> I'd hope that your filesystem offers more performance benefits than just
> what you can get from a lack of journalling :). ext4 can be configured to

I also don't know how to implement journling on persistent memory :) On 
EXT4 or XFS you can pin dirty buffers in memory until the journal is 
flushed. This is obviously impossible on persistent memory. So, I'm 
considering implementing only some lightweight journaling that will 
guarantee atomicity between just a few writes.

> run without a journal as well - mkfs.ext4 -O ^has_journal. And yes, it does
> significantly improve performance for some workloads but you have to have
> some way to recover from crashes so it's mostly used for scratch
> filesystems (e.g. in build systems, Google uses this feature a lot for some
> of their infrastructure as well).
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

I've run "dir-test /mnt/test/ 8000000 8000000" and the result is:
EXT4 with journal	- 5m54,019s
EXT4 without journal	- 4m4,444s
NVFS			- 2m9,482s

Mikulas
Matthew Wilcox Sept. 23, 2020, 3:04 p.m. UTC | #16
On Wed, Sep 23, 2020 at 09:11:43AM -0400, Mikulas Patocka wrote:
> I also don't know how to implement journling on persistent memory :) On 
> EXT4 or XFS you can pin dirty buffers in memory until the journal is 
> flushed. This is obviously impossible on persistent memory. So, I'm 
> considering implementing only some lightweight journaling that will 
> guarantee atomicity between just a few writes.

That's a bit disappointing considering people have been publishing
papers on how to do umpteen different variations on persistent memory
journalling for the last five years.

https://www.google.com/search?q=intel+persistent+memory+atomic+updates

for example
Mikulas Patocka Sept. 23, 2020, 5:19 p.m. UTC | #17
On Wed, 23 Sep 2020, Dave Chinner wrote:

> > > dir-test /mnt/test/linux-2.6 63000 1048576
> > > nvfs		6.6s
> > > ext4 dax	8.4s
> > > xfs dax		12.2s
> > > 
> > > 
> > > dir-test /mnt/test/linux-2.6 63000 1048576 link
> > > nvfs		4.7s
> > > ext4 dax	5.6s
> > > xfs dax		7.8s
> > > 
> > > dir-test /mnt/test/linux-2.6 63000 1048576 dir
> > > nvfs		8.2s
> > > ext4 dax	15.1s
> > > xfs dax		11.8s
> > > 
> > > Yes, nvfs is faster than both ext4 and XFS on DAX, but it's  not a
> > > huge difference - it's not orders of magnitude faster.
> > 
> > If I increase the size of the test directory, NVFS is order of magnitude 
> > faster:
> > 
> > time dir-test /mnt/test/ 2000000 2000000
> > NVFS: 0m29,395s
> > XFS:  1m59,523s
> > EXT4: 1m14,176s
> 
> What happened to NVFS there? The runtime went up by a factor of 5,
> even though the number of ops performed only doubled.

This test is from a different machine (K10 Opteron) than the above test 
(Skylake Xeon). I borrowed the Xeon for a short time and I no longer have 
access to it.

> > time dir-test /mnt/test/ 8000000 8000000
> > NVFS: 2m13,507s
> > XFS: 14m31,261s
> > EXT4: reports "file 1976882 can't be created: No space left on device", 
> > 	(although there are free blocks and inodes)
> > 	Is it a bug or expected behavior?
> 
> Exponential increase in runtime for a workload like this indicates
> the XFS journal is too small to run large scale operations. I'm
> guessing you're just testing on a small device?

In this test, the pmem device had 64GiB.

I've created 1TiB ramdisk, formatted it with XFS and ran dir-test 8000000 
on it, however it wasn't much better - it took 14m8,824s.

> In which case, you'd get a 16MB log for XFS, which is tiny and most
> definitely will limit performance of any large scale metadta
> operation. Performance should improve significantly for large scale
> operations with a much larger log, and that should bring the XFS
> runtimes down significantly.

Is there some mkfs.xfs option that can increase log size?

> > If you think that the lack of journaling is show-stopper, I can implement 
> > it.
> 
> I did not say that. My comments are about the requirement for
> atomicity of object changes, not journalling. Journalling is an
> -implementation that can provide change atomicity-, it is not a
> design constraint for metadata modification algorithms.
> 
> Really, you can chose how to do object update however you want. What
> I want to review is the design documentation and a correctness proof
> for whatever mechanism you choose to use. Without that information,
> we have absolutely no chance of reviewing the filesystem
> implementation for correctness. We don't need a proof for something
> that uses journalling (because we all know how that works), but for
> something that uses soft updates we most definitely need the proof
> of correctness for the update algorithm before we can determine if
> the implementation is good...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

I am thinking about this: I can implement lightweight journaling that will 
journal just a few writes - I'll allocate some small per-cpu intent log 
for that.

For example, in nvfs_rename, we call nvfs_delete_de and nvfs_finish_add - 
these functions are very simple, both of them write just one word - so we 
can add these two words to the intent log. The same for setattr requesting 
simultaneous uid/gid/mode change - they are small, so they'll fit into the 
intent log well.

Regarding verifiability, I can do this - the writes to pmem are wrapped in 
a macro nv_store. So, I can modify this macro so that it logs all 
modifications. Then I take the log, cut it at random time, reorder the 
entries (to simulate reordering in the CPU write-combining buffers), 
replay it, run nvfsck on it and mount it. This way, we can verify that no 
matter where the crash happened, either an old file or a new file is 
present in a directory.

Do you agree with that?

Mikulas
Andreas Dilger Sept. 23, 2020, 8:20 p.m. UTC | #18
[cut down CC list, since I don't think most people care about this detail]

On Sep 23, 2020, at 3:44 AM, Jan Kara <jack@suse.cz> wrote:
> 
> On Wed 23-09-20 05:20:55, Mikulas Patocka wrote:
>> There seems to be a bug in ext4 - when I create very large directory, ext4
>> fails with -ENOSPC despite the fact that there is plenty of free space and
>> free inodes on the filesystem.
>> 
>> How to reproduce:
>> download the program dir-test:
>> http://people.redhat.com/~mpatocka/benchmarks/dir-test.c
>> 
>> # modprobe brd rd_size=67108864
>> # mkfs.ext4 /dev/ram0
>> # mount -t ext4 /dev/ram0 /mnt/test
>> # dir-test /mnt/test/ 8000000 8000000
>> deleting: 7999000
>> 2540000
>> file 2515327 can't be created: No space left on device
>> # df /mnt/test
>> /dev/ram0        65531436 633752 61525860   2% /mnt/test
>> # df -i /mnt/test
>> /dev/ram0        4194304 1881547 2312757   45% /mnt/test
> 
> Yeah, you likely run out of space in ext4 directory h-tree. You can enable
> higher depth h-trees with large_dir feature (mkfs.ext4 -O large_dir). Does
> that help?

You can also enable this feature on an existing filesystem by running
"tune2fs -O large_dir /dev/sdX".  It might need to be unmounted, not sure.

Cheers, Andreas
Mikulas Patocka Sept. 24, 2020, 3 p.m. UTC | #19
On Tue, 22 Sep 2020, Matthew Wilcox wrote:

> > > The NVFS indirect block tree has a fan-out of 16,
> > 
> > No. The top level in the inode contains 16 blocks (11 direct and 5 
> > indirect). And each indirect block can have 512 pointers (4096/8). You can 
> > format the device with larger block size and this increases the fanout 
> > (the NVFS block size must be greater or equal than the system page size).
> > 
> > 2 levels can map 1GiB (4096*512^2), 3 levels can map 512 GiB, 4 levels can 
> > map 256 TiB and 5 levels can map 128 PiB.
> 
> But compare to an unfragmented file ... you can map the entire thing with
> a single entry.  Even if you have to use a leaf node, you can get four
> extents in a single cacheline (and that's a fairly naive leaf node layout;
> I don't know exactly what XFS uses)

But the benchmarks show that it is comparable to extent-based filesystems.

> > > Rename is another operation that has specific "operation has atomic
> > > behaviour" expectations. I haven't looked at how you've
> > > implementated that yet, but I suspect it also is extremely difficult
> > > to implement in an atomic manner using direct pmem updates to the
> > > directory structures.
> > 
> > There is a small window when renamed inode is neither in source nor in 
> > target directory. Fsck will reclaim such inode and add it to lost+found - 
> > just like on EXT2.
> 
> ... ouch.  If you have to choose, it'd be better to link it to the second
> directory then unlink it from the first one.  Then your fsck can detect
> it has the wrong count and fix up the count (ie link it into both
> directories rather than neither).

I admit that this is lame and I'll fix it. Rename is not so 
performance-critical, so I can add a small journal for this.

> > If you think that the lack of journaling is show-stopper, I can implement 
> > it. But then, I'll have something that has complexity of EXT4 and 
> > performance of EXT4. So that there will no longer be any reason why to use 
> > NVFS over EXT4. Without journaling, it will be faster than EXT4 and it may 
> > attract some users who want good performance and who don't care about GID 
> > and UID being updated atomically, etc.
> 
> Well, what's your intent with nvfs?  Do you already have customers in mind
> who want to use this in production, or is this somewhere to play with and
> develop concepts that might make it into one of the longer-established
> filesystems?

I develop it just because I thought it may be interesting. So far, it 
doesn't have any serious users (the physical format is still changing). I 
hope that it could be useable as a general purpose root filesystem when 
Optane DIMMs become common.

Mikulas
Mikulas Patocka Sept. 28, 2020, 3:22 p.m. UTC | #20
On Thu, 24 Sep 2020, Mikulas Patocka wrote:

> On Tue, 22 Sep 2020, Matthew Wilcox wrote:
> 
> > > There is a small window when renamed inode is neither in source nor in 
> > > target directory. Fsck will reclaim such inode and add it to lost+found - 
> > > just like on EXT2.
> > 
> > ... ouch.  If you have to choose, it'd be better to link it to the second
> > directory then unlink it from the first one.  Then your fsck can detect
> > it has the wrong count and fix up the count (ie link it into both
> > directories rather than neither).
> 
> I admit that this is lame and I'll fix it. Rename is not so 
> performance-critical, so I can add a small journal for this.

Hi

I have implmemented transactions in nvfs and I use them for rename, 
setattr, atomic xattr replacement and for RENAME_EXCHANGE.

You can download the current version here:
git://leontynka.twibright.com/nvfs.git

Mikulas
diff mbox series

Patch

Index: linux-2.6/arch/arm64/lib/uaccess_flushcache.c
===================================================================
--- linux-2.6.orig/arch/arm64/lib/uaccess_flushcache.c	2020-09-16 12:44:15.068038000 +0200
+++ linux-2.6/arch/arm64/lib/uaccess_flushcache.c	2020-09-16 12:44:15.068038000 +0200
@@ -38,3 +38,4 @@  unsigned long __copy_user_flushcache(voi
 	__clean_dcache_area_pop(to, n - rc);
 	return rc;
 }
+EXPORT_SYMBOL_GPL(__copy_user_flushcache);
Index: linux-2.6/arch/x86/lib/usercopy_64.c
===================================================================
--- linux-2.6.orig/arch/x86/lib/usercopy_64.c	2020-09-16 12:44:15.068038000 +0200
+++ linux-2.6/arch/x86/lib/usercopy_64.c	2020-09-16 12:44:15.068038000 +0200
@@ -134,6 +134,7 @@  long __copy_user_flushcache(void *dst, c
 
 	return rc;
 }
+EXPORT_SYMBOL_GPL(__copy_user_flushcache);
 
 void __memcpy_flushcache(void *_dst, const void *_src, size_t size)
 {
Index: linux-2.6/arch/powerpc/lib/pmem.c
===================================================================
--- linux-2.6.orig/arch/powerpc/lib/pmem.c	2020-09-16 12:44:15.068038000 +0200
+++ linux-2.6/arch/powerpc/lib/pmem.c	2020-09-16 12:44:15.068038000 +0200
@@ -75,6 +75,7 @@  long __copy_from_user_flushcache(void *d
 
 	return copied;
 }
+EXPORT_SYMBOL_GPL(__copy_from_user_flushcache);
 
 void memcpy_flushcache(void *dest, const void *src, size_t size)
 {