mbox series

[00/10] btrfs: Support for DAX devices

Message ID 20181205122835.19290-1-rgoldwyn@suse.de (mailing list archive)
Headers show
Series btrfs: Support for DAX devices | expand

Message

Goldwyn Rodrigues Dec. 5, 2018, 12:28 p.m. UTC
This is a support for DAX in btrfs. I understand there have been
previous attempts at it. However, I wanted to make sure copy-on-write
(COW) works on dax as well.

Before I present this to the FS folks I wanted to run this through the
btrfs. Even though I wish, I cannot get it correct the first time
around :/.. Here are some questions for which I need suggestions:

Questions:
1. I have been unable to do checksumming for DAX devices. While
checksumming can be done for reads and writes, it is a problem when mmap
is involved because btrfs kernel module does not get back control after
an mmap() writes. Any ideas are appreciated, or we would have to set
nodatasum when dax is enabled.

2. Currently, a user can continue writing on "old" extents of an mmaped file
after a snapshot has been created. How can we enforce writes to be directed
to new extents after snapshots have been created? Do we keep a list of
all mmap()s, and re-mmap them after a snapshot?

Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
command line parameter.


[PATCH 01/10] btrfs: create a mount option for dax
[PATCH 02/10] btrfs: basic dax read
[PATCH 03/10] btrfs: dax: read zeros from holes
[PATCH 04/10] Rename __endio_write_update_ordered() to
[PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of
[PATCH 06/10] btrfs: dax write support
[PATCH 07/10] dax: export functions for use with btrfs
[PATCH 08/10] btrfs: dax add read mmap path
[PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
[PATCH 10/10] btrfs: dax mmap write

 fs/btrfs/Makefile   |    1 
 fs/btrfs/ctree.h    |   17 ++
 fs/btrfs/dax.c      |  303 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/file.c     |   29 ++++
 fs/btrfs/inode.c    |   54 +++++----
 fs/btrfs/ioctl.c    |    5 
 fs/btrfs/super.c    |   15 ++
 fs/dax.c            |   35 ++++--
 include/linux/dax.h |   16 ++
 9 files changed, 430 insertions(+), 45 deletions(-)

Comments

Qu Wenruo Dec. 5, 2018, 1:03 p.m. UTC | #1
On 2018/12/5 下午8:28, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs. I understand there have been
> previous attempts at it. However, I wanted to make sure copy-on-write
> (COW) works on dax as well.
> 
> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:
> 
> Questions:
> 1. I have been unable to do checksumming for DAX devices. While
> checksumming can be done for reads and writes, it is a problem when mmap
> is involved because btrfs kernel module does not get back control after
> an mmap() writes. Any ideas are appreciated, or we would have to set
> nodatasum when dax is enabled.

I'm not familar with DAX, so it's completely possible I'm talking like
an idiot.

If btrfs_page_mkwrite() can't provide enough control, then I have a
crazy idea.

Forcing page fault for every mmap() read/write (completely disable page
cache like DIO).
So that we could get some control since we're informed to read the page
and do some hacks there.

Thanks,
Qu
> 
> 2. Currently, a user can continue writing on "old" extents of an mmaped file
> after a snapshot has been created. How can we enforce writes to be directed
> to new extents after snapshots have been created? Do we keep a list of
> all mmap()s, and re-mmap them after a snapshot?
> 
> Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
> command line parameter.
> 
> 
> [PATCH 01/10] btrfs: create a mount option for dax
> [PATCH 02/10] btrfs: basic dax read
> [PATCH 03/10] btrfs: dax: read zeros from holes
> [PATCH 04/10] Rename __endio_write_update_ordered() to
> [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of
> [PATCH 06/10] btrfs: dax write support
> [PATCH 07/10] dax: export functions for use with btrfs
> [PATCH 08/10] btrfs: dax add read mmap path
> [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
> [PATCH 10/10] btrfs: dax mmap write
> 
>  fs/btrfs/Makefile   |    1 
>  fs/btrfs/ctree.h    |   17 ++
>  fs/btrfs/dax.c      |  303 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  fs/btrfs/file.c     |   29 ++++
>  fs/btrfs/inode.c    |   54 +++++----
>  fs/btrfs/ioctl.c    |    5 
>  fs/btrfs/super.c    |   15 ++
>  fs/dax.c            |   35 ++++--
>  include/linux/dax.h |   16 ++
>  9 files changed, 430 insertions(+), 45 deletions(-)
> 
>
Adam Borowski Dec. 5, 2018, 1:57 p.m. UTC | #2
On Wed, Dec 05, 2018 at 06:28:25AM -0600, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs.

Yay!

> I understand there have been previous attempts at it.  However, I wanted
> to make sure copy-on-write (COW) works on dax as well.

btrfs' usual use of CoW and DAX are thoroughly in conflict.

The very point of DAX is to have writes not go through the kernel, you
mmap the file then do all writes right to the pmem, flushing when needed
(without hitting the kernel) and having the processor+memory persist what
you wrote.

CoW via page faults are fine -- pmem is closer to memory than disk, and this
means the kernel will ask the filesystem for an extent to place the new page
in, copy the contents and let the process play with it.  But real btrfs CoW
would mean we'd need to page fault on ᴇᴠᴇʀʏ ꜱɪɴɢʟᴇ ᴡʀɪᴛᴇ.

Delaying CoW until the next commit doesn't help -- you'd need to store the
dirty page in DRAM then write it, which goes against the whole concept of
DAX.

Only way I see would be to CoW once then pretend the page is nodatacow until
the next commit, when we checksum it, add to the metadata trees, and mark
for CoWing on the next write.  Lots of complexity, and you still need to
copy the whole thing every commit (so no gain).

Ie, we're in nodatacow land.  CoW for metadata is fine.

> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:
> 
> Questions:
> 1. I have been unable to do checksumming for DAX devices. While
> checksumming can be done for reads and writes, it is a problem when mmap
> is involved because btrfs kernel module does not get back control after
> an mmap() writes. Any ideas are appreciated, or we would have to set
> nodatasum when dax is enabled.

Per the above, it sounds like nodatacow (ie, "cow once") would be needed.

> 2. Currently, a user can continue writing on "old" extents of an mmaped file
> after a snapshot has been created. How can we enforce writes to be directed
> to new extents after snapshots have been created? Do we keep a list of
> all mmap()s, and re-mmap them after a snapshot?

Same as for any other memory that's shared: when a new instance of sharing
is added (a snapshot/reflink in our case), you deny writes, causing a page
fault on the next attempt.  "pmem" is named "ᴘersistent ᴍᴇᴍory" for a
reason...

> Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
> command line parameter.

Might be more useful to use a bigger piece of the "disk" than 2G, it's not
in the danger area though.

Also note that it's utterly pointless to use any RAID modes; multi-dev
single is fine, DUP counts as RAID here.
* RAID0 is already done better in hardware (interleave)
* RAID1 would require hardware support, replication isn't easy
* RAID5/6 

What would make sense, is disabling dax for any files that are not marked as
nodatacow.  This way, unrelated files can still use checksums or
compression, while only files meant as a pmempool or otherwise by a
pmem-aware program would have dax writes (you can still give read-only pages
that CoW to DRAM).  This way we can have write dax for only a subset of
files, and full set of btrfs features for the rest.  Write dax is dangerous
for programs that have no specific support: the vast majority of
database-like programs rely on page-level atomicity while pmem gives you
cacheline/word atomicity only; torn writes mean data loss.


Meow!
Jeff Mahoney Dec. 5, 2018, 9:36 p.m. UTC | #3
On 12/5/18 8:03 AM, Qu Wenruo wrote:
> 
> 
> On 2018/12/5 下午8:28, Goldwyn Rodrigues wrote:
>> This is a support for DAX in btrfs. I understand there have been
>> previous attempts at it. However, I wanted to make sure copy-on-write
>> (COW) works on dax as well.
>>
>> Before I present this to the FS folks I wanted to run this through the
>> btrfs. Even though I wish, I cannot get it correct the first time
>> around :/.. Here are some questions for which I need suggestions:
>>
>> Questions:
>> 1. I have been unable to do checksumming for DAX devices. While
>> checksumming can be done for reads and writes, it is a problem when mmap
>> is involved because btrfs kernel module does not get back control after
>> an mmap() writes. Any ideas are appreciated, or we would have to set
>> nodatasum when dax is enabled.
> 
> I'm not familar with DAX, so it's completely possible I'm talking like
> an idiot.

The general idea is:

1) there is no page cache involved. read() and write() are like direct 
i/o writes in concept.  The user buffer is written directly (via what is 
essentially a specialized memcpy) to the NVDIMM.
2) for mmap, once the mapping is established and mapped, the file system 
is not involved.  The application writes directly to the memory as it 
would a normal mmap, except it's persistent.  All that's required to 
ensure persistence is a CPU cache flush.  The only way the file system 
is involved again is if some operation has occurred to reset the WP bit.

> If btrfs_page_mkwrite() can't provide enough control, then I have a
> crazy idea.

It can't, because it is only invoked on the page fault path and we want 
to try to limit those as much as possible.

> Forcing page fault for every mmap() read/write (completely disable page
> cache like DIO).
> So that we could get some control since we're informed to read the page
> and do some hacks there.
There's no way to force a page fault for every mmap read/write.  Even if 
there was, we wouldn't want that.  No user would turn that on when they 
can just make similar guarantees in their app (which are typically apps 
that do this already) and not pay any performance penalty.   The idea 
with DAX mmap is that the file system manages the namespace, space 
allocation, and permissions.  Otherwise we stay out of the way.

-Jeff
Jeff Mahoney Dec. 5, 2018, 9:37 p.m. UTC | #4
On 12/5/18 7:28 AM, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs. I understand there have been
> previous attempts at it. However, I wanted to make sure copy-on-write
> (COW) works on dax as well.
> 
> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:
> 
> Questions:
> 1. I have been unable to do checksumming for DAX devices. While
> checksumming can be done for reads and writes, it is a problem when mmap
> is involved because btrfs kernel module does not get back control after
> an mmap() writes. Any ideas are appreciated, or we would have to set
> nodatasum when dax is enabled.

Yep.  It has to be nodatasum, at least within the confines of datasum 
today.  DAX mmap writes are essentially in the same situation as with 
direct i/o when another thread modifies the buffer being submitted. 
Except rather than it being a race, it happens every time.  An 
alternative here could be to add the ability to mark a crc as unreliable 
and then go back and update them once the last DAX mmap reference is 
dropped on a range.  There's no reason to make this a requirement of the 
initial implementation, though.

> 2. Currently, a user can continue writing on "old" extents of an mmaped file
> after a snapshot has been created. How can we enforce writes to be directed
> to new extents after snapshots have been created? Do we keep a list of
> all mmap()s, and re-mmap them after a snapshot?

It's the second question that's the hard part.  As Adam describes later, 
setting each pfn read-only will ensure page faults cause the remapping.

The high level idea that Jan Kara and I came up with in our conversation 
at Labs conf is pretty expensive.  We'd need to set a flag that pauses 
new page faults, set the WP bit on affected ranges, do the snapshot, 
commit, clear the flag, and wake up the waiting threads.  Neither of us 
had any concrete idea of how well that would perform and it still 
depends on finding a good way to resolve all open mmap ranges on a 
subvolume.  Perhaps using the address_space->private_list anchored on 
each root would work.

-Jeff

> Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
> command line parameter.
> 
> 
> [PATCH 01/10] btrfs: create a mount option for dax
> [PATCH 02/10] btrfs: basic dax read
> [PATCH 03/10] btrfs: dax: read zeros from holes
> [PATCH 04/10] Rename __endio_write_update_ordered() to
> [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of
> [PATCH 06/10] btrfs: dax write support
> [PATCH 07/10] dax: export functions for use with btrfs
> [PATCH 08/10] btrfs: dax add read mmap path
> [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
> [PATCH 10/10] btrfs: dax mmap write
> 
>   fs/btrfs/Makefile   |    1
>   fs/btrfs/ctree.h    |   17 ++
>   fs/btrfs/dax.c      |  303 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>   fs/btrfs/file.c     |   29 ++++
>   fs/btrfs/inode.c    |   54 +++++----
>   fs/btrfs/ioctl.c    |    5
>   fs/btrfs/super.c    |   15 ++
>   fs/dax.c            |   35 ++++--
>   include/linux/dax.h |   16 ++
>   9 files changed, 430 insertions(+), 45 deletions(-)
> 
>
Robert White Dec. 6, 2018, 7:40 a.m. UTC | #5
On 12/5/18 9:37 PM, Jeff Mahoney wrote:
> The high level idea that Jan Kara and I came up with in our conversation 
> at Labs conf is pretty expensive.  We'd need to set a flag that pauses 
> new page faults, set the WP bit on affected ranges, do the snapshot, 
> commit, clear the flag, and wake up the waiting threads.  Neither of us 
> had any concrete idea of how well that would perform and it still 
> depends on finding a good way to resolve all open mmap ranges on a 
> subvolume.  Perhaps using the address_space->private_list anchored on 
> each root would work.

This is a potentially wild idea, so "grain of salt" and all that. I may 
misuse the exact wording.

So the essential problem of DAX is basically the opposite of 
data-deduplication. Instead of merging two duplicate data regions, you 
want to mark regions as at-risk while keeping the original content 
intact if there are snapshots in conflict.

So suppose you _require_ data checksums and data mode of "dup" or mirror 
or one of the other fault tolerant layouts.

By definition any block that gets written with content that it didn't 
have before will now have a bad checksum.

If the inode is flagged for direct IO that's an indication that the 
block has been updated.

At this point you really just need to do the opposite of deduplication, 
as in find/recover the original contents and assign/leave assigned those 
to the old/other snapshots, then compute the new checksum on the 
"original block" and assign it to the active subvolume.

So when a region is mapped for direct IO, and it's refcount is greater 
than one, and you get to a sync or close event, you "recover" the old 
contents into a new location and assign those to "all the other users". 
Now that original storage region has only one user, so on sync or close 
you fix its checksums on the cheap.

Instead of the new data being a small rock sitting over a large rug to 
make a lump, the new data is like a rock being slid under the rug to 
make a lump.

So the first write to an extent creates a burdensome copy to retain the 
old contents, but second and subsequent writes to the same extent only 
have the cost of an _eventual_ checksum of the original block list.

Maybe If the data isn't already duplicated then the write mapping or the 
DAX open or the setting of the S_DUP flag could force the file into an 
extent block that _is_ duplicated.

The mental leap required is that the new blocks don't need to belong to 
the new state being created. The new blocks can be associated to the 
snapshots since data copy is idempotent.

The side note is that it only ever matters if the usage count is greater 
than one, so at worst taking a snapshot, which is already a _little_ 
racy anyway, would/could trigger a semi-lightweight copy of any S_DAX files:

If S_DAX :
   If checksum invalid :
     copy data as-is and checksum, store in snapshot
   else : look for duplicate checksum
     if duplicate found :
       assign that extent to the snapshot
     else :
       If file opened for writing and has any mmaps for write :
         copy extent and assign to new snapshot.
       else :
         increment usage count and assign current block to snapshot

Anyway, I only know enough of the internals to be dangerous.

Since the real goal of mmap is speed during actual update, this idea is 
basically about amortizing the copy costs into the task of maintaining 
the snapshots instead of leaving them in the immediate hands of the 
time-critical updater.

The flush, unmmap, or close by the user, or a system-wide sync event, 
are also good points to expense the bookeeping time.
Johannes Thumshirn Dec. 6, 2018, 10:07 a.m. UTC | #6
On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs. I understand there have been
> previous attempts at it. However, I wanted to make sure copy-on-write
> (COW) works on dax as well.
> 
> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:

Hi Goldwyn,

I've thrown your patches (from your git tree) onto one of my pmem test
machines with this pmem config:

mayhem:~/:[0]# ndctl list
[
  {
    "dev":"namespace1.0",
    "mode":"fsdax",
    "map":"dev",
    "size":792721358848,
    "uuid":"3fd4ab18-5145-4675-85a0-e05e6f9bcee4",
    "raw_uuid":"49264743-2351-41c5-9db9-38534813df61",
    "sector_size":512,
    "blockdev":"pmem1",
    "numa_node":1
  },
  {
    "dev":"namespace0.0",
    "mode":"fsdax",
    "map":"dev",
    "size":792721358848,
    "uuid":"dd0aec3c-7721-4621-8898-e50684a371b5",
    "raw_uuid":"84ff5463-f76e-4ddf-a248-85122541e909",
    "sector_size":4096,
    "blockdev":"pmem0",
    "numa_node":0
  }
]

Unfortunately I hit a btrfs_panic() with btrfs/002.
export TEST_DEV=/dev/pmem0
export SCRATCH_DEV=/dev/pmem1
export MOUNT_OPTIONS="-o dax"
./check
[...]
[  178.173113] run fstests btrfs/002 at 2018-12-06 10:55:43
[  178.357044] BTRFS info (device pmem0): disk space caching is enabled
[  178.357047] BTRFS info (device pmem0): has skinny extents
[  178.360042] BTRFS info (device pmem0): enabling ssd optimizations
[  178.475918] BTRFS: device fsid ee888255-7f4a-4bf7-af65-e8a6a354aca8
devid 1 transid 3 /dev/pmem1
[  178.505717] BTRFS info (device pmem1): disk space caching is enabled
[  178.513593] BTRFS info (device pmem1): has skinny extents
[  178.520384] BTRFS info (device pmem1): flagging fs with big metadata
feature
[  178.530997] BTRFS info (device pmem1): enabling ssd optimizations
[  178.538331] BTRFS info (device pmem1): creating UUID tree
[  178.587200] BTRFS critical (device pmem1): panic in
ordered_data_tree_panic:57: Inconsistency in ordered tree at offset 0
(errno=-17 Object already exists)
[  178.603129] ------------[ cut here ]------------
[  178.608667] kernel BUG at fs/btrfs/ordered-data.c:57!
[  178.614333] invalid opcode: 0000 [#1] SMP PTI
[  178.619295] CPU: 87 PID: 8225 Comm: dd Kdump: loaded Tainted: G
      E     4.20.0-rc5-default-btrfs-dax #920
[  178.630090] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS
SE5C620.86B.0D.01.0010.072020182008 07/20/2018
[  178.640626] RIP: 0010:__btrfs_add_ordered_extent+0x325/0x410 [btrfs]
[  178.647404] Code: 28 4d 89 f1 49 c7 c0 90 9c 57 c0 b9 ef ff ff ff ba
39 00 00 00 48 c7 c6 10 fe 56 c0 48 8b b8 d8 03 00 00 31 c0 e8 e2 99 06
00 <0f> 0b 65 8b 05 d2 e4 b0 3f 89 c0 48 0f a3 05 78 5e cf c2 0f 92 c0
[  178.667019] RSP: 0018:ffffa3e3674c7ba8 EFLAGS: 00010096
[  178.672684] RAX: 000000000000008f RBX: ffff9770c2ac5748 RCX:
0000000000000000
[  178.680254] RDX: ffff97711f9dee80 RSI: ffff97711f9d6868 RDI:
ffff97711f9d6868
[  178.687831] RBP: ffff97711d523000 R08: 0000000000000000 R09:
000000000000065a
[  178.695411] R10: 00000000000003ff R11: 0000000000000001 R12:
ffff97710d66da70
[  178.702993] R13: ffff9770c2ac5600 R14: 0000000000000000 R15:
ffff97710d66d9c0
[  178.710573] FS:  00007fe11ef90700(0000) GS:ffff97711f9c0000(0000)
knlGS:0000000000000000
[  178.719122] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  178.725380] CR2: 000000000156a000 CR3: 000000eb30dfc006 CR4:
00000000007606e0
[  178.732999] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  178.740574] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[  178.748147] PKRU: 55555554
[  178.751297] Call Trace:
[  178.754230]  btrfs_add_ordered_extent_dio+0x1d/0x30 [btrfs]
[  178.760269]  btrfs_create_dio_extent+0x79/0xe0 [btrfs]
[  178.765930]  btrfs_get_extent_map_write+0x1a9/0x2b0 [btrfs]
[  178.771959]  btrfs_file_dax_write+0x1f8/0x4f0 [btrfs]
[  178.777508]  ? current_time+0x3f/0x70
[  178.781672]  btrfs_file_write_iter+0x384/0x580 [btrfs]
[  178.787265]  ? pipe_read+0x243/0x2a0
[  178.791298]  __vfs_write+0xee/0x170
[  178.795241]  vfs_write+0xad/0x1a0
[  178.799008]  ? vfs_read+0x111/0x130
[  178.802949]  ksys_write+0x42/0x90
[  178.806712]  do_syscall_64+0x5b/0x180
[  178.810829]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  178.816334] RIP: 0033:0x7fe11eabb3d0
[  178.820364] Code: 73 01 c3 48 8b 0d b8 ea 2b 00 f7 d8 64 89 01 48 83
c8 ff c3 66 0f 1f 44 00 00 83 3d b9 43 2c 00 00 75 10 b8 01 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 2e 90 01 00 48 89 04 24
[  178.840052] RSP: 002b:00007ffec969d978 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[  178.848100] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007fe11eabb3d0
[  178.855715] RDX: 0000000000000400 RSI: 000000000156a000 RDI:
0000000000000001
[  178.863326] RBP: 0000000000000400 R08: 0000000000000003 R09:
00007fe11ed7a698
[  178.870928] R10: 0000000010a8b550 R11: 0000000000000246 R12:
000000000156a000
[  178.878529] R13: 0000000000000000 R14: 000000000156a000 R15:
00007ffec969e9f1
[  178.886177] Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E)
nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) fscache(E) devlink(E)
ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E)
iptable_filter(E) ip_tables(E) x_tables(E) rpcrdma(E) sunrpc(E)
rdma_ucm(E) ib_uverbs(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E)
intel_rapl(E) libiscsi(E) af_packet(E) scsi_transport_iscsi(E)
skx_edac(E) configfs(E) x86_pkg_temp_thermal(E) intel_powerclamp(E)
iscsi_ibft(E) coretemp(E) iscsi_boot_sysfs(E) ipmi_ssif(E) kvm(E) msr(E)
i40iw(E) ib_core(E) ext4(E) nls_iso8859_1(E) nls_cp437(E) crc16(E)
mbcache(E) vfat(E) irqbypass(E) crc32_pclmul(E) ghash_clmulni_intel(E)
jbd2(E) joydev(E) fat(E) i40e(E) aesni_intel(E) iTCO_wdt(E) ptp(E)
aes_x86_64(E) iTCO_vendor_support(E) mei_me(E) crypto_simd(E) ipmi_si(E)
pps_core(E) lpc_ich(E) ioatdma(E) dax_pmem(E) ipmi_devintf(E) nd_pmem(E)
cryptd(E) glue_helper(E) pcspkr(E) mfd_core(E) i2c_i801(E) device_dax(E)
ipmi_msghandler(E) mei(E) nd_btt(E) dca(E)
[  178.886201]  pcc_cpufreq(E) acpi_pad(E) btrfs(E) libcrc32c(E) xor(E)
zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) hid_generic(E)
usbhid(E) sd_mod(E) sr_mod(E) cdrom(E) ast(E) i2c_algo_bit(E)
drm_kms_helper(E) syscopyarea(E) ahci(E) sysfillrect(E) xhci_pci(E)
sysimgblt(E) fb_sys_fops(E) libahci(E) xhci_hcd(E) ttm(E)
crc32c_intel(E) drm(E) libata(E) usbcore(E) wmi(E) nfit(E) libnvdimm(E)
button(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E)
scsi_dh_alua(E) scsi_mod(E) efivarfs(E) autofs4(E)
Goldwyn Rodrigues Dec. 6, 2018, 11:47 a.m. UTC | #7
On 11:07 06/12, Johannes Thumshirn wrote:
> On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> > This is a support for DAX in btrfs. I understand there have been
> > previous attempts at it. However, I wanted to make sure copy-on-write
> > (COW) works on dax as well.
> > 
> > Before I present this to the FS folks I wanted to run this through the
> > btrfs. Even though I wish, I cannot get it correct the first time
> > around :/.. Here are some questions for which I need suggestions:
> 
> Hi Goldwyn,
> 
> I've thrown your patches (from your git tree) onto one of my pmem test
> machines with this pmem config:

Thanks. I will check on this. Ordered extents have been a pain to deal
with for me (though mainly because of my incorrect usage)

> 
> mayhem:~/:[0]# ndctl list
> [
>   {
>     "dev":"namespace1.0",
>     "mode":"fsdax",
>     "map":"dev",
>     "size":792721358848,
>     "uuid":"3fd4ab18-5145-4675-85a0-e05e6f9bcee4",
>     "raw_uuid":"49264743-2351-41c5-9db9-38534813df61",
>     "sector_size":512,
>     "blockdev":"pmem1",
>     "numa_node":1
>   },
>   {
>     "dev":"namespace0.0",
>     "mode":"fsdax",
>     "map":"dev",
>     "size":792721358848,
>     "uuid":"dd0aec3c-7721-4621-8898-e50684a371b5",
>     "raw_uuid":"84ff5463-f76e-4ddf-a248-85122541e909",
>     "sector_size":4096,
>     "blockdev":"pmem0",
>     "numa_node":0
>   }
> ]
> 
> Unfortunately I hit a btrfs_panic() with btrfs/002.
> export TEST_DEV=/dev/pmem0
> export SCRATCH_DEV=/dev/pmem1
> export MOUNT_OPTIONS="-o dax"
> ./check
> [...]
> [  178.173113] run fstests btrfs/002 at 2018-12-06 10:55:43
> [  178.357044] BTRFS info (device pmem0): disk space caching is enabled
> [  178.357047] BTRFS info (device pmem0): has skinny extents
> [  178.360042] BTRFS info (device pmem0): enabling ssd optimizations
> [  178.475918] BTRFS: device fsid ee888255-7f4a-4bf7-af65-e8a6a354aca8
> devid 1 transid 3 /dev/pmem1
> [  178.505717] BTRFS info (device pmem1): disk space caching is enabled
> [  178.513593] BTRFS info (device pmem1): has skinny extents
> [  178.520384] BTRFS info (device pmem1): flagging fs with big metadata
> feature
> [  178.530997] BTRFS info (device pmem1): enabling ssd optimizations
> [  178.538331] BTRFS info (device pmem1): creating UUID tree
> [  178.587200] BTRFS critical (device pmem1): panic in
> ordered_data_tree_panic:57: Inconsistency in ordered tree at offset 0
> (errno=-17 Object already exists)
> [  178.603129] ------------[ cut here ]------------
> [  178.608667] kernel BUG at fs/btrfs/ordered-data.c:57!
> [  178.614333] invalid opcode: 0000 [#1] SMP PTI
> [  178.619295] CPU: 87 PID: 8225 Comm: dd Kdump: loaded Tainted: G
>       E     4.20.0-rc5-default-btrfs-dax #920
> [  178.630090] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS
> SE5C620.86B.0D.01.0010.072020182008 07/20/2018
> [  178.640626] RIP: 0010:__btrfs_add_ordered_extent+0x325/0x410 [btrfs]
> [  178.647404] Code: 28 4d 89 f1 49 c7 c0 90 9c 57 c0 b9 ef ff ff ff ba
> 39 00 00 00 48 c7 c6 10 fe 56 c0 48 8b b8 d8 03 00 00 31 c0 e8 e2 99 06
> 00 <0f> 0b 65 8b 05 d2 e4 b0 3f 89 c0 48 0f a3 05 78 5e cf c2 0f 92 c0
> [  178.667019] RSP: 0018:ffffa3e3674c7ba8 EFLAGS: 00010096
> [  178.672684] RAX: 000000000000008f RBX: ffff9770c2ac5748 RCX:
> 0000000000000000
> [  178.680254] RDX: ffff97711f9dee80 RSI: ffff97711f9d6868 RDI:
> ffff97711f9d6868
> [  178.687831] RBP: ffff97711d523000 R08: 0000000000000000 R09:
> 000000000000065a
> [  178.695411] R10: 00000000000003ff R11: 0000000000000001 R12:
> ffff97710d66da70
> [  178.702993] R13: ffff9770c2ac5600 R14: 0000000000000000 R15:
> ffff97710d66d9c0
> [  178.710573] FS:  00007fe11ef90700(0000) GS:ffff97711f9c0000(0000)
> knlGS:0000000000000000
> [  178.719122] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  178.725380] CR2: 000000000156a000 CR3: 000000eb30dfc006 CR4:
> 00000000007606e0
> [  178.732999] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [  178.740574] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [  178.748147] PKRU: 55555554
> [  178.751297] Call Trace:
> [  178.754230]  btrfs_add_ordered_extent_dio+0x1d/0x30 [btrfs]
> [  178.760269]  btrfs_create_dio_extent+0x79/0xe0 [btrfs]
> [  178.765930]  btrfs_get_extent_map_write+0x1a9/0x2b0 [btrfs]
> [  178.771959]  btrfs_file_dax_write+0x1f8/0x4f0 [btrfs]
> [  178.777508]  ? current_time+0x3f/0x70
> [  178.781672]  btrfs_file_write_iter+0x384/0x580 [btrfs]
> [  178.787265]  ? pipe_read+0x243/0x2a0
> [  178.791298]  __vfs_write+0xee/0x170
> [  178.795241]  vfs_write+0xad/0x1a0
> [  178.799008]  ? vfs_read+0x111/0x130
> [  178.802949]  ksys_write+0x42/0x90
> [  178.806712]  do_syscall_64+0x5b/0x180
> [  178.810829]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  178.816334] RIP: 0033:0x7fe11eabb3d0
> [  178.820364] Code: 73 01 c3 48 8b 0d b8 ea 2b 00 f7 d8 64 89 01 48 83
> c8 ff c3 66 0f 1f 44 00 00 83 3d b9 43 2c 00 00 75 10 b8 01 00 00 00 0f
> 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 2e 90 01 00 48 89 04 24
> [  178.840052] RSP: 002b:00007ffec969d978 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000001
> [  178.848100] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
> 00007fe11eabb3d0
> [  178.855715] RDX: 0000000000000400 RSI: 000000000156a000 RDI:
> 0000000000000001
> [  178.863326] RBP: 0000000000000400 R08: 0000000000000003 R09:
> 00007fe11ed7a698
> [  178.870928] R10: 0000000010a8b550 R11: 0000000000000246 R12:
> 000000000156a000
> [  178.878529] R13: 0000000000000000 R14: 000000000156a000 R15:
> 00007ffec969e9f1
> [  178.886177] Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E)
> nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) fscache(E) devlink(E)
> ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E)
> iptable_filter(E) ip_tables(E) x_tables(E) rpcrdma(E) sunrpc(E)
> rdma_ucm(E) ib_uverbs(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E)
> intel_rapl(E) libiscsi(E) af_packet(E) scsi_transport_iscsi(E)
> skx_edac(E) configfs(E) x86_pkg_temp_thermal(E) intel_powerclamp(E)
> iscsi_ibft(E) coretemp(E) iscsi_boot_sysfs(E) ipmi_ssif(E) kvm(E) msr(E)
> i40iw(E) ib_core(E) ext4(E) nls_iso8859_1(E) nls_cp437(E) crc16(E)
> mbcache(E) vfat(E) irqbypass(E) crc32_pclmul(E) ghash_clmulni_intel(E)
> jbd2(E) joydev(E) fat(E) i40e(E) aesni_intel(E) iTCO_wdt(E) ptp(E)
> aes_x86_64(E) iTCO_vendor_support(E) mei_me(E) crypto_simd(E) ipmi_si(E)
> pps_core(E) lpc_ich(E) ioatdma(E) dax_pmem(E) ipmi_devintf(E) nd_pmem(E)
> cryptd(E) glue_helper(E) pcspkr(E) mfd_core(E) i2c_i801(E) device_dax(E)
> ipmi_msghandler(E) mei(E) nd_btt(E) dca(E)
> [  178.886201]  pcc_cpufreq(E) acpi_pad(E) btrfs(E) libcrc32c(E) xor(E)
> zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) hid_generic(E)
> usbhid(E) sd_mod(E) sr_mod(E) cdrom(E) ast(E) i2c_algo_bit(E)
> drm_kms_helper(E) syscopyarea(E) ahci(E) sysfillrect(E) xhci_pci(E)
> sysimgblt(E) fb_sys_fops(E) libahci(E) xhci_hcd(E) ttm(E)
> crc32c_intel(E) drm(E) libata(E) usbcore(E) wmi(E) nfit(E) libnvdimm(E)
> button(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E)
> scsi_dh_alua(E) scsi_mod(E) efivarfs(E) autofs4(E)
> 
> 
> 
> -- 
> Johannes Thumshirn                            SUSE Labs Filesystems
> jthumshirn@suse.de                                +49 911 74053 689
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)
> Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850